___

Main: `Data collection`

Data Collector: `@crispengari`

Data: `Football Prediction`

Packages: `BeautifulSoup`, `requests`, `pandas`

Description: `Collecting data for football predictions.`

Data Source: [`www.forebet.com`](https://www.forebet.com/en/football-predictions/predictions-1x2/2022-05-18)

Date: `2022-05-20`
___

_Note that this data was collected from `2022-04-24` to `2022-05-19`._



In this notebook we are going to use webscrapping to collect some data on the [www.forebet.com](https://www.forebet.com/en/football-predictions/predictions-1x2/2022-05-18).

### Football Predictions Dataset

We are going to scrap the daily football predictions data using `BeautifulSoup`.


### Imports
In the following code cell we are goin to import all the required packages that we are going to use in scraping and saving data in a `csv` file.



In [1]:
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup as bs


### Scrapping a single page

In the following code cell we are going to scrape the football predictions for the day `2022-05-18` for testing purposes.

In [2]:
url = r'https://www.forebet.com/en/football-predictions/predictions-1x2/2022-05-18'
html = requests.get(url)
soup = bs(html.content, 'html.parser')

In the following code cell we are going to define the columns of data that we are going to store in a `csv` and the fields that we are going to `scrape`.

In [3]:
"""
home_win = 1
draw = 0
away_win = 2
"""
columns =  [
 "home_team", "away_team", "home_win_probability", "draw_probability", "away_win_probability",
 "temperature",
 "avg_goals", "home_win_odds", "draw_odds", "away_win_odds", "predicted_home_team_goals", "predicted_away_team_goals",
 "actual_home_team_goals", "actual_away_team_goals", "home_result"
]

In the following code cell we are going to scrape the data for the defined columns based on a single match. We will later on iterate over these results to create a giant `csv` file.

In [4]:
day_teams = soup.find_all("div", {"class": "rcnt tr_0"})
home_team = day_teams[0].find('span', {'class': 'homeTeam'}).text
away_team = day_teams[0].find('span', {'class': 'awayTeam'}).text
home_win_probability, draw_probability, away_win_probability  = [float(prob.text) for prob in day_teams[0].find('div', {'class': 'fprc'}).find_all('span')]
temperature = float(day_teams[0].find('span', {'class': 'wnums'}).text.replace('°', ''))
avg_goals = float(day_teams[0].find('div', {'class': 'avg_sc'}).text)
home_win_odds, draw_odds, away_win_odds = [float(prob.text) for prob in day_teams[0].find('div', {'class': 'haodd'}).find_all('span')[:3]]
predicted_home_team_goals, predicted_away_team_goals = [int(i.strip()) for i in day_teams[0].find('div', {'class': 'ex_sc'}).text.split('-')]
actual_home_team_goals, actual_away_team_goals = [int(i.strip()) for i in day_teams[0].find('b', {'class': 'l_scr'}).text.split('-')]
home_result = 0 if actual_home_team_goals == actual_away_team_goals else 1 if actual_home_team_goals > actual_away_team_goals else 2

### Creating a Giant Football prediction dataset.

Our football prediction dataset will start from `2022-04-24` to `2022-05-19`. So we will need to generate dates between these range. To collect this data we are going to follow the following steps in order:

1. programatically change the date in the url `https://www.forebet.com/en/football-predictions/predictions-1x2/<date>` for the provided dates.
2. Loop through the results and store them in a list as turples.
3. Create a dataframe based on teh columns defined above and save a `csv` file.


In [5]:
dates = pd.date_range(start="2022-04-24",end="2022-05-19")

In [6]:
data = list()
for date in dates:
  url = r'https://www.forebet.com/en/football-predictions/predictions-1x2/'+str(date).split(" ")[0].strip()
  html = requests.get(url)
  soup = bs(html.content, 'html.parser')
  # now we can scrap the data.
  day_teams = soup.find_all("div", {"class": "rcnt"})
  for day_team in day_teams:
    try:
      home_team = day_team.find('span', {'class': 'homeTeam'}).text
      away_team = day_team.find('span', {'class': 'awayTeam'}).text
      home_win_probability, draw_probability, away_win_probability  = [float(prob.text) for prob in day_team.find('div', {'class': 'fprc'}).find_all('span')]
      temperature = float(day_team.find('span', {'class': 'wnums'}).text.replace('°', ''))
      avg_goals = float(day_team.find('div', {'class': 'avg_sc'}).text)
      home_win_odds, draw_odds, away_win_odds = [float(prob.text) for prob in day_team.find('div', {'class': 'haodd'}).find_all('span')[:3]]
      predicted_home_team_goals, predicted_away_team_goals = [int(i.strip()) for i in day_team.find('div', {'class': 'ex_sc'}).text.split('-')]
      actual_home_team_goals, actual_away_team_goals = [int(i.strip()) for i in day_team.find('b', {'class': 'l_scr'}).text.split('-')]
      home_result = 0 if actual_home_team_goals == actual_away_team_goals else 1 if actual_home_team_goals > actual_away_team_goals else 2
      
      data.append((home_team, away_team, home_win_probability, draw_probability, away_win_probability, temperature,
            avg_goals, home_win_odds, draw_odds, away_win_odds, predicted_home_team_goals, predicted_away_team_goals,
            actual_home_team_goals, actual_away_team_goals, home_result))
    except Exception:
      pass

In [7]:
dataframe = pd.DataFrame(data, columns=columns)
dataframe.head()

Unnamed: 0,home_team,away_team,home_win_probability,draw_probability,away_win_probability,temperature,avg_goals,home_win_odds,draw_odds,away_win_odds,predicted_home_team_goals,predicted_away_team_goals,actual_home_team_goals,actual_away_team_goals,home_result
0,Colorado Rapids,Charlotte FC,41.0,29.0,30.0,10.0,1.17,1.55,4.0,5.75,2,0,0,0,0
1,Santos Guápiles,Saprissa,31.0,37.0,32.0,22.0,3.27,2.63,3.25,2.38,2,2,2,1,1
2,Malacateco,C.D. Guastatoya,43.0,39.0,18.0,24.0,1.95,1.8,3.25,4.0,1,0,0,0,0
3,Santa Tecla,Atlético Marte,31.0,46.0,24.0,20.0,2.01,2.2,3.2,2.88,1,1,1,1,0
4,CD Águila,Luís Ángel Firpo,50.0,31.0,20.0,24.0,2.56,1.73,3.2,4.5,3,0,1,0,1


Now we can save our `csv` as `football-predictions-2022-04-24-to-2022-05-19.csv`

In [8]:
dataframe.to_csv('football-predictions-2022-04-24-to-2022-05-19.csv', index=False)

print("Done")

Done


In [9]:
len(dataframe)

1985

### Football Prediction dataset

This dataset contains `1985` rows of data for football match predictions.

### Content
This dataset contains the following `13` fields.

1. `home_team`
* The team name that is playing at home.
2. `away_team`
* The team name that is playing away.
2. `home_win_probability`
- A percentage of home team to win the match
3. `draw_probability`
- A percentage of both teams to draw the full match.
4. `temperature`
- A float number, temperature in degrees when the match was played.
5. `avg_goals`
- Total average goals for the teams.
6. `home_win_odds`
- The total odds for the home team to win the match
7. `draw_odds`
- The total odds for both teams to draw the match at full time.
8. `away_win_odds`
- The total number of odds for the away team to win the match.
9. `predicted_home_team_goals`
- The predicted goals for the home team to score during the match

10. `predicted_away_team_goals`
* The predicted goals for the away team to score.

11. `actual_home_team_goals`
* The actual scores that the home team scored at full time.
12. `actual_away_team_goals`
* The actual scoree that the away team scored at full time.
13. `home_result`
* A number, (`0=draw`, `1=win`, `2=lose`) weather the  home team win, lose or draw the match at full time
