# Obtaining the FootballPredictions data
This script aims to retrieve short descriptions from the site https://footballpredictions.com/ published before each match. 

FootballPredictions (FP) is a website of football news which allow to search articles, published before a match, setting in the url the date of the match you want to know about.

E.g. if you want to know about matches played in 23/10/2022 you need to set the following url:
https://footballpredictions.com/footballpredictions/?date=23-10-2022

The class used to obtain descriptions (news) from FootballPredictions web pages is FootballPredictions (football_predictions.py).

In [1]:
from football_predictions import FootballPredictions
import pandas as pd
import numpy as np
import util_strings as utils

In [2]:
matches = pd.read_csv(utils.dataset_without_text, index_col=0)
ta = FootballPredictions(matches)

# Obtaining links
Grouping of the dates in which the matches take place and the date is set in a url that will be used to go and search for the matches

In [3]:
ta.get_urls()
ta.save_urls(utils.json_link_matches)

Once the links have been obtained, combine the main link with the date of the matches, then on the page at the link you can search for all the various links under the Serie A div that lead to the pages of the pre-match descriptions.

In [4]:
ta.read_urls(utils.json_link_matches)
ta.get_predictions(utils.matches_description, True)

# Text normalization

Games for which no results were found are searched for and downloaded

In [None]:
ta.read_all_predictions(utils.matches_description)
ta.recovery_games()

In [None]:
#d = ta.recoveries[(ta.recoveries.home == 'Juventus') & (ta.recoveries.away == 'Napoli')].date
ta.matches[(ta.matches.home == 'Lecce') & (ta.matches.away == 'Cagliari')]
#ta.recoveries[(ta.recoveries.home == 'Juventus') & (ta.recoveries.away == 'Napoli')]
ta.df[(ta.df.home == 'Lecce') & (ta.df.away == 'Cagliari')]

The fix_dates method saves all the news in the cleanes_news.csv

In [7]:
ta.matches_not_found()        
ta.fix_dates(utils.cleaned_matches_description)

2019-08-30 00:00:00 bologna spal
2022-03-06 00:00:00 genoa empoli


In [8]:
for i, k in ta.matches.iterrows():
    if len(ta.recoveries[(ta.recoveries.home == k.home) & (ta.recoveries.away == k.away) & (ta.recoveries.season == k.season)]) == 0:
        print(k.home, k.away, k.date)

genoa empoli 2022-03-06 00:00:00


In [9]:
ta.recoveries[(ta.recoveries.home == 'bologna') & (ta.recoveries.away == 'hellas verona')]

Unnamed: 0_level_0,date,home,away,description,prediction,season
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
572,2020-01-19,bologna,hellas verona,Stadio Renato Dall’Ara will host Sunday’s foot...,N,2019-2020
928,2021-01-16,bologna,hellas verona,Bologna will be aiming to record their first l...,P,2020-2021
1168,2021-09-12,bologna,hellas verona,Bologna and Verona take on each other at Stadi...,V,2021-2022
1535,2022-08-21,bologna,hellas verona,Bologna and Verona face each other at Stadio R...,N,2022-2023


In [10]:
ta.recoveries[((ta.recoveries.home == 'bologna') & (ta.recoveries.away == 'hellas verona')) | ((ta.recoveries.away == 'bologna') & (ta.recoveries.home == 'hellas verona'))]

Unnamed: 0_level_0,date,home,away,description,prediction,season
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
388,2019-08-25,hellas verona,bologna,Stadio Marc’Antonio Bentegodi will host Sunday...,N,2019-2020
572,2020-01-19,bologna,hellas verona,Stadio Renato Dall’Ara will host Sunday’s foot...,N,2019-2020
928,2021-01-16,bologna,hellas verona,Bologna will be aiming to record their first l...,P,2020-2021
1127,2021-05-17,hellas verona,bologna,Monday’s football game between Verona and Bolo...,N,2020-2021
1168,2021-09-12,bologna,hellas verona,Bologna and Verona take on each other at Stadi...,V,2021-2022
1354,2022-01-21,hellas verona,bologna,Stadio Marc’Antonio Bentegodi will host Friday...,V,2021-2022
1535,2022-08-21,bologna,hellas verona,Bologna and Verona face each other at Stadio R...,N,2022-2023


In [11]:
print(len(ta.matches), len(ta.recoveries)) #the two datasets are of different sizes -> recoveries has not been cleaned of double entries (recoveries, postponed matches)

1640 1639


In [12]:
ta.df[ta.df.prediction=='NAN']

Unnamed: 0,home,away,date,result,season,description,prediction
1091,hellas verona,spezia,2021-05-01,N,2020-2021,After suffering a 1-0 loss to Inter on Matchda...,NAN
1336,udinese,atalanta,2022-01-09,P,2021-2022,,NAN
1552,juventus,spezia,2022-08-31,V,2022-2023,,NAN
1561,lazio,napoli,2022-09-03,P,2022-2023,,NAN
1560,milan,inter,2022-09-03,V,2022-2023,,NAN
1559,fiorentina,juventus,2022-09-03,N,2022-2023,,NAN
1562,cremonese,sassuolo,2022-09-04,N,2022-2023,,NAN
1563,hellas verona,sampdoria,2022-09-04,V,2022-2023,,NAN
1564,udinese,roma,2022-09-04,V,2022-2023,,NAN
1565,spezia,bologna,2022-09-04,N,2022-2023,,NAN


In [13]:
len(ta.matches), len(ta.df) #the two datasets are of equal size

(1640, 1639)