# Soccer Analytics -  Predicting Soccer Match Results to Improve Chances of Winning Bets 

## Objective
- __To webscrape metrics and features related to match, player and teams.__<br>
- __Transform the aggregated data into a training dataset using functional knowledge and machine learning algorithms.__<br>
- __Build prediction models on the training data and predict the results for current season 2018-2019.__<br>
- __Compare the predictions with predictions from betting websites and football fanatics to check the effectiveness of the model built and where it stands.__<br>

This Jupyter file consists of 4 major webscrape codes - Match level data, Player metrics summary, Scoreboard data and Team Metrics data at season level 

## 01- WebScraping
__Used Beautiful Soup python package to scrape soccer related data__<br>
__Dataset Source__:<br>
Understat - [Understat.com](https://understat.com/)<br><br>

### A. Scrape EPL Match Statistics  for five seasons - 2014 to 2018

In [1]:
import requests, bs4, os
import numpy as np
import pandas as pd

Change values of k -> 2014,2015,2016,2017,2018 to scrape each season individual match stats

In [103]:
import requests, bs4, os
import json
import numpy as np
import pandas as pd

scrblink ="https://understat.com/league/EPL/"
k=2018
player_perf=pd.DataFrame()
final=pd.DataFrame()
res = requests.get(scrblink+str(k))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
print('Web Scrape - Get Final Team and Player Scoreboard from Understat.com')
## Team Final Scoreboard JSON Data
matches = soup.find_all('script')
mat2=matches[1].text.strip()
mat3 =mat2.encode('utf8').decode('unicode_escape')
ma= mat3.find("'")
md =mat3.find(")")
matstr = mat3[ma+1:md-1]
match = json.loads(matstr)
page=list()
for k in range(0,len(match)):
    page.append(match[k]['id'])
link ="https://understat.com/match/"
final_df=pd.DataFrame()

try:
    for k in range(0,len(page)):
        res = requests.get(link+str(page[k]))
        res.raise_for_status()
        soup = bs4.BeautifulSoup(res.text)
        print('Web Scrape - Get Team Performance Metrics from Understat.com')
        ## Match Details
        titles = soup.find_all('title')
        ## Match Stats
        home_team = soup.find_all('div', attrs={'class':'progress-home'})
        away_team = soup.find_all('div', attrs={'class':'progress-away'})
        draw_chance= soup.find_all('div', attrs={'class':'progress-draw'})
        print('Web Scraping - Completed')
        homefeat=[]
        awayfeat=[]
        drawfeat=[]
        for i in range(0,len(home_team)):
            a= home_team[i].text.strip()
            if (a==''):
                homefeat.append(home_team[i]['title'])
            else:
                homefeat.append(a)
        for i in range(0,len(away_team)):
            a= away_team[i].text.strip()
            if (a==''):
                awayfeat.append(away_team[i]['title'])
            else:
                awayfeat.append(a)
        for i in range(0,len(draw_chance)):
            a= draw_chance[i].text.strip()
            if (a==''):
                drawfeat.append(draw_chance[i]['title'])
            else:
                drawfeat.append(a)
        c = pd.DataFrame(np.hstack((homefeat,awayfeat)))
        soccer_df = c.transpose()
        soccer_df['Dchance']=drawfeat
        soccer_df.rename(columns = {0:'Hteam',1:'Hchance',2:'Hgoals',3:'HxG',4:'Hshots',5:'Hshotstrgt',6:'Hdeep',7:'Hppda',8:'Hxpts',9:'Ateam',10:'Achance',11:'Agoals',12:'AxG',13:'Ashots',14:'Ashotstrgt',15:'Adeep',16:'Appda',17:'Axpts'},inplace=True)
        ftr=[]
        for i in range(0,len(soccer_df.index)):
            if (soccer_df['Hgoals'][i]>soccer_df['Agoals'][i]):
                ftr.append(soccer_df['Hteam'][i])
            elif(soccer_df['Hgoals'][i]<soccer_df['Agoals'][i]):
                ftr.append(soccer_df['Ateam'][i])
            else:
                ftr.append('Draw')
        soccer_df['FTR']=ftr
        soccer_df['Season']=titles[0].text.split('|')[2]
        soccer_df['League']=titles[0].text.split('|')[1]
        soccer_df['Description']=titles[0].text.split('|')[0]
        date_str = titles[0].text.split('|')[0].split('(')[1]
        frmt_date =date_str.strip()[:17]
        frmt_date
        soccer_df['MatchDate']=frmt_date
        final_df = pd.concat([final_df, soccer_df])
        print("All "+str(k)+"th match details obtained")
except Exception:
    pass



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


Web Scrape - Get Final Team and Player Scoreboard from Understat.com
Web Scrape - Get Team Performance Metrics from Understat.com
Web Scraping - Completed
All 0th match details obtained
Web Scrape - Get Team Performance Metrics from Understat.com
Web Scraping - Completed
All 1th match details obtained
Web Scrape - Get Team Performance Metrics from Understat.com
Web Scraping - Completed
All 2th match details obtained
Web Scrape - Get Team Performance Metrics from Understat.com
Web Scraping - Completed
All 3th match details obtained
Web Scrape - Get Team Performance Metrics from Understat.com
Web Scraping - Completed
All 4th match details obtained
Web Scrape - Get Team Performance Metrics from Understat.com
Web Scraping - Completed
All 5th match details obtained
Web Scrape - Get Team Performance Metrics from Understat.com
Web Scraping - Completed
All 6th match details obtained
Web Scrape - Get Team Performance Metrics from Understat.com
Web Scraping - Completed
All 7th match details obta

Web Scrape - Get Team Performance Metrics from Understat.com
Web Scraping - Completed
All 69th match details obtained
Web Scrape - Get Team Performance Metrics from Understat.com
Web Scraping - Completed
All 70th match details obtained
Web Scrape - Get Team Performance Metrics from Understat.com
Web Scraping - Completed
All 71th match details obtained
Web Scrape - Get Team Performance Metrics from Understat.com
Web Scraping - Completed
All 72th match details obtained
Web Scrape - Get Team Performance Metrics from Understat.com
Web Scraping - Completed
All 73th match details obtained
Web Scrape - Get Team Performance Metrics from Understat.com
Web Scraping - Completed
All 74th match details obtained
Web Scrape - Get Team Performance Metrics from Understat.com
Web Scraping - Completed
All 75th match details obtained
Web Scrape - Get Team Performance Metrics from Understat.com
Web Scraping - Completed
All 76th match details obtained
Web Scrape - Get Team Performance Metrics from Understat

Web Scrape - Get Team Performance Metrics from Understat.com
Web Scraping - Completed
All 139th match details obtained
Web Scrape - Get Team Performance Metrics from Understat.com
Web Scraping - Completed
All 140th match details obtained
Web Scrape - Get Team Performance Metrics from Understat.com
Web Scraping - Completed
All 141th match details obtained
Web Scrape - Get Team Performance Metrics from Understat.com
Web Scraping - Completed
All 142th match details obtained
Web Scrape - Get Team Performance Metrics from Understat.com
Web Scraping - Completed
All 143th match details obtained
Web Scrape - Get Team Performance Metrics from Understat.com
Web Scraping - Completed
All 144th match details obtained
Web Scrape - Get Team Performance Metrics from Understat.com
Web Scraping - Completed
All 145th match details obtained
Web Scrape - Get Team Performance Metrics from Understat.com
Web Scraping - Completed
All 146th match details obtained
Web Scrape - Get Team Performance Metrics from U

Web Scrape - Get Team Performance Metrics from Understat.com
Web Scraping - Completed
All 208th match details obtained
Web Scrape - Get Team Performance Metrics from Understat.com
Web Scraping - Completed
All 209th match details obtained
Web Scrape - Get Team Performance Metrics from Understat.com
Web Scraping - Completed
All 210th match details obtained
Web Scrape - Get Team Performance Metrics from Understat.com
Web Scraping - Completed
All 211th match details obtained
Web Scrape - Get Team Performance Metrics from Understat.com
Web Scraping - Completed
All 212th match details obtained
Web Scrape - Get Team Performance Metrics from Understat.com
Web Scraping - Completed
All 213th match details obtained
Web Scrape - Get Team Performance Metrics from Understat.com
Web Scraping - Completed
All 214th match details obtained
Web Scrape - Get Team Performance Metrics from Understat.com
Web Scraping - Completed
All 215th match details obtained
Web Scrape - Get Team Performance Metrics from U

In [101]:
len(page) ## There are 380 matches played in one complete season

380

In [104]:
final_df.to_csv('soccer_data.csv')
final_df

Unnamed: 0,Hteam,Hchance,Hgoals,HxG,Hshots,Hshotstrgt,Hdeep,Hppda,Hxpts,Ateam,...,Ashotstrgt,Adeep,Appda,Axpts,Dchance,FTR,Season,League,Description,MatchDate
0,Manchester United,28%,2,1.51,8,6,3,15.83,1.17,Leicester,...,4,10,11.46,1.50,33%,Manchester United,2018/2019,EPL,Manchester United 2 - 1 Leicester (August 10 2...,August 10 2018)
0,Newcastle United,8%,1,0.97,15,2,8,17.45,0.39,Tottenham,...,5,3,5.66,2.46,15%,Tottenham,2018/2019,EPL,Newcastle United 1 - 2 Tottenham (August 11 20...,August 11 2018)
0,Watford,64%,2,1.42,19,5,8,8.70,2.19,Brighton,...,0,2,14.25,0.55,26%,Watford,2018/2019,EPL,Watford 2 - 0 Brighton (August 11 2018),August 11 2018)
0,Huddersfield,2%,0,0.40,6,1,2,11.55,0.17,Chelsea,...,4,4,12.67,2.73,10%,Chelsea,2018/2019,EPL,Huddersfield 0 - 3 Chelsea (August 11 2018),August 11 2018)
0,Fulham,20%,0,0.64,15,6,3,4.53,0.93,Crystal Palace,...,10,8,22.79,1.75,32%,Crystal Palace,2018/2019,EPL,Fulham 0 - 2 Crystal Palace (August 11 2018),August 11 2018)
0,Bournemouth,68%,2,2.60,12,4,12,5.63,2.24,Cardiff,...,1,5,15.82,0.55,21%,Bournemouth,2018/2019,EPL,Bournemouth 2 - 0 Cardiff (August 11 2018),August 11 2018)
0,Wolverhampton Wanderers,36%,2,0.89,11,4,4,25.33,1.45,Everton,...,5,2,15.60,1.17,38%,Draw,2018/2019,EPL,Wolverhampton Wanderers 2 - 2 Everton (August ...,August 11 2018)
0,Southampton,26%,0,1.02,18,3,3,16.58,1.04,Burnley,...,6,7,7.81,1.68,27%,Draw,2018/2019,EPL,Southampton 0 - 0 Burnley (August 12 2018),August 12 2018)
0,Liverpool,99%,4,4.34,18,8,16,8.03,2.98,West Ham,...,2,3,25.19,0.01,1%,Liverpool,2018/2019,EPL,Liverpool 4 - 0 West Ham (August 12 2018),August 12 2018)
0,Arsenal,5%,0,0.45,9,3,8,13.56,0.28,Manchester City,...,8,8,10.95,2.58,14%,Manchester City,2018/2019,EPL,Arsenal 0 - 2 Manchester City (August 12 2018),August 12 2018)


#### Data Quality Check - Check for empty fields

-   __ 2014-15 season check for NA__

In [88]:
final_df.isna().any()

Hteam          False
Hchance        False
Hgoals         False
HxG            False
Hshots         False
Hshotstrgt     False
Hdeep          False
Hppda          False
Hxpts          False
Ateam          False
Achance        False
Agoals         False
AxG            False
Ashots         False
Ashotstrgt     False
Adeep          False
Appda          False
Axpts          False
Dchance        False
FTR            False
Season         False
League         False
Description    False
MatchDate      False
dtype: bool

-   __ 2015-16 season check for NA__

In [91]:
final_df.isna().any()

Hteam          False
Hchance        False
Hgoals         False
HxG            False
Hshots         False
Hshotstrgt     False
Hdeep          False
Hppda          False
Hxpts          False
Ateam          False
Achance        False
Agoals         False
AxG            False
Ashots         False
Ashotstrgt     False
Adeep          False
Appda          False
Axpts          False
Dchance        False
FTR            False
Season         False
League         False
Description    False
MatchDate      False
dtype: bool

-   __ 2016-17 season check for NA__

In [95]:
final_df.isna().any()

Hteam          False
Hchance        False
Hgoals         False
HxG            False
Hshots         False
Hshotstrgt     False
Hdeep          False
Hppda          False
Hxpts          False
Ateam          False
Achance        False
Agoals         False
AxG            False
Ashots         False
Ashotstrgt     False
Adeep          False
Appda          False
Axpts          False
Dchance        False
FTR            False
Season         False
League         False
Description    False
MatchDate      False
dtype: bool

-   __ 2017-18 season check for NA__

In [98]:
final_df.isna().any()

Hteam          False
Hchance        False
Hgoals         False
HxG            False
Hshots         False
Hshotstrgt     False
Hdeep          False
Hppda          False
Hxpts          False
Ateam          False
Achance        False
Agoals         False
AxG            False
Ashots         False
Ashotstrgt     False
Adeep          False
Appda          False
Axpts          False
Dchance        False
FTR            False
Season         False
League         False
Description    False
MatchDate      False
dtype: bool

-   __ 2018-19 season check for NA__

In [105]:
final_df.isna().any()

Hteam          False
Hchance        False
Hgoals         False
HxG            False
Hshots         False
Hshotstrgt     False
Hdeep          False
Hppda          False
Hxpts          False
Ateam          False
Achance        False
Agoals         False
AxG            False
Ashots         False
Ashotstrgt     False
Adeep          False
Appda          False
Axpts          False
Dchance        False
FTR            False
Season         False
League         False
Description    False
MatchDate      False
dtype: bool

### B. Scrape EPL Player Summary & Match Statistics  for five seasons - 2014 to 2018

In [116]:
import requests, bs4, os
import json
import numpy as np
import pandas as pd
link ="https://understat.com/match/"
start=4439
end=4758
player_df=pd.DataFrame()
player_df3=pd.DataFrame()
for k in range(start,end):
    res = requests.get(link+str(k))
    res.raise_for_status()
    soup = bs4.BeautifulSoup(res.text)
    print('Web Scrape - Get Player Performance Metrics per Match from Understat.com')
    ## Player JSON Data
    titles=soup.find('title')
    players = soup.find_all('script')
    date_str = titles.text.split('|')[0].split('(')[1].strip()
    season = titles.text.split('|')[2].strip()
    description=  titles.text.split('|')[0].strip()
    league =  titles.text.split('|')[1].strip()
    v2=players[2].text.strip()
    v3 =v2.encode('utf8').decode('unicode_escape')
    c = v3.find("'")
    d =v3.find(");")
    newstr = v3[c+1:d-1]
    y = json.loads(newstr)
    f1 = list(y.keys())
    final=pd.DataFrame()
    for i in range(0,len(f1)):
        if (f1[i]=='h'):
            home = pd.DataFrame.from_dict(y[f1[i]],orient='index')
            home.reset_index(inplace=True)
            home.drop(columns=['index'],inplace=True)
        if (f1[i]=='a'):
            away = pd.DataFrame.from_dict(y[f1[1]],orient='index')
            away.reset_index(inplace=True)
            away.drop(columns=['index'],inplace=True)
    hcol = home.columns
    hcol_new=[]
    acol = away.columns
    acol_new=[]
    for i in range(0,len(hcol)):
        hcol_new.append('h_'+hcol[i])
    for i in range(0,len(acol)):
        acol_new.append('a_'+acol[i]) 
    total = hcol_new+acol_new
    final=pd.concat([home,away],axis=1)
    final['Season']=season
    final['League']=league
    final['Description']=description
    final['MatchDate']=date_str
    final['MatchID']=k
    player_df = pd.concat([player_df, final])
    player_df2=pd.concat((home,away))
    player_df2['Season']=season
    player_df2['League']=league
    player_df2['Description']=description
    player_df2['MatchDate']=date_str
    player_df2['MatchID']=k
    player_df3=pd.concat([player_df3,player_df2])
    total=total+['Season','League','Description','MatchDate','MatchID']
    print("All "+str(k)+"th match details obtained")
player_df.columns=total
print('Web Scraping - Completed')



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4439th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4440th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4441th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4442th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4443th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4444th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4445th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4446th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4447th match details obtained
Web Scrape - Get Player Performance M

Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4516th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4517th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4518th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4519th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4520th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4521th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4522th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4523th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4524th match details obtained
Web Scrape - Get Player Performance M

Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4593th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4594th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4595th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4596th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4597th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4598th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4599th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4600th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4601th match details obtained
Web Scrape - Get Player Performance M

Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4670th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4671th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4672th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4673th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4674th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4675th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4676th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4677th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4678th match details obtained
Web Scrape - Get Player Performance M

Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4747th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4748th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4749th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4750th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4751th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4752th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4753th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4754th match details obtained
Web Scrape - Get Player Performance Metrics per Match from Understat.com
All 4755th match details obtained
Web Scrape - Get Player Performance M

In [117]:
player_df.to_csv('player_data.csv')
player_df3.to_csv('player_datav2.csv')
player_df3.shape

(8745, 26)

### C. Scrape EPL Scoreboard(League Table) Statistics for five seasons - 2014 to 2018

In [129]:
import requests, bs4, os
import json
import numpy as np
import pandas as pd
link ="https://understat.com/league/EPL/"
start=2014
end=2019
player_perf=pd.DataFrame()
final=pd.DataFrame()
for k in range(start,end):
    res = requests.get(link+str(k))
    res.raise_for_status()
    soup = bs4.BeautifulSoup(res.text)
    print('Web Scrape - Get Final Team and Player Scoreboard from Understat.com')
    ## Team Final Scoreboard JSON Data
    titles=soup.find('title')
    players = soup.find_all('script')
    v2=players[2].text.strip()
    v3 =v2.encode('utf8').decode('unicode_escape')
    c = v3.find("'")
    d =v3.find(");");
    #print(c)
    newstr = v3[c+1:d-1]
    #print(newstr)
    y = json.loads(newstr)
    f1 = list(y.keys())
    for i in range(0,len(f1)):
        for j in range(0,len(y[f1[i]]['history'])):
            home = pd.DataFrame.from_dict(y[f1[i]]['history'][j],orient='index').transpose()
            home.reset_index(inplace=True)
            home.drop(columns=['index'],inplace=True)
            home['id']=y[f1[i]]['id']
            home['team']=y[f1[i]]['title']
            home['season']=str(k)
            final=pd.concat([final,home])
    ## Player Scoreboard
    p2=players[3].text.strip()
    p3 =p2.encode('utf8').decode('unicode_escape')
    c1 = p3.find("'")
    d1 =p3.find(");");
    #print(c)
    newstr1 = p3[c1+1:d1-1]
    #print(newstr)
    player_y = json.loads(newstr1)
    player_perf=pd.DataFrame()
    for s in range(0,len(player_y)):
        players = pd.DataFrame.from_dict(player_y[s],orient='index').transpose()
        players.reset_index(inplace=True)
        players.drop(columns=['index'],inplace=True)
        players['season']=str(k)
        player_perf=pd.concat([player_perf,players])
    print("All "+str(k)+"th match details obtained")
team_brd=final[['id', 'season','team','h_a', 'xG', 'xGA', 'npxG', 'npxGA', 'ppda', 'ppda_allowed', 'deep',
       'deep_allowed', 'scored', 'missed', 'xpts', 'result', 'date', 'wins',
       'draws', 'loses', 'pts', 'npxGD']]
team_brd.set_index('id',inplace=True)
player_brd=player_perf[['id','season', 'player_name', 'games', 'time', 'goals', 'xG', 'assists', 'xA',
       'shots', 'key_passes', 'yellow_cards', 'red_cards', 'position',
       'team_title', 'npg', 'npxG', 'xGChain', 'xGBuildup']]
player_brd.set_index('id',inplace=True)
print('Web Scraping - Completed')



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


Web Scrape - Get Final Team and Player Scoreboard from Understat.com
All 2014th match details obtained
Web Scrape - Get Final Team and Player Scoreboard from Understat.com
All 2015th match details obtained
Web Scrape - Get Final Team and Player Scoreboard from Understat.com
All 2016th match details obtained
Web Scrape - Get Final Team and Player Scoreboard from Understat.com
All 2017th match details obtained
Web Scrape - Get Final Team and Player Scoreboard from Understat.com
All 2018th match details obtained
Web Scraping - Completed


In [251]:
player_brd.shape

(4227, 18)

In [268]:
player_brd.to_csv('player_performance.csv')

In [236]:
player_brd

Unnamed: 0_level_0,season,player_name,games,time,goals,xG,assists,xA,shots,key_passes,yellow_cards,red_cards,position,team_title,npg,npxG,xGChain,xGBuildup
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
4536,2014,Nani,1,41,0,0,0,0,0,0,0,0,S,Manchester United,0,0,0.0731491595506668,0.0731491595506668


### D. Scrape EPL Team Metric Statistics  for five seasons - 2014 to 2018
Metrics like Attacking Play, Position in the Pitch, Formation used, Situational Play, Shots Timing etc.

Example Link : https://understat.com/team/Leicester/2015

In [68]:
import requests, bs4, os
import json
import numpy as np
import pandas as pd
scrape_link='https://understat.com/team/'
main_link ='https://understat.com/league/EPL/'
season=2018
res = requests.get(main_link+str(season))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
print('Web Scrape - Get Team Details from Understat.com')
## Team Final Scoreboard JSON Data
teams = soup.find_all('script')
t2=teams[2].text.strip()
t3 =t2.encode('utf8').decode('unicode_escape')
c = t3.find("'")
d =t3.find(");")
teamstr = t3[c+1:d-1]
x = json.loads(teamstr)
l=list(x.keys())
teams=[]
fsituation=pd.DataFrame()
fformation=pd.DataFrame()
fgamestate=pd.DataFrame()
ftiming=pd.DataFrame()
fshotzones=pd.DataFrame()
fattackspeed=pd.DataFrame()
fresult=pd.DataFrame()

for i in range(0,len(l)):
    c =(x[l[i]]['title'])
    d =c.replace(' ','_')
    #print(d)
    teams.append(d)
print('Team Details Extracted')
print("End of Web scraping")
for k in range(0,len(teams)):
    name=teams[k]
    res = requests.get(scrape_link+name+"/"+str(season))
    res.raise_for_status()
    soup = bs4.BeautifulSoup(res.text)
    print('Web Scrape - Get Team Metrics from Understat.com')
    ## Team Final Scoreboard JSON Data
    titles=soup.find('title')
    players = soup.find_all('script')
    v2=players[2].text.strip()
    v3 =v2.encode('utf8').decode('unicode_escape')
    c = v3.find("'")
    d =v3.find(");");
    #print(c)
    newstr = v3[c+1:d-1]
    #print(newstr)
    y = json.loads(newstr)
    print("Iteration "+str(k)+" Team Name "+name+" Data Length "+str(len(y)))
    f1 = list(y.keys())
    team=titles.text.split('|')[0].split(' ')[0]
    description=titles.text.split('|')[0]
    #season='2015'
    situation=pd.DataFrame()
    formation=pd.DataFrame()
    gamestate=pd.DataFrame()
    timing=pd.DataFrame()
    shotzones=pd.DataFrame()
    attackspeed=pd.DataFrame()
    result=pd.DataFrame()
    for i in range(0,len(f1)):
        if (f1[i]=='situation'):
            finaway=pd.DataFrame()
            home = pd.DataFrame.from_dict(y[f1[i]],orient='index')
            home.rename(columns=lambda x:x+'F',inplace=True)
            home.drop(columns='againstF',inplace=True)
            f2 = list(y[f1[i]].keys())
            for j in range(0,len(f2)):
                #print(y[f1[i]][f2[j]])
                against=pd.DataFrame.from_dict(y[f1[i]][f2[j]]['against'],orient='index').transpose()
                against['situation']=f2[j]
                finaway=pd.concat([finaway,against])
            finaway.set_index('situation',inplace=True)
            finaway.rename(columns=lambda x:x+'A',inplace=True)
            situation=pd.concat([home,finaway],axis=1, join='inner')
            situation['xGD']=situation['xGF']-situation['xGA']
            situation['xGF/Sh']=situation['xGF']/situation['shotsF']
            situation['xGA/Sh']=situation['xGA']/situation['shotsA']
            situation['team']=teams[k]
            situation['description']=description
            situation['situation']=situation.index
            situation['season']=season
            situation=situation[['season','team', 'description', 'situation','shotsF', 'goalsF', 'xGF', 'shotsA', 'goalsA', 'xGA', 'xGD', 'xGF/Sh',
           'xGA/Sh']]
            situation.reset_index(inplace=True)
        if (f1[i]=='formation'):
            finaway=pd.DataFrame()
            home = pd.DataFrame.from_dict(y[f1[i]],orient='index')
            home.rename(columns=lambda x:x+'F',inplace=True)
            home.drop(columns='againstF',inplace=True)
            f2 = list(y[f1[i]].keys())
            for j in range(0,len(f2)):
               # print(y[f1[i]][f2[j]])
                against=pd.DataFrame.from_dict(y[f1[i]][f2[j]]['against'],orient='index').transpose()
                against['formation']=f2[j]
                finaway=pd.concat([finaway,against])
            finaway.set_index('formation',inplace=True)
            finaway.rename(columns=lambda x:x+'A',inplace=True)
            formation=pd.concat([home,finaway],axis=1, join='inner')
            formation['xGD']=formation['xGF']-formation['xGA']
            formation['xGF90']=formation['xGF']/(formation['timeF']/90)
            formation['xGA90']=formation['xGA']/(formation['timeF']/90)
            formation['team']=teams[k]
            formation['description']=description
            formation['formation']=formation.index
            formation['season']=season
            formation=formation[['season','team', 'description','formation',  'statF', 'timeF', 'shotsF', 'goalsF', 'xGF', 'shotsA','goalsA', 'xGA', 'xGD', 'xGF90', 'xGA90']]
           # formation.reset_index(inplace=True)
        if (f1[i]=='gameState'):
            finaway=pd.DataFrame()
            home = pd.DataFrame.from_dict(y[f1[i]],orient='index')
            home.rename(columns=lambda x:x+'F',inplace=True)
            home.drop(columns='againstF',inplace=True)
            f2 = list(y[f1[i]].keys())
            for j in range(0,len(f2)):
               #print(y[f1[i]][f2[j]])
                against=pd.DataFrame.from_dict(y[f1[i]][f2[j]]['against'],orient='index').transpose()
                against['gameState']=f2[j]
                finaway=pd.concat([finaway,against])
            finaway.set_index('gameState',inplace=True)
            finaway.rename(columns=lambda x:x+'A',inplace=True)
            gamestate=pd.concat([home,finaway],axis=1, join='inner')
            gamestate['xGD']=gamestate['xGF']-gamestate['xGA']
            gamestate['xGF90']=gamestate['xGF']/(gamestate['timeF']/90)
            gamestate['xGA90']=gamestate['xGA']/(gamestate['timeF']/90)
            gamestate['team']=teams[k]
            gamestate['description']=description
            gamestate['gamestate']=gamestate.index
            gamestate['season']=season
            gamestate=gamestate[['season','team', 'description','gamestate',  'statF', 'timeF', 'shotsF', 'goalsF', 'xGF', 'shotsA','goalsA', 'xGA', 'xGD', 'xGF90', 'xGA90']]
            gamestate.reset_index(inplace=True)
        if (f1[i]=='timing'):
            finaway=pd.DataFrame()
            home = pd.DataFrame.from_dict(y[f1[i]],orient='index')
            home.rename(columns=lambda x:x+'F',inplace=True)
            home.drop(columns='againstF',inplace=True)
            f2 = list(y[f1[i]].keys())
            for j in range(0,len(f2)):
                #print(y[f1[i]][f2[j]])
                against=pd.DataFrame.from_dict(y[f1[i]][f2[j]]['against'],orient='index').transpose()
                against['timing']=f2[j]
                finaway=pd.concat([finaway,against])
            finaway.set_index('timing',inplace=True)
            finaway.rename(columns=lambda x:x+'A',inplace=True)
            timing=pd.concat([home,finaway],axis=1, join='inner')
            timing['xGD']=timing['xGF']-timing['xGA']
            timing['xGF/Sh']=timing['xGF']/timing['shotsF']
            timing['xGA/Sh']=timing['xGA']/timing['shotsA']
            timing['team']=teams[k]
            timing['description']=description
            timing['timing']=timing.index
            timing['season']=season
            timing=timing[['season','team', 'description', 'timing','shotsF', 'goalsF', 'xGF', 'shotsA', 'goalsA', 'xGA', 'xGD', 'xGF/Sh','xGA/Sh']]
        if (f1[i]=='shotZone'):
            finaway=pd.DataFrame()
            home = pd.DataFrame.from_dict(y[f1[i]],orient='index')
            home.rename(columns=lambda x:x+'F',inplace=True)
            home.drop(columns='againstF',inplace=True)
            f2 = list(y[f1[i]].keys())
            for j in range(0,len(f2)):
                #print(y[f1[i]][f2[j]])
                against=pd.DataFrame.from_dict(y[f1[i]][f2[j]]['against'],orient='index').transpose()
                against['shotzones']=f2[j]
                finaway=pd.concat([finaway,against])
            finaway.set_index('shotzones',inplace=True)
            finaway.rename(columns=lambda x:x+'A',inplace=True)
            shotzones=pd.concat([home,finaway],axis=1, join='inner')
            shotzones['xGD']=shotzones['xGF']-shotzones['xGA']
            shotzones['xGF/Sh']=shotzones['xGF']/shotzones['shotsF']
            shotzones['xGA/Sh']=shotzones['xGA']/shotzones['shotsA']
            shotzones['team']=teams[k]
            shotzones['description']=description
            shotzones['shotzones']=shotzones.index
            shotzones['season']=season
            shotzones=shotzones[['season','team', 'description', 'shotzones', 'statF', 'shotsF', 'goalsF', 'xGF', 'shotsA', 'goalsA', 'xGA', 'xGD','xGF/Sh', 'xGA/Sh']]
           # shotzones.reset_index(inplace=True)
        if (f1[i]=='attackSpeed'):
            finaway=pd.DataFrame()
            home = pd.DataFrame.from_dict(y[f1[i]],orient='index')
            home.rename(columns=lambda x:x+'F',inplace=True)
            home.drop(columns='againstF',inplace=True)
            f2 = list(y[f1[i]].keys())
            for j in range(0,len(f2)):
                #print(y[f1[i]][f2[j]])
                against=pd.DataFrame.from_dict(y[f1[i]][f2[j]]['against'],orient='index').transpose()
                against['attackspeed']=f2[j]
                finaway=pd.concat([finaway,against])
            finaway.set_index('attackspeed',inplace=True)
            finaway.rename(columns=lambda x:x+'A',inplace=True)
            attackspeed=pd.concat([home,finaway],axis=1, join='inner')
            attackspeed['xGD']=attackspeed['xGF']-attackspeed['xGA']
            attackspeed['xGF/Sh']=attackspeed['xGF']/attackspeed['shotsF']
            attackspeed['xGA/Sh']=attackspeed['xGA']/attackspeed['shotsA']
            attackspeed['team']=teams[k]
            attackspeed['description']=description
            attackspeed['attackspeed']=attackspeed.index
            attackspeed['season']=season
            attackspeed=attackspeed[['season','team', 'description', 'attackspeed', 'statF', 'shotsF', 'goalsF', 'xGF', 'shotsA', 'goalsA', 'xGA', 'xGD','xGF/Sh', 'xGA/Sh']]
            attackspeed.reset_index(inplace=True)
        if (f1[i]=='result'):
            finaway=pd.DataFrame()
            home = pd.DataFrame.from_dict(y[f1[i]],orient='index')
            home.rename(columns=lambda x:x+'F',inplace=True)
            home.drop(columns='againstF',inplace=True)
            f2 = list(y[f1[i]].keys())
            for j in range(0,len(f2)):
                #print(y[f1[i]][f2[j]])
                against=pd.DataFrame.from_dict(y[f1[i]][f2[j]]['against'],orient='index').transpose()
                against['result']=f2[j]
                finaway=pd.concat([finaway,against])
            finaway.set_index('result',inplace=True)
            finaway.rename(columns=lambda x:x+'A',inplace=True)
            result=pd.concat([home,finaway],axis=1, join='inner')
            result['xGD']=result['xGF']-result['xGA']
            result['xGF/Sh']=result['xGF']/result['shotsF']
            result['xGA/Sh']=result['xGA']/result['shotsA']
            result['team']=teams[k]
            result['description']=description
            result['result']=result.index
            result['season']=season
            result=result[['season','team', 'description', 'result', 'shotsF', 'goalsF', 'xGF', 'shotsA', 'goalsA', 'xGA', 'xGD', 'xGF/Sh','xGA/Sh']]
           # result.reset_index(inplace=True)
    fsituation=pd.concat([fsituation,situation])
    fformation=pd.concat([fformation,formation])
    fgamestate=pd.concat([fgamestate,gamestate])
    ftiming=pd.concat([ftiming,timing])
    fshotzones=pd.concat([fshotzones,shotzones])
    fattackspeed=pd.concat([fattackspeed,attackspeed])
    fresult=pd.concat([fresult,result])
print("Web scraping for "+str(season)+" completed")



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


Web Scrape - Get Team Details from Understat.com
Team Details Extracted
End of Web scraping
Web Scrape - Get Team Metrics from Understat.com
Iteration 0 Team Name Everton Data Length 7
Web Scrape - Get Team Metrics from Understat.com
Iteration 1 Team Name Bournemouth Data Length 7
Web Scrape - Get Team Metrics from Understat.com
Iteration 2 Team Name Southampton Data Length 7
Web Scrape - Get Team Metrics from Understat.com
Iteration 3 Team Name Leicester Data Length 7
Web Scrape - Get Team Metrics from Understat.com
Iteration 4 Team Name Crystal_Palace Data Length 7
Web Scrape - Get Team Metrics from Understat.com
Iteration 5 Team Name Chelsea Data Length 7
Web Scrape - Get Team Metrics from Understat.com
Iteration 6 Team Name West_Ham Data Length 7
Web Scrape - Get Team Metrics from Understat.com
Iteration 7 Team Name Tottenham Data Length 7
Web Scrape - Get Team Metrics from Understat.com
Iteration 8 Team Name Arsenal Data Length 7
Web Scrape - Get Team Metrics from Understat.com
It

#### Check if all the seven dataset scraped from Understat have 20 teams in each run

In [69]:
if (len(fsituation.team.unique())==20):
    print("Pass - Situation")
else:
    print("Fail - Situation")
if (len(fformation.team.unique())==20):
    print("Pass - Formation")
else:
    print("Fail - Formation")
if (len(fgamestate.team.unique())==20):
    print("Pass - Gamestate")
else:
    print("Fail - Gamestate")
if (len(ftiming.team.unique())==20):
    print("Pass - Timing")
else:
    print("Fail - Timing")
if (len(fshotzones.team.unique())==20):
    print("Pass - Shotzones")
else:
    print("Fail - Shotzones")
if (len(fattackspeed.team.unique())==20):
    print("Pass - Attackspeed")
else:
    print("Fail - Attackspeed")
if (len(fresult.team.unique())==20):
    print("Pass - Result")
else:
    print("Fail - Result")

Pass - Situation
Pass - Formation
Pass - Gamestate
Pass - Timing
Pass - Shotzones
Pass - Attackspeed
Pass - Result


#### Export as csv for data engineering and EDA

In [70]:
fsituation.to_csv('situation.csv')
fformation.to_csv('formation.csv')
fgamestate.to_csv('gamestate.csv')
ftiming.to_csv('timing.csv')
fshotzones.to_csv('shotzones.csv')
fattackspeed.to_csv('attackspeed.csv')
fresult.to_csv('result.csv')

In [53]:
len(fsituation.team.unique())

array(['Aston_Villa', 'Everton', 'Bournemouth', 'Southampton',
       'Leicester', 'West_Bromwich_Albion', 'Sunderland',
       'Crystal_Palace', 'Norwich', 'Chelsea', 'West_Ham', 'Tottenham',
       'Arsenal', 'Swansea', 'Stoke', 'Newcastle_United', 'Liverpool',
       'Manchester_City', 'Manchester_United', 'Watford'], dtype=object)