# Collecting Data from ESPN

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import re

I am going to start working with one team in the MLS, LAFC. This will allow me to start gathering the information for this team only and then I can expand from here. My process for this scraper is to

1. Create a list of all game id's for LAFC over the course of their history
2. Use these game ID's to pull game data from the ESPN website
3. Remove any games that have happened in the 2020 season so far
4. Expand this data collection to be able to include all teams in the MLS
5. Use this data to create a model to predict the results of the 2020 MLS season (if it ever actually happens)

### 1. List of game id's for LAFC

#### 1.a First attempt to pull game Id's

In [2]:
#create a page and us beautiful soup to parse out the content
lafc_page=requests.get("https://global.espn.com/soccer/team/_/id/18966/lafc")

lafc_soup= BeautifulSoup(lafc_page.content, "html.parser")

# find the body of the webpage
[type(item) for item in list(lafc_soup.children)]

lafc_body=list(lafc_soup.children)[3]
lafc_body

#find the fixture data
lafc_fixtures=lafc_body.findAll("section",{"class","col-b"})[0]

#find game webpage
lafc_games=lafc_fixtures.findAll("a",{"class","competitors"})


#create a list of LAFC game Id's
lafc_game_id=[]
for game in lafc_games:
    if game.has_attr("href"):
        lafc_game_id.append(game["href"][-6:])
        
print(lafc_game_id)

['560836', '560829', '560541', '561813', '569598', '569593', '561794', '561786', '561768']


<b><i>Looking at the list of game id's it seems like I have only been able to pull game id's from a few of LAFC's games. Upon further inspection of the webpage I used at the beginning, it looks like these are game id's from only the 2020 season which is actually the game Id's that I do not want. It also looks like I have included CONCACAF Champions League games which at this point, I might or might not want to include.</b> </i> 

In [4]:
#rename lafc_game_id list to 2020_lafc_game_id, Code is in comments below.

    ##lafc_game_id_2020 = []
    ##lafc_game_id_2020=lafc_game_id
    ##del(lafc_game_id)
    
lafc_game_id_2020

['560836',
 '560829',
 '560541',
 '561813',
 '569598',
 '569593',
 '561794',
 '561786',
 '561768']

#### 1.b Reworking the scraper above to be able to pull all game data for LAFC

In [5]:
#create a page and us beautiful soup to parse out the content
lafc_page= requests.get("https://www.espn.com/soccer/team/results/_/id/18966/season")

lafc_soup= BeautifulSoup(lafc_page.content, "html.parser")

# find the body of the webpage
[type(item) for item in list(lafc_soup.children)]
lafc_body=list(lafc_soup.children)[3]
lafc_body

#Find all the games in the body
lafc_games=lafc_body.findAll("tr",{"class":"Table__TR Table__TR--sm Table__even"})

print("Number of Games for LAFC:",len(lafc_games))
print("")
print("Output for one game")
print("-----")
print(lafc_games[0])

Number of Games for LAFC: 89

Output for one game
-----
<tr class="Table__TR Table__TR--sm Table__even" data-idx="0"><td class="Table__TD"><div class="matchTeams">Sun, Mar 8</div></td><td class="Table__TD"><div class="local flex items-center"><a class="AnchorLink Table__Team" href="/soccer/team/_/id/18966/lafc" tabindex="0">LAFC</a></div></td><td class="Table__TD"><span class="Table__Team score"><a class="AnchorLink" href="/soccer/team/_/id/18966/lafc" tabindex="0"><figure class="Image aspect-ratio--parent Logo Logo__sm"><div class="Image__Wrapper aspect-ratio--1x1"></div></figure></a><a class="AnchorLink" href="/soccer/match/_/gameId/561813" tabindex="0">3 - 3</a><a class="AnchorLink" href="/soccer/team/_/id/10739/philadelphia-union" tabindex="0"><figure class="Image aspect-ratio--parent Logo Logo__sm"><div class="Image__Wrapper asp

<b><i>Looking at the output of the individual game, it seems as if there are multiple links and classes that contain AnchorLink. What I have decided to do is find all the links in the body that have the text of FT or FT-Pens. From here I can extract the game ID..</b> </i> 

In [6]:
#findall links where text is FT or FT-Pens
links=lafc_body.findAll("a",{"class":"AnchorLink"},text=("FT","FT-Pens"))
print("Number of links:",len(links))
print("")
print("Output for one link")
print("-----")
print(links[0])

Number of links: 89

Output for one link
-----
<a class="AnchorLink" href="/soccer/match/_/gameId/561813" tabindex="0">FT</a>


In [7]:
#find game_id's

lafc_game_ids=[]
for link in links:
    #append the id list with the last 6 numbers of the link
    lafc_game_ids.append(link["href"][-6:])
print("Number of games:",len(lafc_game_ids))
print("")
print("Output for game")
print("-----")
print(lafc_game_ids[0])
print("")
#Does the number of games = links = game_ids?
print("We collected all of the game_IDs:",len(lafc_games)==len(links)==len(lafc_game_ids))

Number of games: 89

Output for game
-----
561813

We collected all of the game_IDs: True


## 2. Accessing Game Data

In [8]:
def home_stats(game_ids):
    game_date=[]
    game_id=[]
    event=[]
    team=[]
    opponent=[]
    home=[]
    goals=[]
    fouls=[]
    yellow_cards=[]
    red_cards=[]
    offsides=[]
    corners=[]
    saves=[]
    possession=[]
    shots=[]
    shots_on_goal=[]
    
    for game in game_ids:
        page=requests.get("https://www.espn.com/soccer/match?gameId="+game)
        soup=BeautifulSoup(page.content,"html.parser")
        if soup.find("div",{"id":"gamepackage-game-information"}):
            date_data = soup.find("div",{"id":"gamepackage-game-information"})
            game_date.append(np.datetime64(date_data.find("span",{"data-behavior":"date_time"})["data-date"]))
        else:
            game_date.append(None)
        game_id.append(game)
            #event for match
        if soup.find("div",{"class","game-details header"}):
            event.append(soup.find("div",{"class","game-details header"}).text.replace("\n","").replace("  ",""))
        else:
            event.append(None)
            #home_team
        home.append(1.0)
                #home_team
        regex = re.compile('.*team away.*')
        if soup.find("div",{"class",regex}):
            home_team=soup.find("div",{"class",regex})
            team.append(home_team.find("span",{"class","long-name"}).text)
        else:
            team.append(None)
        #opponent
        opp_regex = re.compile('.*team home.*')
        if soup.find("div",{"class",opp_regex}):
            opp=soup.find("div",{"class",opp_regex})
            opponent.append(opp.find("span",{"class","long-name"}).text)
        else:
            opponent.append(None)
                #home_goals
        goals.append(home_team.find("span",{"class","score icon-font-after"}).text.replace("\n","").replace("\t",""))
                #stats
        if soup.find("div",{"class":"stat-list"}):
            statistics=soup.find("div",{"class":"stat-list"})
            stats=statistics.findAll("td",{"data-home-away":"home"})
            for stat in stats:
                if stat["data-stat"]=="foulsCommitted":
                    fouls.append(stat.text)
                if stat["data-stat"]=="yellowCards":
                    yellow_cards.append(stat.text) 
                if stat["data-stat"]=="redCards":
                    red_cards.append(stat.text) 
                if stat["data-stat"]=="offsides":
                    offsides.append(stat.text) 
                if stat["data-stat"]=="wonCorners":
                    corners.append(stat.text) 
                if stat["data-stat"]=="saves":
                    saves.append(stat.text) 
            vis_stats=soup.find("div",{"class","data-vis"})        
            poss=vis_stats.findAll("span",{"class":"chartValue"})
            for p in poss:
                if p["data-home-away"] == "home":
                    possession.append(float(p.text.replace("%",""))/100)
            shot_stats=vis_stats.findAll("span",{"class":"number"})
            for shot_stat in shot_stats:
                if shot_stat["data-home-away"] == "home":
                    s=shot_stat.text.split(" ")
                    shots.append(float(s[0]))
                    shots_on_goal.append(float(s[1].replace("(","").replace(")","")))
        else:
            fouls.append(None)
            yellow_cards.append(None)
            red_cards.append(None)
            offsides.append(None)
            corners.append(None)
            saves.append(None)
            possession.append(None)
            shots.append(None)
            shots_on_goal.append(None)
            
                
    home_results=pd.DataFrame(zip(game_date,game_id,event,team,opponent,home,goals,fouls,yellow_cards,red_cards,
                                  offsides,corners,saves,possession,shots,shots_on_goal),
                         columns=["game_date","game_id","event", "team","opponent","home","goals","fouls",
                                  "yellow_cards", "red_cards", "offsides","corners","saves","possession",
                                  "shots","shots_on_goal"])
    
    return home_results

In [9]:
def away_stats(game_ids):
    game_date=[]
    game_id=[]
    event=[]
    team=[]
    opponent=[]
    home=[]
    goals=[]
    fouls=[]
    yellow_cards=[]
    red_cards=[]
    offsides=[]
    corners=[]
    saves=[]
    possession=[]
    shots=[]
    shots_on_goal=[]
    
    for game in game_ids:
        page=requests.get("https://www.espn.com/soccer/match?gameId="+game)
        soup=BeautifulSoup(page.content,"html.parser")
        if soup.find("div",{"id":"gamepackage-game-information"}):
            date_data = soup.find("div",{"id":"gamepackage-game-information"})
            game_date.append(np.datetime64(date_data.find("span",{"data-behavior":"date_time"})["data-date"]))
        else:
            game_date.append(None)
        game_id.append(game)
            #event for match
        if soup.find("div",{"class","game-details header"}):
            event.append(soup.find("div",{"class","game-details header"}).text.replace("\n","").replace("  ",""))
        else:
            event.append(None)
            #home_team
        home.append(0.0)
            #away_team
        regex = re.compile('.*team home.*')
        if soup.find("div",{"class",regex}):
            away_team=soup.find("div",{"class",regex})
            team.append(away_team.find("span",{"class","long-name"}).text)
        else:
            team.append(None)
        #opponent
        opp_regex = re.compile('.*team away.*')
        if soup.find("div",{"class",opp_regex}):
            opp=soup.find("div",{"class",opp_regex})
            opponent.append(opp.find("span",{"class","long-name"}).text)
        else:
            opponent.append(None)
                #home_goals
        goals.append(away_team.find("span",{"class","score icon-font-before"}).text.replace("\n","").replace("\t",""))
                #stats
        if soup.find("div",{"class":"stat-list"}):
            statistics=soup.find("div",{"class":"stat-list"})
            stats=statistics.findAll("td",{"data-home-away":"away"})
            for stat in stats:
                if stat["data-stat"]=="foulsCommitted":
                    fouls.append(stat.text)
                if stat["data-stat"]=="yellowCards":
                    yellow_cards.append(stat.text) 
                if stat["data-stat"]=="redCards":
                    red_cards.append(stat.text) 
                if stat["data-stat"]=="offsides":
                    offsides.append(stat.text) 
                if stat["data-stat"]=="wonCorners":
                    corners.append(stat.text) 
                if stat["data-stat"]=="saves":
                    saves.append(stat.text) 
            vis_stats=soup.find("div",{"class","data-vis"})        
            poss=vis_stats.findAll("span",{"class":"chartValue"})
            for p in poss:
                if p["data-home-away"] == "away":
                    possession.append(float(p.text.replace("%",""))/100)
            shot_stats=vis_stats.findAll("span",{"class":"number"})
            for shot_stat in shot_stats:
                if shot_stat["data-home-away"] == "away":
                    s=shot_stat.text.split(" ")
                    shots.append(float(s[0]))
                    shots_on_goal.append(float(s[1].replace("(","").replace(")","")))
        else:
            fouls.append(None)
            yellow_cards.append(None)
            red_cards.append(None)
            offsides.append(None)
            corners.append(None)
            saves.append(None)
            possession.append(None)
            shots.append(None)
            shots_on_goal.append(None)
            
                
    away_results=pd.DataFrame(zip(game_date,game_id,event,team,opponent,home,goals,fouls,yellow_cards,red_cards,
                                  offsides,corners,saves,possession,shots,shots_on_goal),
                         columns=["game_date","game_id","event", "team","opponent","home","goals","fouls",
                                  "yellow_cards", "red_cards", "offsides","corners","saves","possession",
                                  "shots","shots_on_goal"])
    
    return away_results

In [13]:
def get_results(tournament,game_ids):
    home=home_stats(game_ids)
    away=away_stats(game_ids)
    
    results=home.append(away)
    results.sort_values(["game_id"],inplace=True)
    results.reset_index(inplace=True)
    results.drop("index",axis=1,inplace=True)
    
    results.to_csv(tournament+".csv")
    return results.head(5)

In [14]:
# Remove all games for 2020 season

lafc_game_ids=[x for x in lafc_game_ids if x not in lafc_game_id_2020]

In [15]:
get_results("lafc_games",lafc_game_ids)



Unnamed: 0,game_date,game_id,event,team,opponent,home,goals,fouls,yellow_cards,red_cards,offsides,corners,saves,possession,shots,shots_on_goal
0,2018-10-28 20:30:00,502360,"2018 Major League Soccer, Regular Season",LAFC,Sporting Kansas City,0.0,1,12,1,0,1,4,5,0.56,16.0,3.0
1,2018-10-28 20:30:00,502360,"2018 Major League Soccer, Regular Season",Sporting Kansas City,LAFC,1.0,2,13,3,1,0,2,1,0.44,8.0,7.0
2,2018-10-21 21:00:00,502377,"2018 Major League Soccer, Regular Season",Vancouver Whitecaps,LAFC,0.0,2,18,3,0,3,1,3,0.27,10.0,4.0
3,2018-10-21 21:00:00,502377,"2018 Major League Soccer, Regular Season",LAFC,Vancouver Whitecaps,1.0,2,11,1,0,0,4,2,0.73,20.0,5.0
4,2018-10-13 02:00:00,502378,"2018 Major League Soccer, Regular Season",Houston Dynamo,LAFC,0.0,2,13,2,0,0,2,4,0.4,9.0,5.0


The above file shows the games of LAFC since the beginning of their campaign in 2018. Each game_id consists of two rows, one for each team in the match. Home/Away team is indicated by a 1.0 or 0.0 respectively. All other data point are fairly self explanitory but will be explored more during the next phase. 