# Collecting Data from ESPN

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import re

I am going to start working with one team in the MLS, LAFC. This will allow me to start gathering the information for this team only and then I can expand from here. My process for this scraper is to

1. Create a list of all game id's for LAFC over the course of their history
2. Use these game ID's to pull game data from the ESPN website
3. Remove any games that have happened in the 2020 season so far
4. Expand this data collection to be able to include all teams in the MLS
5. Use this data to create a model to predict the results of the 2020 MLS season (if it ever actually happens)

### 1. List of game id's for LAFC

#### 1.a First attempt to pull game Id's

In [2]:
#create a page and us beautiful soup to parse out the content
lafc_page=requests.get("https://global.espn.com/soccer/team/_/id/18966/lafc")
lafc_page

lafc_soup= BeautifulSoup(lafc_page.content, "html.parser")
lafc_soup


<!DOCTYPE html>

<html class="no-icon-fonts" lang="en">
<head>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="IE=edge,chrome=1" http-equiv="x-ua-compatible"/>
<meta content="initial-scale=1.0, maximum-scale=1.0, user-scalable=no" name="viewport"/>
<meta content="origin-when-cross-origin" name="referrer"/>
<link href="https://global.espn.com/football/team/_/id/18966/lafc" rel="canonical"/>
<title>LAFC  News and Scores - ESPN</title>
<meta content="Get the latest LAFC news, scores, stats, standings, rumors, and more from ESPN." name="description">
<meta content="lafc, soccer, soccer, scores, news, schedule, highlights" name="news_keywords">
<meta content="lafc, soccer, soccer, scores, news, schedule, highlights" name="keywords"/>
<meta content="116656161708917" property="fb:app_id"/>
<meta content="ESPN.com" property="og:site_name"/>
<meta content="https://global.espn.com/football/team/_/id/18966/lafc" property="og:url"/>
<meta content="LAFC  News an

In [3]:
# find the body of the webpage
[type(item) for item in list(lafc_soup.children)]

lafc_body=list(lafc_soup.children)[3]
lafc_body

#find the fixture data
lafc_fixtures=lafc_body.findAll("section",{"class","col-b"})[0]

#find game webpage
lafc_games=lafc_fixtures.findAll("a",{"class","competitors"})


#create a list of LAFC game Id's
lafc_game_id=[]
for game in lafc_games:
    if game.has_attr("href"):
        lafc_game_id.append(game["href"][-6:])
        
print(lafc_game_id)

['560836', '560829', '560541', '561813', '569598', '569593', '561794', '561786', '561768']


<b><i>Looking at the list of game id's it seems like I have only been able to pull game id's from a few of LAFC's games. Upon further inspection of the webpage I used at the beginning, it looks like these are game id's from only the 2020 season which is actually the game Id's that I do not want. It also looks like I have included CONCACAF Champions League games which at this point, I might or might not want to include.</b> </i> 

In [4]:
lafc_games[0]

<a class="competitors" href="/football/matchstats?gameId=560836"><div class="team team-a"><div class="team__content"><div class="team__banner"><div class="team__banner__wrapper"><picture><source srcset="https://a.espncdn.com/combiner/i?img=/i/teamlogos/soccer/500/228.png&amp;h=40&amp;w=40"/><img class="team-logo" data-default-src="https://a.espncdn.com/combiner/i?img=/i/teamlogos/soccer/500/228.png&amp;h=40&amp;w=40" src="https://a.espncdn.com/combiner/i?img=/i/teamlogos/soccer/500/228.png&amp;h=40&amp;w=40"/></picture><svg class="team__svg team__svg--primary" viewbox="0 0 176 80" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><defs><lineargradient gradienttransform="translate(-5 -710)" gradientunits="userSpaceOnUse" id="linear-gradient" x1="108.61" x2="90.78" y1="752.49" y2="770.32"><stop offset="0" stop-opacity="0"></stop><stop offset="1"></stop></lineargradient></defs><polygon class="team__stroke" points="81 80 80 80 0 0 1 0 81 80"></polygon><polygon c

In [5]:
#rename lafc_game_id list to 2020_lafc_game_id, Code is in comments below.

    ##lafc_game_id_2020 = []
    ##lafc_game_id_2020=lafc_game_id
    ##del(lafc_game_id)
    
lafc_game_id_2020

['560836',
 '560829',
 '560541',
 '561813',
 '569598',
 '569593',
 '561794',
 '561786',
 '561768']

#### 1.b Reworking the scraper above to be able to pull all game data for LAFC

In [6]:
#create a page and us beautiful soup to parse out the content
lafc_page= requests.get("https://www.espn.com/soccer/team/results/_/id/18966/season")

lafc_soup= BeautifulSoup(lafc_page.content, "html.parser")
lafc_soup


<!DOCTYPE doctype html>

<html lang="en">
<head>
<!-- ESPNFITT | 4d4b5641d13b | 3596 | 1d4236e078642a9e7f7732436b3c1c306ce20400 | Thu, 19 Mar 2020 22:53:16 GMT -->
<script type="text/javascript">
        ;(function(){
            function gc(n){var r=document.cookie.match("(^|;) ?"+n+"=([^;]*)(;|$)");return r?r[2]:null}function sc(n){document.cookie=n}function smpl(n){var r=n/100;return!!r&&Math.random()<=r}var _nr=!1,_nrCookie=gc("_nr");null!==_nrCookie?"1"===_nrCookie&&(_nr=!0):smpl(100)?(_nr=!0,sc("_nr=1; path=/")):(_nr=!1,sc("_nr=0; path=/"));;
            _nr && window.NREUM||(NREUM={}),__nr_require=function(t,e,n){function r(n){if(!e[n]){var o=e[n]={exports:{}};t[n][0].call(o.exports,function(e){var o=t[n][1][e];return r(o||e)},o,o.exports)}return e[n].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<n.length;o++)r(n[o]);return r}({1:[function(t,e,n){function r(t){try{s.console&&console.log(t)}catch(e){}}var o,i=t("ee"),a=t(23),s={};try{o=localStorage

In [7]:
# find the body of the webpage
[type(item) for item in list(lafc_soup.children)]
lafc_body=list(lafc_soup.children)[3]
lafc_body

<html lang="en">
<head>
<!-- ESPNFITT | 4d4b5641d13b | 3596 | 1d4236e078642a9e7f7732436b3c1c306ce20400 | Thu, 19 Mar 2020 22:53:16 GMT -->
<script type="text/javascript">
        ;(function(){
            function gc(n){var r=document.cookie.match("(^|;) ?"+n+"=([^;]*)(;|$)");return r?r[2]:null}function sc(n){document.cookie=n}function smpl(n){var r=n/100;return!!r&&Math.random()<=r}var _nr=!1,_nrCookie=gc("_nr");null!==_nrCookie?"1"===_nrCookie&&(_nr=!0):smpl(100)?(_nr=!0,sc("_nr=1; path=/")):(_nr=!1,sc("_nr=0; path=/"));;
            _nr && window.NREUM||(NREUM={}),__nr_require=function(t,e,n){function r(n){if(!e[n]){var o=e[n]={exports:{}};t[n][0].call(o.exports,function(e){var o=t[n][1][e];return r(o||e)},o,o.exports)}return e[n].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<n.length;o++)r(n[o]);return r}({1:[function(t,e,n){function r(t){try{s.console&&console.log(t)}catch(e){}}var o,i=t("ee"),a=t(23),s={};try{o=localStorage.getItem("__nr_flags").spl

In [8]:
#Find all the games in the body
lafc_games=lafc_body.findAll("tr",{"class":"Table__TR Table__TR--sm Table__even"})

print("Number of Games for LAFC:",len(lafc_games))
print("")
print("Output for one game")
print("-----")
print(lafc_games[0])

Number of Games for LAFC: 89

Output for one game
-----
<tr class="Table__TR Table__TR--sm Table__even" data-idx="0"><td class="Table__TD"><div class="matchTeams">Sun, Mar 8</div></td><td class="Table__TD"><div class="local flex items-center"><a class="AnchorLink Table__Team" href="/soccer/team/_/id/18966/lafc" tabindex="0">LAFC</a></div></td><td class="Table__TD"><span class="Table__Team score"><a class="AnchorLink" href="/soccer/team/_/id/18966/lafc" tabindex="0"><figure class="Image aspect-ratio--parent Logo Logo__sm"><div class="Image__Wrapper aspect-ratio--1x1"></div></figure></a><a class="AnchorLink" href="/soccer/match/_/gameId/561813" tabindex="0">3 - 3</a><a class="AnchorLink" href="/soccer/team/_/id/10739/philadelphia-union" tabindex="0"><figure class="Image aspect-ratio--parent Logo Logo__sm"><div class="Image__Wrapper asp

<b><i>Looking at the output of the individual game, it seems as if there are multiple links and classes that contain AnchorLink. What I have decided to do is find all the links in the body that have the text of FT or FT-Pens. From here I can extract the game ID..</b> </i> 

In [9]:
#findall links where text is FT or FT-Pens
links=lafc_body.findAll("a",{"class":"AnchorLink"},text=("FT","FT-Pens"))
print("Number of links:",len(links))
print("")
print("Output for one link")
print("-----")
print(links[0])

Number of links: 89

Output for one link
-----
<a class="AnchorLink" href="/soccer/match/_/gameId/561813" tabindex="0">FT</a>


In [10]:
#find game_id's

lafc_game_ids=[]
for link in links:
    #append the id list with the last 6 numbers of the link
    lafc_game_ids.append(link["href"][-6:])
print("Number of games:",len(lafc_game_ids))
print("")
print("Output for game")
print("-----")
print(lafc_game_ids[0])
print("")
#Does the number of games = links = game_ids?
print("We collected all of the game_IDs:",len(lafc_games)==len(links)==len(lafc_game_ids))

Number of games: 89

Output for game
-----
561813

We collected all of the game_IDs: True
