<h1 align=center> Scraping Championship data</h1>

# Introduction

The purpose of this notebook is to get the matches data from <b> England Championship.</b> We will use the webpage www.flashscore.com to get all the information.  To make the web scraping we will use Selenium to remotly control the webpage and BeautifulSoup to save the HTML code, which contains the information we want.

## Import the libraries

In [1]:
import pandas as pd 
import numpy as np
from time import sleep
from bs4 import BeautifulSoup as bs #import beautiful soup

#! pip install selenium 
from selenium import webdriver

## Read the web page

It is important to know the structure of the page we are going to scrape, to do that we are going to open this match : <a href=https://www.flashscore.com/match/faNi9ZPf/#match-statistics;0> Reading vs Birmingham</a>

The first step is to create a browser. You have to check the version of the browser you watn to use (chrome, firefox, edge, etc.) and look for the 'driver'. Once you have the right driver, save it in the location of your preference, it will be necesary to know the File route.

In [2]:
browser = webdriver.Chrome(executable_path = r'C:\Users\Aldo\Documents\Data science\Projects\web_scraping\chromedriver87.exe')

put the url page in a variable

In [3]:
url='https://www.flashscore.com/match/faNi9ZPf/#match-statistics;0'

Make a BeautifulSoup element 

In [4]:
#open the webpage
browser.get(url)
#Save the html code in a variable
html = browser.page_source
soup = bs(html, 'html.parser')


#close all pages
#browser.quit()

 ## Getting data from the webpage

Now that we have the HTML code save as a BeautifulSoup object, it is posible to look for specifict data.We can look for an element by the id or the class, just consider that  there are multiple elements in the same class.

### Match info

First, look for the id of the element you want to get, it could be done by usinig the inspect mode of the browser where you open the page 
(Ctrl + Shift + C). In this case, we will get the  date by the id:  'utime'

In [5]:
date = soup.findAll('div',{'id':'utime'})[0]
date = date.text
date

'09.12.2020 13:45'

In [6]:
description = soup.findAll('span',{'class':'description__country'})[0].a.text
description

'Championship - Round 17'

getting the home name

In [7]:
home = soup.findAll('div',{'class':'team-text tname-home'})[0]
home = home.a.text
home

'Reading'

getting away name

In [8]:
away = soup.findAll('div',{'class':'team-text tname-away'})[0]
away = away.a.text
away

'Birmingham'

In [9]:
match_result = soup.find_all("div", {"id":"event_detail_current_result"})[0]
scores = match_result.findAll('span',{'class':'scoreboard'})
score = [score.text for score in scores]
score
home_score = score[0]
away_score = score[1]


print ('home score: '+ home_score)
print ('away score: '+ away_score)


home score: 1
away score: 2


### Stats
Now that we have the basic match info, it is time to get the match statistics. The Id  we will use is the full time match

<p>full time match id: <b> 'tab-statistics-0-statistic' </b> </p> 
<p>first half time id:<b> 'tab-statistics-1-statistic' </b> </p>
<p>second half time id:<b> 'tab-statistics-2-statistic' </b> </p>

In [10]:
match = soup.findAll('div', {'id':'tab-statistics-0-statistic'})[0]

Home stats

In [11]:
home_values = match.findAll('div', {'class':'statText statText--homeValue'})
home_value = [home.text for home in home_values]
home_value

['64%',
 '11',
 '2',
 '8',
 '1',
 '4',
 '1',
 '1',
 '9',
 '0',
 '1',
 '577',
 '15',
 '89',
 '28']

In [12]:
away_values = match.findAll('div', {'class':'statText statText--awayValue'})
away_value = [away.text for away in away_values]
away_value

['36%',
 '5',
 '3',
 '0',
 '2',
 '1',
 '2',
 '1',
 '25',
 '1',
 '5',
 '318',
 '20',
 '127',
 '43']

In [16]:
browser.quit()

## Put all data in a dataframe

### First Create a data frame

the columns will be the match stats titles

In [13]:
titles = match.findAll('div',{'class':'statText statText--titleValue'})
#List with all the titles
actual_titltes = [title.text for title in titles]

#add a prefix to identefy which team the stat belongs to.
home_titles = ['Home ' + sub for sub in actual_titltes]
away_titles = ['Away ' + sub for sub in actual_titltes]
all_titles = home_titles + away_titles

all_titles

['Home Ball Possession',
 'Home Goal Attempts',
 'Home Shots on Goal',
 'Home Shots off Goal',
 'Home Blocked Shots',
 'Home Corner Kicks',
 'Home Offsides',
 'Home Goalkeeper Saves',
 'Home Fouls',
 'Home Red Cards',
 'Home Yellow Cards',
 'Home Total Passes',
 'Home Tackles',
 'Home Attacks',
 'Home Dangerous Attacks',
 'Away Ball Possession',
 'Away Goal Attempts',
 'Away Shots on Goal',
 'Away Shots off Goal',
 'Away Blocked Shots',
 'Away Corner Kicks',
 'Away Offsides',
 'Away Goalkeeper Saves',
 'Away Fouls',
 'Away Red Cards',
 'Away Yellow Cards',
 'Away Total Passes',
 'Away Tackles',
 'Away Attacks',
 'Away Dangerous Attacks']

...and the match info 

In [14]:
info_columns = ['Description','Date','Home','Away','FTHG','FTAG'] 
info_columns

['Description', 'Date', 'Home', 'Away', 'FTHG', 'FTAG']

Create a list with all colum names

In [15]:
all_columns = info_columns + all_titles

create an empty dataframe with the column names

In [17]:
df = pd.DataFrame(columns=all_columns)
df

Unnamed: 0,Description,Date,Home,Away,FTHG,FTAG,Home Ball Possession,Home Goal Attempts,Home Shots on Goal,Home Shots off Goal,...,Away Corner Kicks,Away Offsides,Away Goalkeeper Saves,Away Fouls,Away Red Cards,Away Yellow Cards,Away Total Passes,Away Tackles,Away Attacks,Away Dangerous Attacks


Create a list with all the values, in the same order as the columns

In [18]:
match_data = [description, date, home, away, home_score, away_score]

all_values = match_data + home_value + away_value

create a dictionary, it will be used to fill the dataframe

In [19]:

keys_list = all_columns 
values_list = all_values
zip_iterator = zip(keys_list, values_list)
match_dictionary = dict(zip_iterator)

print(match_dictionary)

{'Description': 'Championship - Round 17', 'Date': '09.12.2020 13:45', 'Home': 'Reading', 'Away': 'Birmingham', 'FTHG': '1', 'FTAG': '2', 'Home Ball Possession': '64%', 'Home Goal Attempts': '11', 'Home Shots on Goal': '2', 'Home Shots off Goal': '8', 'Home Blocked Shots': '1', 'Home Corner Kicks': '4', 'Home Offsides': '1', 'Home Goalkeeper Saves': '1', 'Home Fouls': '9', 'Home Red Cards': '0', 'Home Yellow Cards': '1', 'Home Total Passes': '577', 'Home Tackles': '15', 'Home Attacks': '89', 'Home Dangerous Attacks': '28', 'Away Ball Possession': '36%', 'Away Goal Attempts': '5', 'Away Shots on Goal': '3', 'Away Shots off Goal': '0', 'Away Blocked Shots': '2', 'Away Corner Kicks': '1', 'Away Offsides': '2', 'Away Goalkeeper Saves': '1', 'Away Fouls': '25', 'Away Red Cards': '1', 'Away Yellow Cards': '5', 'Away Total Passes': '318', 'Away Tackles': '20', 'Away Attacks': '127', 'Away Dangerous Attacks': '43'}


Fill the first row of the dataframe with the info 

In [20]:
df.append( match_dictionary,  ignore_index = True)


Unnamed: 0,Description,Date,Home,Away,FTHG,FTAG,Home Ball Possession,Home Goal Attempts,Home Shots on Goal,Home Shots off Goal,...,Away Corner Kicks,Away Offsides,Away Goalkeeper Saves,Away Fouls,Away Red Cards,Away Yellow Cards,Away Total Passes,Away Tackles,Away Attacks,Away Dangerous Attacks
0,Championship - Round 17,09.12.2020 13:45,Reading,Birmingham,1,2,64%,11,2,8,...,1,2,1,25,1,5,318,20,127,43


...and here we have all the match details

## Scraping multiple pages

Now that we understood how to get the data from a unique match, the next step is to do the same on multiple matches. To do that, we are going to open the webpage with all the matches and create a list with all the match ids (that lis will be called "Match_ids"), then we will use the list to locate the element with that id and click on it, that will open the match details, but that new page is NOT the one that contains the data we want. it is necesary another click on "statistics" to open the the right window.

Once we get in the righ window, we will use that page to create a BeautifoulSoup Object and save it in a list, that list will be called "list_soups" and will be used to get the description and match details. Repet this proces for each id. 

Finally we will create a dataframe and fill it with the data from all the ids and save it as an csv file.

### Import extra libraries

In [21]:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options

chrome_options = Options()

chrome_options.add_argument('disable-notifications')
chrome_options.add_argument('--disable-infobars')
chrome_options.add_argument('start-maximized')
# Pass the argument 1 to allow and 2 to block
chrome_options.add_experimental_option("prefs", { 
    "profile.default_content_setting_values.notifications": 2
    })

### Getting ids of each match

Create a selenium object to navigate in the main page 

In [26]:
browser = webdriver.Chrome(executable_path = r'C:\Users\Aldo\Documents\Data science\Projects\web_scraping\chromedriver87.exe')
url_main = 'https://www.flashscore.com/football/england/championship/results/'
browser.get(url_main)

Before Creating the BeautifulSoup object, make sure to click on "show more matches" to load the full page, otherwise the list will be incomplete.

In [28]:
html_index = browser.page_source
soup_index = bs(html_index , 'html.parser')

Look for the ids from all the matches, these ids will be used to 'click' in the match and get the details.

In [29]:
event_match = soup_index.findAll('div',{'class', 'event__match event__match--static event__match--oneLine'})

In [30]:
Match_ids = [ event_id['id'] for event_id in event_match ]

print('List lenght: '+ str(len(Match_ids)))

List lenght: 198


double click here to see how to save the list
<!--
#Save the List as a file
with open("Match_ids", "w") as file:
    file.write(str(Match_ids))

#Load the List file 
file1 = open("Match_ids", "r")
file_content= file1.read()
-->

In [31]:
browser.quit()

Now that we have the id of every match, it is posible to click on those elements, To do that we create a new driver

In [32]:
from selenium.webdriver.common.action_chains import ActionChains

In [33]:
driver = webdriver.Chrome(executable_path = r'C:\Users\Aldo\Documents\Data science\Projects\web_scraping\chromedriver87.exe',
                         options = chrome_options)

url_main = 'https://www.flashscore.com/football/england/championship/results/'
driver.implicitly_wait(5)
driver.get(url_main)

click on "show more matches" to load the complete page

In [34]:
#to make sure the driver is focus on the page...

wb = driver.window_handles[0]
driver.switch_to_window(wb)

#match = driver.find_element_by_id(match_id)
#match.click()

  driver.switch_to_window(wb)


In [36]:
#To see a fast demostration you can only use the first 5 elements of the list. or run the cplete list.
Match_ids[0:5]


['g_1_GEB07Df7',
 'g_1_rL3nVGnE',
 'g_1_2NJmTfHQ',
 'g_1_bg5JNYvl',
 'g_1_YBcWKWO6']

In [37]:
#scraping code
error_list = []
list_soups = []
wb = driver.window_handles[0]
c=0

for match_id in Match_ids[0:5]:
    element = driver.find_element_by_id(match_id) 
    ActionChains(driver).move_to_element(element).perform() #go to the element to 'click'
    
    try:
        element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.ID, match_id))
        )
        element.click()
    except:
        print('error identifying Match_id ['+ str(c) + '] :' + match_id)
    
    try:
        wa = driver.window_handles[-1] 
        driver.switch_to_window(wa)  #Focus on the new page
        #click on statistics
        element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.ID, 'li-match-statistics'))
        )
        element.click()
        sleep(4)
        soup_match = bs(driver.page_source , 'html.parser') 
        list_soups.append(soup_match) 
        driver.close()
        driver.switch_to_window(wb) #go back to the main page
    except:
        driver.switch_to_window(wb)
        error_list.append(match_id)
    sleep(4)
    c = c +1


print(' ')
print('total Soups: ' + str(len(list_soups)) )
print(' ')
print('total error ' + str(len(error_list)))
print('Error list: ' + str(error_list))


  driver.switch_to_window(wa)
  driver.switch_to_window(wb)


 
total Soups: 5
 
total error 0
Error list: []


In [38]:
driver.quit()

<b> Read it in case of error:</b> 
An error in this code means that the match data was not Collected, it could be for some reasons.

in some cases the id is NOT located becouse the browser was manually manipulated while the code was runing, a bad internet conection, or the computer entered into suspended mode due the inactivity, sometimes the match does not have the statistics window and the BeautifulSoup can NOT be created. Whichever the case is, a error list is created with the ids that does NOT collect the match data,  and you can use that list to run the code (#scraping code) again with just those ids.

Before runing the code again;
1. You can save the data you already got in a dataframe  and append the new data later (Go forward, it is explained in this notebook).
2. Or you can create a new soup list (with another name) to save the new BeautifulSoup objects and combine the lists later

In [44]:
#use the error list to run the code again, after saving the data you already have 
Match_ids = error_list
len(Match_ids)

0

In [45]:
Match_ids

[]

### Save the data in a dataframe

 double click to see how to get the titles

<!-- 

browser = webdriver.Chrome(executable_path = r'C:\Users\Aldo\Documents\Data science\Projects\web_scraping\chromedriver87.exe')
url='https://www.flashscore.com/match/faNi9ZPf/#match-statistics;0'
browser.get(url)
html = browser.page_source
soup = bs(html, 'html.parser')
match = soup.findAll('div', {'id':'tab-statistics-0-statistic'})[0]
titles = match.findAll('div',{'class':'statText statText--titleValue'})

browser.quit()

actual_titltes = [title.text for title in titles] 

-->


In [41]:
#List with all the titles
actual_titltes =[
 'Ball Possession',
 'Goal Attempts',
 'Shots on Goal',
 'Shots off Goal',
 'Blocked Shots',
 'Corner Kicks',
 'Offsides',
 'Goalkeeper Saves',
 'Fouls',
 'Red Cards',
 'Yellow Cards',
 'Total Passes',
 'Tackles',
 'Attacks',
 'Dangerous Attacks']

Create an empy dataframe with the columns

In [42]:
#add a prefix to identefy which team the stat belongs to.
home_titles = ['Home ' + sub for sub in actual_titltes]
away_titles = ['Away ' + sub for sub in actual_titltes]
all_titles = home_titles + away_titles


info_columns = ['Info','Date','Home','Away','FTHG','FTAG'] 
all_columns = info_columns + all_titles

df = pd.DataFrame(columns=all_columns)
df

Unnamed: 0,Info,Date,Home,Away,FTHG,FTAG,Home Ball Possession,Home Goal Attempts,Home Shots on Goal,Home Shots off Goal,...,Away Corner Kicks,Away Offsides,Away Goalkeeper Saves,Away Fouls,Away Red Cards,Away Yellow Cards,Away Total Passes,Away Tackles,Away Attacks,Away Dangerous Attacks


Fill the dataframe

In [43]:
count = 0

#Rememeber you can get statistics from half, Second and complete macth by changing statistics_time 
complete = 'tab-statistics-0-statistic'
first_half = 'tab-statistics-1-statistic'
second_half ='tab-statistics-2-statistic'
statistics_time = complete


for soup in list_soups:
    try:
        info = soup.findAll('span',{'class':'description__country'})[0].a.text
    except:
        info = np.nan
    
    try:
        date = soup.findAll('div',{'id':'utime'})[0].text
    except:
        date = np.nan
    
    try:
        home = soup.findAll('div',{'class':'team-text tname-home'})[0].a.text
    except:
        home = np.nan
    
    try:
        away = soup.findAll('div',{'class':'team-text tname-away'})[0].a.text
    except:
        away = np.nan

    try:  
        match_result = soup.find_all("div", {"id":"event_detail_current_result"})[0]
        scores = match_result.findAll('span',{'class':'scoreboard'})
        score = [score.text for score in scores]
        home_score = score[0]
        away_score = score[1]
    except:
        home_score = np.nan
        away_score = np.nan
        
    #try:
    match = soup.findAll('div', {'id':statistics_time})[0]

    home_values = match.findAll('div', {'class':'statText statText--homeValue'})
    home_value = [home.text for home in home_values]

    away_values = match.findAll('div', {'class':'statText statText--awayValue'})
    away_value = [away.text for away in away_values]

    #---------------
    titles = match.findAll('div',{'class':'statText statText--titleValue'})
    #List with all the titles
    actual_titltes = [title.text for title in titles]
    
    #except:
    #   home_value = []
    #   away_value = []
    #   actual_titltes = []
        

    #add a prefix to identefy which team the stat belongs to.
    home_titles = ['Home ' + sub for sub in actual_titltes]
    away_titles = ['Away ' + sub for sub in actual_titltes]
    all_titles = home_titles + away_titles

    all_columns = info_columns + all_titles
        #----------------------------------------------------------

    #dataframe values
    match_data = [info, date, home, away, home_score, away_score]
    all_values = match_data + home_value + away_value

    #create a dictionary
    keys_list = all_columns 
    values_list = all_values
    zip_iterator = zip(keys_list, values_list)
    match_dictionary = dict(zip_iterator)

    #append values in the dataframe
    df = df.append( match_dictionary,  ignore_index = True)
    #print ( 'NOT found list_soups[{}]'.format(count) )
df

Unnamed: 0,Info,Date,Home,Away,FTHG,FTAG,Home Ball Possession,Home Goal Attempts,Home Shots on Goal,Home Shots off Goal,...,Away Corner Kicks,Away Offsides,Away Goalkeeper Saves,Away Fouls,Away Red Cards,Away Yellow Cards,Away Total Passes,Away Tackles,Away Attacks,Away Dangerous Attacks
0,Championship - Round 18,12.12.2020 09:00,Birmingham,Watford,0,1,45%,9,2,5,...,6,2,2,10,0.0,1.0,386,15,113,43
1,Championship - Round 18,12.12.2020 09:00,Derby,Stoke,0,0,62%,11,5,3,...,3,1,5,14,,1.0,310,11,95,34
2,Championship - Round 18,12.12.2020 09:00,Bournemouth,Huddersfield,5,0,49%,13,8,3,...,2,2,3,8,,1.0,463,16,114,48
3,Championship - Round 18,12.12.2020 09:00,Derby,Stoke,0,0,62%,11,5,3,...,3,1,5,14,,1.0,310,11,95,34
4,Championship - Round 18,12.12.2020 09:00,Luton,Preston,3,0,46%,19,5,10,...,4,2,2,7,,,357,21,90,36


if the data is right, it is time to save the file_

In [129]:
df.to_csv('Championship_data.csv', index=False)

<b> Read it in case of error:</b> 
1. rename the dataframe

In [130]:
df_all = df

2. Run the code again (#scraping code) using the error list.
3. Create a new empy dataframe with the column names
4. Fill the dataframe
5. Combine the new dataframe with the previous one. 

In [None]:
df_all2 = pd.concat([df_all, df], ignore_index=True) 
df_all2
#df_all2.to_csv('Championship_data.csv', index=False)

6. Repeat this error proces as necesary.