# Let's Football!

**FIFA world cup** is the most widely viewed sporting event in the world having its latest 2018 Russian version attracted *3.5 billion* viewers. The matches were held between 14 June - 15 July. 32 teams competed in a total of 64 matches in 12 different stadiums. Deschamps' French team won the tournament thrashing Croatia **4-2** in the finals and *Modric* still bagging the *Golden Ball*

**Fun fact**: *Global human population is estimated around 7.7 billion*

### From text to Python

We'll try to add the above **data** to _Python_

In [251]:
event_name = 'World Cup'

print(event_name)

World Cup


In [None]:
# 1. String data - they go wrapped in quotes '' or ""
host_organisation = 'FIFA'
host_nation = 'Russia'
winning_nation = 'France'

# 2. Integer data 
version = 2018
total_matches = 64
total_venues = 12

# 3. Float data - store decimal values
total_viewers_billion = 3.5
human_population_billion = 7.7

<p style="font-size:16px;font-family:helvetica">
Manual collection of data like this gets tedious and boring easily. We'll look at how to automate the process to extract the required data
</p>

<hr>

### Collecting Data

In [297]:
###########################
# IMPORT NECESSARY PACKAGES
###########################

# highly optimised calculations on large arrays
import numpy as np

# collection of functions to manipulate data
import pandas as pd

# base library in Python for plotting graphs
# import matplotlib.pyplot as plt

# for better visualizations
# import seaborn as sns

# for plotting on maps
# from mpl_toolkits.basemap import Basemap

# sending and receiving HTTP requests
import requests

# parsing html data 
import bs4
from bs4 import BeautifulSoup as bsoup

# for easy searching and replacing using regular expression
import re

# better dictionary
from collections import defaultdict 

# pretty print values
from pprint import pprint

### Part 1: Get sample data

We'll use Wikipedia article on [2018 FIFA World Cup](https://en.wikipedia.org/wiki/2018_FIFA_World_Cup) to get more data on the event

> The *Art* of retrieving data from the web is **Web Scraping**


In [258]:
## Get the wikipedia page's html content using Python
## Pointers - https://realpython.com/python-requests/  

import requests
from bs4 import BeautifulSoup as bsoup

# send an HTTP get request to the wikipiedia server for the required page
html = requests.get('https://en.wikipedia.org/wiki/2018_FIFA_World_Cup')
print(f"requests returned a {type(html)} object")

# get string version of Response object returned by requests
text = html.text
print(f"text is a {type(text)} object")
      
# convert to a BeautifulSoup object for easy parsing
soup = bsoup(text, 'html.parser')

requests returned a <class 'requests.models.Response'> object
text is a <class 'str'> object


In [444]:
# pretty print contents of the webpage
print(soup.prettify())

The window on far right has a quick overview - let's get that **DATA**

<p align="center" style="text-align:center">
  <img src="https://drive.google.com/uc?export=view&id=1XDoUcWBuiTwnBn_-cLvvuVXNI-N9SOTs" alt="Wiki Home - 2018 Fifa WC" width="720p">
</p>
<p align="center" style="text-align:center"> Source: Wikipedia</p> 

Ok, we have the website and know where to *scrape* data from - How to do it?

We'll analyze the website html. While you are at the wikipedia page, right-click on the window we need to collect data from  and select **Inspect**. The pop-up displays the HTML for the wikipedia page. 


<p align="center" style="text-align:center"> <b>Opening Developer Tools</b></p> 
<p align="center" style="text-align:center">
  <img src="https://drive.google.com/uc?export=view&id=105CHEVsaUtU2iRPYSXCcb8rCQuGrTNaV" alt="Wiki Home - 2018 Fifa WC" >
</p>

<p align="center" style="text-align:center"><b>Data Highlighted</b></p> 
<p align="center" style="text-align:center">
  <img src="https://drive.google.com/uc?export=view&id=1Iak8Bf5yB1Ni6KKZAqcZ-z90A60FBpDB" alt="Wiki Home - 2018 Fifa WC" >
</p>

<p align="center" style="text-align:center"> Source: Wikipedia</p> 

We need to identify the box somehow, we'll make use of the tags for this. When you hover over different tags in the developer window, the corresponding section of the webpage will light up. 

In [445]:
# TODO 1: Fill the 'attrs' parameter of 'soup.find method with the class of the <table> tag 
# that completely wraps the information box
infobox = soup.find('table', attrs={'class': 'infobox vcalendar'})

# infobox = soup.find('table', attrs={'class': ''})

# prints the content of the information box
print(infobox.prettify())

<!-- <p align="center" style="text-align:center">
  <img src="https://drive.google.com/uc?export=view&id=1XDoUcWBuiTwnBn_-cLvvuVXNI-N9SOTs" alt="Wiki-home" width="720p">
</p>
<p align="center"> Source: Wikipedia</p> -->




In [274]:
# there are multiple <tr> tags in html of the infobox, 
# the attribute-value pairs have <th>&<td> in their <tr>s
# Host country : Russia , Host country is the attribute & Russia its value

# TODO 2: complete the call to the find_all method for returning all the <tr> tags
infobox_rows = infobox.find_all('tr')
# infobox_rows = infobox.find_all('')


print(f"There are {len(infobox_rows)} <tr> tags in the html of the infobox \n")

# create a dictionary to store attribute-value pairs, or key-value if you prefer
infobox_data = {}

''' 
 Sample <tr> tag:
 
 <tr>
    <th scope="row">
        Host country
    </th>
    <td>
        Russia
    </td>
 </tr>
  
  
 checking the content of 'infobox' you can see that 
 <tr> tags with data in it has both <th> and <td> tags as its children
 attribute name is in <th> and attribute value in <td>
'''

# TODO3: Loop through 'infobox_rows'
# for row in ... :
for row in infobox_rows:
    # check if both <th> and <td> tags are present
    if row.th is not None and row.td is not None:
        # get text content from the tags
        attribute = row.th.text
        value = row.td.text
        
        # some encoding stuff
        # https://stackoverflow.com/a/11566398/9734484
        attribute = attribute.replace('\xa0', ' ')
        value = value.replace('\xa0', ' ')
        
        # TODO4: add to 'infobox_data' the data
        # use attribute as key and value as value to the dictionary
#         infobox_data[...] = ...
        infobox_data[attribute] = value
    
    
pprint(infobox_data)

There are 22 <tr> tags in the html of the infobox 

{'Attendance': '3,031,768 (47,371 per match)',
 'Best goalkeeper': ' Thibaut Courtois',
 'Best player(s)': ' Luka Modrić',
 'Best young player': ' Kylian Mbappé',
 'Champions': ' France (2nd title)',
 'Dates': '14 June – 15 July',
 'Fair play award': ' Spain',
 'Fourth place': ' England',
 'Goals scored': '169 (2.64 per match)',
 'Host country': 'Russia',
 'Matches played': '64',
 'Runners-up': ' Croatia',
 'Teams': '32 (from 5 confederations)',
 'Third place': ' Belgium',
 'Top scorer(s)': ' Harry Kane (6 goals)',
 'Venue(s)': '12 (in 11 host cities)'}


In [8]:
# TODO 5: Print the value of the attribute 'Teams' from infobox_data
# print(...)
print(infobox_data['Teams'])

'32 (from 5 confederations)'

In [275]:
# json is a file format with a data representation similar to Python dicts
# We'll save our infobox data into a json file on disk
# https://realpython.com/python-json/
import json

# json.dumps - return Python string from JSON encoded data 

with open('data/infobox-data.json', 'w') as fp:
    '''
    json.dump(data, file_pointer) - save JSON encoded data into disk
    
    Input:
        data - data to be stored
        file_pointer - location to be stored
        indent - indentation level to use
    '''
    json.dump(infobox_data, fp, indent=2)
    
del infobox_data


# TODO 6: Fill arguments for the json.load() function call to load json data into infobox variable
with open('data/infobox-data.json', 'r') as fp:
    '''
    # json.load(file_pointer) - load saved JSON encoded data from disk
    
    Input:
        file_pointer - location to be read from
    '''
#     infobox_data  = json.load(...)
    infobox_data  = json.load(fp)
    
    
# TODO 7: save json data as a string with indentation 2
# Fill arguments to json.dumps()
# infobox_data_string = json.dumps(..., ...)
infobox_data_string = json.dumps(infobox_data, indent=2)

# print the contents
print(infobox_data_string)

{
  "Host country": "Russia",
  "Dates": "14 June \u2013 15 July",
  "Teams": "32 (from 5 confederations)",
  "Venue(s)": "12 (in 11 host cities)",
  "Champions": " France (2nd title)",
  "Runners-up": " Croatia",
  "Third place": " Belgium",
  "Fourth place": " England",
  "Matches played": "64",
  "Goals scored": "169 (2.64 per match)",
  "Attendance": "3,031,768 (47,371 per match)",
  "Top scorer(s)": " Harry Kane (6 goals)",
  "Best player(s)": " Luka Modri\u0107",
  "Best young player": " Kylian Mbapp\u00e9",
  "Best goalkeeper": " Thibaut Courtois",
  "Fair play award": " Spain"
}


### Part 2: Teams by confederation

<p style="font-size:16px;font-family:helvetica"> Let's get details of all teams that played the 2018 Fifa World Cup
and the Confederation they represent</p>

<p align="center" style="text-align:center">
  <img src="https://drive.google.com/uc?export=view&id=1Uo2AnHRCDxPKDQivs_052lvBVHwaEPUx" alt="Teams by confederation" >
</p>
<p align="center" style="text-align:center"> Source: Wikipiedia </p>

In [279]:
# TODO 8: Find the class of the table that wraps the required data
# multicol_tables = soup.find_all('table', attrs={'class':'...'})
multicol_tables = soup.find_all('table', attrs={'class':'multicol'})

# let's check if we got ourselves more than one table
print(len(multicol_tables))

3


In [446]:
# TODO 9: Print out each of the three tables we found
# Find the index of the tabel for our data
# idx = ...
idx = 0
print(multicol_tables[idx].prettify())

In [None]:
# save the table
teams_by_confederation_table = multicol_tables[idx]

#### Find confederation tags

In [285]:
'''
You can find all the confederation names are inside dl tags

<dl>
     <dt>
          <a href="/wiki/2018_FIFA_World_Cup_qualification_(AFC)" title="2018 FIFA World Cup qualification (AFC)">
           AFC
          </a>
          (5)
     </dt>
</dl
''' 

# TODO 10: Use appropriate function to 'find all' <dl> tags in teams_by_confideration_table
# confederation_tags = ...
confederation_tags = teams_by_confederation_table.find_all('dl')

print(confederation_tags[0].prettify())

<dl>
 <dt>
  <a href="/wiki/2018_FIFA_World_Cup_qualification_(AFC)" title="2018 FIFA World Cup qualification (AFC)">
   AFC
  </a>
  (5)
 </dt>
</dl>



In [294]:
# print only the text inside the html 
print(confederation_tags[0].text)

AFC (5)


In [304]:
'''
TODO 11: 
1. Loop through the confederation_tags list
2. print text in each confederation tag

Example output - 

AFC (5)
CAF (5)
CONCACAF (3)
CONMEBOL (5)
OFC (0)
UEFA (14)

'''

# for tag in ...:
#     print(...)
    
for tag in confederation_tags:
    print(tag.text)
    
# TODO(Optional): Print the text in the tags using list comprehension
# print([tag.text for tag in confederation_tags])

AFC (5)
CAF (5)
CONCACAF (3)
CONMEBOL (5)
OFC (0)
UEFA (14)


#### Clean confederation names

In [306]:
from collections import defaultdict 

'''
Remove from 'confederation_tags' extra details in the text data part from the Country name

Input: AFC (5)
Output: AFC
'''

confederation_text = "AFC (5)"


def clean_confederation_text(text):
    # TODO 12: Use appropriate string operation to SPLIT the name and the other info to list of two strings
    # Input: "AFC (5)""
    # Output: ["AFC", "(5)"]

    # confederation_split = ...
    confederation_split = text.split(" ")


    # TODO 13: Get the text string from the list
    # confederation_cleaned = ...
    confederation_cleaned = confederation_split[0]

    return confederation_cleaned


confederation_cleaned = clean_confederation_text(confederation_text)

print(confederation_cleaned)

AFC


In [307]:
# TODO 14: Loop through all the confederation tags and save only the name part
# You have to give as argument to the clean_confederation_text function the TEXT in tags

# confederations = [clean_confederation_text(...) for tag in confederation_tags]
confederations = [clean_confederation_text(tag.text) for tag in confederation_tags]

print(confederations)

['AFC', 'CAF', 'CONCACAF', 'CONMEBOL', 'OFC', 'UEFA']


#### Find list of teams in each confederation

In [320]:
'''
Info on teams in each confederation is in <ul> tag directly following
the <dl> tag of the corresponding confederation

TODO 15: Use appropriate function to 'find all' <ul> tags in teams_by_confideration_table
'''

# teams_list_tags = ...
teams_list_tags = teams_by_confederation_table.find_all('ul')

# print text in one of the teams list tag
print(teams_list_tags[0].text)

 Australia (36)
 Iran (37)
 Japan (61)
 Saudi Arabia (67)
 South Korea (57)


In [447]:
# Now we have a list with all teams in a confederation as a single entity
# Let's split that too

afc_teams_tag = teams_list_tags[0]

# On inspection you can see that all the <a> tag of each team name has as text the corresponding team name
# eg: <a href="/wiki/Australia_national_soccer_team" title="Australia national soccer team">Australia</a>

#TODO 16: Find all <a> tags within a <ul> tag - afc_teams
afc_teams_a_tags = afc_teams_tag.find_all('a')


'''
Output:

[<a href="/wiki/Australia_national_soccer_team" title="Australia national soccer team">Australia</a>,
<a href="/wiki/Iran_national_football_team" title="Iran national football team">Iran</a>,
<a href="/wiki/Japan_national_football_team" title="Japan national football team">Japan</a>, 
<a href="/wiki/Saudi_Arabia_national_football_team" title="Saudi Arabia national football team">Saudi Arabia</a>, 
<a href="/wiki/South_Korea_national_football_team" title="South Korea national football team">South Korea</a>]
'''

pprint(afc_teams_a_tags)

[<a href="/wiki/Australia_national_soccer_team" title="Australia national soccer team">Australia</a>,
 <a href="/wiki/Iran_national_football_team" title="Iran national football team">Iran</a>,
 <a href="/wiki/Japan_national_football_team" title="Japan national football team">Japan</a>,
 <a href="/wiki/Saudi_Arabia_national_football_team" title="Saudi Arabia national football team">Saudi Arabia</a>,
 <a href="/wiki/South_Korea_national_football_team" title="South Korea national football team">South Korea</a>]


In [332]:
print([team_a_tag.text for team_a_tag in afc_teams_a_tags])

['Australia', 'Iran', 'Japan', 'Saudi Arabia', 'South Korea']


In [338]:
# def get_list_of_team_names_from_conf_tags(tags):
#     teams_a_tag = tags.find_all('a')
    
#     team_names = [team_a_tag.text for team_a_tag in teams_a_tag]
    
#     return team_names


# afc_teams = get_list_of_team_names_from_conf_tags(afc_teams_tag)
# print(afc_teams)
print(f'Confederation list has {len(confederations)} items')
print(f'Teams list has {len(teams_list_tags)} items, each having all the teams in a confederation')

Confederation list has 6 items
Teams list has 6 items, each having all the teams in a confederation


> So, first confederation in `confederations` will correspond to the first team names tag in `teams_list_tags`

#### Store teams by confederation

In [342]:
# Now we will add all the teams to their corresponding confs and use a dict to save it

from collections import defaultdict

# TODO 17: Create a dictionary with values pre-initialized as list
# Pointer - https://docs.python.org/3.3/library/collections.html#collections.defaultdict
# teams_by_confederation = defaultdict(...)
teams_by_confederation = defaultdict(list)



for i, confederation in enumerate(confederations):
    # find the teams tag corresponding to the confederation
    confederation_teams_tag = teams_list_tags[i]
    
    # TODO 18: ind all <a> tags in 'confederation_teams_tag'
    # for anchor in ...:
    for anchor in confederation_teams_tag.find_all('a'):
        
        # TODO 19: extract only text from the anchor tag
        # Example
        # Input: <a href="/wiki/Australia_national_soccer_team" title="Australia national soccer team">Australia</a>
        # Output: Australia
        # team = ...
        team_name = anchor.text
        
        # TODO 20: Append the team name to the corresponding confederation key in 'teams_by_confederation'
        # teams_by_confederation[...].append(team_name)
        teams_by_confederation[confederation].append(team_name)


'''
Output: ['Belgium', 'Croatia', 'Denmark', 'England', 'France', 'Germany', 'Iceland', 'Poland', 'Portugal', 
        'Russia', 'Serbia', 'Spain', 'Sweden', 'Switzerland']
'''
print(teams_by_confederation['UEFA'])

['Belgium', 'Croatia', 'Denmark', 'England', 'France', 'Germany', 'Iceland', 'Poland', 'Portugal', 'Russia', 'Serbia', 'Spain', 'Sweden', 'Switzerland']


In [345]:
# TODO 21: Save the teams_by_confederation dictionary as a NEW JSON file called 'teams-by-confederation.json'

# with open(... , ...) as fp:
#     json.dump(..., ..., indent=2)
    
with open('data/teams-by-confederation.json', 'w') as fp:
    json.dump(teams_by_confederation, fp, indent=2)
    
    
# Open the 'teams-by-confederation.json' file to see its content, play with the indent parameter

### Part 3: Stadiums and capacity(Optional)

<p align="center" style="text-align:center">
  <img src="https://drive.google.com/uc?export=view&id=1UM3qMIQhalJ3Z0RsMOIO5mpVLWeKkRe_" alt="Stadiums and Capacity">
</p>
<p align="center" style="text-align:center"> Source: Wikipedia </p>


In [448]:
# TODO: 
# 1. Find all the tables having same class as the required table
# 2. Inspect each of the returned tables to find the index of the required table
# wikitables = soup.find_all(..., attrs=...)

wikitables = soup.find_all('table', attrs={'class': 'wikitable'})

# i = 0
i = 3
stadiums_table = wikitables[i]
stadiums_table

In [350]:
# stadium names are inside anchor tag within td tags compared to regions which are in th tags
stadium_and_capacity_tags = stadiums_table.find_all('td')

# stadium tags are td with <a> having title
# capacity tags are td with <b> and td.text having 'Capacity'
# there are images also in td tags but their <a> has class "image"
stadiums = []
capacities = []

for tag in stadium_and_capacity_tags:
    # TODO: check for the string 'Capacity' in the tag's text
    # if ... in ...:
    if 'Capacity' in tag.text:
        capacities.append(tag.b.text)
        
        
    # TODO: check if class attribute of <a> tag is image
    # if not, it will be stadium name, add the stadium NAME to the 'stadiums' list
    # elif not tag.a.get(...) == ['image']:
    elif not tag.a.get('class') == ['image']:
        # stadiums.append(...)
        stadiums.append(tag.a.text)
        
assert len(stadiums) == len(capacities)


print(f"Stadiums list contains \n {stadiums} \n")
print(f"Capacities list contains \n {capacities} \n")

Stadiums list contains 
 ['Luzhniki Stadium', 'Otkritie Arena', 'Krestovsky Stadium', 'Fisht Olympic Stadium', 'Volgograd Arena', 'Rostov Arena', 'Nizhny Novgorod Stadium', 'Kazan Arena', 'Samara Arena', 'Mordovia Arena', 'Kaliningrad Stadium', 'Central Stadium'] 

Capacities list contains 
 ['78,011', '44,190', '64,468', '44,287', '43,713', '43,472', '43,319', '42,873', '41,970', '41,685', '33,973', '33,061'] 



In [352]:
# TODO: store in a dictionary stadiums as key and their capacity as values

stadiums_and_capacities = {stadium: capacity for stadium, capacity in zip(stadiums, capacities)}

'''
Output:

{'Luzhniki Stadium': '78,011',
 'Otkritie Arena': '44,190',
 'Krestovsky Stadium': '64,468',
 'Fisht Olympic Stadium': '44,287',
 'Volgograd Arena': '43,713',
 'Rostov Arena': '43,472',
 'Nizhny Novgorod Stadium': '43,319',
 'Kazan Arena': '42,873',
 'Samara Arena': '41,970',
 'Mordovia Arena': '41,685',
 'Kaliningrad Stadium': '33,973',
 'Central Stadium': '33,061'}
 
'''
pprint(stadiums_and_capacities)

{'Central Stadium': '33,061',
 'Fisht Olympic Stadium': '44,287',
 'Kaliningrad Stadium': '33,973',
 'Kazan Arena': '42,873',
 'Krestovsky Stadium': '64,468',
 'Luzhniki Stadium': '78,011',
 'Mordovia Arena': '41,685',
 'Nizhny Novgorod Stadium': '43,319',
 'Otkritie Arena': '44,190',
 'Rostov Arena': '43,472',
 'Samara Arena': '41,970',
 'Volgograd Arena': '43,713'}


<hr>

### Part 4: Group stage results

<p align="center" style="text-align:center">
  <img src="https://drive.google.com/uc?export=view&id=1p1Lwd9zR1uh7-vAyZj-37qeuYaqS4nwh" alt="Group Stage Table" >
</p>
<p align="center" style="text-align:center"> Source: Wikipedia </p>


In [353]:
# Let's get all tables with a class of 'wikitable' and find which all we need

wikitables = soup.find_all('table', attrs={'class': 'wikitable'})

print(len(wikitables))

19


<p style="font-size:16px;font-family:helvetica">
We got 19 tables to filter 8 corresponding to each Group

> Quick question: **How many ways can you select 8 tables from 19?**


And by the way, the above question is *COMPLETELY UNRELATED* to our purpose. Sorry, if you went on to calculate factorials :)
</p>

In [354]:
# All of our tables has a 'Pos' column
# Let's filter the tables by this information

[i for i, wikitable in enumerate(wikitables) if 'Pos' in wikitable.text]

[4, 5, 6, 7, 8, 9, 10, 11, 17]

<p style="font-size:16px;font-family:helvetica">
So, we have a total of 9 tables and tables 4-11 are 8 consecutive one. It's most likely those are what we need. 

Nb: 
<b> Skeptics are welcome to print the contents of Table 17 to ascertain above hypothesis</b>
</p>

In [358]:
# Do you remember what was the last TODO number?
# Yeah, that's what I thought. Let's scroll up!



# TODO 22: select EIGHT tables 4 to 11
# all_group_table_tags = ...
all_group_table_tags = wikitables[4:12]

assert len(all_group_table_tags) == 8

In [449]:
# let's inspect contents of Group A table to get a hang of its structure
groupA = all_group_table_tags[0]

print(groupA.prettify())

In [360]:
# If you check the contents of groupA, each of the table row is a <tr> tag

# TODO 23: find all <tr> tags in groupA
groupA_table_rows = groupA.find_all('tr')

In [450]:
# first row is the attribute titles - we dont need 'Qualification'
# all the cols except Team has the attribute names in th > abbr.title
first_row = groupA_table_rows[0]

print(first_row.prettify())

In [368]:
pprint(first_row.contents)

['\n',
 <th scope="col" width="28"><abbr title="Position">Pos</abbr>
</th>,
 '\n',
 <th scope="col" width="180">Team<div class="plainlinks hlist navbar mini" style="float:right"><span style="margin-right:-0.125em">[ </span><ul><li class="nv-view"><a href="/wiki/Template:2018_FIFA_World_Cup_Group_A_table" title="Template:2018 FIFA World Cup Group A table"><abbr title="View this template">v</abbr></a></li><li class="nv-talk"><a href="/wiki/Template_talk:2018_FIFA_World_Cup_Group_A_table" title="Template talk:2018 FIFA World Cup Group A table"><abbr title="Discuss this template">t</abbr></a></li><li class="nv-edit"><a class="external text" href="https://en.wikipedia.org/w/index.php?title=Template:2018_FIFA_World_Cup_Group_A_table&amp;action=edit"><abbr title="Edit this template">e</abbr></a></li></ul><span style="margin-left:-0.125em"> ]</span></div>
</th>,
 '\n',
 <th scope="col" width="28"><abbr title="Played">Pld</abbr>
</th>,
 '\n',
 <th scope="col" width="28"><abbr title="Won">W</abb

In [363]:
# let's print the text of each first_row tags
for col in first_row.contents:
    print(col.text)
    

AttributeError: 'NavigableString' object has no attribute 'text'

In [376]:
# print the type of the last 'col' which errored
print(col, type(col), '\n')

print(first_row.contents[0], type(first_row.contents[0]))

print(first_row.contents[1], type(first_row.contents[1]))

first_row.contents[0]


 <class 'bs4.element.NavigableString'> 


 <class 'bs4.element.NavigableString'>
<th scope="col" width="28"><abbr title="Position">Pos</abbr>
</th> <class 'bs4.element.Tag'>


'\n'

The **\n** string is a `bs4.element.NavigableString` object and is causing errors. We don't need that anyway, throw it out!

In [384]:
# TODO 24: store rows that are not '\n's
# first_row_cleaned = [content for content in first_row if content != ...]
first_row_cleaned = [content for content in first_row if content != '\n']

        
for col in first_row_cleaned:
    print(col.find(text=True))

Pos
Team
Pld
W
D
L
GF
GA
GD
Pts
Qualification



In [393]:
# The titles except 'Team' and 'Qualification' are all 'title' attribute of <abbr> tag 
print(first_row_cleaned[0])
print(first_row_cleaned[0].find('abbr').get('title'))

<th scope="col" width="28"><abbr title="Position">Pos</abbr>
</th>
Position


In [398]:
titles = []

# TODO 25: Loop through tags in first_row_cleaned
# for row in ...:
for row in first_row_cleaned:

    # TODO 26: check for <abbr> tag
    # recursive=False - return tag only if its the direct children
    # if row.find(..., recursive=False):
    if row.find('abbr', recursive=False):
        titles.append(row.abbr.get('title'))
    else:
        # for 'Team' and 'Qualification' columns
        titles.append(row.find(text=True).strip())

In [399]:
print(titles)

['Position', 'Team', 'Played', 'Won', 'Drawn', 'Lost', 'Goals for', 'Goals against', 'Goal difference', 'Points', 'Qualification']


In [400]:
titles = titles[:-1]

print(len(titles), titles)

10 ['Position', 'Team', 'Played', 'Won', 'Drawn', 'Lost', 'Goals for', 'Goals against', 'Goal difference', 'Points']


In [401]:
# first row is the table headers and the rest has team info

team_rows = groupA_table_rows[1:]
len(team_rows)

4

In [403]:
# get first team
team_row = team_rows[0]

# remove NavigableStrings - '\n'
team_row_cleaned = [content for content in team_row if content!='\n']

pprint(team_row_cleaned)

[<th scope="row" style="text-align: center;font-weight: normal;background-color:#BBF3BB;">1
</th>,
 <td style="text-align: left; white-space:nowrap;font-weight: normal;background-color:#BBF3BB;"><span style="white-space:nowrap"><span class="flagicon"><img alt="" class="thumbborder" data-file-height="630" data-file-width="945" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/fe/Flag_of_Uruguay.svg/23px-Flag_of_Uruguay.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/f/fe/Flag_of_Uruguay.svg/35px-Flag_of_Uruguay.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/f/fe/Flag_of_Uruguay.svg/45px-Flag_of_Uruguay.svg.png 2x" width="23"/> </span><a href="/wiki/Uruguay_national_football_team" title="Uruguay national football team">Uruguay</a></span>
</td>,
 <td style="font-weight: normal;background-color:#BBF3BB;">3
</td>,
 <td style="font-weight: normal;background-color:#BBF3BB;">3
</td>,
 <td style="font-weight: normal;background-color

In [404]:
teams_data = []

# get the row data 
team_data = [col.get_text(strip=True) for col in team_row_cleaned][:-1]

print(team_data)

['1', 'Uruguay', '3', '3', '0', '0', '5', '0', '+5', '9']


#### Let's make some functions

In [432]:
def get_group_table_headers(header_row):
    '''
    Returns list of table headers
    
    Input: Tag corresponding to first row of a group data table
    Output: List of table headers
        ['Position', 'Team', 'Played', 'Won', 'Drawn', 'Lost', 'Goals for', 'Goals against', 
        'Goal difference', 'Points']
    '''
    header_row_cleaned = [col for col in header_row.contents if col!='\n']
    
    titles = []
    # TODO(optional): Find out difference b/w .text vs get_text() vs find(text=True)
    for row in header_row_cleaned:
        if row.find('abbr', recursive=False):
            titles.append(row.abbr.get('title'))
        else:
            titles.append(row.find(text=True))

    # remove qualification header
    titles = titles[:-1]
    
    return titles


def get_group_team_data(team_row, num_headers):
    '''
    Returns list of column values given data row tag in group data table
    
    Input: 
        team_row - Tag corresponding to a team in group data table
        num_headers - Number of headers present
    
    Output: List of data related to that team in the group
        ['1', 'Uruguay', '3', '3', '0', '0', '5', '0', '+5', '9']
    '''
    
    team_data = [col.get_text(strip=True) for col in team_row.contents if col!='\n']
    
    # remove value of Qualification column if present
    if len(team_data) == (num_headers + 1):
        team_data = team_data[:-1]
    
    return team_data



def get_group_teams_data(group_table, num_headers):
    '''
    Returns data of all teams in a group table
    
    Input: 
        group_table - tags corresponding to each row in the group table except first - header
        num_headers - Number of headers present
    
    Output: List of data related to all teams in the group
    '''
    
    teams_data = []
    for team_row in group_table:
        # get data of a single team
        team_data = get_group_team_data(team_row, num_headers)
        
        # check no of cols
        assert len(team_data) == num_headers
        
        # add team data to the teams data list
        teams_data.append(team_data)
    
    
    return teams_data

    


def get_group_table_data(table_tag):
    '''
    Returns table data with headers
    
    Input: 
        table_tags - tag correspnding to a group table
    
    Output: full table data formatted
    '''
    
    
    table_rows = table_tag.find_all('tr')
    
    header_row = table_rows[0]
    team_rows = table_rows[1:]
    
    header_data = get_group_table_headers(header_row)
    
    num_headers = len(header_data)
    
    teams_data = get_group_teams_data(team_rows, num_headers)
    
    # TODO(optional): add header and team data to single list
#     print(team_rows)
    table_data = [header_data] + teams_data
    assert len(table_data) == 5
    assert table_data[0] == header_data
    assert table_data[1:] == teams_data
    
    return table_data


def get_group_tables_data(all_table_tags):
    '''
    Returns all of the tables' data
    
    Input: 
        all_table_tags - list of tags corresponding to each group table
    
    Output: List of data related to all teams in the group
    '''
    
    group_tables = [get_group_table_data(table_tags) for table_tags in all_table_tags]
    
    return group_tables

In [434]:
# get the first row returned by get_group_table_data on the first table
(get_group_table_data(all_group_table_tags[0])[1])

['1', 'Uruguay', '3', '3', '0', '0', '5', '0', '+5', '9']

In [435]:
import numpy as np

# create a numpy array for storing all the group tables
all_group_tables = np.array(get_group_tables_data(all_group_table_tags))
all_group_tables.shape

(8, 5, 10)

<p style="font-size:16px;font-family:helvetica">
    <ul>
        <li>The <code>all_group_tables</code> numpy array contains each of the <b><i>8</i></b> groups data.</li>
        <li>Each of the group has <b><i>5</i></b> rows in its table - One header row and 4 team rows.</li>
        <li>Each row in the table has <b><i>10</i></b> attributes like Position, Team, Played etc</li>
     </ul>
</p>

In [436]:
group_names = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']

print(group_names)

['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']


In [438]:
# TODO(optional): find difference b/w repeat and tile
# https://docs.scipy.org/doc/numpy/reference/generated/numpy.repeat.html
# https://docs.scipy.org/doc/numpy/reference/generated/numpy.tile.html

repeat_group_names = np.repeat(group_names, 4)
print(repeat_group_names)

['A' 'A' 'A' 'A' 'B' 'B' 'B' 'B' 'C' 'C' 'C' 'C' 'D' 'D' 'D' 'D' 'E' 'E'
 'E' 'E' 'F' 'F' 'F' 'F' 'G' 'G' 'G' 'G' 'H' 'H' 'H' 'H']


In [439]:
def combine_group_tables(all_group_tables, group_names, num_teams_in_group=4):
        '''
        Combines all the group tables to a single elongated list
        
        '''
        
        assert type(all_group_tables) == np.ndarray
        assert len(all_group_tables) == len(group_names)
        
        # save headers
        headers = all_group_tables[0][0]

        '''
        all_group_tables[:, :, :] - returns all the data in all_group_tables
        
        all_group_tables[0, :, :] - returns all data of the first group 
        all_group_tables[0, :, :].shape == (5, 10)
        
        all_group_tables[:, 0, :] - returns first row data(header) from each table
        all_group_tables[:, 0, :].shape == (8, 1, 10)
        
        all_group_tables[:, 1:, :] - returns all row data except the first row(header) in each table
        all_group_tables[:, 1:, :].shape == (8, 4, 10)
        '''
        # remove header from all groups
        all_group_tables_no_header = all_group_tables[:, 1:, :]
        
        # all_group_tables_no_header == (8, 4, 10)
        assert all_group_tables_no_header.shape[1] == num_teams_in_group        
        
        # TODO(optional): reshape all_group_tables_no_header to (32, 10)
        # create a single table of all groups
        all_group_single_table = all_group_tables_no_header.reshape(-1, len(headers))
        
        # TODO(optional): repeat group names
        # add group name to all tables
        
        
        group_names_repeated = np.repeat(group_names, num_teams_in_group)
        
        ### Add group_name as last col of each row
        
        # current last row index - all_group_tables_no_heder.shape[2] - 1
        # because indexing starts at 0
        col_index_to_insert = all_group_tables_no_header.shape[2]
        
        # Insert group names at the set index across columns(axis=1)
        # https://docs.scipy.org/doc/numpy/reference/generated/numpy.insert.html
        table_with_group_names = np.insert(all_group_single_table, col_index_to_insert
                                                      , group_names_repeated, axis=1)
        
        # add Group to header list
        # reshape headers to (1, 11) instead of (11,) to avoid future problems
        headers = np.append(headers, 'Group').reshape(1, -1)
        
        # Concatenate headers and table data across rows(axis=0)
        group_table_with_header = np.append(headers, table_with_group_names, axis=0)
        
        return group_table_with_header

In [442]:
# get a single cumulated table with a header row and data of 32 teams in the group stage
group_stage_results = combine_group_tables(all_group_tables, group_names)
group_stage_results.shape

(33, 11)

In [None]:
# save data as a csv file
with open('data/group_stage_results.csv', 'w', encoding="utf-8") as fp:
    for row in group_stage_results:
        join_row_to_string = ','.join(row) + '\n'
        fp.write(join_row_to_string)

<hr>

<p style="font-size:30px;font-family:Arial;text-align:center">
    <i>Congratulations on completing the First part of the Assignment</i>
</p>

<p align="center" style="text-align:center">
  <img src="https://drive.google.com/uc?export=view&id=16zQgxLkXPK-IRCAZu9ZV271voD5kI576" alt="Congrats">
</p>
<p align="center" style="text-align:center"> Source: media.tenor.com </p>

<p style="font-size:20px;font-family:cursive">
    Don't forget to STRETCH b4 you move on to the <b>Data Analysis</b> section
</p>

<p align="center" style="text-align:center">
  <img src="https://drive.google.com/uc?export=view&id=17lciATYFeJfoENoKR_alSAf1G_pzBWSg" alt="CNN" width="720p">
</p>
<p align="center" style="text-align:center"> Source: media.tenor.com </p>

<hr>