**DESCRIPTION:**

This Jupyter Notebook shows the data collection of the teams organized into groups of the Qatar 2022 World Cup. The "web scraping" will be done with the pandas library to finally be saved as a binary file.

<img src='images/worldcup_qatar_2022.jpg'>

[**Image source**](https://as01.epimg.net/meristation/imagenes/2022/11/18/betech/1668795424_212051_1668799749_noticia_normal.jpg)

# 1. Data collection of WorldCup Qatar 2022 groups with Pandas

In [1]:
# To install Pandas
# !pip install pandas

**Libraries**

In [2]:
# To do web scraping to simple websites coded in html.
import pandas as pd

In [3]:
# Pandas version
pd.__version__

'1.5.3'

In [4]:
# To save groups with its capital letters
from string import ascii_uppercase as alphabet

In [5]:
# To save dictionary as a binary file
import pickle

# 1.1 Web scraping with Pandas

In [6]:
# website link
website = 'https://web.archive.org/web/20221115040351/https://en.wikipedia.org/wiki/2022_FIFA_World_Cup'

In [7]:
# If there's the following problem:
#  URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: ...
# Run the following line of code.
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

In [8]:
# Save all the tables in the "all_tables" variable.
all_tables = pd.read_html(website)

In [9]:
# Number of tables
len(all_tables)

92

In [10]:
# Going through all the tables

# for idx, table in enumerate(all_tables):
#     print('**'*40)
#     print(idx)
#     print(table)

# There are a total of 92 data tables.

In [11]:
# Group A table
all_tables[12]

Unnamed: 0,Pos,"Team.mw-parser-output .navbar{display:inline;font-size:88%;font-weight:normal}.mw-parser-output .navbar-collapse{float:left;text-align:left}.mw-parser-output .navbar-boxtext{word-spacing:0}.mw-parser-output .navbar ul{display:inline-block;white-space:nowrap;line-height:inherit}.mw-parser-output .navbar-brackets::before{margin-right:-0.125em;content:""[ ""}.mw-parser-output .navbar-brackets::after{margin-left:-0.125em;content:"" ]""}.mw-parser-output .navbar li{word-spacing:-0.125em}.mw-parser-output .navbar a>span,.mw-parser-output .navbar a>abbr{text-decoration:inherit}.mw-parser-output .navbar-mini abbr{font-variant:small-caps;border-bottom:none;text-decoration:none;cursor:inherit}.mw-parser-output .navbar-ct-full{font-size:114%;margin:0 7em}.mw-parser-output .navbar-ct-mini{font-size:114%;margin:0 4em}vte",Pld,W,D,L,GF,GA,GD,Pts,Qualification
0,1,Qatar (H),0,0,0,0,0,0,0,0,Advance to knockout stage
1,2,Ecuador,0,0,0,0,0,0,0,0,Advance to knockout stage
2,3,Senegal,0,0,0,0,0,0,0,0,
3,4,Netherlands,0,0,0,0,0,0,0,0,


In [12]:
# Recognition of all groups
all_tables[19]   #Group B
all_tables[26]   #Group C
all_tables[33]   #Group D
all_tables[40]   #Group E
all_tables[47]   #Group F
all_tables[54]   #Group G
all_tables[61]   #Group H
print('All groups found')

All groups found


In [13]:
# Index of last table
12 + 7*7

61

## 1.2 Save groups in a dictionary

In [14]:
# Empty dictionary
dict_groups = {}

# Loop to make a dictionary with all groups of worldcup
for letter, i in zip(alphabet, range(12, 62, 7)):
    df = all_tables[i]
    df.rename(columns={df.columns[1]:'Team'}, inplace=True)   #Rename column from index 1
    df.pop(df.columns[-1])   # Delete last column
    dict_groups[f'Group {letter}'] = df

In [15]:
dict_groups.keys()

dict_keys(['Group A', 'Group B', 'Group C', 'Group D', 'Group E', 'Group F', 'Group G', 'Group H'])

In [16]:
# Showing last table
dict_groups['Group H']

Unnamed: 0,Pos,Team,Pld,W,D,L,GF,GA,GD,Pts
0,1,Portugal,0,0,0,0,0,0,0,0
1,2,Ghana,0,0,0,0,0,0,0,0
2,3,Uruguay,0,0,0,0,0,0,0,0
3,4,South Korea,0,0,0,0,0,0,0,0


## 1.3 Export dictionary as a binary file

In [17]:
with open('data/Dictionary_of_groups', 'wb') as output:   #wb: write as binary
    pickle.dump(dict_groups, output)

In [18]:
output.close()