<a href="https://colab.research.google.com/github/JuanjoRestrepo/Qatar-2022/blob/main/WorldCup_Qatar2022_Predictor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
from string import ascii_uppercase as alphabet
import pickle
import time

# **1. Recolección de Datos**

In [2]:
#all_tables = pd.read_html('https://en.wikipedia.org/wiki/2022_FIFA_World_Cup')
all_tables = pd.read_html('https://web.archive.org/web/20221115040351/https://en.wikipedia.org/wiki/2022_FIFA_World_Cup')

## **1.1 Obteniendo los grupos**

Como son 8 grupos (A -> H) y en el link las tablas de estos están ordenados cada 7 posiciones (desde la 12) haremos lo siguiente para obtener la tabla de cada grupo:
- 11 -> 7*8 + 12 = 67

In [3]:
all_tables[12]
all_tables[19]
all_tables[26]
all_tables[61]

Unnamed: 0,Pos,Teamvte,Pld,W,D,L,GF,GA,GD,Pts,Qualification
0,1,Portugal,0,0,0,0,0,0,0,0,Advance to knockout stage
1,2,Ghana,0,0,0,0,0,0,0,0,Advance to knockout stage
2,3,Uruguay,0,0,0,0,0,0,0,0,
3,4,South Korea,0,0,0,0,0,0,0,0,


Como vemos el último grupo (H) está en la posición 61, por lo cual haremos un recorrido hasta ese rango que calculamos anteriorment y haremos unas modificaciones al formato de la tabla original que nos presenta el dataset.

Reemplazaremos el nombre de la columna 'Teamvte' por solo 'Team' pero hay una excepción en el primer grupo (A) y es que tiene un formato de 'Team.mw' en su columna de equipos, por lo cual haremos lo siguiente

In [4]:
all_tables[12].columns[1]

'Team.mw-parser-output .navbar{display:inline;font-size:88%;font-weight:normal}.mw-parser-output .navbar-collapse{float:left;text-align:left}.mw-parser-output .navbar-boxtext{word-spacing:0}.mw-parser-output .navbar ul{display:inline-block;white-space:nowrap;line-height:inherit}.mw-parser-output .navbar-brackets::before{margin-right:-0.125em;content:"[ "}.mw-parser-output .navbar-brackets::after{margin-left:-0.125em;content:" ]"}.mw-parser-output .navbar li{word-spacing:-0.125em}.mw-parser-output .navbar a>span,.mw-parser-output .navbar a>abbr{text-decoration:inherit}.mw-parser-output .navbar-mini abbr{font-variant:small-caps;border-bottom:none;text-decoration:none;cursor:inherit}.mw-parser-output .navbar-ct-full{font-size:114%;margin:0 7em}.mw-parser-output .navbar-ct-mini{font-size:114%;margin:0 4em}vte'

Asignamos el valor numerico de las tablas a su grupo correspondiente

In [5]:
for letter, i in zip(alphabet, range(12, 68, 7)):
  print(letter, i)

A 12
B 19
C 26
D 33
E 40
F 47
G 54
H 61


In [6]:
dict_tables = {}
for letter, i in zip(alphabet, range(12, 68, 7)):
  df = all_tables[i]
  df.rename(columns={df.columns[1]: 'Team'}, inplace=True)
  df.pop('Qualification')
  dict_tables[f'Group {letter}'] = df

In [7]:
dict_tables.keys()

dict_keys(['Group A', 'Group B', 'Group C', 'Group D', 'Group E', 'Group F', 'Group G', 'Group H'])

In [8]:
dict_tables['Group H']

Unnamed: 0,Pos,Team,Pld,W,D,L,GF,GA,GD,Pts
0,1,Portugal,0,0,0,0,0,0,0,0
1,2,Ghana,0,0,0,0,0,0,0,0
2,3,Uruguay,0,0,0,0,0,0,0,0
3,4,South Korea,0,0,0,0,0,0,0,0


## **1.2 Exportamos nuestro Diccionario**
Abrimos un archivo para colocar el diccionario de las tablas en un archivo llamado 'output'

In [9]:
with open('dict_table', 'wb') as output:
  pickle.dump(dict_tables, output)

# **2. Extraemos la data de todos los mundiales desde 1930 hasta 2018 y de los partidos del 2022**

## **2.1 Partidos 1930 a 2018**

Haremos el Webscrapping para obtener los partidos de **1930** hasta **2018** con 'requests' y bs4

In [10]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

# Extraemos la data de todos los mundiales desde 1930 hasta 2018
# Haremos el Webscrapping para obtener estos datos con 'requests' y bs4
years = [1930, 1934, 1938, 1950, 1954, 1958, 1962, 1966, 1970, 1974,
         1978, 1982, 1986, 1990, 1994, 1998, 2002, 2006, 2010, 2014,
         2018]

def get_matches(year):
  if year == '2022':
    web =  f'https://web.archive.org/web/20221115040351/https://en.wikipedia.org/wiki/{year}_FIFA_World_Cup'
  else:
    web = f'https://en.wikipedia.org/wiki/{year}_FIFA_World_Cup'

  response = requests.get(web)
  content = response.text #contenido html de la pagina
  soup = BeautifulSoup(content, 'lxml')

  matches = soup.find_all('div', class_='footballbox')

  home = []
  score = []
  away = []

  for game in matches:
      home.append(game.find('th', class_='fhome').get_text())
      score.append(game.find('th', class_='fscore').get_text())
      away.append(game.find('th', class_='faway').get_text())

  dict_football = {'home': home,
                  'score': score,
                  'away': away}

  df_football = pd.DataFrame(dict_football)
  df_football['year'] = year
  return df_football

# Data Historica de todos los mundiales realizados
fifa = [get_matches(year) for year in years]
df_fifa = pd.concat(fifa, ignore_index=True)
df_fifa.to_csv('fifa_worldcup_historical_data.csv', index=False)

# Data del mundial Qatar 2022
df_fixture = get_matches('2022')
df_fixture = df_fixture
df_fixture.to_csv('fifa_worldcup_fixture.csv', index=False)

Usando Chromedriver hacemos el WebScrapping para obtener un DataFrame de todos los mundiales hasta el 2018

Nota: el webscrapping se realizó tanto en Google en Colab, como de forma local en un script a parte llamado "selenium-world-cup.py"

In [11]:
!pip install selenium
!apt update
!apt install chromium-chromedriver

Collecting selenium
  Downloading selenium-4.18.1-py3-none-any.whl (10.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10.0 MB[0m [31m25.5 MB/s[0m eta [36m0:00:00[0m
Collecting trio~=0.17 (from selenium)
  Downloading trio-0.24.0-py3-none-any.whl (460 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m460.2/460.2 kB[0m [31m38.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting trio-websocket~=0.9 (from selenium)
  Downloading trio_websocket-0.11.1-py3-none-any.whl (17 kB)
Collecting outcome (from trio~=0.17->selenium)
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl (10 kB)
Collecting wsproto>=0.14 (from trio-websocket~=0.9->selenium)
  Downloading wsproto-1.2.0-py3-none-any.whl (24 kB)
Collecting h11<1,>=0.9.0 (from wsproto>=0.14->trio-websocket~=0.9->selenium)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?

In [13]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import pandas as pd
import time

In [16]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import pandas as pd
import time

#path = 'C:\Users\Juan Jose Restrepo\Desktop\WC 2022\chromedriver-win64\chromedriver.exe'
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(options=options)

def getMissingData(year):
    web = f'https://en.wikipedia.org/wiki/{year}_FIFA_World_Cup'
    print(f'\nGetting the Matches of WC {year}')

    # NODO PADRE: CUBRE LOCAL Y VISITANTE: <tr itemprop="name"> <th class="fhome" itemprop="homeTeam" itemscope="" itemtype="http://schema.org/SportsTeam"><span itemprop="name"><a href="/wiki/Italy_national_football_team" title="Italy national football team">Italy</a><span class="flagicon">&nbsp;<span class="mw-image-border" typeof="mw:File"><span><img alt="" src="//upload.wikimedia.org/wikipedia/en/thumb/0/03/Flag_of_Italy.svg/23px-Flag_of_Italy.svg.png" decoding="async" width="23" height="15" class="mw-file-element" srcset="//upload.wikimedia.org/wikipedia/en/thumb/0/03/Flag_of_Italy.svg/35px-Flag_of_Italy.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/0/03/Flag_of_Italy.svg/45px-Flag_of_Italy.svg.png 2x" data-file-width="1500" data-file-height="1000"></span></span></span></span></th><th class="fscore">0–0</th><th class="faway" itemprop="awayTeam" itemscope="" itemtype="http://schema.org/SportsTeam"><span itemprop="name"><span style="white-space:nowrap"><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><span><img alt="" src="//upload.wikimedia.org/wikipedia/en/thumb/1/12/Flag_of_Poland.svg/23px-Flag_of_Poland.svg.png" decoding="async" width="23" height="14" class="mw-file-element" srcset="//upload.wikimedia.org/wikipedia/en/thumb/1/12/Flag_of_Poland.svg/35px-Flag_of_Poland.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/1/12/Flag_of_Poland.svg/46px-Flag_of_Poland.svg.png 2x" data-file-width="1280" data-file-height="800"></span></span>&nbsp;</span><a href="/wiki/Poland_national_football_team" title="Poland national football team">Poland</a></span></span></th></tr>
    # <th class="fhome" itemprop="homeTeam" itemscope="" itemtype="http://schema.org/SportsTeam"><span itemprop="name"><a href="/wiki/Italy_national_football_team" title="Italy national football team">Italy</a><span class="flagicon">&nbsp;<span class="mw-image-border" typeof="mw:File"><span><img alt="" src="//upload.wikimedia.org/wikipedia/en/thumb/0/03/Flag_of_Italy.svg/23px-Flag_of_Italy.svg.png" decoding="async" width="23" height="15" class="mw-file-element" srcset="//upload.wikimedia.org/wikipedia/en/thumb/0/03/Flag_of_Italy.svg/35px-Flag_of_Italy.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/0/03/Flag_of_Italy.svg/45px-Flag_of_Italy.svg.png 2x" data-file-width="1500" data-file-height="1000"></span></span></span></span></th>

    #//th[@class="fhome"]/..

    # Find all rows containing match information
    # obtenemos los partidos en la pagina web
    driver.get(web)
    matches = driver.find_elements(by='xpath', value='//th[@class="fhome"]/..')

    # guardamos los datos de los partidos en las listas
    home = []
    score = []
    away = []

    # Recorremos los partidos guardados para separarlos en local, visitante y resultado
    for match in matches:
        home.append(match.find_element(by='xpath', value='./th[1]').text)
        score.append(match.find_element(by='xpath', value='./th[2]').text)
        away.append(match.find_element(by='xpath', value='./th[3]').text)

    # Creamos un DataFrame a partir de las listas
    data = {'Home': home, 'Score': score, 'Away': away}
    df_football = pd.DataFrame(data)
    df_football['year'] = year
    time.sleep(2)

    return df_football



years = [1930, 1934, 1938, 1950, 1954, 1958, 1962, 1966, 1970, 1974,
         1978, 1982, 1986, 1990, 1994, 1998, 2002, 2006, 2010, 2014,
         2018]

# Guardamos todos los df de los mundiales en una lista
fifa = [getMissingData(year) for year in years]
# Close the WebDriver
driver.quit()

# Juntamos todos los df en uno solo
df_fifa = pd.concat(fifa, ignore_index=True)
df_fifa.to_csv('fifa_worldcup_missing_data.csv', index=False)

print('Web Scraping Done!')


Getting the Matches of WC 1930

Getting the Matches of WC 1934

Getting the Matches of WC 1938

Getting the Matches of WC 1950

Getting the Matches of WC 1954

Getting the Matches of WC 1958

Getting the Matches of WC 1962

Getting the Matches of WC 1966

Getting the Matches of WC 1970

Getting the Matches of WC 1974

Getting the Matches of WC 1978

Getting the Matches of WC 1982

Getting the Matches of WC 1986

Getting the Matches of WC 1990

Getting the Matches of WC 1994

Getting the Matches of WC 1998

Getting the Matches of WC 2002

Getting the Matches of WC 2006

Getting the Matches of WC 2010

Getting the Matches of WC 2014

Getting the Matches of WC 2018
Web Scraping Done!


Data del mundial Qatar 2022

In [None]:
df_fixture = get_matches('2022')
df_fixture = df_fixture
df_fixture.to_csv('fifa_worldcup_fixture.csv', index=False)