# FIFA World Cup 2018 Predictions

## Introduction
We will predict the winning team of the 2018 FIFA World Cup by collecting, organizing and analyzing data! We will be using the data from the 2016-17 season as well as three Python libraries (**BeautifulSoup** and **Pandas**) to acomplish our task. We will be using BeautifulSoup (BS) for scraping data off the web and Pandas for data manipulation.

## Methods and Processes
**Part 1: The Methodology**
<br>
We will grade each player in the top Europian clubs on a merit basis - a goal will result in 3 points and an assist in 1 for the player. Then for each player, their country would inherit the player's points. Our prediction will be the country with the most points! 
<br><br>
**Part 2: Data to be Analyzed**
<br>
Our data will come from the top clubs in Europe:
1. Premier League
    * Chelsea (winners)
    * Tottenham Hotspur
    * Machester City
    * Liverpool
    * Arsenal
    * Manchester United
    * Everton
<br> <br>
2. La Liga
    * Real Madrid (winners)
    * Barcelona
    * Atletico Madrid
<br> <br>  
3. Bundesliga
    * Bayern Munich (winners)
    * Dortmund 
<br> <br>   
4. Ligue 1
    * Monnaco (winners)
    * Paris Saint Germain
<br> <br>    
5. Serie A
    * Juventus (winners)
    * Roma 
    
## Implementation
For each team above we will use BS to scrape their starting line-up of pllayers. In our dataset of players, we will manipulate the scraped data to have the player's attributes - age, country, club, club goals and club assists. Thereafter we will assign their score and also their countries' score. Then we will analyze the results and predict a winner!

** In the folowing cell we will import all of our dependancies **

In [40]:
# import dependancies 

from bs4 import BeautifulSoup as soup   # used for web-scrapping (traversing html)
from urllib.request import urlopen   # web-client for grabbing webpages
import pandas as pd   # used for manipulating data 

** In the following cell we will scrape data off Wikipedia of players and their country, club and position **

In [41]:
def make_dataset_of_players_info():

    # urls needed for scrapping
    pl_teams_url = [
        'https://en.wikipedia.org/wiki/Real_Madrid_C.F.',
        'https://en.wikipedia.org/wiki/FC_Barcelona',
        'https://en.wikipedia.org/wiki/Atl%C3%A9tico_Madrid',
        'https://en.wikipedia.org/wiki/AS_Monaco_FC',
        'https://en.wikipedia.org/wiki/Paris_Saint-Germain_F.C.',
        'https://en.wikipedia.org/wiki/Juventus_F.C.',
        'https://en.wikipedia.org/wiki/A.S._Roma',
        'https://en.wikipedia.org/wiki/FC_Bayern_Munich',
        'https://en.wikipedia.org/wiki/Borussia_Dortmund',
        'https://en.wikipedia.org/wiki/Chelsea_F.C.',
        'https://en.wikipedia.org/wiki/Tottenham_Hotspur_F.C.',
        'https://en.wikipedia.org/wiki/Manchester_City_F.C.',
        'https://en.wikipedia.org/wiki/Liverpool_F.C.',
        'https://en.wikipedia.org/wiki/Arsenal_F.C.',
        'https://en.wikipedia.org/wiki/Manchester_United_F.C.',
        'https://en.wikipedia.org/wiki/Everton_F.C.'
    ]

    ''' used for storing names of entities for the players info dataset '''
    country = []
    position = []
    name = []
    club = []

    for url in pl_teams_url:

        ''' pull data from web-page'''
        # establish connection and download the html file
        u_client = urlopen(url)
        # offloads content into a variable
        page_html = u_client.read()
        # close connection with client
        u_client.close()

        ''' parse the html via BS '''
        page_soup = soup(page_html, 'html.parser')
        players_soup = page_soup.find_all('tr', {'class': 'vcard agent'})
        players_soup = players_soup[0: 24]
        for player_info in players_soup:
            country.append(player_info.find('span', {'class': 'flagicon'}).a['title'])
            position.append(player_info.find('td', {'style': 'text-align: center;'}).a.text)
            try:
                name.append(player_info.find('span', {'clas': 'fn'}).a.text)
            except AttributeError:
                name.append(player_info.find('span', {'class': 'fn'}).text)
            club.append(page_soup.h1.text)
            
            
    ''' write data to csv file '''
    df = pd.DataFrame({
        'Name':name, 
        'Country': country,
        'Club': club,
        'Position': position})
    df = df[['Name', 'Country', 'Club', 'Position']]
    df.to_csv(path_or_buf='players_info.csv', index=False)
    print(df.head())
    

''' scrape data off the web and make a dataset of players and their information '''
make_dataset_of_players_info()


             Name     Country              Club Position
0    Keylor Navas  Costa Rica  Real Madrid C.F.       GK
1   Dani Carvajal       Spain  Real Madrid C.F.       DF
2   Jesús Vallejo       Spain  Real Madrid C.F.       DF
3    Sergio Ramos       Spain  Real Madrid C.F.       DF
4  Raphaël Varane      France  Real Madrid C.F.       DF


** In the following cell we will scrape data off espnfc of the top players who scored the mosts goals in each league **

In [42]:

def make_dataset_of_top_goalscorers():
    goals_scored_in_top_leagues = [
        'http://www.espnfc.com/english-premier-league/23/statistics/scorers?season=2016',
        'http://www.espnfc.com/spanish-primera-division/15/statistics/scorers?season=2016',
        'http://www.espnfc.com/german-bundesliga/10/statistics/scorers?season=2016',
        'http://www.espnfc.com/french-ligue-1/9/statistics/scorers',
        'http://www.espnfc.com/italian-serie-a/12/statistics/scorers?season=2016'
    ]
    
    
    names_of_players = []
    names_of_clubs = []
    goals_scored = []

    for league in goals_scored_in_top_leagues:

        ''' pull data from the weppage'''
        # establish connection and download html file
        u_client = urlopen(league)
        # offload content into variable 
        page_html = u_client.read()
        # close connection with client
        u_client.close()

        ''' parse html via BS '''
        goals_scorers_soup = soup(page_html, 'html.parser')
        stats = goals_scorers_soup.find('div', {'class': 'stats-top-scores'})
        players_soup = stats.findAll('td', {'headers': 'player'})
        clubs_soup = stats.findAll('td', {'headers': 'team'})
        goals_soup = stats.findAll('td', {'headers': 'goals'})

        ''' sort the data into columns '''
        for player, team, goals in zip(players_soup, clubs_soup, goals_soup):
            names_of_players.append(player.text)
            names_of_clubs.append(team.text)
            goals_scored.append(goals.text)
            
    ''' write data to csv file'''
    df = pd.DataFrame({
        'Name': names_of_players,
        'Club': names_of_clubs,
        'Goals Scored': goals_scored
    })
    df = df[['Name', 'Club', 'Goals Scored']]
    df.to_csv(path_or_buf="top_goalscorers.csv", index=False)
    print(df.head())
            

''' scrape data off the web and make a dataset of top goal scorers'''
make_dataset_of_top_goalscorers()
    

             Name               Club Goals Scored
0      Harry Kane  Tottenham Hotspur           29
1   Romelu Lukaku            Everton           25
2  Alexis Sánchez            Arsenal           24
3   Sergio Agüero    Manchester City           20
4     Diego Costa            Chelsea           20


** In the following cell we will scrape data off espnfc of the top players who gave the mosts assists in each league **

In [43]:

def make_dataset_of_players_with_most_assists():
    assists_made_in_top_leagues = [
        'http://www.espnfc.com/english-premier-league/23/statistics/assists?season=2016',
        'http://www.espnfc.com/spanish-primera-division/15/statistics/assists?season=2016',
        'http://www.espnfc.com/german-bundesliga/10/statistics/assists?season=2016',
        'http://www.espnfc.com/french-ligue-1/9/statistics/assists',
        'http://www.espnfc.com/italian-serie-a/12/statistics/assists?season=2016'
    ]
    
    ''' storing variables needed for convertring to csv file '''
    names_of_players = []
    names_of_clubs = []
    assists_given = []

    for league in assists_made_in_top_leagues:

        ''' pull data from the weppage '''
        # establish connection and download html file
        u_client = urlopen(league)
        # offload content into variable 
        page_html = u_client.read()
        # close connection with client
        u_client.close()

        ''' parse html via BS '''
        assists_given_soup = soup(page_html, 'html.parser')
        stats = assists_given_soup.find('div', {'id': 'stats-top-assists'})
        players_soup = stats.findAll('td', {'headers': 'player'})
        clubs_soup = stats.findAll('td', {'headers': 'team'})
        # note: assists are named as goals in the html, hence the following code
        assists_soup = stats.findAll('td', {'headers': 'goals'})

        
        ''' sort the data into columns '''
        for player, team, assists in zip(players_soup, clubs_soup, assists_soup):
            names_of_players.append(player.text)
            names_of_clubs.append(team.text)
            assists_given.append(assists.text)
        
          
    ''' write data to csv file'''
    df = pd.DataFrame({
        'Name': names_of_players,
        'Club': names_of_clubs,
        'Assists Given': assists_given
    })
    df = df[['Name', 'Club', 'Assists Given']]
    df.to_csv(path_or_buf="top_assists.csv", index=False)
    print(df.head())
    

''' scrape data off the web and make a dataset of top players who gave the most assists '''
make_dataset_of_players_with_most_assists()
    

                Name               Club Assists Given
0    Kevin De Bruyne    Manchester City            18
1  Christian Eriksen  Tottenham Hotspur            15
2   Gylfi Sigurdsson       Swansea City            13
3      Cesc Fàbregas            Chelsea            12
4     Alexis Sánchez            Arsenal            10


** In the following cell we will merge all the data we have collected **

In [44]:

def merge_collected_data():
    
    ''' organize data for manipulation '''
    # import general player info
    players_info = pd.read_csv('players_info.csv', encoding='cp1252')
    players_info_names = players_info['Name']
    players_info_country = players_info['Country']
    players_info_club = players_info['Club']
    
    # import the goalsscorers
    goalscorer_info = pd.read_csv('top_goalscorers.csv', encoding='cp1252')
    goalscorer_info_names = goalscorer_info['Name']    
    goalscorer_info_club = goalscorer_info['Club']
    goalscorer_info_goals = goalscorer_info['Goals Scored']
    
    # import the assist givers
    assists_info = pd.read_csv('top_assists.csv', encoding='cp1252')
    assists_info_names = assists_info['Name']
    assists_info_club = assists_info['Club']
    assists_info_assists = assists_info['Assists Given']
    
    ''' columns for our merged dataset '''
    output_goals = []    
    output_assists = []
    for i in players_info_names:
        output_goals.append(0)
        output_assists.append(0)
    
    ''' goals the player has scored '''
    for i in range(len(goalscorer_info_names)):
        name_info = goalscorer_info_names[i]
        for j in range(len(players_info_names)):
            if name_info == players_info_names[j]:
                output_goals[j] = goalscorer_info_goals[i]
                break
                
    ''' assists player has made '''            
    for i in range(len(assists_info_names)):
        name_info = assists_info_names[i]
        for j in range(len(players_info_names)):
            if name_info == players_info_names[j]:
                output_assists[j] = assists_info_assists[i]
                break
                
                
    ''' create new dataframe for merging '''
    # make new dataframe
    goals_assists_df = pd.DataFrame({'Goals': output_goals, 'Assists': output_assists})
    # merge dataframe
    players_info = players_info.join(goals_assists_df)
    
    ''' write new dataset to csv file'''
    players_info.to_csv(path_or_buf='merged.csv', index=False)
    

''' merge data '''
merge_collected_data()

In [45]:

def predict_the_winning_country():
    
    ''' import needed datasets and columns '''
    players_info = pd.read_csv('merged.csv', encoding='cp1252')
    player_name = players_info['Name']
    player_country = players_info['Country']
    player_goals = players_info['Goals']
    player_assists = players_info['Assists']
    
    ''' returns a list of unique countries'''
    def find_unique_countries(countries):
        countries = sorted(countries)
        last = ''
        list_of_countries = []
        for country in countries:
            if last != country:
                list_of_countries.append(country)
                last = country
        return list_of_countries

    
    ''' create a dictionary so we can increment chances the country has of winning '''
    countries = find_unique_countries(player_country)
    zeros = []
    for i in range(len(countries)):
        zeros.append(0)
    dictionary_of_countries = dict(zip(countries, zeros))
    
    ''' assign scores to the countries based on goals and assists'''
    for country, goals, assists in zip(player_country, player_goals, player_assists):
        points_for_goal = goals * 3
        points_for_assists = assists * 1
        dictionary_of_countries[country] += points_for_goal + points_for_assists
        
    
    winners = dict(zip(dictionary_of_countries.values(), dictionary_of_countries.keys()))
    print('The top 10 teams that are more likey to win the world cup are: ')
    for i in range(1, 11):
        reason = '(points:' + str(max(winners)) + ')'
        print(str(i)+') ', winners[max(winners)], '\t', reason)
        winners.pop(max(winners))
    
    

''' predict the winning country '''
predict_the_winning_country()

The top 10 teams that are more likey to win the world cup are: 
1)  Spain 	 (points:425)
2)  France 	 (points:405)
3)  Argentina 	 (points:358)
4)  England 	 (points:294)
5)  Brazil 	 (points:265)
6)  Belgium 	 (points:230)
7)  Uruguay 	 (points:209)
8)  Germany 	 (points:142)
9)  Portugal 	 (points:114)
10)  Poland 	 (points:113)
