# WorldCup 2018 Analysis

In this notebook, I would like to analyze 3 last games of the World Cup 2018.  
*  France vs Belgium (1st semi-final)
* Croatia vs England (2nd semi-final)
* France vs Croatia (final)

### Here are my plans:
1. Import the dataset
1. Set up the teams
1. Compare between teams based on: value, forward, midfielder, defender, goalkeeper , overall, and history games
1. Make prediction before the game
1. Compare the prediction and the result

In [None]:
# Import libraries
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import Image

from plotly.offline import iplot, init_notebook_mode
from geopy.geocoders import Nominatim
import plotly.plotly as py

import os
print(os.listdir("../input"))


# 1. Data Preparation
## 1.1 Load Data

In [None]:
df = pd.read_csv("../input/fifa-18-demo-player-dataset/CompleteDataset.csv", index_col=0)
df.head()

In [None]:
columns = ['Name','Age','Nationality','Overall','Value']
data = pd.DataFrame(df,columns=columns)
data.head()

## 1.2 Summarize Data

In [None]:
data.info()

## 1.3 Preprocess Data

I will change the Value column from string to integer by removing € and M (or K).

In [None]:
# Supporting function for converting string values into numbers
def str2number(amount):
    if amount[-1] == 'M':
        return float(amount[1:-1])*1000000
    elif amount[-1] == 'K':
        return float(amount[1:-1])*1000
    else:
        return float(amount[1:])
    
data['Value'] = data['Value'].apply(lambda x: str2number(x))


# 2. Data Visualization
## 2.1 Age

In [None]:
plt.figure(figsize=(16,8))
sns.set_style("whitegrid")
plt.title('Grouping players by Age', fontsize=30, fontweight='bold', y=1.05,)
plt.xlabel('Age',fontsize=25)
plt.ylabel('Count',fontsize=25)
sns.countplot(x="Age", data=data, palette="hls");
plt.show()

Looking the graph above, I see players come from all ages.  I believe players can perform best when they are at age 20 to 29.  If they are younger than 20, they dont have a lot of experience.  If they are older than 29, they probably dont have enough strength.  So, I will deduct 5% overall score if players are younger than 20.  I deduct 3% overall score if players are older than 29.  Oder players dont have enough strength compare to younger players, but they have more experience.  That's why I deduct only 3% for older players.

## 2.2 Overall

In [None]:
plt.figure(figsize=(16,8))
sns.set_style("whitegrid")
plt.title('Grouping players by Overall', fontsize=30, fontweight='bold', y=1.05,)
plt.xlabel('Number of players', fontsize=25)
plt.ylabel('Players Age', fontsize=25)
sns.countplot(x="Overall", data=data, palette="hls");
plt.show()

In [None]:
data.Overall.describe()

In [None]:
player_over_80 = len(data[data.Overall > 80])*100/len(data)
print('The percentage of players who have overall score greater than 80 is only {:.1f}%'.format(player_over_80))
print('Most players who play in Worldcup games have overall score obove 80.')

The overall score is range from 46 to 94 points.  Majority of the score is from 62 to 71.   

## 2.3 Nationality

In [None]:
'''
a = data.Nationality.value_counts().reset_index()
a.columns=['Nationality','Count']
a[:20]
'''
top20_nation = data.groupby('Nationality').size().reset_index(name='Count').sort_values('Count',ascending = False)[:20]
top20_nation

In [None]:
plt.figure(figsize=(16,20))

countries = list(top20_nation.loc[::-1,'Nationality'])
pos = np.arange(len(countries))
count = list(top20_nation.loc[::-1,'Count'])

plt.barh(pos, count, align='center', alpha=.8)
plt.yticks(pos, countries, fontsize=25)
plt.xlabel('Count', fontsize=25)
plt.title('Number players by Countries', fontsize=30, fontweight='bold')
 
plt.show()

England has the most players.  It has about 500 players more than Germany (2nd).  

## 2.4 Relationship between Value and Age 

In [None]:
plt.figure(figsize=(16,16))
sns.set_style("whitegrid")
plt.title('Relationship between Value, Overall and Age', fontsize=30, fontweight='bold', y=1.05,)
plt.xlabel('Age', fontsize=25)
plt.ylabel('Overall', fontsize=25)

age = data["Age"].values
overall = data["Overall"].values
value = data["Value"].values

plt.scatter(age, overall, s = value/50000, edgecolors='black', color="red")
plt.show()

* Players with overall score above 75 have value
* At younger age, players dont have high overall score.  The highest score is under 85.
* As age inceasing, players have more experience, so the overall score inscreases, the value inscreases until age 33.
* When players get older, their scores and values decrease after the age 33.
* The age from 25 to 33 is a great time for a player to shine.


In [None]:
sns.pairplot(data[['Age','Overall','Value']])

## 2.5 Relationship between Overall and Value

In [None]:
data['AgeRange'] = pd.cut(data.Age, bins = [0,23,33,45],labels = ['Young','Mature','Old'])

In [None]:
# Use the 'hue' argument to provide a factor variable
sns.lmplot( x="Overall", y="Value", data=data, fit_reg=False, hue='AgeRange', legend=False)
 
# Move the legend to an empty part of the plot
plt.legend(loc='upper left')


Looking at the graph above, I see older player's values are lowest because they will be retire soon.  The mature players can earn a lot more value.  Their values are highest.  The younger players values are at second place, because they are young.  As they are getting older, their values might increase.

## 2.6 Percentages between Young, Mature, and Old Players

In [None]:
plt.figure(figsize=(16,8))
plt.title("Percentages of Young, Mature, and Old Players", fontsize = 20, fontweight = 'bold')
labels = 'Mature','Young','Old'
plt.rcParams['font.size'] = 20.0
plt.pie(data.AgeRange.value_counts(), labels = labels, autopct='%1.1f%%', startangle=0)
plt.axis('equal')
plt.show()

* Players who are older than 33 years old are only 4.1% of the player pool.  Not a lot of players still play after they are 33 year old.  Most of players would be retired before 33 years old.
* Young players and Mature players have 40.4% and 55.5% respectively.  

## 2.6 Value of Top 20 players in each Top 10 Countries

In [None]:
def get_top20_players(country):
    top20players = data[data.Nationality == country].sort_values('Value',ascending = False)[:20]
    return top20players

top10_nation_list = top20_nation.Nationality[:10].tolist()

frames = []
for i in range(len(top10_nation_list)):
    temp_df = get_top20_players(top10_nation_list[i])
    frames.append(temp_df)
top_players_in_top10 = pd.concat(frames)
top_players_in_top10.head(3)

In [None]:
plt.figure(figsize=(20,14))
sns.boxplot(x="Nationality",y = 'Value',data = top_players_in_top10)
plt.title("Value of Top 20 players in each Top 10 Countries", fontsize = 30, fontweight = 'bold')
plt.xlabel('Countries', fontsize=25)
plt.ylabel('Value', fontsize=25)
plt.show()

Looking at the graph above:
* Brazil has one player who has highest value. 
* Players who are from England and Japan have value really close to their teammates.
* In term of median, Spain has the most value.

# 3. Match Analysis
The comparation is using only 11 players in lineups because

## 3.1 France vs Belgium ( 1 st quater final)

In [None]:
# define a function that get location of each players from a name list
def get_location(player_list,data):
    location=[]
    for idx,s in enumerate(data.Name):
        for player in player_list:
            if player in s:
                location.append(idx)
    return location


In [None]:
# Players who are younger than 20 will have 95% of their overall score
# Players who are older than 29 will have 98% of their overall score
def overall_adjusted_score(input_data):
    data = input_data.copy()
    data.loc[data.index[(data.Age < 20)],"Overall"]=data.loc[data.index[(data.Age < 20)],"Overall"]*0.95
    data.loc[data.index[(data.Age >29)],"Overall"]=data.loc[data.index[(data.Age >29)],"Overall"]*0.98
    return data.Overall.mean()

### 3.1.1 Teams Analysis

In [None]:
Image("../input/worldcup2018/FranceVsBelgium.png")

In [None]:
FrancePlayers = ["H. Lloris","B. Pavard","R. Varane","S. Umtiti","L. Hernandez","N. Kante","P. Pogba","K. Mbappe","A. Griezmann","B. Matuidi","O. Giroud"]
BelgiumPlayers = ["R. Lukaku","E. Hazard","M. Fellaini","K. De Bruyne","M. Dembele","A. Witsel","J. Vertonghen","V. Kompany","T. Alderweireld","N. Chadli","T. Courtois"]

In [None]:
all_france_players =data[data.Nationality == "France"]
France_lineups = all_france_players.iloc[get_location(FrancePlayers,all_france_players),:]
France_lineups


In France_lineups, we found only 8 players.  We are missing 3 more players who are L. Hernandez,  N. Kante, and K. Mbappe.  I found the data for those players on sofifa.com.

* K. Mbappe: age:20, Overall: 88, Value: 81M
* L. Hernandez: age:22, Overall: 83, Value: 29.5M
* N. Kante: age:27, Overall: 89, Value: 63M

In [None]:
# Add missing players to France_lineups dataframe
France_lineups=France_lineups.append({'Name' : 'K. Mbappe' , 'Age' : 20, 'Nationality': 'France', 'Overall' : 88, 'Value' : 81000000} , ignore_index=True)
France_lineups=France_lineups.append({'Name' : 'L. Hernandez' , 'Age' : 22, 'Nationality': 'France', 'Overall' : 83, 'Value' : 29500000} , ignore_index=True)
France_lineups=France_lineups.append({'Name' : 'N. Kante' , 'Age' : 27, 'Nationality': 'France', 'Overall' : 89, 'Value' : 63000000} , ignore_index=True)


In [None]:
# complete France Lineups
France_lineups

In [None]:
# Belgium
all_belgium_players =data[data.Nationality == "Belgium"]
Belgium_lineups = all_belgium_players.iloc[get_location(BelgiumPlayers,all_belgium_players),:]
Belgium_lineups

In Belgium lineups, I have 10 players.  We are missing one player who is M. Dembele. 

* M. Dembele: age:30, Overall: 82, Value: 0

In [None]:
# Add missing players to Belgium_lineups dataframe
Belgium_lineups=Belgium_lineups.append({'Name' : 'M. Dembele' , 'Age' : 30, 'Nationality': 'Belgium', 'Overall' : 82, 'Value' : 0} , ignore_index=True)
Belgium_lineups

### Age

In [None]:
print("Here is France's average age: {:.2f},comparing against Belgium's average age: {:.2f}.".format(France_lineups.Age.mean(),Belgium_lineups.Age.mean()))
print("We can see France players are younger than Belgium players on average. It means Belgium have more experiences than France. France players are youngers, so they have more strength.")

In [None]:
plt.figure(figsize=(16,8))
sns.set_style("whitegrid")
plt.title('France Players Age', fontsize=30, fontweight='bold', y=1.05,)
plt.xlabel('x', fontsize=25)
plt.ylabel('y', fontsize=25)
sns.countplot(x="Age", data=France_lineups, palette="hls");
plt.show()

In [None]:
plt.figure(figsize=(16,8))
sns.set_style("whitegrid")
plt.title('Belgium Players Age', fontsize=30, fontweight='bold', y=1.05,)
plt.xlabel('x', fontsize=25)
plt.ylabel('y', fontsize=25)
sns.countplot(x="Age", data=Belgium_lineups, palette="hls");
plt.show()

### Value

I will replace the value of A. Witsel and M. Dembele with Belgium's average value .

In [None]:
names = ['A. Witsel','M. Dembele']
mean = Belgium_lineups.Value.mean()
Belgium_lineups.loc[get_location(names,Belgium_lineups),'Value'] = mean
Belgium_lineups

In [None]:
print('Here is total value of France: €{:.2f}M, \nand here is total value of Belgium: €{:.2f}M'.format(France_lineups.Value.sum()/1000000,Belgium_lineups.Value.sum()/1000000))


Belgium have one point here because its total value is greater than France's total value.

### Forward

In [None]:
France_Forward_Players = ["K. Mbappe","A. Griezmann","B. Matuidi","O. Giroud"]
Belgium_Forward_Players = ["R. Lukaku","E. Hazard","M. Fellaini","K. De Bruyne"]

In [None]:
France_Forward_Lineups=France_lineups.iloc[get_location(France_Forward_Players,France_lineups),:]
Belgium_Forward_Lineups=Belgium_lineups.iloc[get_location(Belgium_Forward_Players,Belgium_lineups),:]

In [None]:
print("With adjusted overall score for forward lineups, France's average Overall score: {:.2f},comparing against Belgium's average Overall score: {:.2f}.".format(overall_adjusted_score(France_Forward_Lineups),overall_adjusted_score(Belgium_Forward_Lineups)))

Belgium's forward players are stronger than France's, so Belgium receives one point.

### Midfielder

In [None]:
France_Midfielder_Players = ["N. Kante","P. Pogba"]
Belgium_Midfielder_Players = ["M. Dembele","A. Witsel"]

In [None]:
France_Midfielder_Lineups=France_lineups.iloc[get_location(France_Midfielder_Players,France_lineups),:]
Belgium_Midfielder_Lineups=Belgium_lineups.iloc[get_location(Belgium_Midfielder_Players,Belgium_lineups),:]

In [None]:
print("With adjusted overall score for midfielder lineups, France's average Overall score: {:.2f},comparing against Belgium's average Overall score: {:.2f}.".format(overall_adjusted_score(France_Midfielder_Lineups),overall_adjusted_score(Belgium_Midfielder_Lineups)))

France's midfield players are stronger than Belgium's, so France receives one point.

### Defender

In [None]:
France_Defender_Players = ["B. Pavard","R. Varane","S. Umtiti","L. Hernandez"]
Belgium_Defender_Players = ["J. Vertonghen","V. Kompany","T. Alderweireld","N. Chadli"]

In [None]:
France_Defender_Lineups=France_lineups.iloc[get_location(France_Defender_Players,France_lineups),:]
Belgium_Defender_Lineups=Belgium_lineups.iloc[get_location(Belgium_Defender_Players,Belgium_lineups),:]

In [None]:
print("With adjusted overall score for defender lineups, France's average Overall score: {:.2f},comparing against Belgium's average Overall score: {:.2f}.".format(overall_adjusted_score(France_Defender_Lineups),overall_adjusted_score(Belgium_Defender_Lineups)))

Belgium's defend players are stronger than France's, so Belgium receives one point.

### Goalkeeper

In [None]:
France_Goalkeeper_Players = ["H. Lloris"]
Belgium_Goalkeeper_Players = ["T. Courtois"]

In [None]:
France_Goalkeeper_Lineups=France_lineups.iloc[get_location(France_Goalkeeper_Players,France_lineups),:]
Belgium_Goalkeeper_Lineups=Belgium_lineups.iloc[get_location(Belgium_Goalkeeper_Players,Belgium_lineups),:]

In [None]:
print(France_Goalkeeper_Lineups)
print(Belgium_Goalkeeper_Lineups)

Belgium's goalkeeper is stronger than France's, so Belgium recieves one point.

### Overall Score

In [None]:
print("With unadjusted overall score, France's average Overall score: {:.2f},comparing against Belgium's average Overall score: {:.2f}.".format(France_lineups.Overall.mean(),Belgium_lineups.Overall.mean()))
print("Belgium's overall score is higher than France's overall score.")

In [None]:
print("With adjusted overall score, France's average Overall score: {:.2f},comparing against Belgium's average Overall score: {:.2f}.".format(overall_adjusted_score(France_lineups),overall_adjusted_score(Belgium_lineups)))
print("Belgium's overall score is higher than France's overall score.")

Belgium's overall score is higher than France's overall score on both adjusted and unadjusted version.  Belgium will get one point.

### History games
#### This is results whenever France play against Belgium in history

| France Won        | Drawn           |  Belgium Won |
| ------------- |:-------------:| -----:|-----:|
| 24	| 19	| 30

[Here](https://www.11v11.com/teams/france/tab/opposingTeams/opposition/Belgium/) is where I found that data.



In [None]:
print("Total games: {}".format(24+19+30))
print("France winrate: {:.2f}%".format(24*100/73))
print("Belgium winrate: {:.2f}%".format(30*100/73))
print("Belgium has more chance to win than France.")

### 3.1.2 Prediction

|         | **France**           |  **Belgium **  |
| ------------- |:-------------:| -----:|
|Value	| 0 | 1|
| Forward | 0	|1	|
|Midfielder	| 1 | 0|
| Defender| 0	|1	|
| Goalkeeper| 0	|1	|
| Overall score	|0|1	| 
| History games	|0|1	| 
| **Total**	| **1**	| ** 6**	|

**Belgium obviously has advantages over France, so I think Belgium has more chance to win this game.  I see France is a weaker team.**

### 3.1.3 Result

In [None]:
Image("../input/worldcup2018/FranceVsBelgiumResult.png")

**The result is France won over Belgium.  Looking at the stats after the game, Belgium controlled the game with 64%, made 594 passes with 91% accuracy.  It showed that Belgium was a better team.  However, Belgium could not make a lot of shots.  It had only 3 shots on target.  **

**On the other hand, France was a weaker team that could not control of the game, but it made a lot of shots (19 shots).  Whenever France had the ball, they tried to shoot to the goal and 5 of them was on target.  Finally, France scored one and won the game.**

## We can try to analyze the other games like that to see which team has more advantages.  

## 3.2 Croatia vs England ( 2nd quater final)
### 3.2.1 Teams Analysis

In [None]:
Image("../input/worldcup2018/CroatiaVsEngland.png")

In [None]:
CroatiaPlayers = ["D. Subašić","S. Vrsaljko","D. Lovren","D. Vida","I. Strinić","M. Brozović","A. Rebić","L. Modrić","I. Rakitić","I. Perišić", "M. Mandžukić"]
EnglandPlayers = ["R. Sterling","H. Kane","D. Alli","J. Lingard","A. Young","K. Trippier","J. Henderson", "H. Maguire","J. Stones","K. Walker","J. Pickford"]

In [None]:
all_croatia_players =data[data.Nationality == "Croatia"]
Croatia_lineups = all_croatia_players.iloc[get_location(CroatiaPlayers,all_croatia_players),:]
Croatia_lineups

In Croatia_lineups, we found 11 players.  However, we have a duplicate D. Lovren, and we are missing D. Vida. I found information about D. Vida [here ](https://sofifa.com/player/199206/domagoj-vida/)

* D. Vida: age:29, Overall: 80, Value: 11.5M


In [None]:
# Add missing players to Croatia_lineups dataframe
Croatia_lineups=Croatia_lineups.append({'Name' : 'D. Vida' , 'Age' : 29, 'Nationality': 'Croatia', 'Overall' : 80, 'Value' : 11500000} , ignore_index=True)
Croatia_lineups

In [None]:
# Remove D. Lovren who has overall score 58.  I will keep D. Lovren with overall score at 81.
Croatia_lineups.drop(index = 10)

In [None]:
all_england_players =data[data.Nationality == "England"]
England_lineups = all_england_players.iloc[get_location(EnglandPlayers,all_england_players),:]
England_lineups

We found 13 players.  However, we have a duplicate J. Lingard, and K. Walker-Perter who is not in the lineup.  We need to remove those players.

In [None]:
England_lineups.drop(England_lineups.index[[7,12]])

In [None]:
print("Here is Croatia's average age: {:.2f},comparing against England's average age: {:.2f}.".format(Croatia_lineups.Age.mean(),England_lineups.Age.mean()))
print("We can see England players are younger than Croatia players on average. It means Croatia have more experiences than France. England players are youngers, so they have more strength.")

### History Games
#### This is results whenever Croatia play against England in the World Cup history

| Date        | Match          | Score |
| ------------- |:-------------:| -----:|
| 10 Sep 2008	|Croatia v England	| 	1-4
| 09 Sep 2009	|England v Croatia	|  5-1


[Here](https://www.11v11.com/teams/england/tab/opposingTeams/opposition/Croatia/) is where I found that data.

They only played twice, and England won both games.

### 3.2.2 Prediction

### 3.2.3 Result

## 3.3 France vs Croatia ( Final)

### 3.3.1 Teams Analysis

In [None]:
Image("../input/worldcup2018/FranceVsCroatia.png")

### 3.3.2 Prediction

### 3.3.3 Result