# Analysing and Predicting the Women's World Cup

<ul>
<li><a href="#motivation">Motivation</a></li>
    <ul>
    <li><a href="#requirements">Requirements</a></li>
    </ul>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
    <ul>
    <li><a href="#question_1">Item if needed</a></li>
    </ul>
<li><a href="#modeling">Modeling</a></li>
    <ul>
    <li><a href="#preparation">Data Preparation</a></li>
    <li><a href="#preprocess">Preprocessing</a></li>
    <li><a href="#features">Features</a></li>
    <li><a href="#model_selection">Model Selection</a></li>
    </ul>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='motivation'></a>
## Motivation

The FIFA Women's World Cup is in its eighth edition in 2019. It occurs every four years between June and July, and has teams from all continents. This edition is being held in France, and 24 teams qualified for the final tournament ([Wikipedia](https://en.wikipedia.org/wiki/2019_FIFA_Women%27s_World_Cup)).

Besides the similarities, the women's football is not even close to have the same visibility as the men's one (at least not in Brazil, but we imagine that it's the same in the whole world), and founding the data about previous matches wasn't very easy. There wasn't data available on FIFA's website or in any other "official" provider, but we found it on [Kaggle](https://www.kaggle.com/alexkaechele/womens-world-cup) (thanks a lot for inputing this data by hand).

This analysis and modeling has the intent of predicting the winners from the round of 16 to the final match. The data and code used is provided on our Github.

We are very excited to know who is going to win, and we hope you enjoy the results as much as we did working on it.

<a id='requirements'></a>
### Requirements

**python 3.7.3**

* matplotlib==3.0.3
* numpy==1.16.3
* pandas==0.24.2
* seaborn==0.9.0

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns

from IPython.core.display import display, HTML

import plots as ps

import warnings
warnings.filterwarnings('ignore')

sns.set()
sns.set_palette("GnBu_d", 6)
%matplotlib inline

def create_link(id):
    display(HTML(f'<a id={id}></a>'))

---
Loading data

In [None]:
scores_raw = pd.read_csv('womens_world_cup_data.csv')
ranking = pd.read_csv('womens_world_cup_rankings.csv')

<a id='wrangling'></a>
## Data Wrangling

In [None]:
scores_raw = pd.read_csv('womens_world_cup_data.csv')
ranking = pd.read_csv('womens_world_cup_rankings.csv')

display(scores_raw.shape)
display(ranking.shape)

In [None]:
scores_raw.head(3)

In [None]:
ranking.head(3)

The countries'names in `ranking` data have upper case letters, we are going to make it consistent with the `scores` data by changing them to lower case.

In [None]:
ranking.team = ranking.team.apply(lambda x: x.lower())
ranking.head(3)

#### Merging the DataFrames

In order to make easy to analyse the data, we are going to merge the DFs.

In [None]:
all_matches_i_j = (scores_raw.merge(ranking, left_on='Team_i', right_on='team')
                             .rename(columns={'Team_i': 'team_i',
                                              'rating': 'rating_i',
                                              'rank': 'rank_i'})
                             .drop(columns=['team'])
                             .merge(ranking, left_on='Team_j', right_on='team')
                             .rename(columns={'Team_j': 'team_j',
                                              'rating': 'rating_j',
                                              'rank': 'rank_j'})
                             .drop(columns=['team']))


all_matches = all_matches_i_j.rename(columns={'team_i': 'team_a',
                                              'home_i': 'home_a',
                                              'score_i': 'score_a',
                                              'rank_i': 'rank_a',
                                              'rating_i': 'rating_a',
                                              'team_j': 'team_b',
                                              'home_j': 'home_b',
                                              'score_j': 'score_b',
                                              'rank_j': 'rank_b',
                                              'rating_j': 'rating_b',})

display(all_matches.head())
display(all_matches.shape)

<a id=''></a>
## Exploratory Data Analysis

Here are some questions we are going to address in this section:

[1. *Is a team more likely to win when playing at home?*](#question_1)

[2. *How many matches happened per year?*](#question_2)

[3. *Which are the teams with the most winnings?*](#question_3)

[4. *Which are the teams with the most loss?*](#question_4)

[5. *Higher ratings are related with more wins?*](#question_5)

[6. *How are the scores distributions?*](#question_6)

In [None]:
pd.plotting.scatter_matrix(all_matches, figsize=(20,20));

Some points:

countries with more rating play more games. as we can see in `rating_i X rating_i` *(equal to rank_i, rating_j and rank_j)*

There are more games at 2018.

There are some correlations between scores of A and rating that 

In [None]:
all_matches.sample(2, random_state=23)

<a id='modeling'></a>
## Modeling

###  Data Preparation

**Guarantee that team I and team J respect lexical order**

Above we can see that the `scores_raw` contains a row for every match, including the teams, if the team played at home, the scores, and the year. As the world cup is held in one country, we do not consider the data about playing in home relevant for this prediction, that's why we are going to drop it.

In [None]:
scores = all_matches.drop(columns=['home_a', 'home_b'])
display(scores.head())

In [None]:
def order_teams(df, columns=['team', 'score', 'rank', 'rating']):
    cols_a = [i + '_a' for i in columns] 
    cols_b = [i + '_b' for i in columns] 
    
    df.loc[df['team_b'] < df['team_a'], cols_a + cols_b] = \
        df.loc[df['team_b'] < df['team_a'], cols_b + cols_a].values
    
    return df
    
order_teams(scores)
scores.head()

Defining the target value:

In [None]:
#Target distribution
labels = ('win_a', 'draw', 'win_b')

scores['target'] = labels[1]
scores.loc[scores.score_a > scores.score_b, ['target']] = labels[0]
scores.loc[scores.score_a < scores.score_b, ['target']] = labels[2]

scores['target'].value_counts()

In [None]:
for c in scores.columns:
    print(f'{c} = {scores[c].unique()} \n')

We defined ratings as continuous variable and all others are going to be categorical.

In [None]:
X_raw = scores.drop(columns=['target', 'score_a', 'score_b'])
y = scores['target']

display(X_raw.shape)
display(X_raw.sample(3, random_state=13))

display(y.shape)
display(y.sample(3, random_state=13))

### Features

In [None]:
all_countries = np.union1d(X_raw.team_a.unique(), X_raw.team_b.unique())

In [None]:
# enconding categorical features with one hot encoder
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(categories=[all_countries, all_countries])

feat_cats_raw = ohe.fit_transform(X_raw[['team_a', 'team_b']].astype(str))

feat_cats = pd.DataFrame(feat_cats_raw.todense(), columns=ohe.get_feature_names()).astype(int)
feat_cats.shape

In [None]:
# normalization numerical features
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

feat_nums_raw = scaler.fit_transform(X_raw[['rating_a', 'rating_b']])
feat_nums = pd.DataFrame(feat_nums_raw, columns=['rating_a', 'rating_b'])
display(feat_nums.describe())

In [None]:
# merging data
X = feat_nums.join(feat_cats)
print(f'shape all merged: {X.shape}')

##### About Cross Validation

We have to realize that we have few data points, because of that we are NOT follow the default data split:
`Train | Cross Validation | Test`

We are going to use kfold trying to not throw data away. We will define 5 buckets and then we are going to train each algorithm proposed 5 times, using k-1 bucket, and evaluate metric with the one that remains. In conclusion, we will have 5 metrics for each model, the final metric for each model is the average of those metrics. After that, we are going to compare each model to choose the best one.

We will using Stratified Kfold to guarantee that classes are equally devided among the folds.

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import confusion_matrix

num_folds = 4
kf = StratifiedKFold(n_splits=num_folds, shuffle=True, random_state=43)
kf.get_n_splits(X, y)
k_fold_indexes = [(train, test) for train, test in kf.split(X, y)]

def train_model_k_fold(model, score, train_data=X, target_data=y):
    scores = []
    confusion_matrixes = []
    
    for train, test in k_fold_indexes:       
        X_train, X_test, y_train, y_test = \
            train_data.iloc[train], train_data.iloc[test], target_data.iloc[train], target_data.iloc[test]

        model.fit(X_train, y_train)
        scores.append(score(model, X_test, y_test))
        
    return np.mean(scores)

## The Model - First Scene - Compare some algorithms

### Choosing a score

At this point we have to choose a good metric to our model optimize. I found in [fbeta score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.fbeta_score.html) a good metric to our propose, more specific with a `Beta=2` with that model will target more in recall than in precision, this is good because we want to find each of the True Positives (for each class), explain more about this, we will calculate the True Positives for each class and then we will calculate an weighted average base on the ocurrency of each class.

As we know, there is some unbalance between the 3 target class - draw, win_a and win_b - and probably draw will be sacrificed to we find better `win_a` and `win_b`. We are ok with that.

In [None]:
from sklearn.metrics import fbeta_score, accuracy_score

from sklearn.metrics import make_scorer

score = make_scorer(fbeta_score, beta=2, labels=labels, average='weighted')
# score = make_scorer(accuracy_score)

### Comparing models

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

models = []

models.append(("Decision Tree", DecisionTreeClassifier(random_state=4)))
models.append(("Logistic Regression", LogisticRegression(solver='lbfgs', C=0.1, multi_class='auto', random_state=4)))
models.append(("SVC kernel=linear", SVC(kernel='linear', random_state=4)))
models.append(("SVC kernel=poly", SVC(kernel='poly', gamma='auto', random_state=23)))
models.append(("SVC kernel=rbf", SVC(kernel='rbf', gamma='auto', random_state=4)))
models.append(("SVC kernel=sigmoid", SVC(kernel='sigmoid', gamma='auto', random_state=4)))
models.append(("Random Forest", RandomForestClassifier(n_estimators=10, random_state=4)))
models.append(("Gradient Boosting", GradientBoostingClassifier(random_state=4)))


for name, model in models:
    print(f'{name}: {train_model_k_fold(model, score)}')
    ps.plot_learning_curve(model, name, X, y, cv=k_fold_indexes, scoring=score)


### Model Analysis

#### Decision Tree
`f2-score = 0.63`

Learning Curve show us that model extremelly overfits the data, the gap between the curves show this, so the model has high variance and acquiring more data could help this overfits behavior. But, decision trees tend to overfit a lot, even more when we don't do feature selection, like we didn't.

#### Logistic Regression
`f2-score = 0.69`

Learning Curve shows a very nice pattern: high score with some close curves. We made a simple hyper parameter tunning at this stage (cheating, ok!) tunning C to regularize our 230+ features =D. Model answers very nice for our propose, one of the best choices.

#### SVC
`kernel: linear -> f2-score = 0.64`

`kernel: poly -> f2-score = 0.38`

`kernel: rbf -> f2-score = 0.67`

`kernel: sigmoid -> f2-score = 0.67`

At this point we made a simple hyper parameter tunning again (cheating), at my point of view, does no make sense choose an SVM model without choose a kernel, so we create 4 svc models, one for each basic kernel that has in sklearn.

Linear kernel has high variance (overfit the data) and poly kernel has high bias (underfit the data). Sigmoid and rbf are both nice!

#### Ensemble Methods
`Gradient Boosting -> f2-score = 0.63`

`Random Forest -> f2-score = 0.64`

Both have good metrics, but is easy to see in learning curves that both are overfitted. Both cases, I think that more data could fix this overfitting problem, sadly we don't have more data, so let's move on.

### Final Comments

There are a bunch of other models that we could explore, but at this point, for our propose, those are enough to find a good solution.

I decided to stress Linear Regression and SVC with sigmoid and rbf kernel to the next stages.

## The Model - Second Scene - Stressing some algorithms

At this part we will find the best hyper parameters to chossed models. 

In [None]:
from sklearn.model_selection import GridSearchCV

### Linear Regression

In [None]:
parameters_lr = {'solver': ('newton-cg', 'sag', 'saga', 'lbfgs'),
                 'C': np.unique(np.geomspace(0.001, 1, num=15, dtype=float)),
                 'max_iter': np.unique(np.geomspace(50, 200, num=3, dtype=int)),
                 'class_weight': (None,
                                  'balanced',
                                  {'win_a': (354/(354-151)), 'win_b': (354/(354-151)), 'draw': (354/(354-52))},
                                  {'win_a': (354/(354-151)), 'win_b': (354/(354-151)), 'draw': (354/(354-(2*52)))})}

lr = LogisticRegression(multi_class='auto')

gd_model_lr = GridSearchCV(lr,
                           parameters_lr,
                           n_jobs=8,
                           cv=k_fold_indexes,
                           iid=True,
                           scoring=score,
                           verbose=1)
gd_model_lr.fit(X, y)

display(gd_model_lr.best_estimator_)
gd_model_lr.best_score_


### SVC

In [None]:
parameters_scv = {'kernel': ('sigmoid', 'rbf'),
                  'C': np.unique(np.geomspace(0.01, 1000, num=10, dtype=float)),
                  'coef0': np.unique(np.geomspace(0.01, 10, num=5, dtype=float)),
                  'class_weight': (None,
                                   'balanced',
                                   {'win_a': (354/(354-151)), 'win_b': (354/(354-151)), 'draw': (354/(354-52))},
                                   {'win_a': (354/(354-151)), 'win_b': (354/(354-151)), 'draw': (354/(354-(2*52)))}),
                  'gamma': list(np.unique(np.geomspace(0.0001, 10, num=7, dtype=float))) + ['scale', 'auto'],
                  'tol': np.unique(np.geomspace(0.0001, 10, num=6, dtype=float))}


svc = SVC(decision_function_shape='ovo', random_state=37)

gd_model_svc = GridSearchCV(svc,
                            parameters_scv,
                            n_jobs=8,
                            cv=k_fold_indexes,
                            iid=True,
                            scoring=score,
                            verbose=1)
gd_model_svc.fit(X, y)


display(gd_model_svc.best_estimator_)
gd_model_svc.best_score_

## The Model - Last Scene - The Final One

In [None]:
best_model = gd_model_svc.best_estimator_

print('The final Model with the hyper parameters is:')
display(best_model)

print(f'The fbeta score of it is {train_model_k_fold(best_model, score)}')

ps.plot_learning_curve(best_model, 'Final Model - Learning Curve', X, y, cv=k_fold_indexes, scoring=score);

### Confusion Matrix

In [None]:
confusion_matrixes = []
    
for train, test in k_fold_indexes:       
    X_train, X_test, y_train, y_test = \
        X.iloc[train], X.iloc[test], y.iloc[train], y.iloc[test]

    best_model.fit(X_train, y_train)
    cm = confusion_matrix(y_test, best_model.predict(X_test), labels=labels)
    confusion_matrixes.append(cm)
    
ps.print_confusion_matrixes(confusion_matrixes, labels)

### Final Conclusions

We do not have a test data to make a last evaluation of the model, because, like we sad, we going to a k-fold approach. At this point we just want to see the confusion matrix of the model, but to see this we have to plot each confusion matrix for each model created in k-fold function that we designed. The plots are above.

We could see that the weakness of out model is the power of predict draws. The recall and precision of each win class (win_a and win_b) are good and we are happy with that. In fact we are going to use the model to predict knockout phase, so a model that does not predict a draw is very useful at this point.

Overall we are satisfied with the model and the results.

## Predict the rest of the Female World Cup

To predict we will retrain the model found with all data

In [None]:
final_model = best_model.fit(X, y)

def preprocess(df_raw):    
    df = (df_raw.merge(ranking, left_on='team_a', right_on='team')
                .rename(columns={'rating': 'rating_a',
                                 'rank': 'rank_a'})
                .drop(columns=['team'])
                .merge(ranking, left_on='team_b', right_on='team')
                .rename(columns={'rating': 'rating_b',
                                 'rank': 'rank_b'})
                .drop(columns=['team']))
    
    df = order_teams(df, columns=['team', 'rank', 'rating'])
    
    feat_nums_raw = scaler.transform(df[['rating_a', 'rating_b']])
    feat_nums = pd.DataFrame(feat_nums_raw, columns=['rating_a', 'rating_b'])

    feat_cats_raw = ohe.transform(df[['team_a', 'team_b']].astype(str))
    feat_cats = pd.DataFrame(feat_cats_raw.todense(), columns=ohe.get_feature_names()).astype(int)
    
    return feat_nums.join(feat_cats), df

def display_results(round_list):
    r = {'team_a': [a for a, _ in round_list],
         'team_b': [b for _, b in round_list],}

    df = pd.DataFrame.from_dict(r)
    
    preprocessed, preds = preprocess(df)
    preds['prediction'] = final_model.predict(preprocessed)
    preds['winner'] = preds.team_a
    preds.loc[preds['prediction'] == 'win_b', ['winner']] = preds.loc[preds['prediction'] == 'win_b', ['team_b']].values
    display(preds)

In [None]:
round_16 = [('norway', 'australia'),
            ('england', 'cameroon'),
            ('france', 'brazil'),
            ('spain', 'united states'),
            ('italy', 'china'),
            ('netherlands', 'japan'),
            ('germany', 'nigeria'),
            ('sweden', 'canada'),] 

display_results(round_16)

In [None]:
round_8 = [('australia', 'england'),
           ('france', 'united states'),
           ('china', 'netherlands'),
           ('germany', 'sweden')] 

display_results(round_8)

In [None]:
round_4 = [('england', 'united states'),
           ('netherlands', 'sweden'),] 

display_results(round_4)

In [None]:
third_place = [('england', 'netherlands'),] 

display_results(third_place)

In [None]:
final = [('united states', 'sweden'),] 

display_results(final)