# Cecklist

## Frame the problem and look at big picture
[X] Define the objective.

    - predict winners of each game in world cup

[X] What are the current solutions if any

    - Many

[X] How should performance be measures

    - How well predictions come to fruition. 

[X] List the assumptions you or others have made so far.

    - spi will predict winner
    - gather SPI of starters for each team
    - use cummulative SPI for each team 

[ ] Verify assumptions if possible



## Get the Data
[X] List data you need and how much is needed

    - Historical World Cup data for wins and losses?
    - SPI of all starters on national teams

[ ] Find and document where you get data

[ ] Get the data

[ ] Convert the data to a format you can manipulate

[ ] Check size and type of data (time series, sample, geographial, etc.)

[ ] Sample a test set, put it aside, and don't look at it. 


## Explore the data to gain insights
Note: try to get insights from a field expert for these steps.

[ ] Create a copy of the data for exploration

[ ] Create a jupyter notebook to keep record of data 
exploration

[ ] Study each attribute and its characteristics:

    - Name    
    - Type (categorical, int/float, bount/unbounded, text, structured, etc.)
            - .info(), .describe(), .shape, .head()        
    - Noisiness and type of noise (stochastic, outliers, rounding errors, etc.   
    - Possibly useful for the task?
    - Type of distribution (Gaussian, uniform, logarithmic, etc.)

[ ] For supervised learning, Identify the target attribute(s)

[ ] Visualize the data.

[ ] Study the correlations between attributes

[ ] Identify promising transformations you may want to apply. 

[ ] Document what you have learned
    

## Prepare the data to better expose the underlying data patterns to ML algorithms
Notes: 

    - Work on copies of data (Keep the original dataset intact).
    - Write functions for all data transformation you apply, for 5 reasons:
        1. You can easily prepare the data the next time you get a fresh dataset
        2. ability to apply these transformations in future projects
        3. To clean and prepare the test set
        4. To clean and new data instances
        5. To make it easy to treate your preparation choices as hyperparamteres
 

[ ] **Data Clearning**:

    - Fix or remove outliers (optional)
    - Fill in missing values (e.g., with zero, mean, median, etc.) or drop rows (columns)

[ ] Feature Selection (optional)

    - Drop the attributes that provide no useful information for the task.

[ ] **Feature engineering**, where appropriate:

    - Discretize continuous features.
    - Decompose features (e.g., categorical, date/time, etc.)
    - Aggregate features into promising new features.

[ ] **Feature Scaling** 

    - Standardize or normalize features


## Explore many different models and short-list the best ones
note: try to automate these steps as much as possible.


[ ] Train many quick and dirty models from different categories, using standard parameters.

    - linear
    - naive
    - Bayes
    - SVM
    - Random Forest
    - Neural net
    - etc.

[ ] Measure and compare their performance

    - For each model, use **N-fold cross-validation** and compute the standard deviation of their performance.

[ ] Analyze the most significant variables for each algorithm

[ ] Analyze the types of errors the models make

[ ] Quick round of feature selection and engineering

[ ] One or two more quick iterations of the 5 previous steps

[ ] short-list the top three to five most promising models, preferring models that make different types of errors


## Fine-tune your models and combine them into a solution.
Note: 

    - you will want to use as much data as possible for this step, especially as you move toward the end of fine-tuning.
    - automate what you can
    

[ ] Find-tune hyperparameters using **cross-validation**

    - treat your data transformation choices asa hyperparameters, especially when you are note sure about them (e.g., should I replace missing values with zero or the median value, etc.)
    - Unless there are very few hyperparameter values to explore, prefer random search over grid search. If training is very long, you might prefer a Bayesian optimization approach(e.g., a gaussian process priors)

[ ] Try **Ensemble methods**. Combining your best models will often perform better than running them indiviually.

[ ] Once you are confident about your final model, measure its performance on the test set to estimate the generalization error. 

**Note: Don't tweak your model after measuring the generalization error: you would just start overfitting the test set.**


## Present solution

[ ] Document what you have done

[ ] Create Presentation


## Launch, monitor, and Maintain

### https://www.kaggle.com/code/launay10christian/world-cup-prediction/notebook

Good source for baseline

## Glossary

**SPI** - rating designed to provide the best possible objective representation of a team's current overall skill level.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.ticker as ticker
import matplotlib.ticker as plticker
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

### Variables for wc 2018, 2014, 2010 and for teams in 2022 world cup

In [2]:
years = [2018, 2014, 2010]

In [3]:
teams_2022 = ['Qatar', 'Netherlands', 'Senegal', 'Ecuador', 'England', 'USA', 'Wales', 'Iran', 'Argentina', 'Poland', 'Mexico', 'Saudi Arabia', 'France', 'Denmark', 'Tunisia', 'Australia', 'Germany', 'Spain', 'Japan', 'Costa Rica', 'Belgium', 'Croatia', 'Canada', 'Morocco', 'Brazil', 'Switzerland', 'Serbia', 'Cameroon', 'Portugal', 'Uruguay', 'Ghana', 'Korea Republic']

## World Cup 2018 Matches  

In [4]:
matches = pd.read_csv('wc_matches.csv')

In [5]:
matches.columns

Index(['date', 'league_id', 'league', 'team1', 'team2', 'spi1', 'spi2',
       'prob1', 'prob2', 'probtie', 'proj_score1', 'proj_score2', 'score1',
       'score2', 'xg1', 'xg2', 'nsxg1', 'nsxg2', 'adj_score1', 'adj_score2'],
      dtype='object')

In [6]:
#Adding goal difference and establishing who is the winner 
winner = []
for i in range (len(matches['team2'])):
    if matches ['score2'][i] > matches['score1'][i]:
        winner.append(matches['team2'][i])
    elif matches['score2'][i] < matches ['score1'][i]:
        winner.append(matches['team1'][i])
    else:
        winner.append('Draw')
matches['winning_team'] = winner

#adding goal difference column
matches['goal_difference'] = np.absolute(matches['score2'] - matches['score1'])

# matches.head()

In [7]:
matches.shape

(64, 22)

## World Cup Comparisons Data

In [8]:
comparisons = pd.read_csv('world_cup_comparisons.csv')

In [9]:
comparisons.columns

Index(['player', 'season', 'team', 'goals_z', 'xg_z', 'crosses_z',
       'boxtouches_z', 'passes_z', 'progpasses_z', 'takeons_z', 'progruns_z',
       'tackles_z', 'interceptions_z', 'clearances_z', 'blocks_z', 'aerials_z',
       'fouls_z', 'fouled_z', 'nsxg_z'],
      dtype='object')

In [10]:
comparisons.shape

(5899, 19)

### Change comparisons to world cups 2018, 2014, and 2010

comparisons years 2010-2018

In [11]:
comparisons = comparisons.loc[comparisons['season'].isin(years)]

In [12]:
comparisons.shape

(1668, 19)

comparisons teams that are in world cup 2022

In [13]:
comparisons = comparisons.loc[comparisons['team'].isin(teams_2022)]

In [14]:
comparisons.shape

(1094, 19)

In [43]:
year2010 = [2010]
year2014 = [2014]
year2018 = [2018]

# Germany

In [16]:
ger = ["Germany"]

In [17]:
ger_years = comparisons.loc[comparisons["team"].isin(ger)]

In [18]:
ger2010year = ger_years.loc[ger_years["season"].isin(year2010)]

In [19]:
ger2010xg = ger2010year.sort_values(by=["xg_z"], ascending=False)

In [20]:
ger2010xg13 = ger2010xg.head(13)

#### top 13 players from Germany in 2010 by XG

### Remove columns

In [21]:
ger2010xg13 = ger2010xg13.drop(columns=['player', 'season', 'team', 'boxtouches_z', 'progpasses_z', 'progruns_z', 'crosses_z', 'passes_z', 'takeons_z', 'tackles_z', 'interceptions_z', 'clearances_z', 'blocks_z', 'aerials_z'])


In [22]:
ger2010 = ger2010xg13.mean()

In [23]:
ger2010

goals_z     0.816923
xg_z        0.513077
fouls_z    -0.294615
fouled_z   -0.134615
nsxg_z      0.328462
dtype: float64

In [24]:
(.513077 + .328462) / .816923

1.0301325828750079

### Fouled

In [25]:
-0.134615 - -0.294615

0.16

# Switzerland
* 2010: -1.76 
* 2014: 1.954
* 2018: 2.143

In [26]:
swits = ['Switzerland']

In [27]:
swiss_year = comparisons.loc[comparisons["team"].isin(swits)]

In [28]:
swiss2010year = swiss_year.loc[swiss_year["season"].isin(year2010)]

In [29]:
ch2010xg = swiss2010year.sort_values(by=["xg_z"], ascending=False)

In [30]:
ch2010xg13 = ch2010xg.head(13)

In [31]:
ch2010xg13 = ch2010xg13.drop(columns=['player', 'season', 'team', 'boxtouches_z', 'progpasses_z', 'progruns_z', 'crosses_z', 'passes_z', 'takeons_z', 'tackles_z', 'interceptions_z', 'clearances_z', 'blocks_z', 'aerials_z'])


In [32]:
ch2010 = ch2010xg13.mean()

In [33]:
ch2010

goals_z    -0.222308
xg_z        0.122308
fouls_z     0.429231
fouled_z    0.042308
nsxg_z      0.269231
dtype: float64

In [34]:
(.122308 + .269231) / -0.222308

-1.761245659175558

### Fouled

In [35]:
.042308 - .429231

-0.38692299999999996

### 2014

In [36]:
swiss2014year = swiss_year.loc[swiss_year["season"].isin(year2014)]

In [37]:
ch2014xg = swiss2014year.sort_values(by=["xg_z"], ascending=False)

In [38]:
ch2014xg13 = ch2014xg.head(13)

In [39]:
ch2014xg13 = ch2014xg13.drop(columns=['player', 'season', 'team', 'boxtouches_z', 'progpasses_z', 'progruns_z', 'crosses_z', 'passes_z', 'takeons_z', 'tackles_z', 'interceptions_z', 'clearances_z', 'blocks_z', 'aerials_z'])


In [40]:
ch2014 = ch2014xg13.mean()

In [41]:
ch2014

goals_z     0.387692
xg_z        0.550769
fouls_z     0.323077
fouled_z    0.382308
nsxg_z      0.206923
dtype: float64

In [55]:
(.550769 + .206923) / .387692

1.954365836798283

### Fouled

In [54]:
.382308 - .323077

0.05923099999999998

### 2018

In [44]:
swiss2018year = swiss_year.loc[swiss_year["season"].isin(year2018)]

In [47]:
ch2018xg = swiss2018year.sort_values(by=["xg_z"], ascending=False)

In [48]:
ch2018xg13 = ch2018xg.head(13)

In [49]:
ch2018xg13 = ch2018xg13.drop(columns=['player', 'season', 'team', 'boxtouches_z', 'progpasses_z', 'progruns_z', 'crosses_z', 'passes_z', 'takeons_z', 'tackles_z', 'interceptions_z', 'clearances_z', 'blocks_z', 'aerials_z'])


In [50]:
ch2018 = ch2018xg13.mean()

In [51]:
ch2018

goals_z     0.241538
xg_z        0.330769
fouls_z    -0.040000
fouled_z    0.116154
nsxg_z      0.186923
dtype: float64

In [52]:
(.330769 + .186923) / 0.241538

2.143314923531701

### Fouled

In [53]:
.116154 - -0.040000

0.156154

# Spain

In [None]:
es = ['Spain']

In [None]:
es_year = comparisons.loc[comparisons["team"].isin(es)]

In [None]:
es2010year = es_year.loc[es_year["season"].isin(year2010)]

In [None]:
es2010xg = es2010year.sort_values(by=["xg_z"], ascending=False)

In [None]:
es2010xg13 = es2010xg.head(13)

In [None]:
es2010xg13 = es2010xg13.drop(columns=['player', 'season', 'team', 'boxtouches_z', 'progpasses_z', 'progruns_z', 'crosses_z', 'passes_z', 'takeons_z', 'tackles_z', 'interceptions_z', 'clearances_z', 'blocks_z', 'aerials_z'])


In [None]:
es2010 = es2010xg13.mean()

In [None]:
es2010

In [None]:
(.539231 + .762308) / .180000

### Fouled

In [None]:
.464615 - -0.138462

#  Portugal

In [None]:
port = ['Portugal']

In [None]:
port_year = comparisons.loc[comparisons["team"].isin(port)]

In [None]:
port2010year = port_year.loc[port_year["season"].isin(year2010)]

In [None]:
port2010xg = port2010year.sort_values(by=["xg_z"], ascending=False)

In [None]:
port2010xg13 = port2010xg.head(13)

In [None]:
port2010xg13 = port2010xg13.drop(columns=['player', 'season', 'team', 'boxtouches_z', 'progpasses_z', 'progruns_z', 'crosses_z', 'passes_z', 'takeons_z', 'tackles_z', 'interceptions_z', 'clearances_z', 'blocks_z', 'aerials_z'])


In [None]:
port2010 = port2010xg13.mean()

In [None]:
port2010

In [None]:
(.284615 + .053077) / .531538

### Fouled

In [None]:
-0.030000 - .236154

# Netherlands

In [None]:
ned = ['Netherlands']

In [None]:
ned_year = comparisons.loc[comparisons["team"].isin(ned)]

In [None]:
ned2010year = ned_year.loc[ned_year["season"].isin(year2010)]

In [None]:
ned2010xg = ned2010year.sort_values(by=["xg_z"], ascending=False)

In [None]:
ned2010xg13 = ned2010xg.head(13)

In [None]:
ned2010xg13 = ned2010xg13.drop(columns=['player', 'season', 'team', 'boxtouches_z', 'progpasses_z', 'progruns_z', 'crosses_z', 'passes_z', 'takeons_z', 'tackles_z', 'interceptions_z', 'clearances_z', 'blocks_z', 'aerials_z'])


In [None]:
ned2010 = ned2010xg13.mean()

In [None]:
ned2010

In [None]:
(.126923 + .127692) / .396923

### Fouled

In [None]:
.439231 - .368462

## World Cup Data from 2018 - 210

initialized to variable df

In [None]:
# df = pd.concat(
#     map(pd.read_csv, ['FIFA - 2018.csv', 'FIFA - 2014.csv', 'FIFA - 2010.csv', 'FIFA - 2006.csv', 'FIFA - 2002.csv', 'FIFA - 1998.csv', 'FIFA - 1994.csv', 'FIFA - 1990.csv', 'FIFA - 1986.csv', 'FIFA - 1982.csv', 'FIFA - 1978.csv', 'FIFA - 1974.csv', 'FIFA - 1970.csv' ]), ignore_index=True)
# print(df)

df = pd.concat(
    map(pd.read_csv, ['FIFA - 2018.csv', 'FIFA - 2014.csv', 'FIFA - 2010.csv']))


In [None]:
df.columns

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df.tail()

### change to only teams in world cup 2022

In [None]:
# teams_2022 = ['Qatar', 'Netherlands', 'Senegal', 'Ecuador', 'England', 'USA', 'Wales', 'Iran', 'Argentina', 'Poland', 'Mexico', 'Saudi Arabia', 'France', 'Denmark', 'Tunisia', 'Australia', 'Germany', 'Spain', 'Japan', 'Costa Rica', 'Belgium', 'Croatia', 'Canada', 'Morocco', 'Brazil', 'Switzerland', 'Serbia', 'Cameroon', 'Portugal', 'Uruguay', 'Ghana', 'Korea Republic']

In [None]:
df = df.loc[df['Team'].isin(teams_2022)]

In [None]:
df.head()

In [None]:
df.tail()