<a href="https://colab.research.google.com/github/MoffatKirui/ipweek6/blob/main/Moringa_Data_Science_Core_W6_Independent_Project_Moffat_Kirui_Python_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. Defining the Question

### a) Specifying the Question

You have been recruited as a football analyst in a company - Mchezopesa Ltd and tasked to accomplish the task below.

A prediction result of a game between team 1 and team 2, based on who's home and who's away, and on whether or not the game is friendly (include rank in your training).

### b) Defining the Metric for Success

* Predict how many goals the home team scores.
* Predict how many goals the away team scores.
* Figure out from the home team’s perspective if the game is a Win, Lose or Draw (W, L, D)

### c) Understanding the context 

A more detailed explanation and history of the rankings is available here: [link](https://en.wikipedia.org/wiki/FIFA_World_Rankings) 

An explanation of the ranking procedure is available here: [Link](https://www.fifa.com/fifa-world-ranking/procedure/men.html)


### d) Recording the Experimental Design

Expected flow for the assessment:
* Perform your EDA
* Perform any necessary feature engineering 
* Check of multicollinearity
* Start building the model
* Cross-validate the model
* Compute RMSE
* Create residual plots for your models, and assess their heteroscedasticity using Bartlett’s test
* Perform appropriate regressions on the data including your justification
* Challenge your solution by providing insights on how you can make improvements.

### e) Data Relevance

Our dataset contains information about past matches with the relevant details required for our analysis including the ranking of teams, goals scored, type of match among others. The data is therefore relevant for use in building a model to obtain our desired predictions.

## 2. Reading the Data

In [747]:
# importing dependencies
#

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
from matplotlib import pyplot as plt

In [748]:
#loading the dataset
results=pd.read_csv("results.csv")
fifa_ranking=pd.read_csv("fifa_ranking.csv")

## 3. Checking the Data

In [749]:
# Determining the no. of records in our dataset
#
print('fifa_ranking',fifa_ranking.shape)
print('results',results.shape)

fifa_ranking (13995, 16)
results (40839, 9)


In [750]:
# Previewing the top of our dataset
#
fifa_ranking.head()


Unnamed: 0,rank,country_full,country_abrv,total_points,previous_points,rank_change,cur_year_avg,cur_year_avg_weighted,last_year_avg,last_year_avg_weighted,two_year_ago_avg,two_year_ago_weighted,three_year_ago_avg,three_year_ago_weighted,confederation,rank_date
0,1,Germany,GER,0.0,57.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,UEFA,1993-08-08
1,2,Italy,ITA,0.0,57.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,UEFA,1993-08-08
2,3,Switzerland,SUI,0.0,50.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,UEFA,1993-08-08
3,4,Sweden,SWE,0.0,55.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,UEFA,1993-08-08
4,5,Argentina,ARG,0.0,51.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,CONMEBOL,1993-08-08


In [751]:
results.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False


In [752]:
# Previewing the bottom of our dataset
# 
fifa_ranking.tail()

Unnamed: 0,rank,country_full,country_abrv,total_points,previous_points,rank_change,cur_year_avg,cur_year_avg_weighted,last_year_avg,last_year_avg_weighted,two_year_ago_avg,two_year_ago_weighted,three_year_ago_avg,three_year_ago_weighted,confederation,rank_date
13990,103,Wales,WAL,0.0,384.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,UEFA,2000-08-09
13991,104,Vietnam,VIE,0.0,393.0,-3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,AFC,2000-08-09
13992,105,Mozambique,MOZ,0.0,373.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,CAF,2000-08-09
13993,106,Uganda,UGA,0.0,355.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,CAF,2000-08-09
13994,107,,,,,,,,,,,,,,,


In [753]:
results.tail()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
40834,2019-07-18,American Samoa,Tahiti,8,1,Pacific Games,Apia,Samoa,True
40835,2019-07-18,Fiji,Solomon Islands,4,4,Pacific Games,Apia,Samoa,True
40836,2019-07-19,Senegal,Algeria,0,1,African Cup of Nations,Cairo,Egypt,True
40837,2019-07-19,Tajikistan,North Korea,0,1,Intercontinental Cup,Ahmedabad,India,True
40838,2019-07-20,Papua New Guinea,Fiji,1,1,Pacific Games,Apia,Samoa,True


In [754]:
# Checking whether each column has an appropriate datatype
#
print(fifa_ranking.dtypes)
print(results.dtypes)

rank                         int64
country_full                object
country_abrv                object
total_points               float64
previous_points            float64
rank_change                float64
cur_year_avg               float64
cur_year_avg_weighted      float64
last_year_avg              float64
last_year_avg_weighted     float64
two_year_ago_avg           float64
two_year_ago_weighted      float64
three_year_ago_avg         float64
three_year_ago_weighted    float64
confederation               object
rank_date                   object
dtype: object
date          object
home_team     object
away_team     object
home_score     int64
away_score     int64
tournament    object
city          object
country       object
neutral         bool
dtype: object


In [755]:
results['date']= pd.to_datetime(results['date'])

In [756]:
fifa_ranking['rank_date']=pd.to_datetime(fifa_ranking['rank_date'])

## 4. External Data Source Validation

Making sure your data matches something outside of the dataset is very important. It allows you to ensure that the measurements are roughly in line with what they should be and it serves as a check on what other things might be wrong in your dataset. External validation can often be as simple as checking your data against a single number, as we will do here.

### a.Validation

Some features are available on the FIFA ranking page [Link](https://www.fifa.com/fifa-world-ranking/ranking-table/men/index.html).

The link to our dataset is provided [here](https://drive.google.com/open?id=1BYUqaEEnFtAe5lvzJh9lpVpR2MAvERUc)

## 5. Tidying the Dataset

In [757]:
# Checking for Outliers
#
fifa_ranking.describe()

Unnamed: 0,rank,total_points,previous_points,rank_change,cur_year_avg,cur_year_avg_weighted,last_year_avg,last_year_avg_weighted,two_year_ago_avg,two_year_ago_weighted,three_year_ago_avg,three_year_ago_weighted
count,13995.0,13994.0,13994.0,13994.0,13994.0,13994.0,13994.0,13994.0,13994.0,13994.0,13994.0,13994.0
mean,94.36463,0.0,115.123339,-0.035015,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
std,54.733547,0.0,183.991128,5.607973,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
min,1.0,0.0,0.0,-72.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.0,0.0,13.0,-2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,94.0,0.0,35.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,141.0,0.0,64.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,202.0,0.0,842.0,92.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [758]:
results.describe()

Unnamed: 0,home_score,away_score
count,40839.0,40839.0
mean,1.745709,1.188105
std,1.749145,1.40512
min,0.0,0.0
25%,1.0,0.0
50%,1.0,1.0
75%,2.0,2.0
max,31.0,21.0


In [759]:
# Checking for Anomalies
#


In [760]:
# Identifying the Missing Data
#
fifa_ranking.isnull().sum()

rank                       0
country_full               1
country_abrv               1
total_points               1
previous_points            1
rank_change                1
cur_year_avg               1
cur_year_avg_weighted      1
last_year_avg              1
last_year_avg_weighted     1
two_year_ago_avg           1
two_year_ago_weighted      1
three_year_ago_avg         1
three_year_ago_weighted    1
confederation              1
rank_date                  1
dtype: int64

In [761]:
results.isnull().sum()

date          0
home_team     0
away_team     0
home_score    0
away_score    0
tournament    0
city          0
country       0
neutral       0
dtype: int64

In [762]:
# Dealing with the Missing Data
#
fifa_ranking.dropna(inplace=True)

In [763]:
# More data cleaning procedures
#
fifa_ranking.duplicated().sum()

0

In [764]:
results.duplicated().sum()

0

In [765]:
#dropping irrelevant columns
results.drop(['city'],axis=1,inplace=True)
fifa_ranking.drop(['country_abrv','confederation','total_points','cur_year_avg','cur_year_avg_weighted','last_year_avg','last_year_avg_weighted','two_year_ago_avg','two_year_ago_weighted','three_year_ago_avg','three_year_ago_weighted'],axis=1,inplace=True)

## merging dataset

In [766]:
#splitting year and month
results['year'] = results.date.dt.year

results['month'] = results.date.dt.month

In [767]:
results.drop(['date'],axis=1,inplace=True)

In [768]:
#filtering the results dataset to match the year ranking started
results=results[(results['year'] > 1992) & (results['year'] < 2019)] 

In [769]:
fifa_ranking['year'] = fifa_ranking.rank_date.dt.year

fifa_ranking['month'] = fifa_ranking.rank_date.dt.month

In [770]:
fifa_ranking.drop(['rank_date'],axis=1,inplace=True)

In [771]:
home = pd.merge(fifa_ranking, results, how = 'inner', left_on = ['year','month','country_full'], right_on = ['year','month','home_team'])


In [772]:
home.rename(columns={'rank':'home_rank','previous_points':'home_previous_points','rank_change':'home_rank_change'}, inplace = True)

In [773]:
home.drop(['country_full'],axis=1,inplace=True)

In [774]:
away = pd.merge(fifa_ranking, results, how = 'inner', left_on = ['year','month','country_full'], right_on = ['year','month','away_team'])


In [775]:
away.drop(['country_full'],axis=1,inplace=True)

In [776]:
away.rename(columns={'rank':'away_rank','previous_points':'away_previous_points','rank_change':'away_rank_change'}, inplace = True)

In [777]:
away

Unnamed: 0,away_rank,away_previous_points,away_rank_change,year,month,home_team,away_team,home_score,away_score,tournament,country,neutral
0,3,50.0,9.0,1993,8,Sweden,Switzerland,1,2,Friendly,Sweden,False
1,5,51.0,5.0,1993,8,Peru,Argentina,0,1,FIFA World Cup qualification,Peru,False
2,5,51.0,5.0,1993,8,Paraguay,Argentina,1,3,FIFA World Cup qualification,Paraguay,False
3,5,51.0,5.0,1993,8,Colombia,Argentina,2,1,FIFA World Cup qualification,Colombia,False
4,8,55.0,-5.0,1993,8,Venezuela,Brazil,1,5,FIFA World Cup qualification,Venezuela,False
...,...,...,...,...,...,...,...,...,...,...,...,...
4095,94,424.0,-1.0,2000,8,United States,Barbados,7,0,FIFA World Cup qualification,United States,False
4096,96,410.0,0.0,2000,8,Malaysia,New Zealand,0,0,Merdeka Tournament,Malaysia,False
4097,96,410.0,0.0,2000,8,Malaysia,New Zealand,0,2,Merdeka Tournament,Malaysia,False
4098,101,384.0,2.0,2000,8,Bahrain,Jordan,0,2,Friendly,Syria,True


In [778]:
home_away = pd.merge(home,away, how='inner', left_on=['year','month','away_team','home_score','away_score','home_team','tournament','country','neutral'],right_on=['year','month','away_team','home_score','away_score','home_team','tournament', 'country','neutral'])
home_away.head()

Unnamed: 0,home_rank,home_previous_points,home_rank_change,year,month,home_team,away_team,home_score,away_score,tournament,country,neutral,away_rank,away_previous_points,away_rank_change
0,4,55.0,0.0,1993,8,Sweden,Switzerland,1,2,Friendly,Sweden,False,3,50.0,9.0
1,4,55.0,0.0,1993,8,Sweden,France,1,1,FIFA World Cup qualification,Sweden,False,12,45.0,7.0
2,5,51.0,5.0,1993,8,Argentina,Peru,2,1,FIFA World Cup qualification,Argentina,False,70,16.0,8.0
3,5,51.0,5.0,1993,8,Argentina,Paraguay,0,0,FIFA World Cup qualification,Argentina,False,67,22.0,1.0
4,8,55.0,-5.0,1993,8,Brazil,Mexico,1,1,Friendly,Brazil,False,14,42.0,11.0


In [779]:
home_away.isnull().sum()

home_rank               0
home_previous_points    0
home_rank_change        0
year                    0
month                   0
home_team               0
away_team               0
home_score              0
away_score              0
tournament              0
country                 0
neutral                 0
away_rank               0
away_previous_points    0
away_rank_change        0
dtype: int64

In [780]:
home_away.duplicated().sum()

48

In [781]:
home_away = home_away.drop_duplicates()

In [782]:
home_away.drop(['year','month'],axis=1,inplace=True)

In [783]:
def result(row):
  if row['home_score'] < row['away_score']:
    outcome = 'Lose'
  elif row['home_score'] > row['away_score']:
    outcome = 'Win'
  else:
    outcome = 'Draw'
  return outcome

home_away['result'] = home_away.apply(result, axis=1)
home_away

Unnamed: 0,home_rank,home_previous_points,home_rank_change,home_team,away_team,home_score,away_score,tournament,country,neutral,away_rank,away_previous_points,away_rank_change,result
0,4,55.0,0.0,Sweden,Switzerland,1,2,Friendly,Sweden,False,3,50.0,9.0,Lose
1,4,55.0,0.0,Sweden,France,1,1,FIFA World Cup qualification,Sweden,False,12,45.0,7.0,Draw
2,5,51.0,5.0,Argentina,Peru,2,1,FIFA World Cup qualification,Argentina,False,70,16.0,8.0,Win
3,5,51.0,5.0,Argentina,Paraguay,0,0,FIFA World Cup qualification,Argentina,False,67,22.0,1.0,Draw
4,8,55.0,-5.0,Brazil,Mexico,1,1,Friendly,Brazil,False,14,42.0,11.0,Draw
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3746,82,453.0,-2.0,Bosnia and Herzegovina,Turkey,2,0,Friendly,Bosnia and Herzegovina,False,30,588.0,-1.0,Win
3747,83,452.0,-2.0,Haiti,Honduras,0,4,Friendly,United States,True,49,528.0,2.0,Lose
3748,85,435.0,7.0,El Salvador,Haiti,1,2,Friendly,United States,True,83,452.0,-2.0,Lose
3749,90,440.0,0.0,Albania,Cyprus,0,0,Friendly,Albania,False,63,503.0,-1.0,Draw


In [784]:
home_away['neutral']=pd.to_numeric(home_away['neutral'])

## 6. Exploratory Analysis

In [785]:
# Ploting the univariate summaries and recording our observations
#
home_away.describe(include='all')

Unnamed: 0,home_rank,home_previous_points,home_rank_change,home_team,away_team,home_score,away_score,tournament,country,neutral,away_rank,away_previous_points,away_rank_change,result
count,3703.0,3703.0,3703.0,3703,3703,3703.0,3703.0,3703,3703,3703,3703.0,3703.0,3703.0,3703
unique,,,,179,181,,,47,184,2,,,,3
top,,,,Mexico,Zambia,,,Friendly,United States,False,,,,Win
freq,,,,62,53,,,1359,159,2871,,,,1872
mean,66.983527,166.269241,1.444775,,,1.674318,1.061572,,,,69.436943,162.52849,1.219822,
std,47.001345,219.679412,6.106529,,,1.665199,1.238667,,,,49.01251,217.975731,6.518582,
min,1.0,0.0,-28.0,,,0.0,0.0,,,,1.0,0.0,-30.0,
25%,28.0,30.0,-2.0,,,1.0,0.0,,,,28.0,29.0,-2.0,
50%,60.0,47.0,0.0,,,1.0,1.0,,,,61.0,46.0,0.0,
75%,100.0,293.0,3.0,,,2.0,2.0,,,,102.5,279.5,3.0,


In [786]:
home_away['tournament'].value_counts()

Friendly                                      1359
FIFA World Cup qualification                   687
UEFA Euro qualification                        409
African Cup of Nations qualification           238
African Cup of Nations                          83
AFC Asian Cup qualification                     80
Copa América                                    72
FIFA World Cup                                  53
CFU Caribbean Cup qualification                 46
Amílcar Cabral Cup                              46
Gulf Cup                                        44
CFU Caribbean Cup                               36
COSAFA Cup                                      35
UNCAF Cup                                       33
Gold Cup                                        33
AFF Championship                                32
UEFA Euro                                       27
Confederations Cup                              27
CECAFA Cup                                      26
Oceania Nations Cup            

In [787]:
home_away.corr()

Unnamed: 0,home_rank,home_previous_points,home_rank_change,home_score,away_score,neutral,away_rank,away_previous_points,away_rank_change
home_rank,1.0,-0.236222,0.019347,-0.131751,0.217652,0.024255,0.431671,-0.079422,0.00134
home_previous_points,-0.236222,1.0,-0.081292,0.047629,-0.054131,0.011689,-0.069653,0.882088,-0.041093
home_rank_change,0.019347,-0.081292,1.0,0.068069,-0.059297,0.051178,0.012879,-0.064373,0.128704
home_score,-0.131751,0.047629,0.068069,1.0,-0.146512,0.000765,0.341807,-0.106291,-0.042303
away_score,0.217652,-0.054131,-0.059297,-0.146512,1.0,0.088694,-0.184533,0.044324,0.075035
neutral,0.024255,0.011689,0.051178,0.000765,0.088694,1.0,0.037767,0.001108,0.055901
away_rank,0.431671,-0.069653,0.012879,0.341807,-0.184533,0.037767,1.0,-0.254479,0.015475
away_previous_points,-0.079422,0.882088,-0.064373,-0.106291,0.044324,0.001108,-0.254479,1.0,-0.070033
away_rank_change,0.00134,-0.041093,0.128704,-0.042303,0.075035,0.055901,0.015475,-0.070033,1.0


## checking for multicollinearity

In [788]:
correlations=home_away.corr()
pd.DataFrame(np.linalg.inv(correlations.values), index = correlations.index, columns=correlations.columns)

Unnamed: 0,home_rank,home_previous_points,home_rank_change,home_score,away_score,neutral,away_rank,away_previous_points,away_rank_change
home_rank,2.21807,2.280845,-0.039929,0.42661,-0.500434,0.026886,-1.590882,-2.174098,0.022349
home_previous_points,2.280845,7.85572,0.173864,-0.111872,-0.15915,-0.037476,-2.293984,-7.337687,-0.171688
home_rank_change,-0.039929,0.173864,1.041587,-0.103662,0.093594,-0.056252,0.046842,-0.102686,-0.143037
home_score,0.42661,-0.111872,-0.103662,1.286934,-0.04205,0.017358,-0.610968,0.114809,0.082293
away_score,-0.500434,-0.15915,0.093594,-0.04205,1.199513,-0.110184,0.487458,0.166233,-0.099441
neutral,0.026886,-0.037476,-0.056252,0.017358,-0.110184,1.01742,-0.073581,0.015648,-0.039975
away_rank,-1.590882,-2.293984,0.046842,-0.610968,0.487458,-0.073581,2.447934,2.434975,-0.023827
away_previous_points,-2.174098,-7.337687,-0.102686,0.114809,0.166233,0.015648,2.434975,7.933362,0.22402
away_rank_change,0.022349,-0.171688,-0.143037,0.082293,-0.099441,-0.039975,-0.023827,0.22402,1.040559


In [789]:
# dropping anything with a vif score above 5
home_away.drop(['home_previous_points','away_previous_points'],axis=1,inplace=True)

In [790]:
#checking how vif score has been affected
correlations=home_away.corr()
pd.DataFrame(np.linalg.inv(correlations.values), index = correlations.index, columns=correlations.columns)

Unnamed: 0,home_rank,home_rank_change,home_score,away_score,neutral,away_rank,away_rank_change
home_rank,1.554079,-0.087994,0.459508,-0.453516,0.036984,-0.913023,0.074771
home_rank_change,-0.087994,1.034436,-0.101757,0.096144,-0.054352,0.081446,-0.142758
home_score,0.459508,-0.101757,1.285242,-0.044484,0.017009,-0.646428,0.07924
away_score,-0.453516,0.096144,-0.044484,1.196002,-0.110628,0.436225,-0.103956
neutral,0.036984,-0.054352,0.017009,-0.110628,1.016894,-0.079284,-0.039652
away_rank,-0.913023,0.081446,-0.646428,0.436225,-0.079284,1.698933,-0.091195
away_rank_change,0.074771,-0.142758,0.07924,-0.103956,-0.039652,-0.091195,1.033054


## 7. Implementing the Solution

In [791]:
## approach 1-polynomial

In [792]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics 

### model 1-predict home team scores

In [793]:


#splitting the dataset
X=home_away[['home_rank','home_rank_change','away_score','neutral','away_rank','away_rank_change']]
y=home_away['home_score']

In [794]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [795]:
# Feature scaling
# We now need to perform feature scaling. We execute the following code to do so:
# 
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [796]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components=1)
X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)

In [797]:
# Training the Algorithm
# ---
# To train the algorithm we execute the same code as before, using the fit() method of the LinearRegression class
# ---
# 
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [798]:
# Making Predictions
# ---
# To make pre-dictions on the test data, execute the following
# ---
# 
y_pred = regressor.predict(X_test)

# To compare the actual output values for X_test with the predicted values
# 
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df


Unnamed: 0,Actual,Predicted
1047,0,1.737227
3058,4,2.409315
2822,3,1.656856
2084,4,1.394097
1384,1,1.773403
...,...,...
2826,2,1.996165
1729,10,4.039461
23,1,0.642943
680,0,0.937953


In [799]:
# Evaluating the Algorithm
# ---
# 
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))


Mean Absolute Error: 1.1154434186452846
Mean Squared Error: 2.5097212069768093
Root Mean Squared Error: 1.5842099630341961


## cross-validation

In [800]:
from sklearn.model_selection import KFold

# We will use the same 6 independent variables for this
X = home_away[['home_rank', 'away_rank', 'away_score','away_rank_change','home_rank_change','neutral']].values
y = home_away['home_score'].values

folds = KFold(n_splits=5)

# note that if you have a KFold object, you can figure out how many folds you set up 
# for it using get_n_splits
print('we are using ' +str(folds.get_n_splits(X)) + ' folds')

# We now create and assess 5 models based on the folds we created.
RMSES = [] # We will use this array to keep track of the RSME of each model
count = 1 # This will just help 
for train_index, test_index in folds.split(X):
  print('\nTraining model ' + str(count))
  
  # set up the train and test based on the split determined by KFold
  # With 5 folds, we will end up with 80% of our data in the training set, and 20% in the test set, just as above
  X_train, X_test = X[train_index], X[test_index]
  y_train, y_test = y[train_index], y[test_index]
  
  # fit a model accordingly
  regressor = LinearRegression()  
  regressor.fit(X_train, y_train)
  
  # assess the accuraccy of the model
  y_pred = regressor.predict(X_test)
  
  rmse_value =  np.sqrt(metrics.mean_squared_error(y_test, y_pred))
  RMSES.append(rmse_value)
  
  print('Model ' + str(count) + ' Root Mean Squared Error:',rmse_value)
  count = count + 1

we are using 5 folds

Training model 1
Model 1 Root Mean Squared Error: 1.29330583339767

Training model 2
Model 2 Root Mean Squared Error: 1.3616056535215242

Training model 3
Model 3 Root Mean Squared Error: 1.5536523303181777

Training model 4
Model 4 Root Mean Squared Error: 1.5323220680835352

Training model 5
Model 5 Root Mean Squared Error: 1.591281554265497


In [801]:
np.mean(RMSES)

1.4664334879172807

In [802]:
#assessing heteroscedasticity using barlett's test
residuals = np.subtract(y_pred, y_test)
residuals.mean()

0.018880602912629024

In [803]:
import scipy as sp

test_result, p_value = sp.stats.bartlett(y_pred, residuals)
test_result, p_value

(335.3230770267489, 6.659501190097627e-75)

In [804]:
# To interpret the results we must also compute a critical value of the chi squared distribution
degree_of_freedom = len(y_pred)-1
probability = 1 - p_value

critical_value = sp.stats.chi2.ppf(probability, degree_of_freedom)
print(critical_value)

# If the test_result is greater than the critical value, then we reject our null
# hypothesis. This would mean that there are patterns to the variance of the data

# Otherwise, we can identify no patterns, and we accept the null hypothesis that 
# the variance is homogenous across our data


inf


In [805]:
if (test_result > critical_value):
  print('the variances are unequal, and the model should be reassessed')
else:
  print('the variances are homogeneous!')

the variances are homogeneous!


### model 2-predict away team scores

In [806]:
X = home_away[['home_rank', 'away_rank', 'home_score','away_rank_change','home_rank_change','neutral']].values
y = home_away['away_score'].values

In [807]:
poly_reg = PolynomialFeatures(degree =2) 
X_poly = poly_reg.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.2, random_state=0)


pol_reg = LinearRegression()
pol_reg.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [808]:
coeff_2 =([pol_reg.coef_])
coeff_2


[array([-3.67885355e-15,  9.03347685e-03, -1.17696838e-02,  1.28618477e-01,
         1.85833733e-02, -1.78117484e-02,  1.54819868e-01,  3.47332867e-05,
        -5.26892652e-05, -5.73037982e-04,  2.90493166e-04, -1.69048927e-04,
         2.72321963e-03,  4.82409026e-05, -3.21206057e-04, -1.41766471e-04,
         3.55097558e-05, -2.52939136e-03, -3.51414100e-03, -5.24843896e-03,
         6.02142858e-04, -3.28798241e-02, -1.97817456e-04,  4.05720577e-04,
         2.82650419e-03,  3.17999474e-04,  1.23935136e-02,  1.54819868e-01])]

In [809]:
y_pred = pol_reg.predict(X_test)

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))


Mean Absolute Error: 0.8307244779393816
Mean Squared Error: 1.1734080265364757
Root Mean Squared Error: 1.083239597935967


In [810]:
from sklearn.model_selection import KFold

# We will use the same 6 independent variables for this
X = home_away[['home_rank', 'away_rank', 'home_score','away_rank_change','home_rank_change','neutral']].values
y = home_away['away_score'].values
poly_reg = PolynomialFeatures(degree =2) 
X_poly = poly_reg.fit_transform(X)

folds = KFold(n_splits=5)

# We now create and assess 5 models based on the folds we created.
RMSES = [] # We will use this array to keep track of the RSME of each model
count = 1 # This will just help 
for train_index, test_index in folds.split(X_poly):
  print('\nTraining model ' + str(count))
  
  # set up the train and test based on the split determined by KFold
  # With 5 folds, we will end up with 80% of our data in the training set, and 20% in the test set, just as above
  X_train, X_test = X_poly[train_index], X_poly[test_index]
  y_train, y_test = y[train_index], y[test_index]
  
  # fit a model accordingly
  regressor = LinearRegression()  
  regressor.fit(X_train, y_train)
  
  # assess the accuraccy of the model
  y_pred = regressor.predict(X_test)
  
  rmse_value =  np.sqrt(metrics.mean_squared_error(y_test, y_pred))
  RMSES.append(rmse_value)
  
  print('Model ' + str(count) + ' Root Mean Squared Error:',rmse_value)
  count = count + 1


Training model 1
Model 1 Root Mean Squared Error: 1.121803058126355

Training model 2
Model 2 Root Mean Squared Error: 1.1896472883937605

Training model 3
Model 3 Root Mean Squared Error: 1.1417120330943966

Training model 4
Model 4 Root Mean Squared Error: 1.0957484128523198

Training model 5
Model 5 Root Mean Squared Error: 1.1034255794469519


In [811]:
np.mean(RMSES)

1.1304672743827566

In [813]:
#barlett's test
residuals = np.subtract(y_pred, y_test)
residuals.mean()

0.023045235137249098

In [814]:
import scipy as sp

test_result, p_value = sp.stats.bartlett(y_pred, residuals)
test_result, p_value

(321.7664604704001, 5.972156267063777e-72)

In [815]:
degree_of_freedom = len(y_pred)-1
probability = 1 - p_value

critical_value = sp.stats.chi2.ppf(probability, degree_of_freedom)
print(critical_value)


inf


In [816]:
if (test_result > critical_value):
  print('the variances are unequal, and the model should be reassessed')
else:
  print('the variances are homogeneous!')

the variances are homogeneous!


## Logistic Regression

In [817]:
home_away.head()

Unnamed: 0,home_rank,home_rank_change,home_team,away_team,home_score,away_score,tournament,country,neutral,away_rank,away_rank_change,result
0,4,0.0,Sweden,Switzerland,1,2,Friendly,Sweden,False,3,9.0,Lose
1,4,0.0,Sweden,France,1,1,FIFA World Cup qualification,Sweden,False,12,7.0,Draw
2,5,5.0,Argentina,Peru,2,1,FIFA World Cup qualification,Argentina,False,70,8.0,Win
3,5,5.0,Argentina,Paraguay,0,0,FIFA World Cup qualification,Argentina,False,67,1.0,Draw
4,8,-5.0,Brazil,Mexico,1,1,Friendly,Brazil,False,14,11.0,Draw


In [824]:
# Import label encoder 
from sklearn import preprocessing
# label_encoder object knows how to understand word labels. 
label_encoder = preprocessing.LabelEncoder()
# Encode labels in column neutral and result. 
home_away['neutral']= label_encoder.fit_transform(home_away['neutral'])
home_away['result']= label_encoder.fit_transform(home_away['result'])

In [825]:
X=home_away[['home_rank','home_rank_change','home_score','away_score','neutral','away_rank','away_rank_change']]
y=home_away['result']

In [826]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state=25)

In [827]:
# Fitting our model
# 
from sklearn.linear_model import LogisticRegression

LogReg = LogisticRegression()
LogReg.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [828]:
# Using our model to make a prediction
#
y_pred = LogReg.predict(X_test)

In [829]:
# Evaluating the model
#
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_pred)
confusion_matrix

array([[240,   0,   0],
       [  0, 296,   0],
       [  0,   0, 575]])

## 8. Challenging the solution

our data could not produce a worthy model.
more data  could help perhaps containing match statistics such as average possession of the different teams and even the average rating of the players.





## 9. Follow up questions

> At this point, we can refine our question or collect new data, all in an iterative process to get at the truth.



### a). Did we have the right data?

### b). Do we need other data to answer our question?

### c). Did we have the right question?