# Grade: /20 Mark(s)

# Assignment 04: Model Selection & Cross Validation

### You're a Data Scientist!
You are working as a Junior Data Scientist for a professional football (er, Soccer) club.  The owner of the team is very interested in seeing how the use of data can help improve the team's peformance, and perhaps win them a championship!

The draft is coming up soon (thats when you get to pick new players for your team), and the owner has asked you to create a model to help score potential draftees.  The model should look at attributes about the player and predict what their "rating" will be once they start playing professionally.

The football club's data team has provided you with data for 17,993 footballers from the league.  Your job: work with the Senior Data Scientist to build a model or models, perform model selection, and make predictions on players you have not yet seen.

### The Dataset

The data is stored in a csv file called `footballer_data.csv`.  The data contain 52 columns, including some information about the player, their skills, and their overall measure as an effective footballer.

Most features relate to the player's abilities in football related skills, such as passing, shooting, dribbling, etc.  Some features are rated on a 1-5 scale (5 being the best), others are rated on 0-100 (100 being the best), and others still are categorical (e.g. work rate is coded as low, medium, or high).

The target variable (or $y$ variable, if you will) is `overall`.  This is an overall measure of the footballer's skill and is rated from 0 to 100.  The most amazingly skilled footballer would be rated 100, where as I would struggle to score more than a 20. The model(s) you build should use the other features to predict `overall`.


### Follow These Steps before submitting
Once you are finished, ensure to complete the following steps.

1.  Restart your kernel by clicking 'Kernel' > 'Restart & Run All'.

2.  Fix any errors which result from this.

3.  Repeat steps 1. and 2. until your notebook runs without errors.

4.  Submit your completed notebook to OWL by the deadline.


### Preliminaries
---

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.metrics import mean_squared_error, mean_absolute_error,make_scorer
pd.set_option('display.max_columns', 500)

%matplotlib inline

### Question 1: /2 Mark(s)

Read in the data and take a look at the dataframe.  There should be 52 columns. The outcome of interest is called `overall` which gives an overall measure of player performance. Not all of the other columns are particularly useful for modelling though (for instance, `ID` is just a unique identifier for the player.  This is essentially an arbitrary number and has no bearing on the player's rating).

The Senior Data Scientist thinks the following columns should be removed:

* ID
* club
* club_logo
* birth_date
* flag
* nationality
* photo
* potential

The Senior Data Scientist would also like the following columns converted into dummy variables:

* work_rate_att
* work_rate_def
* preferred_foot

Clean the data according to the Senior Data Scientist's instructions.

In [None]:
df = pd.read_csv("footballer_data.csv")
data = df.drop(['ID','club','club_logo','flag', 'nationality','photo','potential', 'birth_date'], axis = 'columns')
# In order to get dummies, convert categorical data to categorical type
model_data = pd.get_dummies(data, drop_first=True)
model_data


### Question 2: /1 Mark(s)

The data should all be numerical now. Before we begin modelling, it is important to obtain a baseline for the accuracy of our predictive models. Compute the absolute errors resulting if we use the median of the `overall` variable to make predictions. This will serve as our baseline performance. Plot the distribution of the errors and print their mean and standard deviation.

In [None]:
ab_error = abs(data['overall'] - np.median(data['overall']))
sns.distplot(ab_error)
print(np.mean(ab_error))
print(np.std(ab_error))

### Question 3: /3 Mark(s)
To prepare the data for modelling, the Senior Data Scientist recomends you use `sklearn.model_selection.train_test_split` to seperate the data into a training set and a test set.

The Senior Data Scientist would like to estimate the performance of the final selected model to within +/- 0.25 units using mean absolute error as the loss function of choice.  Decide on an appropriate size for the test set, then use `train_test_split` to split the features and target variables into appropriate sets.

In [None]:
Y = model_data.overall
X = model_data.drop("overall", axis = "columns")



Train_x,Test_x,Train_y,Test_y = train_test_split(X,Y,test_size = 0.10,random_state = 0)

### Question 4: /1 Mark(s)


The Senior Data Scientist wants you to fit a linear regression to the data as a first model.  Use sklearn to build a model pipeline which fits a linear regression to the data. (This will be a very simple, one-step pipeline but we will expand it later.) You can read up on sklearn pipelines [here](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html). Note that the sklearn linear regression adds its own intercept so you don't need to create a column of 1s.

In [None]:
pipe_model = Pipeline([
           ("regression", LinearRegression())
])

interaction_model = pipe_model.fit(Train_x,Train_y)


### Question 5: /3 Mark(s)

The senior data scientist wants a report of this model's cross validation score.  Use 5 fold cross validation to estimate the out of sample performance for this model.  You may find sklearn's `cross_val_score` useful.

In [None]:
cross_result = cross_val_score(pipe_model,
                                    Train_x,
                                    Train_y,
                                    cv = 5,
                                    scoring = make_scorer(mean_squared_error))

print(f"the cross validation result is: {np.mean(cross_result)}")


### Question 6: /3 Mark(s)

That's impressive!  Your model seems to be very accurate, but now the Senior Data Scientist wants to try and make it more accurate.  Scouts have shared with the Senior Data Scientist that players hit their prime in their late 20s, and as they age they become worse overall.

The Senior Data Scientist wants to add a quadratic term for age to the model.  Repeat the steps above (creating a pipeline, validating the model, etc) for a model which includes a quadratic term for age.

In [None]:
newtrain = Train_x.assign(age2 = Train_x.age**2)
cross_result = cross_val_score(pipe_model,
                                    newtrain,
                                    Train_y,
                                    cv = 5,
                                    scoring = make_scorer(mean_squared_error))
print(f"the cross validation result is: {np.mean(cross_result)}")



### Question 7: /3 Mark(s)


The Senior Data Scientist isn't too happy that the quadratic term has not improved the fit of the model much and now wants to include quadratic and interaction term for every feature (That's a total of 1080 features!!!!)

Add sklearn's `PolynomialFeatures` to your pipeline from part C.  Report the cross validation score.

In [None]:
newtrain
pipe_model_poly = Pipeline([
          ('poly',PolynomialFeatures(include_bias=False)),
           ("regression", LinearRegression())
])


cross_result_poly = cross_val_score(pipe_model_poly,
                               newtrain,
                               Train_y,
                               cv = 5,
                               scoring = make_scorer(mean_squared_error)
                               )
print(f"the cross validation result is: {np.mean(cross_result_poly)}")


### Question 8: /2 Mark(s)

The Senior Data Scientist is really happy with the results of adding every interaction into the model and wants to explore third order interactions (that is adding cubic terms to the model).

This is not a good idea!  Talk them down from the ledge.  Write them an email in the cell below explaining what could happen if you add too may interactions.

---

Hey Boss,

I got your email about adding cubic terms to the model.  I think there exsit varience bias tradeoff, when fitting parameter increase, the model bias decrease, but the model varience increase. the combined effect might increase the testing error after the number of parameter pass certain threshold.

### Question 9:  /2 Mark(s)

You've successfully talked the Senior Data Scientist out of adding cubic terms to the model. Good job!

Based on the cross validation scores, which model would you choose?  Estimate the performance of your chosen model on the test data you held out, and do the following:

- Compute a point estimate for the generalization error.
- Compute a confidence interval for the generalization error.  
- Plot the distribution of the absolute errors.

Is the test error close to the cross validation error of the model you chose? Why do you think this is the case?


In [None]:


pipe_model_poly.fit(Train_x, Train_y)

pred_y = pipe_model_poly.predict(Test_x)

test_errors = Test_y - pred_y

generalization_error = mean_absolute_error(Test_y, pred_y)

test_interval = generalization_error + 1.96 * np.std(test_errors) / np.sqrt(len(test_errors)) * np.array([-1, 1])

sns.distplot(abs(test_errors))

print(generalization_error)
print(test_interval)


### Follow These Steps before submitting
Once you are finished, ensure to complete the following steps.

1.  Restart your kernel by clicking 'Kernel' > 'Restart & Run All'.

2.  Fix any errors which result from this.

3.  Repeat steps 1. and 2. until your notebook runs without errors.

4.  Submit your completed notebook to OWL by the deadline.