## Day 24 Lecture 1 Assignment

In this assignment, we will build our first logistic regression model on numeric data. We will use the FIFA soccer ratings dataset loaded below and analyze the model generated for this dataset.

In [17]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.calibration import calibration_curve
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
def remove_correlated_features(dataset, threshold):
    col_corr = set()
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if (corr_matrix.iloc[i, j] >= threshold) and (corr_matrix.columns[j] not in col_corr):
                colname = corr_matrix.columns[i]
                col_corr.add(colname)
                if colname in dataset.columns:
                    print(f'Deleted {colname} from dataset.')
                    del dataset[colname]

    return dataset

In [4]:
soccer_data = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/fifa_ratings.csv')

In [5]:
soccer_data.head()

Unnamed: 0,ID,Name,Overall,Crossing,Finishing,HeadingAccuracy,ShortPassing,Volleys,Dribbling,Curve,...,LongShots,Aggression,Interceptions,Positioning,Vision,Penalties,Composure,Marking,StandingTackle,SlidingTackle
0,158023,L. Messi,94,84,95,70,90,86,97,93,...,94,48,22,94,94,75,96,33,28,26
1,20801,Cristiano Ronaldo,94,84,94,89,81,87,88,81,...,93,63,29,95,82,85,95,28,31,23
2,190871,Neymar Jr,92,79,87,62,84,84,96,88,...,82,56,36,89,87,81,94,27,24,33
3,192985,K. De Bruyne,91,93,82,55,92,82,86,85,...,91,76,61,87,94,79,88,68,58,51
4,183277,E. Hazard,91,81,84,61,89,80,95,83,...,80,54,41,87,89,86,91,34,27,22


Our response for our logistic regression model is going to be a binary label, "Elite" or "Not Elite", corresponding to whether or not the player has an overall rating greater than or equal to 75. This corresponds to the top 10% or so of soccer players in the data set. Create the response column.

In [8]:
# answer goes here
soccer_data['elite'] = soccer_data['Overall'] >=75




Address potential collinearity issues by removing the appropriate features. There is no universally agreed upon technique for doing so, so feel free to use any reasonable method. We have provided the convenience function *remove_correlated_features* at the top as one way of doing so, and we use a threshold of 0.9 for that function to reduce correlation among features.

In [10]:
# answer goes here
soccer_data = remove_correlated_features(soccer_data, 0.9)




Split the data into train and test, with 80% training and 20% testing. Be sure to leave out columns that would not make sense in the model, like the player ID column.

In [13]:
# answer goes here


X = soccer_data.drop(columns=["ID", "Name"])
y = soccer_data["elite"]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    # Num of total test + trial
    random_state=1969,
)

display(f"n train records: {X_train.shape[0]}")
display(f"n test records: {X_test.shape[0]}")


'n train records: 12897'

'n test records: 3225'

Fit the logistic regression model using the statsmodels package and print out the coefficient summary. Which variables appear to be the most important, and what effect do they have on the probability of a player being elite?

In [14]:
# answer goes here

# Storing columns since we're going to overwrite
# X with a numpy array (which will delete its column names)
cols = X_train.columns

# Perform ANOVAs for each of our features and outcome
selector = SelectKBest(f_classif, k=2)

selector.fit(X_train, y_train)

X_train = selector.transform(X_train)
X_test = selector.transform(X_test)

# We don't have to transform this back into a dataframe
# this is just being done for better display
selected_cols = cols[selector.get_support()]
X_train = pd.DataFrame(X_train, columns=selected_cols, index=y_train.index)
X_train.head()



  f = msb / msw


Unnamed: 0,Overall,elite
2297,74,False
15122,56,False
1141,76,True
5770,69,False
3853,71,False


We have yet to discuss how to evaluate the model, which will happen next week, but one intuitive way to see if our model predictions are reasonable is to plot a calibration curve. In essence, the probabilities predicted by a good model will match the observed proportions of outcomes (i.e. If we take all of the predictions around 70% made by our model, the corresponding observed outcomes should be Elite about 70% of the time).

First, make predictions on the test set and join them to the corresponding true outcomes. Then, use the *calibration_curve* function in scikit learn to plot a calibration curve. What do you see?

There is some helpful code for creating calibration plots at the link below:
https://scikit-learn.org/stable/auto_examples/calibration/plot_calibration_curve.html#sphx-glr-auto-examples-calibration-plot-calibration-curve-py

In [19]:
model = LogisticRegression()

In [28]:
model = LogisticRegression() 
model.fit(X_train, y_train)
#
pred_df = model.predict_proba(X_train)
pred_df = pd.DataFrame(pred_df[:, 1], columns=["prob"]) 
pred_df["Overall"] = X_train["Overall"].reset_index(drop=True) 
pred_df["elite"] = y_train.reset_index(drop=True) 
#pred_df = pd.melt( pred_df, id_vars=["Overall"], var_name="Overall" )
pred_df.loc[pred_df["elite"] == "prob", "elite"] = "predicted" 
pred_df.loc[pred_df["elite"] == "elite", "elite"] = "actual"
display(pred_df)
#px.scatter_3d(pred_df, "numKillingPunishes", "openingsPerKill", "value", color="won")

  result = method(y)


Unnamed: 0,prob,Overall,elite
0,0.000792,74,False
1,0.002273,56,False
2,0.992412,76,True
3,0.001061,69,False
4,0.000944,71,False
...,...,...,...
12892,0.992841,75,True
12893,0.001509,63,False
12894,0.989243,82,True
12895,0.991475,78,True


In [29]:
y_pred = model.predict(X_test)

confusion_mat = confusion_matrix(y_true=y_test, y_pred=y_pred)

# We could put this in dataframe with row/column
# labels to make it easier to read
confusion_df = pd.DataFrame(confusion_mat,
                            columns=['predicted_0', 'predicted_1'],
                            index=['actual_0', 'actual_1'])
confusion_df

Unnamed: 0,predicted_0,predicted_1
actual_0,2843,0
actual_1,0,382


In [0]:
# answer goes here
#sklearn.calibration.calibration_curve(y_true, y_prob, normalize=False, n_bins=5, strategy='uniform')
calibration_curve(y_test, y)





We see that the lower predicted probabilities tend to be well calibrated - when the model predicts 20% likelihood of eliteness, for example, we tend to see about 20% in reality, which is a good sign. However, the calibration does falter quite a bit for the more confident predictions; weaker calibration at the extremes is fairly common for probabilistic models, although not always to this extent.