## Day 23 Lecture 1 Assignment

In this assignment, we will explore feature selection and dimensionality reduction techniques. We will use both the FIFA ratings dataset and the Chicago traffic crashes dataset.

In [16]:
%reload_ext nb_black
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import (
    SelectKBest,
    f_classif,
    f_regression,
    mutual_info_regression,
)
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression

from scipy import stats

import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [2]:
crash_data = pd.read_csv(
    "https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/traffic_crashes_chicago.csv"
)
soccer_data = pd.read_csv(
    "https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/fifa_ratings.csv"
)

<IPython.core.display.Javascript object>

In [3]:
soccer_data.head()

Unnamed: 0,ID,Name,Overall,Crossing,Finishing,HeadingAccuracy,ShortPassing,Volleys,Dribbling,Curve,...,LongShots,Aggression,Interceptions,Positioning,Vision,Penalties,Composure,Marking,StandingTackle,SlidingTackle
0,158023,L. Messi,94,84,95,70,90,86,97,93,...,94,48,22,94,94,75,96,33,28,26
1,20801,Cristiano Ronaldo,94,84,94,89,81,87,88,81,...,93,63,29,95,82,85,95,28,31,23
2,190871,Neymar Jr,92,79,87,62,84,84,96,88,...,82,56,36,89,87,81,94,27,24,33
3,192985,K. De Bruyne,91,93,82,55,92,82,86,85,...,91,76,61,87,94,79,88,68,58,51
4,183277,E. Hazard,91,81,84,61,89,80,95,83,...,80,54,41,87,89,86,91,34,27,22


<IPython.core.display.Javascript object>

We will begin with the Chicago traffic crashes dataset, focusing on removing columns with significant missing data.

Remove all columns with more than 5% missing data from the dataframe. (The *missingness summary* function we wrote a few exercises ago will speed this process up significantly.) Print out the columns that were removed, and the proportion of missing data for each column.

In [4]:
# answer goes here
def missingness_summary(df, print_log=False, sort=None):
    missings = df.isna().mean()

    if sort == "asc":
        missings = missings.sort_values()
    elif sort == "desc":
        missings = missings.sort_values(ascending=False)

    if print_log:
        print(missings)

    return missings




<IPython.core.display.Javascript object>

In [5]:
miss_report = missingness_summary(crash_data, sort="desc")

<IPython.core.display.Javascript object>

In [6]:
miss_report["WORKERS_PRESENT_I"]

0.9983520538937424

<IPython.core.display.Javascript object>

In [7]:
drop_cols = []
for col in miss_report.index:
    if miss_report[col] > 0.05:
        drop_cols.append(col)
crash = crash_data.drop(columns=drop_cols)

<IPython.core.display.Javascript object>

In [8]:
missingness_summary(crash, sort="desc")

REPORT_TYPE                      0.023012
MOST_SEVERE_INJURY               0.005795
INJURIES_UNKNOWN                 0.005776
INJURIES_NO_INDICATION           0.005776
INJURIES_REPORTED_NOT_EVIDENT    0.005776
INJURIES_NON_INCAPACITATING      0.005776
INJURIES_INCAPACITATING          0.005776
INJURIES_FATAL                   0.005776
INJURIES_TOTAL                   0.005776
NUM_UNITS                        0.003755
BEAT_OF_OCCURRENCE               0.000011
STREET_DIRECTION                 0.000005
STREET_NAME                      0.000003
POSTED_SPEED_LIMIT               0.000000
TRAFFIC_CONTROL_DEVICE           0.000000
CRASH_DATE                       0.000000
DEVICE_CONDITION                 0.000000
WEATHER_CONDITION                0.000000
TRAFFICWAY_TYPE                  0.000000
LIGHTING_CONDITION               0.000000
FIRST_CRASH_TYPE                 0.000000
DATE_POLICE_NOTIFIED             0.000000
ALIGNMENT                        0.000000
ROADWAY_SURFACE_COND             0

<IPython.core.display.Javascript object>

Next, we will shift our focus to the FIFA ratings dataset and explore univariate feature selection techniques. We will treat "Overall" as the response and the other ratings as features.

Using the correlations between the response and features, identify the 5 features with the greatest univariate correlation to the response.

In [9]:
# answer goes here
soccer_data['Overall']
soccer_data.corr()['Overall'].abs().sort_values(ascending=False).iloc[1:6]





Reactions       0.847739
Composure       0.801749
ShortPassing    0.722720
BallControl     0.717933
LongPassing     0.585104
Name: Overall, dtype: float64

<IPython.core.display.Javascript object>

Use sklearn's "SelectKBest" function to select the top 5 features using two different scoring metrics: f_regression and mutual_info_regression. Print out the top 5 columns that are selected by both. How do they compare to the ones selected by  univariate correlation?

In [None]:
soccer_data.info()

In [13]:
X = soccer_data.drop(columns=["Overall", "Name", "ID"])
y = soccer_data["Overall"]

<IPython.core.display.Javascript object>

In [17]:
# answer goes here

#f_regression
selector = SelectKBest(score_func= f_regression, k=5)

# Use `.fit()` method so the selector can 'learn' from our data
selector.fit(X, y)

# Use `.transform()` method so the selector can apply
# what it learned in `.fit()`
k_best = selector.transform(X)

# We can see/rank which features were the best
score_df = pd.DataFrame({"feature": X.columns, "f_score": selector.scores_})
score_df = score_df.sort_values("f_score", ascending=False)
print(score_df.head())




         feature       f_score
13     Reactions  41177.634074
25     Composure  29009.062753
3   ShortPassing  17626.722526
9    BallControl  17146.460079
8    LongPassing   8391.416302


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [19]:
# mutual_info_regression
selector = SelectKBest(score_func=mutual_info_regression, k=5)

# Use `.fit()` method so the selector can 'learn' from our data
selector.fit(X, y)

# Use `.transform()` method so the selector can apply
# what it learned in `.fit()`
k_best = selector.transform(X)

# We can see/rank which features were the best
score_df = pd.DataFrame({"feature": X.columns, "f_score": selector.scores_})
score_df = score_df.sort_values("f_score", ascending=False)
print(score_df.head())

         feature   f_score
13     Reactions  0.564317
25     Composure  0.441333
9    BallControl  0.416656
3   ShortPassing  0.351961
5      Dribbling  0.249920


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

array([[90, 97, 96, 95, 96],
       [81, 88, 94, 96, 95],
       [84, 96, 95, 94, 94],
       ...,
       [38, 45, 44, 47, 41],
       [42, 51, 52, 21, 46],
       [48, 43, 51, 51, 43]], dtype=int64)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Shifting our focus from feature selection to dimensionality reduction, perform PCA on the ratings provided, excluding "Overall". Then, answer the following questions:

- What percentage of the total variance is capture by the first component? What about the first two, or first three?
- Looking at the components themselves, how would you interpret the first two components in plain English?

In [None]:
# answer goes here



