# Feature Engineering

What are features?
- variables we use to help predict our target. 
- not our target variable
- not all of the independent variables we start with. 
- the independent variables we END with, the ones we use in modeling. 

Why would we choose some variables and not others?
- doesn't influence your target
- may overfit the model
- too many missing values
- dependency between attributes
- categorical with too many values and can't encode.
- variable too computationally expensive
- information that could lead to discrimination or unethical decisions

Why would we create new features?
- dependency between 2 variables, so blend them into one. 
- binning categorical with too many values into fewer categories
- continuous variables with a lot of noise
- calculation of 2 variables, like length x width

Why do we try to limit the number of features? 
- curse of dimensionality


What is it? 
- creating new features
- removing features
- selecting top features
- tranforming features

Goal in feature engineering: 

I want to make it easy for the computer to see the patterns

Algorithmic feature selection methods: 

- Filter Feature Selection methods: look at the features with highest correlation to the target and select those features. wouldn't have the ability to check for things like confidential info. wouldn't pick out if the impact of 3 features together is strong but individually weak. Could end up giving you 3 features that all give the same information. 

- Wrapper Feature selection methods: create n different models, evaluate performance, and the features that are in the model that performed the best, are thes ones to keep. Computationally expensive. 


Importance of scaling:

if you have a variable with significantly larger units than another, it's going to have inflated importance. So, scale before doing this! 

Must scale X's, do not scale y. 

**Features are the difference!**

## Math Student Grades 

http://archive.ics.uci.edu/ml/machine-learning-databases/00320/

Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:
- 1 school - student's school (binary: "GP" - Gabriel Pereira or "MS" - Mousinho da Silveira)
- 2 sex - student's sex (binary: "F" - female or "M" - male)
- 3 age - student's age (numeric: from 15 to 22)
- 4 address - student's home address type (binary: "U" - urban or "R" - rural)
- 5 famsize - family size (binary: "LE3" - less or equal to 3 or "GT3" - greater than 3)
- 6 Pstatus - parent's cohabitation status (binary: "T" - living together or "A" - apart)
- 7 Medu - mother's education (numeric: 0 - none,  1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
- 8 Fedu - father's education (numeric: 0 - none,  1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
- 9 Mjob - mother's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
- 10 Fjob - father's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
- 11 reason - reason to choose this school (nominal: close to "home", school "reputation", "course" preference or "other")
- 12 guardian - student's guardian (nominal: "mother", "father" or "other")
- 13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
- 14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
- 15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
- 16 schoolsup - extra educational support (binary: yes or no)
- 17 famsup - family educational support (binary: yes or no)
- 18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
- 19 activities - extra-curricular activities (binary: yes or no)
- 20 nursery - attended nursery school (binary: yes or no)
- 21 higher - wants to take higher education (binary: yes or no)
- 22 internet - Internet access at home (binary: yes or no)
- 23 romantic - with a romantic relationship (binary: yes or no)
- 24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
- 25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
- 26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
- 27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
- 28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
- 29 health - current health status (numeric: from 1 - very bad to 5 - very good)
- 30 absences - number of school absences (numeric: from 0 to 93)

These grades are related with the course subject, Math:

- 31 G1 - first period grade (numeric: from 0 to 20)
- 31 G2 - second period grade (numeric: from 0 to 20)
- 32 G3 - final grade (numeric: from 0 to 20, output target)

We will aim to predict G3. 

## First Pass

- take care of nulls
- data errors
- data types
- dummy vars
- split
- scaling
- features (select kbest, recursive feature engineering)

In [104]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Wrangle

#### Acquire
Acquire from local drive. 
The source of the file: http://archive.ics.uci.edu/ml/machine-learning-databases/00320/. 

In [105]:
df = pd.read_csv("student/student-mat.csv", sep=";")

In [106]:
df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


#### Summarize

In [107]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 33 columns):
school        395 non-null object
sex           395 non-null object
age           395 non-null int64
address       395 non-null object
famsize       395 non-null object
Pstatus       395 non-null object
Medu          395 non-null int64
Fedu          395 non-null int64
Mjob          395 non-null object
Fjob          395 non-null object
reason        395 non-null object
guardian      395 non-null object
traveltime    395 non-null int64
studytime     395 non-null int64
failures      395 non-null int64
schoolsup     395 non-null object
famsup        395 non-null object
paid          395 non-null object
activities    395 non-null object
nursery       395 non-null object
higher        395 non-null object
internet      395 non-null object
romantic      395 non-null object
famrel        395 non-null int64
freetime      395 non-null int64
goout         395 non-null int64
Dalc          395 no

#### Nulls

No missing values

#### Data Errors, Outliers, Types

Do we need to correct any issues? 

**Numeric Columns**

In [108]:
df.describe()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
count,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0
mean,16.696203,2.749367,2.521519,1.448101,2.035443,0.334177,3.944304,3.235443,3.108861,1.481013,2.291139,3.55443,5.708861,10.908861,10.713924,10.41519
std,1.276043,1.094735,1.088201,0.697505,0.83924,0.743651,0.896659,0.998862,1.113278,0.890741,1.287897,1.390303,8.003096,3.319195,3.761505,4.581443
min,15.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,3.0,0.0,0.0
25%,16.0,2.0,2.0,1.0,1.0,0.0,4.0,3.0,2.0,1.0,1.0,3.0,0.0,8.0,9.0,8.0
50%,17.0,3.0,2.0,1.0,2.0,0.0,4.0,3.0,3.0,1.0,2.0,4.0,4.0,11.0,11.0,11.0
75%,18.0,4.0,3.0,2.0,2.0,0.0,5.0,4.0,4.0,2.0,3.0,5.0,8.0,13.0,13.0,14.0
max,22.0,4.0,4.0,4.0,4.0,3.0,5.0,5.0,5.0,5.0,5.0,5.0,75.0,19.0,19.0,20.0


**Object Columns**

How many unique values in each column?
We need to answer this so that we know if creating dummy variables makes sense (or if it ends up creating way too many columns). 

1. Create a boolean mask of the columns indicating whether the datatype is object or not. 

In [109]:
# df.dtypes == 'object' returns a series. 
# convert this to an array
mask = np.array(df.dtypes == "object")
mask

array([ True,  True, False,  True,  True,  True, False, False,  True,
        True,  True,  True, False, False, False,  True,  True,  True,
        True,  True,  True,  True,  True, False, False, False, False,
       False, False, False, False, False, False])

2. filter the dataframe columns by using the mask

In [110]:
# using iloc, the df will filter out all the index locations 
# (columns number) where mast is false 

obj_df = df.iloc[:, mask]
obj_df.columns

Index(['school', 'sex', 'address', 'famsize', 'Pstatus', 'Mjob', 'Fjob',
       'reason', 'guardian', 'schoolsup', 'famsup', 'paid', 'activities',
       'nursery', 'higher', 'internet', 'romantic'],
      dtype='object')

3. loop through all the object columns and generate value counts of each unique value. 

In [111]:
# loop through each column name in the list of columns
# print the value_counts 

for col in obj_df.columns:
    print(obj_df[col].value_counts())
    print("\n")

GP    349
MS     46
Name: school, dtype: int64


F    208
M    187
Name: sex, dtype: int64


U    307
R     88
Name: address, dtype: int64


GT3    281
LE3    114
Name: famsize, dtype: int64


T    354
A     41
Name: Pstatus, dtype: int64


other       141
services    103
at_home      59
teacher      58
health       34
Name: Mjob, dtype: int64


other       217
services    111
teacher      29
at_home      20
health       18
Name: Fjob, dtype: int64


course        145
home          109
reputation    105
other          36
Name: reason, dtype: int64


mother    273
father     90
other      32
Name: guardian, dtype: int64


no     344
yes     51
Name: schoolsup, dtype: int64


yes    242
no     153
Name: famsup, dtype: int64


no     214
yes    181
Name: paid, dtype: int64


yes    201
no     194
Name: activities, dtype: int64


yes    314
no      81
Name: nursery, dtype: int64


yes    375
no      20
Name: higher, dtype: int64


yes    329
no      66
Name: internet, dtype: int64


no    

In [112]:
df.nunique()

school         2
sex            2
age            8
address        2
famsize        2
Pstatus        2
Medu           5
Fedu           5
Mjob           5
Fjob           5
reason         4
guardian       3
traveltime     4
studytime      4
failures       4
schoolsup      2
famsup         2
paid           2
activities     2
nursery        2
higher         2
internet       2
romantic       2
famrel         5
freetime       5
goout          5
Dalc           5
Walc           5
health         5
absences      34
G1            17
G2            17
G3            18
dtype: int64

#### Dummy Variables

In [113]:
# create df with new dummy vars
dummy_df = pd.get_dummies(obj_df, dummy_na=False, drop_first=True)

In [114]:
# concatenate the dataframe with dummies to our original dataframe
# via column (axis=1)
df = pd.concat([df, dummy_df], axis=1)

In [115]:
# drop object columns from df
df.drop(columns=obj_df.columns, inplace=True)

In [116]:
# df.info()

#### Split

Split data into train, validate, test

In [117]:
from sklearn.model_selection import train_test_split

train_validate, test = train_test_split(df, test_size=.2, 
                                        random_state=123)

train, validate = train_test_split(train_validate, 
                                   test_size=.3, random_state=123)

#### Split into X & y dataframes

- y = G3


In [118]:
# x df's are all cols except G3
X_train = train.drop(columns=['G3'])
X_validate = validate.drop(columns=['G3'])
X_test = test.drop(columns=['G3'])

# y df's are just G3
y_train = train[['G3']]
y_validate = validate[['G3']]
y_test = test[['G3']]

#### Explore

#### Scale

In [119]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(copy=True).fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_validate_scaled = scaler.transform(X_validate)
X_test_scaled = scaler.transform(X_test)

  return self.partial_fit(X, y)


Create dataframes out of the scaled arrays that were generated by the scaler tranform. 

In [120]:
X_train_scaled = pd.DataFrame(X_train_scaled, 
                              columns=X_train.columns.values).\
                            set_index([X_train.index.values])

X_validate_scaled = pd.DataFrame(X_validate_scaled, 
                                columns=X_validate.columns.values).\
                            set_index([X_validate.index.values])

X_test_scaled = pd.DataFrame(X_test_scaled, 
                                columns=X_test.columns.values).\
                            set_index([X_test.index.values])

#### Feature Selection

1. SelectKBest
2. RFE: Recursive Feature Elimination

**SelectKBest**

- filter method  
- find and keep the attributes with the highest correlation to the target variable. 

How? 
1. the correlation between each attribute & the target is computed.  
2. converted to an F-score and then p-value.  
3. top k attributes are kept.  

In [121]:
from sklearn.feature_selection import SelectKBest, f_regression

Initialize the f_selector object, defining the scoring method. 

In [122]:
f_selector = SelectKBest(f_regression, k=13)

Fit the object to our X and y data (train!). 
This will score, rank and ID the top k features. 

In [123]:
f_selector = f_selector.fit(X_train_scaled, y_train.G3)

Transform our dataset to reduct to the K best features. 

In [124]:
X_train_reduced = f_selector.transform(X_train_scaled)

print(X_train.shape)
print(X_train_reduced.shape)

(221, 41)
(221, 13)


In [125]:
f_support = f_selector.get_support()
type(f_support)
print(f_support)

[ True  True  True  True  True  True False False False False False False
 False  True  True False  True False False False False  True False False
 False False False False False False  True False  True False False False
 False False  True False False]


Create a dataframe with just the selected features. 

In [126]:
# using iloc, the df will filter out all the index locations 
# (columns number) where mask is false
# the : before the comma is for rows (so if we wanted to filter rows 
# we could say like 10:20), and after the comma is for columns. 

X_reduced_scaled = X_train_scaled.iloc[:,f_support]

This new dataframe is ready for modeling! 

X_reduced_scaled.head()

View the features selected: 

In [128]:
# columns to keep:
f_feature = X_train_scaled.iloc[:,f_support].columns.tolist()
f_feature

['age',
 'Medu',
 'Fedu',
 'traveltime',
 'studytime',
 'failures',
 'G1',
 'G2',
 'sex_M',
 'Mjob_other',
 'reason_reputation',
 'guardian_other',
 'higher_yes']

We could run through it again with a different k value, and select those best features. 
We can then run the different dataframes through models, and select the best model. 

**Recursive Feature Elimination: RFE**

Wrapper method

Recursively build model after model with fewer and fewer features. It will then identify which model performs the best. Then, return which features were used in that model. Those are the features we will keep. 

In [129]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

Initialize the linear regression object

In [130]:
lm = LinearRegression()

Initialize the RFE object, setting the hyperparameters to be our linear model above (lm), and the number of features we want returned. 

In [144]:
rfe = RFE(lm, 13)

In [145]:
X_rfe = rfe.fit_transform(X_train_scaled, y_train.G3)

Save the X_rfe for later, to feed into a model. 

In [147]:
mask = rfe.support_

In [149]:
X_reduced_scaled_rfe = X_train_scaled.iloc[:,mask]

In [150]:
# features selected using rfe
X_reduced_scaled_rfe.columns.tolist()

['age',
 'traveltime',
 'failures',
 'famrel',
 'absences',
 'G1',
 'G2',
 'Mjob_health',
 'Mjob_other',
 'Mjob_services',
 'schoolsup_yes',
 'famsup_yes',
 'internet_yes']

In [151]:
# features selected using selectkbest
X_reduced_scaled.columns.tolist()

['age',
 'Medu',
 'Fedu',
 'traveltime',
 'studytime',
 'failures',
 'G1',
 'G2',
 'sex_M',
 'Mjob_other',
 'reason_reputation',
 'guardian_other',
 'higher_yes']