**Feature engineering**
- creating new features(variables). Features are the variables we use to model
- removing features
- selecting top features
- combining features
- transfomring features
- calculating features eg. using length and width to calculate area


**Feature engineering is where you can fine tune your model for accuracy. You can use various features to find a model that fits the data better. It allows for creativity.**

In [54]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [55]:
df = pd.read_csv('student-mat.csv', sep = ';')

In [56]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 33 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   school      395 non-null    object
 1   sex         395 non-null    object
 2   age         395 non-null    int64 
 3   address     395 non-null    object
 4   famsize     395 non-null    object
 5   Pstatus     395 non-null    object
 6   Medu        395 non-null    int64 
 7   Fedu        395 non-null    int64 
 8   Mjob        395 non-null    object
 9   Fjob        395 non-null    object
 10  reason      395 non-null    object
 11  guardian    395 non-null    object
 12  traveltime  395 non-null    int64 
 13  studytime   395 non-null    int64 
 14  failures    395 non-null    int64 
 15  schoolsup   395 non-null    object
 16  famsup      395 non-null    object
 17  paid        395 non-null    object
 18  activities  395 non-null    object
 19  nursery     395 non-null    object
 20  higher    

**No missing values**

In [57]:
df.describe()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
count,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0
mean,16.696203,2.749367,2.521519,1.448101,2.035443,0.334177,3.944304,3.235443,3.108861,1.481013,2.291139,3.55443,5.708861,10.908861,10.713924,10.41519
std,1.276043,1.094735,1.088201,0.697505,0.83924,0.743651,0.896659,0.998862,1.113278,0.890741,1.287897,1.390303,8.003096,3.319195,3.761505,4.581443
min,15.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,3.0,0.0,0.0
25%,16.0,2.0,2.0,1.0,1.0,0.0,4.0,3.0,2.0,1.0,1.0,3.0,0.0,8.0,9.0,8.0
50%,17.0,3.0,2.0,1.0,2.0,0.0,4.0,3.0,3.0,1.0,2.0,4.0,4.0,11.0,11.0,11.0
75%,18.0,4.0,3.0,2.0,2.0,0.0,5.0,4.0,4.0,2.0,3.0,5.0,8.0,13.0,13.0,14.0
max,22.0,4.0,4.0,4.0,4.0,3.0,5.0,5.0,5.0,5.0,5.0,5.0,75.0,19.0,19.0,20.0


In [58]:
mask = np.array(df.dtypes =='object')

In [59]:
object_df = df.iloc[:,mask]

In [60]:
object_df

Unnamed: 0,school,sex,address,famsize,Pstatus,Mjob,Fjob,reason,guardian,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic
0,GP,F,U,GT3,A,at_home,teacher,course,mother,yes,no,no,no,yes,yes,no,no
1,GP,F,U,GT3,T,at_home,other,course,father,no,yes,no,no,no,yes,yes,no
2,GP,F,U,LE3,T,at_home,other,other,mother,yes,no,yes,no,yes,yes,yes,no
3,GP,F,U,GT3,T,health,services,home,mother,no,yes,yes,yes,yes,yes,yes,yes
4,GP,F,U,GT3,T,other,other,home,father,no,yes,yes,no,yes,yes,no,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
390,MS,M,U,LE3,A,services,services,course,other,no,yes,yes,no,yes,yes,no,no
391,MS,M,U,LE3,T,services,services,course,mother,no,no,no,no,no,yes,yes,no
392,MS,M,R,GT3,T,other,other,course,other,no,no,no,no,no,yes,no,no
393,MS,M,R,LE3,T,services,other,course,mother,no,no,no,no,no,yes,yes,no


In [61]:
for col in object_df.columns:
    print(object_df[col].value_counts())
    print('\n')

GP    349
MS     46
Name: school, dtype: int64


F    208
M    187
Name: sex, dtype: int64


U    307
R     88
Name: address, dtype: int64


GT3    281
LE3    114
Name: famsize, dtype: int64


T    354
A     41
Name: Pstatus, dtype: int64


other       141
services    103
at_home      59
teacher      58
health       34
Name: Mjob, dtype: int64


other       217
services    111
teacher      29
at_home      20
health       18
Name: Fjob, dtype: int64


course        145
home          109
reputation    105
other          36
Name: reason, dtype: int64


mother    273
father     90
other      32
Name: guardian, dtype: int64


no     344
yes     51
Name: schoolsup, dtype: int64


yes    242
no     153
Name: famsup, dtype: int64


no     214
yes    181
Name: paid, dtype: int64


yes    201
no     194
Name: activities, dtype: int64


yes    314
no      81
Name: nursery, dtype: int64


yes    375
no      20
Name: higher, dtype: int64


yes    329
no      66
Name: internet, dtype: int64


no    

In [62]:
dummy_df = pd.get_dummies(object_df, dummy_na = False, drop_first = True)

In [63]:
df = pd.concat([df, dummy_df], axis = 1)

In [64]:
df.drop(columns = object_df.columns, inplace = True)

In [65]:
## Lets split the data

from sklearn.model_selection import train_test_split

In [66]:
train_validate, test = train_test_split(df, test_size = 0.2, random_state = 123)
train, validate = train_test_split(train_validate, test_size = 0.3, random_state = 123)

In [67]:
train.shape, validate.shape, test.shape

((221, 42), (95, 42), (79, 42))

In [68]:
# split into X and y
X_train = train.drop(columns = ['G3'])
X_validate = validate.drop(columns = ['G3'])
X_test = test.drop(columns = ['G3'])


y_train = train[['G3']]
y_validate = validate[['G3']]
y_test = test[['G3']]

In [69]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(copy = True)
scaler = scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_validate_scaled = scaler.transform(X_validate)
X_test_scaled = scaler.transform(X_test)

In [81]:
X_train_scaled = pd.DataFrame(X_train_scaled, columns = X_train.columns.values). \
                    set_index([X_train.index.values])

X_validate_scaled = pd.DataFrame(X_validate_scaled, columns = X_validate.columns.values). \
                    set_index([X_validate.index.values])


X_test_scaled = pd.DataFrame(X_test_scaled, columns = X_test.columns.values). \
                    set_index([X_test.index.values])

In [71]:
from sklearn.feature_selection import SelectKBest, f_regression

In [72]:
# Initialize the f_selector object, defining the scoring method

f_selector = SelectKBest(f_regression, k = 13)

In [73]:
# fit the object to our X and y data(train)

# this will score, rank, and ID our top K features

f_selector = f_selector.fit(X_train_scaled, y_train.G3)

In [74]:
# Transform our dataset to reduce to the K best features


X_train_reduced = f_selector.transform(X_train_scaled)

print(X_train.shape)
print(X_train_reduced.shape)

(221, 41)
(221, 13)


In [75]:
f_support = f_selector.get_support()
print(f_support)

[ True  True  True  True  True  True False False False False False False
 False  True  True False  True False False False False  True False False
 False False False False False False  True False  True False False False
 False False  True False False]


In [76]:
f_feature = X_train_scaled.iloc[:, f_support].columns.tolist()
f_feature

['age',
 'Medu',
 'Fedu',
 'traveltime',
 'studytime',
 'failures',
 'G1',
 'G2',
 'sex_M',
 'Mjob_other',
 'reason_reputation',
 'guardian_other',
 'higher_yes']

In [77]:
X_reduced_scaled = X_train_scaled.iloc[:, f_support]
X_reduced_scaled


## This data frame is now ready for model building

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,G1,G2,sex_M,Mjob_other,reason_reputation,guardian_other,higher_yes
142,0.000000,1.00,1.00,0.000000,0.666667,0.000000,0.357143,0.578947,0.0,0.0,0.0,0.0,1.0
326,0.333333,0.75,0.75,0.000000,0.000000,0.000000,0.714286,0.789474,1.0,1.0,1.0,0.0,1.0
88,0.166667,0.50,0.50,0.333333,0.333333,0.333333,0.500000,0.526316,1.0,0.0,1.0,0.0,1.0
118,0.333333,0.25,0.75,0.666667,0.333333,0.333333,0.357143,0.368421,1.0,1.0,0.0,0.0,1.0
312,0.666667,0.25,0.50,0.000000,0.333333,0.333333,0.642857,0.578947,1.0,1.0,0.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
229,0.333333,0.50,0.25,0.333333,0.666667,0.000000,0.571429,0.526316,0.0,1.0,0.0,0.0,1.0
61,0.166667,0.25,0.25,1.000000,0.000000,0.000000,0.428571,0.421053,0.0,0.0,0.0,0.0,1.0
38,0.000000,0.75,1.00,0.000000,0.666667,0.000000,0.571429,0.631579,0.0,0.0,0.0,0.0,1.0
243,0.166667,1.00,1.00,0.000000,0.000000,0.000000,0.642857,0.631579,1.0,0.0,0.0,0.0,1.0


**Recursive Feature Elimination(RFE)**
- wrapper method: recursively build model after model with fewer and fewer features. It will then identify which model performs the best. Then, return which features were used in the model. Those are the features we will keep

In [78]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

In [79]:
# initialize the linear regression object


lm = LinearRegression()

In [87]:
# Initialize the RFE object, setting the hyperparameters to be our linear model above(lm), 
# and the number of features we want returned


rfe = RFE(lm, 13)


# lets fit and transform

X_rfe = rfe.fit_transform(X_train_scaled, y_train.G3)







In [93]:
# rfe.support_ outputs an array of booleans indicating if that column is selected or not



mask = rfe.support_

In [96]:
# we can then use the mask to select the columns selected and create a new dataframe that
# we can use for model building


X_reduced_scaled_rfe = X_train_scaled.iloc[:,mask]

In [98]:
# lets check the columns that were selected

X_reduced_scaled_rfe.columns.tolist()

['age',
 'traveltime',
 'failures',
 'famrel',
 'absences',
 'G1',
 'G2',
 'Mjob_health',
 'Mjob_other',
 'Mjob_services',
 'schoolsup_yes',
 'famsup_yes',
 'internet_yes']