# Feature Engineering


## Math Student Grades 

http://archive.ics.uci.edu/ml/machine-learning-databases/00320/

Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:
- 1 school - student's school (binary: "GP" - Gabriel Pereira or "MS" - Mousinho da Silveira)
- 2 sex - student's sex (binary: "F" - female or "M" - male)
- 3 age - student's age (numeric: from 15 to 22)
- 4 address - student's home address type (binary: "U" - urban or "R" - rural)
- 5 famsize - family size (binary: "LE3" - less or equal to 3 or "GT3" - greater than 3)
- 6 Pstatus - parent's cohabitation status (binary: "T" - living together or "A" - apart)
- 7 Medu - mother's education (numeric: 0 - none,  1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
- 8 Fedu - father's education (numeric: 0 - none,  1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
- 9 Mjob - mother's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
- 10 Fjob - father's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
- 11 reason - reason to choose this school (nominal: close to "home", school "reputation", "course" preference or "other")
- 12 guardian - student's guardian (nominal: "mother", "father" or "other")
- 13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
- 14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
- 15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
- 16 schoolsup - extra educational support (binary: yes or no)
- 17 famsup - family educational support (binary: yes or no)
- 18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
- 19 activities - extra-curricular activities (binary: yes or no)
- 20 nursery - attended nursery school (binary: yes or no)
- 21 higher - wants to take higher education (binary: yes or no)
- 22 internet - Internet access at home (binary: yes or no)
- 23 romantic - with a romantic relationship (binary: yes or no)
- 24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
- 25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
- 26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
- 27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
- 28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
- 29 health - current health status (numeric: from 1 - very bad to 5 - very good)
- 30 absences - number of school absences (numeric: from 0 to 93)

These grades are related with the course subject, Math:

- 31 G1 - first period grade (numeric: from 0 to 20)
- 31 G2 - second period grade (numeric: from 0 to 20)
- 32 G3 - final grade (numeric: from 0 to 20, output target)

We will aim to predict G3. 

## First Pass

- take care of nulls
- data errors
- data types
- dummy vars
- split
- scaling
- features (select kbest, recursive feature engineering)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Wrangle

#### Acquire
Acquire from local drive. 
The source of the file: http://archive.ics.uci.edu/ml/machine-learning-databases/00320/. 

In [50]:
df = pd.read_csv("student/student-mat.csv", sep=";")

# we will predict G3, so drop G1 & G2. 

df.drop(columns=['G1','G2'], inplace=True)

#### Summarize

#### Nulls

#### Data Errors, Outliers, Types

Do we need to correct any issues? 

**Numeric Columns**

**Object Columns**

How many unique values in each column?
We need to answer this so that we know if creating dummy variables makes sense (or if it ends up creating way too many columns). 

1. Create a boolean mask of the columns indicating whether the datatype is object or not. 

In [51]:
# df.dtypes == 'object' returns a series. 
# convert this to an array
mask = np.array(df.dtypes == "object")

2. filter the dataframe columns by using the mask

In [52]:
# using iloc, the df will filter out all the index locations 
# (columns number) where mast is false 

obj_df = df.iloc[:, mask]

3. loop through all the object columns and generate value counts of each unique value. 

In [53]:
# loop through each column name in the list of columns
# print the value_counts 

#for col in obj_df.columns:
#    print(obj_df[col].value_counts())

#### Dummy Variables

In [54]:
# create df with new dummy vars
dummy_df = pd.get_dummies(obj_df, dummy_na=False, drop_first=True)

In [55]:
# concatenate the dataframe with dummies to our original dataframe
# via column (axis=1)
df = pd.concat([df, dummy_df], axis=1)

In [56]:
# drop object columns from df
df.drop(columns=obj_df.columns, inplace=True)

#### Split

Split data into train, validate, test

In [57]:
from sklearn.model_selection import train_test_split
train_validate, test = train_test_split(df, test_size=.2, random_state=123)
train, validate = train_test_split(train_validate, test_size=.3, random_state=123)

#### Split into X & y dataframes

- y = G3


In [58]:
# x df's are all cols except G3
X_train = train.drop(columns=['G3'])
X_validate = validate.drop(columns=['G3'])
X_test = test.drop(columns=['G3'])

# y df's are just G3
y_train = train[['G3']]
y_validate = validate[['G3']]
y_test = test[['G3']]

#### Explore

#### Scale

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_validate_scaled = scaler.transform(X_validate)
X_test_scaled = scaler.transform(X_test)

#### Feature Selection

1. SelectKBest
2. RFE: Recursive Feature Elimination