# Data Audit Report

This is the first part of our Competition 2 where we performed our preprocessing steps on the data. More details can be found in our [README.MD](README.md) file.

### Importing Our Required Packages and things

In [1]:
#importing required libraries and packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import backend as bk
from sklearn import preprocessing
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from pandas import ExcelWriter

#set plot style to 'ggplot' and don't limit the view of DF when diaplyhign them to screen
plt.style.use('ggplot')
pd.options.display.max_columns = None

### Read the Data In

In [2]:
#importing our data and resetting our index
df = pd.read_excel('Data/Comp2_Raw_Data.xls')
df.head(1)

FileNotFoundError: [Errno 2] No such file or directory: 'Data/Comp2_Raw_Data.xls'

### Renaming and Dropping Columns

To make our dataset cleaner, we renamed our columns and we also dropped the ID column because it had no value to our model

In [None]:
df = df.drop(columns = ['ID'])
df.columns = ['Credit_Limit', 'Gender', 'Education', 'Marriage',  'Age', 'Pay_Sept', 'Pay_Aug', 'Pay_Jul', 'Pay_Jun', 'Pay_May', 'Pay_Apr', 
             'Bill_Amt_Sept', 'Bill_Amt_Aug', 'Bill_Amt_Jul', 'Bill_Amt_Jun', 'Bill_Amt_May', 'Bill_Amt_Apr','Pay_Amt_Sept', 'Pay_Amt_Aug',
             'Pay_Amt_Jul', 'Pay_Amt_Jun', 'Pay_Amt_May', 'Pay_Amt_Apr', 'Default']
df.head(5)

### Creating A Target DF and a Feature DF 

We seperated our data into a df_target which held all of our target variables. This way we don't accidently scale/transform them or include them as a feature in our feature selection/reduction  below

In [None]:
#Copy our target variables to their own df
df_target = df[['Default']].copy()
df_target.head(10)
#change the data type to categorical
df_target['Default'] = pd.Categorical(df_target.Default)
#Drop Default from our target variable df
df = df.drop(['Default'], axis=1)
df.head(5)

In [None]:
#chacking that our data was transfered properly.
df_target.head(5)

### Changing Our DataTypes

We want to make our fields from integer to float so we do that here

In [None]:
#chnage column datatypes to float
for col in df:
    df[col]=pd.to_numeric(df[col], errors='coerce', downcast='float')
df.dtypes

### EDA On Our Data

Here we check for missing values as well as begin our preocessing steps to transofrm and scale our data. 

In [None]:
df.isna().sum()

Since our data has no missing values, we can move on without worrying about imputation.

Here we want to visualize our data using histograms.

In [None]:
pd.DataFrame.hist(df, figsize = [15,15])

We can see that some fields seem to be skewed, by how much can be hard to do visually see so we will use a numerical value to make it clearer

In [None]:
#chekc the skew of the data numerically
df.skew()

we also want to do some descriptive statistics on our data


In [None]:
df.describe()

We will want to scale and transform our continous fields. We copy these to a new dataframe so we don't impact our categorical variables

In [None]:
#scale our continous fields
columns = ['Bill_Amt_Apr', 'Bill_Amt_May', 'Bill_Amt_Jun', 'Bill_Amt_Jul', 'Bill_Amt_Aug', 'Bill_Amt_Sept', 
           'Pay_Amt_Apr', 'Pay_Amt_May', 'Pay_Amt_Jun', 'Pay_Amt_Jul', 'Pay_Amt_Aug', 'Pay_Amt_Sept', 'Credit_Limit']
#Copy our target variables to their own df
df_cont = df[columns].copy()

In [None]:
df_cont.describe()

First thing we do is handle outliers, on all of our continous columns. We move all the data to be within 3 std dev of the mean. 

In [None]:
for col in df_cont.columns:
    u_bound = df_cont[col].mean() + 3* df_cont[col].std()
    l_bound = df_cont[col].mean() - 3* df_cont[col].std()
    df_cont[col][df_cont[col] > u_bound] = u_bound
    df_cont[col][df_cont[col] < l_bound] = l_bound

Here we normalize our data to remove the skewness. We Use Sckitlearns `normalize` function

In [None]:
df_cont = pd.DataFrame(preprocessing.normalize(df_cont,norm='l2'),columns = df_cont.columns)

Here we use `MinMaxScaler` to scale all of oour data so it is between `[0,1]`

In [None]:
min_max_scaler = preprocessing.MinMaxScaler()
df_cont = pd.DataFrame(min_max_scaler.fit_transform(df_cont),columns = df_cont.columns)
df_cont.describe()

We check our skew again and see that everything is pretty good, except our Pay_Amt columns so we have to do more work on them. 

In [None]:
df_cont.skew()

Things look pretty solid so far, except our Pay Columns, we have to do more work on them. 

We do a log transform on them because they are heavily skewed right. We have to add .001 becasue  the minimum values after our scale is 0-1 abd you cant take a log of a zero number, so we add a small constant to our data.

In [None]:
pay_cols = ['Pay_Amt_Apr', 'Pay_Amt_May', 'Pay_Amt_Jun', 'Pay_Amt_Jul', 'Pay_Amt_Aug', 'Pay_Amt_Sept']
for col in pay_cols:
     df_cont[col]=np.log(df_cont[col]+.001)

In [None]:
df_cont.describe()

In [None]:
df_cont.skew()

The skewed data seems to be fixed! we can add it back to our original dataframe

In [None]:
df_cont.reset_index(drop=True, inplace=True)
for col in df_cont:
    df[col] = df_cont[col]
df.head(5)

### Feature Selection Phase

Here we look to reduce the dimensionality of our data. We use RFE and Correlation analysis to select features. We run both to test and see if they both reccomend similar variables, so we know which ones are actually strong. 

In [None]:
#here we use RFE to select 8 features that our data will feature. Pass in the two dataframes as well as how many freatures you want it to select
rfe_cols = bk.rfe_select(df, df_target, 8)

In [None]:
#make a df from that list of values
df_RFE = df[rfe_cols].copy()
df_RFE.head(1)

Here we try Correlation

In [None]:
#here we just do standard correlation on the features. In this method, in order to sort the values, we lose the +/- nature of the correlation values.
#if you want to see the raw values, set "True" to "False"
bk.correlate(df, df_target, 8, "True")

In [None]:
df_corr = df[['Pay_Apr','Pay_May', 'Pay_Jun', 'Pay_Jul', 'Pay_Aug', 'Pay_Sept', 'Credit_Limit', 'Bill_Amt_Apr']].copy()
df_corr.head(1)

Lastly We Try PCA

In [None]:
#we chose to run our PCA model on the variables that we chose from our correlation because it gave us the best results
pca = PCA(n_components=4)
principalComponents1 = pca.fit_transform(df_corr)
principalDf = pd.DataFrame(data = principalComponents1
             , columns = ['PC1', 'PC2', 'PC3', 'PC4'])
principalDf.head()
sum(pca.explained_variance_ratio_)

In [None]:
principalDf.head()

### First Iteration Of testing Our Feature Selection

We want to just run and test the performance of the features that we have selected so far and see how the models perform

First we run a logisitic regression model on all of our features. 

In [None]:
df.head()
#somehow this is getting added back to our dataframe so we have to drop it again
df = df.drop(['Default'], axis=1)

In [None]:
bk.make_model(df,df_target)

Next we make a model from our Correlation results

In [None]:
bk.make_model(df_corr,df_target)

Here we try the features that our RFE selection gave us

In [None]:
#makes a df with the features we want, then runs the regression model on it
bk.make_model(df_RFE,df_target)

Lastly, we run the model using our PCA model

In [None]:
#makes a df with the features we want, then runs the regression model on it
bk.make_model(principalDf,df_target)

## It appears that our correlation model is the best performing right now. We will continue to modify this as we go forward and try to tweak our model.

# This is It for Part 1. Please refer to the [README.MD](README.md) for our next steps and plans. 

In [None]:
df.head()

In [None]:
df_target.head()

In [None]:
df.shape

In [None]:
df_target.shape

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn import metrics
X_train, x_test, y_train, y_test = train_test_split(df,df_target, test_size= 0.20, random_state=2019)
oversample = pd.concat([X_train,y_train],axis=1)
max_size = oversample['Default'].value_counts().max()
lst = [oversample]
    
for class_index, group in oversample.groupby('Default'):
    lst.append(group.sample(max_size-len(group), replace=True))
X_train = pd.concat(lst)
y_train=pd.DataFrame.copy(X_train['Default'])
del X_train['Default']

In [None]:
from sklearn import model_selection
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
X = X_train
Y = y_train

In [None]:
kfold = model_selection.KFold(n_splits=10, random_state=7)
cart = DecisionTreeClassifier()
num_trees = 100
model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=7)
results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold)
print(results.mean())

In [None]:
import xgboost as xgb
from sklearn.metrics import mean_squared_error
import pandas as pd
import numpy as np


In [None]:
xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 5, alpha = 10, n_estimators = 100)

In [None]:
xg_reg.fit(X,Y)

preds = xg_reg.predict(x_test)

In [None]:
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))