<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>

# Hyperparameter Tuning

## *Data Science Unit 4 Sprint 2 Assignment 4*

## Your Mission, should you choose to accept it...

To hyperparameter tune and extract every ounce of accuracy out of this telecom customer churn dataset: <https://drive.google.com/file/d/1dfbAsM9DwA7tYhInyflIpZnYs7VT-0AQ/view> 

## Requirements

- Load the data
- Clean the data if necessary (it will be)
- Create and fit a baseline Keras MLP model to the data.
- Hyperparameter tune (at least) the following parameters:
 - batch_size
 - training epochs
 - optimizer
 - learning rate (if applicable to optimizer)
 - momentum (if applicable to optimizer)
 - activation functions
 - network weight initialization
 - dropout regularization
 - number of neurons in the hidden layer
 
 You must use Grid Search and Cross Validation for your initial pass of the above hyperparameters
 
 Try and get the maximum accuracy possible out of this data! You'll save big telecoms millions! Doesn't that sound great?


# DAY1

### Learning Objectives
- Describe the foundational components of a neural network
- Implement a Perceptron from scratch in Python

#### Input Layer:

The input Layer is where the feature data from the dataframe are input

#### Hidden Layer:

These are the layer that exist between the input layer and output layer. You cna have one hidden layer or many hidden layers

#### Output Layer:

This is the answer/result of our neurons in our neural netoworks. These ouputs can then be used as inputs for the next layer of neurons or be the final output(s) of the neural network.

#### Neuron:

The neuron recieves inputs, multiplies the inputs by their weights, sums everyhting up, and then applies the activation function to the sum. Usually involves a continuous activation function

#### Weight:

This is the amount or positive or negative effect an input will be associated with the ending output.

#### Activation Function:

The activation function is how the neural network normalizes the results after inputs, weights, and biases have been applied within the neuron.

#### Node Map:

The node maps show how the features of the dataframe or the outputs of upper level neurons are further processed throughout the neural netowork. It shows inputs, outputs, and hidden layers visualized at a high level.

#### Perceptron:

Simply, a perceptron consists of four distinct parts. Uses a binary activation function that is either activate or not, different from a neuron

    Inputs
    Weights
    Weighted Sum
    Activation Function (Output)

Perceptrons classify data into two parts (0,1) most of the time. Perceptrons are also known as Linear Binary Classifiers


#### Inputs -> Outputs
Explain the flow of information through a neural network from inputs to outputs. Be sure to include: inputs, weights, bias, and activation functions. How does it all flow from beginning to end?
Your Answer Here

Depending on your network, Inputs and Outputs can range arbitraily. Each input can come from an upper level neuron or the intial inputted values from a dataframe. Each input can be weighted negatively or positvely depending on whether your desired answer needs the neuron to activate negatively or positively depending how your inputted bias has shifted the activation curve up or down.


### Imports

In [59]:
!pip install category-encoders



In [60]:
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV


from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.datasets import mnist
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

import category_encoders as ce

In [61]:
#Load Data
df = sns.load_dataset('tips')
df.describe()

Unnamed: 0,total_bill,tip,size
count,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672
std,8.902412,1.383638,0.9511
min,3.07,1.0,1.0
25%,13.3475,2.0,2.0
50%,17.795,2.9,2.0
75%,24.1275,3.5625,3.0
max,50.81,10.0,6.0


In [62]:
print(df.shape)
df.head()

(244, 7)


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [63]:
def prep(df, target):
    
    """
    This function will:
    1. Change "size" into a catagorical to be one hotted
    2. Add Total and Tip and put into 3 bins
    3. Split data
    4. Create X and y train/test
    5. process X train/test data by one hotting categoricals
    6. Make 'sex' a binary column
    7. return 4 df's
    """
    df['size'] = df['size'].astype(str)
    df['bill_tip_sum'] = pd.qcut(df['total_bill']+df['tip'], 3, labels=['low', 'medium', 'high'])
    df['tip_pct'] = df['tip']/df['total_bill']
    
    training, testing = train_test_split(df, test_size=.2)
    
    X_train = training.drop(columns=target)
    y_train = training[target]
    X_test = testing.drop(columns=target)
    y_test = testing[target]
    
    processor = make_pipeline(
        ce.OneHotEncoder(use_cat_names=True),  
#         SimpleImputer(strategy='median'), # Use when a normalized dataframe is needed
#         StandardScaler() # Use when a normalized dataframe is needed
    )
    
    gender = {'Female': 0, 'Male': 1}
    y_train = y_train.map(gender)
    y_test = y_test.map(gender)
    
    X_process_train = processor.fit_transform(X_train)
    X_process_test = processor.transform(X_test)
    
    return X_process_train,y_train, X_process_test, y_test

In [64]:
X_train, y_train, X_test, y_test = prep(df, 'sex')
print(X_train.shape) 
print(X_test.shape) 
print(y_train.shape) 
print(y_test.shape)
X_train.head()

(195, 20)
(49, 20)
(195,)
(49,)


Unnamed: 0,total_bill,tip,smoker_Yes,smoker_No,day_Thur,day_Fri,day_Sat,day_Sun,time_Lunch,time_Dinner,size_2,size_3,size_4,size_1,size_6,size_5,bill_tip_sum_low,bill_tip_sum_medium,bill_tip_sum_high,tip_pct
57,26.41,1.5,0,1,0,0,1,0,0,1,1,0,0,0,0,0,0,0,1,0.056797
188,18.15,3.5,1,0,0,0,0,1,0,1,0,1,0,0,0,0,0,1,0,0.192837
154,19.77,2.0,0,1,0,0,0,1,0,1,0,0,1,0,0,0,0,1,0,0.101163
2,21.01,3.5,0,1,0,0,0,1,0,1,0,1,0,0,0,0,0,1,0,0.166587
206,26.59,3.41,1,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,1,0.128244


In [65]:
class NNet:
    def __init__(self):
        
        # Inputs must be == to number of features
        self.inputs = 20
        # Only one output node b/c only trying to predict one thing
        self.outputNodes = 1
        
        self.weights = np.random.rand(self.inputs, self.outputNodes)
     
    # Squishify
    def sigmoid(self, s):
        return 1 / (1+np.exp(-s))
    
    # Create 0 or 1 from prediced activated output
    def binary(self, X):
        binary = self.feed_forward(X)
        binary = [1 if x > .9999 else 0 for x in binary]
        return binary
    
     
    def feed_forward(self, X):
        """Calculate the NNet inference using the feed forward, aka predict """
        
        # Combining  inputs and weights in a weighted sum
        self.input_sum = np.dot(X, self.weights)
        
        # Apply activation function to the weighted sum
        self.output_activated = self.sigmoid(self.input_sum)
        
        return self.output_activated

In [66]:
nn = NNet()

In [67]:
y_pred1 = nn.binary(X_train)
score = accuracy_score(y_train, y_pred1)

y_pred2 = nn.binary(X_test)
score2 = accuracy_score(y_test, y_pred2)

print(f"Mean baseline for our target(Males) is {round(df['sex'].value_counts(normalize=True)[0]*100, 2)}%")
print(f"The accuracy of the train is {round(score*100, 2)}%")
print(f"The accuracy of the test is {round(score2*100, 2)}%")

Mean baseline for our target(Males) is 64.34%
The accuracy of the train is 57.95%
The accuracy of the test is 69.39%


# Day 2

### Learning Objectives
- Explain the intuition behind backproprogation
- Implement gradient descent + backproprogation on a feedforward neural network

In [68]:
# I want activations that correspond to negative weights to be lower
# and activations that correspond to positive weights to be higher

class NNetbackprop:
    def __init__(self):
        # Set up Architecture of Neural Network
        self.inputs = 20
        self.hiddenNodes = 3
        self.outputNodes = 1

        # Initial Weights
        # 20x3 Matrix Array for the First Layer
        self.weights1 = np.random.rand(self.inputs, self.hiddenNodes)
       
        # 3x1 Matrix Array for Hidden to Output
        self.weights2 = np.random.rand(self.hiddenNodes, self.outputNodes)
        
    def sigmoid(self, s):
        return 1 / (1+np.exp(-s))
    
    def sigmoidPrime(self, s):
        return s * (1 - s)
    
    def feed_forward(self, X):
        """
        Calculate the NN inference using feed forward.
        aka "predict"
        """
        
        # Weighted sum of inputs => hidden layer
        self.hidden_sum = np.dot(X, self.weights1)
        
        # Activations of weighted sum
        self.activated_hidden = self.sigmoid(self.hidden_sum)
        
        # Weight sum between hidden and output
        self.output_sum = np.dot(self.activated_hidden, self.weights2)
        
        # Final activation of output
        self.activated_output = self.sigmoid(self.output_sum)
        
        return self.activated_output
        
    def backward(self, X,y,o):
        """
        Backward propagate through the network
        """
        
        # Error in Output
        # Calculate the error, the diffrence between true y value and the predicted
        self.o_error = y - o
        
        # Apply Derivative of Sigmoid to error
        # How far off are we in relation to the Sigmoid f(x) of the output
        # ^- aka hidden => output
        # Which direction do we want to go 
        self.o_delta = self.o_error * self.sigmoidPrime(o)
        
        # z2 error
        # Applying the  o-delta/correction to weights2 transformed
        self.z2_error = self.o_delta.dot(self.weights2.T)
        
        # How much of that "far off" can explained by the input => hidden
        # Apply sigmoid derivative to the error
        self.z2_delta = self.z2_error * self.sigmoidPrime(self.activated_hidden)
        
        # Adjustment to first set of weights (input => hidden)
        # Applying adjustments to the weights
        self.weights1 += X.T.dot(self.z2_delta)
        
        # Adjustment to second set of weights (hidden => output)
        # Applying adjustments to the weights
        self.weights2 += self.activated_hidden.T.dot(self.o_delta)
        

    def train(self, X, y):
        o = self.feed_forward(X)
        self.backward(X,y,o)

### Preprocessing

In [69]:
def y_input(x):
    "Changes our y into a list of individual arrays"
    y_list = []
    for x in y_train:
        new = np.array([x])
        y_list.append(new)
    return y_list

In [70]:
nnbp = NNetbackprop()

In [71]:
ytrain = y_input(y_train)
nnbp.train(X_train, ytrain)

### Backproprogation

In [83]:
# ---1st ERROR---
# Apply sigmoid derivative to the error
# Which direction do we want to go
# self.o_delta = self.o_error * sigmoidprime(o)
#How much more sigmoid activation would have pushed us towards the right answer

nnbp.o_error

array([[-0.81432757],
       [-0.81391112],
       [ 0.18577946],
       [ 0.18570363],
       [ 0.18556909],
       [ 0.18918028],
       [-0.81454891],
       [ 0.18583457],
       [ 0.18591579],
       [ 0.18584719],
       [ 0.18554578],
       [ 0.18690983],
       [ 0.18629526],
       [ 0.18545895],
       [ 0.18930769],
       [-0.81387394],
       [ 0.18818647],
       [-0.8136505 ],
       [-0.81336792],
       [-0.80970152],
       [-0.81222417],
       [ 0.18726095],
       [-0.81377936],
       [ 0.18589344],
       [ 0.18729134],
       [ 0.18813667],
       [ 0.18606041],
       [-0.81192573],
       [-0.81449639],
       [ 0.18590369],
       [-0.81444791],
       [ 0.18727127],
       [ 0.19004425],
       [ 0.18585442],
       [-0.81371311],
       [ 0.18716468],
       [ 0.18582581],
       [ 0.18681872],
       [ 0.18546522],
       [ 0.18567938],
       [ 0.18555616],
       [ 0.18594604],
       [ 0.18548815],
       [-0.81386708],
       [ 0.19065313],
       [ 0

In [85]:
# Apply sigmoid derivative to the error
# Which direction do we want to go
# self.o_delta = self.o_error * sigmoidprime(o)

nnbp.o_delta.shape

(195, 1)

In [86]:
# z2 error
# Applying the  o-delta/correction to weights2 transformed
# These are the errors from the output to the hidden layer

nnbp.z2_error.shape

(195, 3)

In [87]:
# How much on the sigmoid curve we want to move
# Being the delta this is the direction we will be traveling
# For each observation, how much more sigmoid activation from this layer would have 
# pushed us towards the right answer?

nnbp.z2_delta.shape

(195, 3)

In [129]:
#Calculation to update the weights
X_train.T.dot(nnbp.z2_delta)

Unnamed: 0,0,1,2
total_bill,-0.0008669517,-0.0003819881,-0.001028224
tip,-0.0002491814,-0.000112803,-0.000279906
smoker_Yes,-0.0002436398,-0.0001079564,-0.0002696525
smoker_No,-5.476197e-06,-4.43813e-06,-9.721217e-06
day_Thur,-1.959782e-06,-1.498097e-06,-2.356769e-06
day_Fri,-2.900559e-05,-5.800273e-06,-4.910028e-05
day_Sat,-0.000218393,-0.0001051281,-0.0002279937
day_Sun,2.42357e-07,3.190668e-08,7.696697e-08
time_Lunch,-2.009212e-06,-1.366094e-06,-2.575745e-06
time_Dinner,-0.0002471068,-0.0001110284,-0.000276798


In [130]:
# Update hidden layer weights

nnbp.activated_hidden.T.dot(nnbp.o_delta)

array([[-4.83444101],
       [-4.8347326 ],
       [-4.83402782]])

In [131]:
# Train my 'net
nnbp = NNetbackprop()

# Number of Epochs / Iterations
for i in range(10000):
    if (i+1 in [1,2,3,4,5]) or ((i+1) % 1000 ==0):
        print('+' + '---' * 3 + f'EPOCH {i+1}' + '---'*3 + '+')
        print('Input: \n', X_train)
        print('Actual Output: \n', ytrain)
        print('Predicted Output: \n', str(nn.feed_forward(X_train)))
        print("Loss: \n", str(np.mean(np.square(ytrain - nnbp.feed_forward(X_train)))))
    nnbp.train(X_train,ytrain)

+---------EPOCH 1---------+
Input: 
      total_bill   tip  smoker_Yes  smoker_No  day_Thur  day_Fri  day_Sat  \
199       13.51  2.00           1          0         1        0        0   
47        32.40  6.00           0          1         0        0        0   
35        24.06  3.60           0          1         0        0        1   
145        8.35  1.50           0          1         1        0        0   
11        35.26  5.00           0          1         0        0        0   
6          8.77  2.00           0          1         0        0        0   
193       15.48  2.02           1          0         1        0        0   
15        21.58  3.92           0          1         0        0        0   
8         15.04  1.96           0          1         0        0        0   
162       16.21  2.00           0          1         0        0        0   
63        18.29  3.76           1          0         0        0        1   
121       13.42  1.68           0          1       

# Day 3

### Learning Objectives
- Introduce the Keras Sequential Model API
- Learn How to Select Model Architecture
- Discuss the trade-off between various activation functions

In [88]:
#Model
model = Sequential()

#Input
model.add(Dense(16, input_dim=20, activation='relu'))

#Hidden
model.add(Dense(32, kernel_initializer='normal', activation='relu'))
model.add(Dense(32, kernel_initializer='normal', activation='relu'))

#Output
model.add(Dense(1, activation='linear'))

#Compile
model.compile(loss='mean_squared_error',
              metrics=['mean_squared_error'],
              optimizer='adam')

#Fit & Evaluate
history = model.fit(X_train, y_train, epochs=100, verbose=False, validation_split=.1)
scores = model.evaluate(X_test, y_test, verbose=0)



In [89]:
print(f"The MSE of our neural net is ${scores[1]}")
print(f"The RMSE of our neural net is ${round(np.sqrt(scores[1]), 2)}")

The MSE of our neural net is $0.21952718496322632
The RMSE of our neural net is $0.4699999988079071


### Activation Functions

#### Step Function

- Binary activation function. 
- Updating weights through back proprogation is impossible due to all or nothing nature
- Since the derivative has no slope it cannot update weights


#### Linear Function

- Passes the signal onto the next layer by a constant factor
- Derivative is of a linear activation is horizontal which would mean we should update all weights by a constant
amount everytime
- Only used for very simple tasks where intepretability is important

#### Sigmoid Function

- Sigmoid and its derivatives are usually better at classification problems
- Great activation function since its continuously differentiable
- Slope reaches 0 quickly when moving away from zero
- Higher slope around 0 pushes our y more quickly to one of the extremes
- Useful for binary classification

#### Tanh Function

- Doesnt get so flat when moving away from zero
- A little steeper in the middle than sigmoid
- Created by scaling y by 2 in the y dimmension and subtracting 1 from all values
- Same drawbacks as sigmoid like diminishing flat gradients when moving from zero
- Derivative higher at 0 causing weights to move to extremes faster

#### ReLu Function (Rectified Linear Units)

- Typicaly see ReLu for initial layers
- Generally better at obtaining optimal model fit
- Commonly used as activatoin functions in neural networks
- Doesnt activate when neuron outputs a negative signal but passes on positive signal
- Derivative looks like a step function
- Turns off a portion of our less important neurons which decreases computational load
- Can lead to dead neurons since negatively weighted neurons wont activate
- Might want to to upadte negatively initialized weights

#### Leaky ReLu

- Avoids having a gradient of 0 on the left side of the derivative funcion
- Even 'dead' neurons have a chance of being revived w/ enough iterations
- Leaky side slope can even be tuned as a hyper parameter

#### Softmax Funtion

- Good for multiclassification problems
- Takes any set of inputs and translates them into probabilities that sum to 1
- Can ouput a list of outputs and translate them into probabilities that sum to 1

# DAY 4

### Learning Objectives
- Describe the major hyperparameters to tune
- Implement an exeriment tracking framework
- Search the hyperameter space using RandomSearch (Optional)

In [100]:
# Important Hyperparameters
inputs = X_train.shape[1]
epochs = 75
batch_size = 10


# Create Model
model = Sequential()
model.add(Dense(64, activation='relu', input_shape=(inputs,)))
model.add(Dense(64, activation='relu'))
model.add(Dense(1))

# Compile Model
model.compile(optimizer='adam', loss='mse', metrics=['mse', 'mae'])

# Fit Model
model.fit(X_train, y_train, 
          validation_data=(X_test,y_test), 
          epochs=epochs, 
          batch_size=batch_size
         )

Train on 195 samples, validate on 49 samples
Epoch 1/75
Epoch 2/75
Epoch 3/75
Epoch 4/75
Epoch 5/75
Epoch 6/75
Epoch 7/75
Epoch 8/75
Epoch 9/75
Epoch 10/75
Epoch 11/75
Epoch 12/75
Epoch 13/75
Epoch 14/75
Epoch 15/75
Epoch 16/75
Epoch 17/75
Epoch 18/75
Epoch 19/75
Epoch 20/75
Epoch 21/75
Epoch 22/75
Epoch 23/75
Epoch 24/75
Epoch 25/75
Epoch 26/75
Epoch 27/75
Epoch 28/75
Epoch 29/75
Epoch 30/75
Epoch 31/75
Epoch 32/75
Epoch 33/75
Epoch 34/75
Epoch 35/75
Epoch 36/75
Epoch 37/75
Epoch 38/75
Epoch 39/75
Epoch 40/75
Epoch 41/75
Epoch 42/75
Epoch 43/75
Epoch 44/75
Epoch 45/75
Epoch 46/75
Epoch 47/75
Epoch 48/75
Epoch 49/75
Epoch 50/75
Epoch 51/75
Epoch 52/75
Epoch 53/75
Epoch 54/75
Epoch 55/75
Epoch 56/75
Epoch 57/75
Epoch 58/75
Epoch 59/75
Epoch 60/75
Epoch 61/75
Epoch 62/75
Epoch 63/75
Epoch 64/75
Epoch 65/75
Epoch 66/75
Epoch 67/75
Epoch 68/75
Epoch 69/75
Epoch 70/75
Epoch 71/75
Epoch 72/75
Epoch 73/75
Epoch 74/75
Epoch 75/75


<tensorflow.python.keras.callbacks.History at 0x7fd0e4c8d358>

In [106]:
# fix random seed for reproducibility
seed = 7
np.random.seed(seed)

# Replaced with tips dataset
# # load dataset
# url ="https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"

# dataset = pd.read_csv(url, header=None).values

# # split into input (X) and output (Y) variables
# X = dataset[:,0:8]
# Y = dataset[:,8]

# Function to create model, required for KerasClassifier
def create_model():
    # create model
    model = Sequential()
    model.add(Dense(12, input_dim=20, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

# create model
model = KerasClassifier(build_fn=create_model, verbose=0)

# define the grid search parameters
# batch_size = [10, 20, 40, 60, 80, 100]
# param_grid = dict(batch_size=batch_size, epochs=epochs)

# define the grid search parameters
param_grid = {'batch_size': [10, 20, 40, 60, 80, 100],
              'epochs': [20]}

# Create Grid Search
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=1)
grid_result = grid.fit(X_train, y_train)

# Report Results
print(f"Best: {grid_result.best_score_} using {grid_result.best_params_}")
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print(f"Means: {mean}, Stdev: {stdev} with: {param}") 



Best: 0.5948718190193176 using {'batch_size': 10, 'epochs': 20}
Means: 0.5948718190193176, Stdev: 0.0691832654313919 with: {'batch_size': 10, 'epochs': 20}
Means: 0.5794872045516968, Stdev: 0.05936326262982861 with: {'batch_size': 20, 'epochs': 20}
Means: 0.5589743852615356, Stdev: 0.014504753621327159 with: {'batch_size': 40, 'epochs': 20}
Means: 0.5794872045516968, Stdev: 0.026148816459833527 with: {'batch_size': 60, 'epochs': 20}
Means: 0.5589743852615356, Stdev: 0.0725237681066358 with: {'batch_size': 80, 'epochs': 20}
Means: 0.5589743753274282, Stdev: 0.09428091042619949 with: {'batch_size': 100, 'epochs': 20}


In [108]:
# define the grid search parameters
param_grid = {'batch_size': [20],
              'epochs': [20, 40, 60,200]}

# Create Grid Search
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=1)
grid_result = grid.fit(X_train, y_train)

# Report Results
print(f"Best: {grid_result.best_score_} using {grid_result.best_params_}")
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print(f"Means: {mean}, Stdev: {stdev} with: {param}")



Best: 0.6410256624221802 using {'batch_size': 20, 'epochs': 20}
Means: 0.6410256624221802, Stdev: 0.07977614491729938 with: {'batch_size': 20, 'epochs': 20}
Means: 0.5948718190193176, Stdev: 0.03161237761816035 with: {'batch_size': 20, 'epochs': 40}
Means: 0.5743589997291565, Stdev: 0.047557015090999834 with: {'batch_size': 20, 'epochs': 60}
Means: 0.6153846283753713, Stdev: 0.0905821708302442 with: {'batch_size': 20, 'epochs': 200}


## Your Mission, should you choose to accept it...

To hyperparameter tune and extract every ounce of accuracy out of this telecom customer churn dataset: <https://drive.google.com/file/d/1dfbAsM9DwA7tYhInyflIpZnYs7VT-0AQ/view> 

## Requirements

- Load the data
- Clean the data if necessary (it will be)
- Create and fit a baseline Keras MLP model to the data.
- Hyperparameter tune (at least) the following parameters:
 - batch_size
 - training epochs
 - optimizer
 - learning rate (if applicable to optimizer)
 - momentum (if applicable to optimizer)
 - activation functions
 - network weight initialization
 - dropout regularization
 - number of neurons in the hidden layer
 
 You must use Grid Search and Cross Validation for your initial pass of the above hyperparameters
 
 Try and get the maximum accuracy possible out of this data! You'll save big telecoms millions! Doesn't that sound great?

# Assignment

In [14]:
df2 = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn+(1).csv')

In [24]:
print(df2.shape)
df2.head()

(7043, 21)


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [21]:
df2.isnull().sum()

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

In [25]:
df2.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


In [26]:
df2.describe(exclude='number')

Unnamed: 0,customerID,gender,Partner,Dependents,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,TotalCharges,Churn
count,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043.0,7043
unique,7043,2,2,2,2,3,3,3,3,3,3,3,3,3,2,4,6531.0,2
top,8111-RKSPX,Male,No,No,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,,No
freq,1,3555,3641,4933,6361,3390,3096,3498,3088,3095,3473,2810,2785,3875,4171,2365,11.0,5174


(7032, 21)

In [46]:
def prep2(df, target):
    
    
    # Change Total Charges to float and drop NaN
    rows = df[df['TotalCharges'] == " "]
    df = df.drop(rows.index, axis=0)
    df['TotalCharges'] = df['TotalCharges'].astype(float)
    
    # Split
    training, testing = train_test_split(df, test_size=.2)
    
    # Make X, y and drop appropriate columns and high catagoricals
    X_train = training.drop(columns=[target, 'customerID'])
    y_train = training[target]
    X_test = testing.drop(columns=[target, 'customerID'])
    y_test = testing[target]
    
    
    processor = make_pipeline(
        ce.OneHotEncoder(use_cat_names=True),  
#         SimpleImputer(strategy='median'), # Use when a normalized dataframe is needed
#         StandardScaler() # Use when a normalized dataframe is needed
    )
    
    binary = {'No': 0, 'Yes': 1}
    y_train = y_train.map(binary)
    y_test = y_test.map(binary)
    
    X_process_train = processor.fit_transform(X_train)
    X_process_test = processor.transform(X_test)
    
    return X_process_train,y_train, X_process_test, y_test

In [47]:
Xtrain, ytrain, Xtest, ytest = prep2(df2, 'Churn')

In [48]:
print(Xtrain.shape)
Xtrain.head()

(5625, 45)


Unnamed: 0,gender_Female,gender_Male,SeniorCitizen,Partner_Yes,Partner_No,Dependents_No,Dependents_Yes,tenure,PhoneService_Yes,PhoneService_No,...,Contract_Two year,Contract_One year,PaperlessBilling_Yes,PaperlessBilling_No,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Bank transfer (automatic),PaymentMethod_Mailed check,MonthlyCharges,TotalCharges
53,1,0,1,1,0,1,0,8,1,0,...,0,0,1,0,1,0,0,0,80.65,633.3
3785,0,1,1,1,0,1,0,10,1,0,...,0,0,1,0,0,1,0,0,89.8,914.3
5366,1,0,0,1,0,0,1,66,1,0,...,1,0,1,0,0,1,0,0,59.75,3996.8
3426,0,1,0,0,1,1,0,1,1,0,...,0,0,1,0,0,1,0,0,69.8,69.8
3104,0,1,1,0,1,1,0,9,1,0,...,0,0,1,0,0,1,0,0,74.05,678.45


In [57]:
# fix random seed for reproducibility
seed = 7
np.random.seed(seed)

# Function to create model, required for KerasClassifier
def create_model():
    # create model
    model = Sequential()
    model.add(Dense(12, input_dim=45, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

# create model
model = KerasClassifier(build_fn=create_model, verbose=0)

# define the grid search parameters
# batch_size = [10, 20, 40, 60, 80, 100]
# param_grid = dict(batch_size=batch_size, epochs=epochs)

# define the grid search parameters
param_grid = {'batch_size': [10, 20, 40, 60, 80, 100],
              'epochs': [20, 40, 60, 100]}

# Create Grid Search
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=4)
grid_result = grid.fit(Xtrain, ytrain)

# Report Results
print(f"Best: {grid_result.best_score_} using {grid_result.best_params_}")
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print(f"Means: {mean}, Stdev: {stdev} with: {param}") 

Best: 0.8012444376945496 using {'batch_size': 100, 'epochs': 60}
Means: 0.7957333326339722, Stdev: 0.010675538021228769 with: {'batch_size': 10, 'epochs': 20}
Means: 0.7278222242991129, Stdev: 0.05588617371486254 with: {'batch_size': 10, 'epochs': 40}
Means: 0.7797333399454752, Stdev: 0.012866481979977896 with: {'batch_size': 10, 'epochs': 60}
Means: 0.7914666732152303, Stdev: 0.005134067389353745 with: {'batch_size': 10, 'epochs': 100}
Means: 0.7921777764956156, Stdev: 0.008821095438618167 with: {'batch_size': 20, 'epochs': 20}
Means: 0.781511127948761, Stdev: 0.021361459277727535 with: {'batch_size': 20, 'epochs': 40}
Means: 0.793066680431366, Stdev: 0.012138453802542713 with: {'batch_size': 20, 'epochs': 60}
Means: 0.7847111225128174, Stdev: 0.019186834908845082 with: {'batch_size': 20, 'epochs': 100}
Means: 0.7989333271980286, Stdev: 0.009420528921098374 with: {'batch_size': 40, 'epochs': 20}
Means: 0.7756444414456686, Stdev: 0.01702960965129327 with: {'batch_size': 40, 'epochs': 4

## Stretch Goals:

- Try to implement Random Search Hyperparameter Tuning on this dataset
- Try to implement Bayesian Optimiation tuning on this dataset using hyperas or hyperopt (if you're brave)
- Practice hyperparameter tuning other datasets that we have looked at. How high can you get MNIST? Above 99%?
- Study for the Sprint Challenge
 - Can you implement both perceptron and MLP models from scratch with forward and backpropagation?
 - Can you implement both perceptron and MLP models in keras and tune their hyperparameters with cross validation?