## Deliverable 3: Optimize the Model

Up to this point, the accuracy from the existing model has an accuracy of 72.56%. The goal is to increase accuracy above 75%. From the initial file, the following changes have already been made prior to the application_df file:
- Application Type has been binned to group all application types with under 200 value counts
- Classification has been binned to group all classifications with under 1000 value counts
- Categorical data has been encoded with OneHotEncoder

One observation: there are 8747 unique values in the Ask Amount column, which are likely posing a major problem, for the neural networks since it doesn't interpret the numbers as sequential, but rather as just unique values. This column should either be removed or binned into groups. 

In [12]:
# Import our dependencies
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import pandas as pd
import tensorflow as tf
import os
from tensorflow.keras.callbacks import ModelCheckpoint
from pathlib import Path

#  Import and read the charity_data.csv.
application_df = pd.read_csv("Resources/application_df.csv")
application_df.head()


Unnamed: 0,STATUS,ASK_AMT,IS_SUCCESSFUL,APPLICATION_TYPE_Other,APPLICATION_TYPE_T10,APPLICATION_TYPE_T19,APPLICATION_TYPE_T3,APPLICATION_TYPE_T4,APPLICATION_TYPE_T5,APPLICATION_TYPE_T6,...,INCOME_AMT_1-9999,INCOME_AMT_10000-24999,INCOME_AMT_100000-499999,INCOME_AMT_10M-50M,INCOME_AMT_1M-5M,INCOME_AMT_25000-99999,INCOME_AMT_50M+,INCOME_AMT_5M-10M,SPECIAL_CONSIDERATIONS_N,SPECIAL_CONSIDERATIONS_Y
0,1,5000,1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,1,108590,1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,1,5000,0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,1,6692,1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,1,142590,1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [13]:
ask_counts = application_df.ASK_AMT.value_counts()
ask_counts

5000        25398
10478           3
15583           3
63981           3
6725            3
            ...  
5371754         1
30060           1
43091152        1
18683           1
36500179        1
Name: ASK_AMT, Length: 8747, dtype: int64

Just binning by frequency poses a problem because aside from $5000, most others asked for individualized amounts, so ranges will need to be made

In [14]:
min_ask = application_df.ASK_AMT.min()
max_ask = application_df.ASK_AMT.max()
median_ask = application_df.ASK_AMT.median()

print(f'Min: {min_ask}, Median: {median_ask}, Max: {max_ask}')

Min: 5000, Median: 5000.0, Max: 8597806340


The data is heavily skewed to the right with values ranging from $5000 to $8.6 billion 

In [15]:
five_thousand = application_df.ASK_AMT[application_df.ASK_AMT == 5000]
five_thousand.value_counts()

5000    25398
Name: ASK_AMT, dtype: int64

In [16]:
under_thirtyfive = application_df.ASK_AMT[(application_df.ASK_AMT > 5000) & (application_df.ASK_AMT < 35000)]
under_thirtyfive.nunique()


2185

In [17]:
under_hundred = application_df.ASK_AMT[(application_df.ASK_AMT > 35000) & (application_df.ASK_AMT < 100000)]
under_hundred.nunique()


2044

In [18]:
under_fivehundred = application_df.ASK_AMT[(
    application_df.ASK_AMT > 100000) & (application_df.ASK_AMT < 500000)]
under_fivehundred.nunique()


2289

In [19]:

over_halfmillion = application_df.ASK_AMT[application_df.ASK_AMT > 500000]
over_halfmillion.nunique()


2226

In [20]:
# for row in application_df.ASK_AMT:
#     if row == 5000:
#         application_df.ASK_AMT = 0
#     elif row < 35000:
#         application_df.ASK_AMT = 1
#     elif row < 100000:
#         application_df.ASK_AMT = 2
#     elif row < 500000:
#         application_df.ASK_AMT = 3
#     else: 
#         application_df.ASK_AMT = 4

# application_df.head()

In [21]:
application_df.ASK_AMT.dtypes

dtype('int64')