# Optimization attempt 1

In [None]:
# Repeat the processing steps to get the data to the same place. Then, play around with optimization by bucketing asking amount and 

# Import our dependencies
import pandas as pd
import tensorflow as tf
import keras_tuner as k
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV
from imblearn.over_sampling import SMOTE

# Read the charity_data.csv from the provided cloud URL.
application_df = pd.read_csv("https://static.bc-edx.com/data/dl-1-2/m21/lms/starter/charity_data.csv")

# Do most basic processing outlined in instructions
application_df = application_df.drop(columns=['EIN', 'NAME'])

application_df.head()





Unnamed: 0,APPLICATION_TYPE,AFFILIATION,CLASSIFICATION,USE_CASE,ORGANIZATION,STATUS,INCOME_AMT,SPECIAL_CONSIDERATIONS,ASK_AMT,IS_SUCCESSFUL
0,T10,Independent,C1000,ProductDev,Association,1,0,N,5000,1
1,T3,Independent,C2000,Preservation,Co-operative,1,1-9999,N,108590,1
2,T5,CompanySponsored,C3000,ProductDev,Association,1,0,N,5000,0
3,T3,CompanySponsored,C2000,Preservation,Trust,1,10000-24999,N,6692,1
4,T3,Independent,C1000,Heathcare,Trust,1,100000-499999,N,142590,1


#### Process Optimization Option 1: 
Firstly, I need to figure out what columns in the processed DataFrame are important. I can accomplish this with this code. Then I also want to add way more layers and nodes to my model for the best possible accuracy improvement.

In [None]:
# Make a coppy of the application dataframe
important_df = application_df.copy()

# Add the same application and classification cutoffs as before
# Define application_cutoff
application_cutoff = 500

# Identify applications with counts below the cutoff
application_types_to_replace = important_df['APPLICATION_TYPE'].value_counts()
application_types_to_replace = application_types_to_replace[application_types_to_replace < application_cutoff].index.tolist()

# Replace in dataframe
for app in application_types_to_replace:
    important_df['APPLICATION_TYPE'] = important_df['APPLICATION_TYPE'].replace(app,"Other")

# Define classificaiton_cutoff
classification_cutoff = 1000

# Identify classifications with counts below the cutoff
classifications_to_replace = important_df['CLASSIFICATION'].value_counts()
classifications_to_replace = classifications_to_replace[classifications_to_replace < classification_cutoff].index.tolist()

# Replace in dataframe
for cls in classifications_to_replace:
    important_df['CLASSIFICATION'] = important_df['CLASSIFICATION'].replace(cls,"Other")

# List of columns with categorical data
categorical_columns = [
    'APPLICATION_TYPE', 'AFFILIATION', 'CLASSIFICATION', 'USE_CASE', 'ORGANIZATION',
    'INCOME_AMT', 'SPECIAL_CONSIDERATIONS'
]

# Convert categorical data to numeric with `pd.get_dummies`
application_dummies = pd.get_dummies(important_df[categorical_columns], drop_first=True)

# Add dummies to copy of DataFrame for model building
important_df = pd.concat([important_df, application_dummies], axis=1)

# Drop original non-numeric columns 
important_df = important_df.drop(columns=categorical_columns)

# Check the proccessed DataFrame
important_df.head()

Unnamed: 0,STATUS,ASK_AMT,IS_SUCCESSFUL,APPLICATION_TYPE_T10,APPLICATION_TYPE_T19,APPLICATION_TYPE_T3,APPLICATION_TYPE_T4,APPLICATION_TYPE_T5,APPLICATION_TYPE_T6,APPLICATION_TYPE_T7,...,ORGANIZATION_Trust,INCOME_AMT_1-9999,INCOME_AMT_10000-24999,INCOME_AMT_100000-499999,INCOME_AMT_10M-50M,INCOME_AMT_1M-5M,INCOME_AMT_25000-99999,INCOME_AMT_50M+,INCOME_AMT_5M-10M,SPECIAL_CONSIDERATIONS_Y
0,1,5000,1,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,1,108590,1,False,False,True,False,False,False,False,...,False,True,False,False,False,False,False,False,False,False
2,1,5000,0,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
3,1,6692,1,False,False,True,False,False,False,False,...,True,False,True,False,False,False,False,False,False,False
4,1,142590,1,False,False,True,False,False,False,False,...,True,False,False,True,False,False,False,False,False,False


In [None]:
# Define features and target
X = important_df.drop(columns='IS_SUCCESSFUL')
y = important_df['IS_SUCCESSFUL']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the model
rf = RandomForestClassifier(random_state=42)

param_grid = {
    'n_estimators': [200, 300, 400],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

# Define an instance of GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=3, verbose=2, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Fit the gridsearch to the data
best_rf = grid_search.best_estimator_
best_rf.fit(X_train, y_train)

# Predict outcomes
y_pred = best_rf.predict(X_test)

# Evaluate the random forest
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.4f}')

# Additional classification report
print(classification_report(y_test, y_pred))

# Further evaluate feature importances
feature_importances = best_rf.feature_importances_
features = X.columns

# Create a DataFrame for better readability
feature_importance_df = pd.DataFrame({
    'Feature': features,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

feature_importance_df

Fitting 3 folds for each of 216 candidates, totalling 648 fits
Accuracy: 0.7258
              precision    recall  f1-score   support

           0       0.73      0.66      0.69      3196
           1       0.72      0.79      0.75      3664

    accuracy                           0.73      6860
   macro avg       0.73      0.72      0.72      6860
weighted avg       0.73      0.73      0.72      6860



Unnamed: 0,Feature,Importance
11,AFFILIATION_Independent,0.450664
26,ORGANIZATION_Trust,0.083658
17,CLASSIFICATION_C2100,0.055752
6,APPLICATION_TYPE_T5,0.046338
2,APPLICATION_TYPE_T10,0.04484
1,ASK_AMT,0.044696
3,APPLICATION_TYPE_T19,0.040266
19,CLASSIFICATION_Other,0.038596
5,APPLICATION_TYPE_T4,0.03288
4,APPLICATION_TYPE_T3,0.023084


After seeing which columns are most important, I think that it is safe to drop the following columns with less than 0.000500. This is because they offer either no value or little to no value to training the model. Making them excessive noise that should be cut out.

In [None]:
important_df_2 = important_df.copy()

# Identify features with zero importance
low_importance_features = feature_importance_df[feature_importance_df['Importance'] <= 0.000500]['Feature'].tolist()

# Remove the useless columns
important_df_2 = important_df_2.drop(columns=low_importance_features)

# run importance model again to make sure things didn't drastically change
# Define features and target
X = important_df.drop(columns='IS_SUCCESSFUL')
y = important_df['IS_SUCCESSFUL']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the model
rf = RandomForestClassifier(random_state=42)

param_grid = {
    'n_estimators': [200, 300, 400],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

# Define an instance of GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=3, verbose=2, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Fit the gridsearch to the data
best_rf = grid_search.best_estimator_
best_rf.fit(X_train, y_train)

# Predict outcomes
y_pred = best_rf.predict(X_test)

# Evaluate the random forest
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.4f}')

# Additional classification report
print(classification_report(y_test, y_pred))

# Further evaluate feature importances
feature_importances = best_rf.feature_importances_
features = X.columns

# Create a DataFrame for better readability
feature_importance_df = pd.DataFrame({
    'Feature': features,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

feature_importance_df

Fitting 3 folds for each of 216 candidates, totalling 648 fits
Accuracy: 0.7265
              precision    recall  f1-score   support

           0       0.73      0.66      0.69      3196
           1       0.73      0.78      0.75      3664

    accuracy                           0.73      6860
   macro avg       0.73      0.72      0.72      6860
weighted avg       0.73      0.73      0.73      6860



Unnamed: 0,Feature,Importance
10,AFFILIATION_Independent,0.433845
21,ORGANIZATION_Trust,0.085329
13,CLASSIFICATION_C2100,0.060145
5,APPLICATION_TYPE_T5,0.047692
0,ASK_AMT,0.04599
1,APPLICATION_TYPE_T10,0.043435
2,APPLICATION_TYPE_T19,0.043271
15,CLASSIFICATION_Other,0.041304
4,APPLICATION_TYPE_T4,0.036481
3,APPLICATION_TYPE_T3,0.021566


In [None]:
# Check value counts
# Define columns
all_columns = important_df_2.columns.tolist()

for i in all_columns:
    print(f'Value Counts for the column named: {i}')
    print(important_df_2[i].value_counts())


Value Counts for the column named: ASK_AMT
ASK_AMT
5000        25398
10478           3
15583           3
63981           3
6725            3
            ...  
5371754         1
30060           1
43091152        1
18683           1
36500179        1
Name: count, Length: 8747, dtype: int64
Value Counts for the column named: IS_SUCCESSFUL
IS_SUCCESSFUL
1    18261
0    16038
Name: count, dtype: int64
Value Counts for the column named: APPLICATION_TYPE_T10
APPLICATION_TYPE_T10
False    33771
True       528
Name: count, dtype: int64
Value Counts for the column named: APPLICATION_TYPE_T19
APPLICATION_TYPE_T19
False    33234
True      1065
Name: count, dtype: int64
Value Counts for the column named: APPLICATION_TYPE_T3
APPLICATION_TYPE_T3
True     27037
False     7262
Name: count, dtype: int64
Value Counts for the column named: APPLICATION_TYPE_T4
APPLICATION_TYPE_T4
False    32757
True      1542
Name: count, dtype: int64
Value Counts for the column named: APPLICATION_TYPE_T5
APPLICATION_TYPE_

In [None]:
# Copy the important_df
optimization_df_1 = important_df_2.copy()

# Build the model optimization attempt one.
# Define features and target
X = optimization_df_1.drop(columns='IS_SUCCESSFUL')
y = optimization_df_1['IS_SUCCESSFUL']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a StandardScaler instances
scaler = StandardScaler()

# Fit the StandardScaler
X_scaler = scaler.fit(X_train)

# Scale the data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

# Check dimensions
print(f"Training Data Shape: {X_train_scaled.shape}")
print(f"Test Data Shape: {X_test_scaled.shape}")


Training Data Shape: (27439, 30)
Test Data Shape: (6860, 30)


In [12]:
# Define the model - deep neural net, i.e., the number of input features and hidden nodes for each layer.
nn = tf.keras.models.Sequential()

# First units layer
nn.add(tf.keras.layers.Dense(units=25, activation="relu", input_dim=30))

# First hidden layer
nn.add(tf.keras.layers.Dense(units=110, activation="relu"))  # units_0

# Second hidden layer
nn.add(tf.keras.layers.Dense(units=285, activation="relu"))  # units_1

# Third hidden layer
nn.add(tf.keras.layers.Dense(units=135, activation="relu"))  # units_2

# Fourth hidden layer
nn.add(tf.keras.layers.Dense(units=210, activation="relu"))  # units_3

# Output layer
nn.add(tf.keras.layers.Dense(units=1, activation="sigmoid"))

# Check the structure of the model
nn.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_6 (Dense)             (None, 25)                775       
                                                                 
 dense_7 (Dense)             (None, 110)               2860      
                                                                 
 dense_8 (Dense)             (None, 285)               31635     
                                                                 
 dense_9 (Dense)             (None, 135)               38610     
                                                                 
 dense_10 (Dense)            (None, 210)               28560     
                                                                 
 dense_11 (Dense)            (None, 1)                 211       
                                                                 
Total params: 102651 (400.98 KB)
Trainable params: 102

In [13]:
# Compile the model
nn.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

In [14]:
# Train the model
model = nn.fit(X_train_scaled, y_train, epochs=100)

Epoch 1/100


Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 

In [15]:
# Get classification_report
y_pred_prob = nn.predict(X_test_scaled)
y_pred = (y_pred_prob > 0.5).astype(int)  
print(classification_report(y_test, y_pred))

# Evaluate the model using the test data
model_loss, model_accuracy = nn.evaluate(X_test_scaled,y_test,verbose=2)
print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")

              precision    recall  f1-score   support

           0       0.73      0.65      0.69      3196
           1       0.72      0.79      0.76      3664

    accuracy                           0.73      6860
   macro avg       0.73      0.72      0.72      6860
weighted avg       0.73      0.73      0.73      6860

215/215 - 1s - loss: 0.5911 - accuracy: 0.7277 - 558ms/epoch - 3ms/step
Loss: 0.5910669565200806, Accuracy: 0.7276967763900757


In [16]:
# Export our model to HDF5 file
nn.save("AlphabetSoupCharity_OptimizationAttempt1.h5")

  saving_api.save_model(


## Optimization attempt 2:
Because the target variable is slightly imbalanced, I want to see if there is a way to oversample the data to help the model get more even distribution in target values. Implementing SMOTE to balance the target variables in the trianing set to see if performance increases or decreases.

In [19]:
# Create a copy of application df
optimization_df_2 = important_df_2.copy()

# Build the model attempt two
# Define features and target
X = optimization_df_2.drop(columns='IS_SUCCESSFUL')
y = optimization_df_2['IS_SUCCESSFUL']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a StandardScaler instances
scaler = StandardScaler()

# Fit the StandardScaler
X_scaler = scaler.fit(X_train)

# Scale the data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

# Apply SMOTE to balance the target variable in the training set
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled, y_train)

# Check dimensions
print(f"Training Data Shape: {X_train_resampled.shape}")
print(f"Test Data Shape: {X_test_scaled.shape}")

Training Data Shape: (29194, 29)
Test Data Shape: (6860, 29)


In [20]:
# Define the model - deep neural net, i.e., the number of input features and hidden nodes for each layer.
nn_2 = tf.keras.models.Sequential()

# First units layer
nn_2.add(tf.keras.layers.Dense(units=25, activation="relu", input_dim=29))

# First hidden layer
nn_2.add(tf.keras.layers.Dense(units=110, activation="relu"))  # units_0

# Second hidden layer
nn_2.add(tf.keras.layers.Dense(units=285, activation="relu"))  # units_1

# Third hidden layer
nn_2.add(tf.keras.layers.Dense(units=135, activation="relu"))  # units_2

# Fourth hidden layer
nn_2.add(tf.keras.layers.Dense(units=210, activation="relu"))  # units_3

# Output layer
nn_2.add(tf.keras.layers.Dense(units=1, activation="sigmoid"))

# Check the structure of the model
nn_2.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_12 (Dense)            (None, 25)                750       
                                                                 
 dense_13 (Dense)            (None, 110)               2860      
                                                                 
 dense_14 (Dense)            (None, 285)               31635     
                                                                 
 dense_15 (Dense)            (None, 135)               38610     
                                                                 
 dense_16 (Dense)            (None, 210)               28560     
                                                                 
 dense_17 (Dense)            (None, 1)                 211       
                                                                 
Total params: 102626 (400.88 KB)
Trainable params: 102

In [21]:
# Compile the model
nn_2.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

In [22]:
# Train the model
model_2 = nn_2.fit(X_train_resampled, y_train_resampled, epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [23]:
# Get classification_report
y_pred_prob_2 = nn_2.predict(X_test_scaled)
y_pred_2 = (y_pred_prob_2 > 0.5).astype(int)  
print(classification_report(y_test, y_pred_2))

# Evaluate the model using the test data
model_loss, model_accuracy = nn_2.evaluate(X_test_scaled,y_test,verbose=2)
print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")

              precision    recall  f1-score   support

           0       0.71      0.68      0.70      3196
           1       0.73      0.76      0.74      3664

    accuracy                           0.72      6860
   macro avg       0.72      0.72      0.72      6860
weighted avg       0.72      0.72      0.72      6860

215/215 - 0s - loss: 0.5741 - accuracy: 0.7224 - 460ms/epoch - 2ms/step
Loss: 0.5740903615951538, Accuracy: 0.722449004650116


In [24]:
# Export our model to HDF5 file
nn_2.save("AlphabetSoupCharity_OptimizationAttempt2.h5")

  saving_api.save_model(


## Optimization Attempt 3:
Since smote didnt increase the accuracy and because there is such a large imbalance in the ASK_AMT value counts, I want to turn the entire column into a binary column where the values are either 0, values equal to 5000, or 1, every other value in the column. Then I want to run the model without smote to see if the accuracy increases. 

In [25]:
optimization_df_3 = important_df_2.copy()

# Create a binary column where 5000 is 0 and all other values are 1
optimization_df_3["ASK_AMT_Binary"] = (optimization_df_3["ASK_AMT"] != 5000).astype(int)

# Drop Original column
optimization_df_3 = optimization_df_3.drop(columns=(['ASK_AMT']))

# Display value counts for the binary column
print(optimization_df_3["ASK_AMT_Binary"].value_counts())

ASK_AMT_Binary
0    25398
1     8901
Name: count, dtype: int64


In [None]:
# Build the model optimization attempt three.
# Define features and target
X = optimization_df_3.drop(columns='IS_SUCCESSFUL')
y = optimization_df_3['IS_SUCCESSFUL']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a StandardScaler instances
scaler = StandardScaler()

# Fit the StandardScaler
X_scaler = scaler.fit(X_train)

# Scale the data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

# Check dimensions
print(f"Training Data Shape: {X_train_scaled.shape}")
print(f"Test Data Shape: {X_test_scaled.shape}")

Training Data Shape: (27439, 29)
Test Data Shape: (6860, 29)


In [27]:
# Define the model - deep neural net, i.e., the number of input features and hidden nodes for each layer.
nn_3 = tf.keras.models.Sequential()

# First units layer
nn_3.add(tf.keras.layers.Dense(units=25, activation="relu", input_dim=29))

# First hidden layer
nn_3.add(tf.keras.layers.Dense(units=110, activation="relu"))  # units_0

# Second hidden layer
nn_3.add(tf.keras.layers.Dense(units=285, activation="relu"))  # units_1

# Third hidden layer
nn_3.add(tf.keras.layers.Dense(units=135, activation="relu"))  # units_2

# Fourth hidden layer
nn_3.add(tf.keras.layers.Dense(units=210, activation="relu"))  # units_3

# Output layer
nn_3.add(tf.keras.layers.Dense(units=1, activation="sigmoid"))

# Check the structure of the model
nn_3.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_18 (Dense)            (None, 25)                750       
                                                                 
 dense_19 (Dense)            (None, 110)               2860      
                                                                 
 dense_20 (Dense)            (None, 285)               31635     
                                                                 
 dense_21 (Dense)            (None, 135)               38610     
                                                                 
 dense_22 (Dense)            (None, 210)               28560     
                                                                 
 dense_23 (Dense)            (None, 1)                 211       
                                                                 
Total params: 102626 (400.88 KB)
Trainable params: 102

In [28]:
# Compile the model
nn_3.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

In [29]:
# Train the model
model_3 = nn_3.fit(X_train_scaled, y_train, epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [30]:
# Get classification_report
y_pred_prob_3 = nn_3.predict(X_test_scaled)
y_pred_3 = (y_pred_prob_3 > 0.5).astype(int)  
print(classification_report(y_test, y_pred_3))

# Evaluate the model using the test data
model_loss, model_accuracy = nn_3.evaluate(X_test_scaled,y_test,verbose=2)
print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")

              precision    recall  f1-score   support

           0       0.72      0.66      0.69      3196
           1       0.73      0.78      0.75      3664

    accuracy                           0.72      6860
   macro avg       0.72      0.72      0.72      6860
weighted avg       0.72      0.72      0.72      6860

215/215 - 1s - loss: 0.6307 - accuracy: 0.7235 - 513ms/epoch - 2ms/step
Loss: 0.6306723356246948, Accuracy: 0.7234693765640259


In [None]:
# Export our model to HDF5 file
nn_3.save("AlphabetSoupCharity_OptimizationAttempt3.h5")

  saving_api.save_model(


# Neural Network Model Optimization Report

---
## Overview of the Analysis
The purpose of this model analysis report is to outline the reasoning used in the optimization of the to predicting successful campaigns at a 75% or higher accuracy. I used a csv file containing data for more than 34,000 organizations that received funding from the nonprofit foundation Alphabet Soup.

---
## Results

---
### Data Preprocessing

- **Target Variable:** This model is trying to predict successful campaigns by using the `IS_SUCCESSFUL` column found in the dataset, which indicates whether an applicant was successful or not.
    
- **Feature Variables:** The feature variables are all other columns in the pandas DataFrame after removing the required non-variable columns. These features include various applicant attributes such as `APPLICATION_TYPE`, `ASK_AMT`, `CLASSIFICATION`, and `INCOME_AMT`.
    
- **Removed Variables:** Based on the value counts of all columns in the dataset, I was able to identify and remove the non-useful identification columns and remove them from the DataFrame. In the end, only the columns `EIN` and `NAME`.
  
---
### Compiling, Training, and Evaluating the Model

- **Neurons, Layers, and Activation Functions:**
    
    - When I initially ran the model, there was only a basic architecture with three layers. However, when optimizing the model, additional layers and neurons were added to enhance model performance.
        
    - The activation functions used included ReLU for hidden layers and sigmoid for the output layer to handle the binary classification problem. ReLU was chosen to introduce non-linearity and mitigate vanishing gradients, while sigmoid was used for its probabilistic output, aligning with the classification task. Furthermore, I stuck to using the `adam` optimizer when building my models. 
      
- **Model Performance:**
    
    - The initial model did not achieve the desired accuracy. Multiple different approaches were taken in hopes of increasing the prediction accuracy of the model. iterations of feature selection, data balancing, and architectural changes were made in an attempt to improve performance.
      
- **Optimization Attempts:**
    
	1. **Adding Additional Neural Network Layers and Node on top of Feature Selection:** I ran a random forest model on to predict the `IS_SUCCESSFUL` target. I then used [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to cross validate the models. After doing that, I trained the best-performing model and evaluated the accuracy score. Then, I analyzed which features influenced predictions the most. I then removed low-importance features in an attempt to reduce excessive noise in the dataset. I classified low-importance features as those that contributed less than .05% to the model training. 
        
    2. **SMOTE Oversampling:** Due to the slight imbalance of the target classes, I tried to implement a over-sampling technique to see if that improved the model's accuracy. I used [SMOTE](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html) to balance the target variable distribution, but this did not significantly improve accuracy. Since SMOTE did not yield meaningful performance gains, it was removed in subsequent model refinements to simplify preprocessing.
        
    3. **ASK_AMT Binarization:** Due to **extreme** imbalance in `ASK_AMT` where the value 5000 was the only instance of a value count over 5, the column was converted into a binary feature where values were either 5000 or other. The model was then retrained without SMOTE, but accuracy improvements remained marginal. 
      
---
## Summary

---
Despite multiple optimization attempts, including feature selection, oversampling, and feature engineering, the final model did not achieve the target accuracy. The transformation of `ASK_AMT` into a binary variable did not yield significant improvement, and SMOTE failed to enhance performance. Given that SMOTE did not contribute to model improvement, it was removed from later iterations to focus on other optimization strategies.

**Recommendation:** Given the large class imbalances in several features, replacing the neural network with a Random Forest classifier could yield better results. Random Forests can handle imbalanced data more effectively by adjusting class weights. This approach can improve classification performance without the need for artificial data balancing techniques like SMOTE, which did not provide meaningful gains in accuracy.

----
## Citations

---

1. I used [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) in association with [ChatGPT](https://chatgpt.com/) to cross validate the random forest model. This allowed me to find the optimal model and find out the features that most influenced the model's predictions. 
2. I used [SMOTE](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html) in association with [ChatGPT](https://chatgpt.com/) to over-sample my imbalanced target variable in an attempt to improve model accuracy.