# Tweaking the Features

Neural net tuning brought us from 72.87% to 73.42% accuracy, but we're still looking for 75% accuracy. Here, we exhaustively check all combinations of features, to see which ones give the highest accuracy.

In [5]:
# Import our dependencies
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
import tensorflow as tf
import itertools

In [6]:
def create_app_df(cols) -> pd.DataFrame:
    #  Import and read the charity_data.csv.
    #application_df = pd.read_csv("https://static.bc-edx.com/data/dl-1-2/m21/lms/starter/charity_data.csv")
    application_df = pd.read_csv("Data/charity_data.csv")

    # Choose a cutoff value and create a list of application types to be replaced
    for app in application_df["APPLICATION_TYPE"].value_counts().index[8:]:
        application_df['APPLICATION_TYPE'] = application_df['APPLICATION_TYPE'].replace(app,"Other")

    # Choose a cutoff value and create a list of classifications to be replaced
    for cls in application_df["CLASSIFICATION"].value_counts().index[5:]:
        application_df['CLASSIFICATION'] = application_df['CLASSIFICATION'].replace(cls,"Other")

    application_df = application_df[[*cols, "IS_SUCCESSFUL"]]

    application_df = pd.get_dummies(application_df)
    return application_df

In [7]:
def get_column_combos() -> []:
    all_cols = ["APPLICATION_TYPE","AFFILIATION","CLASSIFICATION","USE_CASE","ORGANIZATION","STATUS","INCOME_AMT","SPECIAL_CONSIDERATIONS","ASK_AMT"]
    for combo_count in range(1,len(all_cols)):
        for combo in itertools.combinations(all_cols, combo_count):
            yield list(combo)

In [26]:
def create_model(num_cols):
    # Define the model - deep neural net, i.e., the number of input features and hidden nodes for each layer.
    # Use most efficient neural network from tuning notebook. Won't always be most efficient, but it gives us a chance
    # 0 - selu 6, 1 - selu 13, 2 - tanh 7, output - sigmoid 1
    model = tf.keras.models.Sequential()
    model.add(tf.keras.layers.Dense(units=6, activation="selu", input_dim=num_cols))
    model.add(tf.keras.layers.Dense(units=13, activation="selu"))
    model.add(tf.keras.layers.Dense(units=7, activation="tanh"))
    model.add(tf.keras.layers.Dense(units=1, activation="sigmoid"))

    # Compile the model
    model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
    
    return model

In [8]:
# Beware, this can take almost 6 hours to run!
all_model_stats = []

for col_combo in get_column_combos():
    application_df = create_app_df(col_combo)

    model_summary = { 'cols': application_df.columns }

    # Split our preprocessed data into our features and target arrays
    features = application_df.drop("IS_SUCCESSFUL", axis=1)
    target = application_df["IS_SUCCESSFUL"]

    # Split the preprocessed data into a training and testing dataset
    X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=1)

    # Create a StandardScaler instance
    # Fit the StandardScaler
    X_scaler = StandardScaler().fit(X_train)

    # Scale the data
    X_train_scaled = X_scaler.transform(X_train)
    X_test_scaled = X_scaler.transform(X_test)

    model = create_model(len(features.columns))

    # Train the model
    history = model.fit(
        X_train_scaled,
        y_train,
        epochs=34,
        verbose=0
        )

    model_loss, model_accuracy = model.evaluate(X_test_scaled,y_test,verbose=2)
    model_summary["accuracy"] = model_accuracy
    model_summary["loss"] = model_loss

    all_model_stats.append(model_summary)

268/268 - 0s - loss: 0.6677 - accuracy: 0.5720 - 426ms/epoch - 2ms/step
268/268 - 0s - loss: 0.6185 - accuracy: 0.6917 - 408ms/epoch - 2ms/step
268/268 - 0s - loss: 0.6719 - accuracy: 0.5784 - 416ms/epoch - 2ms/step
268/268 - 0s - loss: 0.6902 - accuracy: 0.5354 - 414ms/epoch - 2ms/step
268/268 - 0s - loss: 0.6720 - accuracy: 0.6000 - 413ms/epoch - 2ms/step
268/268 - 0s - loss: 0.6914 - accuracy: 0.5292 - 417ms/epoch - 2ms/step
268/268 - 0s - loss: 0.6877 - accuracy: 0.5346 - 412ms/epoch - 2ms/step
268/268 - 0s - loss: 0.6915 - accuracy: 0.5292 - 411ms/epoch - 2ms/step
268/268 - 0s - loss: 0.6907 - accuracy: 0.5342 - 411ms/epoch - 2ms/step
268/268 - 0s - loss: 0.5851 - accuracy: 0.7191 - 408ms/epoch - 2ms/step
268/268 - 0s - loss: 0.6462 - accuracy: 0.6170 - 421ms/epoch - 2ms/step
268/268 - 0s - loss: 0.6613 - accuracy: 0.5785 - 420ms/epoch - 2ms/step
268/268 - 0s - loss: 0.6477 - accuracy: 0.6251 - 425ms/epoch - 2ms/step
268/268 - 0s - loss: 0.6678 - accuracy: 0.5720 - 421ms/epoch - 2

In [20]:
all_cols = ["APPLICATION_TYPE","AFFILIATION","CLASSIFICATION","USE_CASE","ORGANIZATION","STATUS","INCOME_AMT","SPECIAL_CONSIDERATIONS","ASK_AMT"]

# Print all summaries, in order of descending accuracy
for i, model_stats in enumerate(sorted(all_model_stats, key=lambda x: x["accuracy"], reverse=True)):
    cols = [col for col in all_cols if col in [col for x in model_stats["cols"] if col in x]]
    print(f'Number {i+1} most effective\nColumns: {", ".join(cols)}\n{model_stats["accuracy"]}\n{model_stats["loss"]}\n--------------')

Number 1 most effective
Columns: APPLICATION_TYPE, AFFILIATION, CLASSIFICATION, ORGANIZATION, INCOME_AMT
0.7318950295448303
0.5562520027160645
--------------
Number 2 most effective
Columns: APPLICATION_TYPE, AFFILIATION, CLASSIFICATION, ORGANIZATION, INCOME_AMT, ASK_AMT
0.7302623987197876
0.5550990700721741
--------------
Number 3 most effective
Columns: APPLICATION_TYPE, AFFILIATION, CLASSIFICATION, ORGANIZATION, INCOME_AMT, SPECIAL_CONSIDERATIONS, ASK_AMT
0.729912519454956
0.5551552772521973
--------------
Number 4 most effective
Columns: APPLICATION_TYPE, AFFILIATION, CLASSIFICATION, ORGANIZATION, STATUS, INCOME_AMT, SPECIAL_CONSIDERATIONS, ASK_AMT
0.729912519454956
0.5570317506790161
--------------
Number 5 most effective
Columns: APPLICATION_TYPE, AFFILIATION, CLASSIFICATION, ORGANIZATION, INCOME_AMT, SPECIAL_CONSIDERATIONS
0.7295626997947693
0.5575858354568481
--------------
Number 6 most effective
Columns: APPLICATION_TYPE, AFFILIATION, CLASSIFICATION, USE_CASE, ORGANIZATION, I

In [27]:
# Do once more on the full feature set to test its comparative accuracy
application_df = create_app_df(all_cols)

model_summary = { 'cols': application_df.columns }

# Split our preprocessed data into our features and target arrays
features = application_df.drop("IS_SUCCESSFUL", axis=1)
target = application_df["IS_SUCCESSFUL"]

# Split the preprocessed data into a training and testing dataset
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=1)

# Create a StandardScaler instance
# Fit the StandardScaler
X_scaler = StandardScaler().fit(X_train)

# Scale the data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

model = create_model(len(features.columns))

# Train the model
history = model.fit(
    X_train_scaled,
    y_train,
    epochs=34,
    verbose=0
    )

model_loss, model_accuracy = model.evaluate(X_test_scaled,y_test,verbose=2)
print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")

268/268 - 0s - loss: 0.5588 - accuracy: 0.7269 - 424ms/epoch - 2ms/step
Loss: 0.5587639212608337, Accuracy: 0.7268804907798767


# Analysis

The most important thing here, I think, is the test I did as an afterthought: the test to recreate my tuning results. The full feature set, using the same neural network parameters, produced almost an entire percent lower accuracy, here, than it did during tuning. This says, to me, that any tuning gains were mostly random chance, and that I need to tune again using different feature sets.

The following features are needed for obtaining the highest effective accuracy, given the neural network parameters obtained from AlphabetSoupCharity_Tuning: APPLICATION_TYPE, AFFILIATION, CLASSIFICATION, ORGANIZATION, INCOME_AMT. However, due to my findings concerning the outweighted importance of random chance, I feel I should ignore the single highest grouping and instead take the features from the top 10 feature sets

* APPLICATION_TYPE - 10 times
* AFFILIATION - 10 times
* CLASSIFICATION - 10 times
* USE_CASE - 3 times
* ORGANIZATION - 10 times
* STATUS - 2 times
* INCOME_AMT - 8 times
* SPECIAL_CONSIDERATIONS - 5 times
* ASK_AMT - 6 times

From these top 10 numbers, I can glean the following:

* APPLICATION_TYPE, AFFILIATION, CLASSIFICATION, and ORGANIZATION are obviously critical features, as they are present in every one of the top 10
* INCOME_AMT is also very important to successful training
* STATUS and USE_CASE are practically inconsequential
* I'm left to decide for ASK_AMT and SPECIAL_CONSIDERATIONS

Since I want those features which are correlated most highly with accuracy, I will go with APPLICATION_TYPE, AFFILIATION, CLASSIFICATION, ORGANIZATION, and INCOME_AMT. The others' correlation is just too low for me to be happy with their influence. (Incidentally, this happens to be the feature set that gave me the highest accuracy here, though I don't know whether or not that means much)

Since the accuracies here are rather low, compared to the Tuning notebook, I'm going to try tuning again, but first I want to try feature tweaking again, this time on the number of values collapsed into the Other categories. That will be done in AlphabetSoupCharity_FeatureTweaksPt2