# Part 1: RNF Function Testing and Evaluation
## Step 1: Preprocessing US Social Security Data for RNF (Relative Name Frequency) Function
#### Notes
* Same as in main model
* Like main model, the unique name-gender-frequency US Social Security data will be used as the RNF function's frame of reference, or its 'df_train' parameter

In [129]:
# Import Dependencies
import pandas as pd
import os
import psycopg2
from sqlalchemy import create_engine
from getpass import getpass
import random
from nltk.corpus import names
import nltk

In [130]:
# Folder Path
path = (r"C:\Users\benmo\OneDrive\Desktop\Capstone Project\Final-Project-Sunshine-Segment3\Resources\Resources_ML\names")
  
# Change the directory
os.chdir(path)

names_list = []

for file in os.listdir():
    file_path = rf"{path}\{file}"
    year = file[3:7]
    df_temp = pd.read_csv(file_path, names=["first_name", "gender", "frequency"])
    df_temp["year"] = year
    names_list.append(df_temp)

In [131]:
# Import dependency to help combine all 'names_list' items--each US Social Security Names File DF--into one large DF
from functools import reduce

# Use dependency + lambda function to execute individual DF consolidation 
total_us_ss_df = reduce(lambda x, y: x.append(y), names_list).reset_index(drop=True)

# Display output (which is one large dataframe that has combined all of the DFs in 'names_list')
total_us_ss_df

Unnamed: 0,first_name,gender,frequency,year
0,Mary,F,7065,1880
1,Anna,F,2604,1880
2,Emma,F,2003,1880
3,Elizabeth,F,1939,1880
4,Minnie,F,1746,1880
...,...,...,...,...
2020858,Zykell,M,5,2020
2020859,Zylus,M,5,2020
2020860,Zymari,M,5,2020
2020861,Zyn,M,5,2020


In [132]:
# Grouping the above DF by first name and gender and summing the frequency 
# Purpose of grouping by first name is to have a DF consisting of only the unique names across all of the DFs (across all of the US Social Security Data's annual name files)
# Purpose of grouping by gender is to ensure that any multi-gender first name does not have its M/F frequencies combined into one frequency, but rather have one instance for each
# Purpose of summing frequency is to tally each unique name-gender combination across all of the years (1880-2020)
# *NOTE: The original US Social Security name files contain a name, associated gender and the frequency or how many people had that name and gender in that given year. Moreover, every year the frequency for each name is adjusted for births and deaths of people with that name. Further, each year, names are added and eliminated from the list based on succeeding / failing to meet the required minimum name-gender frequency of 5. Therefore, there is certainly overlap in the frequency if you consider individual people, but that is not relevant for our purposes. 
sum_frequencies_df = total_us_ss_df.groupby(["first_name", "gender"])[["frequency"]].sum().reset_index()

# Displaying output
sum_frequencies_df

Unnamed: 0,first_name,gender,frequency
0,Aaban,M,120
1,Aabha,F,46
2,Aabid,M,16
3,Aabidah,F,5
4,Aabir,M,10
...,...,...,...
111467,Zyvion,M,5
111468,Zyvon,M,7
111469,Zyyanna,F,6
111470,Zyyon,M,6


In [133]:
# Using inverted loc on above DF's 'first_name' column to find all of the names that do not have a duplicate 
# Purpose of this is to find all of the names that are not multi-gender and separate them from the names that are historically multi-gender--so if at any point in the US from 1880-2020 there were 5 or more instances of a first name with a Male associated gender and five or more instances of the same name with a Female associated gender they will not be included in this extraction--
not_duplicated = sum_frequencies_df.loc[~sum_frequencies_df.duplicated(subset=["first_name"], keep=False)]

# Pivoting the above DF so that the binary values within the 'gender' column become two distinct columns of their own--rather than one gender column-- and their associated values will be the frequency of each gender for each name since the 'first_name' column is set to the index
# Even though the names in the DF created above are not duplicated--meaning they will only have a value for one of the 'M' or 'F' columns this is done for later purposes when this DF will be combined with another one
not_duplicated_pivoted = not_duplicated.pivot_table(values="frequency", index="first_name", columns="gender").fillna(0)

# Removing original gender column
not_duplicated_pivoted.columns.name=None

# Resetting index
not_duplicated_final = not_duplicated_pivoted.reset_index()

# Displaying output
not_duplicated_final

Unnamed: 0,first_name,F,M
0,Aaban,0.0,120.0
1,Aabha,46.0,0.0
2,Aabid,0.0,16.0
3,Aabidah,5.0,0.0
4,Aabir,0.0,10.0
...,...,...,...
89251,Zyvion,0.0,5.0
89252,Zyvon,0.0,7.0
89253,Zyyanna,6.0,0.0
89254,Zyyon,0.0,6.0


In [134]:
# Same thing as above except with the opposite data--all multi-gender first names
duplicated = sum_frequencies_df.loc[sum_frequencies_df.duplicated(subset=["first_name"], keep=False)]
duplicated_pivoted = duplicated.pivot_table(values="frequency", index="first_name", columns="gender")
duplicated_pivoted.columns.name=None
duplicated_final = duplicated_pivoted.reset_index()
duplicated_final

Unnamed: 0,first_name,F,M
0,Aaden,5,4975
1,Aadi,16,933
2,Aadyn,16,555
3,Aalijah,149,244
4,Aaliyah,94707,101
...,...,...,...
11103,Zymir,5,1048
11104,Zyon,661,3026
11105,Zyonn,5,59
11106,Zyree,16,116


In [135]:
# Combining the two DFs created in the two cells above and sorting them by alphabetically by first name
final_best_df = not_duplicated_final.append(duplicated_final).sort_values(by="first_name").fillna(0)

# Changing the 'F' and 'M' frequency columns in the above DF to be an integer datatype rather than a float to get rid of the decimal for each value
final_best_df = final_best_df.astype({"F": int, "M": int})

# Displaying output
final_best_df

Unnamed: 0,first_name,F,M
0,Aaban,0,120
1,Aabha,46,0
2,Aabid,0,16
3,Aabidah,5,0
4,Aabir,0,10
...,...,...,...
89251,Zyvion,0,5
89252,Zyvon,0,7
89253,Zyyanna,6,0
89254,Zyyon,0,6


## Step 2: Create RNF Function
#### Notes:
* Same as in main model

In [136]:
# Defining 'generate_preds' function and 'df_train' and 'df_test' as parameters to function
# *NOTE: 'df_test' is the input DF given by function user--the function will then use preprocessed data to output a gender prediction for each name wihtin 'test_df'--
# *NOTE: 'df_train' represents the data preprocessed in step 1 that will be used in below algorithm to calculate probability of input_df name being either male/female
def generate_preds(df_train, df_test):
    
    # Strips all values of input DF 'first_name' column to eliminate any formating errors that could cause function to output an error
    df_test.first_name = df_test.first_name.str.strip()
    # Sorts test values alphabetically by first name (not really needed I guess)
    df_test = df_test.sort_values(by='first_name')
    
    # Calculate probability of each name in preprocessed data being male by using unique name instances with respective male/female gender frequency from preprocessed data and insert a new column in final preprocessed DF (final_best_df) to store values
    df_train['pom'] = ((df_train.M)/(df_train.M + df_train.F))*100
     # Calculate probability of each name in preprocessed data being female by using unique name instances with respective male/female gender frequency from preprocessed data and insert a new column in final preprocessed DF (final_best_df) to store values
    df_train['pof'] = ((df_train.F)/(df_train.M + df_train.F))*100
    
    # Executing left join on 'df_test' with 'df_train' on 'df_train' 'first_name' column so that all of the (first name) data from 'df_test' is kept and only the gender probabilities (new columns being merged to 'df_test') that occur in both DFs are added to 'df_test' rather than also adding all of the names from 'df_train' that are not in 'df_test' and their respective probabilities because the function user only wants the function to output a prediction for the names they gave the function ('df_test') 
    df_merged = df_test.merge(df_train, on='first_name', how='left')
    
    # Instantiating empty list to store values output by below loop
    preds = []
    # Iterating through each row in the merged DF 
    # *NOTE: 'itertuples()' method is used for DF looping efficiency and since it provides ability to access each DF column using dot notation
    for t in df_merged.itertuples():
        # States that if merged DF 'pom'--probability of male-- column is above 50% and 'pof'--probability of female-- column is below 50% to add an 'M' to the 'preds' list because the model has dictated that the gender prediction for the respective name is male 
        if (t.pom > 50 and t.pof < 50):
            preds.append('M')
        # States that if merged DF 'pof'--probability of female-- column is above 50% and 'pom'--probability of male-- column is below 50% to add an 'F' to the 'preds' list because the model has dictated that the gender prediction for the respective name is female
        elif (t.pof > 50 and t.pom < 50):
            preds.append('F')
        # States that if the probability of a name being male/female is 50%-50% to output 'EV'--standing for even-- to the 'preds' list since the respective gender neutral name's gender cannot be predicted using this model due to insufficient data 
        elif (t.pom == 50 and t.pof == 50):
            preds.append('EV')
        # States that if none of the above can be executed to output 'U'--standing for unknown-- to the 'preds' list since the name cannot be predicted by the model due to there being no data on this name (name is not included in 'df_train')
        else:
            preds.append('U')
    # Creating a new column in the 'df_test' DF and assigning it to the 'preds' list which contains all of the gender predictions output by the model
    df_test['gender'] = preds 
    
    # Returns the same DF that the function-user input with another column that contains a gender prediction for each name from the input DF
    return df_test

# Step 3: Create Labeled DF from NLTK Default Name-Gender Supervised Data for RNF Testing/Evaluation
#### Notes
* Created from 'male.txt' and 'female.txt' files that come with NLTK library
* Original data was two separate name lists: one for male and one for female
* The two distinct lists are each given a gender column ("Actual") and then combined into one DataFrame
* Resulting DF will be used to evaluate the RNF function's gender predictions (that will be made later) by using this actual gender data to determine accuracy

In [137]:
def gender_features(word):
    return {'last_letter':word}
  
# Preparing a list of examples and corresponding class labels.
labeled_names = ([(name, 'M') for name in names.words('male.txt')]+
             [(name, 'F') for name in names.words('female.txt')])

In [138]:
random.shuffle(labeled_names)

In [139]:
test_data_eval = pd.DataFrame(labeled_names, columns=["first_name", "Actual"])
test_data_eval = test_data_eval.sort_values(by='first_name').reset_index(drop=True)
test_data_eval

Unnamed: 0,first_name,Actual
0,Aamir,M
1,Aaron,M
2,Abagael,F
3,Abagail,F
4,Abbe,F
...,...,...
7939,Zorro,M
7940,Zsa Zsa,F
7941,Zsazsa,F
7942,Zulema,F


# Step 4: Extract 'first_name' Column 
#### Notes
* Extracting 'first_name' column from DF above and assigning it to a variable
* Variable will be used further along in this analysis

In [140]:
rnf_test = test_data_eval[["first_name"]]
rnf_test

Unnamed: 0,first_name
0,Aamir
1,Aaron
2,Abagael
3,Abagail
4,Abbe
...,...
7939,Zorro
7940,Zsa Zsa
7941,Zsazsa
7942,Zulema


# Step 5: Run RNF Function
#### Notes
* Running RNF function with first name data extracted above as 'test_df' input parameter
* RNF function is using 'final_best_df' as 'train_df' input --> meaning the RNF function is using the gender probabilities it calculated from the US Social Security name-gender-frequency data to predict the gender of each name passed from the 'df_test' input parameter, which is the 'rnf_test' variable defined above

In [141]:
rnf_preds_df = generate_preds(final_best_df, rnf_test).reset_index(drop=True)
rnf_preds_df


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


Unnamed: 0,first_name,gender
0,Aamir,M
1,Aaron,M
2,Abagael,F
3,Abagail,F
4,Abbe,F
...,...,...
7939,Zorro,U
7940,Zsa Zsa,U
7941,Zsazsa,F
7942,Zulema,F


# Step 6: Join 'Actual' Column to RNF Function Output
#### Notes
* The 'Actual' column (containing the actual genders of the labeled data) is extracted from the 'test_data_eval' DF to display a side-by-side comparison of the RNF gender predictions vs. their actual genders
* Purpose is for performing accuracy evaluation calculation on the RNF function's predictions ('gender' column below)

In [142]:
actual = test_data_eval["Actual"]
rnf_final = rnf_preds_df.join(actual)
rnf_final

Unnamed: 0,first_name,gender,Actual
0,Aamir,M,M
1,Aaron,M,M
2,Abagael,F,F
3,Abagail,F,F
4,Abbe,F,F
...,...,...,...
7939,Zorro,U,M
7940,Zsa Zsa,U,F
7941,Zsazsa,F,F
7942,Zulema,F,F


#### Notes (cont.)
* As seen below the RNF function can only output a gender prediction for those first names that are contained in the 'train_df' parameter passed to the function 
* Names contained in the 'test_df' parameter that are not in the 'train_df' will receive a "U" prediction standing for 'unknown' because the RNF function has no way of calculating these names' gender probabilities and therefore cannot output a gender prediction, opposed to names that the function has name-gender-frequency data on (passed to the function via the 'train_df' parameter) to calculate probability

In [143]:
rnf_final.value_counts("gender")

gender
F     4269
M     2627
U     1046
EV       2
dtype: int64

# Step 7: Accuracy Evaluation
#### Notes
* An empty list is declared to hold results
* A for loop is used to iterate through the above DF ('rnf_final')
* Six conditionals are delcared that will compare the 'gender' column (RNF gender predictions) to the 'Actual' column (actual genders)
* Each conditional represents a possible outcome of the two columns from 'rnf_final' being compared and each one contains the appropriate subsequent action to accurately document whether the RNF gender prediction was correct or incorrect 
* Each "C" (correct) and "I" (incorrect) is appended to the empty list declared prior 
* Finally this list is appended as a column to the 'rnf_final' DF for convenient further analysis

In [144]:
accuracy_list = []
for row in rnf_final.itertuples():
    if row.gender == "M" and row.Actual == "M":
        accuracy_list.append("C")
    elif row.gender == "M" and row.Actual == "F":
        accuracy_list.append("I")
    elif row.gender == "F" and row.Actual == "F":
        accuracy_list.append("C")
    elif row.gender == "F" and row.Actual == "M":
        accuracy_list.append("I")
    elif row.gender == "U":
        accuracy_list.append("I")
    elif row.gender == "EV":
        accuracy_list.append("I")
rnf_final["Evaluation"] = accuracy_list

In [145]:
rnf_final.replace(to_replace={})

Unnamed: 0,first_name,gender,Actual,Evaluation
0,Aamir,M,M,C
1,Aaron,M,M,C
2,Abagael,F,F,C
3,Abagail,F,F,C
4,Abbe,F,F,C
...,...,...,...,...
7939,Zorro,U,M,I
7940,Zsa Zsa,U,F,I
7941,Zsazsa,F,F,C
7942,Zulema,F,F,C


In [146]:
rnf_final

Unnamed: 0,first_name,gender,Actual,Evaluation
0,Aamir,M,M,C
1,Aaron,M,M,C
2,Abagael,F,F,C
3,Abagail,F,F,C
4,Abbe,F,F,C
...,...,...,...,...
7939,Zorro,U,M,I
7940,Zsa Zsa,U,F,I
7941,Zsazsa,F,F,C
7942,Zulema,F,F,C


# Step 8: RNF Function Accuracy Score Evaluation

In [30]:
rnf_final.value_counts("Evaluation")

Evaluation
C    6292
I    1652
dtype: int64

In [4]:
accuracy_score = (6292/(6292+1652))*100

In [5]:
print(f"When using the US Social Security Data as its base, the relative name frequency model accurately predicted {accuracy_score:.2f}% of the default NLTK name-gender dataset.")

When using the US Social Security Data as its base, the relative name frequency model accurately predicted 79.20% of the default NLTK name-gender dataset.


# Part 2: NLTK Library Naive Bayes Classification Model Testing/Evaluation
## Step 1: Preprocess US Social Security Name-Gender Data for Naive Bayes
#### Notes
* This is the same process used in the main ML model file to preprocess the US Social Security name-gender data so it can be used as training for the Naive Bayes Classification Model
* US Social Security name-gender data must be preprocessed appropriately to be used to train an NLTK Naive Bayes Classification Model as per the library's documentation states
* For this specific library-model instance the training data must be in a format comprising a list of tuples where each tuple contains a python dictionary where the key is, 'last three letters' and the value is the last three letters of each name being fed to the model for training 
 - **Note**: Doesn't have to be last three letters, this can be modified by user as desired (our choice of last three letters is explained in README)
* Outside the dictionary, but still within the tuple, there is one more value which is the target/output variable, which in our case is the gender prediction that is represented as either "M" or "F"

In [147]:
# Import dependencies 
import random
from nltk.corpus import names
import nltk
import sklearn
import pandas as pd
import numpy as np

# Note
* The below DF is the raw data being used to train the NLTK Naive Bayes Classification Model 
* Originally, the data used for the output ("gender") column was from 'final_best_df' which is a list of unique names from the US Social Security data
* Further, the original output ("gender") was the result of running the RNF function on each unique name and their relative male/female frequencies to calculate gender probability and output a prediction 
* However, upon further consideration we realized that although minute, this was not optimal due to the following: 
* The Naive Bayes Classification Model outputs a gender prediction by analyzing the n-gram frequency from its training data (e.g. if the model is passed "Ben" it would look to see how freuqent the n-grams 'B', 'BE', 'EN', and 'BEN' are among male names and among female names and apply certain weights to each to output a gender prediction) Therefore, if the model is trained without seeing the multigender name instances from a female and male perspective but rather, only the lone gender prediction output by the RNF function, it can be argued that it would be missing data that could increase the model's accuracy
* This would only effect the multi-gender name instances which is 11,108 names out of over 110k+ (< 10%)
* Further, it stands that this ultimately should not signifcantly effect the model's accuracy because the RNF model always output the gender prediction that was most probable for each given name based on the male/female frequency data for each name
* In summary, the n-grams that made up the multi-gender name instances were previously only associated with the prediction that was output by the RNF model (since that was used as the tagret variable) but it was decided to use the data that included a separate row for each name-gender combination (Avery, M & Avery F as two distinct rows) so that the n-grams from these multi-gender names could be associated with both genders so if for example, before there was a row that had the following data: 
 - First Name: Avery 
 - M-Freq: 100
 - F-Freq: 3000
 - RNF Gender Prediction: Female
* Therefore the n-grams 'A', 'V', 'E', 'R', 'Y', 'AV', 'VE', 'ER', 'RY', 'AVE', 'VER', and 'ERY' would all get a tally in the female category within the NLTK algorithm, but the new method would give a tally to both the male and female categories since there is now an Avery instance for each, so it is possible that these could add up to change the NLTK algorithm for any given name and ultimately, its output/prediction
**Update: After comparing the results of both methods the newer method provided the NLTK algorithm with one additional correct gender prediction*

In [148]:
nltk_training_data = sum_frequencies_df[["first_name", "gender"]]
nltk_training_data

Unnamed: 0,first_name,gender
0,Aaban,M
1,Aabha,F
2,Aabid,M
3,Aabidah,F
4,Aabir,M
...,...,...
111467,Zyvion,M
111468,Zyvon,M
111469,Zyyanna,F
111470,Zyyon,M


In [149]:
# Creating a new variable for the above DF and converting values to list
labeled_names = nltk_training_data.to_numpy().tolist()

# Creating a function that takes a string input and and returns a dictionary with a key of 'last three letters' and a value of the the last three letters of the input string
# *NOTE: This function and the list comp below were taken from the NLTK library's documentation explaining how to use the library
def gender_features(word):
    return {'last three letters' :word[-3:]}

# Using list comprehension and the function created above as the expression output to return a new list that includes the last three letters of each name and the gender from each item in 'labeled_names'
featuresets = [(gender_features(n), gender)
               for (n, gender)in labeled_names]

# Displaying output
featuresets

[({'last three letters': 'ban'}, 'M'),
 ({'last three letters': 'bha'}, 'F'),
 ({'last three letters': 'bid'}, 'M'),
 ({'last three letters': 'dah'}, 'F'),
 ({'last three letters': 'bir'}, 'M'),
 ({'last three letters': 'lla'}, 'F'),
 ({'last three letters': 'ada'}, 'F'),
 ({'last three letters': 'dam'}, 'M'),
 ({'last three letters': 'dan'}, 'M'),
 ({'last three letters': 'rsh'}, 'M'),
 ({'last three letters': 'dav'}, 'M'),
 ({'last three letters': 'aya'}, 'F'),
 ({'last three letters': 'den'}, 'F'),
 ({'last three letters': 'den'}, 'M'),
 ({'last three letters': 'esh'}, 'M'),
 ({'last three letters': 'han'}, 'M'),
 ({'last three letters': 'hav'}, 'M'),
 ({'last three letters': 'van'}, 'M'),
 ({'last three letters': 'dhi'}, 'M'),
 ({'last three letters': 'ira'}, 'F'),
 ({'last three letters': 'ran'}, 'M'),
 ({'last three letters': 'vik'}, 'M'),
 ({'last three letters': 'ika'}, 'F'),
 ({'last three letters': 'hya'}, 'F'),
 ({'last three letters': 'yan'}, 'M'),
 ({'last three letters': 

# Step 2: Instantiate NLTK Naive Bayes Classification Model & Train Model Using Preprocessed US Social Security Name-Gender Data

In [150]:
classifier = nltk.NaiveBayesClassifier.train(featuresets)

# Step 3: Preprocess NLTK Name-Gender Data to Test NLTK Naive Bayes Classification Model
#### Notes
* Although this name-gender data comes with the installation of the NLTK library, it does not come in the required format to be fed into the NLTK NBCM, so preprocessing is necessary
* This is the same data passed to the RNF function via its 'df_test' parameter (same data used to test RNF function)

In [154]:
# Performing itertuples() method on every row from rnf_final DF for easier looping and extracting all of it into list
accuracy_list_1 = []
for item in rnf_final.itertuples():
    if item.Evaluation == "I":
        accuracy_list_1.append(item)
    elif item.Evaluation == "C":
        accuracy_list_1.append(item)

In [36]:
accuracy_list_1

[Pandas(Index=0, first_name='Aamir', gender='M', Actual='M', Evaluation='C'),
 Pandas(Index=1, first_name='Aaron', gender='M', Actual='M', Evaluation='C'),
 Pandas(Index=2, first_name='Abagael', gender='F', Actual='F', Evaluation='C'),
 Pandas(Index=3, first_name='Abagail', gender='F', Actual='F', Evaluation='C'),
 Pandas(Index=4, first_name='Abbe', gender='F', Actual='F', Evaluation='C'),
 Pandas(Index=5, first_name='Abbey', gender='F', Actual='F', Evaluation='C'),
 Pandas(Index=6, first_name='Abbey', gender='F', Actual='M', Evaluation='I'),
 Pandas(Index=7, first_name='Abbi', gender='F', Actual='F', Evaluation='C'),
 Pandas(Index=8, first_name='Abbie', gender='F', Actual='F', Evaluation='C'),
 Pandas(Index=9, first_name='Abbie', gender='F', Actual='M', Evaluation='I'),
 Pandas(Index=10, first_name='Abbot', gender='M', Actual='M', Evaluation='C'),
 Pandas(Index=11, first_name='Abbott', gender='M', Actual='M', Evaluation='C'),
 Pandas(Index=12, first_name='Abby', gender='F', Actual='F'

In [157]:
# Creating a new list variable and extracting only the first names column from the list above and storing all of the first names in this new list
new_list = []

for names in accuracy_list_1:
    new_list.append(names[1])

new_list

['Aamir',
 'Aaron',
 'Abagael',
 'Abagail',
 'Abbe',
 'Abbey',
 'Abbey',
 'Abbi',
 'Abbie',
 'Abbie',
 'Abbot',
 'Abbott',
 'Abby',
 'Abby',
 'Abdel',
 'Abdul',
 'Abdulkarim',
 'Abdullah',
 'Abe',
 'Abel',
 'Abelard',
 'Abigael',
 'Abigail',
 'Abigale',
 'Abner',
 'Abra',
 'Abraham',
 'Abram',
 'Acacia',
 'Ace',
 'Ada',
 'Adah',
 'Adair',
 'Adaline',
 'Adam',
 'Adams',
 'Adara',
 'Addie',
 'Addie',
 'Addis',
 'Adel',
 'Adela',
 'Adelaide',
 'Adele',
 'Adelice',
 'Adelina',
 'Adelind',
 'Adeline',
 'Adella',
 'Adelle',
 'Adena',
 'Adey',
 'Adger',
 'Adi',
 'Adiana',
 'Adina',
 'Aditya',
 'Adlai',
 'Adnan',
 'Adolf',
 'Adolfo',
 'Adolph',
 'Adolphe',
 'Adolpho',
 'Adolphus',
 'Adora',
 'Adore',
 'Adoree',
 'Adorne',
 'Adrea',
 'Adria',
 'Adriaens',
 'Adrian',
 'Adrian',
 'Adriana',
 'Adriane',
 'Adrianna',
 'Adrianne',
 'Adrick',
 'Adrien',
 'Adrien',
 'Adriena',
 'Adrienne',
 'Aeriel',
 'Aeriela',
 'Aeriell',
 'Ag',
 'Agace',
 'Agamemnon',
 'Agata',
 'Agatha',
 'Agathe',
 'Aggi',
 'Aggi

In [162]:
# Performing the function defined earlier (within part 1) on the list of names created above to convert the data into the format required for it to be passed into the NLTK NBCM 
i_names_input = [(gender_features(n))
               for (n)in new_list]

i_names_input

[{'last three letters': 'mir'},
 {'last three letters': 'ron'},
 {'last three letters': 'ael'},
 {'last three letters': 'ail'},
 {'last three letters': 'bbe'},
 {'last three letters': 'bey'},
 {'last three letters': 'bey'},
 {'last three letters': 'bbi'},
 {'last three letters': 'bie'},
 {'last three letters': 'bie'},
 {'last three letters': 'bot'},
 {'last three letters': 'ott'},
 {'last three letters': 'bby'},
 {'last three letters': 'bby'},
 {'last three letters': 'del'},
 {'last three letters': 'dul'},
 {'last three letters': 'rim'},
 {'last three letters': 'lah'},
 {'last three letters': 'Abe'},
 {'last three letters': 'bel'},
 {'last three letters': 'ard'},
 {'last three letters': 'ael'},
 {'last three letters': 'ail'},
 {'last three letters': 'ale'},
 {'last three letters': 'ner'},
 {'last three letters': 'bra'},
 {'last three letters': 'ham'},
 {'last three letters': 'ram'},
 {'last three letters': 'cia'},
 {'last three letters': 'Ace'},
 {'last three letters': 'Ada'},
 {'last 

In [163]:
# Creating a new list variable to store results
# Looping through every test name from above list and calling the classifier object (NLTK NBCM) with the classify method to generate a geneder prediction for each of the test names
i_names_pred = []
for names in i_names_input:
    pred = classifier.classify(names)
    i_names_pred.append(pred)

i_names_pred

['M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'M',
 'F',
 'F',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'M',
 'F',
 'F',
 'F',
 'F',
 'M',
 'F',
 'M',
 'M',
 'F',
 'F',
 'F',
 'M',
 'M',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'M',
 'F',
 'F',
 'F',
 'F',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'F',
 'M',
 'F',
 'F',
 'M',
 'M',
 'M',
 'F',
 'F',
 'F',
 'F',
 'M',
 'M',
 'M',
 'F',
 'F',
 'M',
 'F',
 'M',
 'F',
 'F',
 'M',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'M',
 'F',
 'F',
 'F',
 'F',
 'F',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'M',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'M',
 'M',
 'F',
 'F',
 'M',
 'F',
 'F',
 'M',
 'F',
 'F',
 'F',
 'F',
 'M',
 'M',
 'F',
 'M',
 'M',
 'F',
 'F',
 'F',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'F',
 'F',
 'F'

# Step 4: Creating New DF with Test First Names and Associated NLTK NBCM Predictions
#### Notes
* The NLTK Naive Bayes Classification Model was trained using the US Social Security name-gender data opposed to the NLTK default name-gender data because the  default NLTK name-gender data is being used for testing, thereby if the model was trained using the testing data the model's predictions would likely not be valid due to overfitting

In [164]:
mega_list_df = pd.DataFrame(data=[new_list, i_names_pred]).T.rename(columns={0: "first_name", 1: "NLTK_gender"})
mega_list_df

Unnamed: 0,first_name,NLTK_gender
0,Aamir,M
1,Aaron,M
2,Abagael,M
3,Abagail,M
4,Abbe,F
...,...,...
7939,Zorro,M
7940,Zsa Zsa,F
7941,Zsazsa,F
7942,Zulema,F


# Step 5: Adding to Above DF to Contribute to Final Results DF: 'mega_list_df'
#### Notes
* Various columns from the 'rnf_final' DF are appended to 'mega_list_df' for final overall model evaluation purposes

In [165]:
mega_list_df["RNF_gender"] = rnf_final["gender"]
mega_list_df["Actual"] = rnf_final["Actual"]
mega_list_df["RNF_eval"] = rnf_final["Evaluation"]
mega_list_df

Unnamed: 0,first_name,NLTK_gender,RNF_gender,Actual,RNF_eval
0,Aamir,M,M,M,C
1,Aaron,M,M,M,C
2,Abagael,M,F,F,C
3,Abagail,M,F,F,C
4,Abbe,F,F,F,C
...,...,...,...,...,...
7939,Zorro,M,U,M,I
7940,Zsa Zsa,F,U,F,I
7941,Zsazsa,F,F,F,C
7942,Zulema,F,F,F,C


## Note: 
* Since the current 'RNF_eval' column has denoted a "C" only to correct predictions, it will not suffice to filter this column on "I" when creating our our final hybrid model predictions
* This is because although we have the luxury of knowing whether each RNF function gender prediction was correct right now, we do not have that luxury when running the main ML model since it is ran using unsupervised data (Ontario Sunshine List genders are not known ... whole point of our model is to generate these predictions as accurately as possible) 
* Therefore, to get the most accurate and honest evaluation of our model, it was decided that we must only take the NLTK-NBCM ('NlTK_gender' column below) predictions for the first names that the RNF model could not predict ("U" and "EV" predictions) because this is how it is done in the main model due to the use of unsupervised data
* To do this, another column must be added that classfies the RNF predictions in more detail as seen below ('RNF_Better_Eval')
* Essentially, below for loop allows us to further classify the RNF gender predictions so we know which rows to take the NLTK-NBCM predictions for in creating the final hybrid model's predictions (remember, the real model only passes the first names that could not be predicted ("U" or "EV") using the RNF model, to the NLTK-NBCM

In [168]:
# Since the current 'RNF_eval' column has denoted a "C" only to correct predictions, it will not suffice to filter this column on "I" in coming to our final hybrid model predictions
# This is because although we have the luxury of knowing whether each RNF function gender prediction was correct right now, we do not have that luxury when running the main ML model, therefore, to get the most accurate and honest accuracy scores, it was decided that we must only take the NLTK-NBCM ('NlTK_gender' column below) predictions for the first names that the RNF model could not predict ("U" and "EV" predictions)
# To do this, another column must be added that classfies the RNF predictions in more detail as seen below ('RNF_Better_Eval')
# Essentially, below for loop allows us to further classify the RNF gender predictions so we know which rows to take the NLTK-NBCM predictions for in creating the final hybrid model's predictions (remember, the real model only passes the first names that could not be predicted ("U" or "EV") using the RNF model, to the NLTK-NBCM
rnf_u_names = []
for row in mega_list_df.itertuples():
    if row[3] == "U":
        rnf_u_names.append("RNF-Unknown")
    elif ((row[3] == "M") or (row[3] == "F")) & (row[5] == "I"):
        rnf_u_names.append("RNF-Incorrect")
    elif ((row[3] == "M") or (row[3] == "F")) & (row[5] == "C"):
        rnf_u_names.append("RNF-Correct")
    else:
        rnf_u_names.append("RNF-EV")
        
mega_list_df["RNF_Better_Eval"] = rnf_u_names

mega_list_df

Unnamed: 0,first_name,NLTK_gender,RNF_gender,Actual,RNF_eval,RNF_Better_Eval
0,Aamir,M,M,M,C,RNF-Correct
1,Aaron,M,M,M,C,RNF-Correct
2,Abagael,M,F,F,C,RNF-Correct
3,Abagail,M,F,F,C,RNF-Correct
4,Abbe,F,F,F,C,RNF-Correct
...,...,...,...,...,...,...
7939,Zorro,M,U,M,I,RNF-Unknown
7940,Zsa Zsa,F,U,F,I,RNF-Unknown
7941,Zsazsa,F,F,F,C,RNF-Correct
7942,Zulema,F,F,F,C,RNF-Correct


In [169]:
mega_list_df.value_counts("RNF_Better_Eval")

RNF_Better_Eval
RNF-Correct      6292
RNF-Unknown      1046
RNF-Incorrect     604
RNF-EV              2
dtype: int64

# Step 6: Generating Final Hybrid Model Gender Predictions for Test Names Data
#### Notes
* Declaring an emtpy list variable to hold results
* Looping through 'mega_list_df'
* Applying conditionals so that if the RNF function output a valid gender prediction (correct or incorrect) --> this gender will be appended to the list --> but if the RNF function output "U" or "EV" --> the NLTK gender prediction will be appended to the list
* This is done because in our ML model, only the names that are given an RNF prediction of "U" or"EV" are then passed through the NLTK Naive Bayes Classifier opposed to including the incorrect RNF predictions because when we used our model to predict the gender of the unique first names contained in the Sunshine List, we do not know which RNF predictions are correct or incorrect since we are not given gender, thereby employing this method here provides a more representative process replication to our actual ML model and thereby increases the credibility of this model evaluation process 

In [170]:
final_hybrid_gender_preds = []

for row in mega_list_df.itertuples():
    if (row.RNF_Better_Eval == "RNF-Correct") or (row.RNF_Better_Eval == "RNF-Incorrect"):
        final_hybrid_gender_preds.append(row[3])
    elif (row.RNF_Better_Eval == "RNF-Unknown") or (row.RNF_Better_Eval == "RNF-EV"):
        final_hybrid_gender_preds.append(row[2])

final_hybrid_gender_preds

['M',
 'M',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'M',
 'M',
 'F',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'F',
 'F',
 'M',
 'F',
 'M',
 'M',
 'F',
 'M',
 'F',
 'F',
 'M',
 'F',
 'M',
 'M',
 'F',
 'F',
 'F',
 'M',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'M',
 'F',
 'F',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'F',
 'F',
 'M',
 'F',
 'F',
 'M',
 'M',
 'M',
 'F',
 'F',
 'F',
 'F',
 'M',
 'M',
 'M',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'M',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'M',
 'F',
 'F',
 'F',
 'F',
 'F',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'F',
 'F',
 'M',
 'F',
 'F',
 'F',
 'F',
 'M',
 'M',
 'F',
 'M',
 'M',
 'F',
 'F',
 'F',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'F',
 'F',
 'F'

In [171]:
mega_list_df["hybrid_preds"] = final_hybrid_gender_preds
mega_list_df

Unnamed: 0,first_name,NLTK_gender,RNF_gender,Actual,RNF_eval,RNF_Better_Eval,hybrid_preds
0,Aamir,M,M,M,C,RNF-Correct,M
1,Aaron,M,M,M,C,RNF-Correct,M
2,Abagael,M,F,F,C,RNF-Correct,F
3,Abagail,M,F,F,C,RNF-Correct,F
4,Abbe,F,F,F,C,RNF-Correct,F
...,...,...,...,...,...,...,...
7939,Zorro,M,U,M,I,RNF-Unknown,M
7940,Zsa Zsa,F,U,F,I,RNF-Unknown,F
7941,Zsazsa,F,F,F,C,RNF-Correct,F
7942,Zulema,F,F,F,C,RNF-Correct,F


In [172]:
final_accuracy = []
for row in mega_list_df.itertuples():
    if (row[4] == "M") & (row[7] == "M"):
        final_accuracy.append("C")
    elif (row[4] == "M") & (row[7] == "F"):
        final_accuracy.append("I")
    elif (row[4] == "F") & (row[7] == "F"):
        final_accuracy.append("C")
    elif (row[4] == "F") & (row[7] == "M"):
        final_accuracy.append("I")

mega_list_df["hybrid_eval"] = final_accuracy
mega_list_df.value_counts("hybrid_eval")     

hybrid_eval
C    7112
I     832
dtype: int64

In [173]:
hybrid_accuracy_score = (7112)/(7112+832)*100

In [174]:
print(f"Once all of the 'U' or 'EV' (Unknown / even probability RLF predictions ... which both resulted in an 'I' evaluation value which stands for 'Incorrect') gender predictions are extracted from the RLF output and passed to the NLTK Naive Bayes Classification Model the overall hybrid model's accuracy score increases from ~79.2% to {hybrid_accuracy_score:.2f}% accurate.")

Once all of the 'U' or 'EV' (Unknown / even probability RLF predictions ... which both resulted in an 'I' evaluation value which stands for 'Incorrect') gender predictions are extracted from the RLF output and passed to the NLTK Naive Bayes Classification Model the overall hybrid model's accuracy score increases from ~79.2% to 89.53% accurate.


# Displaying Final Hybrid ML Model DF

In [190]:
# Dropping the prior (more high-level) 'RNF_eval' column used in part 1 to determine how many of the RNF Function's predictions were correct because we now have the more telling 'RNF_Better_Eval' column
# Displaying the final DF which presents all relevant columns including the final hybrid predictions and their evaluations
mega_list_df_final = mega_list_df.drop(["RNF_eval"], axis=1)
column_names_final = ["first_name", "Actual", "RNF_gender", "NLTK_gender", "RNF_Better_Eval", "hybrid_preds", "hybrid_eval"]
mega_list_df_final = mega_list_df_final.reindex(columns=column_names_final)
mega_list_df_final

Unnamed: 0,first_name,Actual,RNF_gender,NLTK_gender,RNF_Better_Eval,hybrid_preds,hybrid_eval
0,Aamir,M,M,M,RNF-Correct,M,C
1,Aaron,M,M,M,RNF-Correct,M,C
2,Abagael,F,F,M,RNF-Correct,F,C
3,Abagail,F,F,M,RNF-Correct,F,C
4,Abbe,F,F,F,RNF-Correct,F,C
...,...,...,...,...,...,...,...
7939,Zorro,M,U,M,RNF-Unknown,M,C
7940,Zsa Zsa,F,U,F,RNF-Unknown,F,C
7941,Zsazsa,F,F,F,RNF-Correct,F,C
7942,Zulema,F,F,F,RNF-Correct,F,C


# Just to show NLTK Naive Bayes Classification Model Accuracy on its Own

In [181]:
final_accuracy_1 = []
for row in mega_list_df.itertuples():
    if (row[4] == "M") & (row[2] == "M"):
        final_accuracy_1.append("C")
    elif (row[4] == "M") & (row[2] == "F"):
        final_accuracy_1.append("I")
    elif (row[4] == "F") & (row[2] == "F"):
        final_accuracy_1.append("C")
    elif (row[4] == "F") & (row[2] == "M"):
        final_accuracy_1.append("I")
        
        
NLTK_mega_list_df = mega_list_df_final
NLTK_mega_list_df["NLTK_eval"] = final_accuracy_1
NLTK_mega_list_df

Unnamed: 0,first_name,NLTK_gender,RNF_gender,Actual,RNF_Better_Eval,hybrid_preds,hybrid_eval,NLTK_eval
0,Aamir,M,M,M,RNF-Correct,M,C,C
1,Aaron,M,M,M,RNF-Correct,M,C,C
2,Abagael,M,F,F,RNF-Correct,F,C,I
3,Abagail,M,F,F,RNF-Correct,F,C,I
4,Abbe,F,F,F,RNF-Correct,F,C,C
...,...,...,...,...,...,...,...,...
7939,Zorro,M,U,M,RNF-Unknown,M,C,C
7940,Zsa Zsa,F,U,F,RNF-Unknown,F,C,C
7941,Zsazsa,F,F,F,RNF-Correct,F,C,C
7942,Zulema,F,F,F,RNF-Correct,F,C,C


In [185]:
NLTK_mega_list_df.value_counts("NLTK_eval")

NLTK_eval
C    6373
I    1571
dtype: int64

In [186]:
NLTK_accuracy_score = (6373) / (6373+1571)*100
NLTK_accuracy_score

80.2240684793555

In [187]:
print(f"The accuracy score of the NLTK Naive Bayes Classification Model alone, as tested by training it with the same training data fed into the RNF function (US Social Security Data) is {NLTK_accuracy_score:.2f}% accurate. Only 1% better than the RNF function alone when trained with the same data.")

The accuracy score of the NLTK Naive Bayes Classification Model alone, as tested by training it with the same training data fed into the RNF function (US Social Security Data) is 80.22% accurate. Only 1% better than the RNF function alone when trained with the same data.
