# Naive Bayes for Gender Prediction Classifier

We are given USA baby names, approx 250k female and male names<br>
Now, given unknown name to us, predict whether it's a boy or a girl name

Dataset:<br>
<b>Taken from Back4App, use fake_application option and generate master-key and application-id</b><br>
<b>Pull Data from their API as shown in the code</b><br>

We will be using Naive Bayes to solve this problem. Enough though accuracy is not much great but from this notebook
we can learn how to build Naive Classifier from scratch for binary classification.

In [None]:
'''
Compare two values and who so ever is greater tag it as male or female and check the accuracy on test dataset

P(Male | Name) = P(Name | Male)*P(Male) / P(Name)
               = P(n|male) * P(a|Male) * P(m|Male) * P(e|Male) * P(Male)
               = log(P(n|male)) + log(P(a|Male)).....+log(P(Male))
        
P(Female | Name) = Same fashion calculation

As P(Name) will be coming for both male and female class, we don't cal at all. It acts as a normalizer
'''

# Data Gathering

In [42]:
import json
import urllib
import requests
import pandas as pd

url = 'https://parseapi.back4app.com/classes/Complete_List_Names?count=1&limit=250000'
headers = {
    'X-Parse-Application-Id': 'paste your hascode api key', # This is the fake app's application id
    'X-Parse-Master-Key': 'paste your hashcode master-key' # This is the fake app's readonly master key
}
data = json.loads(requests.get(url, headers=headers).content.decode('utf-8')) # Here you have the data that you need

# Data Processing 

In [165]:
#converting dict to pandas
df = pd.DataFrame.from_dict(data['results'])

#all lowercase conversion
df['Name'] = df['Name'].apply(lambda x: x.lower())

#dropping duplicates if any
df = df.drop_duplicates()

#filtering required columns
df_final = df[['Name','Gender']]

#shuffling data
df_final = df_final.sample(frac=1).reset_index(drop=True)

#splitting into train and test
test_ratio = 0.3
train_idx = int(test_ratio * len(df_final))
df_train,df_test = df_final.iloc[:train_idx],df_final.iloc[train_idx:]

# Feature Engineering

- In our case we are generating the distribution of each character for both male and female class
- This will help in knowing which characters are seen more freq in male / female names  

In [172]:
from collections import defaultdict


#responsible for storing character count for each male and female class, 0:female, 1:male
dict_mapper = {0:defaultdict(int),1:defaultdict(int)}

def letter_count(string,pos):
    for i in string:
        dict_mapper[pos][i] +=1

for i in range(len(df_train)):
    string = df_train.iloc[i]['Name']
    gender = df_train.iloc[i]['Gender']
    letter_count(string.lower(),1 if gender == 'male' else 0)

# Priors Calculation

In [173]:
def priors_cal(df_final):
    
    mf_data = df_final['Gender'].value_counts()
    p_male = mf_data['male'] / len(df_final)
    p_female = mf_data['female'] / len(df_final)
    
    return p_male,p_female,mf_data

# Likelihood Calculation

In [174]:
import math

def output_probs(name,letter_count,pos,prob):
    
    sum_prob = prob
    '''
    reason for adding 1 is called Laplace smoothing if we do not see the character in male or female class then
    prob will be 0 and log(0) will raise an error
    
    This is unlikely to happen in character counts but think about spam filtering case where test words are not present 
    training dataset
    '''
    for i in name.lower():
        sum_prob += math.log(dict_mapper[pos][i] + 1 / (letter_count+26))
    
    
    return sum_prob


def accuracy(female_letter_count,male_letter_count,p_male,p_female,df_test):
    
    
    correct = 0
    for i in range(len(df_test)):
        string = df_final.iloc[i]['Name'].lower()
        gender = df_final.iloc[i]['Gender']
        
        p_name_female = output_probs(string,female_letter_count,0,p_female)
        p_name_male = output_probs(string,male_letter_count,1,p_male)

        if(p_name_male > p_name_female):
            
            if(gender == 'male'):
                correct +=1
        
        else:
            if(gender == 'female'):
                correct +=1
                
        
    return correct / len(df_test) * 100


p_male,p_female,mf_data = priors_cal(df_train)
female_letter_count,male_letter_count = sum(list(dict_mapper[0].values())),sum(list(dict_mapper[1].values()))


#Using test dataset to check the accuracy of our model
accuracy(female_letter_count,male_letter_count,p_male,p_female,df_test)

66.34914285714287

# Improvements Thoughts
You can see, the accuracy is not so great<br>
Can you think of different features to solve this problem?<br>
For instance instead of counting each character, can we calculate the pair count ? 'Darshan' --> (d,a), (a,r) ...
and apply the same logic. This is called 2-gram technique, where we are generating pairs of length 2 and solving the problem<br>