<h1>Bayesian Modeling</h1>

<p>This project uses naive bayes to build a baby name classifier. The classifier will use the features commonly found in names to distinguish whether a name is likely to be given to a baby girl or boy.</p>

<h3>Python Initialization</h3>

In [1]:
import numpy as np
from sklearn.model_selection import train_test_split

<h3>The <code>hash_features</code> and <code>names2features</code> functions:</h3>
<p>Below, the <code>hash_features</code> and <code>names2features</code> functions will take the plain text names and convert them to feature vectors to work with the data effectively.

In [2]:
def hash_features(baby: str, b: int, fix: int):
    """
    This function takes a baby name and converts it into a feature vector.
    
    Inputs:
    ------
    baby : a string representing the baby's name to be hashed
    b    : the number of dimensions to be in the feature vector
    fix  : the number of chunks to extract and hash from each string
    
    Outputs:
    -------
    v : a feature vector representing the input string
    """
    v = np.zeros(b)
    for m in range(fix):
        feature_string = 'prefix' + baby[:m]
        v[hash(feature_string) % b] = 1
        feature_string = 'suffix' + baby[-m:]
        v[hash(feature_string) % b] = 1
        
    return v

In [3]:
def names2features(filename: str, b: int=128, fix: int=3, usefile: bool=True):
    """
    This function takes a file of baby names and generates feature vectors.
    
    Inputs:
    ------
    filename : the filepath string or names to use
    b        : the number of dimensions to be in the feature vector
    fix      : the number of chunks to extract and hash from each string
    usefile  : whether to load a file or use a name directly
    
    Outputs:
    -------
    x : n feature vectors of dimension (nxb)
    """
    if usefile:
        with open(filename, 'r') as f:
            baby_names = [x.rstrip() for x in f.readlines() if len(x) > 0]
    else:
        baby_names = filename.split('\\n')
    
    n = len(baby_names)
    x = np.zeros((n, b))
    for i in range(n):
        x[i,:] = hash_features(baby_names[i], b, fix)
        
    return x

<p>In the code cell above, <code>names2features</code> reads every name in the given file and converts it into a 128-dimensional feature vector by first assembling substrings (based on the parameter 'fix'), then hashing these assembled substrings and modifying the feature vector index (the modulo of the number of dimensions) that corresponds to this hash value.</p>

<h3>The <code>generate_features</code> function:</h3>
<p>Below is a python function <code>generate_features</code>, which transforms the names into features and loads them into memory.

In [4]:
def generate_features(dim: int=128):
    """
    Inputs:
    ------
    dim : desired dimension of the features
        
    Outputs:
    -------
    X : n feature vectors of dimensionality d (nxd)
    y : n labels (-1 = girl, +1 = boy) (n)
    """
    girl_names = names2features('../data/girls_train.csv', b=dim)
    boy_names = names2features('../data/boys_train.csv', b=dim)
    
    X = np.concatenate([girl_names, boy_names])
    y = np.concatenate([-np.ones(len(girl_names)), np.ones(len(boy_names))])
    
    ii = np.random.permutation([i for i in range(len(y))])
    
    return X[ii,:], y[ii]

<h2>The Na&iuml;ve Bayes Classifier</h2>

<p>The Na&iuml;ve Bayes classifier is a linear classifier based on Bayes Rule. The following cells walk through the steps necessary to properly implement it.</p>

<h3>Part One: Class Probability</h3>

<p>Estimate the class probability $P(y)$ in 
<b><code>naive_bayes_prior</code></b>. This returns the probability that a sample in the training set is positive or negative, independent of its features.</p>

In [5]:
def naive_bayes_prior(X, y):
    """
    This function calculates the prior probability.
    This is the probability that a sample in the training
    set is positive or negative, independent of the features.
    
    Inputs:
    ------
    X : n input vectors of d dimensions (nxd)
    y : n labels (-1 or +1) (nx1)
    
    Outputs:
    -------
    pos : probability P(y=1)
    neg : probability P(y=-1)
    """
    # plus-one smoothing
    y = np.concatenate([y, [-1, 1]])
    n = len(y)
    pos = sum(y == 1) / n
    
    return pos, 1 - pos

<h3>Part Two: Conditional Probability</h3>

<p>Estimate the conditional probabilities $P([\mathbf{x}]_{\alpha}|y)$ in 
<b><code>naive_bayes_conditional</code></b>. Notice that by construction, the features are binary categorical features. This uses a <b>categorical</b> distribution and return the probability vectors for each feature being 1 given a class label. Note that the result will be two vectors of length d (the number of features), where the values represent the probability that feature i is equal to 1.</p>

In [6]:
def naive_bayes_conditional(X, y):
    """
    This function computes the conditional probability of each feature.
    
    Inputs:
    ------
    X : n input vectors of d dimensions (nxd)
    y : n labels (-1 or +1) (n)
    
    Outputs:
    -------
    pos_prob : probability vector of p(x_alpha = 1 | y = 1) (d)
    neg_prob : probability vector of p(x_alpha = 1 | y = -1) (d)
    """
    # plus-one smoothing
    n, d = X.shape
    X = np.concatenate([X, np.ones((2, d)), np.zeros((2, d))])
    y = np.concatenate([y, [-1, 1, -1, 1]])
    
    pos = X[y == 1]
    neg = X[y == -1]
    
    pos_count = len(pos)
    neg_count = len(neg)
    
    pos_feature_count = np.sum(pos, axis=0)
    neg_feature_count = np.sum(neg, axis=0)
    
    pos_prob = pos_feature_count / pos_count
    neg_prob = neg_feature_count / neg_count
    
    return pos_prob, neg_prob

<h3>Part Three: Log Likelihood</h3>

<p>Calculate the log likelihood $\log P(\mathbf{x}|y)$ for each point in X_test in 
<b><code>loglikelihood</code></b> and label y_test. The likelihood is given by the product of the conditional probabilities of each feature and that $\log(ab) = \log a + \log b$.</p>

In [7]:
def loglikelihood(pos_prob, neg_prob, X_test, y_test):
    """
    This function calculates the loglikelihood of each point.
    
    Inputs:
    ------
    pos_prob : conditional probabilities for the positive class (d)
    neg_prob : conditional probabilities for the negative class (d)
    X_test   : features (nxd)
    y_test   : labels (-1 or +1) (n)
    
    Outputs:
    -------
    loglikelihood : loglikelihood of each point in X_test (n)
    """
    n, d = X_test.shape
    loglikelihood = np.zeros(n)
    
    for i in range(n):
        x = X_test[i,:]
        y = y_test[i]
        if y == 1:
            ll = np.sum(x * np.log(pos_prob) + (1 - x) * np.log(1 - pos_prob))
        else:
            ll = np.sum(x * np.log(neg_prob) + (1 - x) * np.log(1 - neg_prob))
            
        loglikelihood[i] = ll
        
    return loglikelihood

<h3>Part Four: Naive Bayes Prediction</h3>

<p>For a test point $\mathbf{x}_{test}$, we should classify it as positive if the log ratio $\log\left(\frac{P(y=1 | \mathbf{x} = \mathbf{x}_{test})}{P(y=-1|\mathbf{x} = \mathbf{x}_{test})}\right) > 0$ and negative otherwise. <b><code>naive_bayes_prediction</code></b> first calculates the log ratio $\log\left(\frac{P(y=1 | \mathbf{x} = \mathbf{x}_{test})}{P(y=-1|\mathbf{x} = \mathbf{x}_{test})}\right)$ for each test point in $\mathbf{x}_{test}$ using Bayes' rule and predicts the label of the test points by looking at the log ratio.</p>

In [8]:
def naive_bayes_prediction(pos, neg, pos_prob, neg_prob, X_test):
    """
    This function returns the prediction of each provided point.
    
    Inputs:
    ------
    pos      : class probability for the positive class
    neg      : class probability for the negative class
    pos_prob : conditional probabilities for the positive class (d)
    neg_prob : conditional probabilities for the negative class (d)
    X_test   : features (nxd)
    
    Outputs:
    -------
    preds : prediction of each point in X_test (n)
    """
    n, d = X_test.shape
    num = loglikelihood(pos_prob, neg_prob, X_test, np.ones(n)) + np.log(pos)
    den = loglikelihood(pos_prob, neg_prob, X_test, -np.ones(n)) + np.log(neg)
    
    p = num - den
    preds = -np.ones(n)
    preds[p > 0] = 1
    
    return preds

Now we can test the code with the following interactive name classification script:

In [9]:
print('Loading data ...')
X, y = generate_features(dim=128)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print('Training classifier ...')
pos, neg = naive_bayes_prior(X_train, y_train)
pos_prob, neg_prob = naive_bayes_conditional(X_train, y_train)

print('Testing classifier ...')
prediction = naive_bayes_prediction(pos, neg, pos_prob, neg_prob, X_test)
error = np.mean(prediction != y_test)
print('Training error: %.2f%%' % (100 * error))

Loading data ...
Training classifier ...
Testing classifier ...
Training error: 28.06%


In [10]:
while True:
    print('Please enter a baby name>')
    name = input()
    if len(name) < 1:
        break
    
    x = names2features(name, b=128, usefile=False)
    prediction = naive_bayes_prediction(pos, neg, pos_prob, neg_prob, x)
    if prediction > 0:
        print("%s, I am sure you are a baby boy.\n" % name)
    else:
        print("%s, I am sure you are a baby girl.\n" % name)

Please enter a baby name>
Sarah
Sarah, I am sure you are a baby girl.

Please enter a baby name>
John
John, I am sure you are a baby boy.

Please enter a baby name>

