# Recreate the Naive Bayes Algorithm
## Steven Glover
***

To recreate the Naïve Bayes algorithm, I used a series of functions to compartmentalize the individual tasks and then created a function wrapper to execute the functions in sequence. The functions that the algorithm uses are as follows:

**Probability Calculations:** 
1.	def independant_prob:  This function takes a column from a data frame in the form of an array. The function calculates the probability of each unique item in the array with respect to the entire array. It does this calculation in two stages. The first stage I used a combination of the numpy unique function and the zip function to create a dictionary where the keys are each unique value in the array and the values are the unique counts. The second stage was to create a dictionary of probabilities by dividing the dictionary of counts by the total number items in the original array. The function returns a dictionary of probabilities. 

**Preprocessing:**
2.	def nb_preprocessing: This function takes care of several of the preprocessing tasks needed to generate predictions using the Naïve Bayes algorithm. The function takes two inputs: the dataset as a data frame and the name of the y variable. Prior to discussing the mechanics of this function, I would like to take a moment to discuss the strategy that I had to calculate the given likelihood component of the Naïve Bayes computation.  To achieve this, I subset the X_train dataset using the response variable and stored the two data frames to a python list.  I then used another custom function to generate a dictionary of the probabilities for each unique occurrence of a variable given the response (survived, died).  Additionally, I wanted to scale the function to effectively manage multiple y responses, which was achieved through the use of label encoding (more on this below). The function has four stages:
    - Calculating the prior probability for the response, which was achieved by using the independant_prob() function discussed in ‘1’
    - Label encoding the response -  label encoding converts a series of string variables to numerical. For instance, if the response was “died” or “lived” in the response, the label encoder would convert the strings to 0 and 1 values.  Why is this needed? Remember, the data frames that I plan to subset by the y response and store to a list? I will store each data frame in the list to the index that corresponds with the label encoded y variable. This allowed me to keep track of which subsets I used while processing the data iteratively. It also enables us to scale the project to work with more than a binary response. 
    - Create a test / train split
    - Subset the X_train by the y response and store the data frames to a python list
    - The function returns the test/train split with the list of X_train data frames, the names of the x variables, the label encoder object from sci-kit learn, and the prior probabilities in the form of a python dictionary.<br>

<br>**Likelihood / Given Probabilities:**
3.	Conditional_prob: This function returns a dictionary of dictionaries that contain the given probabilities or likelihoods for each class of a categorical variable given the response. The function takes two arguments: the list of X_train subsets and a list of the unique response variables, which were both returned using the nb_preprocessing described in #2. Additionally, the function utilized independant_prob that was described in #1 to generate the probability dictionary for each column of the data frames as it is accessed iteratively.<br>

<br>**Generate Predictions:**<br>
4.	Naive_Bayes_Predictions: This function will return a list of the predictions as well as a list of the probabilities associated with the predictions. The function takes three input arguments: The X_test data frame, the dictionary of likelihoods, and the dictionary of prior probabilities.  Using the dictionary of likelihoods created in #3 and the dictionary of prior probabilities created in #1, this function iteratively calculates the posterior probability of each y response for each line of the X_test training set. It then chooses the y response with the greater probability for the line and appends the prediction and the probability to python lists with each iteration.  After the function has completed this process for every line of the test_X dataset both the lists are returned. <br>

<br>**Prediction Accuracy Report:**<br>
5.	The function returns the accuracy score, precision/recall tables, and a confusion matrix for the predictions. 

**The Wrapper Function:**
6.	Complete_NaiveBayes: The function wraps all the above functions together to provide preprocessing, generate predictions using the Naïve Bayes algorithm, print an accuracy report with the label encoding mappings, and return the lists of predictions/probabilities. The function can do all of this with only two inputs: the data frame and the name of the y response. 

**<font color = red> Note: </font>** the function requires that the datatypes for all columns be objects /strings. You will receive an error if you try to feed the function a data frame that has columns with a float or int datatypes. 


## 1) Import and Preprocess the Data

In [34]:
import numpy as np
import pandas as pd

data = pd.read_csv('C:/Users/Steven Glover/Jupyter Notebooks/Fall Semester/KNN/titanic3.csv')
print('columns: ',data.columns,'\n')
print('shape: ', data.shape,'\n')
print('null: \n', pd.isnull(data).sum())
print('-------------------------------')
print('Unique Values: ', data.nunique())

columns:  Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
      dtype='object') 

shape:  (1310, 14) 

null: 
 pclass          1
survived        1
name            1
sex             1
age           264
sibsp           1
parch           1
ticket          1
fare            2
cabin        1015
embarked        3
boat          824
body         1189
home.dest     565
dtype: int64
-------------------------------
Unique Values:  pclass          3
survived        2
name         1307
sex             2
age            98
sibsp           7
parch           8
ticket        929
fare          281
cabin         186
embarked        3
boat           27
body          121
home.dest     369
dtype: int64


### Drop Name, Cabin, Ticket, Boat & Get Rid of Null Values

In [35]:
data = data.drop(['name','cabin','ticket','boat','home.dest','body'], axis = 1)
data = data.dropna()
data = data.reset_index(drop = True)
display(data.info())

for i in data.columns.tolist():
    print(i,'\n',data[i].unique())
    print('---------------')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1043 entries, 0 to 1042
Data columns (total 8 columns):
pclass      1043 non-null float64
survived    1043 non-null float64
sex         1043 non-null object
age         1043 non-null float64
sibsp       1043 non-null float64
parch       1043 non-null float64
fare        1043 non-null float64
embarked    1043 non-null object
dtypes: float64(6), object(2)
memory usage: 65.3+ KB


None

pclass 
 [ 1.  2.  3.]
---------------
survived 
 [ 1.  0.]
---------------
sex 
 ['female' 'male']
---------------
age 
 [ 29.       0.9167   2.      30.      25.      48.      63.      39.      53.
  71.      47.      18.      24.      26.      80.      50.      32.      36.
  37.      42.      19.      35.      28.      45.      40.      58.      22.
  41.      44.      59.      60.      33.      17.      11.      14.      49.
  76.      46.      27.      64.      55.      70.      38.      51.      31.
   4.      54.      23.      43.      52.      16.      32.5     21.      15.
  65.      28.5     45.5     56.      13.      61.      34.       6.      57.
  62.      67.       1.      12.      20.       0.8333   8.       0.6667
   7.       3.      36.5     18.5      5.      66.       9.       0.75
  70.5     22.5      0.3333   0.1667  40.5     10.      23.5     34.5     20.5
  30.5     55.5     38.5     14.5     24.5     74.       0.4167  11.5     26.5   ]
---------------
sibsp 
 [ 

### Convert All data columns to a string prior to preprocessing

In [36]:
data = data[['pclass' , 'sex' , 'embarked' , 'survived']]

for col in data.columns.tolist():
    data[col] = data[col].astype(str)

## <font color = blue> The following functions and code is the prototyping for the Complete_NB function in the .py script file </font>
The NB script file will be called at the end of the file to ensure that it works correctly. 

In [4]:
def independant_prob(arr, show = 'n'):
    '''This function will return a dictionary of the unique values in an array and counts
    for that value in the form of a python dictionary. It will also determine the independant probability
    and return as dictionary'''
    #get the count dictionary
    cat_dict = dict(zip(np.unique(arr, return_counts= True)[0],np.unique(arr, return_counts= True)[1]))
    prob_dict = {}
    total = sum(cat_dict.values())
    for key in cat_dict.keys():
        prob_dict[key] = cat_dict[key] / total
    
    #Return a Summary of the calcations
    if show == 'y':
        print('Count Dictionary:')
        print(cat_dict,'\n')
        print('Probability Dictionary')
        print(prob_dict)
    
    return prob_dict

## Conditional Probability

In [5]:
def conditional_prob(data_list, uniqueY):
    given_prob_dict = {}
    for data_set in range(len(uniqueY)):
        df = data_list[data_set]
        y_given = {}
        for cat in df.columns.tolist():
        #need to make sure that the input stays strings at conversion
            y_given[str(cat)] = independant_prob(np.array(df[cat]).astype(str))
        #need to keep the float string format
        given_prob_dict[str(float(data_set))] = y_given 
    return given_prob_dict

***
### Subset X_train based off output categories and each to a list

In [6]:
def nb_preprocessing(df, y):
    from sklearn.cross_validation import train_test_split
    from sklearn.preprocessing import LabelEncoder
    
    #subset X and Y
    x_names = [x for x in df.columns.tolist() if x not in [y]]
    X = df[x_names]
    
    #store the y column name and identify the y variable
    y_col = y
    y = df[y_col]

    """In order for this fuction to be able to used with multiple classes
    the y values should be label encoded. The label encoding is necessary
    because when the naive bayes algorithm is iteratively making its predictions,
    it  will need to associate the prediction with the position in the list its X subset
    is in. To easily calculate given probabilites, I will subset the X_train by its Y_train category."""
    
    #label encode the y 
    """note the encoded y will still needed to be converted back to a string for the train test split.
    I will need to account for this when I split X_train into subsets."""
    
    le = LabelEncoder()
    y = le.fit_transform(y)
    
    y = y.astype('float').astype('str')
    
    y_prob = independant_prob(y.astype(str))
    
    # Test Train Split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

    # Turn the datasets back into dataframes
    X_train = pd.DataFrame(data = X_train, columns = x_names).reset_index(drop=True)
    X_test = pd.DataFrame(data = X_test, columns = x_names).reset_index(drop=True)
    y_train = pd.DataFrame(data = y_train, columns = ['survived']).reset_index(drop=True)
    y_test = pd.DataFrame(data = y_test, columns = ['survived']).reset_index(drop=True)

    #--------------------------------------
    #sort unique label encoded Ys. 
    """since the Ys have been label encoded and sorted, the X_train
    dataframes will be stored to the list in the order assosciated with the y
    variable. This will allow us to correspond the y responce with the position in the list
    when generating the predictions."""
    
    uniqueY = sorted(pd.Series(y).astype(float).unique().tolist())
    X_train_list = []
    
    for subset in range(len(uniqueY)):
        subset_df = X_train[y_train.survived == str(float(subset))]
        # Make sure every column is still a string 
        for col in subset_df.columns.tolist():
            subset_df[col] == subset_df[col].astype(str)
        # add subset to a list
        X_train_list.append(subset_df)
                        
    return X_train_list, X_test, y_train, y_test, uniqueY, x_names, le, y_prob

In [16]:
data = data[['pclass' , 'sex' , 'embarked' , 'survived']] 
X_train_list, X_test, y_train, y_test, uniqueY, x_names, label_encoder, y_prob = nb_preprocessing(data, 'survived')

# Naive Bayes Calculation
Each piece will be broken down into a different function and combined. We will use the pclass to test

In [17]:
print('Survived Probabilites')
print('----------------------------')
survived_prob = independant_prob(np.array(data.survived).astype(str))
survived_prob

Survived Probabilites
----------------------------


{'0.0': 0.59252157238734415, '1.0': 0.40747842761265579}

In [18]:
probabilities = conditional_prob(X_train_list, uniqueY)
probabilities

{'0.0': {'embarked': {'C': 0.13934426229508196,
   'Q': 0.061475409836065573,
   'S': 0.79918032786885251},
  'pclass': {'1.0': 0.17008196721311475,
   '2.0': 0.23360655737704919,
   '3.0': 0.59631147540983609},
  'sex': {'female': 0.17418032786885246, 'male': 0.82581967213114749}},
 '1.0': {'embarked': {'C': 0.31502890173410403,
   'Q': 0.034682080924855488,
   'S': 0.6502890173410405},
  'pclass': {'1.0': 0.42196531791907516,
   '2.0': 0.2861271676300578,
   '3.0': 0.29190751445086704},
  'sex': {'female': 0.69075144508670516, 'male': 0.30924855491329478}}}

### Naive Bayes Predictions and Probabilities

In [10]:
def Naive_Bayes_Predictions(X_test,probabilities, survived_prob):
    testObs = X_test
    classification_list = []
    classification_probs = []
    #iterate over each row
    for row in range(testObs.shape[0]):
        predictions = {}
        #calcuate the probablity for each outcome seperately
        for outcome in survived_prob.keys():
            outcome = str(outcome)
            #iterate over each column in the row
            prob_list = []
            prob_list.append(survived_prob[outcome])
            for col in testObs.columns.tolist():
                #pull probabilites out of 3x nested dictionary. keys are as follows: y response, column, 
                #category in the column
                prob_list.append(probabilities[outcome][col][testObs.loc[row, col]])   
            predictions[outcome] = np.prod(np.array(prob_list)) 
        
        #get the prediction of the greatest number
        pred_key = max(predictions, key=predictions.get)
        classification_list.append(pred_key)
        classification_probs.append(predictions[pred_key] / sum(predictions.values()))
        
    return classification_list, classification_probs

In [19]:
classification_list, classification_probs = Naive_Bayes_Predictions(X_test, probabilities, survived_prob)

### Accuracy Report

In [12]:
def accuaracy_report(y_test, predictions):
    from sklearn.metrics import classification_report, confusion_matrix
    print('The Accuracy Score is: ',np.array(np.sum(np.equal(predictions, y_test)) / y_test.shape[0]),'\n')
    print('The Classification Report')
    print('-------------------------')
    print(classification_report(y_test, predictions),'\n')
    print('The Confusion Matrix')
    print('-------------------------')
    print(pd.DataFrame(confusion_matrix(y_test, predictions)).apply(lambda x: x / sum(x), axis=1))

In [20]:
accuaracy_report(np.array(y_test),np.array(classification_list).reshape(len(classification_list),1))

The Accuracy Score is:  0.8229665071770335 

The Classification Report
-------------------------
             precision    recall  f1-score   support

        0.0       0.84      0.88      0.86       130
        1.0       0.78      0.73      0.76        79

avg / total       0.82      0.82      0.82       209
 

The Confusion Matrix
-------------------------
          0         1
0  0.876923  0.123077
1  0.265823  0.734177


# Bring it all together prototype for the .py script

In [14]:
def Complete_NaiveBayes_test(df, y):
    #check for categorical variables: must be an object or a category
    for col in data.columns.tolist():
        if data[col].dtype != 'object':
            return 'All columns must have a dytpe of object'
        
    # get test train list
    X_train_list, X_test, y_train, y_test, uniqueY, x_names, label_encoder, y_prob  = nb_preprocessing(df, 'survived')
        
    # generate conditional probabilites     
    probabilities = conditional_prob(X_train_list, uniqueY)
    
    # generate predictions and the list of probabilites
    classification_list, classification_probs = Naive_Bayes_Predictions(X_test,probabilities, y_prob)
    
    # print the accuracy report
    accuaracy_report(np.array(y_test),np.array(classification_list).reshape(len(classification_list),1))
    
    #return the label encoding categories
    print('\nThe label encoding mappings')
    print('-----------------------------')
    for val in uniqueY:
        print(label_encoder.inverse_transform(int(val)),' : ',val)
        
    #return the classification list and the probabilities    
    print('\nThe Classification Report & Classification Probabilites')
    print('---------------------------------------------------------')       
    return classification_list, classification_probs

In [27]:
Complete_NaiveBayes_test(data, 'survived')

The Accuracy Score is:  0.8038277511961722 

The Classification Report
-------------------------
             precision    recall  f1-score   support

        0.0       0.80      0.83      0.82       109
        1.0       0.81      0.77      0.79       100

avg / total       0.80      0.80      0.80       209
 

The Confusion Matrix
-------------------------
          0         1
0  0.834862  0.165138
1  0.230000  0.770000

The label encoding mappings
-----------------------------
0.0  :  0.0
1.0  :  1.0

The Classification Report & Classification Probabilites
---------------------------------------------------------


(['0.0',
  '1.0',
  '1.0',
  '0.0',
  '0.0',
  '0.0',
  '1.0',
  '1.0',
  '1.0',
  '1.0',
  '0.0',
  '0.0',
  '0.0',
  '0.0',
  '1.0',
  '1.0',
  '1.0',
  '1.0',
  '1.0',
  '0.0',
  '0.0',
  '1.0',
  '1.0',
  '0.0',
  '1.0',
  '1.0',
  '1.0',
  '0.0',
  '0.0',
  '1.0',
  '1.0',
  '0.0',
  '1.0',
  '0.0',
  '1.0',
  '0.0',
  '1.0',
  '1.0',
  '1.0',
  '1.0',
  '0.0',
  '0.0',
  '0.0',
  '0.0',
  '0.0',
  '0.0',
  '0.0',
  '0.0',
  '0.0',
  '0.0',
  '1.0',
  '1.0',
  '0.0',
  '0.0',
  '1.0',
  '0.0',
  '0.0',
  '0.0',
  '0.0',
  '1.0',
  '1.0',
  '0.0',
  '1.0',
  '0.0',
  '0.0',
  '0.0',
  '0.0',
  '1.0',
  '1.0',
  '0.0',
  '0.0',
  '1.0',
  '1.0',
  '0.0',
  '1.0',
  '0.0',
  '0.0',
  '0.0',
  '1.0',
  '0.0',
  '0.0',
  '0.0',
  '1.0',
  '0.0',
  '1.0',
  '0.0',
  '0.0',
  '0.0',
  '1.0',
  '0.0',
  '0.0',
  '1.0',
  '0.0',
  '1.0',
  '1.0',
  '0.0',
  '0.0',
  '1.0',
  '1.0',
  '0.0',
  '0.0',
  '0.0',
  '0.0',
  '1.0',
  '0.0',
  '1.0',
  '1.0',
  '1.0',
  '1.0',
  '0.0',
  '0.0',
 

In [30]:
from NaiveBayes import Complete_NaiveBayes
Complete_NaiveBayes(data, 'survived')

The Accuracy Score is:  0.7942583732057417 

The Classification Report
-------------------------
             precision    recall  f1-score   support

        0.0       0.81      0.82      0.81       115
        1.0       0.77      0.77      0.77        94

avg / total       0.79      0.79      0.79       209
 

The Confusion Matrix
-------------------------
          0         1
0  0.817391  0.182609
1  0.234043  0.765957

The label encoding mappings
-----------------------------
0.0  :  0.0
1.0  :  1.0

The Classification Report & Classification Probabilites
---------------------------------------------------------


(['1.0',
  '0.0',
  '1.0',
  '0.0',
  '0.0',
  '0.0',
  '1.0',
  '1.0',
  '0.0',
  '1.0',
  '0.0',
  '0.0',
  '1.0',
  '1.0',
  '0.0',
  '1.0',
  '1.0',
  '0.0',
  '1.0',
  '0.0',
  '1.0',
  '0.0',
  '1.0',
  '0.0',
  '0.0',
  '0.0',
  '0.0',
  '1.0',
  '1.0',
  '1.0',
  '0.0',
  '0.0',
  '0.0',
  '0.0',
  '1.0',
  '1.0',
  '0.0',
  '0.0',
  '0.0',
  '1.0',
  '1.0',
  '0.0',
  '1.0',
  '0.0',
  '0.0',
  '0.0',
  '1.0',
  '1.0',
  '0.0',
  '1.0',
  '0.0',
  '0.0',
  '0.0',
  '1.0',
  '0.0',
  '1.0',
  '0.0',
  '0.0',
  '1.0',
  '1.0',
  '0.0',
  '0.0',
  '1.0',
  '1.0',
  '1.0',
  '1.0',
  '1.0',
  '0.0',
  '1.0',
  '1.0',
  '0.0',
  '1.0',
  '0.0',
  '1.0',
  '0.0',
  '0.0',
  '1.0',
  '1.0',
  '0.0',
  '0.0',
  '0.0',
  '1.0',
  '1.0',
  '0.0',
  '0.0',
  '1.0',
  '0.0',
  '0.0',
  '1.0',
  '1.0',
  '1.0',
  '0.0',
  '0.0',
  '1.0',
  '0.0',
  '1.0',
  '0.0',
  '0.0',
  '1.0',
  '0.0',
  '1.0',
  '0.0',
  '0.0',
  '0.0',
  '1.0',
  '1.0',
  '0.0',
  '0.0',
  '1.0',
  '1.0',
  '1.0',
 

### I converted the response from 0 to 1 to died, survived to illustrate the functionality of the label encoding.

In [37]:
data = data[['pclass' , 'sex' , 'embarked' , 'survived']]
sur = []
for i in data.survived.tolist():
    if i == '1.0':
        sur.append('survived')
    else:
        sur.append('died')
        
data.survived =  sur

for col in data.columns.tolist():
    data[col] = data[col].astype(str)
    

0       survived
1       survived
2           died
3           died
4           died
5       survived
6       survived
7           died
8       survived
9           died
10          died
11      survived
12      survived
13      survived
14      survived
15          died
16      survived
17      survived
18          died
19      survived
20      survived
21      survived
22      survived
23      survived
24          died
25      survived
26      survived
27      survived
28      survived
29          died
          ...   
1013        died
1014        died
1015        died
1016        died
1017        died
1018        died
1019        died
1020        died
1021    survived
1022        died
1023        died
1024        died
1025        died
1026        died
1027    survived
1028        died
1029        died
1030        died
1031    survived
1032        died
1033        died
1034        died
1035        died
1036        died
1037    survived
1038        died
1039        died
1040        di