
# **HW7: Naive Bayes Classifier**

### **TODO: Ramsay Ward**

**Attention: This is an individual assignment.**


For this week's homework we are going explore one new classification technique:

  - Naive Bayes

We are reusing the version of the Melbourne housing data set from HW5, to predict the housing type as one of three possible categories:

  - 'h' house
  - 'u' duplex
  - 't' townhouse

In addition to building our own Naive Bayes classifier, we are going to compare the performace of our classifier to the [Gaussian Naive Bayes Classifier](https://scikit-learn.org/stable/modules/naive_bayes.html#gaussian-naive-bayes) available in the scikit-learn library.


In [1]:
# Import libraries


# These are the libraries you will use for this assignment. Do not import any other library.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
import calendar
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

## **Question 1: Import the Dataset**

Load the training set and set a variable for the target column, "Type"

In [2]:
# write your code here
df_melb = pd.read_csv('melb_data_train.csv')
target = df_melb['Type']
df_melb.head()


Unnamed: 0,Rooms,Type,Price,Date,Distance,Postcode,Bathroom,Car,Landsize,BuildingArea,YearBuilt
0,2,h,399000,7/5/16,8.7,3032,1,1.0,904,53.0,1985.0
1,3,h,1241000,28/08/2016,13.9,3165,1,1.0,643,,
2,2,u,550000,8/7/17,3.0,3067,1,1.0,1521,,
3,3,u,691000,24/06/2017,8.4,3072,1,1.0,170,,
4,2,u,657500,19/11/2016,4.6,3122,1,1.0,728,73.0,1965.0


## **Question 2 - Fix a column of data to be numeric**

* Upon examining our dataframe df_melb with the dtypes method, it's observed that the "Date" column is classified as an object type. Recognizing the potential utility of this column, our objective is to transform it into a format representing seconds since the Unix epoch.

* To accomplish this while adhering to the constraints of the libraries already imported, we'll introduce a new column named `unixtime`. Attention must be paid to standardizing the date strings, as they may present inconsistencies in formatting.

* After converting the dates, ensure to verify the correctness of your transformation by printing the minimum and maximum values of the `unixtime` column. Finally, remove the original `Date` column from the dataframe.

* normalize_date accepts the date string as shown in the df_melb 'Date' column, and returns a data in a standarized format.

* Your function needs to have a complete docstring

In [3]:
# Write your function here
def normalize_date(d):


    """
    Normalizes a date string by ensuring the year is in four-digit format.

    This function takes a date string in the format 'DD/MM/YY' or 'DD/MM/YYYY',
    checks the length of the year, and converts two-digit year formats to four-digit
    by prefixing '20' to the year. It is assumed that the date is in the 2000s.

    Parameters:
    d (str): The date string to be normalized.

    Returns:
    str: The normalized date string in the format 'DD/MM/YYYY'.
    """
    parts = d.split('/')
    if len(parts[2]) == 2:
        parts[2] = '20' + parts[2]
        d = '/'.join(parts)
    date = pd.to_datetime(d, format='%d/%m/%Y')
    return date





## **Question 3 Normalize the Date Column**

* Use the `normalize_date` function provided to ensure all dates in the "Date" column of your `df_melb` DataFrame have a four-digit year format.


* Convert the normalized date strings to Unix time, creating a new column "unixtime" in your DataFrame. Remember, Unix time represents the number of seconds since January 1, 1970. This is an involved step that will entail the following operations:

* * The [apply](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html) method in pandas allows you to apply a function along an axis of the DataFrame. When you use `df_melb['Date'].apply(normalize_date)`, you're applying the normalize_date function to each element in the "Date" column. This step ensures all date formats are standardized.

* * A [lambda function](https://www.w3schools.com/python/python_lambda.asp) is a small anonymous function defined with the keyword lambda. In this context, it's used to apply a series of operations on the date strings. Try to understand the operations being performed inside the lambda function: parsing the date string into a structured time object, and then converting that time object into Unix time.

* * The [time.strptime](https://docs.python.org/3/library/time.html#time.strptime) function parses a string representing a time according to a format. The format "%d/%m/%Y" tells the function that the input string is in day/month/year format. Play around with this function separately to see how it converts string representations of dates into structured time objects.

* * Converting to Unix Time with calendar.timegm: The [calendar.timegm](https://docs.python.org/3/library/calendar.html) function takes a time tuple in UTC and returns the corresponding Unix timestamp. Experiment with this function separately by passing structured time objects to see how it returns the number of seconds since the Unix epoch.

* * The solution code chains two `apply` operations. This might look complex, but it's just applying one function after the other to the same column. Try to apply each function separately to better understand their effects before chaining them.


* Remove the original "Date" column from your DataFrame as it's no longer necessary.


* Calculate and print the minimum and maximum Unix time values found in your "unixtime" column to verify the successful conversion of date strings.



In [4]:
# Normalize and convert date to Unix time, then drop the original "Date" column in one fluent operation
df_melb['unixtime'] = df_melb['Date'].apply(normalize_date).apply(lambda x: int(time.mktime(x.timetuple())))
df_melb.drop(columns=['Date'], inplace=True)

# Print the minimum and maximum Unix time

print( df_melb['unixtime'].min())
print( df_melb['unixtime'].max())




1454544000
1506124800


In [5]:
df_melb.head()

Unnamed: 0,Rooms,Type,Price,Distance,Postcode,Bathroom,Car,Landsize,BuildingArea,YearBuilt,unixtime
0,2,h,399000,8.7,3032,1,1.0,904,53.0,1985.0,1462579200
1,3,h,1241000,13.9,3165,1,1.0,643,,,1472342400
2,2,u,550000,3.0,3067,1,1.0,1521,,,1499472000
3,3,u,691000,8.4,3072,1,1.0,170,,,1498262400
4,2,u,657500,4.6,3122,1,1.0,728,73.0,1965.0,1479513600


## **Question 4: Calculating the prior probabilities**

* Calculate the prior probabilities for each possible "Type" in `df_melb` and populate a dictionary, `dict_priors`, where the key is the possible "Type" values and the value is the prior probabilities. Show the dictionary. Do not hardcode the possible values of "Type".  Don't forget about [value counts](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html).

In [6]:
# Write your code here
n = len(target)
type_count = target.value_counts()

dict_priors = (type_count/n).to_dict()

print(dict_priors)



{'h': 0.452, 'u': 0.418, 't': 0.13}


## **Question 5: Create a model for the distribution of all of the numeric attributes**

* For each class, and for each attribute calculate the sample mean and sample standard deviation.  You should store the model in a nested dictionary, `dict_nb_model`, such that `dict_nb_model['h']['Rooms']` is a tuple containing the mean and standard deviation for the target Type 'h' and the attribute 'Rooms'.  Show the model using the `display` function. You should ignore entries that are `NaN` in the mean and [standard deviation](https://pandas.pydata.org/docs/reference/api/pandas.Series.std.html) calculation.

* The docstring included with this function is designed to offer detailed guidance and insight into its purpose and implementation. Given the complexity and intricacy involved in crafting this function, it's important to thoroughly comprehend the logic and steps described in the docstring. Carefully study and internalize the methodology outlined, as this understanding will be critical in effectively implementing the function. Take this opportunity to engage deeply with the material, and ensure that you grasp the function's objectives, the parameters it requires, and the output it generates. This hands-on experience is invaluable and will improve your ability to work with Naive Bayes models and similar machine learning techniques.

In [7]:
df_melb.head()

Unnamed: 0,Rooms,Type,Price,Distance,Postcode,Bathroom,Car,Landsize,BuildingArea,YearBuilt,unixtime
0,2,h,399000,8.7,3032,1,1.0,904,53.0,1985.0,1462579200
1,3,h,1241000,13.9,3165,1,1.0,643,,,1472342400
2,2,u,550000,3.0,3067,1,1.0,1521,,,1499472000
3,3,u,691000,8.4,3072,1,1.0,170,,,1498262400
4,2,u,657500,4.6,3122,1,1.0,728,73.0,1965.0,1479513600


In [21]:
def build_naive_bayes_model(df, target_col):
    """
    Builds a Naive Bayes model dictionary for each target class in the dataframe.
    For each class, calculates the mean and standard deviation for every feature column,
    excluding the target column.

    Parameters:
    - df: pandas.DataFrame, the dataframe containing the dataset.
    - target_col: str, the name of the target column in the dataframe.

    Returns: dict_nb_model
    - A dictionary where each key is a target class, and each value is another dictionary.
      This inner dictionary has feature column names as keys, and tuples of (mean, std deviation)
      as values for that feature with respect to the target class.
    """

    dict_nb_model = {}
    unique_classes = df[target_col].unique()
    #print(classes)

    for targetc in unique_classes:
      #print(Type)
      class_df = df[df[target_col] == targetc]
      class_model = {}
      #print(class_df)


      for columns in df.columns:
        #print(df.columns)
        if columns != target_col:
          mean = class_df[columns].mean()
          std_dv = class_df[columns].std()
          class_model[columns] = (mean, std_dv)

      dict_nb_model[targetc] = class_model
          #print(class_model)

    return dict_nb_model
















In [9]:
# Call your function and store in a variable
dict_nb_model = build_naive_bayes_model(df_melb, 'Type')
dict_nb_model

{'h': {'Rooms': (3.269911504424779, 0.725826420112775),
  'Price': (1189022.3451327435, 586296.5794417894),
  'Distance': (12.086725663716816, 7.397501132737295),
  'Postcode': (3103.8982300884954, 98.35750345419703),
  'Bathroom': (1.5619469026548674, 0.6720871086493074),
  'Car': (1.7777777777777777, 0.932759177140425),
  'Landsize': (932.9646017699115, 3830.7934157687173),
  'BuildingArea': (156.2433962264151, 54.62662837301433),
  'YearBuilt': (1954.900826446281, 32.4618763471547),
  'unixtime': (1485717578.761062, 13838562.05060146)},
 'u': {'Rooms': (2.0430622009569377, 0.5908453859944255),
  'Price': (634207.1770334928, 217947.32866736987),
  'Distance': (8.760287081339714, 5.609778714430756),
  'Postcode': (3120.4545454545455, 87.18475679946476),
  'Bathroom': (1.1818181818181819, 0.4222815154866222),
  'Car': (1.1483253588516746, 0.47231993860297056),
  'Landsize': (436.23444976076553, 1394.3403794653257),
  'BuildingArea': (83.85585585585585, 45.95943801516662),
  'YearBuilt'

In [10]:
dict_nb_model['h']['Rooms']

(3.269911504424779, 0.725826420112775)

## **Question 6: Write a function that calculates the probability of a Gaussian**

* Given the mean ($\mu$), standard deviation ($\sigma$), and a observed point, `x`, return the probability.  
* Use the formula $p(x) = \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}$ ([wiki](https://en.wikipedia.org/wiki/Normal_distribution)).  You should use [numpy's exp](https://numpy.org/doc/stable/reference/generated/numpy.exp.html) function in your solution.
* Complete the docstring for the function you will implement.
* Your function should return the probability density of x in the Gaussian distribution defined by mu and sigma.

In [11]:
def get_probability(mu, sigma, x):
    """
    Calculate the probability density of x for a Gaussian distribution.

    This function computes the value of the probability density function (PDF)
    for a normal (Gaussian) distribution given the mean (mu), standard deviation
    (sigma), and a value x. The calculation is based on the formula for the PDF
    of a Gaussian distribution.

    Parameters:
    - mu (float): The mean of the Gaussian distribution.
    - sigma (float): The standard deviation of the Gaussian distribution. Sigma must be positive.
    - x (float): The point at which the PDF is evaluated.

    Returns:
    - float: The probability density of x in the Gaussian distribution defined by mu and sigma.

    Note:
    - The function assumes sigma > 0. Passing a non-positive sigma will result in a division by zero error.
    - This function does not check for the validity of the input types. Passing non-numeric types for mu, sigma, or x will result in a TypeError.
    """
    # Write your code here
    p1 = 1/(sigma*np.sqrt(2*np.pi))
    p2 = np.exp(-0.5 * ((x-mu)/sigma)**2)
    return p1*p2


In [12]:
# Test it
p = get_probability( 0, 2, 0.5)
print(p)

0.19333405840142462



## **Question 7 Write the Naive Bayes classifier function**

* The Naive Bayes classifier function, `nb_class`, below should take as a parameter the prior probability dictionary. `dict_priors`, the dictionary containing all of the gaussian distribution information for each attribue, `dict_nb_model`, and a single observation row (a series generated from iterrows) of the test dataframe.

* It should return a single target classification.

* For this problem, all of our attributes are numeric and modeled as Gaussians, so we don't worry about categorical data.

* Make sure to skip attributes that do not have a value in the observation.  Do not hardcode the possible classification types.



In [22]:
def nb_class(dict_priors, dict_nb_model, observation):
    """
    Classifies an observation based on the Naive Bayes model and prior probabilities.

    Parameters:
    - dict_priors (dict): A dictionary of the prior probabilities for each class.
    - dict_nb_model (dict): The Naive Bayes model built with `build_naive_bayes_model`,
                            containing mean and standard deviation for each attribute by class.
    - observation (dict): A dictionary representing the observation to classify, with attributes
                          as keys and attribute values as values.

    Returns:
    - The class with the highest posterior probability for the given observation.

    Note:
    - This function assumes that `dict_nb_model` has been constructed with the same structure
      as output by `build_naive_bayes_model`.
    - `observation` should have the same attributes as those used to build `dict_nb_model`.
    """
    dict_score = dict()
    for target in dict_priors.keys():
        # Initialize the score for this class with the prior probability
        score = dict_priors[target]

        for attribute, (mean, std_dev) in dict_nb_model[target].items():
            # Calculate conditional probability only if the attribute's value is not NaN
            if pd.notna(observation[attribute]):
              value = observation[attribute]
              #mean, std_dev = dict_nb_model[target][attribute]
              pd_c = get_probability(mean, std_dev, value)
              score *= pd_c
        dict_score[target] = score

    # Find the class with the maximum score (posterior probability)
    max_class = max(dict_score, key = dict_score.get)
    # return the max class
    return max_class

## **Question 8 Calculate the accuracy using Naive Bayes classifier function on the test set**

Here, you will need to apply the same preprocessing and feature engineering techniques we applied to the training set.

* Start by loading your test dataset from a CSV file named 'melb_data_test.csv' into a pandas DataFrame called df_test.

* Apply the normalize_date function to the "Date" column to ensure all dates are in a consistent format as we did for the training set. This function should already be defined and implemented correctly in your notebook based on previous instructions.

* Convert the normalized date strings in the "Date" column to Unix time (seconds since the epoch) and store these in a new column named 'unixtime'.
After conversion, drop the original "Date" column as it's no longer needed.

* Iterate over each row in the df_test DataFrame, preparing each observation for classification by the Naive Bayes model.
* Drop the target column ('Type') from each observation and convert the row to a dictionary format using `.to_dict()`. This is necessary as the `nb_class` function expects the observation as a dictionary.
* Use the `nb_class` function to make predictions. This function should already be defined and requires dict_priors and dict_nb_model which should have been prepared earlier.
* Store each prediction in a list called predictions.

* Calculate the accuracy as the mean of the correct predictions by comparing the predictions list with the actual values in the 'Type' column of df_test.
* Print the accuracy as a percentage with two decimal places.

In [16]:
# Write your code here
df_test = pd.read_csv('melb_data_test.csv')

df_test['unixtime'] = df_test['Date'].apply(normalize_date).apply(lambda x: int(time.mktime(x.timetuple())))
df_test.drop(columns=['Date'], inplace=True)



In [23]:
predictions = []

# write your code here
for index, row in df_test.iterrows():
  observation = row.drop('Type').to_dict()
  prediction = nb_class(dict_priors, dict_nb_model, observation)
  predictions.append(prediction)

correct_predictions = sum(predictions == df_test['Type'])
total_predictions = len(predictions)
accuracy = correct_predictions / total_predictions * 100

In [27]:
#predictions
accuracy

56.99999999999999

### **Question 9 Use `scikit-learn` to do the same thing**

* Now we understand the inner workings of the Naive Bayes algorithm, let's compare our results to [scikit-learn's Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html) implementation.

* Use the [GaussianNB](https://scikit-learn.org/stable/modules/naive_bayes.html#gaussian-naive-bayes) to train using the `df_melb`dataframe and test using the `df_test` dataframe.
* Remember to split `df_melb` into a `df_X` with the numerical attributes, and a `s_y` with the target column. On the `df_melb` frame you will have to fill the empty attributes via imputation since the `scikit-learn` library can not handle missing values.  
* Use the same method you used in the last homework (filling the training data with the mean of the non-nan values).

In [28]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

def train_test_gaussian_nb(df_train, df_test, target_col):
    """
    Trains and tests a Gaussian Naive Bayes model using scikit-learn,
    with imputation applied to handle missing values in both training and test data.

    Parameters:
    - df_train (pd.DataFrame): The training DataFrame.
    - df_test (pd.DataFrame): The test DataFrame to evaluate the model.
    - target_col (str): The name of the target column.

    Returns:
    - The accuracy of the model on the test set.
    """
    # Imputation on training data
    dict_imputation = {}
    for column in df_train.columns:
        if df_train[column].isnull().any():
            dict_imputation[column] = df_train[column].mean()
    df_train_imputed = df_train.fillna(value=dict_imputation)

    # Prepare training data
    X_train = df_train_imputed.drop(columns=[target_col])
    y_train = df_train_imputed[target_col]

    # Imputation on test data using training data means
    df_test_imputed = df_test.fillna(value=dict_imputation)

    # Prepare test data
    X_test = df_test_imputed.drop(columns=[target_col])
    y_test = df_test_imputed[target_col]

    # Training Gaussian Naive Bayes model
    gnb = GaussianNB()
    gnb.fit(X_train, y_train)

    # Testing the model
    y_pred = gnb.predict(X_test)

    # Calculating accuracy
    accuracy = accuracy_score(y_test, y_pred)

    # Printing the accuracy
    print("Accuracy:", accuracy)

    return accuracy




In [29]:
target_col = 'Type'
train_test_gaussian_nb(df_melb, df_test, target_col)


Accuracy: 0.37


0.37

## **Q8 Do you think imputation hurt or helped the classifier?**

Write your answer here.

I think imputation hurt the classifier. In the case without imputation the accuracy is about 56.99, the model with imputation the accuracy is about 0.37. Thus it is fair to say imputation hurt the model.