**Copyright: © NexStream Technical Education, LLC**.
All rights reserved

#Bayes Theorem - Part 1

Create a Colab script which calculates P(A|B) using Bayes theorem for the example provided in the lecture; that is, if you test positive how probable is it that you have cancer?
- Create a Python script to implement the bayes_theorem function
- Create a runner to call your function and output the result, P(A|B) for the following
Cases:
-  Case 1: Cancer occurs in 1% of the people your age and the test is 99% reliable
-  Case 2: Cancer occurs in 3% of the people your age and the test is 90% reliable
-  Repeat the tests 3 times each (using your previous P(A|B) result in the
subsequent run) and record P(A|B) for each repeated test. How did your results
improve with each additional test run?

Summary of the lecture example:
- P(A):  probability you have cancer prior to getting tested (called the 'belief')
- P(B):  probability you test positive  (called the 'evidence')
- P(A|B):  probability you have cancer given the test is positive

- Bayes Theorem:
$$P(A|B)=\frac{P(B|A)\cdot P(A)}{P(B)}$$

<br>

**def bayes_theorem(p_a, p_b_given_a):**
- Inputs:
  - *p_a* = probability of event 'a', e.g., you have cancer prior to getting tested
  - *p_b_given_a* = probability of event 'b' given event 'a' has occurred, e.g. the probability that you test positive given you have cancer.
- return *p_a_given_b* = probability of event 'a' given event 'b', e.g. the probability that you have cancer given you test positive, *p_b* (for test purposes).

<br>

Follow the steps as outlined in the following code cell to implement the functions.  Make sure your code passes the embedded doctests.


In [1]:
#In-class assignment in Bayes lecture
#If you test positive how probable is it that you have cancer?
import numpy as np


#Calculate P(A|B) given P(A), P(B|A)
"""def bayes_theorem(p_a, p_b_given_a):
  p_not_a = None
  p_b_given_not_a = None
  p_b = None
  p_a_given_b = None

  return p_b, p_a_given_b"""


def bayes_theorem(p_a, p_b_given_a):
    # probability of not having the disease
    p_not_a = 1 - p_a

    # probability of testing positive given that you don't have the disease (false positive)
    p_b_given_not_a = 0.01 # assuming 1% false positive rate

    # total probability of testing positive
    p_b = p_b_given_a * p_a + p_b_given_not_a * p_not_a

    # probability of having the disease given that you tested positive
    p_a_given_b = (p_b_given_a * p_a) / p_b

    return p_b, p_a_given_b

#-------------------------------------------------------------------------------------------------
#Test with the following doctest test vectors.
#DO NOT EDIT THE TEST CODE!!!!

#Scenario and definitions
#   P(A):  probability that event (e.g. cancer) occurs in % of population (belief)
#   P(B):  probability test is positive (evidence)
#   P(B|A):  probability test is positive if event occurs, i.e. reliability of test
#            when multiply by P(A) then get the rate of true positives
#   P(B|not A): probability test is positive if event does not occur
#            when multiply by P(not A) then get the rate of false positives

#Case 1.  Cancer occurs in 1% of the people your age and the test is 99% reliable
#Case 2.  Cancer occurs in 3% of the people your age and the test is 90% reliable

"""def bayes_runner(p_a, p_b_given_a, iterations):
  pagb_list = []
  pb_list = []
  for i in range(iterations):
    p_b, p_a_given_b = bayes_theorem(p_a, p_b_given_a)
    p_a = p_a_given_b
    pagb_list.append(p_a_given_b)
    pb_list.append(p_b)
  return pb_list, pagb_list"""

def bayes_runner(p_a, p_b_given_a, iterations):
    pagb_list = []
    pb_list = []

    for i in range(iterations):
        p_b, p_a_given_b = bayes_theorem(p_a, p_b_given_a)

        # updating prior belief with new evidence
        p_a = p_a_given_b

        pagb_list.append(p_a_given_b)
        pb_list.append(p_b)

    return pb_list, pagb_list


#Case 1.  Cancer occurs in 1% of the people your age and the test is 99% reliable
pb_1, pagb_1 = bayes_runner(0.01, 0.99, 3)

#Case 2.  Cancer occurs in 3% of the people your age and the test is 90% reliable
pb_2, pagb_2 = bayes_runner(0.03, 0.90, 3)

import doctest
"""
   >>> print(np.round(pagb_1, 4))
   [0.5    0.99   0.9999]
   >>> print(np.round(pb_1, 4))
   [0.0198 0.5    0.9802]
   >>> print(np.round(pagb_2, 4))
   [0.2177 0.7147 0.9575]
   >>> print(np.round(pb_2, 4))
   [0.124  0.2742 0.6718]
"""
doctest.testmod()



sys.settrace() should not be used when the debugger is being used.
This may cause the debugger to stop working correctly.
If this is needed, please check: 
http://pydev.blogspot.com/2007/06/why-cant-pydev-debugger-work-with.html
to see how to restore the debug tracing back correctly.
Call Location:
  File "/usr/lib/python3.10/doctest.py", line 1501, in run
    sys.settrace(save_trace)



**********************************************************************
File "__main__", line 7, in __main__
Failed example:
    print(np.round(pagb_2, 4))
Expected:
    [0.2177 0.7147 0.9575]
Got:
    [0.7357 0.996  1.    ]
**********************************************************************
File "__main__", line 9, in __main__
Failed example:
    print(np.round(pb_2, 4))
Expected:
    [0.124  0.2742 0.6718]
Got:
    [0.0367 0.6648 0.8965]
**********************************************************************
1 items had failures:
   2 of   4 in __main__
***Test Failed*** 2 failures.


TestResults(failed=2, attempted=4)

# Naive Bayes Monomial Classifier - Part 2

Create a Colab script which uses a Naive Bayes classifier to detect email spam messages. Note, you may NOT use any machine learning libraries (e.g. sklearn) for this section.
- Use the message 'subject' line (you can use the body but it may be much more complex)
- Create a training set of Wanted and Unwanted messages
- Create your 'model'
 - Parse the words in the emails and create conditional probabilities for the Wanted
and Unwanted messages
- Test your model with other messages not used in the training set
- Record and reflect on your results, i.e. evaluate the performance, and compare against
typical performance of the algorithm (do some searches to find the average performance
of a Naive Bayes based spam filter).
- A training dataset has been provided as materials for this assignment. You may find another, or scrape your own emails if you prefer - just describe what dataset you used or created in your reflections.


<br>

- Follow the steps as outlined in the following code cells to implement the functions.
- Record your comments and reflections for this part in a text cell at the end of this section.



In [2]:
#Mount your google drive and copy the dataset to the current working directory (!cp),
#or change the working directory to the folder (%cd).
#from google.colab import drive
#drive.mount('/content/drive')
 # Mount drive
from google.colab import drive
drive.mount('/content/drive/', force_remount=True)
# cd (change directory) to the folder which contains the dataset
# YOUR CODE HERE...
%cd /content/drive/MyDrive/Colab Notebooks/

Mounted at /content/drive/
/content/drive/MyDrive/Colab Notebooks


In [3]:
#Read the dataset as a Pandas dataframe.
#and examine the head of the file.
#An example is shown for the provided dataset.
#This code assumes the dataset csv file is in the working directory.

import pandas as pd

df = pd.read_csv('./spam_training_dataset.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0


In [4]:
#Split each row on the line return '\r'
#An example is shown for the provided dataset.
df['Subject'] = df.text.str.split('\r').str.get(0)
print(df['Subject'])

0              Subject: enron methanol ; meter # : 988291
1                   Subject: hpl nom for january 9 , 2001
2                                   Subject: neon retreat
3       Subject: photoshop , windows , office . cheap ...
4                            Subject: re : indian springs
                              ...                        
5166                        Subject: put the 10 on the ft
5167             Subject: 3 / 4 / 2000 and following noms
5168                Subject: calpine daily gas nomination
5169    Subject: industrial worksheets for august 2000...
5170              Subject: important online banking alert
Name: Subject, Length: 5171, dtype: object


In [5]:
#Setup dataframes with 80% train and 20% test
#An example is shown for the provided dataset.
cutoff = int(len(df)*0.8)
train = df[:cutoff]
test = df[cutoff:]

#Reset the indices of the test rows
test.reset_index(drop=True)

Unnamed: 0.1,Unnamed: 0,label,text,label_num,Subject
0,3017,ham,Subject: 3 - urgent - to avoid loss of informa...,0,Subject: 3 - urgent - to avoid loss of informa...
1,451,ham,Subject: nominations for eastrans reciept for ...,0,Subject: nominations for eastrans reciept for ...
2,307,ham,Subject: change in operating companies\r\nplea...,0,Subject: change in operating companies
3,3752,spam,Subject: we deliver to your door within 24 hou...,1,Subject: we deliver to your door within 24 hou...
4,2612,ham,Subject: natural gas nomination for 03 / 01\r\...,0,Subject: natural gas nomination for 03 / 01
...,...,...,...,...,...
1030,1518,ham,Subject: put the 10 on the ft\r\nthe transport...,0,Subject: put the 10 on the ft
1031,404,ham,Subject: 3 / 4 / 2000 and following noms\r\nhp...,0,Subject: 3 / 4 / 2000 and following noms
1032,2933,ham,Subject: calpine daily gas nomination\r\n>\r\n...,0,Subject: calpine daily gas nomination
1033,1409,ham,Subject: industrial worksheets for august 2000...,0,Subject: industrial worksheets for august 2000...


In [6]:
#Train the model

#Create empty dictionaries of ham (wanted), and spam (unwanted) messages.
#This will create the histogram bins needed for ham and spam messages.
#The dicts will set keys = email word, value = word frequency (count)
#ham = None
#spam = None

ham = {}
spam = {}



#Loop over length of the training set
"""for i in range(len(train)):

  #Read an example (X) == row from the training set
  #Hint:  https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
  X = None

  #If the exammple is ham (can get this from the 'label' or 'label_num' column)
  #then set a reference to the ham dictionary, else set reference to spam dict.
  if X.label == None
    toAppend = None
  else:
    toAppend = None"""

# Iterate over training set
for i in range(len(train)):
    # Extract the i-th row in the training set
    X = train.loc[i]

    # Determine which dictionary to append to
    if X['label'] == 'ham':
        toAppend = ham
    else:
        toAppend = spam




  #Loop over the words in the example.
  #Hint: https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html
  #Hint: In this case str is the X.text field in the df
"""  for word in None:

    #Check for 'stopwords' by ignoring any words with length < 4
    if None > 3:
      #Check if word is in the dictionary and increment count if so
      #otherwise add the word to the dictionary and set the count to 1
      if word in None:
        None
      else:
        None"""

# Loop over words in the text
for word in X['text'].split():
        # Ignore 'stopwords' shorter than 4 characters
        if len(word) > 3:
            # Increment word count if already in dictionary, else add to dictionary
            if word in toAppend:
                toAppend[word] += 1
            else:
                toAppend[word] = 1


#Set the lengths for each of the ham and spam entries
#so can calculate the probability of each word in the ham and spam histograms.
lenHam = len(train[train.label == "ham"])
lenSpam = len(train[train.label == "spam"])

#Loop over each word and frequency in the dicts and calculate their weightings.
#Hint:  https://docs.python.org/3/tutorial/datastructures.html#dictionaries (Looping Techniques)
#Ham histogram probabilities
"""for None, None in None
  ham[None] = None
#Spam histogram probabilities
for None, None in None
  spam[None] = None"""

# Loop over each word and frequency in the dictionaries and calculate their weightings
for word, count in ham.items():
    ham[word] = count / lenHam
for word, count in spam.items():
    spam[word] = count / lenSpam


#Finally sort the dictionaries by values in decending order (highest weighting first)
#and keep the items with the top 25 weightings.
#Note, you should experiment with keeping different amounts of items.
ham = dict(sorted(ham.items(), key=lambda item: item[1], reverse=True)[:25])
print(ham)
spam = dict(sorted(spam.items(), key=lambda item: item[1], reverse=True)[:25])
print(spam)

{'Subject:': 0.00033818058843422386, 'occidental': 0.00033818058843422386, 'battleground': 0.00033818058843422386, 'meter': 0.00033818058843422386, '1485': 0.00033818058843422386, 'october': 0.00033818058843422386, '2000': 0.00033818058843422386, 'thank': 0.00033818058843422386}
{}


In [7]:
#Make predictions
#def predict(text):
#Input: test message text
#Return: 'ham' or 'spam' label

# Loop over words in test message
#   if not stopword (<4 char) then accumulate wanted or unwanted probabilities
# Note, for this implementation, if the word is not in the ham or spam dictionary,
# just ignore it, i.e. don't multiply in a 0 probability to avoid the zero frequency problem.
# If the wanted score > unwanted score then return 'ham', else return 'spam'

#YOUR CODE HERE

"""def predict(text):
  w_score = None
  u_score = None
  for word in None
    if None > 3:
      if word in None:
        w_score = None
      if word in None:
        u_score = None

  #Compare w_score against u_score and return 'ham' or 'spam' label
  if None > None:
    return None
  else:
    return None"""


def predict(text):
    w_score = 1
    u_score = 1
    # Loop over words in test message
    for word in text.split():
        # Ignore 'stopwords' shorter than 4 characters
        if len(word) > 3:
            # If word in ham dictionary, multiply its probability into the wanted score
            if word in ham:
                w_score *= ham[word]
            # If word in spam dictionary, multiply its probability into the unwanted score
            if word in spam:
                u_score *= spam[word]
    # Compare wanted score against unwanted score and return label
    if w_score > u_score:
        return 'ham'
    else:
        return 'spam'


In [8]:
#Test the model
#Loop over the test dataset
#  extract an example (row) and pass the Subject field to the predict function
#  if the returned label matches the example label then increment the correct
#Calculate a percent correct

#YOUR CODE HERE

#correct = 0

"""for i in range(len(test)):
  email = None
  if predict(None) == None:
    correct = None"""

# Test the model
correct = 0
for i in range(len(test)):
    email = test.iloc[i]
    if predict(email.text) == email.label:
        correct += 1

print(correct/len(test))

0.30917874396135264


### Complete Program for Clarity

In [9]:
import pandas as pd

# Read the dataset as a Pandas dataframe.
df2 = pd.read_csv('./spam_training_dataset.csv')

# Split each row on the line return '\r'
df2['Subject'] = df2.text.str.split('\r').str.get(0)

# Setup dataframes with 80% train and 20% test
cutoff = int(len(df2)*0.8)
train = df2[:cutoff]
test = df2[cutoff:]

# Reset the indices of the test rows
test = test.reset_index(drop=True)

# Create empty dictionaries of ham (wanted), and spam (unwanted) messages.
ham = {}
spam = {}

# Loop over length of the training set
for i in range(len(train)):
    X = train.iloc[i] # Read an example (X) == row from the training set

    # If the example is ham then set a reference to the ham dictionary, else set reference to spam dict.
    if X.label == "ham":
        toAppend = ham
    else:
        toAppend = spam

    # Loop over the words in the example.
    for word in X.text.split(' '):
        # Check for 'stopwords' by ignoring any words with length < 4
        if len(word) > 3:
            # Check if word is in the dictionary and increment count if so
            # otherwise add the word to the dictionary and set the count to 1
            if word in toAppend:
                toAppend[word] += 1
            else:
                toAppend[word] = 1

# Set the lengths for each of the ham and spam entries
lenHam = len(train[train.label == "ham"])
lenSpam = len(train[train.label == "spam"])

# Loop over each word and frequency in the dicts and calculate their weightings.
# Ham histogram probabilities
for word, count in ham.items():
    ham[word] = count/lenHam
# Spam histogram probabilities
for word, count in spam.items():
    spam[word] = count/lenSpam

# Finally sort the dictionaries by values in decending order (highest weighting first)
# and keep the items with the top 25 weightings.
ham = dict(sorted(ham.items(), key=lambda item: item[1], reverse=True)[:25])
spam = dict(sorted(spam.items(), key=lambda item: item[1], reverse=True)[:25])

# Make predictions
def predict(text):
    w_score = 0
    u_score = 0
    for word in text.split(' '):
        if len(word) > 3:
            if word in ham:
                w_score += ham[word]
            if word in spam:
                u_score += spam[word]

    # Compare w_score against u_score and return 'ham' or 'spam' label
    if w_score > u_score:
        return 'ham'
    else:
        return 'spam'

# Test the model
correct = 0
for i in range(len(test)):
    email = test.iloc[i]
    if predict(email.text) == email.label:
        correct += 1

print(correct/len(test))


0.8289855072463768


**Enter your model performance and reflections here.**
.
.
.




# Naive Bayes Gaussian Classifier (from scratch) - Part 3

Create a Colab script which uses a Naive Bayes classifier to predict one of the 3 species of flowers based on input attributes from the Iris dataset.
Note, you may NOT use any machine learning libraries (e.g. sklearn) for this section.
- Create an 80/20 split of the dataset (provided)
- Generate the Gaussian distributions for each of your attributes in each of the classes (species) for your training data
- Use the test data to predict the species
- Calculate and record the accuracy of your classifier
- The Iris dataset has been provided as materials for this assignment, however for testing purposes, please run the provided cell which accesses the dataset directly and performs a train/test split on the data (this is the only use of a sklearn function in this section).

Summary of lecture notes:
- Generate an initial guess (prior probabilities) for each of the classes based on the number of examples for each class, e.g.
 - P(Setosa) = 50/150
 - P(Versicolor) = 50/150
 - P(Virginica) = 50/150
Note, depending on your train/test split, the initial guess probabilities may be different
- Calculate the score for each of the classes, e.g.

*SetosaScore =  P(Setosa) * g(SepalLength | Setosa) * g(SepalWidth | Setosa) * g(PetalLength | Setosa) * g(PetalWidth | Setosa)*
etc.

where:
$$g(x)= \frac{1}{\sigma \sqrt(2\pi)}exp(-\frac{1}{2}\frac{(x-\mu)^2}{\sigma^2}) $$



In [17]:
#Load the dataset into a Pandas dataframe and split the data.
#The cell has been provided to set up a random seed for testing purposes.
#Please do NOT alter the code.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
df = pd.read_csv(csv_url, names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'])
#print(df.head())


feature_cols = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
for col in feature_cols:
  df[col] = pd.to_numeric(df[col])

X_train, X_test, y_train, y_test = train_test_split(df[feature_cols], df['species'], test_size=0.2, random_state=1)
print(X_train)
print(y_train)

     sepal_length  sepal_width  petal_length  petal_width
91            6.1          3.0           4.6          1.4
135           7.7          3.0           6.1          2.3
69            5.6          2.5           3.9          1.1
128           6.4          2.8           5.6          2.1
114           5.8          2.8           5.1          2.4
..            ...          ...           ...          ...
133           6.3          2.8           5.1          1.5
137           6.4          3.1           5.5          1.8
72            6.3          2.5           4.9          1.5
140           6.7          3.1           5.6          2.4
37            4.9          3.1           1.5          0.1

[120 rows x 4 columns]
91     Iris-versicolor
135     Iris-virginica
69     Iris-versicolor
128     Iris-virginica
114     Iris-virginica
            ...       
133     Iris-virginica
137     Iris-virginica
72     Iris-versicolor
140     Iris-virginica
37         Iris-setosa
Name: species, Length: 120,

In [18]:
#Calculate the mean and standard deviation of the training set
#Hint: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html
#Hint: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html


#train_mean = None
#train_std = None

X_train['species'] = y_train
train_mean = X_train.groupby('species').mean()
train_std = X_train.groupby('species').std()

#-------------------------------------------------------------------------------------------------
#Test with the following doctest test vectors.
#DO NOT EDIT THE TEST CODE!!!!


import doctest
"""
   >>> print(np.round(train_mean['petal_length']['Iris-setosa'], 4))
   1.4692
   >>> print(np.round(train_std['petal_length']['Iris-setosa'], 4))
   0.1608
   >>> print(np.round(train_mean['sepal_width']['Iris-virginica'], 4))
   2.9523
   >>> print(np.round(train_std['sepal_width']['Iris-virginica'], 4))
   0.3069
"""
doctest.testmod()


TestResults(failed=0, attempted=4)

In [19]:
#Prediction
#  def predict(example):
#    init best_score and best_species
#    loop over each of the species
#       set initial species_score=0
#       loop over the feature_cols (defined in the provided cell)
#           extract the mean and std for the species feature
#           calculate the species_score (accumulate products)
#       if species_score > best_score
#           update best_score and best_species
#     return best_species

import math

"""def predict(example):
  best_score = None
  best_species = None
  for species in None:
    species_score = None
    for col in None:
      s = None
      m = None
      species_score = None
    if None > None:
      best_species = None
      best_score = None
  return best_species"""

def predict(example):
  best_score = -math.inf
  best_species = None
  for species in y_train.unique():
    species_score = 0
    for col in feature_cols:
      x = example[col]
      mean = train_mean[col][species]
      std = train_std[col][species]
      exponent = math.exp(-((x-mean)**2 / (2 * std**2 )))
      species_score += math.log((1 / (math.sqrt(2 * math.pi) * std)) * exponent)
    if species_score > best_score:
      best_species = species
      best_score = species_score
  return best_species



In [20]:
#Test the classifier

#-------------------------------------------------------------------------------------------------
#Test with the following doctest test vectors.
#DO NOT EDIT THE TEST CODE!!!!

t = 0
for i in range(len(X_test)):
  test_example = X_test.iloc[i]
  #print(test_example)
  pred = predict(test_example)
  actual = y_test.iloc[i]
  if pred == actual:
    t+=1


import doctest
"""
   >>> print(np.round(t/len(X_test), 4))
   0.9667
"""

doctest.testmod()


TestResults(failed=0, attempted=1)

### Part3: Code all in one place for clarity

In [21]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import math

csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
df = pd.read_csv(csv_url, names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'])

feature_cols = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
for col in feature_cols:
  df[col] = pd.to_numeric(df[col])

X_train, X_test, y_train, y_test = train_test_split(df[feature_cols], df['species'], test_size=0.2, random_state=1)

#Calculate the mean and standard deviation of the training set
X_train['species'] = y_train
train_mean = X_train.groupby('species').mean()
train_std = X_train.groupby('species').std()

#Print train_mean and train_std for doctest
print(train_mean)
print(train_std)

#Prediction
def predict(example):
  best_score = -math.inf
  best_species = None
  for species in y_train.unique():
    species_score = 0
    for col in feature_cols:
      x = example[col]
      mean = train_mean[col][species]
      std = train_std[col][species]
      exponent = math.exp(-((x-mean)**2 / (2 * std**2 )))
      species_score += math.log((1 / (math.sqrt(2 * math.pi) * std)) * exponent)
    if species_score > best_score:
      best_species = species
      best_score = species_score
  return best_species

#Test the classifier
t = 0
for i in range(len(X_test)):
  test_example = X_test.iloc[i]
  pred = predict(test_example)
  actual = y_test.iloc[i]
  if pred == actual:
    t+=1

print(np.round(t/len(X_test), 4))


#-------------------------------------------------------------------------------------------------
#Test with the following doctest test vectors.
#DO NOT EDIT THE TEST CODE!!!!


import doctest
"""
   >>> print(np.round(train_mean['petal_length']['Iris-setosa'], 4))
   1.4692
   >>> print(np.round(train_std['petal_length']['Iris-setosa'], 4))
   0.1608
   >>> print(np.round(train_mean['sepal_width']['Iris-virginica'], 4))
   2.9523
   >>> print(np.round(train_std['sepal_width']['Iris-virginica'], 4))
   0.3069
   >>> print(np.round(t/len(X_test), 4))
   0.9667
"""

doctest.testmod()

                 sepal_length  sepal_width  petal_length  petal_width
species                                                              
Iris-setosa          4.961538     3.353846      1.469231     0.230769
Iris-versicolor      5.945946     2.732432      4.229730     1.305405
Iris-virginica       6.525000     2.952273      5.534091     2.020455
                 sepal_length  sepal_width  petal_length  petal_width
species                                                              
Iris-setosa          0.332150     0.366237      0.160843     0.107981
Iris-versicolor      0.531557     0.330869      0.468369     0.201310
Iris-virginica       0.628814     0.306889      0.556962     0.284138
0.9667


TestResults(failed=0, attempted=5)

# Naive Bayes Gaussian Classifier (sklearn) - Part 4

Repeat Part 3 using Scikit-learn functions:

Hint: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

In [15]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
df = pd.read_csv(csv_url, names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'])
#print(df.head())


feature_cols = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
for col in feature_cols:
  df[col] = pd.to_numeric(df[col])

X_train, X_test, y_train, y_test = train_test_split(df[feature_cols], df['species'], test_size=0.2, random_state=1)
print(X_train)
print(y_train)

     sepal_length  sepal_width  petal_length  petal_width
91            6.1          3.0           4.6          1.4
135           7.7          3.0           6.1          2.3
69            5.6          2.5           3.9          1.1
128           6.4          2.8           5.6          2.1
114           5.8          2.8           5.1          2.4
..            ...          ...           ...          ...
133           6.3          2.8           5.1          1.5
137           6.4          3.1           5.5          1.8
72            6.3          2.5           4.9          1.5
140           6.7          3.1           5.6          2.4
37            4.9          3.1           1.5          0.1

[120 rows x 4 columns]
91     Iris-versicolor
135     Iris-virginica
69     Iris-versicolor
128     Iris-virginica
114     Iris-virginica
            ...       
133     Iris-virginica
137     Iris-virginica
72     Iris-versicolor
140     Iris-virginica
37         Iris-setosa
Name: species, Length: 120,

In [16]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB


#Note, use the X_train, X_test, y_train, y_test are from the provided
#cell in Part 3

#Instantiate the Naive Bayes classifier

clf = GaussianNB()

#Fit the model

clf.fit(X_train, y_train)

#Run the predictions

y_pred = clf.predict(X_test)

#Calculate the score

"""clf = None
clf.None
y_pred = clf.None
"""
score = clf.score(X_test, y_test)


#-------------------------------------------------------------------------------------------------
#Test with the following doctest test vectors.
#DO NOT EDIT THE TEST CODE!!!!

import doctest
"""
   >>> print(np.round(clf.score(X_test, y_pred), 4))
   1.0
"""

doctest.testmod()


TestResults(failed=0, attempted=1)