# CSCM35, CSLM35 Big Data and Data Mining
### by Dr. Jingjing Deng

This data sets consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) petal and sepal length, stored in a 150x4 numpy.ndarray

The rows being the samples and the columns being: Sepal Length, Sepal Width, Petal Length and Petal Width.

Reference:
- https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html

## Classification with Naive Bayes

In [58]:
import matplotlib.pyplot as plt
from sklearn import datasets
from collections import defaultdict
# importing Statistics module
import statistics as stat
from scipy import stats
from sklearn.preprocessing import StandardScaler

iris = datasets.load_iris()
print("Dataset loaded")

Dataset loaded


### Cross Validation
Reference: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [59]:
print(iris.data.shape)
print(iris.target.shape)

(150, 4)
(150,)


In [60]:
import numpy as np
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(iris.data,iris.target,test_size=0.2, stratify = iris.target,random_state = 42)

print("x_train shape:",x_train.shape,"x_test shape:",x_test.shape)
print("y_train shape:",y_train.shape,"y_test shape:",y_test.shape)
species , counts = np.unique(y_train, return_counts = True)
print(np.asarray((species, counts)).T)
species , counts = np.unique(y_test, return_counts = True)
print(np.asarray((species, counts)).T)

x_train shape: (120, 4) x_test shape: (30, 4)
y_train shape: (120,) y_test shape: (30,)
[[ 0 40]
 [ 1 40]
 [ 2 40]]
[[ 0 10]
 [ 1 10]
 [ 2 10]]


Scaling the data

In [61]:
stdscale = StandardScaler()
x_train = stdscale.fit_transform(x_train)
x_test = stdscale.fit_transform(x_test)
print("Data is scaled via the Standard Scaler")

Data is scaled via the Standard Scaler


The terminology used in the Bayesian method of probability is as follows:
<br>A is called the proposition
<br>B is called the evidence
<br>P(A) is called the prior probability of proposition 
<br>P(B) is called the prior probability of evidence.
<br>P(A|B) is called the posterior probability
<br>P(B|A) is the likelihood

Reference:
- https://medium.com/@rangavamsi5/na%C3%AFve-bayes-algorithm-implementation-from-scratch-in-python-7b2cc39268b9
- https://www.geeksforgeeks.org/naive-bayes-classifiers/
- https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/
        

# Grouping the data
Each class is mapped to individual samples beloging to that class.

In [62]:
def group_class(data,target):
    target_map = defaultdict(list)
    for index in range(len(data)):
        features = data[index]
        if not features.any():
            continue
        x = target[index]
        target_map[x].append(features)  # designating the last column as the class column
    return target_map

In [63]:
group = group_class(x_train,y_train)
print ("Grouped into %s classes: %s" % (len(group.keys()), group.keys()))
# print(group[0])

Grouped into 3 classes: dict_keys([0, 2, 1])


Reference for mean and standard deviation and zip:
- https://docs.python.org/3/library/statistics.html
- https://www.geeksforgeeks.org/statistical-functions-python-set-1averages-measure-central-location/
- https://realpython.com/python-zip-function/

# Summary
Return the (mean, standard deviation) combination for each feature of the train_set.

In [64]:
def summarize(test_set):
    """
    Use zip to line up each feature into a single column across multiple lists.
    yield the mean and the stdev for each feature.
    """
    for feature in zip(*test_set):
        yield {
            'stdev': stat.stdev(feature),
            'mean': stat.mean(feature)
        }
print("Summary function defined")
#usage: 
print ("Feature Summary: %s" % [i for i in summarize(x_train)])

Summary function defined
Feature Summary: [{'stdev': 1.0041928905068676, 'mean': -1.2065580016577352e-15}, {'stdev': 1.0041928905068678, 'mean': -1.9935442185925468e-15}, {'stdev': 1.0041928905068678, 'mean': 4.844504427244563e-16}, {'stdev': 1.0041928905068676, 'mean': 1.6581354345124311e-15}]


# Building the model
Features and class<br>
Features: 
- Sepal length sl
- Sepal width sw
- Petal length pl
- Petal width pw
<br><br>
Class:
- Setosa s
- Versicolor ve
- Virginicas vi


## Posterior = (Class Prior*Likelihood)/(Predictor Prior)

In [65]:
#probabilities of individual categories
#Class Prior
def prior_prob(group, target, data):
    total = float(len(data))
    result = len(group[target]) / total
    return result

for target_class in [0, 1, 2]:
    prior_probcalled = prior_prob(group, target_class, x_train)
    print('P(%s): %s' % (target_class, prior_probcalled))

P(0): 0.3333333333333333
P(1): 0.3333333333333333
P(2): 0.3333333333333333


# Train the model
we calculate the mean and standard deviation to learn from the train set. Using the above grouped classes, the combination (mean, standard deviation) for each feature of each class is calculated.

This will be used to calculate class Likelihoods.

Reference: 
- https://www.geeksforgeeks.org/difference-between-dict-items-and-dict-iteritems-in-python/

In [66]:
def train(train_list, target):
    '''
    For each target:
        1. yield prior_prob: the probability of each class. P(class) eg P(Iris-virginica)
        2. yield summary: list of {'mean': 0.0, 'stdev': 0.0}
    '''
    group = group_class(train_list, target)
    summaries = {}
    for target, features in group.items():
        summaries[target] = {
            'prior_probcalled': prior_prob(group, target, train_list),
            'Summary': [i for i in summarize(features)],
        }
    return summaries

summaries = train(x_train, y_train)
print(summaries)

{0: {'prior_probcalled': 0.3333333333333333, 'Summary': [{'stdev': 0.3683599758072141, 'mean': -1.0229893930683729}, {'stdev': 0.8934701230479329, 'mean': 0.820924097042234}, {'stdev': 0.09136539442393624, 'mean': -1.301716621475558}, {'stdev': 0.1518922489233574, 'mean': -1.2508575153646249}]}, 2: {'prior_probcalled': 0.3333333333333333, 'Summary': [{'stdev': 0.8179090891738221, 'mean': 0.9175060509815539}, {'stdev': 0.7922953328073759, 'mean': -0.1529903999033278}, {'stdev': 0.32712147411454845, 'mean': 1.0277457294965155}, {'stdev': 0.3538084970102825, 'mean': 1.0994379213994365}]}, 1: {'prior_probcalled': 0.3333333333333333, 'Summary': [{'stdev': 0.5693008500693436, 'mean': 0.1054833420868153}, {'stdev': 0.6914658560795356, 'mean': -0.6679336971389123}, {'stdev': 0.25164140430617404, 'mean': 0.2739708919790439}, {'stdev': 0.24624029261503796, 'mean': 0.1514195939651933}]}}


# Likelihood
product of all normal probabilities (for each feature given the class) P(sl|s)xP(sw|s)xP(pl|s)xP(pw|s)

Reference:
- https://en.wikipedia.org/wiki/Normal_distribution

Once we have the likelihood function we'll use the same to calculate the <b>joint probablities</b> as a product of Prior Probability and the Likelihood.

Reference:
- https://stackoverflow.com/questions/43602270/what-is-probability-density-function-in-the-context-of-scipy-stats-norm

In [67]:
def normal_pdf(x,mean,stdev):
    return stats.norm(mean,stdev).pdf(x)

def get_prediction(test_vector):
    '''
    :param test_vector: single list of features to test
    :return:
    Return the target class with the largest/best posterior probability
    '''
    posterior_probs = posterior_probabilities(test_vector)
    best_target = max(posterior_probs, key=posterior_probs.get)
    return best_target
    
def joint_probabilities(test_row):
    '''
    :param test_row: single list of features to test; new data
    :return:
    Use the normal_pdf(x, mean, stdev) to calculate the Normal Probability for each feature
    Take the product of all Normal Probabilities and the Prior Probability.
    '''
    joint_probs = {}
    for target, features in summaries.items():
        total_features = len(features['Summary'])
        likelihood = 1
        for index in range(total_features):
            feature = test_row[index]
            mean = features['Summary'][index]['mean']
            stdev = features['Summary'][index]['stdev']
            normal_prob = normal_pdf(feature, mean, stdev)
            likelihood *= normal_prob
        prior_prob = features['prior_probcalled']
        joint_probs[target] = prior_prob * likelihood
    return joint_probs

def posterior_probabilities(test_row):
    '''
    :param test_row: single list of features to test; new data
    :return:
    For each feature (x) in the test_row:
        1. Calculate Predictor Prior Probability using the Normal PDF N(x; µ, σ). eg = P(feature | class)
        2. Calculate Likelihood by getting the product of the prior and the Normal PDFs
        3. Multiply Likelihood by the prior to calculate the Joint Probability.
    E.g.
    prior_prob: P(setosa)
    likelihood: P(sepal length | setosa) * P(sepal width | setosa) * P(petal length | setosa) * P(petal width | setosa)
    joint_prob: prior_prob * likelihood
    marginal_prob: predictor prior probability
    posterior_prob = joint_prob/ marginal_prob
    returning a dictionary mapping of class to it's posterior probability
    '''
    posterior_probs = {}
    joint_probabilities1 = joint_probabilities(test_row)
    marginal_prob = marginal_pdf(joint_probabilities1)
    for target, joint_prob in joint_probabilities1.items():
        posterior_probs[target] = joint_prob / marginal_prob
    return posterior_probs

def marginal_pdf(joint_probabilities1):
    '''
    :param joint_probabilities: list of joint probabilities for each feature
    :return:
    Marginal Probability Density Function (Predictor Prior Probability)
    Joint Probability = prior * likelihood
    Marginal Probability is the sum of all joint probabilities for all classes.
    marginal_pdf =
      [P(setosa) * P(sepal length | setosa) * P(sepal width | setosa) * P(petal length | setosa) * P(petal width | setosa)]
    + [P(versicolour) * P(sepal length | versicolour) * P(sepal width | versicolour) * P(petal length | versicolour) * P(petal width | versicolour)]
    + [P(virginica) * P(sepal length | verginica) * P(sepal width | verginica) * P(petal length | verginica) * P(petal width | verginica)]
    '''
    marginal_prob = sum(joint_probabilities1.values())
    return marginal_prob

def predict(test_set):
    '''
    Predict the likeliest target for each row of the test_set.
    Return a list of predicted targets.
    '''
    predictions = []
    for row in test_set:
        result = get_prediction(row)
        predictions.append(result)
    return predictions


def accuracy(test_set, predicted):
    '''
    :param test_set: list of test_data
    :param predicted: list of predicted classes
    :return:
    Calculate the the average performance of the classifier.
    '''
    correct = 0
#     actual = [item[-1] for item in test_set]
    actual = test_set
    for x, y in zip(actual, predicted):
        if x == y:
            correct += 1
    return correct / float(len(test_set))

In [68]:
# INCORRECT - was trying to compare list with numpy array
print(predicted_list)
print(y_test)
# accuracy = accuracy(y_test, predicted_list)
# print('Accuracy: %.3f' % accuracy)
print("type of predicated_list:",type(predicted_list))
print("Type of y_test:",type(y_test))

[0, 2, 1, 1, 0, 2, 0, 0, 2, 1, 2, 2, 2, 1, 0, 0, 0, 1, 1, 2, 0, 2, 1, 2, 2, 2, 1, 0, 2, 0]
[0 2 1 1 0 1 0 0 2 1 2 2 2 1 0 0 0 1 1 2 0 2 1 2 2 1 1 0 2 0]
type of predicated_list: <class 'list'>
Type of y_test: <class 'numpy.ndarray'>


In [55]:
print(y_test.tolist())

[0, 2, 1, 1, 0, 1, 0, 0, 2, 1, 2, 2, 2, 1, 0, 0, 0, 1, 1, 2, 0, 2, 1, 2, 2, 1, 1, 0, 2, 0]


In [69]:
predicted_list = predict(x_test)
accuracy = accuracy(y_test.tolist(), predicted_list)
print('Accuracy: %.3f' % (accuracy*100))

Accuracy: 93.333


REFERENCES: 

https://stackoverflow.com/questions/68799909/classification-accuracy-with-sklearn-in-percentage 