# Machine Learning Lab 1.1 - Naive Bayes Pt 1

The goal of the lab is build a Naive Bayes classifier from scratch to classify if food reviews are positive or negative. Effectively you are automating the process we performed in class.

### Imports

In [25]:
# RUN THIS
import numpy as np

## Load Data

This loads the reviews and their corresponding label from `simple-food-reviews.txt` and converts it to an array of `labels` and `reviews`. 

Each review has a label denoting if the review is positive (1) or negative (-1). 

In [26]:
# RUN THIS
labels = []
reviews = []
with open("simple-food-reviews.txt", "r") as f:
    lines = f.readlines()
    for line in lines:
        line = line.replace("\n", "")
        words = line.split(" ")
        label = int(words[0])
        review = " ".join(words[1:])
        labels.append(label)
        reviews.append(review)
        
print("Labels: \n", labels)
print("Reviews: \n", reviews)

Labels: 
 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]
Reviews: 
 ['the food is lovely', 'this is a great restaurant', 'i really enjoyed my food', 'i enjoyed the experience at the restaurant', 'we had a lovely meal', 'my food tasted great', 'the food was lovely and the service was not bad', 'the service was great', 'what a lovely restaurant', 'the food the service and the restaurant was great', 'this restaurant is lovely', 'the service is terrible', 'the food tasted awful', 'this is a bad restaurant  ', 'the food was really bad', 'the service and the food was terrible', 'we had a terrible experience', 'avoid this restaurant', 'avoid the food', 'the meal was terrible', 'the service was bad']


## Process Features

#### Question 1
Create a 'bag of words' from the dataset of `reviews`. This 'bag of words' should be a list containing each **unique** word contained in the dataset. Do not do this manually, i.e. do not just directly specify `bag_of_words = ['the', 'food', 'is', 'lovely', ...]`. If you do this the answer will be marked wrong.

In [27]:
bag_of_words = []
for sentence in reviews:
    words = sentence.split()
    for word in words:
        n = len(bag_of_words)
        if n==0:
            bag_of_words.append(word)
        isin=False
        for bword in bag_of_words:
            if word==bword:
                isin=True
        if isin== False:
            bag_of_words.append(word)
            
print(bag_of_words)


['the', 'food', 'is', 'lovely', 'this', 'a', 'great', 'restaurant', 'i', 'really', 'enjoyed', 'my', 'experience', 'at', 'we', 'had', 'meal', 'tasted', 'was', 'and', 'service', 'not', 'bad', 'what', 'terrible', 'awful', 'avoid']


#### Question 2
Using the bag of words finish the function `create_features` method that transforms a text review into a boolean vector representing which word in the bag_of_words is present in that review. 

The ordering of the words represented in the vector should follow the order in which those words first appear in the dataset. I.e., the first element of the first vector should correspond to 'the', the second should correspond to 'food' and so on.

The correct representation for the first review is given to test the correctness of your implementation of `create_features`.

In [28]:
def create_features(review):
    review_features = []
    alreadyAdded = []
    rWords = review.split()
    for tword in bag_of_words:
        isin = False
        for word in rWords:
            if word==tword:
                if word not in alreadyAdded:
                    alreadyAdded.append(word)
                    isin=True
        if isin==True:
            review_features.append(1)
        else:
            review_features.append(0)
            
        
    
    # Note: return a python list of 1s or 0s 
    return review_features


# DO NOT MODIFY ANYTHING UNDER HERE
review_1_features = create_features(reviews[0])
print(f"Your review 1 features: {review_1_features}")

correct_features = [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
print(f"Correct review 1 features: {correct_features}\n")

if review_1_features == correct_features:
    print(f"create_features correct for first review: review_1_features = correct_features")
else:
    print(f"create_features incorrect for first review: review_1_features != correct_features")

Your review 1 features: [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Correct review 1 features: [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

create_features correct for first review: review_1_features = correct_features


#### Process All Reviews

In [29]:
# RUN THIS
review_features = [create_features(review) for review in reviews]
print("First two feature vectors:")
print(review_features[:2])

First two feature vectors:
[[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]


## Train Test Split
This time round we will not create a validation split since we are not changing any hyperparameters.

This will be done for you in this lab. In this case the reviews for the train and test are choosen manually for convenience, but this will not usually be the case. Usually they are choosen randomly.

**N.B.** The below code block was modified from the previous version

In [30]:
# RUN THIS

print("###############")
print("# TRAIN SPLIT #")
print("###############\n")
# Train Split - ~90%
train_reviews = reviews[:10] + reviews[11:-1]
train_features = review_features[:10] + review_features[11:-1]
train_labels = labels[:10] + labels[11:-1]

print("train_reviews:\n", train_reviews)
print("train_features:\n", train_features)
print("train_labels\n", train_labels)


print("\n###############")
print("# TEST SPLIT #")
print("###############\n")
# Train Split - ~90%
test_reviews = [reviews[10], reviews[-1]]
test_features = [review_features[10], review_features[-1]]
test_labels = [labels[10], labels[-1]]

print("test_reviews:\n", test_reviews)
print("test_features:\n", test_features)
print("test_labels\n", test_labels)

###############
# TRAIN SPLIT #
###############

train_reviews:
 ['the food is lovely', 'this is a great restaurant', 'i really enjoyed my food', 'i enjoyed the experience at the restaurant', 'we had a lovely meal', 'my food tasted great', 'the food was lovely and the service was not bad', 'the service was great', 'what a lovely restaurant', 'the food the service and the restaurant was great', 'the service is terrible', 'the food tasted awful', 'this is a bad restaurant  ', 'the food was really bad', 'the service and the food was terrible', 'we had a terrible experience', 'avoid this restaurant', 'avoid the food', 'the meal was terrible']
train_features:
 [[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0,

## Naive Bayes - The fun stuff

### Learning

#### Question 3
Compute the prior probabilities of the classes/labels 'negative' (-1) and 'positive' (1) using on the training dataset. Set these probabilities to the variables `p_neg` and `p_pos` respectively.

In [31]:
p_pos = 10/19
p_neg = 9/19

print(f"p_pos = {p_pos}")
print(f"p_neg = {p_neg}")

p_pos = 0.5263157894736842
p_neg = 0.47368421052631576


#### Question 4

Compute the class conditional models for each feature and class. Each element should correspond to the conditional prob of a word occuring given an associated class. The rows should correspond to the feature and should have the same ordering as the words in the vector representation of reviews. The columns should correspond to the classes with the first column corresponding to `positive` and the second corresponding to `negative`. 

I.e., I wish for you to create a similar table (specifically a numpy array) as the one seen in the lecture.

In [36]:
import numpy as np
# The variable is created for you but is filled with zeros
class_conditionals = np.zeros((2, len(bag_of_words)))
pos =0
neg=1
n = len(bag_of_words)
for i in range(0,n):
    count =0
    ocp = 0
    ocn = 0
    for encw in train_features:
        if  encw[i]==1 and  train_labels[count]==1:
            ocp +=1
        if encw[i]==1 and train_labels[count]==-1:
            ocn +=1
        count +=1
    prop = ocp/10
    pron = ocn/9
    class_conditionals[0,i]=prop
    class_conditionals[1,i]=pron
        
        
        
print(f"\nclass_conditionals =\n {class_conditionals}")


class_conditionals =
 [[0.5        0.5        0.2        0.4        0.1        0.3
  0.4        0.4        0.2        0.1        0.2        0.2
  0.1        0.1        0.1        0.1        0.1        0.1
  0.3        0.2        0.3        0.1        0.1        0.1
  0.         0.         0.        ]
 [0.66666667 0.44444444 0.22222222 0.         0.22222222 0.22222222
  0.         0.22222222 0.         0.11111111 0.         0.
  0.11111111 0.         0.11111111 0.11111111 0.11111111 0.11111111
  0.33333333 0.11111111 0.22222222 0.         0.22222222 0.
  0.44444444 0.11111111 0.22222222]]


### Inference

Next we will infer the associated class/label for the review 'the food the service and the restaurant was great` which will have had its associated features computed previously.

In [16]:
# RUN THIS
infer_review = test_reviews[1]
infer_features = test_features[1]

print("Infer Review: ", infer_review)
print("Infer Features: ", infer_features)

Infer Review:  the service was bad
Infer Features:  [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0]


#### Question 5
Compute the class conditional models for `infer_features` for both the positive and negative classes. Remember to use the conditional independence assumption.

**a)** Compute P(X=infer_features|C=positive) 

In [None]:
# Class conditional model for infer_features and positive class
class_cond_pos = ...

print(class_cond_pos)

**b)** Compute P(X=infer_features|C=negative) 

In [None]:
# Class conditional model for infer_features and negative class
class_cond_neg = ...

print(class_cond_neg)

#### Question 6
Compute the probability of infer_features (i.e. the normalisation term) using the formula provided in the last lecture. 

I.e. compute P(X=infer_features)

In [None]:
# TODO: Calc value of P(X=infer_features) and set to p_infer_features
p_infer_features = ... 

print(p_infer_features)

#### Question 7
Now using the previous results compute the conditional probability of the class 'positive' given the data infer_features and then compute the same for 'negative'

**a)** Compute P(C=positive|infer_features)

In [None]:
# TODO: Set p_cond_pos to P(C=positive|infer_features)
p_cond_pos = ...

print(p_cond_pos)

**b)** Compute P(C=negative|infer_features)

In [None]:
# TODO: Set p_cond_neg to P(C=negative|infer_features)
p_cond_neg = ...

print(p_cond_neg)

**Tip:** Check your answers for `p_cond_pos` and `p_cond_neg` by summing them and seeing if the add up to 1 - which they should, due to the sum rule.

In [None]:
p_cond_pos + p_cond_neg

#### Question 8
Now finally predict the most likely class/label for this review.

In [None]:
# Which class has highest probability?
pred_label = ...

print(f"The predicted label for the review '{infer_review}' is {pred_label}")
# print(pred_label)

# END