Machine learning INF284 Group Exam: Markus, Nikita, Bo og Andre

# Task 1 Machine learning on tabular mushrooms

# Task 2 Sentiment analysis 

Looking at this task, we noticed it had a resemblance to the sentiment analysis task like in the third lab exercise. We took this into account and tried using the same methods as in the exercise, and seeing what we could change, do differently, or do better. The two models shown in the lab are Bernoulli Naive Bayes and Multinomial Naive Bayes. 

Why would we use Bernoulli and Multinomial Naive Bayes?
The Bernoulli Naive Bayes model looks at the true/false value if a certain word exists, to decide which kind of sentiment it has based on what it has learned. 
The Multinomial Naive Bayes model looks at the word count of multiple words and then decides the sentiment.

We first have to start with the initialization of data frame, containing both the text and sentiment connected to it, and this applies for the training and test data.

In [1]:
# opening the train and test files
f_train = open("task_2/3class/train.json", encoding="utf-8")
f_test = open("task_2/3class/test.json", encoding="utf-8")

import json
import pandas

data_train = json.load(f_train)
data_test = json.load(f_test)

df_train = pandas.DataFrame(data_train)
df_test = pandas.DataFrame(data_train)

print(df_train.head())

        sent_id                                               text     label
0  201911-01-01                                      Philips 190G6   Neutral
1  201911-02-01  Med integrerte høyttalere som på ingen måte er...   Neutral
2  201911-02-02                             Eller bedrar skinnet ?  Negative
3  201911-03-01  De fleste skjermer har et diskret design , med...   Neutral
4  201911-03-02  Men 190G6 fra Philips er en helt annen historie .   Neutral


We can remove the sent_id column in both data frames, as it is redundant because the model only requires the text and sentiment label.

In [2]:
df_train = df_train.drop('sent_id', axis=1)
df_test = df_test.drop('sent_id', axis=1)
print(df_train.head())

                                                text     label
0                                      Philips 190G6   Neutral
1  Med integrerte høyttalere som på ingen måte er...   Neutral
2                             Eller bedrar skinnet ?  Negative
3  De fleste skjermer har et diskret design , med...   Neutral
4  Men 190G6 fra Philips er en helt annen historie .   Neutral


If we look at the text above in the first five rows, we can see that there are things we can remove, like characters and numbers.
We should also make all the characters lowercase so that there isn't a difference between "Test" and "test".
We create a function to handle a line or string of text and clean it of non-necessities.

In [3]:
import re
def cleanText(string):
    # removing these characters from the string
    toRemove = [":", ",", ".", '"', "-", "/", "?", "«", "(", '»', ")","0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
    pattern = '[' + ''.join(toRemove) + ']'
    # Remove characters matched by pattern
    string = re.sub(pattern, '', string)
    
    listOfWords = string.split(" ") # making the string into a list of words
    listOfWords = [word.lower() for word in listOfWords] # making words lowercase
    listOfWords = [word for word in listOfWords if word != ""] # removing empty characters in the list    
    return " ".join(listOfWords) #returning string from the list of words joined by one space


Using this function on each text in our data frame to help remove the text of unnecessary characters. Lets look at a couple of  lines as an example to show the difference.

In [4]:
print(df_train["text"][0])
print(df_train["text"][3])
print(df_train["text"][6])
print(df_train["text"][37])
print(df_train["text"][123])

Philips 190G6
De fleste skjermer har et diskret design , med smale rammer og slank fot .
LES OGSÅ :
I hvert fall når man ikke er alene hjemme ...
Kan vennskapet fastholdes ?


In [5]:
df_train["text"] = df_train["text"].apply(cleanText)

Let's look at the difference.

In [6]:
print(df_train["text"][0])
print(df_train["text"][3])
print(df_train["text"][6])
print(df_train["text"][37])
print(df_train["text"][123])

philips g
de fleste skjermer har et diskret design med smale rammer og slank fot
les også
i hvert fall når man ikke er alene hjemme
kan vennskapet fastholdes


One more step we can do is to lemmatize the words, which means taking the words and putting them into their ground forms. This will make it much simpler for the algorithm to see similarities and differences in the sentences, making this sentiment analysis task much easier with higher word counts for positive, neutral, and negative words.

In [7]:
from nltk.stem.snowball import NorwegianStemmer
lemmatizer = NorwegianStemmer()

Let's lemmatize each word.

In [8]:
for index in range(len(df_train["text"])):
    line = df_train["text"][index]
    words = line.split(" ")
    lemmatized = []
    for word in words:
        lemmatized.append(lemmatizer.stem(word))
    line = " ".join(lemmatized)
    df_train["text"][index] = line

In [9]:
print(df_train["text"][0])
print(df_train["text"][3])
print(df_train["text"][6])
print(df_train["text"][37])
print(df_train["text"][123])

philip g
de flest skjerm har et diskr design med smal ramm og slank fot
les også
i hvert fall når man ikk er alen hjemm
kan vennskap fasthold


Well we can see that there are still some unnecessary words that we can remove to help the accuracy of the model. This would be for example words like "i", "er", etc. To help us remove stop words, we used nltk.

In [10]:
import nltk
from nltk.corpus import stopwords
no_stopwords = stopwords.words("norwegian")

Let's go through the dataframe to remove all these unnecessary words.
We have to go through and retrieve each string, split up the words, and then check if each word is in this stopwords list, and if so, remove it. Again let's use the same examples as before to see the difference it makes.

In [11]:
for index in range(len(df_train["text"])):
    line = df_train["text"][index]
    words = line.split(" ")
    for word in words:
        if word in no_stopwords:
            words.remove(word)
    line = " ".join(words)
    df_train["text"][index] = line

In [12]:
print(df_train["text"][0])
print(df_train["text"][3])
print(df_train["text"][6])
print(df_train["text"][37])
print(df_train["text"][123])

philip g
flest skjerm et diskr design smal ramm slank fot
les
hvert fall man ikk alen hjemm
vennskap fasthold


Now that we have handled the text to a sufficient degree, it is time to apply our machine learning models on the training data. This requires first a few steps.

First we need to convert our training data set into a numerical representation using a vectorizer. We can use the CountVectorizer, which uses the number of occurrences of each word.

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

X_train_data = vectorizer.fit_transform(df_train["text"])

We also need to take our test data and transform it so that the model can use the data.

In [14]:
X_test_data = vectorizer.transform(df_test["text"])

Next after setting up and transforming our data into a numerical representation, we need to choose which classifier or model we want to use on our data. With these sentiment analysis tasks, there are a couple popular models, such as the Naive Bayes Classifiers, Bernoulli and Multinomial. We first use the Bernoulli model to see the accuracy score.

In [15]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score

BernbClassifier = BernoulliNB()

BernbClassifier.fit(X_train_data, df_train["label"])
prediction = BernbClassifier.predict(X_test_data)
accuracy = accuracy_score(df_test["label"], prediction)
print(f"{BernbClassifier} accuracy:", accuracy)

BernoulliNB() accuracy: 0.6395334253104227


The accuracy of this model is about 64%. This means that using the Bernoulli Naive Bayes model, it has predicted 64% of the texts correctly. 

We can try the Multinomial Naive Bayes model to see if it performs with a higher accuracy score.

In [16]:
from sklearn.naive_bayes import MultinomialNB

MnbClassifier = MultinomialNB()

MnbClassifier.fit(X_train_data, df_train["label"])
prediction = MnbClassifier.predict(X_test_data)
accuracy = accuracy_score(df_test["label"], prediction)
print(f"{MnbClassifier} accuracy:", accuracy)

MultinomialNB() accuracy: 0.727580584472595


Why does this model perform better than the previous one? As explained before, the multinomial model takes the number of occurrences of each word based on the sentiment to calculate the probability, to make a prediction.

With these classifiers, we left out an optional alpha value parameter, which is then chosen by default. The alpha value decides the smoothing of the estimated probabilities of the model. But how do we know which alpha value to choose so that it maximizes our performance? Cross validation should be the answer! Cross validation will use the model to check multiple values of alpha to determine the best alpha value to use as a parameter for the classifier.

In [17]:
from sklearn.model_selection import GridSearchCV

alpha_vals = [0.001, 0.01, 0.1, 1, 10, 100]

cv = GridSearchCV(MultinomialNB(),{'alpha': alpha_vals},cv=5,n_jobs=-1,verbose=0)

cv.fit(X_train_data, df_train["label"])
print("Best alpha value:", cv.best_params_['alpha'])

Best alpha value: 10


Given the 'best' alpha value, 10, we can now use this alpha value in our MultinomialNB classifier to see the new accuracy score.

In [18]:
MnbClassifier = MultinomialNB(alpha=10)

MnbClassifier.fit(X_train_data, df_train["label"])
prediction = MnbClassifier.predict(X_test_data)
accuracy = accuracy_score(df_test["label"], prediction)
print(f"{MnbClassifier} accuracy:", accuracy)

MultinomialNB(alpha=10) accuracy: 0.6229775492286467


As we can see, with the alpha value of 10, the model performs worse. We have tried changing the range of the alpha_vals as well as the number of folds and nothing performs better than fine-tuning the different values, ourselves. We have found that with an alpha value of 0.2, the model performs better.

In [19]:
MnbClassifier = MultinomialNB(alpha=0.2)

MnbClassifier.fit(X_train_data, df_train["label"])
prediction = MnbClassifier.predict(X_test_data)
accuracy = accuracy_score(df_test["label"], prediction)
print(f"{MnbClassifier} accuracy:", accuracy)

MultinomialNB(alpha=0.2) accuracy: 0.7817634516493165


This is definitely a better result than using BernoulliNB and MultinomialNB (with a default alpha value), with an accuracy score of 78%. We determined that this result is acceptable. So in conclusion, we have taken our data, removed unnecessary characters and numbers, lemmatized the words into their ground forms, and then removed all the stop words. This turns our data into something more usable for the models we have chosen. The best result we have seen thus far is with a fine-tuned Multinomial Naive Bayes model, with a self-defined alpha value.

# Task 3 Convolutional neural networks

In this task, the assignment is about training a convolutional neural network (CNN) as a binary classifier from the dataset that we have been provided with. This is a CIFAR-10 dataset that consists of 60000 images that are 32x32 colored images and will identify one of the following categories; airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck. Each of these has 6000 images that are divided into 50000 for the training model and 10000 for the testing model.

In [20]:
import tensorflow as tf
from tensorflow.keras.applications import VGG16
from tensorflow.keras.preprocessing import image
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D
from tensorflow.keras.optimizers import Adam
import numpy as np
import os
import pickle
import matplotlib.pyplot as plt

ModuleNotFoundError: No module named 'tensorflow'