# Intelligent Systems 2023: Practical Assignment 10

## Machine Learning Introduction

Your name: Amund Strøm

Your VUnetID: ast101

If you do not provide your name and VUnetID we will not accept your submission. 

### Preliminaries

At the end of this exercise you should be able to work with some basic Machine Learning concepts, and implement and evaluate simple classifiers for *spam classification* using the popular machine learning library scikit-learn(https://scikit-learn.org/stable/).
Scikit-learn offers a many helpful methods for creating simple machine learning models and to perform data science.

In this assignment you will:
1. Use pandas to read a dataset from a comma-separated-value (.csv) file.
2. You should be able to create tf-idf feature vectors with scikit-learn.
3. You should be able to create a simple classification and evaluate basic classification models.
4. You should have learned to improve classification models for textual data.




### Practicalities

Follow this Notebook step-by-step. For this course it is necessary that you manipulate the python programmes we provide. You can do the exercises in any Programming Editor of your liking. Still, please fill in the questions in this notebook as usual. 

Please use your studentID+Assignment10.ipynb as the name of the Notebook, and fill in the missing cells.   

Note: unlike the courses dedicated to programming we will not evaluate the style of the programs. But we will, however, test your programs on other data that we provide, and your program should give the correct output to the test-data as well.

As was mentioned, the assignment is graded as pass/fail. To pass you need to have either a full working code or an explanation of what you tried and what didn't work for the tasks that you were unable to complete (you can use multi-line comments or a text cell).


### Install some packages

First we need to install some additional packages that we will use throughout this assignment.
This might take a while.


In [44]:
# !pip install pandas
# !pip install scikit-learn

## Training classification models with Sci-Kit Learn.

With this notebbook, you have downloaded a small .csv file containing a public spam/ham SMS dataset that is often used for text classification purposes.
We will load this dataset with the pandas library (https://pandas.pydata.org/), which is often used for data analysis.


In [45]:
#load data
import pandas as pd
df = pd.read_csv ('spam.csv', encoding = "ISO-8859-1")
df.dropna(how="any", inplace=True, axis=1)
df.columns = ['label', 'message']
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


As you can see, the resulting pandas dataframe contains an index column, a label, and the message.
Let's first have a look at the class distribution.

## Task 1

For this first task, we ask you to do a basic data science task. Try to get an idea about the dataset by checking how balanced/unbalanced the dataset is. To do this, you need to compute the proportion of the *ham* and the *spam* class.

Find a Pandas function to compute the frequency of the labels to get an idea of the label distribution. 
Then write a short description of your results.
What percentage of the messages are labelled as spam?

*Hint: Have a look at the Pandas documentation (https://pandas.pydata.org/docs/). There a many ways to get your answer!*

In [46]:
freq = df['label'].value_counts()   # Get values
freq = freq / len(df) * 100         # Convert to percentage
print(freq)


label
ham     86.593683
spam    13.406317
Name: count, dtype: float64


In [47]:
MyReport1 = """
Roughly 87'%' of the emails are labeled as ham, this means that 13'%' are labeled as spam
"""

The following code snipped will create textual features, as discussed in last weeks lecture. We will create tf-idf vectors and will append them to our pandas dataframe.
Then we will perform a simple train/test split of our dataset, using the scikit-learn splitting functions.

Have a look at the different parts that we created. What do the dataframes X_train, y_train, X_test, y_test contain?
Try to understand what is happening here by also having a look at the scikit-learn documentation (https://scikit-learn.org/stable/).

In [48]:
#imports
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

#compute the tf-idf vectors for the messages and create a new dataframe for them
v = TfidfVectorizer()
tf_idf = v.fit_transform(df['message'])
df_tfidf = pd.DataFrame(tf_idf.toarray(), columns=v.get_feature_names_out())

#combine the original dataframe with the dataframe for the tf-idf vectors
dataframes = [df, df_tfidf]
df_new = pd.concat(dataframes, axis=1)

#split the dataset into training and test set
train, test = train_test_split(df_new, test_size=0.9)

#separate feature matrices X from label vector y
X_train = train.iloc[:, 3:]
X_test = test.iloc[:, 3:]
y_train = train['label']
y_test = test['label']


In [49]:
#let's have a look at the different dataframes here
X_train

Unnamed: 0,000,000pes,008704050406,0089,0121,01223585236,01223585334,0125698789,02,0207,...,ó_,û_,û_thanks,ûªm,ûªt,ûªve,ûï,ûïharry,ûò,ûówell
3025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1072,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
711,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2089,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4530,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
991,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
626,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4492,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Naive Bayes Classification


In the lecture, we have introduced the naive Bayes classification algorithm and have already computed various examples by hand. Here, we will use scikit-learn to train your first own classification model for spam classification.
However, all examples from the lecture were using categorical features, while our tf-idf vectors here are real-valued features. 
Thus, the model used here will be slightly different than what we have seen in the lecture.



### Task 2

Use the training and test set created in the previous cell and train a Naive Bayes classifier using sci-kit learn.
Please have a look at the documentation on how to use classification model using X_train and y_train as an input.
Afterwards compute the accuracy of your classfier.


In [50]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
y_pred = gnb.fit(X_train, y_train).predict(X_test)

cor = X_test.shape[0]
miss = (y_test != y_pred).sum()
acc = 100 - ((miss * 100) / cor)

print("The accuracy of the Naive Bayes classifier is: ", acc)

The accuracy of the Naive Bayes classifier is:  94.03788634097707


As you might have seen, the accuracy of your Naive Bayes classifier should be over 85%.
This seems to be a very good score, for a very simple classification model and simple tf-idf features.

### Task 3

Have a look at different evaluation metrics for your classifier and discuss the suitability of accuracy for the spam classification task.
Have a look at the definition of accuracy and come up with another metric, which is better suited for our problem

*Hint: Have a look at this documentation and try out different evaluation metrics: https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics*

In [51]:
#try out other evaluation metrics here  
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

print(accuracy_score(y_test, y_pred))

print(f1_score(y_test, y_pred, average='weighted'))

0.9403788634097707
0.9398619833264199


In [52]:
MyReport2 = """
Accuracy is maybe not the best metric in our case since our dataset contains way more ham than spam, therefore it will preform well even if it guesses wrong on all the spam
cases, since ham will just outweigh all the cases. Therefore it may be a better idea to look at the f1-score, which calculates the harmonic mean of precision and recall.
Precision in this context represents the proportion of predicted spam messages that are actually spam, while recall represents the proportion of actual spam messages that are 
correctly predicted as spam.

The idea is that the cases where it actually matters to get the prediction right (predicitng spam) will be valued more in the metric for the classifier.
"""

### Task 4

Come up with any improvements for the classification model here.
You can come up with a new method and/or different features to improve the classification.
Can you beat the baseline Naive Bayes model?

If you try out a different classification model, the training of the model might take a couple of seconds.

Write at least 10 sentences describing your improvements and why these improvements are helping to improve the model?

In [53]:
from sklearn import svm

clf = svm.SVC(kernel='linear')
y_pred = clf.fit(X_train, y_train).predict(X_test)

print(f1_score(y_test, y_pred, average='weighted'))

0.9439982724859741


In [54]:
MyReport3 = """
When choosing a model classification you have to consider many factors about the data, such as its charactersitics and patterns. The classification model I have choosen is 
called the support vector model (SVM), and we can see that it is a minor imporvement (94.5%) from the Naive Bayes classifier (91.7%). By looking at the data we can see why SVM
is an improvment.

Naive Bayes assumes that the data is truly independent, this will simplify the model but may miss out on complex relationships between the data. In our case it will look at each
word independet and without context, which is of course wrong as words are connected to eachother to create meaning. SVM will find the optimal hyperplane to seperate different
classes, and are therefore capable to capture a bigger context between classes.

Naive Bayes is also very receptive when dealing with imbalanced datasets, meaning it will get a bias toward the class with the majority of examples. Which in our case there are
way more exampels of non-spam emails than there are spam emails. SVM is way more robust when dealing with this problem.

SVM compare to Naive Bayes has way more hyperparamaters to tune, meaning you can tune SVM to fit the dataset much better. This may require alot of resources to find the perfect
hyperparamaters. In my case I have only used the 'linear' hyperplane paramater, but through a lot of trial and error you could probably improve the model even more with using
different hyperparamaters.
"""

## Final Task: Collect all the results

Uncomment and run this cell (and all the cells above) to generate the text file that you have to hand in together with the notebook on canvas!

### Please hand in only the text file which is generated by this method!

In [55]:
def exportToText(*args):
    with open(args[0], "w") as f:
        for argument in args:
            f.write("{}\n".format(argument))

exportToText("assignment10.txt", MyReport1, MyReport2, MyReport3)