<a href="https://colab.research.google.com/github/Sjoerd-de-Witte/Machine-Learning-2023/blob/main/5_2_Text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!gdown -O /tmp/ml.py 174lBNvDBJSVWs3OpNL3a68cnhWIcWYuY
%run /tmp/ml.py

Downloading...
From: https://drive.google.com/uc?id=174lBNvDBJSVWs3OpNL3a68cnhWIcWYuY
To: /tmp/ml.py
  0% 0.00/1.31k [00:00<?, ?B/s]100% 1.31k/1.31k [00:00<00:00, 5.86MB/s]


# Text classification

You assignment is to create a classifier for SMS-texts to detect which messages are spam and which are not. In the dataset, ham is used to identify messages that are not spam.

To classify text, it is common to transform the text into vector representation. In a simple bag-of-word vector representation, every word in the collection is a feature and in each documents we simply count how often each word appears. So if our dictionary consists of [ i, am, hungy, and, thirsty ], then the text "I am hungry" would become (1, 1, 1, 0, 0), the text "I am thirsty" (1, 1, 0, 0, 1) and the text "I am hungry and I am thirsty" (2, 2, 1, 1, 1). Since these now are numbers, we can train a classifier like before.

# Text parsing

Several decisions affect the vectorization of text. Commonly, sentences are split on whitespace and punctuation marks to get words. Words are often lowercased and brought back to their stem (i.e. walk, walked, walking are all converted to their stem 'walk') and a list of relatively meaningless words, the so called 'stopwords', are removed.

In [11]:
from pipetorch import DFrame
import pandas as pd
import numpy as np
from sklearn.utils import resample
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import recall_score

In [23]:
df = DFrame.read_from_kaggle('uciml/sms-spam-collection-dataset',
                             encoding = 'latin-1')

#### check the Dataframe, for some reason during import 3 empty columns were created, remove them

In [24]:
df = df.drop(columns=['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'])
df

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ì_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


#### convert the Categories in column v1 to numbers. Since we want to detect spam, it makes sense to use 1 for spam

In [25]:
df['v1'] = df.v1.apply(lambda x: x.strip().lower() == 'spam') * 1
df

Unnamed: 0,v1,v2
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,1,This is the 2nd time we have tried 2 contact u...
5568,0,Will Ì_ b going to esplanade fr home?
5569,0,"Pity, * was in mood for that. So...any other s..."
5570,0,The guy did some bitching but I acted like i'd...


In [26]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=5, encoding='latin-1', stop_words='english')

In [27]:
features = vectorizer.fit_transform(df.v2).toarray()

In [28]:
features

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

#### Look at the shape, how many texts are there and how many words in the dictionary?

In [29]:
features.shape

(5572, 1602)

#### Use n-fold cross validation to compute the recall. Take the average accuracy over the experiments. Depending on the number of splits you should see an accuracy around 87% for n=10. What does the recall stand for?

In [33]:
# Write a loop to the the n-Fold CV, for example with 10 splits.
# for each sample, train a logistic regression model
# and take the average recall score over the samples
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import recall_score

kf = KFold(n_splits=10, random_state=0, shuffle=True)
y = df.v1.to_numpy()
recall = []
for train, valid in kf.split(y):
    train_X = features[train]
    valid_X = features[valid]
    train_y = y[train]
    valid_y = y[valid]

    model = LogisticRegression(solver='liblinear', multi_class='auto')
    model.fit(train_X, train_y)
    pred_y = model.predict(valid_X)
    recall.append(recall_score(valid_y, pred_y))
sum(recall)/len(recall)

0.876043569857012

Show the frequency of spam and ham. It appears the dataset is heavily skewed.

In [37]:
df['v1'] = df['v1'].factorize()[0]
df.groupby('v1').v1.count()

v1
0    4825
1     747
Name: v1, dtype: int64

#### Since there is a big skew in the dataset, try to balance the training set and repeat the experiment. See what happens to the recall. You should see a big improvement.

You can use `resample` to take a sample of a certain size from a collection.

In [47]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import recall_score, accuracy_score

class_0 = df[df['v1'] == 0]
class_1 = df[df['v1'] == 1]

class_1_upsampled = resample(class_1, n_samples=len(class_0), random_state=0)
balanced_df = pd.concat([class_0, class_1_upsampled])

balanced_features = vectorizer.fit_transform(balanced_df['v2']).toarray()

y = balanced_df['v1'].to_numpy()

model = LogisticRegression()
model.fit(balanced_features, y)
pred_y = model.predict(balanced_features)

recall = recall_score(y, pred_y)

recall


Recall Score: 0.9975129533678756


#### Repeat the KFold experiment on the balanced dataset.

In [None]:
class_1_upsampled = resample(class_1, n_samples=len(class_0), random_state=0)
balanced_df = pd.concat([class_0, class_1_upsampled])

balanced_features = vectorizer.fit_transform(balanced_df['v2']).toarray()

kf = KFold(n_splits=10, random_state=0, shuffle=True)
y = balanced_df['v1'].to_numpy()
recall = []

for train, valid in kf.split(y):
    train_X = balanced_features[train]
    valid_X = balanced_features[valid]
    train_y = y[train]
    valid_y = y[valid]

    model = LogisticRegression(solver='liblinear', multi_class='auto')
    model.fit(train_X, train_y)
    pred_y = model.predict(valid_X)
    recall.append(recall_score(valid_y, pred_y))

average_recall = sum(recall) / len(recall)
average_recall

0.9877248222690266

#### Also compute the precision.

In [45]:
from sklearn.metrics import precision_score

class_1_upsampled = resample(class_1, n_samples=len(class_0), random_state=0)
balanced_df = pd.concat([class_0, class_1_upsampled])

balanced_features = vectorizer.fit_transform(balanced_df['v2']).toarray()

kf = KFold(n_splits=10, random_state=0, shuffle=True)
y = balanced_df['v1'].to_numpy()
recall = []

for train, valid in kf.split(y):
    train_X = balanced_features[train]
    valid_X = balanced_features[valid]
    train_y = y[train]
    valid_y = y[valid]

    model = LogisticRegression(solver='liblinear', multi_class='auto')
    model.fit(train_X, train_y)
    pred_y = model.predict(valid_X)
    recall.append(precision_score(valid_y, pred_y))

average_precision = sum(recall) / len(recall)
average_precision

0.99773939889738

In [None]:
halt_notebook()