<a href="https://colab.research.google.com/github/LMAPcoder/Machine-Learning-Lab/blob/main/Exercise_sheet11.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Lab Programming Machine Learning**

Name: Leonardo Antiqui

Group 2 Monday

## Exercise Sheet 11

### Exercise 1: Preprocessing Text Data

The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics

In [None]:
import numpy as np
import pandas as pd
import re
import nltk
nltk.download('stopwords') #download stopwords
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.datasets import fetch_20newsgroups

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
categories = ['comp.graphics', 'sci.med'] #Two categories > binary classification problem
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=3116)

The returned dataset is a scikit-learn bunch: a simple holder object with fields that can be both accessed as python dict keys or object attributes

In [None]:
#Labels of the document sets available
print('Target names:',twenty_train.target_names)

Target names: ['comp.graphics', 'sci.med']


In [None]:
#Targets
unique, counts = np.unique(twenty_train.target, return_counts=True)
print('Classes:',dict(zip(unique, counts)))

Classes: {0: 584, 1: 594}


In [None]:
#Number of samples in the dataset
print('Number of posts:',len(twenty_train.data))

Number of posts: 1178


In [None]:
#Three first lines of the first post
print("\n".join(twenty_train.data[0].split("\n")[:10]))

From: kaminski@netcom.com (Peter Kaminski)
Subject: Re: Krillean Photography
Lines: 101
Organization: The Information Deli - via Netcom / San Jose, California

[Newsgroups: m.h.a added, followups set to most appropriate groups.]

In <1993Apr19.205615.1013@unlv.edu> todamhyp@charles.unlv.edu (Brian M.
Huey) writes:



Vectorizing the posts with bag of words technique

In [None]:
twenty_train.data[0]

'From: kaminski@netcom.com (Peter Kaminski)\nSubject: Re: Krillean Photography\nLines: 101\nOrganization: The Information Deli - via Netcom / San Jose, California\n\n[Newsgroups: m.h.a added, followups set to most appropriate groups.]\n\nIn <1993Apr19.205615.1013@unlv.edu> todamhyp@charles.unlv.edu (Brian M.\nHuey) writes:\n\n>I am looking for any information/supplies that will allow\n>do-it-yourselfers to take Krillean Pictures.\n\n(It\'s "Kirlian".  "Krillean" pictures are portraits of tiny shrimp. :)\n\n[...]\n\n>One might extrapolate here and say that this proves that every object\n>within the universe (as we know it) has its own energy signature.\n\nI think it\'s safe to say that anything that\'s not at 0 degrees Kelvin\nwill have its own "energy signature" -- the interesting questions are\nwhat kind of energy, and what it signifies.\n\nI\'d check places like Edmund Scientific (are they still in business?) --\nor I wonder if you can find ex-Soviet Union equipment for sale somewher

In [None]:
#Preprocessing
def preprocessing_string(text):
  text = re.sub(r'\S*@\S*\s?', '', text)
  text = text.replace('\n',' ').replace('_',' ')
  text = re.sub(r'[^\w\s]', ' ', text)
  text = re.sub(r'\s\s+', ' ', text)
  text = text.lower()
  return text

def nonStopwords(word_list):
  word_list = [word for word in word_list if not word in stopwords.words('english')]
  return word_list

def preprocessing_list(word_list):
  word_list = [word for word in word_list if len(word)>1]
  word_list = [PorterStemmer().stem(word) for word in word_list]
  return word_list

def preprocessing(text):
  text = preprocessing_string(text)
  word_list = text.split(' ')
  word_list = nonStopwords(word_list)
  word_list = preprocessing_list(word_list)
  return word_list

In [None]:
preprocessing(twenty_train.data[0])[:5]

['peter', 'kaminski', 'subject', 'krillean', 'photographi']

In [None]:
def vocab_O(data):
  vocab_O = {}
  N_docs = len(data) #number of posts
  for idx, post in enumerate(data):
    word_list = preprocessing(post)
    # N_words = len(word_list) #number of words in the post
    for word in word_list:
      if word in vocab_O.keys():
        if vocab_O[word][idx] == 0:
          vocab_O[word][idx] = word_list.count(word)
      else:     
        vocab_O[word] = np.zeros(N_docs)
        vocab_O[word][idx] = word_list.count(word)
  vocab = vocab_O.keys()
  vector = np.array([np.array(vocab_O[word]) for word in vocab_O.keys()]).T
  return vector,vocab

In [None]:
%%time
vector_O,vocab = vocab_O(twenty_train.data)

CPU times: user 43.3 s, sys: 4.07 s, total: 47.4 s
Wall time: 50.6 s


In [None]:
#Snip of the feature matrix with ocurrencies as values
pd.DataFrame(data=vector_O, index=None, columns=vocab).iloc[:5,:10]

Unnamed: 0,peter,kaminski,subject,krillean,photographi,line,101,organ,inform,deli
0,2.0,2.0,5.0,3.0,11.0,1.0,1.0,1.0,2.0,1.0
1,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,2.0,0.0,1.0,1.0,0.0
4,1.0,0.0,1.0,0.0,0.0,1.0,0.0,2.0,0.0,0.0


In [None]:
print('Feature matrix size:',vector_O.shape)

Feature matrix size: (1178, 16448)


Term Frequency times Inverse Document Frequency (TF–IDF)

Term Frequency:

$TF(t,d) = \frac{\text{count of d in t}}{\text{number of words in d}}$

Document Frequency: number of documents in which the word is present.

$DF(t) = \text{occurrence of t in N documents}$

Inverse Document Frequency:

$IDF(t) = \log{\frac{N}{DF+1}}$

TF-IDF:

$TF-IDF(t,d) = TF(t,d) \times IDF(t)$

In [None]:
def tf_idf(vector):
  tf = vector/vector.sum(axis=1, keepdims=True)
  df = np.count_nonzero(vector,axis=0)
  idf = np.log(len(vector)/(df+1))
  tf_idf = tf*idf
  return tf, tf_idf

In [None]:
vector_tf ,vector_tf_idf = tf_idf(vector_O)

In [None]:
#Snip of the feature matrix with tf as values
pd.DataFrame(data=vector_tf, index=None, columns=vocab).iloc[:5,:10]

Unnamed: 0,peter,kaminski,subject,krillean,photographi,line,101,organ,inform,deli
0,0.004357,0.004357,0.010893,0.006536,0.023965,0.002179,0.002179,0.002179,0.004357,0.002179
1,0.0,0.0,0.013699,0.0,0.0,0.013699,0.0,0.013699,0.0,0.0
2,0.0,0.0,0.002611,0.0,0.0,0.002611,0.0,0.002611,0.0,0.0
3,0.0,0.0,0.005495,0.0,0.0,0.010989,0.0,0.005495,0.005495,0.0
4,0.017241,0.0,0.017241,0.0,0.0,0.017241,0.0,0.034483,0.0,0.0


In [None]:
#Snip of the feature matrix with tf-idf as values
pd.DataFrame(data=vector_tf_idf, index=None, columns=vocab).iloc[:5,:10]

Unnamed: 0,peter,kaminski,subject,krillean,photographi,line,101,organ,inform,deli
0,0.015993,0.0238,-9e-06,0.027702,0.100203,1.3e-05,0.010182,6.6e-05,0.007683,0.012386
1,0.0,0.0,-1.2e-05,0.0,0.0,8.2e-05,0.0,0.000413,0.0,0.0
2,0.0,0.0,-2e-06,0.0,0.0,1.6e-05,0.0,7.9e-05,0.0,0.0
3,0.0,0.0,-5e-06,0.0,0.0,6.5e-05,0.0,0.000166,0.009688,0.0
4,0.063282,0.0,-1.5e-05,0.0,0.0,0.000103,0.0,0.00104,0.0,0.0


Splitting the data

In [None]:
# X_data = vector_tf_idf
X_data = vector_O
y_data = twenty_train.target.reshape(-1,1)
data = np.hstack((X_data,y_data))
data_train, data_valid, data_test = np.split(data,[int(len(data)*0.8),int(len(data)*0.9)])

In [None]:
X_train = data_train[:,:-1]
y_train = data_train[:,-1]
X_valid = data_valid[:,:-1]
y_valid = data_valid[:,-1]
X_test = data_test[:,:-1]
y_test = data_test[:,-1]

In [None]:
print('X_train shape:',X_train.shape)
print('y_train shape:',y_train.shape)
print('X_valid shape:',X_valid.shape)
print('y_valid shape:',y_valid.shape)
print('X_test shape:',X_test.shape)
print('y_test shape:',y_test.shape)

X_train shape: (942, 16448)
y_train shape: (942,)
X_valid shape: (118, 16448)
y_valid shape: (118,)
X_test shape: (118, 16448)
y_test shape: (118,)


In [None]:
np.hstack((y_train,y_valid)).shape

(1060,)

### Exercise 2: Implementing Naive Bayes Classifier for Text Data

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable.

Conditional distribution:
\begin{align}
p(y|\mathbf{x}) = \frac{p(y)p(\mathbf{x}|y)}{p(\mathbf{x})}
\end{align}
Using the naive conditional independence assumption:
\begin{align}
p(\mathbf{x}|y) = \prod_{m=1}^{M}p(x_m|y)
\end{align}
Finally the predition is:
\begin{align}
\hat{y}_{new} = \arg \max_{y} \left[p(y)\prod_{m=1}^{M}p(x_{new,m}|y)\right] = \arg \max_{y} \left[\log p(y) + \sum_{m=1}^{M} \log p(x_{new,m}|y)\right]
\end{align}
The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of $p(x_m|y)$.

Since our data is very sparse, with some words present in just one document, Gaussian distribution is not possible. Therefore, **multinomial distribution** is used.

The MN distribution is parametrized by vectors for each class $\theta_{y}=(\theta_{y1},...,\theta_{yM})$ where $\theta_{ym}=p(x_m|y)$ is the probability that word m occurs in class $y$. The parameters $\theta_y$ is estimated by a smoothed version of maximum likelihood:
\begin{align}
\hat{\theta}_{ym} = \frac{N_{ym}+\alpha}{N_y+\alpha M}
\end{align}
Where $N_{ym}=\sum_{xϵD} x_m$ is the number of times feature $m$ appears in a sample of class $y$ in the training set $D$, and $N_y=\sum_{m=1}^{M} N_{ym}$ is the total count of all features for class $y$.

If $x_m$ is the number of occurences of feature in the document:
\begin{align}
N_{ym}=\sum_{xϵD} x_m = \sum_{n=1}^{N} x_{nm}
\end{align}
And the conditional probability of the new data point:
\begin{align}
p(x_{new,m}|y) = p(x_{m}|y)^{x_{new,m}}
\end{align}
Putting all together:
\begin{align}
\hat{y}_{new} = \arg \max_{y} \left[\log p(y) + \sum_{m=1}^{M} x_{new,m} \log \left(\frac{N_{ym}+\alpha}{N_y+\alpha M}\right)\right]
\end{align}
To increase the performance of the model we transform the training data as:

$x_{new,m} = TD-IDF(m,new)$

In [None]:
#Function to calculate logarithm of prior probability of class
def prior(y):
  C = np.unique(y)
  P = np.zeros(C.shape)
  for n,c in enumerate(C):  
    prior = np.log((y == c).mean())
    P[n] = prior
  return P

#Function to calculate logarithm of parameters thetas
def multinomial_matrix(X,y,alpha):
  C = np.unique(y)
  M = X.shape[1]
  P = np.zeros((len(C),M))
  for n,c in enumerate(C):
    N_ym = X[y == c].sum(axis=0)
    N_y = X[y == c].sum()
    p_m_y = (N_ym + alpha)/(N_y + alpha*M)
    P[n] = np.log(p_m_y)
  return P

In [None]:
#Training the multinomial naive bayes "model"
alpha = 1
log_MNmtx = multinomial_matrix(X_train,y_train,alpha)
log_prior = prior(y_train)

In [None]:
#Snip of the multinomial matrix
pd.DataFrame(data=log_MNmtx, index=None, columns=vocab).iloc[:,:10]

Unnamed: 0,peter,kaminski,subject,krillean,photographi,line,101,organ,inform,deli
0,-8.053039,-11.348876,-5.128286,-11.348876,-10.655729,-4.895251,-9.962582,-5.194018,-6.305451,-11.348876
1,-8.592835,-9.816611,-5.151287,-8.058753,-7.641859,-5.20149,-9.480138,-5.127099,-6.284385,-10.327436


In [None]:
#Logarith of the prior probability
log_prior

array([-0.71027822, -0.67630468])

In [None]:
#Predictor
def prediction(matrix,prior,X):
  N = X.shape[0]
  pred = np.zeros(N)

  X_tf ,X_tf_idf = tf_idf(X) #the ocurrences are converted to tf-idf

  for n,x in enumerate(X_tf_idf):
    y_hat = np.argmax(log_prior + (x*log_MNmtx).sum(axis=1))
    pred[n] = y_hat
  return pred

#Classification Accuracy value
accuracy = lambda y,y_hat: (y == y_hat).mean()

In [None]:
pred_test = prediction(log_MNmtx,log_prior,X_test)

In [None]:
print('Accuracy on test data:',accuracy(y_test,pred_test))

Accuracy on test data: 0.9745762711864406


### Exercise 2: Implementing SVM Classifier via Scikit-Learn

In [None]:
from sklearn import svm
from sklearn.model_selection import GridSearchCV

In [None]:
clf = svm.SVC(kernel='sigmoid', degree=7, C=10, max_iter=1000)
clf.fit(np.vstack((X_train,X_valid)), np.hstack((y_train,y_valid)))



SVC(C=10, degree=7, kernel='sigmoid', max_iter=1000)

In [None]:
%%time
grid = [
    {'kernel': ['linear', 'rbf', 'sigmoid'],
    'C': [1,10,100]},
    {'kernel': ['poly'],
    'degree': [2,3],
    'C': [1,10,100]}
    ]
clf = svm.SVC(max_iter=1000)
GS = GridSearchCV(
    estimator=clf, #model
    param_grid=grid, #dictionary with hyperparameters
    cv=5, #K-fold cross validation
    return_train_score=True, #training score
    verbose=2
)
GS.fit(np.vstack((X_train,X_valid)), np.hstack((y_train,y_valid)))

In [None]:
print('Best set of hyperparameters: ',GS.best_params_)

Best set of hyperparameters:  {'C': 1, 'kernel': 'sigmoid'}


In [None]:
accuracy = GS.best_estimator_.score(X_test,y_test)
print('Accuracy on test data:',accuracy)

Accuracy on test data: 0.9745762711864406


We can observe that SVM reached the same accuracy than Naive Bayes. It can mean the models are optimal and there is no possible further improvement.