- ## 1. Binary Classification on Text Data.
In this problem, you will implement several machine learning
techniques from the class to perform classification on text data. Throughout the problem, we
will be working on the NLP with Disaster Tweets Kaggle competition, where the task is to predict
whether or not a tweet is about a real disaster.

In [644]:
#Hrudai Battini HW 2, Part 2 Applied Machine Learning
import numpy as np
import seaborn as sns
import os
import pandas as pd
import nltk
import string
import re
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from matplotlib import pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import nltk
nltk.download('wordnet')
nltk.download("omw-1.4")
nltk.download('stopwords')



[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\hruda\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\hruda\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hruda\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

- ### (a) Download the data. 
Download the training and test data from Kaggle, and answer the follow-
ing questions: (1) how many training and test data points are there? and (2) what percentage
of the training tweets are of real disasters, and what percentage is not? Note that the meaning
of each column is explained in the data description on Kaggle.

In [645]:
X_Train = pd.read_csv("train.csv")
X_Test = pd.read_csv("test.csv")

#1
print(len(X_Train),'training points.')
print(len(X_Test),'testing points')
#2 
num_Tweets = X_Train['target'].value_counts()
print(num_Tweets[0]/len(X_Train), "are not of real Disasters.")
print(num_Tweets[1]/len(X_Train), "are of real Disasters.")

7613 training points.
3263 testing points
0.5703402075397347 are not of real Disasters.
0.4296597924602653 are of real Disasters.


- ### (b) Split the training data. 
Since we do not know the correct values of labels in the test data,
we will split the training data from Kaggle into a training set and a development set (a de-
velopment set is a held out subset of the labeled data that we set aside in order to fine-tune
models, before evaluating the best model(s) on the test data). Randomly choose 70% of the
data points in the training data as the training set, and the remaining 30% of the data as the
development set. Throughout the rest of this problem we will keep these two sets fixed. The
idea is that we will train different models on the training set, and compare their performance
on the development set, in order to decide what to submit to Kaggle.

In [646]:
X_train,X_dev = train_test_split(X_Train,train_size=0.7)
#Combined split datasets
df = pd.concat([X_train,X_dev])
lenx = len(X_train)


- ### (c) Preprocess the data. 
Since the data consists of tweets, they may contain significant amounts
of noise and unprocessed content. You may or may not want to do one or all of the following.
Explain the reasons for each of your decision (why or why not).
- • Convert all the words to lowercase.
- • Lemmatize all the words (i.e., convert every word to its root so that all of “running,” “run,”
and “runs” are converted to “run” and and all of “good,” “well,” “better,” and “best” are
converted to “good”; this is easily done using nltk.stem).
- • Strip punctuation.
- • Strip the stop words, e.g., “the”, “and”, “or”.
- • Strip @ and urls. (It’s Twitter.)
- • Something else? Tell us about it

In [647]:
l = WordNetLemmatizer()
t = nltk.tokenize.WhitespaceTokenizer()
def lemat(text):
    st =  [l.lemmatize(x,pos='a') for x in t.tokenize(text)]
    num = 0
    for x in st:
        st[num] = x.translate(str.maketrans('','',string.punctuation))
        num+=1
    
    return " ".join(st)

In [648]:
import enchant

t = nltk.tokenize.WhitespaceTokenizer()

stop_words = set(stopwords.words('english'))
eng_dict = enchant.Dict("en")

def stopw(text):
    out = [w for w in t.tokenize(text) if not w in stop_words]
    return " ".join(out)

def stripURLS(text):
    out2 = [l for l in t.tokenize(text) if not re.search(r'http\S+',l) or re.search(r'www\S+',l)]
    return " ".join(out2)

def isEnglish(text):
    out = [w for w in t.tokenize(text) if eng_dict.check(w)]
    return " ".join(out)


In [649]:
#Convert to Lowercase to not confuse same letters avoid bad comparisons
df["text"] = df['text'].str.lower()
#Lemmatize to simplify comparisons
#Punctuation to simplify comparisons as well
df["text"] = df["text"].apply(lemat)
#Removes Stopwords to directly target keywords
df['text'] = df['text'].apply(stopw)
#Removes Urls and @s from strings as they are irrelevant to keyword comparisons
df['text'] = df['text'].apply(stripURLS)
#Remove non english words
df['text'] = df['text'].apply(isEnglish)
Y_train = X_train.loc[:,'target']
Y_dev = X_dev.loc[:,'target']

#df = df.drop('target',axis=1)

X_train = df.iloc[:lenx,:]
X_dev = df.iloc[lenx:,:]


#print(df["text"][32])



- ### (d) Bag of words model.
 The next task is to extract features in order to represent each tweet using the binary “bag of words” model, as discussed in lectures. The idea is to build a vocabulary of the words appearing in the dataset, and then to represent each tweet by a feature vector x whose length is the same as the size of the vocabulary, where xi =1 if the i’th vocabulary word appears in that tweet, and xi =0 otherwise. In order to build the vocabulary, you should choose some threshold M, and only include words that appear in at least k different tweets; this is important both to avoid run-time and memory issues, and to avoid noisy/unreliable features that can hurt learning. Decide on an appropriate threshold M, and discuss how you made this decision. Then, build the bag of words feature vectors for both the training and development sets, and report the total number of features in these vectors.

In order to construct these features, we suggest using the CountVectorizer class in sklearn. A couple of notes on using this function: (1) you should set the option “binary=True” in order to ensure that the feature vectors are binary; and (2) you can use the option “min_df=M” in order to only include in the vocabulary words that appear in at least M different tweets. Finally,make sure you fit CountVectorizer only once on your training set and use the same instance to process both your training and development sets (don’t refit it on your development set a second time).

Important: at this point you should only be constructing feature vectors for each data point using the text in the “text” column. You should ignore the “keyword” and “location” columns for now.

In [650]:
#Threshold M Decision 
from sklearn.feature_extraction.text import CountVectorizer

# vectorize the training set
# Threshold = 0.75% as 10% returns 0 words, 1% narrows the pool to 70 which is unsuitable for the study
# Between 0.5% and 1% is the sweet spot. 
count_vect = CountVectorizer(binary=True,min_df=20)
X_train = count_vect.fit_transform(X_train["text"]).toarray() #Use combined dataframe am i supposed to only use X_train??
X_dev = count_vect.transform(X_dev['text']).toarray()
names = count_vect.get_feature_names_out()
print(X_train.shape)
print(X_dev.shape)

colLen = len(df.columns)

(5329, 456)
(2284, 456)


- ### (e) Logistic regression. 
In this question, we will be training logistic regression models using bag of words feature vectors obtained in part (d). We will use the F1-score as the evaluation metric.

Note that the F1-score, also known as F-score, is the harmonic mean of precision and recall. Recall that precision and recall are:

precision = $\frac{true positives}{true positives + false positives}$ $recall = \frac{true positives}{true positives+false negatives}$ .

F1-score is the harmonic mean (or, see it as a weighted average) of precision and recall:

F1 = $\frac{2}{precision^{−1} +recall^{−1}} = 2\frac{precision·recall}{precision +recall}$

We use F1-score because it gives a more comprehensive view of classifier performance than accuracy. For more information on this metric see F1-score.

We ask you to train the following classifiers. We suggest using the LogisticRegression implementation in sklearn.

1.  Train a logistic regression model without regularization terms. You will notice that the default sklearn logistic regression utilizes L2 regularization. You can turn off L2 regu-larization by changing the penalty parameter. Report the F1 score in your training and in your development set. Comment on whether you observe any issues with overfitting or underfitting.


In [651]:
regr = LogisticRegression(penalty='none',solver='saga')

regr.fit(X_train,Y_train)


Y_xdev_hat = pd.DataFrame()
Y_xtrain_hat = pd.DataFrame()

Y_xtrain_hat["accuracy"] = regr.predict(X_train)
Y_xdev_hat["accuracy"] = regr.predict(X_dev)

F1_xt_1 = f1_score(Y_train,Y_xtrain_hat)
F1_xd_1 = f1_score(Y_dev,Y_xdev_hat)

#The Development set is underfitting compared to the training set
print(F1_xt_1,F1_xd_1)

0.7640236686390532 0.7003806416530722




2.  Train a logistic regression model with L1 regularization. Sklearn provides some good examples for implementation. Report the performance on both the training and the development sets.


In [652]:
regr = LogisticRegression(penalty='l1',solver='saga')
regr.fit(X_train,Y_train)


Y_xdev_hat = pd.DataFrame()
Y_xtrain_hat = pd.DataFrame()
theta = pd.DataFrame()
theta['words'] = pd.DataFrame(data=regr.coef_[0],index=names)
Y_xtrain_hat["accuracy"] = regr.predict(X_train)
Y_xdev_hat["accuracy"] = regr.predict(X_dev)

F1_xt_2 = f1_score(Y_train,Y_xtrain_hat)
F1_xd_2 = f1_score(Y_dev,Y_xdev_hat)

#The Development set is underfitting compared to the training set
print(F1_xt_2,F1_xd_2)

0.7501817300702689 0.6973610331274566


3. Similarly, train a logistic regression model with L2 regularization. Report the perfor-mance on the training and the development sets.


In [653]:
regr = LogisticRegression(penalty='l2',solver='saga')
regr.fit(X_train,Y_train)


Y_xdev_hat = pd.DataFrame()
Y_xtrain_hat = pd.DataFrame()

Y_xtrain_hat["accuracy"] = regr.predict(X_train)
Y_xdev_hat["accuracy"] = regr.predict(X_dev)

F1_xt_3 = f1_score(Y_train,Y_xtrain_hat)
F1_xd_3 = f1_score(Y_dev,Y_xdev_hat)

print(F1_xt_3,F1_xd_3)

0.7543817527010803 0.7053620784964069


4. Which one of the three classifiers performed the best on your training and developmentset? Did you observe any overfitting and did regularization help reduce it? Support your answers with the classifier performance you got.


In [654]:
#Of the three classifiers, having no penalty performed the best. I observed normal fitting by all three methods and regularization seemed to underfit the data a tad bit. 
# The values I got for training and development sets respectively for None, L1 and L2 regularization are as follows. It is noted that None performed the best with L2 regularization being a close second. 
print(F1_xt_1,F1_xd_1)
print(F1_xt_2,F1_xd_2)
print(F1_xt_3,F1_xd_3)

### I can modify these values as I see fit 

0.7640236686390532 0.7003806416530722
0.7501817300702689 0.6973610331274566
0.7543817527010803 0.7053620784964069


5. Inspect the weight vector of the classifier with L1 regularization (in other words, look at the θ you got after training). You can access the weight vector of the trained model using the coef_ attribute of a LogisticRegression instance. What are the most important words for deciding whether a tweet is about a real disaster or not?

In [655]:

#print(theta.columns)
theta = theta.sort_values(by='words',ascending=False)
print(theta.loc[:,'words'][:5])
#10 Most important words deciding if a tweet is about a real disaster
    

spill         4.210104
derailment    3.721984
bombing       3.657467
airport       3.618402
wildfire      3.161164
Name: words, dtype: float64


- ### (f ) Bernoulli Naive Bayes.
 Implement a Bernoulli Naive Bayes classifier to predict the probability of whether each tweet is about a real disaster. Train this classifier on the training set, and report its F1-score on the development set.

Important: For this question you should implement the classifier yourself similar to what was shown in class, without using any existing machine learning libraries such as sklearn. You may only use basic libraries such as numpy.

As you work on this problem, you may find that some words in the vocabulary occur in the
development set but are not in the training set. As a result, the standard Naive Bayes model
learns to assign them an occurrence probability of zero, which becomes problematic when
we observe this "zero probability" event on our development set.

The solution to this problem is a form of regularization called Laplace smoothing or additive
smoothing. The idea is to use "pseudo-counts", i.e. to increment the number of times we
have seen each word or document by some number of "virtual" occurrences α. Thus, the
Naive Bayes model will behave as if every word or document has been seen at least αtimes.

More formally, the ψjk parameter of Bernoulli Naive Bayes is the probability of observing
word j within class k. Its normal maximum likelihood estimate is
ψjk =njk
nk
,

where nk is the number of documents of class k and njk is the number of documents of class
k that contain word j. In Laplace smoothing, we increment each counter njk by α(thus we
count each word an extra αtimes), and the resulting estimate for ψjk becomes:

ψjk = njk +α
nk +2α.
It’s normal to take α=1.

In [660]:
#Bernoulli Naive Bayes Implementation 

n = X_train.shape[0]
d = X_train.shape[1]
K = 2 #Binary classes = 2
a = 1 #Alpha for smoothing

psis = np.zeros([K,d])
phis = np.zeros([K])

for k in range(K):
    X_k = X_train[Y_train == k]
    
    psis[k] = (np.sum(X_k, axis=0)+a)/(np.sum(X_k)+2*a)
    phis[k] = X_k.shape[0] / float(n)
    

 

In [667]:
#Naive_Bayes Prediction function from Lecture
def nb_predictions(x, psis, phis):
    n,d = x.shape
    x = np.reshape(x, (1, n, d))
    psis = np.reshape(psis, (K, 1, d))
    
    # clip probabilities to avoid log(0)
    psis = psis.clip(1e-14, 1-1e-14)
    
    # compute log-probabilities
    logpy = np.log(phis).reshape([K,1])
    logpxy = x * np.log(psis) + (1-x) * np.log(1-psis)
    logpyx = logpxy.sum(axis=2) + logpy

    return logpyx.argmax(axis=0).flatten(), logpyx.reshape([K,n])

idx, logpyx = nb_predictions(X_dev, psis, phis)
print("Probabilites:", phis)


Probabilites: [0.57571777 0.42428223]


In [668]:
F1_BNB = f1_score(Y_dev,idx)
print(F1_BNB)

0.7005891805034815


- ### (g) Model comparison. 
You just implemented a generative classifier and a discriminative classifier. Reflect on the following:
- • Which model performed the best in predicting whether a tweet is of a real disaster or not? Include your performance metric in your response. Comment on the pros and cons of using generative vs discriminative models.
- • Think about the assumptions that Naive Bayes makes. How are the assumptions different from logistic regressions? Discuss whether it’s valid and efficient to use Bernoulli Naive Bayes classifier for natural language texts.

1. The Logistic Regression Model outperforms the Naive Bayes Model in prediciting whether a tweet is of a real disaster. The difference between the F1 scores is minimal, almost nonexistant. For Logistic Regression and Naive Bayes are listed below. The pros and cons of using generative vs discriminative models are numerous. The pros of using the discriminative model, Logistic Regression, is that we get a score for each instance of a word based on frequency in the twitter text dataset. Furthremore we are able to interpret the models value as the conditional probability of finding y given x. Cons with the discriminative appraoch are that words that are poorly classifying data in the context of missclassifying a point. Using the generative model, Naive Bayes, for this dataset of text classification is good because we are able to filter out a lot of unncessary spam in the text. The pros of this method is generating values and dealing with other texts. The main con is that given an outlier in the dataset, the data will be skewed significantly. 
2. The assumptions that naive bayes makes are that it assumes every event is independent, so words in the text are independent of other words. This leads to over and under confidence in the accuracy of the models. Comparitvely a logistic regressions assumption is that there is little correlation between explanatory variables. They can lead the model to incorect interpretation. With the assumptions taken by Bernoulli Naive Bayes it is a valid classifier for Natural Language Texts but it is not efficient as text independancy could be a limiting factor in emphasizing the connection between texts.

In [673]:
#Logistic Regression L2, Naive Bayes 
print("Logistic Regression: ", F1_xd_3, "\nNaive Bayes: ", F1_BNB)

Logistic Regression:  0.7053620784964069 
Naive Bayes:  0.7005891805034815


- ### (h) N-gram model. 
The N-gram model is similar to the bag of words model, but instead of using individual words we use N-grams, which are contiguous sequences of words. For example, using N =2, we would says that the text “Alice fell down the rabbit hole” consists of the sequence of 2-grams: ["Alice fell", "fell down", "down the", "the rabbit", "rabbit hole"], and the following sequence of 1-grams: ["Alice", "fell", "down", "the", "rabbit", "hole"]. All eleven of these symbols may be included in the vocabulary, and the feature vector x is defined according to xi = 1 if the i’th vocabulary symbol occurs in the tweet, and xi = 0 otherwise. Using N =2, construct feature representations of the tweets in the training and development tweets. Again, you should choose a threshold M, and only include symbols in the vocabulary that occur in at least M different tweets in the training set. Discuss how you chose the threshold M, and report the total number of 1-grams and 2-grams in your vocabulary. In addition, take 10 2-grams from your vocabulary, and print them out.

Then, implement a logistic regression and a Bernoulli classifier to train on 2-grams. You may reuse the code in (e) and (f ). You may also choose to use or not use a regularization term, depending on what you got from (e). Report your results on training and development set.Do these results differ significantly from those using the bag of words model? Discuss what this implies about the task.

Again, we suggest using CountVectorizer to construct these features. In order to include both 1-gram and 2-gram features, you can set ngram_range=(1,2). Note also that in this case, since there are probably many different 2-grams in the dataset, it is especially important carefully set min_df in order to avoid run-time and memory issues.

- ### (i) Determine performance with the test set
Re-build your feature vectors and re-train your preferred classifier (either bag of word or n-gram using either logistic regression or Bernoulli naive bayes) using the entire Kaggle training data (i.e. using all of the data in both the training
and development sets). Then, test it on the Kaggle test data. Submit your results to Kaggle, and report the resulting F1-score on the test data, as reported by Kaggle. Was this lower or higher than you expected? Discuss why it might be lower or higher than your expectation.