# Applied Machine Learning Homework 4
Due 12/15/21 11:59PM EST

### Q1: Natural Language Processing

We will train a supervised training model to predict if a tweet has a positive or negative sentiment.

#### Dataset loading & dev/test splits

1.1) Load the twitter dataset from NLTK library

In [1]:
import nltk
nltk.download('twitter_samples')
from nltk.corpus import twitter_samples 

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\Ivan\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\twitter_samples.zip.


1.2) Load the positive & negative tweets

In [2]:
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

1.3) Create a development & test split (80/20 ratio):

In [21]:
##Combine Datasets
import pandas as pd
df_positive = pd.DataFrame(all_positive_tweets)
df_negative = pd.DataFrame(all_negative_tweets)
df_positive['sentiment'] = "positive"
df_negative['sentiment'] = "negative"
df = pd.concat([df_positive,df_negative], ignore_index=True)
df.rename(columns={0: 'tweets'}, inplace=True)
sentiment = df["sentiment"]

In [208]:
#Train test split
import numpy as np
from sklearn.model_selection import train_test_split
dev_text,test_text,dev_y,test_y = train_test_split(df["tweets"],df["sentiment"], test_size = 0.2,random_state = 2021)
print(dev_y.value_counts())
print(test_y.value_counts())

negative    4009
positive    3991
Name: sentiment, dtype: int64
positive    1009
negative     991
Name: sentiment, dtype: int64


#### Data preprocessing

We will do some data preprocessing before we tokenize the data. We will remove `#` symbol, hyperlinks, stop words & punctuations from the data. You can use the `re` package in python to find and replace these strings. 

1.4) Replace the `#` symbol with '' in every tweet

In [209]:
#code here
dev_text = [text.replace('#','\"') for text in dev_text]
test_text = [text.replace('#','\"') for text in test_text]


1.5) Replace hyperlinks with '' in every tweet

In [210]:
#code here
import re
dev_text = [re.sub(r'http\S+', '\"', text) for text in dev_text]
test_text = [re.sub(r'http\S+', '\"', text) for text in test_text]

1.6) Remove all stop words

In [211]:
from nltk.tokenize import word_tokenize

def removestop(sentence):
    token_words = word_tokenize(sentence)
    stop_sentence = [word for word in token_words if word not in ENGLISH_STOP_WORDS]
    return " ".join(stop_sentence)

dev_text = [removestop(text) for text in dev_text]
test_text = [removestop(text) for text in test_text]


1.7) Remove all punctuations

In [212]:
#code here
dev_text = [ re.sub(r'[^\w\s]', '', text) for text in dev_text ]
test_text = [ re.sub(r'[^\w\s]', '', text) for text in test_text ]


1.8) Apply stemming on the development & test datasets using Porter algorithm

In [213]:
#code here
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer
porter = PorterStemmer()
def stemSentence(sentence):
    token_words = word_tokenize(sentence)
    stem_sentence = [porter.stem(word) for word in token_words]
    return " ".join(stem_sentence)

dev_text = [stemSentence(text) for text in dev_text]
test_text = [stemSentence(text) for text in test_text]


#### Model training

1.9) Create bag of words features for each tweet in the development dataset

In [214]:
#code here
from sklearn.feature_extraction.text import CountVectorizer
vector = CountVectorizer()
dev_text = vector.fit_transform(dev_text)
dev_text
feature_names = vector.get_feature_names()




1.10) Train a supervised learning model of choice on the development dataset

In [215]:
#code here
from sklearn.linear_model import LogisticRegressionCV
lr = LogisticRegressionCV(max_iter = 1000).fit(dev_text, dev_y)
lr.score(dev_text,dev_y)

0.899375

1.11) Create TF-IDF features for each tweet in the development dataset

In [216]:
from sklearn.feature_extraction.text import TfidfVectorizer
#Preprocess again and train the dev data set
dev_text,test_text2,dev_y,test_y = train_test_split(df["tweets"],df["sentiment"], test_size = 0.2,random_state = 2021)
#Replace the # symbol with '' in every tweet
dev_text = [text.replace('#','\"') for text in dev_text]
test_text2 = [text.replace('#','\"') for text in test_text2]
#Replace hyperlinks with '' in every tweet
dev_text = [re.sub(r'http\S+', '\"', text) for text in dev_text]
test_text2 = [re.sub(r'http\S+', '\"', text) for text in test_text2]
#Remove all punctuations
dev_text = [ re.sub(r'[^\w\s]', '', text) for text in dev_text ]
test_text2 = [ re.sub(r'[^\w\s]', '', text) for text in test_text2 ]
#Apply stemming on the development & test datasets using Porter algorithm
dev_text = [stemSentence(text) for text in dev_text]
test_text2 = [stemSentence(text) for text in test_text2]

## Using Soft Stop words
vector2 = TfidfVectorizer(stop_words = "english")
dev_text = vector2.fit_transform(dev_text)
dev_text.shape



(8000, 14511)

1.12) Train the same supervised learning algorithm on the development dataset with TF-IDF features

In [217]:
#code here
lr2 = LogisticRegressionCV(max_iter = 1000).fit(dev_text, dev_y)
lr2.score(dev_text,dev_y)

0.94625

1.13) Compare the performance of the two models on the test dataset

In [219]:
#code here
test1 = vector.transform(test_text)
test2 = vector2.transform(test_text2)
print("Score for test dataset using BOW:", lr.score(test1, test_y))
print("Score for test dataset using TF-IDF:", lr2.score(test2, test_y))
## Thus, we can see both models are overfitting, but the TF-IDF models still perform slightly better then the BOW model.

Score for test dataset using BOW: 0.7455
Score for test dataset using TF-IDF: 0.7535
