# **Applied Machine Learning Homework 5**
**Due 12 Dec,2022 (Monday) 11:59PM EST**

### Natural Language Processing
We will train a supervised model to predict if a movie has a positive or a negative review.

####  **Dataset loading & dev/test splits**

**1.0) Load the movie reviews dataset from NLTK library**

In [185]:
import nltk
nltk.download("movie_reviews")
import pandas as pd
from nltk.corpus import twitter_samples 
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
stop = stopwords.words('english')
import string
import re
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

from nltk.corpus import movie_reviews

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [68]:
negative_fileids = movie_reviews.fileids('neg')
positive_fileids = movie_reviews.fileids('pos')

pos_document = [(' '.join(movie_reviews.words(file_id)),category) for file_id in movie_reviews.fileids() for category in movie_reviews.categories(file_id) if category == 'pos']
neg_document = [(' '.join(movie_reviews.words(file_id)),category) for file_id in movie_reviews.fileids() for category in movie_reviews.categories(file_id) if category == 'neg']

# List of postive and negative reviews
pos_list = [pos[0] for pos in pos_document]
neg_list = [neg[0] for neg in neg_document]

**1.1) Make a data frame that has reviews and its label**

In [69]:
# code here
df = []
for i in range(len(pos_list)):
  df.append([pos_list[i], 'positive'])

for i in range(len(neg_list)):
  df.append([neg_list[i], 'negative'])

df = pd.DataFrame(df, columns=['review', 'label'])

In [70]:
df.head()

Unnamed: 0,review,label
0,films adapted from comic books have had plenty...,positive
1,every now and then a movie comes along from a ...,positive
2,you ' ve got mail works alot better than it de...,positive
3,""" jaws "" is a rare film that grabs your attent...",positive
4,moviemaking is a lot like being the general ma...,positive


**1.2 look at the class distribution of the movie reviews**

In [71]:
# code here
print(df['label'].value_counts())

positive    1000
negative    1000
Name: label, dtype: int64


**1.3) Create a development & test split (80/20 ratio):**

In [143]:
# code here
text=df.drop(columns=['label']) #dataframe
y=df['label'] #series

dev_text, test_text, dev_y, test_y = train_test_split(text, y, test_size=0.2, random_state=0)

#### **Data preprocessing**
We will do some data preprocessing before we tokenize the data. We will remove `#` symbol, hyperlinks, stop words & punctuations from the data. You may use `re` package for this. 

**1.4) Replace the `#` symbol with '' in every review**

In [144]:
# code here
for i in range(len(dev_text)):
  dev_text.iloc[i,0]=dev_text.iloc[i,0].replace('#', '')

for i in range(len(test_text)):
  test_text.iloc[i,0]=test_text.iloc[i,0].replace('#', '')

**1.5) Replace hyperlinks with '' in every review**

We understand the removing hyperlinks is not straightforward for this data set since the hyperlink themselves have spaces between them. Feel free to ignore this question and proceed with normal analysis. We will not deduct marks for this question.

In [None]:
# code here

**1.6) Remove all stop words**

In [145]:
# code here
for i in range(len(dev_text)):
  tokens=word_tokenize(dev_text.iloc[i,0])
  tokens=[word for word in tokens if word not in stop]
  dev_text.iloc[i,0]=" ".join(tokens)

for i in range(len(test_text)):
  tokens=word_tokenize(test_text.iloc[i,0])
  tokens=[word for word in tokens if word not in stop]
  test_text.iloc[i,0]=" ".join(tokens)

**1.7) Remove all punctuations**

In [148]:
# code here
for i in range(len(dev_text)):
  for character in string.punctuation:
    dev_text.iloc[i,0]=dev_text.iloc[i,0].replace(character, '')

for i in range(len(test_text)):
  for character in string.punctuation:
    test_text.iloc[i,0]=test_text.iloc[i,0].replace(character, '')

**1.8) Apply stemming on the development & test datasets using Porter algorithm**

In [149]:
#code here
porter=PorterStemmer()

def stemsentence(sentence):
  token_words=word_tokenize(sentence)
  stem_sentence=[porter.stem(word) for word in token_words]
  return " ".join(stem_sentence)

for i in range(len(dev_text)):
  dev_text.iloc[i,0]=stemsentence(dev_text.iloc[i,0])

for i in range(len(test_text)):
  test_text.iloc[i,0]=stemsentence(test_text.iloc[i,0])

#### **Model training**

**1.9) Create bag of words features for each review in the development dataset**

In [176]:
#code here
vector=CountVectorizer()

BOW=[0] * len(dev_text)
for i in range(len(dev_text)):
  a=[dev_text.iloc[i,0]]
  BOW[i]=vector.fit_transform(a)

dev_text['BOW']=BOW
dev_text.head()

Unnamed: 0,review,BOW
582,note may consid portion follow text spoiler fo...,"(0, 210)\t1\n (0, 189)\t1\n (0, 62)\t2\n ..."
159,last two film shine snow fall cedar australian...,"(0, 149)\t1\n (0, 289)\t2\n (0, 100)\t5\n ..."
1827,best thing lake placid 80 minut long glad wast...,"(0, 11)\t2\n (0, 105)\t1\n (0, 56)\t4\n (..."
318,year initi releas scream horror send veteran h...,"(0, 297)\t1\n (0, 133)\t1\n (0, 205)\t2\n ..."
708,upon time solitari ogr name shrek mike myer re...,"(0, 296)\t1\n (0, 280)\t1\n (0, 248)\t2\n ..."


**1.10) Train a Logistic Regression model on the development dataset**

In [179]:
dev_x_BOW=vector.fit_transform(list(dev_text['review']))
test_x_BOW=vector.transform(list(test_text['review']))

In [180]:
#code here
lr=LogisticRegression().fit(dev_x_BOW, dev_y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [181]:
lr.score(test_x_BOW, test_y)

0.83

**1.11) Create TF-IDF features for each review in the development dataset**

In [177]:
#code here
vector2=TfidfVectorizer()

TF=[0] * len(dev_text)
for i in range(len(dev_text)):
  a=[dev_text.iloc[i,0]]
  TF[i]=vector2.fit_transform(a)

dev_text['TF']=TF
dev_text.head()

Unnamed: 0,review,BOW,TF
582,note may consid portion follow text spoiler fo...,"(0, 210)\t1\n (0, 189)\t1\n (0, 62)\t2\n ...","(0, 37)\t0.020302737602678898\n (0, 6)\t0.0..."
159,last two film shine snow fall cedar australian...,"(0, 149)\t1\n (0, 289)\t2\n (0, 100)\t5\n ...","(0, 83)\t0.038604571824109146\n (0, 280)\t0..."
1827,best thing lake placid 80 minut long glad wast...,"(0, 11)\t2\n (0, 105)\t1\n (0, 56)\t4\n (...","(0, 110)\t0.06868028197434452\n (0, 72)\t0...."
318,year initi releas scream horror send veteran h...,"(0, 297)\t1\n (0, 133)\t1\n (0, 205)\t2\n ...","(0, 248)\t0.03059950306810523\n (0, 170)\t0..."
708,upon time solitari ogr name shrek mike myer re...,"(0, 296)\t1\n (0, 280)\t1\n (0, 248)\t2\n ...","(0, 55)\t0.03454442679267334\n (0, 262)\t0...."


**1.12) Train the Logistic Regression model on the development dataset with TF-IDF features**

In [182]:
#code here
dev_x_TF=vector2.fit_transform(list(dev_text['review']))
test_x_TF=vector2.transform(list(test_text['review']))

In [183]:
lr=LogisticRegression().fit(dev_x_TF, dev_y)

In [184]:
lr.score(test_x_TF, test_y)

0.8325

**1.13) Compare the performance of the two models on the test dataset. Explain the difference in results obtained?**

In [None]:
#code here
# code in 1.10 and 1.12

Logistic Regression model on dataset with bag of words features has test performance score 0.83. Logistic Regression model on dataset with TF-IDF features has test performance score 0.8325. The logistic Regression model on dataset with TF-IDF features has slightly better performance. TF-IDF features may better capture the characteristic of reviews by putting more weights on terms frequent in a specific review and less weights on terms frequent in all reviews. This enlarges the difference between each review and may help improve the model performance.