# Assignment 4 optional
author: Dominika Maciąg

# IMDB Dataset of 50K Movie Reviews
## About the dataset that I am using:
https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?resource=download
    
IMDb is an online database that contains information related to films, television series, video games, 
and streaming content online – including cast, production crew and personal biographies, plot summaries, trivia, ratings, 
and fan and critical reviews. 

The dataset that I am using contains 50 000 reviews of movies which are positive and negative.


In [41]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
from sklearn.linear_model import LogisticRegression

def remove_punctuation(text):
    import string
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

df = pd.read_csv('IMDB Dataset.csv')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


# Exercise 1 
Let's prepare our data

In [42]:
# Replace all missing (nan) revies with empty "" string.
# To aply given function we need firstly to remove all the NaN from "review" row in our dataframe.

df["review"].replace(np.nan, "", inplace=True)

#short test:
df["review"][38] == df["review"][38]

True

In [43]:
# We also need to remove "br" from reviews since it was used to create new line in review on the IMDB website.
df["review"] = df["review"].str.replace("<br />", '')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. The filming tec...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [44]:
# Remove punctuation from reviews using the given function.
df["review"] = df["review"].apply(remove_punctuation)
print(df["review"][4]) #testing if we removed punctuation from reviews

Petter Matteis Love in the Time of Money is a visually stunning film to watch Mr Mattei offers us a vivid portrait about human relations This is a movie that seems to be telling us what money power and success do to people in the different situations we encounter This being a variation on the Arthur Schnitzlers play about the same theme the director transfers the action to the present time New York where all these different characters meet and connect Each one is connected in one way or another to the next person but no one seems to know the previous point of contact Stylishly the film has a sophisticated luxurious look We are taken to see how these people live and the world they live in their own habitatThe only thing one gets out of all these souls in the picture is the different stages of loneliness each one inhabits A big city is not exactly the best place in which human relations find sincere fulfillment as one discerns is the case with most of the people we encounterThe acting is

In [45]:
# d)  Set all positive ratings to 1 and negative to -1.
df["sentiment"].replace('positive', 1, inplace = True)
df["sentiment"].replace('negative', -1, inplace = True)
df.head() # let's test if we replaced ratings successfully

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production The filming tech...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically theres a family where a little boy J...,-1
4,Petter Matteis Love in the Time of Money is a ...,1


In [46]:
# In exercise 3 we can see that as most popular words we get numbers so we will remove all the numbers from our reviews
df["review"] = df["review"].str.replace('\d+', '', regex=True)

# Exercise 2
a) Split dataset into training and test sets. \
b) Transform reviews into vectors using CountVectorizer.

In [47]:
# a)
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, train_size=0.8, test_size=0.2)

In [48]:
# b)
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
x_train = vectorizer.fit_transform(list(train["review"]))
x_test = vectorizer.transform(list(test["review"]))

# Summary:

# Exercise 3
a) Train LogisticRegression model on training data (reviews processed with CountVectorizer, ratings as they were). \
b) Print 10 most positive and 10 most negative words.

In [49]:
#a)
y_train = train["sentiment"]

model = LogisticRegression(solver='lbfgs', max_iter=20000)
model.fit(x_train, y_train)

In [50]:
#b)
# Get the indices of the sorted cofficients
ascending = np.argsort(model.coef_.flatten()) 
descending = ascending[::-1]

vocab = np.array(vectorizer.get_feature_names_out())

# Print most positive words
print("Most positive words: ", end="")
for i in range(10):
    print(vocab[descending[i]], end=", ")
print("\n")

# Print most negative words
print("Most negative words: ", end="")
for i in range(10):
    print(vocab[ascending[i]], end=", ")
print("\n")

Most positive words: refreshing, finest, erotic, disappoint, perfect, excellent, penny, funniest, superb, wonderfully, 

Most negative words: waste, disappointment, worst, forgettable, mstk, awful, wasting, boring, poorly, fails, 



# Exercise 4
a) Predict the sentiment of test data reviews. \
b) Predict the sentiment of test data reviews in terms of probability. \
c) Find five most positive and most negative reviews. \
d) Calculate the accuracy of predictions.

In [51]:
# a)
model.predict(x_test)

array([-1, -1, -1, ...,  1,  1,  1], dtype=int64)

In [52]:
# b)
pred = model.predict_proba(x_test)
print(pred)

[[9.99168374e-01 8.31626484e-04]
 [9.14657223e-01 8.53427765e-02]
 [9.99999999e-01 7.64490298e-10]
 ...
 [2.73958293e-02 9.72604171e-01]
 [1.90502878e-03 9.98094971e-01]
 [1.28552185e-02 9.87144781e-01]]


In [53]:
# c) Find five most positive and most negative reviews.
ascendingproba = np.argsort(pred[:,[0]].flatten()) 

#     Positive reviews
print("Five most positive reviews: ")
for i in range (5):
    print(i+1, ")")
    print(list(train["review"])[ascendingproba[i]])

print()
print()
print("Five most negative reviews: ")
#     Negative reviews
descendingproba = ascendingproba[::-1]
for i in range (5):
    print(i+1, ")")
    print(list(train["review"])[descendingproba[i]])

Five most positive reviews: 
1 )
I am not a very good writer so Ill keep this short World at War is the best WWII documentary that Ive seen Ive seen different WWII documentaries not only EnglishNorth American and this documentary seems to be the most complete WWII documentary that Ive seen I think it could talk a bit more about the Great Depression and whyhow Hitler got to power but it does a very good job at covering the war It seems to be complete and objectivefair to everyone It does not exaggerate or diminish roles of different nations It has a lot of original footage including color footage and many eye witnesses it was made in s when a lot more were alive It has great music and narrator AllinAll I gave this one  because its that good I havent seen specials in DVD version so I cannot comment on those
2 )
This is a truly wonderful love story I liked the songs however even if you do not you have to love the story Peter OToole is at his best and Petula Clark is doing fine as well I f

In [55]:
# d) Calculate the accuracy of predictions.
y_test = test["sentiment"]
model.score(x_test, y_test)

0.8896

# Exercise 5
In this exercise we will limit the dictionary of CountVectorizer to the set of significant words, defined below.

a) Redo exercises 2-5 using limited dictionary.
b) Check the impact of all the words from the dictionary.
c) Compare accuracy of predictions and the time of evaluation.

In [56]:
# 5a)
words = ["refreshing","finest", "disappoint", "perfect", "excellent", "penny", "funniest", "superb", "wonderfully", "waste", "disappointment", "worst", "forgettable", "mstk", "awful", "wasting", "boring", "poorly", "fails","erotic"]

# Excercise 2
# Transform reviews into vectors using CountVectorizer. We limiting dictionary of CountVectorizer
vectorizer_dict = CountVectorizer(vocabulary=words)
x_train_dict = vectorizer_dict.fit_transform(list(train["review"]))
x_test_dict = vectorizer_dict.transform(list(test["review"]))

# Excercise 3
# Train LogisticRegression model on training data
y_train_dict = train["sentiment"]
model_dict = LogisticRegression(solver='lbfgs', max_iter=20000)
model_dict.fit(x_train_dict, y_train_dict)