#                               IMDB Reviews Analysis

**Dilruba Benzerler 090190309 benzerler19@itu.edu.tr**

**Kaan Kaymaz 090180333 kaymaz18@itu.edu.tr**

### Dataset Link and Explanation

This [dataset](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews/code) involves reviews of some random films imported from IMBD. It has two columns which are "text" and "label. "text" contains text of reviews of films and "label" contains the sentiment of the reviews as "0","1". The dateset includes 20000 rows of reviews.

In [1]:
import opendatasets as od
import pandas as pd
from bs4 import BeautifulSoup

In [2]:
od.download("https://www.kaggle.com/datasets/columbine/imdb-dataset-sentiment-analysis-in-csv-format/download?datasetVersionNumber=1")

Skipping, found downloaded files in "./imdb-dataset-sentiment-analysis-in-csv-format" (use force=True to force download)


In [3]:
data = pd.read_csv("IMDB Dataset.csv")

In [4]:
df = pd.get_dummies(data, columns = ['sentiment'])
df.drop("sentiment_negative",axis=1,inplace=True)

In [5]:
df.rename(columns={"review":"text","sentiment_positive":"label"},inplace=True)

In [6]:
df = df[0:20000]

In [7]:
df["text"]=[BeautifulSoup(i).get_text() for i in df['text']]
# To clean html tags 



In [8]:
df.head()

Unnamed: 0,text,label
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. The filming tec...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [9]:
df.shape

(20000, 2)

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    20000 non-null  object
 1   label   20000 non-null  uint8 
dtypes: object(1), uint8(1)
memory usage: 175.9+ KB


**Our data consists of 20000 lines and 2 columns, consisting of a "text" column with the reviews of movies and a "label" column containing the labels indicating positivity or negativity according to the emotion of the review.**

In [11]:
df.iloc[0,0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.I would say the main appeal of the show is due to the fact that it goes where other shows wo

**Text column contains some reviews about random films on IMDB.com.**

**Data is well organized and it has a neat format. Therefore, it can be said that the data is structured. The only regulation that has to be done is to convert text data to numeric data.**

***In order to make our data in text form usable, some Natural Language Proccesing operations will be performed. NLP operations contains some proccesses which will perform some cleaning on text. There are special charachters in texts.A special character is a character that is not an alphabetic or numeric character. Punctuation marks and other symbols are examples of special characters. To get rid of these we will use some functions which is in "re" library. Next step is tokenization. Tokenization splits sentences into individual words. Then, we will remove stopwords like "a","and", "the". Other step is lemmatization which is the grouping together of different forms of the same word. Last step is vectorization which converts text data to numeric data. With this step, we made our text data, which was unusable, usable by making it numerical. These functions that we will be using are built in functions in "nltk" library.***

**Below is the final numerized version of the data.**

In [16]:
text_vectorized

Unnamed: 0,aa,aaa,aaaaaaahhhhhhggg,aaaaah,aaaaargh,aaaaarrrrrrgggggghhhhhh,aaaaaw,aaaahhhhhhh,aaaand,aaaggghhhhhhh,...,zwick,zy,zzzz,zzzzip,zzzzzzzzz,zzzzzzzzzzzz,zzzzzzzzzzzzz,zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz,zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz,zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19998,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**text_vectorized dataframe will be taken as independent and df["label"] column will be taken as dependent variable while working the model. Algorithms specified in the Method and Algorithms section will be applied and accuracy will be calculated.**

### Object of the Project

Since the text data is a data type that is difficult to work with, we should make it numeric, which is easier to work with. Hence,

   *1. The first aim is to converting text data to numeric data.*

The label column of the dataset is a column formed by outputting 0 if the comment is negative and 1 if it is positive, depending on what type of emotional expressions the movie comments contain, positive or negative. Therefore,

   *2. Second aim is to check the accuracy of this column.*

When the data was examined, it was understood that the reviews could be grouped under certain topics. Therefore,

   *3. Our third aim is to identify the topics that are mainly mentioned in the reviews and to classify the reviews under these topics.*
   
   *4. Fourth aim can be explained as follows: apart from the label column given in the data, new label column will be created using sentiment analysis.*

### Methods and Algorithms 

To test the first hypothesis we will use "nltk", "re" and "spacy" libraries. Using these libraries, we separate sentences into words by getting rid of stopwords and specials. And in the last step, we make this data numerical using words.

To test our second hypothesis Logistic Regression, Naive Bayes, Decision Tree and Random Forest algorithms will be used which will give the result whether the "label" column is correct or not. And interpretation will be made according to the accuracy value to be obtained from this model. 

In order to test our third hypothesis, we need to create the clusters. K-Means algorithm, LDA(Latent Dirichlet Allocation) will be used to generate clusters.

To test fourth hypothesis there will be created an Pipeline to clean, tokenize, vectorize, and classify using "Count Vectorizor". SVM algorithm will be run using Pipeline and new labels will be created.

### Task Distribution 

Basically the whole project will be carried out together. However, Dilruba will work predominantly at the first and second hypothesis, while Kaan will predominantly work in the third and fourth hypothesis. So Dilruba will take part in the search for "nltk" and "spacy" libraries. Kaan will explore how to do topic modeling with "LDA". In addition, which algorithms will be used in the project and how the algorithms to be used should be discussed and decided together.

### Calendar
| **Task**                                                  | **Deadline**| 
|:---------------------|:------------------|
|   Finding and examining the dataset                       |   Nov 14  |
|   Planning the whole project and completing the proposal  |   Dec 2     |
|   Executing first two aims of the project                               |   Dec 9     | 
|   Executing last two aims of the project                                |   Dec 16    | 
|   Checking/comparing results                              |   Dec 23    | 
|   Project Delivery                                        |   Dec 30    | 

There may be changes or additions about deadlines and tasks according to the progress.

# Atabey's notes

You must provide some samples of the dataset(s) and explain what they contain. What is the shape of the dataset? What features does it have? Is the data structured or unstructured? How are you planning on converting text into a usable dataset?

Your questions are too vague. How are you planning on testing the accuracy of sentiments attached to each review? How are you going to figure out topics? What do you mean by 'sentiment' and what do you mean by 'topic'? How are you going to model them? What type of ML models are there that would help you answer this question?

This proposal is weak. It reads like something written in less than an hour. Also, the data is fairly well studied: there are close to 50 notebooks that studied this data on kaggle only. You must do something substantially different from them.