### **Data Preparation, Extract 50k from 1M**

In [None]:
import pandas as pd
from collections import Counter

the origial data can be downloaded from https://www.kaggle.com/stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset?select=rotten_tomatoes_critic_reviews.csv 

because it is 200MB with more than a million reviews, I save it inside a google drive

mount your google drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


We want to extract 50k balanced reviews, the top 80k will cover this

In [None]:
df=None
for chunk in pd.read_csv("drive/MyDrive/colab Data/NLPFinal/rotten_tomatoes_critic_reviews.csv", chunksize=80000):
    df=chunk
    break


In [None]:
df.head(3)

Unnamed: 0,rotten_tomatoes_link,critic_name,top_critic,publisher_name,review_type,review_score,review_date,review_content
0,m/0814255,Andrew L. Urban,False,Urban Cinefile,Fresh,,2010-02-06,A fantasy adventure that fuses Greek mythology...
1,m/0814255,Louise Keller,False,Urban Cinefile,Fresh,,2010-02-06,"Uma Thurman as Medusa, the gorgon with a coiff..."
2,m/0814255,,False,FILMINK (Australia),Fresh,,2010-02-09,With a top-notch cast and dazzling special eff...


we only want the review content and the review results

In [None]:
df = df[['review_type','review_content']]

drop empty rows

In [None]:
print(sum(df.isnull().values.ravel()))
df.dropna(inplace=True)
print(sum(df.isnull().values.ravel()))

10051


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


0


See the balance of our review

In [None]:
Counter(df['review_type'])

Counter({'Fresh': 41175, 'Rotten': 28774})

Rename column names

In [None]:
df.rename(columns={"review_type": "target", "review_content": "content"},inplace=True)
df.head(2)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,target,content
0,Fresh,A fantasy adventure that fuses Greek mythology...
1,Fresh,"Uma Thurman as Medusa, the gorgon with a coiff..."


select top 25k of each class

In [None]:
pos = df[df['target'] == 'Fresh']
neg = df[df['target'] == 'Rotten']
pos=pos[:25000]
neg=neg[:25000]
print(pos.shape)
print(neg.shape)

(25000, 2)
(25000, 2)


combine them together

In [None]:
dfFinal=pd.concat([pos,neg],ignore_index=True)

shuffle the data

In [None]:
print(dfFinal.shape)
print(Counter(dfFinal['target']))
dfFinal = dfFinal.sample(frac=1).reset_index(drop=True)
dfFinal.head(10)

(50000, 2)
Counter({'Fresh': 25000, 'Rotten': 25000})


Unnamed: 0,target,content
0,Rotten,"Compelling in fits and starts, actor-director ..."
1,Fresh,Quite simply one of the finest comic romances ...
2,Rotten,A psychological thriller that dangles over the...
3,Fresh,The General is something of a salute to Boorma...
4,Rotten,You'd think that a movie that opens with a gra...
5,Fresh,"That Ridley Scott guy, he directs things prett..."
6,Rotten,Anderson's novel solution to slow-moving stret...
7,Fresh,"... with its superb cast, its literate screenp..."
8,Fresh,Okay teen/romantic comedy with pleasant players.
9,Rotten,[The love triangle] plays more like canned hea...


change 'Fresh' and 'Rotten' target value to 0 and 1

In [None]:
def to_01Sentiment(target):
  
  if target =='Rotten':
    return 0  
  else:
    return 1

dfFinal['sentiment'] = dfFinal.target.apply(to_01Sentiment)



In [None]:
dfFinal.drop(columns=['target'],inplace=True)
dfFinal.head(10)

Unnamed: 0,content,sentiment
0,"Compelling in fits and starts, actor-director ...",0
1,Quite simply one of the finest comic romances ...,1
2,A psychological thriller that dangles over the...,0
3,The General is something of a salute to Boorma...,1
4,You'd think that a movie that opens with a gra...,0
5,"That Ridley Scott guy, he directs things prett...",1
6,Anderson's novel solution to slow-moving stret...,0
7,"... with its superb cast, its literate screenp...",1
8,Okay teen/romantic comedy with pleasant players.,1
9,[The love triangle] plays more like canned hea...,0


the maximum length of the review is 257

In [None]:
max(df.content.apply(len))

257

Save this 50k review to a csv for other uses

In [None]:
dfFinal.to_csv('movie_review_RT50K.csv',index=False)

### **Load Data, Next time you need to load the data**

In [None]:
import pandas as pd
from collections import Counter

In [None]:
df=pd.read_csv('movie_review_RT50K.csv')
df.head(10)

Unnamed: 0,content,sentiment
0,"Compelling in fits and starts, actor-director ...",0
1,Quite simply one of the finest comic romances ...,1
2,A psychological thriller that dangles over the...,0
3,The General is something of a salute to Boorma...,1
4,You'd think that a movie that opens with a gra...,0
5,"That Ridley Scott guy, he directs things prett...",1
6,Anderson's novel solution to slow-moving stret...,0
7,"... with its superb cast, its literate screenp...",1
8,Okay teen/romantic comedy with pleasant players.,1
9,[The love triangle] plays more like canned hea...,0


In [None]:
Counter(df['sentiment'])

Counter({0: 25000, 1: 25000})