# _*Exploration and Preparation*_

In [1]:
import textblob as tb
from wordcloud import WordCloud
import re 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 

### Description

##### The core dataset contains 50,000 reviews of movies from IMDB split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg)
##### In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. Further, the train and test sets contain a disjoint set of movies, so no significant performance is obtained by memorizing movie-unique terms and their associated with observed labels.  In the labeled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10. Thus reviews with more neutral ratings are not included in the train/test sets.

### Exploration 

In [47]:
train_data = pd.read_csv('Train_reviews.csv')
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24984 entries, 0 to 24983
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Unnamed: 0   24984 non-null  int64 
 1   Text_Review  24984 non-null  object
 2   Sentiment    24984 non-null  object
dtypes: int64(1), object(2)
memory usage: 585.7+ KB


In [48]:
train_data.head()

Unnamed: 0.1,Unnamed: 0,Text_Review,Sentiment
0,0,['Bromwell High is a cartoon comedy. It ran at...,positive
1,1,['Homelessness (or Houselessness as George Car...,positive
2,2,['Brilliant over-acting by Lesley Ann Warren. ...,positive
3,3,['This is easily the most underrated film inn ...,positive
4,4,['This is not the typical Mel Brooks film. It ...,positive


In [49]:
train_data = train_data.drop('Unnamed: 0',axis=1)
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24984 entries, 0 to 24983
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Text_Review  24984 non-null  object
 1   Sentiment    24984 non-null  object
dtypes: object(2)
memory usage: 390.5+ KB


In [50]:
a = train_data.sample(10)
a

Unnamed: 0,Text_Review,Sentiment
4815,['I am so upset that ABC is giving up on yet a...,positive
22927,"[""This movie was just plain bad. Just about ev...",negative
16267,"[""this film tries to be immensely clever, and ...",negative
23608,['I thought I read somewhere that this was the...,negative
15595,['Lorne Michaels once again proves that he has...,negative
14098,['Recap: Doctor Markov has developed a new the...,negative
13208,"[""The finale of the Weissmuller Tarzan movies ...",negative
5070,['This movie of 370 minutes was aired by the I...,positive
10964,"['""Der Todesking""-Jorg Buttgereit\'s second fu...",positive
22749,"[""'Anita and Me' is a drama about growing up i...",negative


In [51]:
j=1
for i,k in zip(a.Text_Review,a.Sentiment):
    print(f'{j}) {(i)} - {k}')
    j+=1

1) ['I am so upset that ABC is giving up on yet another show that has the chance to be a real winner. This show is so good, the writing and storyline were great, an actual original idea for a show instead of another boring reality show. The casting was spectacular! Not only were the characters and actors right on, but these are a very talented set of actors. The concept and idea is really a new and cool idea for a TV show, many of us share this whole idea of "connections". I really love the characters of Steven, Laura, Whitney and Damien. But to be honest there is not one person connected to this show that I did not like, even those that only were in for a few episodes (Sheri Appleby for example). The acting and characters are so easy to like and so talented!!. I wish ABC had given this show more of a chance, and not interrupted the show midway, Also it was not advertised enough. Truly unfair!! to everyone!!. This show showed great promise. I for one will let ABC know how I feel and wi

###  Observation for Cleaning 
##### The only thing seen from the ten random sqmples that had to be cleaned is the \<br \/> and also the [ ] and " " 

## Cleaning 

In [52]:
def clean_txt(text):
    text = re.sub('<br />','',text)
    text = text.replace('["',"")
    text = text.replace('"]',"")
    text = text.replace("['","")
    text = text.replace("']","")
    
    # text = re.sub(r'"','',text)
    # text = re.sub(r'"','',text)
    # text = re.sub(r'[','',text)
    # text = re.sub(r']','',text)
    return text

In [53]:
train_data.Text_Review = train_data['Text_Review'].apply(clean_txt)
a = train_data.sample(10)

In [54]:
a

Unnamed: 0,Text_Review,Sentiment
24150,"Disappointing, predictable film in which a wom...",negative
7223,"""What is love? What is this longing in our hea...",positive
9934,Vincent Price's follow-up to HOUSE OF WAX (195...,positive
15749,If you go see this movie you'll be holding a g...,negative
19174,"I wanted to like this film, yes its a SAW, bla...",negative
4151,this is one of the funniest shows i have ever ...,positive
6655,Anyone who had never seen anything like the fi...,positive
15776,"Every once in a while, a group of friends, wit...",negative
18290,I do agree with everything Calamine has said! ...,negative
17306,1. Aliens resemble plush toys and hand puppets...,negative


### Feature Extraction 