# IMDB dataset Sentimental Analysis.

**Step 1: Load Dataset from Directory**

We create the data from ground up using the text in the directories. But first we must read the data using python tools.

In [1]:
import os
import pandas as pd
import numpy as np

data = []
ratings = []
label = []

for files in os.listdir('train/neg/'):
    data.append(open("train/neg/" + files, 'r').read())
    ratings.append(int(files.split('_')[1].split('.')[0]))
    label.append(0)

for files in os.listdir('train/pos/'):
    data.append(open("train/pos/" + files, 'r').read())
    ratings.append(int(files.split('_')[1].split('.')[0]))
    label.append(1)

dfdict = {"data": data, "ratings": ratings, "label": label}

df = pd.DataFrame.from_dict(dfdict)
df

Unnamed: 0,data,ratings,label
0,"The name ""cult movie"" is often given to films ...",4,0
1,Director Ron Atkins is certifiably insane. Thi...,1,0
2,Laughed a lot - because it is so incredibly ba...,1,0
3,"Follow-up to 1973's ""Walking Tall"" continues t...",1,0
4,Now isn't it? Considering all the good work do...,1,0
...,...,...,...
24995,John Thaw is a an excellent actor. I have to a...,10,1
24996,In watching how the two brothers interact and ...,10,1
24997,There's so many things to fall for in Aro Tolb...,9,1
24998,It all begins with a series of thefts of seemi...,7,1


The dataframe now has all negative values. Lets add the positive values too. Later using the text location we will extract the data.

In [2]:
df.describe()

Unnamed: 0,ratings,label
count,25000.0,25000.0
mean,5.47772,0.5
std,3.466477,0.50001
min,1.0,0.0
25%,2.0,0.0
50%,5.5,0.5
75%,9.0,1.0
max,10.0,1.0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   data     25000 non-null  object
 1   ratings  25000 non-null  int64 
 2   label    25000 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 586.1+ KB


Soon we will add the tokens to the dataset to make the data set and create two models. One to predict the the labels as negative or positive and to predict the ratings based on the text using neural networks.


**Step 2: Cleaning the Data**

This is an important step used to clean data. The data contains too many HTML Tags and many many many stop words. We need them cleaned and tokenized.

In [4]:
df['data'] = df['data'].str.lower()
df.head()

Unnamed: 0,data,ratings,label
0,"the name ""cult movie"" is often given to films ...",4,0
1,director ron atkins is certifiably insane. thi...,1,0
2,laughed a lot - because it is so incredibly ba...,1,0
3,"follow-up to 1973's ""walking tall"" continues t...",1,0
4,now isn't it? considering all the good work do...,1,0


made all text in lower case so there is no confusion in cases when tokenizing. next we will remove punchuations to get only the the true words.

In [5]:
df['data'] = df['data'].str.replace(r'<[^<>]*>', '', regex=True)

In [6]:
df.iloc[0]['data']

'the name "cult movie" is often given to films which continue to be screened, or to sell in home movie format, more than a generation after they were first released. superchick, which was first released in 1973, now comes into this category. its cult status is largely due to ongoing interest in it by those women who regard it as an early and effective feminist film.despite the "superwoman" connotation, "superchick" is not a cartoon character but a very competent young lady working as an air stewardess - a career option which in the 1970\'s was commonly regarded as one of the most glamorous open to any girl, and which also enables her to emulate the traditional matelot who reputedly has a wife in every port. since she holds black belt status in karate, she is in a position to make it quite clear that she is very happy with her bachelor existence, and is in no way beholden to any of her extensive suite of male admirers. this film is a situation comedy which avoids the generally much shor

In [7]:
df["data"] = df['data'].str.replace('[^\w\s]','', regex=True)
df.iloc[0].data

numbers = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '0']
for i in numbers:
    df['data'] = df['data'].str.replace(i, '', regex=True)
    
df['data'] = df['data'].str.replace('_', '', regex=True)

df['data']

0        the name cult movie is often given to films wh...
1        director ron atkins is certifiably insane this...
2        laughed a lot  because it is so incredibly bad...
3        followup to s walking tall continues the reall...
4        now isnt it considering all the good work done...
                               ...                        
24995    john thaw is a an excellent actor i have to ad...
24996    in watching how the two brothers interact and ...
24997    theres so many things to fall for in aro tolbu...
24998    it all begins with a series of thefts of seemi...
24999    probably my alltime favorite movie a story of ...
Name: data, Length: 25000, dtype: object

In [8]:
from nltk.corpus import stopwords

stop = stopwords.words('english')

df['data'] = df['data'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop]))

now no capital letters no punchution no stop word. Now we move on to the next step. 

**Step 3: Tokenization and Vectorization of the words.**
Now we use the words to be used for processing by turning them into numbers using a process like vectorization.

In [9]:
df.head()

Unnamed: 0,data,ratings,label
0,name cult movie often given films continue scr...,4,0
1,director ron atkins certifiably insane ultralo...,1,0
2,laughed lot incredibly bad sorry folks definit...,1,0
3,followup walking tall continues reallife drama...,1,0
4,isnt considering good work done danzelclive jo...,1,0


In [21]:
import nltk

words = ' '
for i in df['data']:
    words = words + i

tokens = nltk.word_tokenize(words)

print(tokens)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [11]:
cv.get_feature_names()



['aa',
 'aaa',
 'aaaaaaah',
 'aaaaah',
 'aaaaatchkah',
 'aaaahhhhhhh',
 'aaaand',
 'aaaarrgh',
 'aaah',
 'aaand',
 'aaargh',
 'aaaugh',
 'aachen',
 'aada',
 'aadha',
 'aadmittedly',
 'aag',
 'aage',
 'aaghh',
 'aah',
 'aahhh',
 'aaip',
 'aaja',
 'aakash',
 'aaker',
 'aakrosh',
 'aames',
 'aamess',
 'aamesthe',
 'aamir',
 'aan',
 'aankh',
 'aankhen',
 'aap',
 'aapke',
 'aapkey',
 'aardman',
 'aardmans',
 'aardvarks',
 'aargh',
 'aaron',
 'aarons',
 'aarp',
 'aarrrgh',
 'aasize',
 'aatish',
 'aauugghh',
 'aavjo',
 'aaww',
 'ab',
 'aback',
 'abahy',
 'abanazer',
 'abandon',
 'abandoned',
 'abandoning',
 'abandoningindian',
 'abandonment',
 'abandonmentshe',
 'abandonof',
 'abandons',
 'abanks',
 'abas',
 'abashed',
 'abashidze',
 'abatement',
 'abating',
 'abattoirs',
 'abba',
 'abbad',
 'abbas',
 'abbasi',
 'abbasmustan',
 'abbey',
 'abbeys',
 'abbeythe',
 'abbie',
 'abbot',
 'abbotcostello',
 'abbots',
 'abbott',
 'abbotts',
 'abbottwe',
 'abbreviated',
 'abbu',
 'abby',
 'abbys',
 'abb

The stuff from nightmares. numerical values and for some reason all actually are numerical for whatever reason.