# Preparing the IMDB movei revieww data for text processing

## 1. Obtaining the movie review dataset

I downloaded the dataset from https://ai.stanford.edu/~amaas/data/sentiment/. Then I extracted the data with WinRar. This can also be done with python code. 

## 2. Preprocessing the movie dataset into a more convenient mode

A. I use pyprind to track the downloading and time for completion of a specific task.

B. Then I collected all the text files and store them in a Pandas dataFrame

In [9]:
import pyprind
import pandas as pd
import os
import sys

basepath = r'C:\Users\matin\OneDrive\Desktop\Code\NLP\NLP_Notes_Projects\Internet Movie Database (IMDb)-Sentiment Analysis\aclImdb_v1\aclImdb'

labels = {'pos': 1, 'neg': 0}
pbar = pyprind.ProgBar(50000, stream=sys.stdout)

docs = []
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in sorted(os.listdir(path)):
            with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:
                txt = infile.read()
            docs.append([txt, labels[l]])
            pbar.update()

df = pd.DataFrame(docs, columns=['review', 'sentiment'])
print(df.head())


                                              review  sentiment
0  I went and saw this movie last night after bei...          1
1  Actor turned director Bill Paxton follows up h...          1
2  As a recreational golfer with some knowledge o...          1
3  I saw this film in a sneak preview, and it is ...          1
4  Bill Paxton has taken the true story of the 19...          1


Save the data in CSV file

In [10]:
df.to_csv("imdb_reviews.csv", index=False, encoding="utf-8")

In [13]:
df.shape

(50000, 2)

# Introducing the bag-of-words model

## 1. Transforming wors into feature vectors

In [28]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
docs = np.array(['The sun is shining',
'The weather is sweet',
'The sun is shining, the weather is sweet and one and one is two'])

#bag is a vector based on the indices of the vocabulary, it contains number of times a specific word is present in a sentence. 
bag = count.fit_transform(docs)

# We now, have a vocabulary containing all the words which are repeated in all the sentences. 
vocabulary = count.vocabulary_



In [29]:
bag.toarray()

array([[0, 1, 0, 1, 1, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 1, 1, 0, 1],
       [2, 3, 2, 1, 1, 1, 2, 1, 1]])

In [31]:
vocabulary

{'the': 6,
 'sun': 4,
 'is': 1,
 'shining': 3,
 'weather': 8,
 'sweet': 5,
 'and': 0,
 'one': 2,
 'two': 7}

## 2. Assessing word relevancy via term frequency-inverse document frequency

The overall odea of this method is to show the importance of each word in a semantic way in each sentence. For example, if we have 3 sentences and in all of them we have the word 'good', then this word has no special meaning when we try to distinguish sentences from each oher. Otherwise, if we have a word 'girl' which is present in only one sentence, we can calculate its tf-idf. Again, we use sklearn for converting our `CountVectorizor` into a `tfidf`

In [30]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(use_idf=True,
                         norm='l2',
                         smooth_idf=True)
np.set_printoptions(precision=2)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


## 3. Cleaning text data

Now we have to clear out texts because we have many punctuations and other non-letter characters that should be removed. 

In [38]:
import re

def preprocessor(text):
    text = re.sub(r'<[^>]*>', '', text)  # use r''
    emoticons = re.findall(r'(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = (re.sub(r'[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', ''))
    return text

preprocessor(df.loc[0, 'review'])


'i went and saw this movie last night after being coaxed to by a few friends of mine i ll admit that i was reluctant to see it because from what i knew of ashton kutcher he was only able to do comedy i was wrong kutcher played the character of jake fischer very well and kevin costner played ben randall with such professionalism the sign of a good movie is that it can toy with our emotions this one did exactly that the entire theater which was sold out was overcome by laughter during the first half of the movie and were moved to tears during the second half while exiting the theater i not only saw many women in tears but many full grown men as well trying desperately not to let anyone see them crying this movie was great and i suggest that you go see it before you judge '

In [39]:
df['review'] = df['review'].apply(preprocessor)

## 4. Processing documents into tokens

Now we need a tokenizer to tokenize documents into individual words.

Another useful technique is called **Word stemming**, which is the process of transforming a word into its root form. We then apply the function in our pipeline.

In [40]:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()

def tokenizer_porter(text):
    return [porter.stem(text) for word in text.split()]

# Training a logistic regression model for document classification