# Data description & Problem statement: 
   The IMDB movie reviews dataset is a set of 50,000 reviews, half of which are positive and the other half negative. This dataset is widely used in sentiment analysis benchmarks, which makes it a convenient way to evaluate our own performance against existing models. The dataset is available online and can be either directly downloaded from Stanford’s website. 

# Workflow:
- Load the dataset
- Data cleaning (e.g. remove formats and punctuations)
- Text vectorization, using "Bag of Words" technique
- Use "Latent Dirichlet Allocation" for document clustering (i.e. topic modeling)
- Determine, sort and print most important words/features for each topic

In [1]:
import sklearn
import numpy as np
import scipy as sc
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
from sklearn import preprocessing
%matplotlib inline

from scipy import stats

import warnings
warnings.filterwarnings("ignore")

# we insatll and import spacy package for some advanced tokenizaion techniques:
import spacy

# we also install and import mglearn package (using !pip install mglearn) for some interesting visualization of results:
import mglearn

In [2]:
!tree aclImdb

Folder PATH listing for volume OS
Volume serial number is 3EA9-93A4
C:\USERS\RHASH\DOCUMENTS\DATASETS\NLP PROJECTS (SKLEARN & SPARK)\ACLIMDB
ÃÄÄÄtest
³   ÃÄÄÄneg
³   ÀÄÄÄpos
ÃÄÄÄtrain
³   ÃÄÄÄneg
³   ÀÄÄÄpos
ÀÄÄÄunsupervised
    ÀÄÄÄunsup


In [3]:
ls

 Volume in drive C is OS
 Volume Serial Number is 3EA9-93A4

 Directory of C:\Users\rhash\Documents\Datasets\NLP projects (sklearn & Spark)

09/10/2018  06:38 PM    <DIR>          .
09/10/2018  06:38 PM    <DIR>          ..
09/10/2018  04:05 PM    <DIR>          .ipynb_checkpoints
09/10/2018  11:29 AM    <DIR>          aclImdb
09/10/2018  10:15 AM        84,125,825 aclImdb_v1.tar.gz
09/10/2018  11:57 AM    <DIR>          cache
09/10/2018  06:38 PM           144,551 IMDb movie review (sklearn).ipynb
09/10/2018  05:35 PM             8,800 IMDb review (topic modeling, sklearn).ipynb
               3 File(s)     84,279,176 bytes
               5 Dir(s)  419,847,401,472 bytes free


# load and prepare the text data: 

In [4]:
# load the training data:
from sklearn.datasets import load_files
reviews_train = load_files("aclImdb/unsupervised/") # load_files returns a bunch, containing training texts and training labels
text_train = reviews_train.data

print("type of text_train: {}".format(type(text_train)), "\n") 
print("length of text_train: {}".format(len(text_train)), "\n")

print("text_train[0]:\n{}".format(text_train[0]))

type of text_train: <class 'list'> 

length of text_train: 50000 

text_train[0]:
b'this is a passive movie by ace director anthony minghella, the movie has an awesome star cast and they all give competent performances early in their careers.<br /><br />the movie does not have much of a plot and though the story seems veers to a cliched and predictable end, there are enough minor twists that abound in the movie, making it quite an enjoyable watch. the standout features of the movie include its tight script, terrific lines and smart performances.<br /><br />the plot in itself is no great shakes but this movie is a fun watch for a relaxed evening.<br /><br />an enjoyable and pleasant 7!'


In [5]:
# text_train contains some HTML line breaks (<br />). 
# It is better to clean the data and remove this formatting before we proceed:

text_train = [doc.replace(b"<br />", b" ") for doc in text_train]

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(max_features=12000, 
                       ngram_range=(1, 1),
                       max_df=0.2)

X = vect.fit_transform(text_train)

# document clustering with Latent Dirichlet Allocation:  LDA

In [7]:
from sklearn.decomposition import LatentDirichletAllocation 
lda = LatentDirichletAllocation(n_topics=10, learning_method="batch",                                
                                max_iter=25, random_state=0)

# We build the model and transform the data in one step  
document_topics = lda.fit_transform(X)

In [8]:
# For each topic (a row in the components_), sort the features (ascending) 
sorting = np.argsort(lda.components_, axis=1)[:, ::-1] 

# Get the feature names from the vectorizer 
feature_names = np.array(vect.get_feature_names())

In [9]:
# Print out the 10 topics: 
mglearn.tools.print_topics(topics=range(10), feature_names=feature_names,   
                           sorting=sorting, topics_per_chunk=5, n_words=10)

topic 0       topic 1       topic 2       topic 3       topic 4       
--------      --------      --------      --------      --------      
life          your          action        us            character     
love          ever          game          world         actors        
young         know          fight         war           director      
world         why           war           these         films         
man           ve            scenes        our           scenes        
family        thing         hero          why           does          
beautiful     worst         where         those         here          
us            say           off           american      script        
through       did           gun           know          better        
between       didn          guy           did           work          


topic 5       topic 6       topic 7       topic 8       topic 9       
--------      --------      --------      --------      --------      
role