#**Sentiment Analysis (Text Classification) Using IMDB Moview Reviews Dataset**

**Objective**
In this notebook, I will be utilizing the IMDB moview reviews dataset to perform Sentiment Analysis (Text Classification) using Python. 

**Steps**
1. Define assets
2. Import necessary libraries
3. Understanding Text Processing (Stop Words)
4. Loading and Processing the IMDB dataset
5. Split into Training and Testing Sets
6. Convert text into numerical features
7. Training the Model
8. Evaluate the Model
9. Visualize Model Performance
10. Test Model with New Reviews

##**Step 1. Define assets**

###**Link**
For this project, I will need to download the IMDB moview reviews dataset. 
Link Here --> [IMDB](https://ai.stanford.edu/~amaas/data/sentiment/)

###**Citation**
@InProceedings{maas-EtAl:2011:ACL-HLT2011,
  author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
  title     = {Learning Word Vectors for Sentiment Analysis},
  booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
  month     = {June},
  year      = {2011},
  address   = {Portland, Oregon, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {142--150},
  url       = {http://www.aclweb.org/anthology/P11-1015}
}

###**Folders**
There are two top-level directories [train/, test/] corresponding to
the training and test sets. Each contains [pos/, neg/] directories for
the reviews with binary labels positive and negative. Within these
directories, reviews are stored in text files named following the
convention id_rating.txt where id is a unique id and rating is
the star rating for that review on a 1-10 scale.

These folders both contain over 12k files. I will reduce it down to 50 files each for speed purposes.
- test/
    - neg/
    - pos/
- train/
    - neg/
    - pos/
    - unsup/

I will be using the train/pos/ and train/neg/ for training and the test/pos/ and test/neg/ for testing

##**Step 2. Import Necessary Libraries**
The libraries I will be using as as follows:

- nltk
- scikit-learn
    - model_selection (train_test_split)
    - feature_extraction.text (TfidVectorizer)
    - naive_bayes (MultinominalNB)
    - metrics (classification_report, accuracy_score)
- matplotlib

I will also be accessing my OS to access the dataset file

In [None]:
import os
import random
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score
import matplotlib as plt

##**Step 3: Understanding Text Processing (Stopwords)**
There are words in text processing that have almost no value in information by themselves such as: "and", "to", "is", etc..
By not focusing on these words, it allows for less noise in the dataset and allows the model to focus on the ones with more value. 
The Natural Language Toolkit (NLK) already defines these stopwords so you don't have to manually enter all of them. 
We can simply download the stopwords with the following Python code:

In [None]:
nltk.download('stopwords')

##**Step 4: Loading and Processing the IMDB dataset**
The two training files I'm working with are:
- train/pos/
- train/neg/

These folders need to be defined in the code:

In [None]:
train_pos_directory = ('/Users/darienprall/Documents/GitHub/aclImdb/train/pos')
train_neg_directory = ('/Users/darienprall/Documents/GitHub/aclImdb/train/neg')

Now, I need to store all files from each the positive and negative reviews to an array. I can do this by first, listing all of the names of the files in the directory. I then need to obtain the full directory path for each file by joining the directory and the filename. Once I have all the full path names, I can open each file, using the full path name, and add its contents to the array.

Seeing that the large dataset contains over 48,000 files in total, I will have it only look at the first 500 files 

In [None]:
# Estimated run time 39mins, reduced file count for faster run time
positive_reviews = [
    open(os.path.join(train_pos_directory, f)).read() for f in os.listdir(train_pos_directory)[:500]
]

negative_reviews = [
    open(os.path.join(train_neg_directory, f)).read() for f in os.listdir(train_neg_directory)[:500]
]

Now that I have the two lists split, I can join them into one big list using concatenation so that it contains all text reviews, the first part of the list will be postiive reveiews, and the second part is negative reviews.

In [None]:
texts = positive_reviews + negative_reviews

With all reviews stored, I need to do the following:
- Create a list of labels for each review as positive (1) or negative (0)
- Make sure theres randomness by shuffling the dataset

In [None]:
# Create a list of labels for each review as positive (1) or negative (0)
labels = [1] * len(positive_reviews) + [0] * len(negative_reviews)

# Ensure Randomness
data = list(zip(texts, labels))
random.shuffle(data)
texts, lables = zip(*data)

#print(texts[:2])
#print(labels[:2])

##**Step 5: Split Into Testing and Training Sets**
Using the train_test_split from sklearn, I can set the test_size

In [None]:
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size = 0.2, random_state = 42)

print(f"Training data size: {len(X_train)}")
print(f"Test data size: {len(X_test)}")

##**Step 6: Convert Text to Numerical Values**
Machine learning needs numerical features. Do do this, I will have to use TF-IDF vectorization to convert the text into a numeric form. Then, I can fit and transform the training data, as well as transform the testing data using the same vectorizer. 

In [None]:
# 100 file run time average 5mins
# 500 file run time average 39 mins
# 5000 file run time 
vectorizer = TfidfVectorizer(stop_words = 'english', max_features = 2000)

X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

print(f"Training data shape: {X_train_tfidf.shape}")
print(f"Testing data shape: {X_test_tfidf.shape}")

##**Step 7: Train the Model**

In [None]:
model = MultinomialNB()
model.fit(X_train_tfidf, y_train)

##**Step 8: Evaluate the Model**

In [None]:
y_prediction = model.predict(X_test_tfidf)

accuracy = accuracy_score(y_test, y_prediction)
print(f"The accuracy score is: {accuracy:.4f}")

print("\nClassification Report: ")
print(classification_report(y_test, y_prediction))

##**Classification Report Analysis**

###Precision Score
The classification report shows the precision score predicted 49% of the negative reviews correctly and 46% of the positive reviews correctly. This is not a well performing model based on these results. 

I have increased the total file input from 50 > 100 > 500. But the percentages stay the around the same percentages. I may have to do more than 5000 files to see any significant change but its not gauranteed to increase the accuracy. 

The other reason for the low precision rates could be that the Naive Bayers model might not be the best fit for this dataset. 

The model was better a predicting postive reviews rather than negative reviews correctly. 
