![Logo](http://www.hva.nl/webfiles/1524744682263/img/logo.svg "Hogeschool van Amsterdam")
# Data Mining & Data Analysis 
## Individual Assignment
Student: Joost Buskermolen (500709241)

Each individual student needs to show his/her data analysis and datamining skills by doing an individual assignment.This assignment is a follow-up of the assignment for the course Data Processing. For Data Mining and Data Analysis you need to build a more or less sophisticated classifier for movie reviews. The classifier should be able to classify the sentiment of the review (positive or negative).

**It’s model is built with training data from both:**

1. The dataset found at [Kaggle](https://www.kaggle.com/c/word2vec-nlp-tutorial/data "Kaggle") (In this case the dataset grabbed from VLO, converted to Excel for compatibility reasons.)
2. An additional (large) set of reviews from another movie review website (also through Kaggle), Rottentomatoes in this case,  [which you can find by clicking this link](https://www.kaggle.com/abhipoo/sentiment-rotten-tomatoes/downloads/sentiment-rotten-tomatoes.zip/1 "Download dataset")

**It's accuracy needs to be at least 75%**

The accuracy is around ~85% and therefore above the required minimum. Run all the cells below, and the accuracy will be printed.


### First, we need to import the necessary libraries into the project:

In [1]:
import pandas as pd
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix

1. ```pandas```: I used Pandas (declared as pd in the code) as dataframe for the structured training data. With this I could easily import Excel sheets / CSV files and convert them to a usable dataframe.
2. ```sklearn```: Short for scikit-learn, is used to analyse the training data and make predictions based on that.
3. ```CountVectorizer```: The CountVectorizer provides a simple way to tokenize a collection of text documents and build a vocabulary of known words.
4. ```MultinomialNB```: By using Naive Bayes, the concept of 'probability' is used to classify new entities; based on the tokenized training data.
5. ```accuracy_score, confusion_matrix```: These functions from the metrics class are used to determine accuracy of the trained model later on.


### Next step, reading the Excel sheet and converting it to a dataframe, and showing the first 3 records:
Down below I imported the by HvA supplied dataset and defined the column headers: id, sentiment & review. After that I called the head() function from pd to view the first (in this case 3) rows of the dataframe 

In [None]:
hva_df = pd.read_excel("datasets/HvA_Traindata.xlsx", header=0, index_col='id', columns=['id', 'sentiment', 'review'])
hva_df.head(3)


### Check ratio between positive and negative sentiments:

In [None]:
hva_df.sentiment.value_counts()

## To meet the conditions for this assignment an extra training dataset is neccessary.
For this I downloaded an additional (large) set of reviews and sentiment [from Kaggle](https://www.kaggle.com/abhipoo/sentiment-rotten-tomatoes/downloads/sentiment-rotten-tomatoes.zip/1 "Download dataset") gathered from [Rottentomatoes](https://rottentomatoes.com "Rotten Tomatoes"). 

This dataset was devided in two folders, positive (pos) and negative (neg) with in each seperate file one review. Because that's not a suitable format to import directly in a dataframe, I wrote a small script to enter each review in it's own row of a CSV file; because nobody got time to copy paste ~2000 reviews manually to a CSV. This script works as follows:

1. Create a list of all filenames available in the folder, for both the positive as negative reviews.
2. Create the file positive.csv and write the headers id, sentiment & review
3. For each file in the list positives: assign an id, set sentiment to 1 and write it's contents to the review column
4. When all files from the list are done, break.

Follow the same steps for the negative reviews and you end up with two files: positive.csv and negative.csv

In [None]:
import csv
import os
from os.path import isfile, join

positives = [f for f in os.listdir('datasets/pos/') if isfile(join('datasets/pos/', f))]
negatives = [f for f in os.listdir('datasets/neg/') if isfile(join('datasets/neg/', f))]

id = 1
with open('datasets/positive.csv', 'w', newline='') as f:
    headers = ['id', 'sentiment', 'review']
    writer = csv.DictWriter(f, fieldnames=headers)
    
    writer.writeheader()
    
    for file in positives:
        if id != (len(positives)-1):
            content = open('datasets/pos/' + file)
            writer.writerow({'id' : id, 'sentiment' : 1, 'review' : content.read()})
            id +=1
        else:
            break

with open('datasets/negative.csv', 'w', newline='') as f:
    headers = ['id', 'sentiment', 'review']
    writer = csv.DictWriter(f, fieldnames=headers)
    
    id += 4 #dirty fix to get the right ID
    
    #writer.writeheader()
    
    for file in negatives:
        if id != (len(negatives)+len(positives)):
            content = open('datasets/neg/' + file)
            writer.writerow({'id' : id, 'sentiment' : 0, 'review' : content.read()})
            id +=1
        else:
            break

### Ending up with two seperate files isn't ideal, so I merged its contents to one file: combined.csv

In [None]:
positive = open('datasets/positive.csv', 'r')
negative = open('datasets/negative.csv','r')
merged = open('datasets/combined.csv', 'w')
merged.write(positive.read() + negative.read())
merged.close()

### Create the pandas dataframe from the combined.csv file
And display the last five rows with the tail function

In [None]:
extra_df = pd.read_csv("datasets/combined.csv", header=0, index_col='id')
extra_df.tail(5)

### Combining the two dataframes hva_df & extra_df
Because we want to train the classifier on both datasets, we need to 'append' one dataframe to the other. Below I appended the extra_df to hva_df and assigned it to df.

In [None]:
df = hva_df.append(extra_df, ignore_index=True)

For informative purposes I ran the following command to get an insight about the total reviews and it's corresponding sentiments the dataframe contains.

In [None]:
df.sentiment.value_counts()

### Tokenize dataframe (df) through CountVectorizer:
The CountVectorizer provides a simple way to tokenize the dataframes and build a vocabulary of known words, and also predict new entries based on that vocabulary. This function from scikit-learn is so advanced, that all words are converted to lowercase and punctuation is removed automatically. 

(This step takes a while, depending on your hardware)

In [None]:
vect = CountVectorizer(max_features=3000, binary=True)
X = vect.fit_transform(df.review)
X.toarray()

### To calculate accuracy, I'll split the dataframe in two unequally parts: 85% for training, 15% for testing.
The package ```train_test_split``` from scikit-learn is used for this and the test_size is set to 0.15. This is just for accuracy calculation purposes. Later on we will undo this step to make our training set bigger again.

In [None]:
from sklearn.model_selection import train_test_split
X = df.review
y = df.sentiment
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)
X_train_vect = vect.fit_transform(X_train)

```X_train_vect``` is now transformed into the right format to give to the Naive Bayes model.

### Next, we need to  instantiate a Naive Bayes model from sklearn and fit it to our training data:
The following pieces of code will create this model and returns a fit percentage (not to be confused with accuracy)

In [None]:
nb = MultinomialNB()
nb.fit(X_train_vect, y_train)
print("Fitness: "+ str(nb.score(X_train_vect, y_train)))

### Now I'll vectorize the test set, and use that set to predict if each review is either positive or negative.
This will give us the opportunity to calculate the accuracy using sklearn.metrics ```accuracy_score``` (as done at the last line of this block). I also used the ```math``` library to recalculate the value and round to one decimal.

In [None]:
X_test_vect = vect.transform(X_test)
y_pred = nb.predict(X_test_vect)
import math
acc = accuracy_score(y_test, y_pred) * 100
print("Accuracy: " + str(math.floor(acc*10)/10) + "%")

### Up next, a confusion matrix:

In [None]:
confusion_matrix(y_test, y_pred, labels=[0, 1])

### Because we now know the accuracy, we don't need to split the data in train/test anymore. 
Therefore I will redefine the trainingset and build my Naive Bayes model again below.

In [None]:
X_vect = vect.fit_transform(X)
nb = MultinomialNB()
nb.fit(X_vect, y)
print("Fitness: "+ str(nb.score(X_vect, y)))

### Last but not least, I made a function to input your own review and test the classifier:
Enter your input in the box below and run the code till the end. You will get a result based on the trained classifier.

In [None]:
repeat = True
while repeat:
    userinput = input("Enter the sentence you would like to predict: ")
    if userinput:
        usersentiment = {'sentiment': [0], 'review' : [userinput]}
        user_df = pd.DataFrame.from_dict(usersentiment)
        user_df.index.name = 'id'
        user_test_vect = vect.transform(user_df.review)
        user_pred = nb.predict(user_test_vect)
    else:
        print("No user input was given :(")
    try:
        if user_pred == 1:
            print("Your input was positive!")
        else:
            print("Your input was negative :(")
    except NameError:
        print("No user input was given. Re-run codeblock above and give it some input.")
    except:
        print("Something else went wrong.")
    repeat = str(input("Would you like to try another review? Y/n: ") or 'Y')
    if repeat in ['N', 'n', 'No', 'no']:
        print("Thank you for using Joost's sentiment analysis.")
        break
    else:
        pass