## Sentiment Classification for Product Reviews


### Date: 2nd June, 2019


### STEPS PERFORMED 
- Basic <b> data preprocessing</b> has been performed on the provided train dataset.
- Using <b>cross validation </b>, train dataset has been divided into train and test datasets.
- Various <b> models </b> have been created and <b> accuracy </b> has been checked using predicted values on test dataset created using CV.
- Finally, <b> labels are predcited </b> for the provided test dataset.

# 1. Importing required Libraries for Pre-Processing, Feature Extraction and Model Creation

In [1]:
from __future__ import division
import pandas as pd
from nltk.tokenize import RegexpTokenizer
import re
import numpy as np
import nltk
from nltk.corpus import reuters
from nltk.corpus import stopwords 
from nltk.util import ngrams
from itertools import chain
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.tokenize import MWETokenizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from  sklearn import  metrics

## Reading Train, Labels and Test Datasets again

In [None]:
# Train Dataset
train=pd.read_csv('train_data.csv',encoding="UTF-8")
train.head()

In [None]:
# Test Dataset
test=pd.read_csv('test_data.csv',encoding="UTF-8")
test.head()

In [None]:
# Label Dataset
train_labels=pd.read_csv("train_label.csv",encoding="UTF-8")
train_labels.head()

In [None]:
train['label'] = train_labels['label']
train.tail()

In [None]:
train_df = train[['label','text']]
test_df = test[['text']]

## Preprocessing

In [None]:
import re 
def preprocess(string):
    """
    Tokenization/string cleaning for dataset
    Every dataset is lower cased except
    """
    string = re.sub(r"\n", "", string)    
    string = re.sub(r"\r", "", string) 
    string = re.sub(r"[0-9]", "digit", string)
    string = re.sub(r"\'", "", string)    
    string = re.sub(r"\"", "", string)    
    return string.strip().lower()

In [None]:
for i, row in train_df.iterrows():
    data = str(row['text'])
    new_data = preprocess(data)
    train_df.set_value(i,'text',new_data)


In [None]:
for i, row in test_df.iterrows():
    data = str(row['text'])
    new_data = preprocess(data)
    test_df.set_value(i,'text',new_data)

In [None]:
# Cross Validation
X = []

from  sklearn.cross_validation import train_test_split

for i in train_df['text']:
    X.append(i)

y = np.array(train_df["label"])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=5)

## Creating Models and Checking the accuracy using split Datasets
- For creating models, pipelining is used. It is a method to streamline a code while adding more features. 
- It simplifies the process of manually running through each step like vectorizing, tfidf.

In [None]:
# Pipeline for Logistic Model using variou parameters
model = Pipeline([('vectorizer', CountVectorizer(ngram_range=(1,2))),
    ('tfidf', TfidfTransformer(use_idf=True)),
    ('clf', OneVsRestClassifier(LogisticRegression(random_state=0)))])

In [None]:
# Fitting the model
model.fit(X_train, y_train)

In [None]:
# Prediction
pred = model.predict(X_test)

In [None]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix, accuracy_score
confusion_matrix(pred, y_test)


In [None]:
# get the accuracy
accuracy_score(y_test, pred)

In [None]:
## Predicting the Labels for provided Test Dataset

In [None]:
# fitting the mddel
model.fit(train_df['text'], train_df['label'])

In [None]:
t = test_df['text'][0:50000]

In [None]:
# Prediction
pred = model.predict(t)

In [None]:
# Coverting to dataframe
predDf = pd.DataFrame(pred)
predDf['test_id'] = test['test_id']

predDf.columns = ['label','test_id']

In [None]:
# Writing into csv
import csv
predDf.to_csv(r'predict_label.csv',index=False)