# Sentiment Analysis on Mobile Banking reviews

The given task is to learn sentiment from all the given reviews and predict rating from a new review. This objective is attained by following the steps listed below:

### Importing the modules

In [1]:
import pandas as pd
import numpy as np
import nltk
import future
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn import metrics
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.preprocessing import label_binarize
from sklearn.linear_model import LogisticRegression

from sklearn.pipeline import Pipeline
from sklearn import svm
from sklearn.svm import LinearSVC
from sklearn.svm import SVR
from sklearn import metrics

from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

from bs4 import BeautifulSoup  
import re
import nltk
from nltk.corpus import stopwords 

In [2]:
col_names = ["test", "sentiment"]

data = pd.read_csv('train.csv',names=col_names,error_bad_lines=False)
print(data.head())

                                                test  sentiment
0                                               text  sentiment
1  For a movie that gets no respect there sure ar...          0
2  Bizarre horror movie filled with famous faces ...          0
3  A solid, if unremarkable film. Matthau, as Ein...          0
4  It's a strange feeling to sit alone in a theat...          0


# Data Labeling

- In this step we clean the data and label the data as <br>
  > 0 for Poor <br>
  > 1 for Neutral <br>
  > 2 for Good
- Read the data from __'Amazon_Unlocked_Mobile.csv'__ and add new column for labels

In [3]:
def label_data():
      col_names = ["test", "sentiment"]
      data = pd.read_csv('train.csv',names=col_names,error_bad_lines=False)    
#     labels = []
#     for cell in rows['sentiment']:
#         if cell == 1:
#             labels.append('1')   #Good
#         else:
#             labels.append('0')   #Poor
            
#     rows['label'] = labels
#     del rows['review']
#     return rows
print(data.head())

                                                test  sentiment
0                                               text  sentiment
1  For a movie that gets no respect there sure ar...          0
2  Bizarre horror movie filled with famous faces ...          0
3  A solid, if unremarkable film. Matthau, as Ein...          0
4  It's a strange feeling to sit alone in a theat...          0


### Data Cleaning

Remove all the rows containing blank cells. The resultant data is stored as __'labelled_dataset.csv'__

In [4]:
def clean_data(data):
    col_names = ["test", "sentiment"]
    data = pd.read_csv('train.csv',names=col_names,error_bad_lines=False)    
    return data

In [5]:
print(data.head())

                                                test  sentiment
0                                               text  sentiment
1  For a movie that gets no respect there sure ar...          0
2  Bizarre horror movie filled with famous faces ...          0
3  A solid, if unremarkable film. Matthau, as Ein...          0
4  It's a strange feeling to sit alone in a theat...          0


### Data preprocessing

The following text preprocessing are implemented to convert raw reviews to cleaned review, so that it will be easier for us to do feature extraction in the next step.

- remove html tags using BeautifulSoup
- remove non-character such as digits and symbols
- convert to lower case
- remove stop words such as "the" and "and" if needed
- convert to root words by stemming if needed

In [6]:
def cleanText(raw_text, remove_stopwords=False, stemming=False, split_text=False):
    '''
    Convert a raw review to a cleaned review
    '''
    text = BeautifulSoup(raw_text, 'lxml').get_text()  #remove html
    letters_only = re.sub("[^a-zA-Z]", " ", text)  # remove non-character
    words = letters_only.lower().split() # convert to lower case 
    
    if remove_stopwords: # remove stopword
        stops = set(stopwords.words("english"))
        words = [w for w in words if not w in stops]
        
    if stemming==True: # stemming
#         stemmer = PorterStemmer()
        stemmer = SnowballStemmer('english') 
        words = [stemmer.stem(w) for w in words]
        
    if split_text==True:  # split text
        return (words)
    
    return( " ".join(words))

In [7]:
def modelEvaluation(predictions, y_test_set):
    #Print model evaluation to predicted result 
    
#     print "\nAccuracy on validation set: {:.4f}".format(accuracy_score(y_test_set, predictions))
    print ("\nAccuracy on validation set: % 5.2f " %(accuracy_score(y_test_set, predictions)))
    #print "\nAUC score : {:.4f}".format(roc_auc_score(y_test_set, predictions))
    print ("\nClassification report : \n", metrics.classification_report(y_test_set, predictions))
    print ("\nConfusion Matrix : \n", metrics.confusion_matrix(y_test_set, predictions))

### Bag of Words

The sentiment analysis of given text can be done in two ways. First, we need to find a word embedding to convert a text into a numerical representation. Second, we fit the numerical representations of text to machine learning algorithms or deep learning architectures.

One common approach of word embedding is frequency based embedding such as Bag of Words (BoW) model. BoW model learns a vocubulary list from a given corpus and represents each document based on some counting methods of words. In this part, we will explore the model performance of using BoW with supervised learning algorithms. Here's the workflow in this part.

- Step 1 : Preprocess raw reviews to cleaned reviews
- Step 2 : Create BoW using CountVectorizer / Tfidfvectorizer in sklearn
- Step 3 : Transform review text to numerical representations (feature vectors)
- Step 4 : Fit feature vectors to supervised learning algorithm (eg. Naive Bayes, Logistic regression, etc.)
- Step 5 : Improve the model performance by GridSearch

In [8]:
if __name__ == '__main__':
#     data = label_data()
#     data = clean_data(data)        ----------
    #prints first 5 rows of the dataset
    print(data.head())    

                                                test  sentiment
0                                               text  sentiment
1  For a movie that gets no respect there sure ar...          0
2  Bizarre horror movie filled with famous faces ...          0
3  A solid, if unremarkable film. Matthau, as Ein...          0
4  It's a strange feeling to sit alone in a theat...          0


### Visualisation

In [109]:
#     # Plot distribution of rating
#     plt.figure(figsize=(12,8))
#     # sns.countplot(data['Rating'])
#     data['sentiment'].value_counts().sort_index().plot(kind='bar')
#     plt.title('Distribution of Rating')
#     plt.xlabel('sentiment')
#     plt.ylabel('Count')

In [99]:
#     # Plot number of reviews for top 20 brands
#     brands = data["test"].value_counts()
#     # brands.count()
#     plt.figure(figsize=(12,8))
#     brands[:40].plot(kind='bar')
#     plt.title("Top 40 review")

In [112]:
#     # Plot number of reviews for top 50 products
#     products = data["test"].value_counts()
#     plt.figure(figsize=(12,8))
#     products[:50].plot(kind='bar')
#     plt.title("Number of Reviews for Top 50 Products")

In [113]:
#     # Plot distribution of review length
#     review_length = data["test"].dropna().map(lambda x: len(x))
#     plt.figure(figsize=(12,8))
#     review_length.loc[review_length < 1500].hist()
#     plt.title("Distribution of Review Length")
#     plt.xlabel('Review length (Number of character)')
#     plt.ylabel('Count')

In [9]:
    #split data into training and testing set
    x_train, x_test, y_train, y_test = train_test_split(data['test'], data['sentiment'], test_size=0.1, random_state=0)
    
    
    #If the label are words instead of numbers, they can be replaced programmatically using following two lines
    #y_train = label_binarize(y_train, classes=[0, 1, 2])
#     x_test_cleaned = pd.read_csv('test.csv',names=col_names,error_bad_lines=False)
#     x_test = x_test_cleaned[:2000]

In [10]:
    # Preprocess text data in training set and validation set
    x_train_cleaned = []
    x_test_cleaned = []

    for d in x_train:
        x_train_cleaned.append(cleanText(d))

    for d in x_test:
        x_test_cleaned.append(cleanText(d))    

### CountVectorizer with Mulinomial Naive Bayes (Benchmark Model)

Now we have cleaned reviews, the next step is to convert the reviews into numerical representations for machine learning algorithm.

In sklearn library, we can use CountVectorizer which implements both tokenization and occurrence counting in a single class. The output is a sparse matrix representation of a document.

In [11]:
    # Fit and transform the training data to a document-term matrix using CountVectorizer
    countVect = CountVectorizer() 
    x_train_countVect = countVect.fit_transform(x_train_cleaned)
    print ("Number of features : %d \n" %len(countVect.get_feature_names())) #6378 
    print ("Show some feature names : \n", countVect.get_feature_names()[::1000])

Number of features : 70383 

Show some feature names : 
 ['aa', 'afterworld', 'analyst', 'armistead', 'awwwwww', 'beaches', 'biochemical', 'boos', 'budding', 'capri', 'chao', 'clanging', 'companies', 'cooper', 'crucial', 'deathscythe', 'desperadoes', 'dislocated', 'dragonballz', 'eerieness', 'entente', 'exist', 'feels', 'flounce', 'frumpish', 'gertrude', 'grading', 'haggard', 'heiden', 'honkin', 'ignorance', 'infringement', 'ishmael', 'judgment', 'kisses', 'larky', 'limos', 'luske', 'marguerite', 'meet', 'misawa', 'motorist', 'neanderthals', 'nozzle', 'organics', 'pantomime', 'perms', 'pleadings', 'preform', 'pucking', 'rantings', 'reified', 'revitalize', 'rout', 'savalas', 'selina', 'shoes', 'sled', 'soundtract', 'starring', 'subjecting', 'swith', 'tenderfoot', 'titillates', 'tricked', 'unclean', 'untapped', 'victimized', 'watling', 'withdrawal', 'yuletide']


In [12]:
    # Train MultinomialNB classifier
    mnb = MultinomialNB()
    mnb.fit(x_train_countVect, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [13]:
    # Evaluate the model on validaton set
    predictions = mnb.predict(countVect.transform(x_test_cleaned))
    modelEvaluation(predictions, y_test)
    print(predictions)


Accuracy on validation set:  0.85 

Classification report : 
               precision    recall  f1-score   support

           0       0.88      0.81      0.84      1241
           1       0.83      0.89      0.86      1260

    accuracy                           0.85      2501
   macro avg       0.85      0.85      0.85      2501
weighted avg       0.85      0.85      0.85      2501


Confusion Matrix : 
 [[1007  234]
 [ 138 1122]]
['1' '0' '1' ... '1' '0' '0']


In [119]:
print(data['sentiment'].value_counts())

0            12500
1            12500
sentiment        1
Name: sentiment, dtype: int64


### TfidfVectorizer with Logistic Regression

Some words might frequently appear but have little meaningful information about the sentiment of a particular review. Instead of using occurance counting, we can use tf-idf transform to scale down the impact of frequently appeared words in a given corpus.

In sklearn library, we can use TfidfVectorizer which implements both tokenization and tf-idf weighted counting in a single class.

In [14]:
    # Fit and transform the training data to a document-term matrix using TfidfVectorizer 
    tfidf = TfidfVectorizer(min_df=5) #minimum document frequency of 5
    x_train_tfidf = tfidf.fit_transform(x_train)
    print ("Number of features : %d \n" %len(tfidf.get_feature_names())) #1722
    print ("Show some feature names : \n", tfidf.get_feature_names()[::1000])

    # Logistic Regression
    lr = LogisticRegression()
    lr.fit(x_train_tfidf, y_train)

Number of features : 25937 

Show some feature names : 
 ['00', 'ambiguous', 'barbaric', 'breasts', 'cheerfulness', 'consideration', 'decked', 'doped', 'eschews', 'flips', 'gorier', 'his', 'insisting', 'krige', 'mahatma', 'moms', 'oddest', 'perspectives', 'prophecy', 'rendition', 'satirized', 'siren', 'stepped', 'teenage', 'twentieth', 'vulgar']




LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [15]:
    # Look at the top 10 features with smallest and the largest coefficients
    feature_names = np.array(tfidf.get_feature_names())
    sorted_coef_index = lr.coef_[0].argsort()
    print ("Total number of features = " + str(len(sorted_coef_index)))
    print ('\nTop 10 features with smallest coefficients :\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
    print ('Top 10 features with largest coefficients : \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Total number of features = 25937

Top 10 features with smallest coefficients :
['worst' 'bad' 'awful' 'waste' 'boring' 'poor' 'terrible' 'nothing'
 'worse' 'no']

Top 10 features with largest coefficients : 
['great' 'excellent' 'best' 'perfect' 'wonderful' 'amazing' 'well' 'loved'
 'love' 'favorite']


In [16]:

    # Evaluate on the validaton set
    predictions = lr.predict(tfidf.transform(x_test_cleaned))
    print(predictions)

    modelEvaluation(predictions, y_test)

['1' '0' '1' ... '1' '0' '0']

Accuracy on validation set:  0.88 

Classification report : 
               precision    recall  f1-score   support

           0       0.87      0.89      0.88      1241
           1       0.89      0.87      0.88      1260

    accuracy                           0.88      2501
   macro avg       0.88      0.88      0.88      2501
weighted avg       0.88      0.88      0.88      2501


Confusion Matrix : 
 [[1108  133]
 [ 161 1099]]


### SVM and Trees

system will use two algorithms SVM and Random Forest to perform analysis and to determine which suits better for recommendation.

### Support Vector Machine

Here we implement multi-svm for sentiment analysis. More information about it can be found on [this](http://scikit-learn.org/stable/modules/svm.html) link. <br>

- Tip 1 : Here, different kernels can be tried out. For example, linear, nonlinear, precomputed, rbf etc.
- Tip 2 : The parameter values given below can be tweaked to obtain different results.

### Using LinearSVC

Here you can tweak the api parameters of LinearSVC as per your choice. Refer to [this](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html) link for making any changes.

In [17]:
    #x_train_subset = tfidf.transform(x_train_cleaned[:100])
    x_train_input = tfidf.transform(x_train_cleaned)
    svr_lin = LinearSVC(multi_class='ovr',C=1.0,loss='squared_hinge', dual=False)
    svr_lin.fit(x_train_input, y_train)
    y_svr_lin_predicted = svr_lin.predict(tfidf.transform(x_test_cleaned))
#     print(y_svr_lin_predicted)
    sample = ["Bikas is very Bad.But some time it works beeter"]
    sample = tfidf.transform(sample).toarray()
    sentiment = svr_lin.predict(sample)
    print(sentiment)

['0']


In [18]:
# Test Manually 
sample = ["Bikas is very Bad.But some time it works beeter.It is not okey to use"]
sample = tfidf.transform(sample).toarray()
sentiment = svr_lin.predict(sample)
print(sentiment)

['1']


In [19]:
    modelEvaluation(y_svr_lin_predicted, y_test)


Accuracy on validation set:  0.88 

Classification report : 
               precision    recall  f1-score   support

           0       0.88      0.89      0.88      1241
           1       0.89      0.88      0.88      1260

    accuracy                           0.88      2501
   macro avg       0.88      0.88      0.88      2501
weighted avg       0.88      0.88      0.88      2501


Confusion Matrix : 
 [[1102  139]
 [ 150 1110]]


### Functions for Model Evaluation

There are multiple functions for model evaluation in scikit learn. To know more about them, please follow the below mentioned links
- [accuracy score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score)
- [f_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html)
- [f1_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score)
- [confusion matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix)

In [20]:
    print (str(metrics.accuracy_score(y_test, y_svr_lin_predicted)))
#     print "Fscore of this SVM = " + str(metrics.precision_recall_fscore_support(y_test, y_svr_lin_predicted, pos_label=2, average='weighted'))
#     print "F-1 score of this SVM = " + str(metrics.f1_score(y_test, y_svr_lin_predicted, pos_label=2, average='weighted'))
#     print "confusion matrix = " + str(metrics.confusion_matrix(y_test, y_svr_lin_predicted))

0.8844462215113954


In [128]:
#     not run this code 
#     print "Accuracy of this SVM = " + str(metrics.accuracy_score(y_test, y_svr_lin_predicted))
#     print "Fscore of this SVM = " + str(metrics.precision_recall_fscore_support(y_test, y_svr_lin_predicted, pos_label=2, average='weighted'))
#     print "F-1 score of this SVM = " + str(metrics.f1_score(y_test, y_svr_lin_predicted, pos_label=2, average='weighted'))
#     print "confusion matrix = " + str(metrics.confusion_matrix(y_test, y_svr_lin_predicted))

### Random Forest

Refer to [this](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#id1) link for more information

In [22]:
    rand = RandomForestClassifier()
    rand.fit(x_train_input, y_train)
    y_rand_predicted = rand.predict(tfidf.transform(x_test_cleaned))
    print(y_rand_predicted)
   



['1' '0' '1' ... '0' '0' '0']


In [23]:
    modelEvaluation(y_rand_predicted, y_test)


Accuracy on validation set:  0.74 

Classification report : 
               precision    recall  f1-score   support

           0       0.71      0.83      0.76      1241
           1       0.80      0.67      0.72      1260

    accuracy                           0.74      2501
   macro avg       0.75      0.75      0.74      2501
weighted avg       0.75      0.74      0.74      2501


Confusion Matrix : 
 [[1025  216]
 [ 422  838]]


In [None]:
    print("Result Of Accuracy")

In [None]:
    print (rand.score(tfidf.transform(x_test_cleaned), y_test))
    print("Result Of Accuracy")

In [None]:
#        Not Run
#     print ("Accuracy of Random Forest = " + str(rand.score(tfidf.transform(x_test_cleaned), y_test)))
#     print ("Fscore of this SVM = " + str(metrics.precision_recall_fscore_support(y_test, y_predicted, pos_label=2, average='weighted')))
#     print ("F-1 score of this SVM = " + str(metrics.f1_score(y_test, y_predicted, pos_label=2, average='weighted')))
#     print ("confusion matrix = " + str(metrics.confusion_matrix(y_test, y_predicted)))

### Decision Tree

Refer [this](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier) link for more information

In [None]:
    decTree = DecisionTreeClassifier()
    decTree.fit(x_train_input, y_train)
    y_decTree_predicted = decTree.predict(tfidf.transform(x_test_cleaned))
#     sample = ["Dhaka wasa water is not useable.Very bad"]
#     sample = tfidf.transform(sample).toarray()
#     sentiment = clf.predict(sample)
#     print(sentiment)
    

In [None]:
    modelEvaluation(y_decTree_predicted, y_test)

In [None]:
    print (decTree.score(tfidf.transform(x_test_cleaned), y_test))
#     print "Fscore of this SVM = " + str(metrics.precision_recall_fscore_support(y_test, y_decTree_predicted, pos_label=2, average='weighted'))
#     print "F-1 score of this SVM = " + str(metrics.f1_score(y_test, y_decTree_predicted, pos_label=2, average='weighted'))
#     print "confusion matrix = " + str(metrics.confusion_matrix(y_test, y_decTree_predicted))

In [None]:
#     Not Run
#     print "Accuracy of Decision Tree = " + str(decTree.score(tfidf.transform(x_test_cleaned), y_test))
#     print "Fscore of this SVM = " + str(metrics.precision_recall_fscore_support(y_test, y_decTree_predicted, pos_label=2, average='weighted'))
#     print "F-1 score of this SVM = " + str(metrics.f1_score(y_test, y_decTree_predicted, pos_label=2, average='weighted'))
#     print "confusion matrix = " + str(metrics.confusion_matrix(y_test, y_decTree_predicted))

# My Code

In [28]:
col_names = ["test"]
# inserting column with static value in data frame 
# sentiment_col = pd.Series([]) 
datas = pd.read_csv('data_store/yes.csv',names=col_names,error_bad_lines=False)
# displaying data frame again - Output 2 
datas = datas['test'].tolist()
# print(datas.head())
# print(type(datas))

In [29]:

for data in datas:
    sample = [data]
    sample = tfidf.transform(sample)
    sentiment = svr_lin.predict(sample)
    print(data,"Sentiment : ",sentiment)
#     print('********************************',sentiment) 
    

company Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Sentiment :  ['1']
b Senti

In [30]:
print(datas.shape)
print(datas.head())


AttributeError: 'list' object has no attribute 'shape'