# scikit-learn is used to perform sentiment analysis on Yelp Reviews. The Dataset is the Yelp Reviews on local businesses from the city of Markham.
    ## SELECT review.text, review.stars FROM review JOIN business ON review.business_id = business.id WHERE business.city = 'Markham';
    ## Export the Resultset to an external excel spreadsheet file. The saved file is of .XML format.
    ## Save the Excel XML file to an .xlsx file format.

In [1]:
import pandas as pd

In [2]:
import numpy as np

In [3]:
import nltk

In [4]:
Markham = pd.read_excel('Markham.xlsx')

# The dataframe Markham has following columns - text and stars. For the sentiment analysis purpose our focus will be on these two columns.

In [5]:
Markham.head()

Unnamed: 0,text,stars
0,"Like Dave, I've taken my Corolla here a few ti...",2
1,I started coming here because of my service ad...,5
2,The receptionists are not helpful. I went the...,1
3,I've taken my Corolla here a few times for rep...,2
4,Experienced a body ecu failure in my 6 month o...,5


In [6]:
Markham.count() # number of rows before rows with missing values deleted

text     33923
stars    33923
dtype: int64

# First, we will clean up the dataframe and drop any rows with missing values. 

In [7]:
Markham.dropna(inplace=True)

In [8]:
Markham.count() # number of rows after rows with missing values deleted

text     33923
stars    33923
dtype: int64

# Binary Classification Problem
## Next, reviews with stars = 3 are assumed to be neutral and removed from the dataframe.
## A new column 'Sentiment' is created that will serve as target for our model, where any review text with stars more than 3 will be encoded as a 1, indicating a positive sentiment for the business. On the other hand, review text with stars less than 3 will be encoded as a 0, indicating a negative sentiment for the business.

In [9]:
Markham = Markham[Markham['stars'] != 3]
Markham['Sentiment'] = np.where(Markham['stars'] > 3, 1, 0)
Markham.head(10)

Unnamed: 0,text,stars,Sentiment
0,"Like Dave, I've taken my Corolla here a few ti...",2,0
1,I started coming here because of my service ad...,5,1
2,The receptionists are not helpful. I went the...,1,0
3,I've taken my Corolla here a few times for rep...,2,0
4,Experienced a body ecu failure in my 6 month o...,5,1
5,I have owned nothing but Toyota Corolla's sinc...,5,1
6,Mark is the guy you want to see in service. Go...,5,1
7,This is usually where I go to get my vehicle s...,4,1
8,3.5 stars Brought my car in because it had a ...,3,0
9,A regular 45min oil change was what I came for...,2,0


# Class Imbalance
    ## It is evident from the mean of the Sentiment column and from the count of number of rows for each class, that we have imbalanced classes i.e. most reviews are positive.

In [104]:
Markham['Sentiment'].value_counts()

1    16471
0     6610
Name: Sentiment, dtype: int64

In [105]:
Markham['Sentiment'].mean()

0.7136172609505654

# Stratified Split of the data into training and test sets using the text and sentiment columns.
    ## Stratified sampling is evident from the mean and from the count of number of rows for each class in the y_train and y_test datasets.

In [38]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(Markham['text'], 
                                                    Markham['Sentiment'], 
                                                    test_size = 0.1,
                                                    random_state=0)

In [39]:
print('X_train first entry:\n\n', X_train.iloc[0])
print('\n\nX_train shape: ', X_train.shape)
print('\n\nX_test first entry:\n\n', X_test.iloc[0])
print('\n\nX_test shape: ', X_test.shape)
print('\n\ny_train counts:\n\n', y_train.value_counts())
print('\n\ny_train mean:\n\n', y_train.mean())
print('\n\ny_test counts:\n\n', y_test.value_counts())
print('\n\ny_test mean:\n\n', y_train.mean())

X_train first entry:

 Worst service ever...we got together with four families near Chinese New Year and safe to say we're never coming back. The waitress was extremely rude, coughed on her hands and continued setting up tables, without washing her hands. So we politely brought this to the manager. Turns out he didnt even say sorry, although he was not rude. He just had nothing to say. What kind of management is that? The worst thing is that the entire staff up front started talking about us behind our backs. They were mocking us and shot stares at our table all night. They were cursing at us, saying how we had Alzheimer's because we had elderly people, and also said we had menopause to our ladies. Absolutely horrible and unacceptable. We didn't even get an apology all night. As they (the staff) were sitting around for dinner, they continued gossiping with other staff. They intentionally spoke up so we would hear, bullying us not thinking we would react to it since we're not an aggress

# Feature Selection Approach 1 : CountVectorizer (bag-of-words)
    # X_train has a series of over 17,000 review text. These texts are converted into a matrix of token counts using CountVectorizer. CountVectorizer implements the bag-of-words approach which counts how often each word occurs in each review text.
    # First, the CountVectorizer is fit to the training data. This tokenizes each text by finding all sequences of characters of at least two letters or numbers separated by word boundaries. It also converts the given text to lowercase and builds a vocabulary using these tokens.

In [53]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(stop_words='english').fit(X_train)

#### The vocabulary is built on any tokens that occurred in the training data. Every 500th feature is extracted below and words with numbers as well as misspellings are observed.

In [54]:
vect.get_feature_names()[::500]

['00',
 '380',
 'acai',
 'alaska',
 'approve',
 'awestruck',
 'beetle',
 'blues',
 'broke',
 'campbell',
 'chan',
 'cinnamony',
 'communicating',
 'coors',
 'crêpes',
 'deems',
 'diiiiieeee',
 'doom',
 'edamame',
 'erupted',
 'fades',
 'fizz',
 'freeboard',
 'genes',
 'gravitated',
 'happened',
 'holli',
 'illustrates',
 'instantly',
 'jetta',
 'kipz',
 'leaveeeeeee',
 'lord',
 'marche',
 'messing',
 'monotone',
 'neared',
 'o5j4qjld9a0x3w6hepgdjg',
 'ornaments',
 'parchment',
 'pharmacies',
 'pomeranian',
 'prison',
 'quicky',
 'reeeaaalllly',
 'restarting',
 'roux',
 'sceptical',
 'shaker',
 'sincerely',
 'soem',
 'squatting',
 'stubborn',
 'swiss',
 'tendou',
 'tofufa',
 'trusting',
 'unlicensed',
 'vert',
 'wednesdays',
 'wowza',
 '不過我覺得落粉炸反而比較好吃一點']

#### It could be seen from the length of the vocabulary that there are 28,886 features

In [55]:
len(vect.get_feature_names())

30867

### The transform method transforms the text in X_train to a document term matrix generating the bag-of-word representation of X_train where each row corresponds to a review text and each column a word from the training vocabulary vect.
    
### This representation is stored in a SciPy sparse matrix. The entries in this matrix are the number of times each word appears in each document. Because the number of words in the training vocabulary is much larger than the number of words that might appear in a single review text, most entries of this matrix are zero.

In [43]:
X_train_vectorized = vect.transform(X_train)

X_train_vectorized

<20772x31170 sparse matrix of type '<class 'numpy.int64'>'
	with 1692357 stored elements in Compressed Sparse Row format>

# Multinomial Naive Bayes Classifier

In [32]:
from sklearn import naive_bayes
model = naive_bayes.MultinomialNB()
model.fit(X_train_vectorized, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [33]:
from sklearn.metrics import roc_auc_score, f1_score
predictions = model.predict(vect.transform(X_test))
print('AUC: ', roc_auc_score(y_test, predictions))
print ('F1 score: ', f1_score(y_test, predictions, average='micro'))

AUC:  0.844044439692
F1 score:  0.881476347254


# 5 fold Cross - Validation

In [49]:
from sklearn import naive_bayes
from sklearn import model_selection
model = naive_bayes.MultinomialNB()

predictions = model_selection.cross_val_predict(model, vect.transform(Markham['text']), Markham['Sentiment'], cv=5)
print('AUC: ', roc_auc_score(Markham['Sentiment'], predictions))
print ('F1 score: ', f1_score(Markham['Sentiment'], predictions, average='micro'))

AUC:  0.835473533412
F1 score:  0.869373077423


In [52]:
predictions.value_counts()

AttributeError: 'numpy.ndarray' object has no attribute 'value_counts'

# Support Vector Machine Classifier

In [29]:
from sklearn import svm
model = svm.SVC(kernel = 'linear', C=0.1, class_weight = 'balanced')
model.fit(X_train_vectorized, y_train)

SVC(C=0.1, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [31]:
from sklearn.metrics import roc_auc_score, f1_score
predictions = model.predict(vect.transform(X_test))
print('AUC: ', roc_auc_score(y_test, predictions))
print ('F1 score: ', f1_score(y_test, predictions, average='micro'))

AUC:  0.917028390818
F1 score:  0.925489516548


# Logistic Regression Classifier
    ### Feature matrix X_train_vectorized is used to train Logistic Regression model which works well for high dimensional sparse data [Komarek, Paul, and Andrew W. Moore. "Fast Robust Logistic Regression for Large Sparse Datasets with Binary Outputs." In AISTATS. 2003.].

In [161]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

### First the X_test is transformed using the vectorizer that was fitted to the training data. Followed by the use of X_test to make predictions and compute the area under the curve score. Note that any words in X_test that didn't appear in X_train are ignored. AUC score of about 0.912 is achieved.

In [162]:
from sklearn.metrics import roc_auc_score
predictions = model.predict(vect.transform(X_test))
print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.912116244943


### Below are the 20 smallest and 20 largest coefficients from the model.
    # The model has connected words like 'worst' 'terrible' 'mediocre' 'bland' 'horrible' 'worse' 'overpriced' 'poor' 'disappointing' 'undercooked' 'disappointment' 'sick' 'rude' 'zero' 'ridiculous' 'wouldn' 'dirty' 'poisoning' 'cockroach' 'minimum' to negative reviews and
    # words like 'excellent' 'amazing' 'delicious' 'complaints' 'yummy' 'tasty' 'reasonable' 'satisfied' 'love' 'highly' 'best' 'solid' 'great' 'favourites' 'reasonably' 'glad' 'superb' 'favourite' 'complaint' 'notch' to positive reviews.

In [124]:
# the feature names are taken into the numpy array
feature_names = np.array(vect.get_feature_names())

# Sort the coefficients from the model
sorted_coef_index = model.coef_[0].argsort()

# The 20 largest coefficients are being indexed using [:-21:-1] so the list returned is in order of largest to smallest
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:20]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-21:-1]]))

Smallest Coefs:
['worst' 'terrible' 'mediocre' 'bland' 'horrible' 'worse' 'overpriced'
 'poor' 'disappointing' 'undercooked' 'disappointment' 'sick' 'rude' 'zero'
 'ridiculous' 'wouldn' 'dirty' 'poisoning' 'cockroach' 'minimum']

Largest Coefs: 
['excellent' 'amazing' 'delicious' 'complaints' 'yummy' 'tasty'
 'reasonable' 'satisfied' 'love' 'highly' 'best' 'solid' 'great'
 'favourites' 'reasonably' 'glad' 'superb' 'favourite' 'complaint' 'notch']


# Feature Selection Approach 2 : Rescale features using Tf–idf (Term frequency-inverse document frequency)

#### Tf–idf weight features based on how important they are to a text. High weight is given to features that appear frequently within a particular text, but appear rarely in the corpus. Features with low tf–idf are either commonly used across all texts or rarely used and only occur in long texts.

#### Fit tf–idf vectorizer to the training data.

#### Number of features can be reduced substantially by specifying a minimum number of documents (min_df) in which a word needs to appear to become part of the vocabulary. As a result some words that might appear in only a few text and are unlikely to be useful predictors are removed. For example, with min_df = 5, any words that appeared in fewer than five documents will be removed from the vocabulary.


#### With min_df=1 TfidfVectorizer returns the same number of features as CountVectorizer. However, with AUC 0.884 there seemed to be no improvemnent in model performance as compared to bag-of-words approach. As seen from the results below varying min_df value does reduced the number of features subtantially but it didn't helped in improving the model performance.

    min_df = 1; no. of features = 28886; AUC = 0.884
    min_df = 2; no. features = 16616; AUC = 0.885
    min_df = 3; no. of features = 13061; AUC = 0.886
    min_df = 4; no. of features = 11136; AUC = 0.885
    min_df = 5; no. of features = 9871; AUC = 0.885
    min_df = 6; no. of features = 8940; AUC = 0.886
    min_df = 7; no. of features = 8214; AUC = 0.886
    min_df = 8; no. of features = 7626; AUC = 0.887
    min_df = 9; no. of features = 7128; AUC = 0.887
    min_df = 10; no. of features = 6745; AUC = 0.887
    min_df = 50; no. of features = 2619; AUC = 0.889
    min_df = 100; no. of features = 1705; AUC = 0.888
    min_df = 200; no. of features = 1039; AUC = 0.885

In [152]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Fit the TfidfVectorizer to the training data specifiying a minimum document frequency of 5
vect = TfidfVectorizer(min_df=3).fit(X_train)
len(vect.get_feature_names())

13061

### Next, transform the training data, fit logistic regression model, make predictions on the transform test data, and compute the AUC score. 

In [153]:
X_train_vectorized = vect.transform(X_train)

model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.886065368546


### Features with the smallest tf–idf either appear commonly across all reviews or only rarely in very long reviews. Whereas features with the largest tf–idf appear frequently in a review, but rarely across all reviews. 

### Extracting the smallest and largest coefficients from the model results in the list of words connected to negative and positive reviews.

In [154]:
# the feature names are taken into the numpy array
feature_names = np.array(vect.get_feature_names())

# Sort the coefficients from the model
sorted_tfidf_index = X_train_vectorized.max(0).toarray()[0].argsort()

# The 20 largest coefficients are being indexed using [:-21:-1] so the list returned is in order of largest to smallest
print('Smallest tfidf:\n{}\n'.format(feature_names[sorted_tfidf_index[:20]]))
print('Largest tfidf: \n{}'.format(feature_names[sorted_tfidf_index[:-21:-1]]))

Smallest tfidf:
['artfully' 'registering' 'guidelines' 'jackfruit' 'rectangles' 'clip'
 'freed' 'forbid' 'awakens' 'belonged' 'gouge' 'arch' 'releases'
 'awkwardness' 'idiotic' 'lighten' 'admitting' 'acidity' 'shuts' 'behold']

Largest tfidf: 
['ann' 'ale' 'tongue' 'veg' 'kabob' 'burger' 'fab' 'yes' 'bagels' 'afghani'
 'stinky' 'ladies' 'awful' 'yoghurt' 'teacher' 'macaroon' 'peri' 'gelato'
 'popice' 'cheeseburgers']


In [155]:
sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:20]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-21:-1]]))

Smallest Coefs:
['not' 'worst' 'terrible' 'bland' 'mediocre' 'horrible' 'no' 'disappointed'
 'rude' 'poor' 'disappointing' 'nothing' 'wasn' 'okay' 'bad' 'overpriced'
 'worse' 'better' 'don' 'wouldn']

Largest Coefs: 
['great' 'delicious' 'good' 'amazing' 'love' 'best' 'excellent'
 'definitely' 'friendly' 'tasty' 'nice' 'perfect' 'and' 'try' 'always'
 'favourite' 'fresh' 'reasonable' 'quick' 'awesome']


# Problem with bag-of-words approach is that word order is disregarded. So, the model sees both the sentences below as negative reviews.

In [164]:
# These reviews are treated the same by bag-of-words model
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))

[0 0]


# Feature Selection Approach 3 : n-grams
### Add some meaning by adding sequences of word features known as n-grams. For example, bigrams, which count pairs of adjacent words, could give features such as 'is working' versus 'not working'. And trigrams, which give triplets of adjacent words, could give features such as 'not an issue'.

### Create n-gram features by passing a tuple to the parameter ngram_range, where the values correspond to the minimum length and maximum lengths of sequences.
### For example, if ngram_range=(1,2), CountVectorizer will create features using the individual words, as well as the bigrams. Although n-grams can be powerful in capturing meaning, longer sequences can cause an explosion of the number of features. 

### By adding bigrams, the number of features has increased to almost ________. AUC score improved to _____ after training logistic regression model with new features including bigrams &/or trigrams.

In [181]:
# Fit the CountVectorizer to the training data specifiying a minimum 
# document frequency of 5 and extracting 1-grams and 2-grams
vect = CountVectorizer(min_df=3, ngram_range=(1,3)).fit(X_train)

X_train_vectorized = vect.transform(X_train)

len(vect.get_feature_names())

203834

In [182]:
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.923306003708


If we take a look at what features our model connected with negative reviews, we can see that we now have bigrams such as no good and not happy, 
9:24
while for positive reviews we have not bad and no problems. 

In [185]:
feature_names = np.array(vect.get_feature_names())

sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:20]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-21:-1]]))

Smallest Coefs:
['terrible' 'worst' 'horrible' 'bland' 'mediocre' 'poor' 'overpriced'
 'disappointing' 'rude' 'worse' 'disappointed' 'wouldn' 'not' 'not worth'
 'okay' 'definitely not' 'awful' 'wasn' 'dirty' 'bad']

Largest Coefs: 
['delicious' 'amazing' 'excellent' 'great' 'love' 'best' 'not bad' 'tasty'
 'better than' 'not too' 'good' 'definitely' 'yummy' 'fantastic' 'awesome'
 'friendly' 'reasonable' 'nice' 'perfect' 'must']


If we again try to predict not an issue, phone is working, and an issue, phone is not working, we can see that our newest model now correctly identifies them as positive and negative reviews respectively.

In [184]:
# These reviews are now correctly identified
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))

[0 0]
