# Logistic Regression Binary Classification: Proof of Concept

In this notebook I demonstrate the use of another popular ML technique, Logistic Regression, as a binary classifier for Genuine and Deceptive reviews over the small ground truth Chicago Hotel dataset.

The accuracies produced in this notebook are not to be trusted as they are based on a very small dataset (1600 samples) with a TINY testing set (480 samples). However, it shows a proof of concept using simple Bag-Of-Words features such as count vectorization and Tf-idf to give a baseline model to improve on with larger datasets and more features.

Without further ado, let's continue.

We start by importing the modules we need.

In [1]:
import os
import numpy as np
import pandas as pd

Then, we pull our data files and process them using the Pandas module into easy to manipulate data frames.

In this dataset, we have 1600 samples. 800 are deceptive and 800 are genuine, and there are 400 of each sentiment, positive and negative.

In [2]:
neg_deceptive_folder_path = r"../data/hotels/negative_polarity/deceptive_from_MTurk/"
neg_true_folder_path = r'../data/hotels/negative_polarity/truthful_from_Web/'
pos_deceptive_folder_path = r'../data/hotels/positive_polarity/deceptive_from_MTurk/'
pos_true_folder_path = r'../data/hotels/positive_polarity/truthful_from_TripAdvisor/'

sentiment_class = []
reviews = []
deceptive_class =[]

for i in range(1,6):
    positive_true = pos_true_folder_path + 'fold' + str(i) 
    positive_deceptive = pos_deceptive_folder_path + 'fold' + str(i)
    negative_true = neg_true_folder_path + 'fold' + str(i) 
    negative_deceptive = neg_deceptive_folder_path + 'fold' + str(i) 
    for data_file in sorted(os.listdir(negative_deceptive)):
      sentiment_class.append('negative')
      deceptive_class.append(str(data_file.split('_')[0]))
      with open(os.path.join(negative_deceptive, data_file)) as f:
        contents = f.read()
        reviews.append(contents)
    for data_file in sorted(os.listdir(negative_true)):
      sentiment_class.append('negative')
      deceptive_class.append(str(data_file.split('_')[0]))
      with open(os.path.join(negative_true, data_file)) as f:
        contents = f.read()
        reviews.append(contents)
    for data_file in sorted(os.listdir(positive_deceptive)):
      sentiment_class.append('positive')
      deceptive_class.append(str(data_file.split('_')[0]))
      with open(os.path.join(positive_deceptive, data_file)) as f:
        contents = f.read()
        reviews.append(contents)
    for data_file in sorted(os.listdir(positive_true)):
      sentiment_class.append('positive')
      deceptive_class.append(str(data_file.split('_')[0]))
      with open(os.path.join(positive_true, data_file)) as f:
        contents = f.read()
        reviews.append(contents)


df = pd.DataFrame({'sentiment':sentiment_class,'review':reviews,'deceptive':deceptive_class})

df.loc[df['deceptive']=='d','deceptive']=1
df.loc[df['deceptive']=='t','deceptive']=0

df.loc[df['sentiment']=='positive','sentiment']=1
df.loc[df['sentiment']=='negative','sentiment']=0

data_x = df['review']

data_y = np.asarray(df['deceptive'],dtype=int)

Then, we split our data into training and testing. Experiment shows that 70/30 split is ideal. The random_state parameter lets us shuffle our data randomly.

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data_x, data_y, test_size=0.3, random_state=0)

We now want to initialize our feature extractors, CountVectorizer and TfIdfVectorizer. Count vectorization simply turns reviews into vectors of word counts, and TfIdf vectorizer turns word count vectors into Tf-df vectors. 


In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

cv = CountVectorizer()
tfidf = TfidfTransformer()

Now we wish to fit the data into our feature extractors.

In [5]:
X_train_count = cv.fit_transform(X_train) # Transforming the Training reviews to count vectors and fitting for TF-idf
X_test_count = cv.transform(X_test) # Only transforming the test reviews to count vectors

X_train_tfidf = tfidf.fit_transform(X_train_count) # Transforming the fitted training Count Vectors to Tfidf vectors
X_test_tfidf = tfidf.transform(X_test_count) # Transforming test count vectors to tf-df vectors

And now we fit the logistic regression classifier to our Tf-idf vectors.

In [6]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(random_state=0, solver='lbfgs')
logreg.fit(X_train_tfidf, y_train) 

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=0, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

And now we test our model on our training data.

In [7]:
y_predictions_logreg = logreg.predict(X_test_tfidf)

Alright! Let's get our metrics.

In [10]:
from sklearn import metrics

yp=["Genuine" if prediction == 0 else "Deceptive" for prediction in list(y_predictions_logreg)]
output_fm = pd.DataFrame({'Review':list(X_test) ,'True(0)/Deceptive(1)':yp})
print(output_fm)
print(metrics.classification_report(y_test, y_predictions_logreg, target_names=set(yp)))
print('Overall score: ', logreg.score(X_test_tfidf, y_test))

                                                Review True(0)/Deceptive(1)
0    This hotel has the worst rooms we have stayed ...              Genuine
1    Terrible experience, I will not stay here agai...            Deceptive
2    Spent a wonderful night at the Amalfi with fri...              Genuine
3    The concierge was so helpful. Without hesitati...              Genuine
4    Oh my goodness, this has got to be one of the ...            Deceptive
5    I booked two rooms four months in advance at t...              Genuine
6    My husband and I made a reservation at the She...            Deceptive
7    I travel to chicago quite often, and have stay...            Deceptive
8    My husband and I and our 10 month old took a q...              Genuine
9    My Family and I went to this hotel on holiday ...              Genuine
10   Stayed at the Chicago Affinia Dec 10 thru Dec ...              Genuine
11   I was really excited to be visiting Chicago fo...            Deceptive
12   Attende

90% Accuracy! Let's wrap it up here boys, gg.