## NLP With Hotel Review Part 2

### Pallavi Chintaluri

In [1]:
# Import base packages. Other specific packages will be imported at the time of modelling

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# Ignore warnings

import warnings
warnings.filterwarnings("ignore")

In [2]:
# Utilise helper functions to enhance visualizations

def PlotBoundaries(model, X, Y, dot_size=20, figsize=(10,7)) :
    '''
    Helper function that plots the decision boundaries of a model and data (X,Y)
    code modified from: https://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html
    '''
    
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1,X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))

    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    plt.figure(figsize=figsize)
    plt.contourf(xx, yy, Z, alpha=0.4)

    #Plot
    plt.scatter(X[:, 0], X[:, 1], c=Y, s=dot_size, edgecolor='k')
    plt.show()

In [10]:
# Import data 

clean_test_df = pd.read_csv(r'C:\Users\palla\clean_data\clean_test_dataframe.csv')
clean_train_df = pd.read_csv(r'C:\Users\palla\clean_data\clean_train_dataframe.csv')

# Seperate X and y variables for the two datasets

# The training data is called remain data to facilitate train-validate split for later questions
X_remain = clean_train_df.drop(columns = 'rating')
y_remain = clean_train_df['rating']

X_test = clean_test_df.drop(columns = 'rating')
y_test = clean_test_df['rating']

# Create train and validate sets
from sklearn.model_selection import train_test_split

X_train, X_validate, y_train, y_validate = \
    train_test_split(X_remain, y_remain, test_size = 0.3,
                     random_state=1)

# Create a small sample set as well
X_validate, X_sample, y_validate, y_sample = \
    train_test_split(X_validate, y_validate, test_size = 0.5)

In [12]:
# Look at the shapes of all our datasets

print(f'Shape of test set: {X_test.shape}')
print(f'Shape of validation set: {X_validate.shape}')
print(f'Shape of train set: {X_train.shape}')
print(f'Shape of train set: {X_sample.shape}')

Shape of test set: (4267, 2743)
Shape of validation set: (1920, 2743)
Shape of train set: (8958, 2743)
Shape of train set: (1920, 2743)


Q1. Employ a linear classifier on this dataset:

 - Fit a logisitic regression model to this data with the solver set to lbfgs. What is the accuracy score on the test set?
 - What are the 20 words most predictive of a good review (from the positive review column)? What are the 20 words most predictive with a bad review (from the negative review column)? Use the regression coefficients to answer this question
 - Reduce the dimensionality of the dataset using PCA, what is the relationship between the number of dimensions and run-time for a logistic regression?
 - List one advantage and one disadvantage of dimensionality reduction

In [21]:
# Import the packages for linear classification models

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

In [17]:
# Use bagofwords to remove common English words from the train and test sets. 
# In this case we consider our original remain and test sets prior to creating validate and sample sets.

bagofwords = CountVectorizer(stop_words="english")
bagofwords.fit(X_remain)

X_remain_transformed = bagofwords.transform(X_remain) 
X_test_transformed = bagofwords.transform(X_test) 

# We can see how the dimensions have changed
print(X_remain.shape)
print(X_remain_transformed.shape)

(12798, 2743)
(2743, 2801)


1a. Fit a logisitic regression model to this data with the solver set to lbfgs. What is the accuracy score on the test set?

In [19]:
# Fitting a model and setting the solver to lbfgs
logreg = LogisticRegression(solver = 'lbfgs', C = 0.1) #Define the model
logreg.fit(X_remain, y_remain) # Fit to remain dataset

# Training and test accuracy scores
print(f"Train score: {logreg.score(X_remain, y_remain)}") 
print(f"Test score: {logreg.score(X_test, y_test)}")

Train score: 0.7225347710579778
Test score: 0.7194750410124209


We can see from the results above that our training data in this case has an accuracy score of **~72.3%** and the test data has an accuracy score of **~72%**.

1b. What are the 20 words most predictive of a good review (from the positive review column)? What are the 20 words most predictive with a bad review (from the negative review column)? Use the regression coefficients to answer this question

1c. Reduce the dimensionality of the dataset using PCA, what is the relationship between the number of dimensions and run-time for a logistic regression?

In [23]:
# Scale the dataset prior to reducing dimensions

scaler = StandardScaler() # Define the scaler
scaler.fit(X_remain) # Fit the scaler to the remain dataset
X_remain = scaler.transform(X_remain)  
X_test = scaler.transform(X_test)

In [24]:
# Lets say we want to keep 90% of the variance
my_PCA = PCA(n_components = 0.9) # Define model
my_PCA.fit(X_remain) # Fit the remain dataset

# Transform train and test
X_remain_PCA = my_PCA.transform(X_remain)
X_test_PCA = my_PCA.transform(X_test)

In [25]:
print(f'Original: {X_remain.shape}')
print(f'PCA Transformed: {X_remain_PCA.shape}')

Original: (12798, 2743)
PCA Transformed: (12798, 1891)


We can see from this that the the dimensions have been reduced by around 850. Let us look at the impact on run time and accuracy of the models.

In [26]:
# Do the same but fit on the PCA transformed data
my_logreg_PCA = LogisticRegression()

# Fitting to PCA data
my_logreg_PCA.fit(X_remain_PCA,y_remain)

# Scoring on PCA train and test sets
print(f'Train Score: {my_logreg_PCA.score(X_remain_PCA, y_remain)}')
print(f'Test Score: {my_logreg_PCA.score(X_test_PCA, y_test)}')

Train Score: 0.8546647913736521
Test Score: 0.7654089524255917


We can see our accuracy scores have increased. The train set here has an accuracy of **~85.5%** and the test set has an accuracy score of **76.5%**.

In [27]:
%%timeit
logreg.fit(X_remain, y_remain)

3.69 s ± 235 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [28]:
%%timeit
my_logreg_PCA.fit(X_remain_PCA,y_remain)

2.23 s ± 140 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


We can see right away that the run time of the model with reduced dimensions is lower than the fulldataset. 

1d. List one advantage and one disadvantage of dimensionality reduction.

This is one of the key advantages - **increased computational efficiency and decrease in resources**. 
However, the disadvantage of this approach is **loss of data**. We would lose much of the original dataset, but with PCA, elements of the original features are retained so it is preffered to simply dropping observations.