### Objective: To scrape consumer reviews from a set of web pages and to evaluate the performance of text classification algorithms on the data. 
The reviews have been divided into seven categories here: http://mlg.ucd.ie/modules/yalp

In [None]:
# Importing the required libraries
import pandas as pd
import numpy as np
import bs4
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import urllib.request
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn import metrics
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

import re 
import sys
import nltk
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
import warnings

In [2]:
link = 'http://mlg.ucd.ie/modules/yalp/'

#### Above link provide reviews which have been divided into seven categories.
- Each review has a star rating. We will assume that 1-star to 3- star reviews are “negative”, and 4-star to 5-star reviews are “positive”.

In [3]:
#Fetching the HTML code from the web page 
response=urllib.request.urlopen(link)
html = response.read().decode()
lines = html.strip().split("\n")
for line in lines:
    print(line)

<!DOCTYPE html>
<html lang="en">
<head>
    <meta name="robots" content="noindex">  
    <meta name="description" content="Content on this site is posted for teaching purposes only. Original data is from yelp.com">
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>Yalp Home</title>
    <link rel="shortcut icon" href="images/favicon.ico">
    <!-- Bootstrap core CSS -->
    <link href="assets/css/bootstrap.css" rel="stylesheet">
    <!-- Custom styles for this template -->
    <link href="assets/css/style.css" rel="stylesheet">
    <link href="assets/css/font-awesome.min.css" rel="stylesheet">
    <script src="assets/js/modernizr.js"></script>
</head>
<body>
    <div class="container mtb">
        <div class="row">
        <div class="col-md-12">
          <h3 class="info"><a href="index.html" class="info">Yalp</a> &mdash; Home</h3>
        </div>
        </div>
       

### Task 1: Select three review categories

In [4]:
links = []
parser = bs4.BeautifulSoup(html)
for linkk in parser.findAll('a'):
    links.append(linkk.get('href'))
print(links)

['index.html', 'automotive_list.html', 'cafes_list.html', 'fashion_list.html', 'gym_list.html', 'hair_salons_list.html', 'hotels_list.html', 'restaurants_list.html']



Here I am selecting categories (Gym, Hair saloon, Hotels)

In [5]:
links = [link+links[4],link+links[5],link+links[6]] # Storing the links of the three categories

In [6]:
links

['http://mlg.ucd.ie/modules/yalp/gym_list.html',
 'http://mlg.ucd.ie/modules/yalp/hair_salons_list.html',
 'http://mlg.ucd.ie/modules/yalp/hotels_list.html']

In [7]:
sublinks=[]
for i in links:
    a=[]
    response = urllib.request.urlopen(i)
    html = response.read().decode()
    parser = bs4.BeautifulSoup(html)
    for linkk in parser.findAll('a'):
        a.append(linkk.get('href')) # Appending the list with the link of the reviews for individual businesses inside each category
    sublinks.append(a)

In [8]:
sublinks = [ [link+c for c in a] for a in sublinks]
sublinks

[['http://mlg.ucd.ie/modules/yalp/index.html',
  'http://mlg.ucd.ie/modules/yalp/review_set_ei_4xj1zfVEczaJV7ZdsNA.html',
  'http://mlg.ucd.ie/modules/yalp/review_set_TbsSnPjQsmwmszV95uLJ4w.html',
  'http://mlg.ucd.ie/modules/yalp/review_set_25Qa4NoliJ75F_6u3nqZhw.html',
  'http://mlg.ucd.ie/modules/yalp/review_set_ip-0MsQaogm18KQhhnTqTQ.html',
  'http://mlg.ucd.ie/modules/yalp/review_set_63lkQ2JyOIYWg2b-vMkrrw.html',
  'http://mlg.ucd.ie/modules/yalp/review_set_RSTitmG9qW4LYXQBfNNqpQ.html',
  'http://mlg.ucd.ie/modules/yalp/review_set_hM4mJ29nMMuauqAUStcy8g.html',
  'http://mlg.ucd.ie/modules/yalp/review_set_WJZY8-AVQMhbp84se7H8Bw.html',
  'http://mlg.ucd.ie/modules/yalp/review_set_9xcEdoY2D2O0P18vGsCDMw.html',
  'http://mlg.ucd.ie/modules/yalp/review_set_31cCwS3xzuz89qS__Pw7bg.html',
  'http://mlg.ucd.ie/modules/yalp/review_set_uIrEaKOgWk0OOlolxwU1PA.html',
  'http://mlg.ucd.ie/modules/yalp/review_set_60VqleCpYQWhhUAq3wOh2A.html',
  'http://mlg.ucd.ie/modules/yalp/review_set_vbFlzg9V

In [9]:
df=[] # Creating an empty list to store three dataframes.
i=0
for data in sublinks:
    e=[] # stores the Reviews of each category for every iteration
    rat=[] # stores the Ratings of each category for every iteration
    for d in data:
        response = urllib.request.urlopen(d)
        html = response.read().decode()
        parser = bs4.BeautifulSoup(html,"html.parser")
        for linkk in parser.find_all('p',class_='review-text'):
            e.append(linkk.text)
        for ratings in parser.findAll('img'):
            if ratings.get('alt')!=None:
                if int(ratings.get('alt')[0])<4:  #  1-star to 3-star reviews are “negative”, and 4-star to 5-star reviews as “positive”.
                    rat.append('Negative')
                else:
                    rat.append('Positive')
        
    df.append(pd.DataFrame(e)) # Appending the three datadrame into a list
    df[i]['Ratings']=rat       # Creating a new column Ratings which stores the rating "Positive" and "Negative".
    i=i+1

In [10]:
# Renaming the column names in dataframes
df[0].columns=['Gym Review','Ratings']
df[1].columns=['Hair Salon Review','Ratings']
df[2].columns=['Hotel Review','Ratings']

In [11]:
# Creating csv files for three categories
df[0].to_csv('Gym.csv', index=False)
df[1].to_csv('HairSalon.csv', index=False)
df[2].to_csv('Hotel.csv', index=False)

#### Loading the data using an appropriate data structure.

In [12]:
gym_df = pd.read_csv('Gym.csv')  # reading gym csv
hair_saln_df = pd.read_csv('HairSalon.csv') # reading Hair salons csv
hotel_df = pd.read_csv('Hotel.csv') # reading Hotels csv

In [13]:
gym_df.head(5)

Unnamed: 0,Gym Review,Ratings
0,If you're looking for boxing in the East Valle...,Positive
1,I was really excited to try a fun workout rout...,Negative
2,I was interested in taking a boxing bootcamp c...,Negative
3,I worked out at 1 on 1 boxing for a bout 6 mon...,Positive
4,This place literally KICKED my butt every. sin...,Positive


In [14]:
hair_saln_df.head(5)

Unnamed: 0,Hair Salon Review,Ratings
0,"One of the best barbershops I've been to, with...",Positive
1,Took my son in for a haircut. Barber was great...,Positive
2,"Walked in, said hi. The only barber in there d...",Negative
3,I came here 10 minutes before 9am to get a hai...,Negative
4,"Great haircut. No fuss no muss, I asked for la...",Positive


In [15]:
hotel_df.head(5)

Unnamed: 0,Hotel Review,Ratings
0,Melissa took us on a tour of Asia in the space...,Positive
1,With a group of seven of us visiting Montreal ...,Positive
2,Melissa is a gem! My fiancé found her tour on ...,Positive
3,A perfect day in Montreal! Melissa outfitted u...,Positive
4,I had a really great food truck tour with Meli...,Positive


###  Applying Text Preprocessing steps for classification

A range of steps can be used for text preprocessing:
- Minimum term length: Exclude terms of length < 2.
- Case conversion: Converting all terms to lowercase.
- Stop-word filtering: Remove terms that appear on a pre-defined "blacklist" of terms that are highly frequent and do- not convey useful information.
- Low frequency filtering: Remove terms that appear in very few reviews.
- Stemming: Reduce words to their stems (or base forms).

In [16]:
#Function to remove punctuation and non-alphabetic characters

def cleanPunc(sentence): # Function to clean the word of any punctuation or special characters
    cleaned = re.sub(r'[?|!|\'|"|#]',r'',sentence)
    cleaned = re.sub(r'[.|,|)|(|\|/]',r' ',cleaned)
    cleaned = cleaned.strip()
    cleaned = cleaned.replace("\n"," ")
    return cleaned
def keepAlpha(sentence):
    alpha_sent = ""
    for word in sentence.split():
        alpha_word = re.sub('[^a-z A-Z]+', ' ', word)
        alpha_sent += alpha_word
        alpha_sent += " "
    alpha_sent = alpha_sent.strip()
    return alpha_sent


#Converting all terms of 'gym review' , 'hair salon review' and 'hotel review' to lowercase.
gym_df['Gym Review'] = gym_df['Gym Review'].str.lower()  
hair_saln_df['Hair Salon Review'] = hair_saln_df['Hair Salon Review'].str.lower()
hotel_df['Hotel Review'] = hotel_df['Hotel Review'].str.lower()

#Cleaning punctuation from 'gym review' , 'hair salon review' and 'hotel review'
gym_df['Gym Review'] = gym_df['Gym Review'].apply(cleanPunc)
hair_saln_df['Hair Salon Review'] = hair_saln_df['Hair Salon Review'].apply(cleanPunc)
hotel_df['Hotel Review'] = hotel_df['Hotel Review'].apply(cleanPunc)

#Removing non-alphabetic characters from 'gym review' , 'hair salon review' and 'hotel review'
gym_df['Gym Review'] = gym_df['Gym Review'].apply(keepAlpha)
hair_saln_df['Hair Salon Review'] = hair_saln_df['Hair Salon Review'].apply(keepAlpha)
hotel_df['Hotel Review'] = hotel_df['Hotel Review'].apply(keepAlpha)

In [17]:
# Function to remove all the stop-words

stop_words = set(stopwords.words('english'))
re_stop_words = re.compile(r"\b(" + "|".join(stop_words) + ")\\W", re.I)
def removeStopWords(sentence):
    global re_stop_words
    return re_stop_words.sub(" ", sentence)

#Removing all the stop-words from 'gym review' , 'hair salon review' and 'hotel review'
gym_df['Gym Review'] = gym_df['Gym Review'].apply(removeStopWords)
hair_saln_df['Hair Salon Review'] = hair_saln_df['Hair Salon Review'].apply(removeStopWords)
hotel_df['Hotel Review'] = hotel_df['Hotel Review'].apply(removeStopWords)

In [18]:
# Function to performing Stemming
stemmer = SnowballStemmer("english")
def stemming(sentence):
    stemSentence = ""
    for word in sentence.split():
        stem = stemmer.stem(word)
        stemSentence += stem
        stemSentence += " "
    stemSentence = stemSentence.strip()
    return stemSentence

# performing Stemming on 'gym review' , 'hair salon review' and 'hotel review'
gym_df['Gym Review'] = gym_df['Gym Review'].apply(stemming)
hair_saln_df['Hair Salon Review'] = hair_saln_df['Hair Salon Review'].apply(stemming)
hotel_df['Hotel Review'] = hotel_df['Hotel Review'].apply(stemming)

In [19]:
# Factorize creates a unique numeric representation of the Ratings with Negative as 1 and Positive as 0
gym_df['category_id'] = df[0]['Ratings'].factorize()[0]
hair_saln_df['category_id'] = df[1]['Ratings'].factorize()[0]
hotel_df['category_id'] = df[1]['Ratings'].factorize()[0]

In [20]:
gym_df.head(5)

Unnamed: 0,Gym Review,Ratings,category_id
0,your look box east valley high recommend gym o...,Positive,0
1,realli excit tri fun workout routin would also...,Negative,1
2,interest take box bootcamp class research foun...,Negative,1
3,work box bout month love price reason although...,Positive,0
4,place liter kick butt everi singl time actual ...,Positive,0


In [21]:
hair_saln_df.head(5)

Unnamed: 0,Hair Salon Review,Ratings,category_id
0,one best barbershop ive great price honest cou...,Positive,0
1,took son haircut barber great exact want clean...,Positive,0
2,walk said hi barber courtesi say hi back busi ...,Negative,1
3,came minut get haircut open saturday accord ho...,Negative,1
4,great haircut fuss muss ask layer v shape back...,Positive,0


In [22]:
hotel_df.head(5)

Unnamed: 0,Hotel Review,Ratings,category_id
0,melissa took us tour asia space hour definit k...,Positive,0
1,group seven us visit montreal week look tour w...,Positive,0
2,melissa gem fianc found tour viator check trip...,Positive,1
3,perfect day montreal melissa outfit us bike de...,Positive,1
4,realli great food truck tour melissa montreal ...,Positive,0


In [23]:
# Creating dictionaries
category_id_df = gym_df[['Ratings', 'category_id']].drop_duplicates().sort_values('category_id')
category_to_id = dict(category_id_df.values)
id_to_category = dict(category_id_df[['category_id', 'Ratings']].values)
id_to_category

{0: 'Positive', 1: 'Negative'}

### Task2.1 For Gym category dataset
#### a: Apply appropriate preprocessing steps to create a numeric representation of the data, suitable for classification.

In [24]:
#Preprocessing step to create a numeric representation of the data using TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=5, ngram_range=(1, 2), stop_words='english')

review_feature = vectorizer.fit_transform(gym_df['Gym Review'])

rating_label = gym_df.category_id


#### b: Building classification model to distinguish between positive and negative reviews

In [25]:
#Category: Gym

# Split the data into a training set, a vaidation set, and a test set
X_train_plus_valid, X_test, y_train_plus_valid, y_test \
    = train_test_split(review_feature, rating_label, random_state=123,\
                                    train_size = 0.7)

X_train, X_valid, y_train, y_valid \
    = train_test_split(X_train_plus_valid, \
                                        y_train_plus_valid, random_state=123, \
                                        train_size = 0.5/0.7)


# LogisticRegression Classifier
clf = LogisticRegression(random_state=0,class_weight='balanced').fit(X_train, y_train) #fitting gym training set

#### c.  Test the predictions of the classification model using an appropriate evaluation strategy.

In [26]:
y_pred = clf.predict(X_valid) #predicting validation set whether it is positive or negative review
accuracy = accuracy_score(y_valid, y_pred)
print("Accuracy: " +  str(accuracy))

# Cross validation using gym dataset
scores = cross_val_score(clf, vectorizer.fit_transform(gym_df['Gym Review']),rating_label , cv=10,n_jobs=-1,scoring='accuracy')
print('Cross Validation Experiment scores', scores.mean())

Accuracy: 0.885
Cross Validation Experiment scores 0.89


 This model is giving 88% accuracy on unseen data of gym and it's cross validation score is also high i.e. 89%
 We can say that this model can predict the reviews 88% accurately.

### Task2.2 For Hair Salons category dataset
#### a: Apply appropriate preprocessing steps to create a numeric representation of the data, suitable for classification.

In [27]:
#Preprocessing step to create a numeric representation of the data using TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=5, ngram_range=(1, 2), stop_words='english')

review_feature = vectorizer.fit_transform(hair_saln_df['Hair Salon Review'])

rating_label = hair_saln_df.category_id

#### b: Building classification model to distinguish between positive and negative reviews

In [28]:
#Category: Hair and Salons 

# Split the data into a training set, a vaidation set, and a test set
X_train_plus_valid, X_test, y_train_plus_valid, y_test \
    = train_test_split(review_feature, rating_label, random_state=123,\
                                    train_size = 0.7)

X_train, X_valid, y_train, y_valid \
    = train_test_split(X_train_plus_valid, \
                                        y_train_plus_valid, random_state=123, \
                                        train_size = 0.5/0.7)


# LogisticRegression Classifier
clf = LogisticRegression(random_state=0,class_weight='balanced').fit(X_train, y_train) #fitting hair salon training set

#### c. Test the predictions of the classification model using an appropriate evaluation strategy.

In [29]:
y_pred = clf.predict(X_valid) #predicting validation set whether it is positive or negative review
accuracy = accuracy_score(y_valid, y_pred)
print("Accuracy: " +  str(accuracy))

# Cross validation using hair salon dataset
scores = cross_val_score(clf, vectorizer.fit_transform(hair_saln_df['Hair Salon Review']),rating_label , cv=10,n_jobs=-1,scoring='accuracy')
print('Cross Validation Experiment scores', scores.mean())

Accuracy: 0.92
Cross Validation Experiment scores 0.938


This model perform well and it is having 92% accuracy on unseen data of hair salons. It's cross validation score is also very high i.e. 93.8%. 

### Task2.3 For Hotels category dataset

#### a: Apply appropriate preprocessing steps to create a numeric representation of the data, suitable for classification.

In [30]:
#Preprocessing step to create a numeric representation of the data using TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=5, ngram_range=(1, 2), stop_words='english')

review_feature = vectorizer.fit_transform(hotel_df['Hotel Review'])

rating_label = hotel_df.category_id

#### b: Building classification model to distinguish between positive and negative reviews

In [31]:
#Category: Hotels

# Split the data into a training set, a vaidation set, and a test set
X_train_plus_valid, X_test, y_train_plus_valid, y_test \
    = train_test_split(review_feature, rating_label, random_state=123,\
                                    train_size = 0.7)

X_train, X_valid, y_train, y_valid \
    = train_test_split(X_train_plus_valid, \
                                        y_train_plus_valid, random_state=123, \
                                        train_size = 0.5/0.7)


# LogisticRegression Classifier
clf = LogisticRegression(random_state=0,class_weight='balanced').fit(X_train, y_train) #fitting hotel training set

#### c. Test the predictions of the classification model using an appropriate evaluation strategy.

In [32]:
y_pred = clf.predict(X_valid) #predicting validation set whether it is positive or negative review
accuracy = accuracy_score(y_valid, y_pred)
print("Accuracy: " +  str(accuracy))

print(metrics.classification_report(y_valid, y_pred))

# Cross validation using hotel dataset
scores = cross_val_score(clf, vectorizer.transform(hotel_df['Hotel Review']),rating_label , cv=10,n_jobs=-1,scoring='accuracy')
print('Cross Validation Experiment scores', scores.mean())

Accuracy: 0.6775
              precision    recall  f1-score   support

           0       0.79      0.80      0.79       311
           1       0.27      0.26      0.26        89

    accuracy                           0.68       400
   macro avg       0.53      0.53      0.53       400
weighted avg       0.67      0.68      0.68       400

Cross Validation Experiment scores 0.633


LogisticRegression model is giving 67.7% accuracy on unseen data of hotels and it's cross validation score is very low i.e. 63%. If we check the classification report of this model, we observed that the precision rate and recall rate of 'positive' reviews is less than 'negative' reviews. We can say that this model can predict 'negative' reivews more accurately than 'positive' reviews.

## Task 3(a): Train a classification model on the data from category 'Gym'. Evaluate its performance on data from category 'Hair Saloon' and data from category 'Hotels'

In [33]:
model1 = LogisticRegression(random_state=0,class_weight='balanced') # LogisticRegression Classifier

X_train, X_test, y_train, y_test = train_test_split(gym_df['Gym Review'], gym_df['Ratings'], random_state = 0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = model1.fit(X_train_tfidf, y_train)


# Evaluating performance on hair salon dataset
y_pred = model1.predict(count_vect.transform(hair_saln_df['Hair Salon Review']))
print('Performance on data from category Hair Saloon')
accuracy1 = metrics.accuracy_score(hair_saln_df['Ratings'], y_pred)
print("Accuracy: " +  str(accuracy1))

print(metrics.classification_report(hair_saln_df['Ratings'], y_pred))

# Print confusion matrix
print("Confusion Matrix")
pd.crosstab(hair_saln_df['Ratings'], y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

Performance on data from category Hair Saloon
Accuracy: 0.8635
              precision    recall  f1-score   support

    Negative       0.63      0.94      0.75       442
    Positive       0.98      0.84      0.91      1558

    accuracy                           0.86      2000
   macro avg       0.80      0.89      0.83      2000
weighted avg       0.90      0.86      0.87      2000

Confusion Matrix


Predicted,Negative,Positive,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Negative,415,27,442
Positive,246,1312,1558
All,661,1339,2000


From the above confusion matrix and classification report, we can say that trained LogisticRegression model on 'Gym' dataset classified 'Hair salons reviews' and predicted the ratings which is having more precision of "Positive" reviews (98%) than precision of "Negative" reviews (63%). We can say that model is predicting positive reviews more accurately than negative reviews.

In [34]:
# Evaluating performance on hotel dataset

y_pred = model1.predict(count_vect.transform(hotel_df['Hotel Review']))
print('Performance on data from category Hotel')
accuracy2 = metrics.accuracy_score(hotel_df['Ratings'], y_pred) 
print("Accuracy: " +  str(accuracy2))

print(metrics.classification_report(hotel_df['Ratings'], y_pred))

# Print confusion matrix
print("Confusion Matrix")
pd.crosstab(hotel_df['Ratings'], y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

Performance on data from category Hotel
Accuracy: 0.819
              precision    recall  f1-score   support

    Negative       0.72      0.92      0.81       824
    Positive       0.93      0.75      0.83      1176

    accuracy                           0.82      2000
   macro avg       0.82      0.83      0.82      2000
weighted avg       0.84      0.82      0.82      2000

Confusion Matrix


Predicted,Negative,Positive,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Negative,759,65,824
Positive,297,879,1176
All,1056,944,2000


From the above confusion matrix and classification report, we can say that trained LogisticRegression model on 'Gym' dataset classified 'Hotel reviews' and predicted the ratings with 81.9% accuracy. It can also predict 'negative and 'positive' review accurately because it's precision and recall rates are more than 70% for both the reviews.

### Task 3(b): Train a classification model on the data from category 'Hair Salons'. Evaluate its performance on data from category 'Gym' and data from category 'Hotels'

In [35]:
model2 = LogisticRegression(random_state=0,class_weight='balanced') # LogisticRegression Classifier

X_train, X_test, y_train, y_test = train_test_split(hair_saln_df['Hair Salon Review'], hair_saln_df['Ratings'], random_state = 0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = model2.fit(X_train_tfidf, y_train)

# Evaluating performance on gym dataset

y_pred = model2.predict(count_vect.transform(gym_df['Gym Review']))
print('Performance on data from category GYM')
accuracy1 = metrics.accuracy_score(gym_df['Ratings'], y_pred)
print("Accuracy: " +  str(accuracy1))

print(metrics.classification_report(gym_df['Ratings'], y_pred))

# Print confusion matrix
print("Confusion Matrix")
pd.crosstab(gym_df['Ratings'], y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

Performance on data from category GYM
Accuracy: 0.8535
              precision    recall  f1-score   support

    Negative       0.75      0.88      0.81       701
    Positive       0.93      0.84      0.88      1299

    accuracy                           0.85      2000
   macro avg       0.84      0.86      0.84      2000
weighted avg       0.86      0.85      0.86      2000

Confusion Matrix


Predicted,Negative,Positive,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Negative,617,84,701
Positive,209,1090,1299
All,826,1174,2000


From the above confusion matrix and classification report, we can say that trained LogisticRegression model on 'Hair Salon' dataset classified 'Gym reviews' and predicted the ratings with 85.3% accurately. 

In [36]:
# Evaluating performance on hotel dataset

y_pred = model2.predict(count_vect.transform(hotel_df['Hotel Review']))
print('Performance on data from category Hotel')
accuracy2 = metrics.accuracy_score(hotel_df['Ratings'], y_pred) 
print("Accuracy: " +  str(accuracy2))

print(metrics.classification_report(hotel_df['Ratings'], y_pred))

# Print confusion matrix
print("Confusion Matrix")
pd.crosstab(hotel_df['Ratings'], y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

Performance on data from category Hotel
Accuracy: 0.8065
              precision    recall  f1-score   support

    Negative       0.70      0.92      0.80       824
    Positive       0.93      0.73      0.82      1176

    accuracy                           0.81      2000
   macro avg       0.82      0.82      0.81      2000
weighted avg       0.84      0.81      0.81      2000

Confusion Matrix


Predicted,Negative,Positive,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Negative,760,64,824
Positive,323,853,1176
All,1083,917,2000


From the above confusion matrix and classification report, we can say that trained LogisticRegression model on 'Hair Salon' dataset classified 'Hotel reviews' and predicted the ratings with 80.65% accurately. It is having False negative rate greater than False positive rate.

### Task 3(c): Train a classification model on the data from category 'Hotels'. Evaluate its performance on data from category 'Gym' and data from category  'Hair Saloon'

In [37]:
model3 = LogisticRegression(random_state=0,class_weight='balanced') # LogisticRegression Classifier

X_train, X_test, y_train, y_test = train_test_split(hotel_df['Hotel Review'], hotel_df['Ratings'], random_state = 0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = model3.fit(X_train_tfidf, y_train)


# Evaluating performance on gym dataset

y_pred = model3.predict(count_vect.transform(gym_df['Gym Review']))
print('Performance on data from category GYM')
accuracy1 = metrics.accuracy_score(gym_df['Ratings'], y_pred) 
print("Accuracy: " +  str(accuracy1))

print(metrics.classification_report(gym_df['Ratings'], y_pred))

# Print confusion matrix
print("Confusion Matrix")
pd.crosstab(gym_df['Ratings'], y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

Performance on data from category GYM
Accuracy: 0.8695
              precision    recall  f1-score   support

    Negative       0.80      0.84      0.82       701
    Positive       0.91      0.88      0.90      1299

    accuracy                           0.87      2000
   macro avg       0.85      0.86      0.86      2000
weighted avg       0.87      0.87      0.87      2000

Confusion Matrix


Predicted,Negative,Positive,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Negative,592,109,701
Positive,152,1147,1299
All,744,1256,2000


From the above confusion matrix and classification report, we can say that trained LogisticRegression model on 'Hotel' dataset classified 'Gym reviews' and predicted the ratings with 86.7% accurately. It is having precision rate for negative and positive reviews are 80% and 91% respectively.

In [38]:
# Evaluating performance on hair salon dataset

y_pred = model3.predict(count_vect.transform(hair_saln_df['Hair Salon Review']))
print('Performance on data from category Hair Salon')
accuracy2 = metrics.accuracy_score(hair_saln_df['Ratings'], y_pred)
print("Accuracy: " +  str(accuracy2))

print(metrics.classification_report(hair_saln_df['Ratings'], y_pred))

# Print confusion matrix
print("Confusion Matrix")
pd.crosstab(hair_saln_df['Ratings'], y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

Performance on data from category Hair Salon
Accuracy: 0.904
              precision    recall  f1-score   support

    Negative       0.73      0.91      0.81       442
    Positive       0.97      0.90      0.94      1558

    accuracy                           0.90      2000
   macro avg       0.85      0.91      0.87      2000
weighted avg       0.92      0.90      0.91      2000

Confusion Matrix


Predicted,Negative,Positive,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Negative,401,41,442
Positive,151,1407,1558
All,552,1448,2000


From the above confusion matrix and classification report, we can say that trained LogisticRegression model on 'Hotel' dataset classified 'Hair Salons reviews' and predicted the ratings with 90.6% accuracy. It is having good precision rate for positive and negative reviews i.e. 97% and 73% respectively.

### Conclusion:
- I would say that logistic Regression is one of the best model for text classification. It use statistical method for predicting binary classes. It computes the probability of an event occurrence whether the review is positive or negative. I have given class_weights as 'balanced' which will balance the imbalance classes in the dataset.
- In task 2, I have used cross validation to evaluate LogisticRegression model for all three category datasets. 
- In task 3, LogisticRegression classifier performed well when used on an unseen data of each of the three category datasets (Gym, Hair salons, Hotels). It is having the highest classification rate. It is also avoiding the problem of overfitting and underfitting well. And, it is giving more than 75% of precision and recall rate of 'positive' and 'negative' reviews which proved that LogisticRegression model will predict the reviews accurately for these three category datasets.