# COMP47670 Assignment 2
## Robert Keenan 15333066
## Date: 27/04/20

The aim of this assignment was to scrape a number of customer reviews from a set of websites and to evaluate the performance of a number of text classification algorithms on the data. 

Firstly, I needed to make use of BeautifulSoup to be able to scrape the reviews off of the website for a select number of catergories before storing the reviews and their sentiments, the review as positive or negative depending on the number of stars. 
The review categories are located on the website http://mlg.ucd.ie/modules/yalp/ and are separated into a number of different categories. 

## Review Categories
The first step was to choose 3 of the 7 categories to scrap data from. From here, I could inspect the webpage using 'Inspect Element' on the browswer where I could identify any of the common characteristics of the categories for access. The 3 chosen categories are given below:

- Automotive Category (132 businesses)
- Gym Category (122 businesses)
- Hotel Category (113 businesses)

From here, the aim was to enter into each categories webpage which contains a list of businesses in the categories shown above. Each link for a business contains a number of reviews related to that business and it was required to store each of these reviews in a dataset. 

### Obtaining each category's URL
The first step involved in scraping the category URLs which would allow me to enter the category page where all of the businesses are listed. To do this, I initially investigated how the webpage was set up using 'Inspect Element' in the browser. To obtain the URLs of each category link, I noticed that each link was off a class ="category". More specifically, I could obtain the URLs for each category by searching for the `<a>` tag in the HTML using BeautifulSoup. The `<a>` tag represents a hyperlink that is present on the webpage. 

For example, the hyperlink corresponding to the list of automotive companies is given in the `<a>` as automotive_list.html. As a result, the corresponding link needed to enter the webpage containing the list of businesses is http://mlg.ucd.ie/modules/yalp/automotive_list.html . This link can be obtained by gathering each of the  hyperlinks in a `<a>` tag and adding a `/` to the front of the corresponding html link so it can be joined to the original Yalp Homepage URL. 

To do this, I built a function as shown below which using the BeautifulSoup HTML parser could obtain all of the `<a>` tags with the hyperlinks and store them in a list along with the prepending `/`.

I needed to also make sure that the homepage URL or the index URL for 'Yalp-Home' was not included in the list of category links. The function is shown below. It was built to take an argument of the main URL (the yalp homepage) as well as a list of the categories wanted, `categories_wanted`. 

This function is called inside another function which is used to gather the URLs of the businesses. It is commented fully below. 

In [164]:
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import nltk
from langdetect import detect #COMMENT THAT YOU NEED PIP INSTALL FOR THIS
from sklearn.feature_extraction.text import *
from nltk.stem import WordNetLemmatizer 
from sklearn.feature_extraction import text
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn import linear_model
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score
from statistics import mean

#This is the homepage URL for the Yalp website as given in the assignment specification
link = "http://mlg.ucd.ie/modules/yalp"


#This function is used to gather the category links as well as being called in get_reviews_store() to gather the 
# businesses
def get_url_categories(link, categories_wanted):
    #The list of URLs for the catergories
    category_links=[]
    #The list of businesses in each category
    business_links = []
    response = requests.get(link)
    #Initialise BeautifulSoup as soup for the response content using the HTML parser
    soup = BeautifulSoup(response.content, 'html.parser')
    #Iterate through any <a> tags in the HTML which correspond to Hyperlinks
    for url in soup.find_all('a'):
        #Obtain the hyperlink URL
        current_link = url.get('href')
        #If it is the index URL (Yalp-Home), ignore it
        if current_link == 'index.html':
            continue
        #If there was a valid list of categories provided as argument
        if categories_wanted != None:
            #If any of the categories given in the list categories_wanted are in the current link append to the 
            #list along with a pre-pended '/'
            if any(category in current_link for category in categories_wanted):
                category_links.append('/'+current_link)
                
        #The list of business URLs is being obtained
        else:
            business_links.append('/'+current_link)
    #If there were valid categories given in a list, return the hyperlinks of the categories
    if categories_wanted != None: 
        return category_links
    #Else return the list of business hyperlinks
    else:
        return business_links
    
    

### Obtaining the Business URLs and Reviews before storing
Once I have obtained the URLs for the categories, I need to obtain the business links in each category using BeautifulSoup before then obtaining each review for each of these and storing the sentiment (positive/negative) and the review text. 

To do this, I built another function which could take the link (original Yalp homepage link) and the category which needs to be obtained. I again used 'Inspect Element' to see how the webpage was structured. To obtain the different business links, I used the function shown above `get_url_categories` and then I could iterate through this list of companies and using BeautifulSoup to analyse the content of an individual business' review webpage

Using 'Inspect Element', I could see that a review review had a class of `class = "review"` and had a `<div>` tag which is associated with a division or section in the webpage. 

To obtain the sentiment of a review, I can use the stars associated with it. The stars are an image which can be found using the `<img>` tag in the HTML with BeautifulSoup. The `alt` attribute of the `<img>` tag can be used to find how many stars the review image actually corresponds to. I can then take a review as negative if the `alt` attribute is less than 4 stars and a review is positive if the `alt` attribute is 4 or 5 stars. 

To obtain the text of a review, the `<p>` tag is used as well as the class `class = "review-text"`. The text of the review can then be stored and appended to a dataframe as well as the corresponding sentiment.

**Note**: Through my analysis of the reviews and the business in each category, I noticed that some of the businesses are located in the province of Quebec (QC) in Canada where French is commonly spoken in everyday life. As a result, some of these reviews are in French compared to the vast majority of other reviews in English. 
As the French words will still be associated with the 'sentiment' of the review (positive/negative), I will carry out an analysis to see if stemming has any effect on the English data using English stop words and thus, it would have no effect leaving the French words unstemmed. 

In [124]:
def get_reviews_store(link, category):  
    #The sentiment values. -1 for a negative review and +1 for a positive review
    df = pd.DataFrame(columns=['Review Sentiment', 'Review Text'])
    #Companies
    #This will return a list of companies
    companies = get_url_categories(link+category, None)
    
    #Iterate through the different companies URLs and obtain the reviews
    for company in companies:
        #Initialise BeautifulSoup for the response for each company's webpage.
        response = requests.get(link+company)
        soup = BeautifulSoup(response.content, 'html.parser')
        
        
        #Iterate through each review using the <div> tag as well as the class review
        for review in soup.find_all('div', class_ = 'review'):
            #Use the <alt> attribute of the <img> tag to get the number of stars for the review
            stars = review.find('img')['alt']
            #The star is produced like '1-star' so I need to split on the '-' and take the number of stars
            num_stars = int(stars.split('-')[0])
            
            #Obtain the review text using the <p> tag and the class review-text
            review_text = review.find('p', class_ = 'review-text')
            #Strip white space at end
            review_text = review_text.get_text(strip=True)
            
            #Negative review if the number of stars is less than 4 
            if num_stars < 4:
                class_label = -1
            #Positive review if the number of stars is 4 or 5
            else:
                class_label = 1
            #Append to the data frame
            df_add = pd.DataFrame({'Review Sentiment': class_label, 'Review Text':review_text}, index=[0])
            df = df.append(df_add, ignore_index=True)
    df = df.astype({'Review Sentiment': 'int32'})
    #reviews_of_category = (class_labels, review_text_sections)
    return df
        

## Obtaining the data frames 
Shown below is how I firstly obtained the URLs for the different categories through which the `get_reviews_store` function can be used to gather the reviews and sentiments into Dataframes which were set as `automotive_reviews` `gym_reviews` and `hotel_reviews`.



In [125]:
#Obtaining the category links
categories = get_url_categories(link, ['automotive', 'gym', 'hotel'])
#Obtaining the reviews and storing as a dataframe called automotive_reviews
automotive_reviews = get_reviews_store(link, categories[0])

I can then print out the categories URLs which were gathered from the `get_url_categories` to validate that they were correct as shown below

In [126]:
print(categories)

['/automotive_list.html', '/gym_list.html', '/hotels_list.html']


I can also validate that each dataframe for the reviews is in the proper format with the sentiments in one column and the review text in another column. 

In [127]:
automotive_reviews

Unnamed: 0,Review Sentiment,Review Text
0,-1,The man that was working tonight (8-12-17) was...
1,-1,Chris is a very rude person. Gave me an attitu...
2,1,One of my favorite gas station to stop at. The...
3,-1,Oh thank Heaven for Seven Eleven! I don't know...
4,1,Five stars because of the guy who works weekda...
...,...,...
1995,-1,"Typical used cars, Typical used car salesman w..."
1996,-1,What a joke! So I see an SUV on craigslist lis...
1997,1,I had called on a car thinking it was a privat...
1998,1,I purchased a car from here. Was very happy wi...


In [128]:
gym_reviews = get_reviews_store(link, categories[1])

In [129]:
hotel_reviews = get_reviews_store(link, categories[2])

All of the reviews and sentiments for the 3 chosen categories have been scraped from the website and stored in 3 dataframes with columns of 'Review Sentiment' for the review type( positive = 1, negative = -1) and 'Review Text' where each review's text is stored. 

The next step is to create an appropriate numeric representation of the data as it now takes the form of numeric sentiments and character/word reviews. The key to doing this is to tokenise the text where each token is a word to get a numeric representation of the data in each review. 

## Task 2
As described before, the content of each review is textual and not numeric. In order to analyse the review text data, we need to get a numeric representation of the data which can be done through tokenising. This is the process where the raw review text data is split up into tokens where each token is corresponding to an individual word/term in the review text.

There are multiple ways of doing this such as splitting the raw review text on white space/spaces between the words or even punctuation. In some cases, such as social media data (Twitter), certain special characters can be used to split the associated data such as '#'

To do this, I can make use of scikit learn's (https://scikit-learn.org/stable/)tokenising functions and more specifically can use stemming and lematisation through their vectorizers. To obtain optimal and minimal computation of the number of documents to be studied, it is often intended to reduce the number of terms used to represent a document or in our case, a review's text. These include:

- Minimum term length: Exclude terms of length < 2
- Case conversion: All terms are set to lowercase
- Stemming: Remove the endings of terms/words to remove the tense/plurals. 
- Lemmatisation: Process to reduce a term to its canonical form. 
- Stop-word filtering: Remove highly frequent terms which appear in a given list of words which can be gathered from the NLTK library. These words are highly frequent words with little information like 'and' & 'the'.
- Low frequency filtering: Remove terms that appear in very few reviews

However it was found that the stop-words mentioned above given by the NLTK library are not stemmed or lematised so sometimes the vectorizer can see a stop-word as plural. For example, I encountered this issue with the word 'always'. When the Lemma Tokenizer built below is applied to this word, it stems and lemmatises it to 'alway'. The word 'always' is contained in the English stop-words list and thus, there is a mismatch. 

As a result, I decided to stem the English step words using `PorterStemmer()` provided by NLTK. As a result, the stemmed stop-words could be applied to the stemmed review text. 

I can then create a vectorizer using the lemma tokenizer defined below as well as the stemmed English stop-words. I decided to use a TF-IDF vectorizer which weights the score of a certain term in a document. This will allow me to make a document-term matrix out of the data. The TF-IDF vectorizer is useful compared to the Count Vectorizer as it takes into account how many documents/reviews the term appears in. For example, if the term 'car' appears 100 times in one review but the term 'bad' appears 10 times in 10 reviews, then 'bad; would rank higher in terms of importance than the other word 'car' which didn't appear in as many separate reviews. This is how it weights the score of a certain term:

I also used the `make_pipeline()` from Sklearn to allow me to easily apply a vectorizer and model to a certain categories of reviews. This allows me to apply a list of transforms to transform the data into a document-term matrix before applying a final estimator or classification model to the data. This is a useful way in Sklearn of shortening the amount of code nedeed for a repeated process. 



In [130]:
#Declaring the lemmatizer for the stop words
lemmatizer = WordNetLemmatizer()
#Obtain the English stop words from NLTK
stopWords = set(stopwords.words('english'))
#Stem each of the words in the link
stopWords = [lemmatizer.lemmatize(word) for word in stopWords]

#Implementing a lemma tokenizer which lemmatises the terms (converts to canonical form)
def lemma_tokenizer(text):
    # use the standard scikit-learn tokenizer first
    standard_tokenizer = CountVectorizer().build_tokenizer()
    #Create tokens from the text
    tokens = standard_tokenizer(text)
    # I am then using NLTK to perform lemmatisation on each token
    lemmatizer = WordNetLemmatizer()
    #The lemmatised tokens list is provided here
    lemma_tokens = []
    #For each token created by the CountVectorizer, I can lemmatize this token using the WordNetLemmatizer()
    for token in tokens:
        lemma_tokens.append( lemmatizer.lemmatize(token) )
    return lemma_tokens

#Declaring the vectorizer with the lemma tokenizer and the custom stop words. 
vectorizer = TfidfVectorizer(tokenizer=lemma_tokenizer, stop_words=stopWords ,min_df = 10)

### Which is the best classifier?
I can perform a very short analysis on the Automotive data to see what classifier is most accurate in terms of its cross validation score. 

To do this, I can create 3 different pipelines using `make_pipeline` for Naive Bayes, Random Forest and then Linear Regression.

In [141]:
#The targets are the review sentiments or class labels for positive and negative
automotive_targets  = automotive_reviews['Review Sentiment']
#The data is the review text
automotive_data = automotive_reviews['Review Text']

gym_data = gym_reviews['Review Text'] 
gym_targets = gym_reviews['Review Sentiment']

hotel_data = hotel_reviews['Review Text'] 
hotel_targets = hotel_reviews['Review Sentiment']

#Naive Bayes pipeline
pipe_NB = make_pipeline(vectorizer, MultinomialNB())

#Random Forest pipeline
pipe_RF = make_pipeline(vectorizer, RandomForestClassifier(n_estimators=100))

#Linear Regression pipeline
#Defining the default solver as "lbfgs" which is the current default but may be different depending on versionm
pipe_logistic = make_pipeline(vectorizer, linear_model.LogisticRegression(solver="lbfgs"))

###  Naive Bayes
Firstly I can test the Naive Bayes classifier on the 3 categories worth of review using the cross validation measure. k-fold cross validation involves splitting the training data into k segments and setting one of these segments to be test data while the other k-1 segments are set as training data. A model is created from this training data and used to predict the test data and a score is recorded. This process is then repeated k-times until an average cross validation score can be obtained. It is a useful method for combatting overfitting and demonstrates a model's general performance at predicting unseen data. 

In [151]:
automotive_scores = cross_val_score(pipe_NB, automotive_data, automotive_targets, cv=10, scoring="accuracy")
print(automotive_scores)
naive_automotive_mean = automotive_scores.mean() 
naive_automotive_mean

[0.915 0.91  0.92  0.89  0.905 0.885 0.875 0.875 0.895 0.92 ]


0.899

In [152]:
gym_scores = cross_val_score(pipe_NB, gym_data, gym_targets, cv=10, scoring="accuracy")
print(gym_scores)
naive_gym_mean =gym_scores.mean() 
naive_gym_mean

[0.885 0.89  0.895 0.88  0.87  0.93  0.905 0.86  0.835 0.86 ]


0.881

In [153]:
hotel_scores = cross_val_score(pipe_NB, hotel_data, hotel_targets, cv=10, scoring="accuracy")
print(hotel_scores)
naive_hotel_mean =hotel_scores.mean() 
naive_hotel_mean

[0.91  0.82  0.875 0.965 0.885 0.825 0.87  0.895 0.865 0.865]


0.8775000000000001

### Random Forest
I can then test on the Random Forest classifier to see how it performs. Although decision trees like Random Forest are quite easy to implement and debug as well as providing classification results for a variety of different categorical or numerical classes they can suffer from overfitting. This then leads to different parts of the tree needing to be pruned which can be computationally intensive. 

In comparison, Naive Bayes does not overfit the data as much. 

In [154]:
automotive_scores = cross_val_score(pipe_RF, automotive_data, automotive_targets, cv=10, scoring="accuracy")
print(automotive_scores)

RF_automotive_mean= automotive_scores.mean() 
RF_automotive_mean

[0.895 0.88  0.91  0.885 0.87  0.835 0.84  0.825 0.85  0.895]


0.8684999999999998

In [155]:
gym_scores = cross_val_score(pipe_RF, gym_data, gym_targets, cv=10, scoring="accuracy")
print(gym_scores)
RF_gym_mean = gym_scores.mean() 
RF_gym_mean

[0.88  0.88  0.89  0.87  0.84  0.895 0.9   0.87  0.85  0.85 ]


0.8724999999999999

In [156]:
hotel_scores = cross_val_score(pipe_RF, hotel_data, hotel_targets, cv=10, scoring="accuracy")
print(hotel_scores)
RF_hotel_mean = hotel_scores.mean() 
RF_hotel_mean

[0.9   0.865 0.9   0.935 0.885 0.875 0.825 0.895 0.86  0.84 ]


0.8780000000000001

### Logistic Regression
Logistic regression uses the logistic function to model a binary dependent variable which in our case is the review being positive or negative.  

In [157]:
automotive_scores = cross_val_score(pipe_logistic, automotive_data, automotive_targets, cv=10, scoring="accuracy")
print(automotive_scores)

LR_automotive_mean = automotive_scores.mean() 
LR_automotive_mean

[0.93  0.88  0.91  0.895 0.895 0.87  0.87  0.885 0.865 0.915]


0.8915

In [158]:
gym_scores = cross_val_score(pipe_logistic, gym_data, gym_targets, cv=10, scoring="accuracy")
print(gym_scores)
LR_gym_mean =gym_scores.mean() 
LR_gym_mean

[0.875 0.895 0.905 0.87  0.85  0.92  0.92  0.88  0.83  0.84 ]


0.8785000000000001

In [159]:
hotel_scores = cross_val_score(pipe_logistic, hotel_data, hotel_targets, cv=10, scoring="accuracy")
print(hotel_scores)
LR_hotel_mean =hotel_scores.mean() 
LR_hotel_mean

[0.92  0.85  0.88  0.93  0.89  0.87  0.905 0.9   0.895 0.86 ]


0.89

### Chosen Classifier
Random Forest has a long run time for the 10-fold cross validation and obtains the worst cross validation performance for each category. This could be due to the Random Forest algorithm's tendency to overfit the data. 

The Naive Bayes and Logistic Regression classifiers are the highest performing classifiers for all 3 categories. Naive Bayes obtains slightly better accuracy in the categories of Automotive (89.9& vs. 89.1%) and Gym (88.1% vs. 87.85%) and slightly lower accuracy for Hotels (87.5% vs. 89% accuracy).

To make a decision on the best posssible classifier to take forward into Part 3, I will find the average cross validation accuracy for each classifer over the 3 categories. 

In [168]:
mean([LR_hotel_mean, LR_gym_mean, LR_automotive_mean])

0.8866666666666667

In [169]:
mean([naive_automotive_mean, naive_gym_mean, naive_hotel_mean])

0.8858333333333334

As a result of the above, it would appear that the Logistic Regression classifier is the better performer by a very small margin of 0.1%. As a result, a clear choice is not presented. 

As Logistic Regression can sometimes can sometimes suffer from overfitting of the data and may not be able to accurately predict data not in its own review category, I will choose to use **Naive-Bayes** for the rest of this assignment. 

## Part 3 Evaluating Classifier Models on Other Categories
In this section, I will need to analyse the performance of the Naive Bayes classifier when it is trained on one category, say Category A, and then tested on category B and C.
This is important to test as it verifies the generalisation ability of the model at accurately predicting test data that is both unseen and somewhat unrelated to the trained model's category. 

I can firstly use the pipeline that has been created to use the automotive, gym and hotel Naive Bayes models. From here, I have built a function called `obtain_predictions()` which will use the model to predict the test data. It will then output a confusion matrix and a classification report on the resulting predictions. 

In [170]:
automotive_model = pipe_NB.fit(automotive_data, automotive_targets)
gym_model = pipe_NB.fit(gym_data, gym_targets)
hotel_model = pipe_NB.fit(hotel_data, hotel_targets)

### Part (i)
In Part (i), I will firstly use the model that was trained on the Automotive data called `automotive_model`

In [176]:
from sklearn.metrics import *
def obtain_predictions(model, test_data, test_targets):
    predictions = model.predict(test_data)
    cm = confusion_matrix(test_targets, predictions,labels=[-1,1])
    print(classification_report(test_targets, predictions,labels = [-1,1], target_names=["Negative Review","Positive Review"]))
    print(cm)

In [177]:
#Testing automotive model on gyms
obtain_predictions(automotive_model, gym_data, gym_targets)

                 precision    recall  f1-score   support

Negative Review       0.93      0.66      0.77       701
Positive Review       0.84      0.97      0.90      1299

       accuracy                           0.86      2000
      macro avg       0.89      0.82      0.84      2000
   weighted avg       0.87      0.86      0.86      2000

[[ 465  236]
 [  36 1263]]


In the case of the confusion matrix above, the following terms are translate as follows:

- True Negtive: A review classified as Negative which is actually Negative in reality.
- False Negative: A review classified as Negative which is actually Positive in reality.
- False Positive: A review classified as Positive which is actually Negative in reality. 
- True Positive: A review classified as Positive which is actually Positive in reality


Note in the above confusion matrix. The first item of the first row is the number of True Negatives, the first value of the 2nd row is the number of False Negatives, the 2nd value of the 2nd row is the number of True Positives and the 2nd value of the 1st row is the number of False Positives. 


I can also obtain the same for the Category C results:

In [178]:
#Testing on Hotels
obtain_predictions(automotive_model, hotel_data, hotel_targets)

                 precision    recall  f1-score   support

Negative Review       0.94      0.85      0.90       824
Positive Review       0.90      0.97      0.93      1176

       accuracy                           0.92      2000
      macro avg       0.92      0.91      0.91      2000
   weighted avg       0.92      0.92      0.92      2000

[[ 701  123]
 [  41 1135]]


I can comment on the results separately for each of the tested Categories

- Gym: The Naive Bayes obtains a classification accuracy of 86%. This is reasonably good classification accuracy but it might also be misinterpreted. The column heading above called 'support' denotes the number of examples of each class label (positive or negative) as we can see there is quite a large imbalance in the test data. Classification accuracy can be an inaccurate measured of classifier performance in this case. F1_score is a high performing measure which takes the harmonic mean of precision and recall and is a better indicator of classifier performance in imbalanced test data sets. The F1-Score is 0.90 for Positive reviews which shows that the classifier is quite good at predicting positive reviews but is poor in relation to Negative reviews with a F1-score of 0.77 which suggests that the model predicts too many real negative reviews as positive. This can be seen in the confusion matrix where there are over 236 False Positives in comparison to only 465 True Negatives. 

- Hotel: Again the Naive Bayes classifier performs very well for predicting the hotel data. It achieves a classification accuracy of 92% but again the test set is imbalanced with 1176 positive reviews and 824 negative reviews. The F1-score for the hotel data results are much better than the Gym data. A F1-Score of 0.93 for positive reviews and 0.90 for negative reviews suggests that the model performs extremely well for hotel data. This could be to do with similar language to describe a stay at a hotel and an automotive garage as both of them are providing a service. 

### Part (ii)
In Part (ii), I can use the model trained on the gym data to make predictions for both the automotive data and hotel data. Using the `obtain_predictions()` function, I can again gather the classification results. 

In [179]:
obtain_predictions(gym_model, automotive_data, automotive_targets)

                 precision    recall  f1-score   support

Negative Review       0.77      0.93      0.84       788
Positive Review       0.95      0.82      0.88      1212

       accuracy                           0.86      2000
      macro avg       0.86      0.87      0.86      2000
   weighted avg       0.88      0.86      0.86      2000

[[731  57]
 [221 991]]


In [180]:
obtain_predictions(gym_model, hotel_data, hotel_targets)

                 precision    recall  f1-score   support

Negative Review       0.94      0.85      0.90       824
Positive Review       0.90      0.97      0.93      1176

       accuracy                           0.92      2000
      macro avg       0.92      0.91      0.91      2000
   weighted avg       0.92      0.92      0.92      2000

[[ 701  123]
 [  41 1135]]


Again I can comment on the above results:

- Automotive Data: The Naive Bayes obtains a classification accuracy of 86% which is the same which the Automotive model predicted for Gyms. Again, this is reasonably good classification accuracy. There is again an imbalance in the test data for the Automotive category with 1212 positive reviews and 788 negative reviews. Using the F1-score, I can comment on the performance of the model. The F1-Score obtained for positive reviews is again high at 0.88 which is quite close to the classification accuracy. It achieves a F1-score of 0.84 for negative reviews which again is quite respectable but we suffer from a lot of False Negative reviews which is shown as 221 in the confusion matrix

- Hotel Data: The Naive Bayes classifier again performs very well with the Hotel data but this time trained on the Gym data. It obtains a classification accuracy of 92%. The test classes are imbalanced as we have already found out and the F1-scores are very high also which indicates that the model is very good at generalising data for the Hotel category. It obtains a F1-score for positive reviews of 0.93 and 0.90 for negative reviews which is very good. There is a low amount of False Positives and False Negatives at 123 and 41 respectively. This suggests that the Naive Bayes model trained on the Gym data performs very well with Hotel data. 

### Part (iii)
In Part (iii), I can use the model trained on the hotel data to make predictions for both the automotive data and gym data. Using the `obtain_predictions()` function, I can again gather the classification results. 

In [181]:
obtain_predictions(hotel_model, automotive_data, automotive_targets)

                 precision    recall  f1-score   support

Negative Review       0.77      0.93      0.84       788
Positive Review       0.95      0.82      0.88      1212

       accuracy                           0.86      2000
      macro avg       0.86      0.87      0.86      2000
   weighted avg       0.88      0.86      0.86      2000

[[731  57]
 [221 991]]


In [182]:
obtain_predictions(hotel_model, gym_data, gym_targets)

                 precision    recall  f1-score   support

Negative Review       0.93      0.66      0.77       701
Positive Review       0.84      0.97      0.90      1299

       accuracy                           0.86      2000
      macro avg       0.89      0.82      0.84      2000
   weighted avg       0.87      0.86      0.86      2000

[[ 465  236]
 [  36 1263]]


Again I can comment on the above results:

- Automotive Data: Again we find that the classification results for the hotel model on automotive data corresponds to about 86% classification accuracy. The F1-scores for both positive and negative reviews are 0.88 and 0.84 are quite good but with a large amount of False Negatives of 221, this cases both of these values to remain below 0.9. 

- Gym Data: The classifier training on hotel data performs very well with Positive reviews obtaining an F1-score of 0.90. However, it obtains quite a poor F1-score for negative reviews and this is due to the number of False positives given in the confusion matrix (236) which are False Negatives in the case of the Negative class. These cause the recall to decrease dramatically to 0.66 which is quite poor. As a result, I can conclude that the Naive Bayes classifier trained on hotel data is quite good at classifying positive reviews but poor at classifying negative reviews. 