# Assignment 2 - Text Classification

# Data Science in Python (Mixed Delivery) - COMP47670

## Author - Brian Graham

## Student Number - 14434712

## Date - 27/04/2020

The aim of this assignment is to scrape a number of review pages of various companies within 3 chosen industries, and use the data scraped from the web pages to build a model that will predict whether a review is a positive or negative review.

# User-Defined Functions

In [1]:
# ------------------------------------------------------------------------------------------------------------------------------
# 
# Function to perform a cross validation and display the results 
# ------------------------------------------------------------------------------------------------------------------------------
# 
# Arguments: 
#     model      - the classifier model being used 
#     classifier - the name of the classifier being used
#     data       - the dataset
#     target     - the outcome of each data point
#     num_folds  - the number of folds used in the cross validation
# ------------------------------------------------------------------------------------------------------------------------------

def cross_validation(model, classifier, data, target, num_folds):
    from sklearn.model_selection import cross_val_score

    # Perform the cross validation
    scores = cross_val_score(model, data, target, cv=num_folds)
    print("Classifier: "+ classifier +"\n")
    
    # Display the individual fold classification accuracy
    print("Individual Fold Scores:")
    for score in scores:
        print("\t%.4f" % float(score))
        
    # Display the average classification accuracy
    print("\nAverage Fold Score:")
    print("\t%.4f" % float(scores.mean()))
    
# ------------------------------------------------------------------------------------------------------------------------------
# 
# Function to analyse and display the outcome of a classifier versus the actual outcome
# ------------------------------------------------------------------------------------------------------------------------------
# 
# Arguments:
#     predicted              - the output of the classifier (predicted outcomes)
#     actual                 - the actual outcomes from the datasets
#     training_category_name - the name of the training dataset
#     test_category_name     - the name of the test dataset
# ------------------------------------------------------------------------------------------------------------------------------

def classifier_performance(predicted, actual, training_category_name, test_category_name):
    from sklearn.metrics import accuracy_score
    from sklearn.metrics import precision_score
    from sklearn.metrics import recall_score
    from sklearn.metrics import confusion_matrix
    from sklearn.metrics import f1_score
    
    # Evaluate the performance of the classifier with various machine learning metrics
    accuracy = accuracy_score(actual,predicted)
    precision = precision_score(actual,predicted, average="weighted")
    recall = recall_score(actual,predicted, average="weighted")
    confusion_matrix = confusion_matrix(actual,predicted)
    f1_score_pos = f1_score(actual,predicted,pos_label="positive")
    f1_score_neg = f1_score(actual,predicted,pos_label="negative")
    
    # Calculate the average F1-score
    f1_score = (float(f1_score_neg) + float(f1_score_pos))/2
    
    # Calculate the amount of actual positives and negatives to show the balance of the dataset
    percentage_pos = format(actual.count("positive")/(actual.count("positive") + actual.count("negative")), ".2%")
    percentage_neg = format(actual.count("negative")/(actual.count("positive") + actual.count("negative")), ".2%")
    
    # Print the information to the screen
    print("Trained on:              "+ training_category_name)
    print("Tested on:               "+ test_category_name +"\n")
    
    print("Number of Positives:     %4d (%6s of total)" % (int(actual.count("positive")), percentage_pos))
    print("Number of Negatives:     %4d (%6s of total)\n" % (int(actual.count("negative")), percentage_neg))
    
    print("Classification Accuracy: %s" % format(float(accuracy), ".2%"))
    print("Precision:               %s" % format(float(precision), ".2%"))
    print("Recall:                  %s" % format(float(recall), ".2%"))
    print("F1-Score:                %s\n" % format(float(f1_score), ".2%"))
    
    print("Confusion Matrix:        |%4d\t%4d|" % (int(confusion_matrix[0][0]), int(confusion_matrix[0][1])))
    print("                         |%4d\t%4d|" % (int(confusion_matrix[1][0]), int(confusion_matrix[1][1])))

# Part 1. Web Scraping

## 1.1. Industry Selection

The first step in the assignment is to scrape a number of web pages for reviews on companies in 3 chosen industries.
My chosen industries are the Automotive industry, the Cafe industry and the Gym industry.

In [2]:
industries = ["automotive","cafes","gym"]

# 1.2. Industry Home Pages 

The first step in the web scraping is to scrape the industry home pages for the individual review page URLs associated with each company. 
The URLs associated with the individual company review pages are conveniently formatted in a reasonably uniform way, all of the same length. This allowed them to be easily appended to a list. This list is then stored in a dictionary with the industry name.

In [3]:
import bs4
import requests

# The unchanging part of the URL when getting the home pages of each of the industries from Yalp.
base_url = "http://mlg.ucd.ie/modules/yalp"
company_urls = {}

for industry in industries:
    company_links = []
    
    # Get the web page code in HTML
    home_page_html = requests.get(base_url +"/"+ industry +"_list.html")
    parser = bs4.BeautifulSoup(home_page_html.text,"html.parser")
    
    count = 1
    
    # The character number associated with the start of the URL for each individual company
    # This is updated within the loop
    start_of_url = 18

    # Filter out all the "h5" tags, as this is where all the company URLs are.
    for match in parser.find_all("h5"):
        text = match.get_text()
        company_links.append(str(match)[start_of_url:start_of_url + 38])
        count += 1    
        if count >= 10 and count < 100:
            start_of_url = 19
        elif count >= 100 and count < 1000:
            start_of_url = 20
        elif count >= 1000:
            start_of_url = 21
            
    # Enter each list of company URLs for each industry into a dictionary
    company_urls.update({industry: company_links})

# 1.3. Scrape Company Review Page

The next step is to filter out the review text and ratings on each of the company home pages. This is done in a similar way to previously when scraping the company URLs in 1.2. 
The outcome from this is a dictionary ***industry_reviews*** that has all the reviews and ratings for each of the industries, accessible by ***industry_reviews[industry]***, where ***industry*** can be ["***automotive***", "***cafes***", "***gym***"].

In [4]:
industry_reviews = {}

for industry in industries:
    htmls = []
    review_text = []
    ratings = []
    
    for company_url in company_urls[industry]:
        
        # Parse the web page for each company
        review_page_html = requests.get(base_url +"/"+ company_url).text    
        parser = bs4.BeautifulSoup(review_page_html,"html.parser")
        
        # Filter out the ratings and reviews by getting the lines with the "p" tag
        for match in parser.find_all("p"):
            
            # Find the rating number, with 1-3 being a "negative" outcome and 4 or 5 being a "positive" outcome
            if str(match)[10:16] == "rating":
                if int(str(match)[28:29]) > 3:
                    ratings.append("positive")
                else:
                    ratings.append("negative")
 
            # Find the review text
            if str(match)[10:21] == "review-text":
                review_text.append(str(match)[23:-4])

    # Dictionary of the industries with associated ratings and reviews
    industry_reviews.update({industry: {"Ratings": ratings, "Reviews": review_text}})

# Part 2. 

## Part 2.1. Pre-Processing 

Having successfully scraped the desired web pages for their reviews and ratings, the next step is to process the data in such a way that the review text is represented by a set of numerics.

### Stop Words

Initially, I will investigate the impact of stemming versus lemmatization on the performance of the classifier that will be used. Stemming is the shortening of words, whereas lemmatization is the generalization of words, both with the aim of better correlating similar terms.

In [5]:
from sklearn.feature_extraction import text
stop_words = text.ENGLISH_STOP_WORDS

#### Stemming

In [6]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
stemmed_stop_words = []
for stop_word in stop_words:
    stemmed_stop_words.append(stemmer.stem(stop_word))

#### Lemmatization

In [7]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemma_stop_words = []
for stop_word in stop_words:
    lemma_stop_words.append(lemmatizer.lemmatize(stop_word))

### Declare TF-IDF Vectorizer

To give a more reflective view of the review text against the outcome of the review (positive or negative), the Term Frequency Inverse Document Frequency function is used. This weights words based on their frequency, but also decreases weighting to terms that appear in almost every document, as these do not reflect either way on the outcome of the review.

The TFIDF function called also performs case conversion, so that all tokens are lowercase, and removes tokens with length less than 2 by default. I also add in the extra argument to remove tokens that occur less than 10 times in the dataset, as these low frequency terms are really outliers that can be very hard to fit in a model.

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer as TFIDF
stemmed_vectorizer = TFIDF(stop_words=stemmed_stop_words, min_df = 10)

## 2.2. Model Training

The next step is to build a training set of data and associated ratings so that the algorithm can attempt to fit different word patterns to a positive or negative outcome.
This training data is then mapped to a numerical representation using the TFIDF algorithm.

In [9]:
automotive_data = stemmed_vectorizer.fit_transform(industry_reviews["automotive"]["Reviews"])
cafes_data = stemmed_vectorizer.fit_transform(industry_reviews["cafes"]["Reviews"])
gym_data = stemmed_vectorizer.fit_transform(industry_reviews["gym"]["Reviews"])

## 2.3. Classifier Selection

I will now pick a classifier to implement my predictive model on. The 3 options given in the assignment specifications for the classifier are:

* Naive-Bayes
* Logistic Regression
* Random Forest

For each classifier, I will evaluate the model using cross-validation, which splits the dataset into *k* segments. *k-1* segments are then used to train the model, with the remaining segment used in testing. This is repeated until all of the segments have been used as a test dataset, and the performance of the model is averaged over these k tests.
I will run the classifier for each of the industry categories and see which performs best on average for each of the datasets.

### Naive-Bayes

In [10]:
from sklearn.naive_bayes import MultinomialNB
naive_bayes_model = MultinomialNB()
cross_validation(naive_bayes_model, "Naive-Bayes", automotive_data, industry_reviews["automotive"]["Ratings"], 5)

Classifier: Naive-Bayes

Individual Fold Scores:
	0.9127
	0.9102
	0.8850
	0.8872
	0.9148

Average Fold Score:
	0.9020


In [11]:
cross_validation(naive_bayes_model, "Naive-Bayes", cafes_data, industry_reviews["cafes"]["Ratings"], 5)

Classifier: Naive-Bayes

Individual Fold Scores:
	0.8254
	0.8504
	0.8200
	0.8095
	0.8596

Average Fold Score:
	0.8330


In [12]:
cross_validation(naive_bayes_model, "Naive-Bayes", gym_data, industry_reviews["gym"]["Ratings"], 5)

Classifier: Naive-Bayes

Individual Fold Scores:
	0.8803
	0.9000
	0.8925
	0.8825
	0.8446

Average Fold Score:
	0.8800


### Logistic Regression

In [13]:
from sklearn import linear_model
lr_model = linear_model.LogisticRegression(solver="lbfgs")
cross_validation(lr_model, "Logistic Regression", automotive_data, industry_reviews["automotive"]["Ratings"], 5)

Classifier: Logistic Regression

Individual Fold Scores:
	0.9052
	0.9027
	0.8900
	0.8972
	0.9123

Average Fold Score:
	0.9015


In [14]:
cross_validation(lr_model, "Logistic Regression", cafes_data, industry_reviews["cafes"]["Ratings"], 5)

Classifier: Logistic Regression

Individual Fold Scores:
	0.8180
	0.8728
	0.8475
	0.8221
	0.8571

Average Fold Score:
	0.8435


In [15]:
cross_validation(lr_model, "Logistic Regression", gym_data, industry_reviews["gym"]["Ratings"], 5)

Classifier: Logistic Regression

Individual Fold Scores:
	0.8903
	0.8825
	0.8925
	0.8925
	0.8421

Average Fold Score:
	0.8800


### Random Forest

In [16]:
from sklearn.ensemble import RandomForestClassifier

random_forest_model = RandomForestClassifier(n_estimators=100)
cross_validation(random_forest_model, "Random Forest", automotive_data, industry_reviews["automotive"]["Ratings"], 5)

Classifier: Random Forest

Individual Fold Scores:
	0.8678
	0.8853
	0.8425
	0.8421
	0.8772

Average Fold Score:
	0.8630


In [17]:
cross_validation(random_forest_model, "Random Forest", cafes_data, industry_reviews["cafes"]["Ratings"], 5)

Classifier: Random Forest

Individual Fold Scores:
	0.8204
	0.8828
	0.8350
	0.8346
	0.8672

Average Fold Score:
	0.8480


In [18]:
cross_validation(random_forest_model, "Random Forest", gym_data, industry_reviews["gym"]["Ratings"], 5)

Classifier: Random Forest

Individual Fold Scores:
	0.8853
	0.8750
	0.8750
	0.8925
	0.8471

Average Fold Score:
	0.8750


From the implementation of the 3 classifiers, all of the models perform quite well, with just over 87% accuracy on average for Naive-Bayes and Logistic Regression, and Random Forest achieves just over 86% on average also. However, Random Forest does not perform as well when it comes to runtime, yielding a much longer runtime than the other models. 

As the performance of Naive-Bayes and Logistic Regression are extremely similar, I will pick Naive-Bayes arbitrarily, disregarding Random Forest mainly due to its runtime, but also its slightly inferior fold score.

### Stemming vs. Lemmatization

Having picked my classifier as Naive-Bayes, I will see whether stemming or lemmatization yields a higher classification accuracy. Having evaluated the model previously on stemming, I will now train and test the model using lemmatization.

In [19]:
lemma_vectorizer = TFIDF(stop_words=lemma_stop_words, min_df = 10)
lemma_automotive_data = lemma_vectorizer.fit_transform(industry_reviews["automotive"]["Reviews"])
cross_validation(naive_bayes_model, "Naive-Bayes", lemma_automotive_data, industry_reviews["automotive"]["Ratings"], 5)

Classifier: Naive-Bayes

Individual Fold Scores:
	0.9102
	0.9077
	0.8800
	0.8872
	0.9148

Average Fold Score:
	0.9000


In [20]:
lemma_cafes_data = lemma_vectorizer.fit_transform(industry_reviews["cafes"]["Reviews"])
cross_validation(naive_bayes_model, "Naive-Bayes", lemma_cafes_data, industry_reviews["cafes"]["Ratings"], 5)

Classifier: Naive-Bayes

Individual Fold Scores:
	0.8229
	0.8454
	0.8225
	0.8070
	0.8571

Average Fold Score:
	0.8310


In [21]:
lemma_gym_data = lemma_vectorizer.fit_transform(industry_reviews["gym"]["Reviews"])
cross_validation(naive_bayes_model, "Naive-Bayes", lemma_gym_data, industry_reviews["gym"]["Ratings"], 5)

Classifier: Naive-Bayes

Individual Fold Scores:
	0.8853
	0.9025
	0.8975
	0.8825
	0.8446

Average Fold Score:
	0.8825


Clearly, the difference between the two methods of generalising text are yielding quite similar results, with a difference of just 0.05% on average over the 3 datasets, with stemming performing slightly better than lemmatization.

Thus, as there is little difference in this application, I will stick with stemming.

# 3. Prediction

In the final section, the aim is to train a classifier on 1 of the 3 categories chosen previously in the assignment (automotive, cafes and gym), and test the trained model on the unseen data in the other 2 categories, and see how the model fares.

In accordance with the labelling specified in the question, I will denote automotive to be Category A, cafes to be Category B and gym to be Category C.

## 3.1. Training on Automotive

The model is first trained by giving it both the training data and the actual outcomes for the training data.

In [22]:
vectorizer = TFIDF(stop_words=stemmed_stop_words, min_df = 10)

automotive_train = vectorizer.fit_transform(industry_reviews["automotive"]["Reviews"])

naive_bayes_automotive = MultinomialNB()
naive_bayes_automotive.fit(automotive_train,industry_reviews["automotive"]["Ratings"])
naive_bayes_automotive.verbose = False

Having now been trained to predict the outcome of the review (positive or negative) based on the text in the reviews, the predictions are compared with the actual outcomes below to measure how good the classifier is by testing it on unseen data, namely the ***cafes*** and ***gym*** reviews.

In [23]:
predicted_automotive_cafes = naive_bayes_automotive.predict(vectorizer.transform(industry_reviews["cafes"]["Reviews"]))
classifier_performance(predicted_automotive_cafes, industry_reviews["cafes"]["Ratings"], "Automotive", "Cafes")

Trained on:              Automotive
Tested on:               Cafes

Number of Positives:     1462 (73.10% of total)
Number of Negatives:      538 (26.90% of total)

Classification Accuracy: 83.05%
Precision:               82.47%
Recall:                  83.05%
F1-Score:                75.64%

Confusion Matrix:        | 279	 259|
                         |  80	1382|


In [24]:
predicted_automotive_gym = naive_bayes_automotive.predict(vectorizer.transform(industry_reviews["gym"]["Reviews"]))
classifier_performance(predicted_automotive_gym, industry_reviews["gym"]["Ratings"], "Automotive", "Gyms")

Trained on:              Automotive
Tested on:               Gyms

Number of Positives:     1299 (64.95% of total)
Number of Negatives:      701 (35.05% of total)

Classification Accuracy: 85.95%
Precision:               86.52%
Recall:                  85.95%
F1-Score:                83.42%

Confusion Matrix:        | 469	 232|
                         |  49	1250|


The performance of the model when tested on the ***cafes*** dataset is reasonably good, with a classification accuracy of over 83%. It is also important to look at the F1-score, as the dataset is unbalanced, with almost 75% of the dataset being a positive outcome, and the F1-score is a more reflective measure of the performance of a classifier on an unbalanced dataset. With an F1-score of about 76%, the imbalance in the dataset does affect the reliability of the classification accuracy, with a drop of over 7%.

It is a similar story when classifying the ***gym*** dataset, with a classification accuracy of 86% and a F1-score of over 83%. This dataset is less unbalanced, with 65% of the dataset being positive compared to 73% with ***cafes***.

It is worth noting also that the precision and recall in both cases is between 82% and 87%, which is a good reflection on the performance of the classifier. Also, the classifier trained here using the ***automotive*** dataset yields a false-negative rate that is significantly higher than the false-positive rate. In terms of reviews, this does not matter hugely, as the worst thing that could happen would be that a good review is classed as bad, but businesses that are being reviewed would certainly prefer to have more bad reviews classed as positive than the other way around.

## 3.2. Training on Cafes

Repeating the procedure, training the classifier this time on the ***cafe*** training data.

In [25]:
cafes_train = vectorizer.fit_transform(industry_reviews["cafes"]["Reviews"])

naive_bayes_cafes = MultinomialNB()
naive_bayes_cafes.fit(cafes_train,industry_reviews["cafes"]["Ratings"])
naive_bayes_cafes.verbose = False

In [26]:
predicted_cafes_automotive = naive_bayes_cafes.predict(vectorizer.transform(industry_reviews["automotive"]["Reviews"]))
classifier_performance(predicted_cafes_automotive, industry_reviews["automotive"]["Ratings"], "Cafes", "Automotive")

Trained on:              Cafes
Tested on:               Automotive

Number of Positives:     1212 (60.60% of total)
Number of Negatives:      788 (39.40% of total)

Classification Accuracy: 80.50%
Precision:               81.05%
Recall:                  80.50%
F1-Score:                79.91%

Confusion Matrix:        | 634	 154|
                         | 236	 976|


In [27]:
predicted_cafes_gym = naive_bayes_cafes.predict(vectorizer.transform(industry_reviews["gym"]["Reviews"]))
classifier_performance(predicted_cafes_gym, industry_reviews["gym"]["Ratings"], "Cafes", "Gym")

Trained on:              Cafes
Tested on:               Gym

Number of Positives:     1299 (64.95% of total)
Number of Negatives:      701 (35.05% of total)

Classification Accuracy: 83.90%
Precision:               84.79%
Recall:                  83.90%
F1-Score:                80.61%

Confusion Matrix:        | 427	 274|
                         |  48	1251|


The performance of the model when tested on the ***automotive*** dataset is good, with a classification accuracy of just over 80%. The F1-score, precision and recall statistics all are around the 80% mark as well.

The performance of the model when classifying the ***gym*** dataset is even better, with a classification accuracy of about 84% and a F1-score of over 80%. This dataset is slightly more unbalanced than the ***automotive*** dataset, with 65% of the dataset being positive compared to 60%, although neither would really be classed as unbalanced.

The precision and recall of the classifier is also quite good, with figures between 80% to 85% for the tests.

Finally, the number of false positives is higher than the number of  false negatives when classifying the ***automotive*** dataset. This would be a bad thing for these automotive companies, where they would have much less positive ratings than they should.

## 3.3. Training on Gym

Finally, repeating the procedure, training the data on the ***gym*** dataset and testing on ***automotive*** and ***cafes***.

In [28]:
gym_train = vectorizer.fit_transform(industry_reviews["gym"]["Reviews"])

naive_bayes_model = MultinomialNB()
naive_bayes_model.fit(gym_train,industry_reviews["gym"]["Ratings"])
naive_bayes_model.verbose = False

In [29]:
predicted_gym_automotive = naive_bayes_model.predict(vectorizer.transform(industry_reviews["automotive"]["Reviews"]))
classifier_performance(predicted_gym_automotive, industry_reviews["automotive"]["Ratings"], "Gym", "Automotive")

Trained on:              Gym
Tested on:               Automotive

Number of Positives:     1212 (60.60% of total)
Number of Negatives:      788 (39.40% of total)

Classification Accuracy: 79.90%
Precision:               83.54%
Recall:                  79.90%
F1-Score:                79.80%

Confusion Matrix:        | 729	  59|
                         | 343	 869|


In [30]:
predicted_gym_cafes = naive_bayes_model.predict(vectorizer.transform(industry_reviews["cafes"]["Reviews"]))
classifier_performance(predicted_gym_cafes, industry_reviews["cafes"]["Ratings"], "Gym", "Cafes")

Trained on:              Gym
Tested on:               Cafes

Number of Positives:     1462 (73.10% of total)
Number of Negatives:      538 (26.90% of total)

Classification Accuracy: 86.35%
Precision:               86.03%
Recall:                  86.35%
F1-Score:                81.12%

Confusion Matrix:        | 337	 201|
                         |  72	1390|


Finally, the classifier similarly performs well when trained using the ***gym*** dataset, with figures between approximately 80% and 87% for accuracy, precision, recall and F1-score. 