# COMP47670 Assignment 2:
Name: Oisin Marron

Student Number: 16401562

In [817]:
import urllib.request
import bs4
import pandas as pd
import numpy as np
import nltk
nltk.download('stopwords')
from sklearn import linear_model
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/oisinmarron/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Task 1: Scrape review data

<font color='red'>Create hyperlinks:</font>

The following function is called to create the links where the html data is and it then parses this data and returns it. The category names are passed in as arguments and from these names the links are created.
The data is requested from the links and if no errors are detected, the three category datas are parsed and returned.

In [797]:
# Function to create API html link for three categories
def collect_data(categories):
    links = []
    for l in categories:
        link = "http://mlg.ucd.ie/modules/yalp/" + l + "_list.html"
        # Calls write_data function to retrieve data
        response = urllib.request.urlopen(link)
        html = response.read().decode()
    
        # Returns error message if data not received efficiently
        if (response.code != 200):
            print("Error retrieving data on")
           
        # Collects the html text 
        parser = bs4.BeautifulSoup(html, "html.parser")
    
        links.append(parser)
    return links



<font color='red'>Create Datasets:</font>

The following two functions are used to collect and sort the required data from each review, i.e. collect the review text and class label rating.
The argument of the separate_data() function contains the html data from the chosen category homepage. Here, the links for each individual business that are reviewed are available. This links are gathered and stored in the 'businesses' list and are then accessed to gather the review text and ratings for each review of each business.

The html for each business in each category is then gathered. By finding the necessary html tags, the function can then collect the review text and numbered ratings. The ratings are changed to the corresponding class label depending on the star value.

The create_dataframe() function pulls all the data together and stores it in a panda dataframe, which is returned by the separate_data() function.

In [798]:
# Function to separate review text and rating values from other html text
def separate_data(link):

    businesses = [] # Used to store each business' link add on

    # Finds each business' link add on
    for reviews in link.find_all("h5"):
        for revLink in reviews.find_all("a"):
            rev = str(revLink) # Change text to string type
            rev = rev.split('"') 
            businesses.append(rev[1])
        
    businessReviews = []
    businessRatings = []
    allReviews = []

    # Pulls out review text and ratings from each review, and stores in a list
    for business in businesses:
    
        # Opens business review page and stores its html data
        businessLink = "http://mlg.ucd.ie/modules/yalp/" + business
        businessRev = urllib.request.urlopen(businessLink)
        reviews = businessRev.read().decode()
            
        reviewParser = bs4.BeautifulSoup(reviews, "html.parser")
        
        # Returns error message if data not received efficiently
        if (response.code != 200):
            print("Error retrieving data on")
            
        # Finds all review ratings and records them in their corresponding class label form
        for individRating in reviewParser.find_all("p", attrs={'class':'rating'}):
            rating = str(individRating)
            rating = rating.split('"')
            
            if (rating[3][0] == "4") or (rating[3][0] == "5"):
                classLabel = "Positive"
            else:
                classLabel = "Negative"  
            businessRatings.append(classLabel)
        
        # Finds and stores all review text and removes unwanted symbols and text
        for individReview in reviewParser.find_all("p", attrs={'class':'review-text'}):
            reviewText = str(individReview)
            reviewText = reviewText.split('"')
            reviewText = reviewText[2]
            reviewText = reviewText.replace('</p>', '')
            reviewText = reviewText.replace('>', '')
          
            businessReviews.append(reviewText)
        
    allReviews = create_dataframe(businessReviews, businessRatings)
    return allReviews

def create_dataframe(reviews, ratings):
    reviewList = pd.DataFrame(list(zip(ratings, reviews)), columns =["Ratings", "Reviews"])
    return reviewList

In [799]:
categories = ["automotive", "fashion", "hotels"] # My three chosen categories

# Calling of functions to acquire datasets
links = collect_data(categories) 

category1 = separate_data(links[0])
category2 = separate_data(links[1])
category3 = separate_data(links[2])

In [800]:
print (category1[0:5])
print (category2[0:5])
print (category3[0:5])

    Ratings                                            Reviews
0  Negative  The man that was working tonight (8-12-17) was...
1  Negative  Chris is a very rude person. Gave me an attitu...
2  Positive  One of my favorite gas station to stop at. The...
3  Negative  Oh thank Heaven for Seven Eleven! I don't know...
4  Positive  Five stars because of the guy who works weekda...
    Ratings                                            Reviews
0  Positive  Looking for the best tactical supplies? Look n...
1  Negative  Stood in line like an idiot for 5 minutes to p...
2  Positive  Another great store with quality Equipment. Th...
3  Positive  The Problem with this store is not that they h...
4  Positive  Great place! We went in at almost closing time...
    Ratings                                            Reviews
0  Positive  Melissa took us on a tour of Asia in the space...
1  Positive  With a group of seven of us visiting Montreal ...
2  Positive  Melissa is a gem! My fiancé found her tour

## Task 2: Classification Models Creation

### Part a: Apply Preprocessing Steps:

<font color='red'>Remove Symbols and Numbers:</font>

Firstly, all numbers and symbols are removed from the review texts. After scanning portions of the data, I found a lot of the numbers written are used to describe the timestamps of events that occured, good or bad. 
Therefore, they would not play a beneficial role by being part of the trained data, so they are best to be removed.

In [801]:
# Replaces each number or symbol with a blank space
category1["Reviews"] = category1["Reviews"].str.replace('\d+', '')
category2["Reviews"] = category2["Reviews"].str.replace('\d+', '')
category3["Reviews"] = category3["Reviews"].str.replace('\d+', '')

<font color='red'>Case Conversion, Filtering and Stemming:</font>

Utilising CountVectorizer, the preprocess() function is used to tokenise the argument, convert all letters to lowercase, remove punctuation, remove stop words from the review text, filter short words, filter words rarely used and stem the words in the data.

The preprocess() function calls the stem_review() function which returns a custom analyser that stems the words.

The review text is fitted according to these specifications and scaled using the .fit_transform() method.

The data is then returned as a panda dataframe.

In [802]:
ignore_words = set(stopwords.words("english")) # Assign downloaded stopwords to variable

stemmer = PorterStemmer() # Assigns variable to a stemming method
analyzer = CountVectorizer().build_analyzer() # Assigns variable to a custom analyser builder

In [803]:
# Function to stem the words of its argument
def stem_review(reviews): 
    return (stemmer.stem(word) for word in analyzer(reviews))

# Function that preprocesses the review data, returning it in data frame form
def preprocess(reviews):
    vectorizer = CountVectorizer(stop_words=ignore_words, min_df = 10, analyzer=stem_review, token_pattern=r'\b[^\d\W]+\b')
    X = vectorizer.fit_transform(reviews) # Fits the review into vectorizer's specifications
    df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
    return df

In [804]:
# Preprocessed data of each category
reviews1 = preprocess(category1["Reviews"])
reviews2 = preprocess(category2["Reviews"])
reviews3 = preprocess(category3["Reviews"])

In [805]:
reviews1[0:5]

Unnamed: 0,aaa,abl,about,abov,absolut,ac,accept,accid,accommod,accord,...,yelp,yesterday,yet,you,young,your,yourself,yr,zero,zone
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,1,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


<font color='red'>Prepare Ratings Array:</font>

The test_Ratings() function below structures the class label rating data so it can be easily used during the building of the classification model.

In [806]:
# Function that appropriately shapes and structures the ratings datasets
def test_Ratings(ratings):
    ratingArray = np.asarray(ratings) 
    ratingArray = ratingArray.reshape(len(ratings), 1) # Shapes depending on size of dataset
    return ratingArray

In [807]:
ratings1 = test_Ratings(category1["Ratings"])
ratings2 = test_Ratings(category2["Ratings"])
ratings3 = test_Ratings(category3["Ratings"])

### Part b: Build Classification Model:

The classification() function uses the prepared review's and rating's data from part (a) to build a classification model for the corresponding category.

The function splits the arguments into training and test data. 20% of each of the review and rating's data will be separated and used to test how efficiently the other 80% have created a classification model.

The Naive Bayes classifier is used. I tested both the SVM and Logistic Regression classifiers and found on average, the Naive Bayes returned more accurate prediction models.

The train data is fit to the model. The test reviews are then used to predict the ratings of each review, followed by the actual test ratings being used to check the models accuracy for this test.

In [808]:
# Function that splits review data into train and test datasets, creates a model and tests it
def classification(reviews, ratings):
    X_train, X_test, Y_train, Y_test = train_test_split(reviews, ratings, test_size=0.2, random_state=1, stratify=ratings)
    model = MultinomialNB() # Naive Bayes classifier 
    model.fit(X_train,Y_train.ravel()) # Creates model
    predicted = model.predict(X_test) # Predicts rating test results using model
    return(accuracy_score(Y_test, predicted)) # Compares predicted and test results for accuracy

In [809]:
classification_accuracy = []
classification_accuracy.append(classification(reviews1, ratings1)) # Category 1 classification model
classification_accuracy.append(classification(reviews2, ratings2)) # Category 2 classification model
classification_accuracy.append(classification(reviews3, ratings3)) # Category 3 classification model

### Part c: Test Predictions of Classification Model:

<font color='red'>Hold Out Strategy:</font>

Firstly I tested the model using the hold out method as described above (80% of data is used for training the model and 20% is for testing).
With each category being tested against its own data, the following accuracy table was created.
Accuracy of each model being greater than 86% is a quite positive result.

In [810]:
# Presented hold out strategy results from above
classification = pd.DataFrame(classification_accuracy, columns=["Accuracy"], index=["Category 1", "Category 2", "Category 3"])
print(classification)

            Accuracy
Category 1    0.9075
Category 2    0.8750
Category 3    0.8650


<font color='red'>Cross-Validation:</font>

Secondly, I used the cross-validation strategy. This uses the hold out strategy five time for each model, testing each 20% block and training the corresponding 80%.

The 5 accuracy score's of each category, as well as their mean accuracy, are displayed below.

With each mean value being less than that of the hold out strategy accuracy's above, this shows how above there was a 'lucky' set of data tested and trained, which does not efficiently indicate the accuracy of the entire sets of data.

In [811]:
# Function that uses Naive Bayes to create a classification model using cross-validation
def cross_validation(reviews, ratings):
    model = MultinomialNB()
    scores = cross_val_score(model, reviews, ratings.ravel(), cv=5, scoring='accuracy') # Cross Validates review and rating data
    scores = np.append(scores, scores.mean()) # Adds mean of cross validation to the array
    return(scores)

In [812]:
cross_valid_list = []
cross_valid_list.append(cross_validation(reviews1, ratings1)) # Category 1 Cross-validation
cross_valid_list.append(cross_validation(reviews2, ratings2)) # Category 2 Cross-validation
cross_valid_list.append(cross_validation(reviews3, ratings3)) # Category 3 Cross-validation
cross_valid = pd.DataFrame(cross_valid_list, columns=["Fold 1", "Fold 2", "Fold 3", "Fold 4", "Fold 5", "Mean"], index=["Category 1", "Category 2", "Category 3"])
print(cross_valid)

              Fold 1    Fold 2  Fold 3   Fold 4    Fold 5      Mean
Category 1  0.902743  0.880299  0.8950  0.87218  0.909774  0.891999
Category 2  0.817500  0.837500  0.8650  0.84500  0.835000  0.840000
Category 3  0.825436  0.885000  0.8625  0.88000  0.837093  0.858006


## Task 3: Evaluation Against Other Categories

<font color='red'>Prediction Model Testing:</font>

The following test_model() function creates a prediction model using the first review and rating arguments. This time, 100% of their data are utilisied when creating the model.

The second review argument is fitted to the model's data and is then used to predict what the second rating arguments should be.

The accuracy score is tested by comparing this prediction array and to the actual ratings array.

The function returns this accuracy value.

In [813]:
# Function to create a classification model for one category and predict the other two using this model
def test_model(train_rev, train_rating, test_rev, test_rating):
    vectorizer = CountVectorizer(stop_words=ignore_words, min_df = 10, analyzer=stem_review, token_pattern=r'\b[^\d\W]+\b')
    reviews_train = vectorizer.fit_transform(train_rev) 
    reviews_test = vectorizer.transform(test_rev) # Fits test data to model form before prediction

    model = MultinomialNB()
    model.fit(reviews_train,train_rating.ravel()) # Creates model

    predicted = model.predict(reviews_test) # Predicts test ratings
    return(accuracy_score(test_rating, predicted)) # Compares predicted and test ratings

### Part a: Category A Model:

Creating a prediction model from the category A data and predicting the class label rating's of Category B and C, using their review data.

The accuracy is very high and is quite similar to the hold out strategy accuaracy used above.
Considering the hold out strategy uses test data which was gathered from the same place as the train data, this shows this model is very effective and has a wide range of words and phrases, making it suitable to predict most types of review data.

In [814]:
# Creates model using category A and predicts category B and C using this model
model_catA = []
model_catA.append(test_model(category1["Reviews"], ratings1, category2["Reviews"], ratings2))
model_catA.append(test_model(category1["Reviews"], ratings1, category3["Reviews"], ratings3))
catA_exp = pd.DataFrame(model_catA, columns=["Category A Model"], index=["Category B Test", "Category C Test"])
print(catA_exp)

                 Category A Model
Category B Test            0.8515
Category C Test            0.8710


### Part b: Category B Model:

Creating a prediction model from the category B data and predicting the class label rating's of Category A and C, using their review data.

The accuracy is again quite high, however, the category A prediction is significantly less accurate than that of category C. I would assume this is due to category A having a much smaller set of data. 

This would mean category A's model would include a smaller range of words, with but with greater frequency. Hence, why the category A model is so accurate, there is a tighter group of words that indicate more clearly why and when each word is used, hinting at a more likely result. 

In [815]:
# Creates model using category B and predicts category A and C using this model
model_catB = []
model_catB.append(test_model(category2["Reviews"], ratings2, category1["Reviews"], ratings1))
model_catB.append(test_model(category2["Reviews"], ratings2, category3["Reviews"], ratings3))
catB_exp = pd.DataFrame(model_catB, columns=["Category B Model"], index=["Category A Test", "Category C Test"])
print(catB_exp)

                 Category B Model
Category A Test            0.8095
Category C Test            0.8610


### Part c: Category C Model:

Creating a prediction model from the category C data and predicting the class label rating's of Category A and B, using their review data.

Again, the accuracy is quite high, however, it is lesser again for category A. 
This I would assume, like above, is because category C has a much wider, less compact range of words to category C.
With each category having a similar number of reviews, category C has a much larger list of words, making it harder to pinpoint a result per word or hence, per review.

In [816]:
# Creates model using category C and predicts category A and B using this model
model_catC = []
model_catC.append(test_model(category3["Reviews"], ratings3, category1["Reviews"], ratings1))
model_catC.append(test_model(category3["Reviews"], ratings3, category2["Reviews"], ratings2))
catC_exp = pd.DataFrame(model_catC, columns=["Category C Model"], index=["Category A Test", "Category B Test"])
print(catC_exp)

                 Category C Model
Category A Test            0.7795
Category B Test            0.8510
