In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn.metrics
import sklearn.feature_extraction.text
import sklearn.cluster
import matplotlib.cm as cm
from sklearn.datasets import make_blobs
import sklearn.linear_model

#Ease of Life libraries used to simulate progression of slow algorithms
from ipywidgets import IntProgress
from IPython.display import display
import time

# Introductions: The Problem
As stated, the problem we are facing is the construction of 2 dictionaries (or rather group of words) that hold either negative or positive words. Through out this Notebook, i will describe my thought process and the 3 attempts I constructed.

In [2]:
#Filtered Reviews is a file from Exercise 1 that contains the Philadelphian reviews with non-empty values.
reviews = pd.read_csv("filtered_reviews.csv", encoding="latin").dropna()
reviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 893906 entries, 0 to 893905
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   review_id    893906 non-null  object 
 1   user_id      893906 non-null  object 
 2   business_id  893906 non-null  object 
 3   stars        893906 non-null  float64
 4   useful       893906 non-null  int64  
 5   funny        893906 non-null  int64  
 6   cool         893906 non-null  int64  
 7   text         893906 non-null  object 
 8   date         893906 non-null  object 
dtypes: float64(1), int64(3), object(5)
memory usage: 68.2+ MB


# Attempt 1: Highest Value
To start things off, I considered this: Since we already know which review is "Positive" and "Negative", I can tag them without needing a classifier. But what is considered "Positive"? From a mathematical standpoint, I considered all reviews of 4 or 5 stars to be Positive, while 1 or 2 stared reviews are considered Negative. Reviews that have 3 stars are considered "Neutral" and thus worthless, thus removed.

After training a TF-IDF vectorizer, we can get the vector for the POSITIVE reviews and the NEGATIVE reviews. The highest valued words should produce negative and positive words.

In [9]:
#Trim neutral reviews
data = reviews[~reviews["stars"].isin([3])].copy()
#Mark reviews with P for Positive (4-5 stars) and N for Negative (1-2 stars)
data["category"] = data["stars"].apply(lambda x : "P" if x > 3 else "N")

In [10]:
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(max_features=3_000, stop_words="english",
                                                             min_df=0.0, max_df=0.5)
documents = data.groupby(["business_id"])["text"].sum().to_frame()
vectorizer.fit(documents["text"])

In [11]:
#Attempt 1
evaluationData = data.sample(n=10000)

inv_dic = {j:i for i,j in vectorizer.vocabulary_.items()}

for clusterType in ["P", "N"]:
    clusterReviews = evaluationData.loc[evaluationData["category"] == clusterType]
    text = clusterReviews["text"].sum()
    
    vector = vectorizer.transform([text])
    bestScores = np.argsort(vector.toarray()[0])
    bestScores = np.flip(bestScores[-10:len(bestScores)])
    
    finalPrint = "For Cluster " + clusterType + " top words: "
    for i in bestScores:
        finalPrint += inv_dic[i] + ", "
    print(finalPrint[:-2])

For Cluster P top words: delicious, menu, restaurant, chicken, cheese, bar, pizza, beer, sauce, dinner
For Cluster N top words: restaurant, bar, table, cheese, pizza, chicken, car, menu, fries, manager


# Results
The first attempt came to be fruitless. First off, the vectorizer took a good while to train. Furthermore, the two categories did not produce any significant results. This could mean several things:
- Perhaps each review contained several noise words, and taking them one-by-one would increase the chance of noise words such as "Menu", "Restaurant" or food related words which both belong on both groups
- Perhaps the classification of "Positive/Negative" is incorrect
- Or perhaps the TF-IDF method of highest value was incorrect.

# Attempt 2: Grouped Reviews
We now try a slightly different approach:
To avoid the first bullet point, this time we group up and train each review by business. This way, each business has its own vector, and we hope to reduce the "restaurant stop words".

For the second bullet point, we consider the 3 Stared reviews. But we consider them as both positive and negative. The idea is that they perhaps contain information neccecary for our clasification.

Lastly, we still use the TF-IDF value, but this time, each word is classified as Positive or Negative by the sum of its category score.

This time, we make two groups: Positive and Negative businesses, which are seperated by their average review score (rounded). Afterwards, for each feature of our vectorizer, we calculate the sum of this feature's TF-IDF value for both categories. Whichever has the higher value (therefore the most frequent appearance) must belong to that group!

In [16]:
data = reviews.copy()
evalData = data.groupby(["business_id"])["text"].sum().to_frame()
evalData["stars"] = data.groupby(["business_id"])["stars"].mean()
evalData["stars"] = evalData["stars"].apply(lambda x : round(x))

vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(max_features=3_000, stop_words="english")
matrix = vectorizer.fit_transform(evalData["text"]).toarray()
evalData["vector"] = [matrix[i].tolist() for i in range(len(matrix))]

evalData

Unnamed: 0_level_0,text,stars,vector
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
--OS_I7dnABrXvRCCuWOGQ,I have to say Len's auto body is the WORST. I ...,4,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.03867995..."
-0M0b-XhtFagyLmsBtOe8w,REVIEW OF PARIS FLEA MARKET: Accidentally popp...,4,"[0.03978147804701085, 0.0, 0.0, 0.0, 0.0122694..."
-0PN_KFPtbnLQZEeb23XiA,While there didn't seem to be anything wrong w...,3,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
-0TffRSXXIlBYVbb5AwfTg,We went for my husbands birthday in a fairly l...,4,"[0.0027283746590056526, 0.011635157099922692, ..."
-0eUa8TsXFFy0FCxHYmrjg,Sandwiches had good taste and were a good size...,4,"[0.0, 0.009885542095808109, 0.0, 0.01519419388..."
...,...,...,...
zy_g2wKTNIB7EQdG73_Xaw,Einstein medical center in North Philadelphia ...,2,"[0.008139410954805908, 0.021544465692502488, 0..."
zyghhZzPgb1bRAIYB-oi1w,"I haven't written a review in years, but I fee...",5,"[0.0, 0.023168742201064813, 0.0, 0.0, 0.0, 0.0..."
zz-fcqurtm77bZ_rVvo2Lw,Best lunch truck on campus hands down. I've m...,4,"[0.0, 0.014573003981617209, 0.0, 0.02239887766..."
zz3E7kmJI2r2JseE6LAnrw,This is your typical Asian super market. I've ...,4,"[0.003403086645432603, 0.03377901847945415, 0...."


In [18]:
#Attempt 2
wordMap = vectorizer.vocabulary_
wordMap = { j:i for i,j in wordMap.items()}

positiveBusinesses = evalData.loc[evalData["stars"] >= 4]
negativeBusinesses = evalData.loc[evalData["stars"] <= 3]
positiveDic = {}
negativeDic = {}
for index in range(3000):
    pScore = positiveBusinesses["vector"].apply(lambda x : x[index]).sum()
    nScore = negativeBusinesses["vector"].apply(lambda x : x[index]).sum()
    if (pScore > nScore):
        positiveDic[index] = pScore;
    else:
        negativeDic[index] = nScore;
        
positiveDic = [wordMap[i[0]] for i in sorted(positiveDic.items(), key=lambda x:x[1])]
negativeDic = [wordMap[i[0]] for i in sorted(negativeDic.items(), key=lambda x:x[1])]
positiveDic.reverse()
negativeDic.reverse()

print("Positive Words: ", ", ".join(positiveDic[:50]), "\n")
print("Negative Words: ", ", ".join(negativeDic[:50]))

Positive Words:  food, great, place, good, time, like, just, service, really, ve, delicious, best, friendly, nice, got, chicken, love, definitely, staff, hair, did, recommend, work, store, coffee, ordered, don, amazing, new, little, menu, bar, came, try, philly, restaurant, experience, shop, day, cheese, people, went, make, fresh, didn, come, know, salon, going, beer 

Negative Words:  order, pizza, said, told, customer, location, called, minutes, fries, asked, delivery, bad, manager, phone, wings, money, rude, worst, line, employees, waiting, pay, horrible, customers, chinese, wrong, terrible, doctor, drive, apartment, management, card, paid, waited, received, insurance, lady, starbucks, charge, woman, hotel, stay, cashier, employee, orders, slow, attitude, sales, charged, poor


# Results
Once again, we see a similar pattern. Several noise words such as "food" or "order", but the appearance of more targeted words is apparent. This is progress. I realise that perhaps avoiding the noise words is not something that can be solved exactly, but rather something that should be pushed aside.
- Noise words exist on the classification line that seperate Positive and Negative reviews. They should exist on both dictionaries, but regardless are not important.
- Including the 3 Starred Reviews seemed to give better results, but still in previous tests, I noticed them leaking some negative words in the positive dictionary.

# Attempt 3: Introducing Importance
The factor we seem to miss. Importance. By introducing a classification algorithm (logistic regression), and training it on our data, we can extract the importance of each feature. This way, we can showcase the most high valued words of each dictionary first.

Additionally, after some real world pondering, I realised that the mathematical approach on classifying our data was not entirely correct. If someone was to see a restaurant with an average of 3 stars, their reaction will most likely be to avoid it. Therefore we end up with this:

We once more vectorize our data, and train a logistic regression model on them. We then, using the model's Coefficient, extract the most important words (both negative and positive). And finally, we seperate them with the same algorithm as the previous attempt, but with one difference: Negative Reviews have 1, 2 and 3 stars, while positive have 4 and 5.

In [45]:
#Attempt 3
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(max_features=3_000, stop_words="english")
matrix = vectorizer.fit_transform(evalData["text"]).toarray()
evalData["vector"] = [matrix[i].tolist() for i in range(len(matrix))]

In [46]:
logReg = sklearn.linear_model.LogisticRegression(max_iter=200)
logReg.fit(list(evalData["vector"]), evalData["stars"])

In [47]:
evalData["predict"] = logReg.predict(list(evalData["vector"]))

In [49]:
wordMap = vectorizer.vocabulary_
wordMap = { j:i for i,j in wordMap.items()}

values = logReg.coef_
flatValues = np.sort(values.flatten())

highestWords = []
highestIndexes = []
highestValues = np.flip(flatValues) ##
for v in highestValues:
    i,j = np.where(np.isclose(values, v))
    i, j = i[0], j[0]
    highestWords.append(wordMap[j])
    highestIndexes.append(j)

print("Highest Valued Words: \n", ", ".join(highestWords[:50]), "\n")

Highest Valued Words: 
 great, amazing, best, delicious, great, worst, definitely, friendly, recommend, highly, super, bad, order, gem, love, good, perfect, terrible, love, horrible, delicious, reasonable, pretty, horrible, professional, philly, manager, favorite, rude, wonderful, company, excellent, friendly, drive, good, worst, overpriced, best, knowledgeable, told, definitely, rude, awesome, fantastic, honest, money, easy, don, money, incredible 



In [50]:
#Interstingly, the 3 star reviews contain the word "bad". Therefore, 3-star reviews are the Bad threshold o_o
positiveBusinesses = evalData.loc[evalData["stars"] >= 4]
negativeBusinesses = evalData.loc[evalData["stars"] <= 3]
tags = []
for index in highestIndexes:
    pScore = positiveBusinesses["vector"].apply(lambda x : x[index]).sum()
    nScore = negativeBusinesses["vector"].apply(lambda x : x[index]).sum()
    
    if (pScore > nScore):
        tags.append("P")
    else:
        tags.append("N")
        
positiveWords = []
negativeWords = []
for i,j in zip(tags, highestWords):
    if i == "P":
        positiveWords.append(j)
    else:
        negativeWords.append(j)

print("Positive Words: ", ", ".join(positiveWords[:50]), "\n")
print("Negative Words: ", ", ".join(negativeWords[:50]))

Positive Words:  great, amazing, best, delicious, great, definitely, friendly, recommend, highly, super, gem, love, good, perfect, love, delicious, reasonable, pretty, professional, philly, favorite, wonderful, company, excellent, friendly, good, best, knowledgeable, definitely, awesome, fantastic, honest, easy, don, incredible, decent, awesome, beautiful, ok, helpful, nice, food, thank, fun, happy, helpful, atmosphere, fresh, unique, reviews 

Negative Words:  worst, bad, order, terrible, horrible, horrible, manager, rude, drive, worst, overpriced, told, rude, money, money, dirty, avoid, awful, management, told, paid, lady, told, said, manager, order, poor, customers, shut, said, employees, disgusting, nasty, employees, said, chinese, slow, pay, called, line, disgusting, asked, gross, worse, bad, attitude, mediocre, charged, asked, unprofessional


# Results
This time, we get some satisfying results! Each dictionary has distinct and powerful words for its category. If I were to make some notes however on this entire method, that would be that it's limited. The number of words I can place in both categories is limited to the number of features that my vectorizer has created. Of course, there is no way to make an infinite array of positive or negative words, but in theory I can classify EVERY word on the reviews I have.

# Theorizing
A more generalized approach, or rather continuation, would be to involve some form of vectorized classification to introduce more words. The logistic regression model, or even better; a neural network, could learn from the large sample of positive/negative words, and learn to classify them! With this, we are not limited by the vectorizer, but instead are limited by our data and time.

The only problem, and the main reason I did not make this, is the mere idea of feeding it every word from every review. Too laborious to be worth a preview. The results should be similar if not better.