## CSE150A- Mini Project

###### Members
* Luis Diaz
* Stephanie Moore
* Darren Chang

###### Overview
For this assignment, we used a dataset of amazon reviews to look at unigrams and bigrams. The dataset is in JSON format and has information on the number of votes, review ID, user ID, review text, rating, genre ID, and genre. For this assignment we want to find the TF-IDF of the reviews and use it to predict the rating.

###### Methodology
We are going to first read 10,000 reviews from our dataset and clean them from punctuation and capitalization. We'll then get the number of unigrams and bigrams in our corpus and then get unique values. We can then try to predict rating using ridge regression on the 1000 most common unigrams & bigrams. After this, we will adapt both models to tfidf ones and see predict the rating.

In [1]:
import numpy as np
import urllib
import scipy.optimize
import random
from collections import defaultdict # Dictionaries with default values
import nltk
from nltk.util import ngrams
import operator
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
import string
from nltk.stem.porter import *
from sklearn import linear_model
import ast
import pandas as pd

In [2]:
def parseData(fname):
    for l in urllib.urlopen(fname):
        yield ast.literal_eval(l)

In [3]:
def parseDataFromFile(fname):
    for l in open(fname):
        yield ast.literal_eval(l)

**Reading the first 10,000 without capitalization or punctuation.**

In [4]:
data_ = list(parseDataFromFile("train_Category.json"))
data = data_[:10000]

In [5]:
data[0]

{'n_votes': 0,
 'review_id': 'r99763621',
 'user_id': 'u17334941',
 'review_text': "Genuinely enthralling. If Collins or Bernard did invent this out of whole cloth, they deserve a medal for imagination. Lets leave the veracity aside for a moment - always a touchy subject when it comes to real life stories of the occult - and talk about the contents. \n The Black Alchemist covers a period of two years in which Collins, a magician, and Bernard, a psychic, undertook a series of psychic quests that put them in opposition with the titular Black Alchemist. As entertainment goes, the combination of harrowing discoveries, ancient lore, and going down the pub for a cigarette and a Guinness, trying to make sense of it all while a hen party screams at each other, is a winner. It is simultaneously down to earth and out of this world. \n It reads fast, both because of the curiousity and because Collins has a very clear writing style. Sometimes its a little clunky or over repetitive and there's a fe

In [6]:
# clean data: remove capitalization and punctuation
punct = string.punctuation

for d in data:
    d['review_text'] = d['review_text'].lower() # lowercase string
    d['review_text'] = [c for c in d['review_text'] if not ((c in punct) or (c == '\n'))] # non-punct characters
    d['review_text'] = ''.join(d['review_text']) # convert back to string

In [7]:
#cleaned review text
data[0]['review_text']

'genuinely enthralling if collins or bernard did invent this out of whole cloth they deserve a medal for imagination lets leave the veracity aside for a moment  always a touchy subject when it comes to real life stories of the occult  and talk about the contents  the black alchemist covers a period of two years in which collins a magician and bernard a psychic undertook a series of psychic quests that put them in opposition with the titular black alchemist as entertainment goes the combination of harrowing discoveries ancient lore and going down the pub for a cigarette and a guinness trying to make sense of it all while a hen party screams at each other is a winner it is simultaneously down to earth and out of this world  it reads fast both because of the curiousity and because collins has a very clear writing style sometimes its a little clunky or over repetitive and theres a few meetings that get underreported but i am very much quibbling here mostly important he captures his own and

**Unigrams and bigrams in the reviews.**

In [8]:
def gen_ngrams(corpus, n):
    global ngramCount
    ngramCount= defaultdict(int)
    
    global totalNgrams
    totalNgrams= 0
    for d in corpus:
        t = d['review_text']
        words = t.split() # tokenizes
        Ngrams = list(ngrams(words, n))
        for i in Ngrams:
            totalNgrams += 1
            ngramCount[i] += 1
    return ngramCount

In [9]:
unigrams = gen_ngrams(data, 1)
totalUnigrams = totalNgrams
unigramCount = ngramCount

In [10]:
# total unigrams found
totalUnigrams 

1511677

In [11]:
# unique unigrams
len(unigramCount)

73286

In [12]:
# top 10 frequent
ucounts = [(unigramCount[b], b) for b in unigramCount]
ucounts.sort()
ucounts.reverse()
ucounts[:10]

[(73431, ('the',)),
 (44301, ('and',)),
 (39577, ('a',)),
 (36821, ('to',)),
 (36581, ('i',)),
 (32552, ('of',)),
 (21889, ('is',)),
 (21468, ('in',)),
 (20110, ('it',)),
 (19353, ('this',))]

In [13]:
bigrams = gen_ngrams(data, 2)
totalBigrams = totalNgrams
bigramCount = ngramCount

In [14]:
# total bigrams found
totalBigrams 

1501677

In [15]:
# unique bigrams
len(bigramCount)

521502

In [16]:
# top 10 frequent
bcounts = [(bigramCount[b], b) for b in bigramCount]
bcounts.sort()
bcounts.reverse()
bcounts[:5]

[(7927, ('of', 'the')),
 (5850, ('this', 'book')),
 (5627, ('in', 'the')),
 (3189, ('and', 'the')),
 (3183, ('is', 'a'))]

**least squares using the 1000 most common unigrams and 1000 most common bigrams. Scored using the MSE.**

In [17]:
# MSE
def MSE(Y, YH):
     return np.square(Y - YH).mean()

In [18]:
ugrams = [u[1] for u in ucounts[:1000]]
unigramId = dict(zip(ugrams, range(len(ugrams))))
unigramSet = set(ugrams)

def ug_feature(datum):
    feat = [0]*len(unigramSet)
    t = datum['review_text']
    words = t.strip().split() # tokenizes
    ubigrams = list(ngrams(words, 1))
    
    for u in unigrams:
        if not (u in unigramSet): continue
        feat[unigramId[u]] += 1
    feat.append(1)
    return feat

In [19]:
bg_X = [ug_feature(d) for d in data]
y = [d['rating'] for d in data] #The prediction target should be the ‘rating’ field in each review

In [20]:
clf = linear_model.Ridge(1.0, fit_intercept=False)
clf.fit(bg_X, y)
theta = clf.coef_
predictions = clf.predict(bg_X)

In [21]:
#MSE of unigrams
MSE(y, predictions)

1.3453519100001436

In [22]:
bgrams = [b[1] for b in bcounts[:1000]]
bigramId = dict(zip(bgrams, range(len(bgrams))))
bigramSet = set(bgrams)

def bg_feature(datum):
    feat = [0]*len(bigramSet)
    t = datum['review_text']
    words = t.strip().split() # tokenizes
    bigrams = list(ngrams(words, 2))
    
    for b in bigrams:
        if not (b in bigramSet): continue
        feat[bigramId[b]] += 1
    feat.append(1)
    return feat

In [23]:
bg_X = [bg_feature(d) for d in data]

In [24]:
clf = linear_model.Ridge(1.0, fit_intercept=False)
clf.fit(bg_X, y)
theta = clf.coef_
predictions = clf.predict(bg_X)

In [25]:
#MSE of bigrams
MSE(y, predictions)

1.0178804824879226

**Experiment above using 1000 most common unigrams and bigrams. Some combination of unigrams and bigrams.Scored using MSE**

In [26]:
counts = [(unigramCount[w], w) for w in unigramCount]
counts.sort()
counts.reverse()
words = [w[1] for w in counts[:1000]]
wordId = dict(zip(words, range(len(words))))
wordSet = set(words)

# unigram feature
def ug_feature(datum):
    feat = [0]*len(wordSet)
    t = datum['review_text']
    t = t.lower() # lowercase string
    t = [c for c in t if not (c in punct)] # non-punct characters
    t = ''.join(t) # convert back to string
    words = t.strip().split() # tokenizes
    for w in words:
        if not (w in wordSet): continue
        feat[wordId[w]] += 1
    feat.append(1)
    return feat

totalBigrams = totalNgrams
bigramCount = ngramCount

In [27]:
# 1000 of the most common out of a list of unigrams and bigrams
comb = counts + bcounts
len(comb)

594788

In [28]:
comb.sort(key = operator.itemgetter(0), reverse=True) # sort unigrams and bigrams by counts

In [29]:
# Are there unigrams and bigrams in comb?
l = {'unigrams':0, 'bigrams':0}
for e in comb:
    if len(e[1]) == 2:
        l['bigrams'] += 1
    else:
        l['unigrams'] += 1
l # yes! and its sorted!

{'unigrams': 73286, 'bigrams': 521502}

In [30]:
ncounts = comb
ngramz = [b[1] for b in ncounts[:1000]]
ngramId = dict(zip(ngramz, range(len(ngramz))))
ngramSet = set(ngramz)

def ng_feature(datum):
    feat = [0]*len(ngramSet)
    t = datum['review_text']
    words = t.strip().split() # tokenizes
    bigrams = list(ngrams(words, 2))
    unigrams = list(ngrams(words, 1))
    
    for b in bigrams:
        if not (b in ngramSet): continue
        feat[ngramId[b]] += 1
    
    for u in unigrams:
        if not (u in ngramSet): continue
        feat[ngramId[u]] += 1
    feat.append(1) 
    
    return feat

In [31]:
X = [ng_feature(d) for d in data]
y = [d['rating'] for d in data]

In [32]:
clf = linear_model.Ridge(fit_intercept=False)
clf.fit(X, y)
theta = clf.coef_
predictions = clf.predict(bg_X)

In [33]:
MSE(y, predictions)

1.5050550974282946

**Inverse document frequency of the words ‘stories’, ‘magician’, ‘psychic’, ‘writing’, and ‘wonder’.**

In [34]:
docs = [d['review_text'] for d in data] # list of documents

#instantiate CountVectorizer()
cv = CountVectorizer()
 
# this steps generates word counts for the words in your docs
word_count_vector = cv.fit_transform(docs)

In [35]:
word_count_vector.shape

(10000, 73250)

In [36]:
tfidf_transformer = TfidfTransformer()
tfidf_transformer.fit(word_count_vector)

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

In [37]:
# print idf values
df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(),columns=["idf_weights"])

In [38]:
ws = ['stories', 'magician', 'psychic', 'writing', 'wonder']
d = {}
for w in ws:
    d[w] = df_idf.loc[w]
pd.DataFrame(d)

Unnamed: 0,stories,magician,psychic,writing,wonder
idf_weights,3.571873,7.074946,6.952344,3.296703,5.062946


### tf-idf scores in the first review:

In [39]:
doc = [d['review_text'] for d in data]

In [40]:
# count matrix
count_vector = cv.transform(docs)
 
# tf-idf scores
tf_idf_vector = tfidf_transformer.transform(count_vector)

In [41]:
feature_names = cv.get_feature_names()
 
#get tfidf vector for first document
first_document_vector = tf_idf_vector[0]
 
#print the scores
df = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, columns=["tfidf"])

In [42]:
d = {}
for w in ws:
    d[w] = df.loc[w]
pd.DataFrame(d)

Unnamed: 0,stories,magician,psychic,writing,wonder
tfidf,0.048453,0.095974,0.188621,0.044721,0.06868


**Adapted td-idf unigram & bigram model to use the tfidf scores for the 1000 most common unigrams & bigrams.**

In [43]:
# extract count features and apply TF-IDF normalization 
docs = [d['review_text'] for d in data]
tfidf = TfidfVectorizer(max_features=1000).fit_transform(docs)
tfidf

<10000x1000 sparse matrix of type '<class 'numpy.float64'>'
	with 594591 stored elements in Compressed Sparse Row format>

In [44]:
y = [d['rating'] for d in data]

In [45]:
clf = linear_model.Ridge(fit_intercept=False)
clf.fit(tfidf, y)
theta = clf.coef_

In [46]:
predictions = clf.predict(tfidf)

In [47]:
MSE(y, np.array([int(p) for p in predictions]))

2.0481

###### Assessment
To assess our models we used Mean Square Error. Models that performed better would have a lower MSE. Under those conditions our best model was a bag of words model trained with the 1000 most common bigrams.

###### What you learned
We can see that sometimes it is better to use bigrams over unigrams fro detecting rating.

###### Source code and datasets
Datasets were used from Julian McaAuley http://cseweb.ucsd.edu/classes/fa19/cse258-a/files/ . The amazon review files are under assignment1.tar.gz.