# Text-mining on webscraped phone reviews (iPhone + Samsung)

### Group members:

Allesandro Girelli

Cyprien Nielly

Katie Chang

Sebastien Moeller

Viktor Malesevic

### Introduction:

The aim of this project is to do text-mining and analysis on phone reviews from different webpages (Amazon, Reddit, Influenster, etc...) in order to provide advice to phone manufacturers on potential issues faced by customers.

# Part 1: Webscraping the data

We decided to webscrape Amazon reviews, which we did on R using the 'rvest' package.

The results is found in the csv file 'Reviews.csv'. This file concatenates data from iPhone X, iPhone 8, and Samsung S8 reviews.

# Part 2: Pre-processing the webscraped data

### 2.1 Importing the data

#### Before reading the data we import some usefull libraries:

In [1]:
# Pandas for data manipulation
import pandas as pd

# nltk for all text data pre-processing

#nltk.download()
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk import pos_tag

**NOTA**: the nltk package needs to be download thanks to the command 'nltk.download()' which then opens a window. On this window click on 'all packages' and then 'download'. 

Now we import the dataset:

In [2]:
data = pd.read_csv('Reviews.csv', encoding = 'ISO-8859-1')

In [7]:
data.head()

Unnamed: 0.1,Unnamed: 0,source,product,comments,stars
0,1,Amazon,Samsung S8,BEWARE!99% of the negative reviews are SELLER ...,5.0
1,2,Amazon,Samsung S8,So far the best phone I ever had. Man it's be...,5.0
2,3,Amazon,Samsung S8,I was skeptical about buying this phone off Am...,5.0
3,4,Amazon,Samsung S8,This phone should be a no-brainer. Easily the ...,3.0
4,5,Amazon,Samsung S8,I haven't owned a Galaxy phone since the Galax...,4.0


In [11]:
del data['Unnamed: 0']

In [13]:
data.head()

Unnamed: 0,source,product,comments,stars
0,Amazon,Samsung S8,BEWARE!99% of the negative reviews are SELLER ...,5.0
1,Amazon,Samsung S8,So far the best phone I ever had. Man it's be...,5.0
2,Amazon,Samsung S8,I was skeptical about buying this phone off Am...,5.0
3,Amazon,Samsung S8,This phone should be a no-brainer. Easily the ...,3.0
4,Amazon,Samsung S8,I haven't owned a Galaxy phone since the Galax...,4.0


In [14]:
data.describe()

Unnamed: 0,stars
count,5746.0
mean,4.320745
std,1.247721
min,1.0
25%,4.0
50%,5.0
75%,5.0
max,5.0


#### So far we have a dataset of 5700 reviews, composed of 4 columns:

The source: Amazon

The product: iPhone 8, iPhone X or Samsung S8

The comment/review: raw text written by the customer

The rating: from 1 to 5 stars.

#### For the moment we will only focus on the comments for each phone model (without rating)

In [16]:
commentsiX = data[data['product'] == 'iPhone X']
commentsi8 = data[data['product'] == 'iPhone 8']
commentsS8 = data[data['product'] == 'Samsung S8']

commentsiX = commentsiX['comments']
commentsi8 = commentsi8['comments']
commentsS8 = commentsS8['comments']

### 2.2 Removing unnecessary characters and words

What we first need to do is to remove special characters, accents, punctuation, and put all words in lower case.

For code see 2.4

### 2.3 Tokenizing comments into monograms

After that, we need to 'tokenize' the comments into 'monograms', 'bigrams' or 'Ngrams'...
We choose to build 'monograms': this is basically separating the words one by one for each comment (for bigrams it is two by two, etc...)


For code see 2.4

### 2.4 Lemmatizing the monograms

In this section we create three functions referring to each other in order to do 'special characters removal', 'tokenizing' and 'lemmatizing'. This is all done in the function 'tokenList' which uses the two functions 'get_wordnet_pos' and 'lemmatize'.

**get_wordnet_pos** is the function that is useful for lemmatizing:  it assigns to each words its status (adjective, noun, verb, etc...). It uses the function **wordnet** from nltk.

**lemmatize** is the function lemmatizing each word, using the function **get_wordnet_pos** we just created, the function **pos_tag** and the function **WordLemmatizer.lemmatize** from nltk.

Finally **tokenList** is the main function:

It first removes the special characters from comments, makes sure each variable is a string, and deletes some particular words like 'iphone', 'samsung', etc...

Then, it tokenizes 

In [None]:
# Given a list of tokens passed through nltk.pos_tag(tokens), this returns the
# pos argument needed for the lemmitization in a new list of pairs: (word, type)
def get_wordnet_pos(tokensTag):
    # tokensTag is a list of pairs of tuples and cannot be modified
    tokenNew = []  
    for i in range(len(tokensTag)):
        # Save the current word being identified
        tokenNew.append([tokensTag[i][0]])
        # Append the type
        if tokensTag[i][1][0] == 'J':
            tokenNew[i].append(wordnet.ADJ)
            
        elif tokensTag[i][1][0] == 'V':
            tokenNew[i].append(wordnet.VERB)
            
        elif tokensTag[i][1][0] == 'N':
            tokenNew[i].append(wordnet.NOUN)
            
        elif tokensTag[i][1][0] == 'R':
            tokenNew[i].append(wordnet.ADV)
        
        elif tokensTag[i][1][0] == '?':
            tokenNew[i].append(wordnet.ADJ_SAT)
        else:
            tokenNew[i].append(wordnet.ADJ)
            
    return tokenNew

# Smart lemmitization !
def lemmatize(words):
        
    wordType = get_wordnet_pos(pos_tag(words))
    wordnet_lemmatizer = WordNetLemmatizer()    
    for i in range(len(wordType)):
        words[i] = wordnet_lemmatizer.lemmatize(wordType[i][0], pos = wordType[i][1])
    
    return words 

def tokenList(my_list):
    # Lowercase all characters 
    # (like this the same word will contribute to the same token count)
    comments = [item.lower() for item in my_list]
    # We establish a dictionary of the transformation: characters to replace with a space
    # We also want to remove context words that don't have meaning
    transformation = {a:' ' for a in ['@','/','#','.','\\','!',',','(',')','{','}','[',']','-','~', '*','?','+', '8', '7', '6', ';', ':', '|']}
    transB = {b:'' for b in ['','’','"']}
    comments = [item.translate(str.maketrans(transformation)) for item in comments]
    comments = [item.translate(str.maketrans(transB)) for item in comments]
    comments = [item.replace('iphone', ' ').replace('samsung', ' ').replace('galaxy', ' ').replace('apple', ' ').replace('plus', ' ').replace(
            ' x ', ' ').replace('’', '').replace("'", '').replace('http', '').replace('https', '').replace('com', ' ').replace('co', ' ').replace(
                    '201', ' ').replace('0', ' ').replace('â\x80\x99', '').replace('í¢ä\x89åä\x8b¢', '').replace('phone', ' ') for item in comments]
    
    # nltk's tokenizer
    tkzer = TweetTokenizer(preserve_case = False, strip_handles = True, reduce_len = True)
    tokens = [tkzer.tokenize(item) for item in comments]
    
    tokens = []
    #wordnet_lemmatizer = WordNetLemmatizer()
    for idx in range(len(comments)):
        # Remove engish stopwords
        words = ([word for word in comments[idx].split() if word not in stopwords.words('english')])
        
        words = lemmatize(words)
        #for idy in range(len(words)):
            # Lemmatize each word
            
            #words[idy] = wordnet_lemmatizer.lemmatize(words[idy], pos = 'v')
        
        tokens.append(words)
        # Progress report
        print('Tokenizing: ',idx+1, ' / ', len(comments))
    
    return tokens

Some other useful functions:

In [None]:
# given a list of ordered tokens from a document, the function will return-
# a list of neighbouring groups of size n.
def nGrams(input_list, n):
    return list(zip(*[input_list[i:] for i in range(n)]))

# Returns nGrams from a corpus of tokens
def listGrams(input_list, n):
    output =[]
    for idx in range(len(input_list)):
        temp = nGrams(input_list[idx], n)
        output = output + temp
    return output

## Part 3: Creation of a Term Frequency & Inverse Term Frequency matrix (TF-IDF)

In [None]:
#%% TF - IDF Matrix Construction
# All lemmitized tokens joined by comment
lemmiX = []
lemmi8 = []
lemmS8 = []

for idx in range(len(tokensiX)):
    lemmiX.append(' '.join(tokensiX[idx]))

for idx in range(len(tokensi8)):
    lemmi8.append(' '.join(tokensi8[idx]))

for idx in range(len(tokensS8)):
    lemmS8.append(' '.join(tokensS8[idx]))

In [None]:
#%%
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
termdoc = tfidf_vectorizer.fit_transform(lemmiX)
TFM = pd.DataFrame(termdoc.todense()).replace(0, '')

## Part 4: Non-negative Matrix Factorization (NMF)

In [None]:
#%%
n_dimensions = 40 # This can also be interpreted as topics in this case. This is the "beauty" of NMF. 10 is arbitrary
model = NMF(n_components=40, init='random')
W = model.fit_transform(termdoc) 
H = model.components_ 

In [None]:
#%%
W = pd.DataFrame(W).replace(0, '')
H = pd.DataFrame(H).replace(0, '')

In [None]:
#%%
# Since NMF dimensions can be interpreted as topics, let's look at the dimensions
words = tfidf_vectorizer.get_feature_names()
n_top_words = 20 # print 10 words by dimension. You can change this number

# Loop for each dimension: what words are the most dominant in each dimension
for i_dimension, dimension in enumerate(model.components_):
    print("Topic #%d:" % i_dimension)
    print(" ".join([words[i] for i in dimension.argsort()[:-n_top_words - 1:-1]]))
print()

# Can you interpret these dimensions as humanly intelligible topics?

## Part 5: Topic extraction with Latent Dirichlet Allocation

In [None]:
#%% LDA ANALYSIS
from gensim import corpora, models

#dictionary = corpora.Dictionary(tokens)
dictionaryiX = corpora.Dictionary(tokensiX)
dictionaryi8 = corpora.Dictionary(tokensi8)
dictionaryS8 = corpora.Dictionary(tokensS8)
#print(dictionary)

In [None]:
#%%
#corpus = [dictionary.doc2bow(text) for text in tokens]
corpusiX = [dictionaryiX.doc2bow(text) for text in tokensiX]
corpusi8 = [dictionaryi8.doc2bow(text) for text in tokensi8]
corpusS8 = [dictionaryS8.doc2bow(text) for text in tokensS8]

In [None]:
#%%
# Long computation time!!!
#ldamodel = models.ldamodel.LdaModel(corpus, num_topics = 40, id2word = dictionary, passes = 10)
ldaiX = models.ldamodel.LdaModel(corpusiX, num_topics = 40, id2word = dictionaryiX, passes = 10)

In [None]:
#%%
ldai8 = models.ldamodel.LdaModel(corpusi8, num_topics = 40, id2word = dictionaryi8, passes = 10)

In [None]:
#%%
ldaS8 = models.ldamodel.LdaModel(corpusS8, num_topics = 40, id2word = dictionaryS8, passes = 10)

In [None]:
#%%
# Top 10 words associated with the 40 topics we clustered the data into
#print(ldamodel.print_topics(num_topics = 40, num_words = 10))
print(ldaiX.print_topics(num_topics = 40, num_words = 10))

In [None]:
#%%
print(ldai8.print_topics(num_topics = 40, num_words = 10))

In [None]:
#%%
print(ldaS8.print_topics(num_topics = 40, num_words = 10))