# Text-mining on webscraped phone reviews (iPhone + Samsung)

### Group members:

Allesandro Girelli

Cyprien Nielly

Katie Chang

Sebastien Moeller

Viktor Malesevic

### Introduction:

The aim of this project is to do text-mining and analysis on phone reviews from different webpages (Amazon, Reddit, Influenster, etc...) in order to provide advice to phone manufacturers on potential issues faced by customers.

# Part 1: Webscraping the data

We decided to webscrape Amazon reviews, which we did on R using the 'rvest' package. The second set of reviews we scraped were Google Shopping reviews with the code explained below.

The results is found in our csv file, 'Reviews.csv'. This file concatenates data from iPhone X, iPhone 8, and Samsung S8 reviews.

###### Scrapping Google Shopping Reviews:
To scrap the Google Shopping reviews we used the BeautifulSoup package, as well as url to html downloader urllib.request.

In [1]:
import urllib.request as urll
from bs4 import BeautifulSoup
import re
import pandas as pd

The way the code is built is that we are given the url of the first page of reviews and the functions scrap all pages available for the product.

The following function takes a url and scraps all reviews from one page only.

In [2]:
def googlePageScrap(url):
    # Initializing BeautifulSoup
    page = urll.urlopen(url)
    soup = BeautifulSoup(page, 'html.parser')
    
    # Saving the nodes containing the reviews and ratings
    rev = soup.find_all('div', attrs={'class':'review-content'})
    rat = soup.find_all('div', attrs={'class':'_OBj'})
    
    # Ratings need to be extracted from the soup
    rating = []
    for i in range(len(rat)):
        # Convert to string otherwise .find() doesn't work
        rat[i] = str(rat[i])
        # Finding the index of the rating
        idx = rat[i].find('aria-label="')+len('aria-label="')
        # Save the rating in the rating list
        rating.append(int(rat[i][idx]))
    
    # The first entry is the average rating between all, therefore we delete it
    rating = rating[1:]
    
    # Building the output
    output = []
    for i in range(len(rev)):
        # Building meta data
        meta = []
        meta.append(rating[i])
        # Reviews can use the function .text to extract the actual reviews
        meta.append(rev[i].text[1:-20])
        # Save meta data with the comment ( Rating + Review )
        output.append(meta)
    
    return output

To be able to go to the next page, we need to know how many pages there are. The `googleMaxPage(url)` function takes a url of a page of reviews as an input and returns the total number of pages of reviews. This is because only 10 reviews are displayed per page and we want as many reviews as possible.

In [4]:
# Returns the number of pages of reviews that need to be visited
def googleMaxPage(url):
    page = urll.urlopen(url)
    soup = BeautifulSoup(page, 'html.parser')
    maxPage = soup.find('span', attrs={'class':'pag-n-to-n-txt'})
    maxPage = re.sub('[^0-9]','', maxPage.text[7:])
    # This returns the total number of reviews
    maxPage = int(maxPage)
    # There are 10 reviews per page
    maxPage = int(maxPage/10)-1
    
    return maxPage

Finally, using the following function we are able to visit each page of reviews and scrap the data.

In [6]:
# Scraps all reviews starting from page 1 as the input url
def googleScrap(url):
    
    maxPage = googleMaxPage(url)
    output = googlePageScrap(url)
    
    # Go to each page of reviews and add them to the output list
    for i in range(maxPage):
        urlPage = str(url + ',rstart:'+ str(i+1) +'0')
        new = googlePageScrap(urlPage)
        output = output + new
        print(i+1,' / ',maxPage)
    
    # Convert the list into a pandas dataframe
    output_df = pd.DataFrame(output)
    output_df.columns = ['stars', 'comments']

    return output_df 

The actual scrapping was done on the following URLs

In [None]:
url = 'https://www.google.com/shopping/product/8330308525491645368/reviews?output=search&q=apple+iphone+8&prds=paur:ClkAsKraX0HAK8DTxuQ7a_QLy1VVmmdGjqus04Pco-mYuwDAnhwY-2yRVwjDEGi_xEsNVx11gFrfhnTT-NU5F1Cv8xkFN2t2KRFYe2S0bJHgJvnYxGmE7Xr8KxIZAFPVH73QWHZfxejgLJJNz1emAuExXPnAxg,rsort:1'
iPhone8 = googleScrap(url)

url = 'https://www.google.com/shopping/product/5196767965601398683/reviews?output=search&q=iphone+x&oq=iphone+x&prds=paur:ClkAsKraX6xXlTCTDvTg5n66BfqjZtUzj5mRPstz9QYmLjncZZBAQRRtobM8Pe5XLEZX0CP8x5UxXIzT52WhOhO2moZSRoKU0aTE6QE0f-R3zq1xhh45Jvza8BIZAFPVH70FzB4_QX4D05ZaAMc8F9sjUFRwvg,rsort:1'
iPhoneX = googleScrap(url)

url = 'https://www.google.com/shopping/product/2874873357294577697/reviews?output=search&q=galaxy+s8&oq=galaxy+s8&prds=paur:ClkAsKraX4MdXEv-XobV-tsudUmMvrTaF0oUQFrnUCBf-gngBeSUnGe1TQzRN-qEvUxg11H4haqP6POwtI-P9rAtftKbUh-e4yFNzeeFNldak82GgWHBlGI__xIZAFPVH72dPmpO1V1eoP7Y9BbJQh6EoOBh5Q,rsort:1'
samsungS8 = googleScrap(url)

The reviews were exported to csv files to be combined with the Amazon reviews.

In [None]:
iPhoneX.to_csv('GoogleiPhoneX.csv')

iPhone8.to_csv('GoogleiPhone8.csv')

samsungS8.to_csv('GoogleSamsungS8.csv')

#### Merging the data

Now that we have all our reviews we combine them into one csv file to be used in part 2.

We begin by importing all csv files, merg the data to follow the same format, combine everything into a data frame and then export the information to `Reviews.csv`.

In [None]:
data0 = pd.read_csv('AmazonSamsungS8.csv', index_col = 0)
data1 = pd.read_csv('AmazoniPhone8.csv', index_col = 0)
data2 = pd.read_csv('AmazoniPhoneX.csv', index_col = 0)
data3 = pd.read_csv('GoogleSamsungS8.csv', index_col = 0)
data4 = pd.read_csv('GoogleiPhone8.csv', index_col = 0)
data5 = pd.read_csv('GoogleiPhoneX.csv', index_col = 0)

# Unifying format to merge meta data
data0 = pd.DataFrame(data0[['comments', 'stars']])
data0['source'] = 'Amazon'
data0['product'] = 'Samsung S8'
data0 = data0[['source', 'product', 'comments', 'stars']]

data1 = pd.DataFrame(data1[['comments', 'stars']])
data1['source'] = 'Amazon'
data1['product'] = 'iPhone 8'
data1 = data1[['source', 'product', 'comments', 'stars']]

data2 = pd.DataFrame(data2[['comments', 'stars']])
data2['source'] = 'Amazon'
data2['product'] = 'iPhone X'
data2 = data1[['source', 'product', 'comments', 'stars']]

data3 = pd.DataFrame(data3[['comments', 'stars']])
data3['source'] = 'Google Shopping'
data3['product'] = 'Samsung S8'
data3 = data3[['source', 'product', 'comments', 'stars']]

data4 = pd.DataFrame(data4[['comments', 'stars']])
data4['source'] = 'Google Shopping'
data4['product'] = 'iPhone 8'
data4 = data4[['source', 'product', 'comments', 'stars']]

data5 = pd.DataFrame(data5[['comments', 'stars']])
data5['source'] = 'Google Shopping'
data5['product'] = 'iPhone X'
data5 = data5[['source', 'product', 'comments', 'stars']]

data6 = pd.DataFrame(data6[['CommentBox_Content', 'CommentBox_Rating']])
data6.columns = ['comments', 'stars']
data6['source'] = 'Influenster'
data6['product'] = 'iPhone X'
data6 = data6[['source', 'product', 'comments', 'stars']]

group2Meta = pd.concat([data0, data1, data2, data3, data4, data5, data6])

group2Meta.to_csv('Reviews.csv')

# Part 2: Pre-processing the webscraped data

### 2.1 Importing the data

#### Before reading the data we import some usefull libraries:

In [1]:
# Pandas for data manipulation
import pandas as pd

# nltk for all text data pre-processing

#nltk.download()
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk import pos_tag

**NOTA**: the nltk package needs to be download thanks to the command '`nltk.download()`' which then opens a window. On this window click on 'all packages' and then 'download'. 

Now we import the dataset:

As the reviews come from different sources and contain characters such as emojis we need to specify the encoding option as the default can no longer interpret the raw data.

In [2]:
data = pd.read_csv('Reviews.csv', encoding = 'ISO-8859-1')

In [7]:
data.head()

Unnamed: 0.1,Unnamed: 0,source,product,comments,stars
0,1,Amazon,Samsung S8,BEWARE!99% of the negative reviews are SELLER ...,5.0
1,2,Amazon,Samsung S8,So far the best phone I ever had. Man it's be...,5.0
2,3,Amazon,Samsung S8,I was skeptical about buying this phone off Am...,5.0
3,4,Amazon,Samsung S8,This phone should be a no-brainer. Easily the ...,3.0
4,5,Amazon,Samsung S8,I haven't owned a Galaxy phone since the Galax...,4.0


In [11]:
del data['Unnamed: 0']

In [13]:
data.head()

Unnamed: 0,source,product,comments,stars
0,Amazon,Samsung S8,BEWARE!99% of the negative reviews are SELLER ...,5.0
1,Amazon,Samsung S8,So far the best phone I ever had. Man it's be...,5.0
2,Amazon,Samsung S8,I was skeptical about buying this phone off Am...,5.0
3,Amazon,Samsung S8,This phone should be a no-brainer. Easily the ...,3.0
4,Amazon,Samsung S8,I haven't owned a Galaxy phone since the Galax...,4.0


In [14]:
data.describe()

Unnamed: 0,stars
count,5746.0
mean,4.320745
std,1.247721
min,1.0
25%,4.0
50%,5.0
75%,5.0
max,5.0


#### So far we have a dataset of 5700 reviews, composed of 4 columns:

The source: Amazon

The product: iPhone 8, iPhone X or Samsung S8

The comment/review: raw text written by the customer

The rating: from 1 to 5 stars.

#### For the moment we will only focus on the comments for each phone model (without rating)

In [16]:
commentsiX = data[data['product'] == 'iPhone X']
commentsi8 = data[data['product'] == 'iPhone 8']
commentsS8 = data[data['product'] == 'Samsung S8']

commentsiX = commentsiX['comments']
commentsi8 = commentsi8['comments']
commentsS8 = commentsS8['comments']

### 2.2 Removing unnecessary characters and words

### 2.3 Tokenizing comments into monograms

NameError: name 'comments' is not defined

### 2.4 Lemmatizing the monograms

## Part 3: Creation of a Term Frequency & Inverse Term Frequency matrix (TF-IDF)

## Part 4: Non-negative Matrix Factorization (NMF)

## Part 5: Topic extraction with Latent Dirichlet Allocation