## Social Media Analytics

# Webscraping Project
## Introduction to Text Mining
## Preprocessing


##### Felix Funes 20220306 | Paula Catalan 20221048 | Efstathia Styliagkatzi 20220078 | Alisson Tapia 20221156 | S M Abrar Hossain Asif 20220223


After extracting the necessary data we believe was relevant for the purpose of this project.  It is time for data preprocessing which involves cleaning and transforming the data obtained from best buy web pages before it can be analyzed.


Some of the importan reason of pre-process our data are:

*  Data Quality: Web pages are often unstructured, messy and noisy, which makes it challenging to extract clean and accurate data. Data preprocessing helps to clean and filter out irrelevant or inaccurate data, so that only the useful information is retained.

*  Data Consistency: Data from web pages can come in different formats, with varying levels of detail and granularity. Data preprocessing helps to standardize the data format, so that it can be easily compared and analyzed across different web pages.

* Data Integration: Data collected from web scraping is often stored in different formats or data sources, which makes it difficult to combine and integrate. Data preprocessing helps to transform and merge the data into a consistent format that can be easily combined and integrated with other data sources.

*  Efficiency: Preprocessing can also help to optimize the data processing pipeline by reducing the amount of data that needs to be processed, and by optimizing the data format for efficient storage and retrieval.

* Accuracy: Preprocessing can also help to reduce errors and inconsistencies that can arise from incomplete, missing or incorrect data, which can lead to inaccurate analysis and decision-making.




In [289]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
nltk.download()
import csv
import pandas as pd
import numpy as np
import nltk 
from bs4 import BeautifulSoup
import re
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\madel\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\madel\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\madel\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\madel\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


In [246]:
# Load dataset
dtypes = {'device':'category','user':'category','rating':'integer','ownership_lenght':'category'}
reviews_to_clean = pd.read_excel("ExtractedReviewsDataCollection1.xlsx", sheet_name="Sheet1", index_col = 'user', engine='openpyxl')
reviews_to_clean = reviews_to_clean.drop(reviews_to_clean.columns[0], axis=1)


In [247]:
# Check first rows
reviews_to_clean.head()

Unnamed: 0_level_0,device,rating,text,date,ownership_length
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
BigG,Apple - iPhone 14 128GB - Midnight (Verizon),5,Apple makes the best cellphone on the market h...,2023-02-03,less than 1 week
Jp44087,Apple - iPhone 14 128GB - Midnight (Verizon),5,"Ease of use, good battery life, 128gb fits me ...",2023-02-03,3 weeks
GamerDadLife,Apple - iPhone 14 128GB - Midnight (Verizon),5,Love it works great and the red color is the m...,2022-12-24,2 weeks
LevanaP,Apple - iPhone 14 128GB - Midnight (Verizon),5,Been a long time iPhone user. This is a awesom...,2023-04-14,1 week
Anonymous,Apple - iPhone 14 128GB - Midnight (Verizon),5,My wife dropped her phone right AFTER the Appl...,2023-04-15,3 weeks


In [248]:
# Describe dataset
summary=reviews_to_clean.describe(include='all')
summary=summary.transpose()
summary.head(len(summary))

  summary=reviews_to_clean.describe(include='all')


Unnamed: 0,count,unique,top,freq,first,last,mean,std,min,25%,50%,75%,max
device,373.0,8.0,Apple - iPhone 14 128GB - Midnight (Verizon),112.0,NaT,NaT,,,,,,,
rating,373.0,,,,NaT,NaT,4.702413,0.775928,1.0,5.0,5.0,5.0,5.0
text,373.0,369.0,Apple makes the best cellphone on the market h...,2.0,NaT,NaT,,,,,,,
date,373.0,130.0,2022-12-02 00:00:00,15.0,2022-09-17,2023-04-21,,,,,,,
ownership_length,373.0,10.0,1 week,109.0,NaT,NaT,,,,,,,


In [253]:
# View text of review 
reviews = reviews_to_clean['text']
reviews = reviews.tolist()
# print the first 5 reviews for verification purposes
print("All review:\n", reviews)

All review:
 ['Apple makes the best cellphone on the market hands down', 'Ease of use, good battery life, 128gb fits me just fine', 'Love it works great and the red color is the most gorgeous iPhone color ever', 'Been a long time iPhone user. This is a awesome phone. Full size screen, much faster & so many ways to customize!!! I love it!', 'My wife dropped her phone right AFTER the Apple protect plan expired.', 'Liked the color but the otterbox covers most of the color', 'The perfect iPhone! this thing is amazing for anyone in the family! Kids or Grandma! or anyone else! super powerful, you can run a business off of it!', 'We were going to purchase iPhones through Verizon but when we stopped by Best Buy which I have 0% finance through Best Buy and the person who has excellent knowledge and started to finish assisted all the way. my husband and I were very much satisfied and glad to shop at best buy', 'So far, so good. I used android since I got a cellphone 30 years ago. What a differen

In [256]:
# Removing the HTML tags from the text using BeautifulSoup 
rawtext = [BeautifulSoup(review, 'html.parser').get_text() for review in reviews]

# print the first 5 cleaned reviews for verification purposes
print(' All review without HTML:\n' , rawtext)

 All review without HTML:
 ['Apple makes the best cellphone on the market hands down', 'Ease of use, good battery life, 128gb fits me just fine', 'Love it works great and the red color is the most gorgeous iPhone color ever', 'Been a long time iPhone user. This is a awesome phone. Full size screen, much faster & so many ways to customize!!! I love it!', 'My wife dropped her phone right AFTER the Apple protect plan expired.', 'Liked the color but the otterbox covers most of the color', 'The perfect iPhone! this thing is amazing for anyone in the family! Kids or Grandma! or anyone else! super powerful, you can run a business off of it!', 'We were going to purchase iPhones through Verizon but when we stopped by Best Buy which I have 0% finance through Best Buy and the person who has excellent knowledge and started to finish assisted all the way. my husband and I were very much satisfied and glad to shop at best buy', 'So far, so good. I used android since I got a cellphone 30 years ago. W

In [266]:
# Normalize case
# The typical is normalizing to lower case
normalizedText = [review.lower() for review in rawtext]
print("Normalized text:\n", normalizedText[0:10])

Normalized text:
 ['apple makes the best cellphone on the market hands down', 'ease of use, good battery life, 128gb fits me just fine', 'love it works great and the red color is the most gorgeous iphone color ever', 'been a long time iphone user. this is a awesome phone. full size screen, much faster & so many ways to customize!!! i love it!', 'my wife dropped her phone right after the apple protect plan expired.', 'liked the color but the otterbox covers most of the color', 'the perfect iphone! this thing is amazing for anyone in the family! kids or grandma! or anyone else! super powerful, you can run a business off of it!', 'we were going to purchase iphones through verizon but when we stopped by best buy which i have 0% finance through best buy and the person who has excellent knowledge and started to finish assisted all the way. my husband and i were very much satisfied and glad to shop at best buy', 'so far, so good. i used android since i got a cellphone 30 years ago. what a dif

In [273]:
# Remove punctation and other characters such as "&"
charsToRemove = r"\?|\.|\!|\;|\.|\"|\,|\(|\)|\&|[0-9]"
textWOPunctuation = [re.sub(charsToRemove, '', review) for review in normalizedText]
print("Text without punctuation:\n",textWOPunctuation[0:10])

Text without punctuation:
 ['apple makes the best cellphone on the market hands down', 'ease of use good battery life gb fits me just fine', 'love it works great and the red color is the most gorgeous iphone color ever', 'been a long time iphone user this is a awesome phone full size screen much faster  so many ways to customize i love it', 'my wife dropped her phone right after the apple protect plan expired', 'liked the color but the otterbox covers most of the color', 'the perfect iphone this thing is amazing for anyone in the family kids or grandma or anyone else super powerful you can run a business off of it', 'we were going to purchase iphones through verizon but when we stopped by best buy which i have % finance through best buy and the person who has excellent knowledge and started to finish assisted all the way my husband and i were very much satisfied and glad to shop at best buy', 'so far so good i used android since i got a cellphone  years ago what a difference easy to le

In [274]:
stop_words = set(stopwords.words('english'))
print (stop_words)

{"won't", 'd', 'that', 'into', "isn't", 'be', 'don', 'this', 'down', 'yourselves', 'any', 'herself', "you'll", 'won', 'those', "didn't", 'or', 'between', 'didn', 'not', 'out', 'ourselves', 'will', 'over', "it's", "haven't", 'all', 'because', "she's", 've', 'a', 'her', 'its', 'o', 'should', 'itself', 'against', 'below', "wasn't", 'why', 'being', 'when', 'does', 'his', 'own', 'too', 'hers', "mightn't", "you've", 'is', 't', "shan't", 'doesn', 'some', 'it', 'each', 'of', 're', 'couldn', 'same', 'theirs', 'my', 'whom', 'yourself', 'until', 'll', 'only', 'who', 'had', 'after', 'we', 'himself', 'an', 'and', 'having', 'few', 'how', 'ain', 'needn', 'ours', 'in', "aren't", 'there', "mustn't", 'can', 'has', "don't", 'now', 'further', 'myself', 'with', 'hasn', 'if', 'off', 'hadn', 'yours', 'where', 'have', "shouldn't", 'they', 'me', 'here', 'under', 'your', 'such', 'themselves', 'than', 'am', 'the', 'them', 'i', 'y', "hasn't", 'was', 'mustn', 'by', 'these', 'once', 'just', 'wouldn', "you're", 'no'

In [278]:
# First, we need to tokenize text - Break it into words
from nltk.tokenize import word_tokenize
text = ''.join(textWOPunctuation)
tokenizedText = word_tokenize(text)

print("List of words:\n",tokenizedText)

List of words:
 ['apple', 'makes', 'the', 'best', 'cellphone', 'on', 'the', 'market', 'hands', 'downease', 'of', 'use', 'good', 'battery', 'life', 'gb', 'fits', 'me', 'just', 'finelove', 'it', 'works', 'great', 'and', 'the', 'red', 'color', 'is', 'the', 'most', 'gorgeous', 'iphone', 'color', 'everbeen', 'a', 'long', 'time', 'iphone', 'user', 'this', 'is', 'a', 'awesome', 'phone', 'full', 'size', 'screen', 'much', 'faster', 'so', 'many', 'ways', 'to', 'customize', 'i', 'love', 'itmy', 'wife', 'dropped', 'her', 'phone', 'right', 'after', 'the', 'apple', 'protect', 'plan', 'expiredliked', 'the', 'color', 'but', 'the', 'otterbox', 'covers', 'most', 'of', 'the', 'colorthe', 'perfect', 'iphone', 'this', 'thing', 'is', 'amazing', 'for', 'anyone', 'in', 'the', 'family', 'kids', 'or', 'grandma', 'or', 'anyone', 'else', 'super', 'powerful', 'you', 'can', 'run', 'a', 'business', 'off', 'of', 'itwe', 'were', 'going', 'to', 'purchase', 'iphones', 'through', 'verizon', 'but', 'when', 'we', 'stopped'

In [279]:
# Let's create a list with all words that are not part of the stop words list
cleanedText = []
for t in tokenizedText:
    if t not in stop_words:
        cleanedText.append(t)
print("Text without stopwords:\n", cleanedText) 

Text without stopwords:
 ['apple', 'makes', 'best', 'cellphone', 'market', 'hands', 'downease', 'use', 'good', 'battery', 'life', 'gb', 'fits', 'finelove', 'works', 'great', 'red', 'color', 'gorgeous', 'iphone', 'color', 'everbeen', 'long', 'time', 'iphone', 'user', 'awesome', 'phone', 'full', 'size', 'screen', 'much', 'faster', 'many', 'ways', 'customize', 'love', 'itmy', 'wife', 'dropped', 'phone', 'right', 'apple', 'protect', 'plan', 'expiredliked', 'color', 'otterbox', 'covers', 'colorthe', 'perfect', 'iphone', 'thing', 'amazing', 'anyone', 'family', 'kids', 'grandma', 'anyone', 'else', 'super', 'powerful', 'run', 'business', 'itwe', 'going', 'purchase', 'iphones', 'verizon', 'stopped', 'best', 'buy', '%', 'finance', 'best', 'buy', 'person', 'excellent', 'knowledge', 'started', 'finish', 'assisted', 'way', 'husband', 'much', 'satisfied', 'glad', 'shop', 'best', 'buyso', 'far', 'good', 'used', 'android', 'since', 'got', 'cellphone', 'years', 'ago', 'difference', 'easy', 'learn', 'us

In [280]:
# Now, let's concatenate it again to a sentence
newText = ''
for t in cleanedText:
  newText = newText + t + ' '
newText = newText.rstrip() # rstring (remove spaces at the right of the string)
print("Full sentence with changes so far:\n", newText)

Full sentence with changes so far:
 apple makes best cellphone market hands downease use good battery life gb fits finelove works great red color gorgeous iphone color everbeen long time iphone user awesome phone full size screen much faster many ways customize love itmy wife dropped phone right apple protect plan expiredliked color otterbox covers colorthe perfect iphone thing amazing anyone family kids grandma anyone else super powerful run business itwe going purchase iphones verizon stopped best buy % finance best buy person excellent knowledge started finish assisted way husband much satisfied glad shop best buyso far good used android since got cellphone years ago difference easy learn use senior population member highly recommend afraid smaller keyboard problemthe phone pretty - n't like shutting screen side buttons side get pressed together trying press one picture taken - also button silence phone hard finger reach move itlong overdue upgrade iphone totally worth especially ad

### we can think about some keyword we can replace in this part
#### Many times terms need to be replaced for other terms
### E.g., wi-fi, wifi, internet, wi fi
#### In this example: tub by bathtub
#### newText = (" "+newText+" ").replace(" tub "," bathtub ")
#### print("Text the replacements:\n", newText)

In [285]:
# lemmatize the text - reduce terms to their origin
from nltk.stem.wordnet import WordNetLemmatizer

In [288]:
# We are using the tokanized text
lem = WordNetLemmatizer()
lemmatizedText= []
for t in cleanedText:
    lemWord = lem.lemmatize(t)
    lemmatizedText.append(lemWord)
print("Lemmatized text :\n",lemmatizedText)
# Not that much of a difference - let's see steming

Lemmatized text :
 ['apple', 'make', 'best', 'cellphone', 'market', 'hand', 'downease', 'use', 'good', 'battery', 'life', 'gb', 'fit', 'finelove', 'work', 'great', 'red', 'color', 'gorgeous', 'iphone', 'color', 'everbeen', 'long', 'time', 'iphone', 'user', 'awesome', 'phone', 'full', 'size', 'screen', 'much', 'faster', 'many', 'way', 'customize', 'love', 'itmy', 'wife', 'dropped', 'phone', 'right', 'apple', 'protect', 'plan', 'expiredliked', 'color', 'otterbox', 'cover', 'colorthe', 'perfect', 'iphone', 'thing', 'amazing', 'anyone', 'family', 'kid', 'grandma', 'anyone', 'else', 'super', 'powerful', 'run', 'business', 'itwe', 'going', 'purchase', 'iphones', 'verizon', 'stopped', 'best', 'buy', '%', 'finance', 'best', 'buy', 'person', 'excellent', 'knowledge', 'started', 'finish', 'assisted', 'way', 'husband', 'much', 'satisfied', 'glad', 'shop', 'best', 'buyso', 'far', 'good', 'used', 'android', 'since', 'got', 'cellphone', 'year', 'ago', 'difference', 'easy', 'learn', 'use', 'senior', 

In [290]:
# Check the differences after this
# Verbs like "stayed -> stay" or "arranged -> arrang"
stem = PorterStemmer()
stemmedText = []
for t in cleanedText:
    stemmedWord = stem.stem(t)
    stemmedText.append(stemmedWord)
print("Stemmed text :\n",stemmedText) 

Stemmed text :
 ['appl', 'make', 'best', 'cellphon', 'market', 'hand', 'downeas', 'use', 'good', 'batteri', 'life', 'gb', 'fit', 'finelov', 'work', 'great', 'red', 'color', 'gorgeou', 'iphon', 'color', 'everbeen', 'long', 'time', 'iphon', 'user', 'awesom', 'phone', 'full', 'size', 'screen', 'much', 'faster', 'mani', 'way', 'custom', 'love', 'itmi', 'wife', 'drop', 'phone', 'right', 'appl', 'protect', 'plan', 'expiredlik', 'color', 'otterbox', 'cover', 'colorth', 'perfect', 'iphon', 'thing', 'amaz', 'anyon', 'famili', 'kid', 'grandma', 'anyon', 'els', 'super', 'power', 'run', 'busi', 'itw', 'go', 'purchas', 'iphon', 'verizon', 'stop', 'best', 'buy', '%', 'financ', 'best', 'buy', 'person', 'excel', 'knowledg', 'start', 'finish', 'assist', 'way', 'husband', 'much', 'satisfi', 'glad', 'shop', 'best', 'buyso', 'far', 'good', 'use', 'android', 'sinc', 'got', 'cellphon', 'year', 'ago', 'differ', 'easi', 'learn', 'use', 'senior', 'popul', 'member', 'highli', 'recommend', 'afraid', 'smaller', '