<a href="https://colab.research.google.com/github/1391819/notebooks-ml/blob/main/NLP/apple-reviews-sentiment-analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# About 
Sentiment analysis using BeautifulSoup and TextBlob. 

## Data
- Extracted and collected business reviews from Apple's TrustPilot page
- Created additional features from reviews for a more in-depth data analysis 

## Stack 
- TextBlob 
- BeautifulSoup
- Data Cleaning and Interpretation

## Extracting and collecting business reviews

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

In [2]:
reviews = []

pages = np.arange(1, 5, 1)

for page in pages:
  page = requests.get("https://www.trustpilot.com/review/www.apple.com" + "?page=" + str(page))

  soup = BeautifulSoup(page.text, "html.parser")

  review_div = soup.find_all("div", class_="styles_reviewContent__0Q2Tg")

  for container in review_div:
    raw_content = container.find("p")
    reviews.append(raw_content.text)

## Analysing the data

In [3]:
df = pd.DataFrame(np.array(reviews), columns=["review"])

In [None]:
len(df["review"])

In [5]:
df["word_count"] = df["review"].apply(lambda x: len(x.split()))

In [6]:
df["char_count"] = df["review"].apply(lambda x: len(x))

In [7]:
def average_words(x):
  words = x.split()
  return sum(len(word) for word in words) / len(words)

In [8]:
df["average_word_length"] = df["review"].apply(lambda x: average_words(x))

In [9]:
from nltk.corpus import stopwords

stop_words = stopwords.words("english")

df["stopword-count"] = df["review"].apply(lambda x: len([word for word in x.split() if word.lower() in stop_words]))

df["stopword-rate"] = df["stopword-count"] / df["word_count"]

In [10]:
df.sort_values(by="stopword-rate")

Unnamed: 0,review,word_count,char_count,average_word_length,stopword-count,stopword-rate
45,Bought I Mac 4K in 2017 great at first now it’...,77,370,3.805195,23,0.298701
4,Apple sells very poor quality products. I had...,36,195,4.388889,11,0.305556
20,Will never use apple again. Have always loved ...,65,344,4.307692,22,0.338462
68,Apple make terrible products that just don’t l...,69,401,4.826087,24,0.347826
26,"apple iphone 7 support will stop in a month , ...",131,615,3.702290,47,0.358779
...,...,...,...,...,...,...
40,If you have booked your appointment with servi...,86,423,3.930233,48,0.558140
36,I visited the Apple store today to have my Mac...,127,661,4.212598,73,0.574803
39,My review is about someone who has treated me ...,33,167,4.090909,19,0.575758
57,My AirPods were not working so when I worked w...,157,856,4.452229,91,0.579618


In [11]:
df.describe()

Unnamed: 0,word_count,char_count,average_word_length,stopword-count,stopword-rate
count,80.0,80.0,80.0,80.0,80.0
mean,87.6625,473.3875,4.401452,39.8625,0.453038
std,52.097266,287.268299,0.343675,23.729885,0.063197
min,14.0,77.0,3.70229,7.0,0.298701
25%,58.75,328.75,4.209081,24.0,0.411502
50%,79.5,409.5,4.372685,35.0,0.452425
75%,102.25,566.75,4.616591,48.0,0.5
max,371.0,1998.0,5.225352,155.0,0.586957


## Data cleaning

- Removing redundant words (stop words, punctuation, etc.) 

In [12]:
df.review

0     I have received message from 20697 my iphone w...
1     Purchased a new MacBook Air a year ago & withi...
2     Longtime Apple user I’ve been lucky enough to ...
3     Apple won't help... I need to reset my securit...
4     Apple sells very poor quality products.  I had...
                            ...                        
75    I am not sure how Apple grown this much when t...
76    I tried many methods on YouTube but no video c...
77    My iphone 6S finally crashed. I had insurance ...
78    I’ve had Apple technology for a long time. I s...
79    Short, sweet and to the point. Needed an Apple...
Name: review, Length: 80, dtype: object

In [13]:
# Lower casing
df["lowercase"] = df["review"].apply(lambda x: " ".join(word.lower() for word in x.split()))

In [14]:
# Punctuation
df["punctuation"] = df["lowercase"].str.replace("[^\w\s]", "")

  


In [15]:
# Stop words 
df["stopwords"] = df["punctuation"].apply(lambda x: " ".join(word for word in x.split() if word not in stop_words))

In [16]:
# Creating a frequency count to track recursive words
pd.Series(" ".join(df["stopwords"]).split()).value_counts()[:30]

apple       154
phone        50
customer     38
iphone       34
support      33
service      32
dont         32
new          27
years        24
back         24
never        23
products     22
time         19
store        18
buy          18
use          18
company      17
2            16
im           16
money        16
problem      15
would        15
help         15
3            14
call         14
like         14
could        14
hours        14
going        13
get          13
dtype: int64

In [17]:
other_stop_words = ["get", "told"] # a lot more can be added, testing required
df["cleaned_review"] = df["stopwords"].apply(lambda x: " ".join(word for word in x.split() if word not in other_stop_words))
pd.Series(" ".join(df["cleaned_review"]).split()).value_counts()[:30]

apple       154
phone        50
customer     38
iphone       34
support      33
dont         32
service      32
new          27
years        24
back         24
never        23
products     22
time         19
use          18
buy          18
store        18
company      17
im           16
money        16
2            16
would        15
help         15
problem      15
hours        14
3            14
like         14
call         14
could        14
going        13
one          12
dtype: int64

In [18]:
df.head()

Unnamed: 0,review,word_count,char_count,average_word_length,stopword-count,stopword-rate,lowercase,punctuation,stopwords,cleaned_review
0,I have received message from 20697 my iphone w...,71,441,5.225352,34,0.478873,i have received message from 20697 my iphone w...,i have received message from 20697 my iphone w...,received message 20697 iphone snatched 15augus...,received message 20697 iphone snatched 15augus...
1,Purchased a new MacBook Air a year ago & withi...,57,329,4.789474,23,0.403509,purchased a new macbook air a year ago & withi...,purchased a new macbook air a year ago within...,purchased new macbook air year ago within firs...,purchased new macbook air year ago within firs...
2,Longtime Apple user I’ve been lucky enough to ...,93,514,4.526882,42,0.451613,longtime apple user i’ve been lucky enough to ...,longtime apple user ive been lucky enough to n...,longtime apple user ive lucky enough need supp...,longtime apple user ive lucky enough need supp...
3,Apple won't help... I need to reset my securit...,80,406,4.0875,32,0.4,apple won't help... i need to reset my securit...,apple wont help i need to reset my security qu...,apple wont help need reset security questions ...,apple wont help need reset security questions ...
4,Apple sells very poor quality products. I had...,36,195,4.388889,11,0.305556,apple sells very poor quality products. i had ...,apple sells very poor quality products i had a...,apple sells poor quality products macbook air ...,apple sells poor quality products macbook air ...


## Lemmatization using TextBlob

In [19]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

from textblob import Word

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [20]:
df["lemmatized"] = df["cleaned_review"].apply(lambda x: " ".join(Word(word).lemmatize() for word in x.split()))

## Sentiment Analysis

In [21]:
from textblob import TextBlob

In [22]:
# polarity and subjectivity metrics -> returned by TextBlob
# polarity: negative (-1) or positive (+1) a review is 
# subjectivity: generic opinion vs factual information 
df["polarity"] = df["lemmatized"].apply(lambda x: TextBlob(x).sentiment[0])
df["subjectivity"] = df["lemmatized"].apply(lambda x: TextBlob(x).sentiment[1])    

In [23]:
df.drop(["lowercase", "punctuation", "stopwords", "cleaned_review", "lemmatized"], axis=1, inplace = True)

In [26]:
df.describe()

Unnamed: 0,word_count,char_count,average_word_length,stopword-count,stopword-rate,polarity,subjectivity
count,80.0,80.0,80.0,80.0,80.0,80.0,80.0
mean,87.6625,473.3875,4.401452,39.8625,0.453038,0.001603,0.480444
std,52.097266,287.268299,0.343675,23.729885,0.063197,0.265305,0.212866
min,14.0,77.0,3.70229,7.0,0.298701,-0.816667,0.0
25%,58.75,328.75,4.209081,24.0,0.411502,-0.129911,0.365278
50%,79.5,409.5,4.372685,35.0,0.452425,0.0,0.47399
75%,102.25,566.75,4.616591,48.0,0.5,0.175694,0.598295
max,371.0,1998.0,5.225352,155.0,0.586957,1.0,1.0


Dealing with relatively negative reviews (mean = 0.001603).

In [27]:
df.sort_values(by="polarity")

Unnamed: 0,review,word_count,char_count,average_word_length,stopword-count,stopword-rate,polarity,subjectivity
11,"Really bad camera for iphone 13, I’m so disapp...",27,151,4.629630,10,0.370370,-0.816667,0.805556
24,Moved countries and will not let me reset pass...,79,404,4.126582,43,0.544304,-0.600000,0.900000
6,"Any Apple device is bulls*it, upgrade to iOS 1...",41,228,4.585366,15,0.365854,-0.496212,0.818182
19,Just wanted to say how pathetic it was that I ...,55,294,4.363636,27,0.490909,-0.466667,0.366667
61,I call for support get no support they have ve...,29,153,4.310345,11,0.379310,-0.437500,0.700000
...,...,...,...,...,...,...,...,...
2,Longtime Apple user I’ve been lucky enough to ...,93,514,4.526882,42,0.451613,0.259259,0.487037
36,I visited the Apple store today to have my Mac...,127,661,4.212598,73,0.574803,0.320000,0.420000
64,"I would give a zero if I could, I ordered some...",101,507,4.029703,53,0.524752,0.328571,0.378571
27,The presentation of their items is superb. loo...,14,77,4.571429,7,0.500000,0.900000,0.875000
