<a href="https://colab.research.google.com/github/KelvinLam05/cleaning_primark_reviews/blob/main/cleaning_primark_reviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Web Scraper**

Web Scraper is a chrome browser extension built for data extraction from web pages. Using this extension we can create a plan (sitemap) how a web site should be traversed and what should be extracted. Scraped data later can be exported as CSV.

**Sitemap of scraped data from Trustpilot web page below:**

`{"_id":"primark_reviews","startUrl":["https://www.trustpilot.com/review/www.primark.co.uk?page=[2-54]"],"selectors":[{"id":"item_reviewed","parentSelectors":["_root"],"type":"SelectorText","selector":"li.breadcrumb_breadcrumb__lJO__:nth-of-type(4)","multiple":false,"delay":0,"regex":""},{"id":"review_wrapper","parentSelectors":["_root"],"type":"SelectorElement","selector":"article","multiple":true,"delay":0},{"id":"headline","parentSelectors":["review_wrapper"],"type":"SelectorText","selector":"a.styles_linkwrapper__73Tdy","multiple":false,"delay":0,"regex":""},{"id":"body","parentSelectors":["review_wrapper"],"type":"SelectorText","selector":"p","multiple":false,"delay":0,"regex":""},{"id":"rating","parentSelectors":["review_wrapper"],"type":"SelectorElementAttribute","selector":".star-rating_starRating__4rrcf img","multiple":false,"delay":0,"extractAttribute":"alt"},{"id":"date_published","parentSelectors":["review_wrapper"],"type":"SelectorText","selector":"time","multiple":false,"delay":0,"regex":""}]}`

**Load packages and data**

In [22]:
# Importing libraries
import pandas as pd
import numpy as np

In [23]:
# Load dataset
df = pd.read_csv('/content/customer_reviews_of_the_primark_online.csv')

In [24]:
# Examine the data
df.head()

Unnamed: 0,web-scraper-order,web-scraper-start-url,item_reviewed,headline,body,rating,date_published
0,1647082407-4370,https://www.trustpilot.com/review/www.primark....,Primark,Primark in Sutton Surrey has the rudest…,Primark in Sutton Surrey has the rudest manage...,Rated 1 out of 5 stars,"Dec 26, 2021"
1,1647082345-4114,https://www.trustpilot.com/review/www.primark....,Primark,It would seem I'm not on my own in…,It would seem I'm not on my own in losing a re...,Rated 1 out of 5 stars,"Updated Oct 8, 2020"
2,1647082387-4292,https://www.trustpilot.com/review/www.primark....,Primark,Coercing our tenagers into a untested…,Coercing our tenagers into a untested that has...,Rated 1 out of 5 stars,"Aug 23, 2021"
3,1647082387-4293,https://www.trustpilot.com/review/www.primark....,Primark,Recently bought to packs of boxers from…,Recently bought to packs of boxers from them. ...,Rated 1 out of 5 stars,"Aug 22, 2021"
4,1647082151-3370,https://www.trustpilot.com/review/www.primark....,Primark,Very good,"Competative prices\n\nShops are messy, like a ...",Rated 3 out of 5 stars,"Sep 16, 2011"


In [25]:
# Overview of all variables, their datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1059 entries, 0 to 1058
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   web-scraper-order      1059 non-null   object
 1   web-scraper-start-url  1059 non-null   object
 2   item_reviewed          1059 non-null   object
 3   headline               1059 non-null   object
 4   body                   1043 non-null   object
 5   rating                 1059 non-null   object
 6   date_published         1059 non-null   object
dtypes: object(7)
memory usage: 58.0+ KB


**Pre-processing**

We will concatenate the headline and body columns of the Pandas dataframe into a single column called text.

In [26]:
df['text'] = df['headline'] + ' ' + df['body']

In [27]:
# Extract publisher from the column web-scraper-start-url
df['publisher'] = df['web-scraper-start-url'].str.split('.').str[1]

In [28]:
# Drop unwanted columns
df.drop(['web-scraper-order', 'web-scraper-start-url', 'headline', 'body'], axis = 1, inplace = True)

In [29]:
# Checking for missing values
df.isnull().sum().sort_values(ascending = False)

text              16
item_reviewed      0
rating             0
date_published     0
publisher          0
dtype: int64

In [30]:
# Drop all the rows with the NaN values
df = df.dropna()

In [31]:
# Reset the index
df = df.reset_index(drop = True)

In [32]:
# Extract rating from string
df['rating'] = df['rating'].str.slice(6, 7)

In [33]:
# Convert the rating column to an integer
df['rating'] = df['rating'].astype(int)

In [34]:
# Remove substring from the date_published column
df['date_published'] = df['date_published'].str.replace('Updated', '')

In [35]:
# Convert month names to numbers
df['date_published'] = pd.to_datetime(df['date_published']).dt.date

To change a string or object data type to a datetime or datetime64[ns] data type we can use the Pandas to_datetime( ) function. 

In [36]:
df['date_published'] = pd.to_datetime(df['date_published'])

In [37]:
# Change the order of the Pandas DataFrame columns
df.insert(0, 'text', df.pop('text'))
df.insert(1, 'rating', df.pop('rating'))
df.insert(3, 'publisher', df.pop('publisher'))

To confirm that we have successfully preprocessed our data, we can run head( ) and info( ). 

In [38]:
df.head()

Unnamed: 0,text,rating,item_reviewed,publisher,date_published
0,Primark in Sutton Surrey has the rudest… Prima...,1,Primark,trustpilot,2021-12-26
1,It would seem I'm not on my own in… It would s...,1,Primark,trustpilot,2020-10-08
2,Coercing our tenagers into a untested… Coercin...,1,Primark,trustpilot,2021-08-23
3,Recently bought to packs of boxers from… Recen...,1,Primark,trustpilot,2021-08-22
4,Very good Competative prices\n\nShops are mess...,3,Primark,trustpilot,2011-09-16


In [39]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1043 entries, 0 to 1042
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   text            1043 non-null   object        
 1   rating          1043 non-null   int64         
 2   item_reviewed   1043 non-null   object        
 3   publisher       1043 non-null   object        
 4   date_published  1043 non-null   datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 40.9+ KB


**Exports the DataFrame to CSV format**

In [40]:
df.to_csv('/content/customer_reviews_of_the_primark_online_cleaned.csv', index = None, header = True) 