## Web-scraping Online Reviews

In this notebook I will be scraping online reviews of `MoneyGram International` from <a href="https://www.trustpilot.com/review/www.moneygram.com">Trust Pilot</a> website. MoneyGram International is an American peer to peer payments and money transfer company. The reviews provide insights into customer satisfaction of their service.

The dataset was scraped using Requests & BeautifulSoup python libraries.

**Import the necessary libraries**

In [1]:
import requests 
from bs4 import BeautifulSoup
import pandas as pd 

Let's build the function that will navigate through HTML tags and extract the data. Keep in mind that the function will differ depending on the structure of ythe website, since every website is build differently. 

In [2]:
base_url = 'https://www.trustpilot.com/review/www.moneygram.com'

def get_reviews_from_page(url):
    reviews_df = []
    response = requests.get(url)

    if response.status_code == 200:

        soup = BeautifulSoup(response.text, 'html.parser')

        reviews = soup.find_all('div', class_="styles_reviewCardInner__EwDq2")

        for review in reviews:
            consumer_info_wrapper = review.find('aside', "styles_consumerInfoWrapper__KP3Ra")

            if consumer_info_wrapper:
                consumer_details_wrapper = consumer_info_wrapper.find('a', "link_internal__7XN06 link_wrapper__5ZJEx styles_consumerDetails__ZFieb")
                if consumer_details_wrapper:
                    consumer_name = consumer_details_wrapper.find('span', "typography_heading-xxs__QKBS8 typography_appearance-default__AAY17")
                    
                    if consumer_name:
                        name = consumer_name.get_text()
                    reviews_no = consumer_details_wrapper.find('div', 'styles_consumerExtraDetails__fxS4S')
                    if reviews_no:
                        rev_no = reviews_no.find('span', 'typography_body-m__xgxZ_ typography_appearance-subtle__8_H2l')
                        if rev_no:
                            rev = rev_no.get_text()

                        origin = reviews_no.find('div', 'typography_body-m__xgxZ_ typography_appearance-subtle__8_H2l styles_detailsIcon__Fo_ua')
                        if origin:
                            country = origin.find('span')
                            if country:
                                country_name = country.get_text(strip=True)

            section_wrapper = review.find('section', 'styles_reviewContentwrapper__zH_9M')
            if section_wrapper:
                review_rating_div = section_wrapper.find('div', 'styles_reviewHeader__iU9Px') 
                if review_rating_div:
                    star_rating_div = review_rating_div.find('div', 'star-rating_starRating__4rrcf star-rating_medium__iN6Ty')

                    if star_rating_div:
                        img_url = star_rating_div.find('img')
                        rating = img_url['alt']

                reviews_wrapper = section_wrapper.find('div', 'styles_reviewContent__0Q2Tg')
                if reviews_wrapper:
                        
                        reviews_headings_link = reviews_wrapper.find('a', 'link_notUnderlined__szqki')
                        if reviews_headings_link:
                            reviews_headings_h2 = reviews_headings_link.find('h2', 'typography_heading-s__f7029 typography_appearance-default__AAY17')
                            if reviews_headings_h2:
                                review_title = reviews_headings_h2.get_text()

                        reviews_headings = reviews_wrapper.find('p', 'typography_color-black__5LYEn')
                        if reviews_headings:
                            reviews_text = reviews_headings.get_text()
                        else:
                            reviews_text = 'No review text'

                        reviews_date_wrapper = reviews_wrapper.find('p', 'typography_body-m__xgxZ_ typography_appearance-default__AAY17')
                        if reviews_date_wrapper:
                            reviews_date = reviews_date_wrapper.get_text()
                        
                        reviews_df.append({
                            "Author": name,
                            "Reviews": rev,
                            "Location": country_name,
                            "Ratings": rating,
                            "Review Title": review_title,
                            "Review Text": reviews_text,
                            "Date of Experience": reviews_date
                        })
                            
    if reviews_df:
        return reviews_df

After building the function, we test it on the first page of the website to see if it works.

In [4]:
# Test the function by getting their reviews from the first page
get_reviews_from_page(base_url)

[{'Author': 'Lester',
  'Reviews': '8 reviews',
  'Location': 'GB',
  'Ratings': 'Rated 5 out of 5 stars',
  'Review Title': 'Best one out there',
  'Review Text': 'For every transaction, I compare most of the top players in the exchange game and MoneyGram seems to triumph with the best rate and lowest fees. Quite a few businesses will show you an unbelievable exchange rate, but have high transfer fees, which basically comes down to lower than what MoneyGram offers. I guess MoneyGram understands the meaning of transparency. Kudos.',
  'Date of Experience': 'Date of experience: May 22, 2024'},
 {'Author': 'Laura',
  'Reviews': '3 reviews',
  'Location': 'US',
  'Ratings': 'Rated 5 out of 5 stars',
  'Review Title': 'Easy to use and convenient ',
  'Review Text': 'My mom used to purchase money orders at a local currency exchange branch, I am happy that I learned of MoneyGram as it saves us money on fees when sending funds to Mexico. It is also much more convenient. ',
  'Date of Experien

We were able to obtain the reviews on the first page, which shows that the function working.

Now, since the website is made of multiple pages, we will have to write a code that will go through every page and extract the data.

In [5]:
# Start with the first page
page_number = 1
reviews_data = []
while True:
    # Construct the URL for the current page
    page_url = f"{base_url}?page={page_number}"
    
    # Get reviews from the current page
    reviews_dict_list = get_reviews_from_page(page_url)
    reviews_data.append(reviews_dict_list)
    # Check if there is a next page (this logic depends on the website's structure)
    response = requests.get(page_url)
    soup = BeautifulSoup(response.text, 'html.parser')
    next_button = soup.find('a', 
                            class_='link_internal__7XN06 button_button__T34Lr button_m__lq0nA button_appearance-outline__vYcdF button_squared__21GoE link_button___108l pagination-link_next__SDNU4 pagination-link_rel__VElFy')
 # Adjust this selector to match the "next page" button
    
    if next_button:
        page_number += 1  # Move to the next page
    else:
        break  # No more pages

print("Scraping complete.")
#reviews_data

Scraping complete.


Done! We managed to scrape without any errors, now let's convert the dictionary list containing the data to a dataframe. If you run this code, keep in mind that you're probably going to get different results because <a href="https://www.trustpilot.com">TrustPilot</a> is a dynamic website that keeps on updating it's records every minute if not every second.

In [16]:
# Convert the dict list to a pandas dataframe
import itertools
moneygram_reviews = pd.DataFrame(list(itertools.chain.from_iterable(reviews_data)))
moneygram_reviews.head()

Unnamed: 0,Author,Reviews,Location,Ratings,Review Title,Review Text,Date of Experience
0,Lester,8 reviews,GB,Rated 5 out of 5 stars,Best one out there,"For every transaction, I compare most of the t...","Date of experience: May 22, 2024"
1,Laura,3 reviews,US,Rated 5 out of 5 stars,Easy to use and convenient,My mom used to purchase money orders at a loca...,"Date of experience: May 03, 2024"
2,MrSimps,3 reviews,GB,Rated 5 out of 5 stars,Straightforward and Hassle Free,I was using another money transfer company and...,"Date of experience: March 04, 2024"
3,Lydia Martinez,5 reviews,US,Rated 5 out of 5 stars,I have been using Moneygram since the 80's.,I have been using Moneygram since the 80's to ...,"Date of experience: May 06, 2024"
4,Anonymous,1 review,US,Rated 5 out of 5 stars,What made my experience great is how…,"What made my experience great is how easy, fas...","Date of experience: May 30, 2024"


## Data Wrangling

**Import libraries**

If noticed, there are 3 columns that need to be fixed: `Reviews`, `Ratings` and `Date of Experience`.

* **Reviews:** remove the *reviews* text - should be a integer datatype column.
* **Ratings:** also remove *Rated - out of 5* - should be a integer datatype column.
* **Date of Experience:** remove the *Date of Experience* - column should be a date datatype.

In [18]:
from datetime import datetime
import re

# Remove the unwanted string characters from the date
moneygram_reviews['Date of Experience'] = [string[20:] for string in moneygram_reviews['Date of Experience']]
date_format = '%B %d, %Y'
moneygram_reviews['Date of Experience'] = [datetime.strptime(date, date_format) for date in moneygram_reviews['Date of Experience']]

In [21]:
# Extract number of reviews from text
reviews_pattern = r'\b\d+\b'
moneygram_reviews['Reviews'] = [int(re.search(reviews_pattern, review).group()) for review in moneygram_reviews['Reviews']]

In [22]:
# Extract the ratins
ratings_pattern = r'Rated (\d+) out of 5 stars'
moneygram_reviews['Ratings'] = [int(re.search(ratings_pattern, rating).group(1)) for rating in moneygram_reviews['Ratings']]

In [28]:
moneygram_reviews.head(3)

Unnamed: 0,Author,Reviews,Location,Ratings,Review Title,Review Text,Date of Experience
0,Lester,8,GB,5,Best one out there,"For every transaction, I compare most of the t...",2024-05-22
1,Laura,3,US,5,Easy to use and convenient,My mom used to purchase money orders at a loca...,2024-05-03
2,MrSimps,3,GB,5,Straightforward and Hassle Free,I was using another money transfer company and...,2024-03-04


In [25]:
moneygram_reviews.dtypes

Author                        object
Reviews                        int64
Location                      object
Ratings                        int64
Review Title                  object
Review Text                   object
Date of Experience    datetime64[ns]
dtype: object

Great! The dataset columns are now in proper datatypes. You can save the dataframe to a csv file.


In [None]:
moneygram_reviews.to_csv("C:\\MoneyGram Reviews.csv", index=False)