# Scraping User Reviews from PakWheels

In this notebook we scrape user reviews from the PakWheels website using Python libraries such as `requests` and `BeautifulSoup`.

**Imports**

In [2]:
from bs4 import BeautifulSoup
import requests
import numpy as np
import pandas as pd
import re
import os
from json import loads, dumps
from tqdm import tqdm
from time import sleep

## Scraping User Reviews

A function to scrape user reviews for a car filtered by make and model.

In [34]:
def get_car_reviews(make, model, start=1):
    all_reviews = []
    while True:
        url = f'https://www.pakwheels.com/new-cars/{make}/{model}/reviews/?page={start}'
        response = requests.get(url)
        if response.url != url:
            break
        soup = BeautifulSoup(response.content, 'lxml')
        for div in soup.find('div', class_='well mb20').find_all('div', class_='mb40'):
            review_tagline = div.find('div', class_='col-md-9').find('h3').text.strip()
            car_name = div.find('div', class_='col-md-9').find('a', class_='generic-gray').text.strip()
            meta_info = div.find('div', class_='col-md-9').find('p', class_='date').text.strip()
            try:
                familiarity = div.find('div', class_='col-md-9').find('div', class_='familiarity').text.strip()
            except:
                familiarity = ''
            try:
                review_body = loads(div.find('script').text)['reviewBody'].strip()
            except:
                review_body = ''
            try:
                overall_rating = loads(div.find('script').text)['reviewRating']['ratingValue']
            except:
                overall_rating = float('nan')
            all_ratings = div.find('ul', class_=re.compile(r'review-rating list-unstyled clearfix')).find_all('li')
            style = all_ratings[0].find('img').get('alt').strip()
            comfort = all_ratings[1].find('img').get('alt').strip()
            fuel_economy = all_ratings[2].find('img').get('alt').strip()
            perfomance = all_ratings[3].find('img').get('alt').strip()
            value_for_money = all_ratings[4].find('img').get('alt').strip()
            helpfulness = div.find('div', class_='col-md-6').find_all('div')[-1].text.strip()
            review_obj = {
                'car_model': car_name,
                'review_title': review_tagline,
                'reviewer_info': meta_info,
                'familiarity': familiarity,
                'review_text' : review_body,
                'style': style,
                'comfort_rating': comfort,
                'fuel_economy': fuel_economy,
                'perfomance': perfomance,
                'value_for_money': value_for_money,
                'overall_rating': overall_rating,
                'helpful_votes': helpfulness
            }
            all_reviews.append(review_obj)
        start += 1

    return all_reviews

Utility Functions

In [7]:
def create_directories(make, models):
    for model in models:
        dir = os.path.join('reviews', make, model)
        if not os.path.exists(dir):
            os.makedirs(dir)

In [None]:
def format_string(input):
    return input.replace(' ', '-').lower()

## Loading Make-Model List and Creating Directories

We will load a list of car makes and models from a JSON file and create directories for each make and its corresponding models.

In [8]:
with open('json/make_model_list.json', 'r') as f:
    make_models_list = loads(f.read())

for idx in range(len(make_models_list)):
    create_directories(make_models_list[idx]['make'], make_models_list[idx]['models'])

In [12]:
for make in os.listdir('reviews'):
    for model in tqdm(os.listdir(os.path.join('reviews', make)), desc=f'Processing {make}'):
        try:
            reviews = get_car_reviews(format_string(make), format_string(model))
            if len(reviews):
                with open(os.path.join('reviews', make, model, 'reviews_list.json'), 'w') as f:
                    f.write(dumps(reviews))
            sleep(5)
        except:
            print(make, model)
            continue

Processing Volkswagen: 100%|██████████| 7/7 [00:48<00:00,  6.99s/it]
Processing Porsche: 100%|██████████| 8/8 [00:55<00:00,  6.96s/it]
Processing Hino: 100%|██████████| 1/1 [00:07<00:00,  7.28s/it]
Processing Bugatti: 100%|██████████| 2/2 [00:13<00:00,  6.58s/it]
Processing Power: 100%|██████████| 1/1 [00:06<00:00,  6.80s/it]
Processing United: 100%|██████████| 2/2 [00:14<00:00,  7.30s/it]
Processing Mitsubishi:  50%|█████     | 12/24 [01:19<00:59,  4.94s/it]

Mitsubishi Minicab


Processing Mitsubishi:  88%|████████▊ | 21/24 [02:21<00:16,  5.36s/it]

Mitsubishi Lancer Evolution X


Processing Mitsubishi: 100%|██████████| 24/24 [02:42<00:00,  6.78s/it]
Processing Hummer: 100%|██████████| 3/3 [00:20<00:00,  6.93s/it]
Processing Renault: 100%|██████████| 1/1 [00:06<00:00,  6.75s/it]
Processing GMC: 100%|██████████| 2/2 [00:13<00:00,  6.70s/it]
Processing Ford:  83%|████████▎ | 10/12 [01:03<00:10,  5.12s/it]

Ford F 150


Processing Ford: 100%|██████████| 12/12 [01:16<00:00,  6.36s/it]
Processing Tank: 100%|██████████| 1/1 [00:06<00:00,  6.99s/it]
Processing Pontiac: 100%|██████████| 1/1 [00:06<00:00,  6.80s/it]
Processing Mercedes Benz:   3%|▎         | 1/29 [00:00<00:21,  1.31it/s]

Mercedes Benz GLA Class


Processing Mercedes Benz:  48%|████▊     | 14/29 [01:29<01:20,  5.34s/it]

Mercedes Benz CLS Class


Processing Mercedes Benz:  97%|█████████▋| 28/29 [03:05<00:05,  5.41s/it]

Mercedes Benz SLK Class


Processing Mercedes Benz: 100%|██████████| 29/29 [03:07<00:00,  6.46s/it]


Mercedes Benz Brabus


Processing Changan:  81%|████████▏ | 13/16 [01:38<00:17,  5.80s/it]

Changan Kaghan XL


Processing Changan: 100%|██████████| 16/16 [02:01<00:00,  7.60s/it]


**A sample from user reviews data**

In [3]:
df = pd.read_csv('datasets/dataset_2.csv')
df.sample(5)

Unnamed: 0,make,model,car_model,review_title,reviewer_info,familiarity,review_text,style,comfort_rating,fuel_economy,perfomance,value_for_money,overall_rating,helpful_votes
330,Suzuki,Wagon R,2014 Suzuki Wagon R VXL,My WagonR Experience,"Posted by Izaan Siddiqui on Jun 26, 2014",Familiarity: I owned this car.,"Superior fuel economy,performance,comfort and ...",4 rating,5 rating,5 rating,5 rating,3 rating,4.0,(5 out of 5 people found this review helpful)
2231,Hyundai,Elantra,2021 Hyundai Elantra GLS,DRL Stopped Working and No Customer Support,"Posted by Waqas Dogar on Apr 13, 2023",Familiarity: I owned this car.,I am extremely disappointed with my 2021 Elant...,5 rating,5 rating,4 rating,4 rating,5 rating,4.0,(2 out of 2 people found this review helpful)
603,Suzuki,Alto,2012 Suzuki Alto VXR (CNG),Best CNG fuel econonmy car,"Posted by Muhammad Khalil on Feb 05, 2019",Familiarity: I owned this car.,Car is best for small family with amazing fuel...,3 rating,3 rating,5 rating,5 rating,4 rating,4.0,(1 out of 1 person found this review helpful)
302,Honda,Civic,2013 Honda Civic VTi Oriel Prosmatec 1.8 i-VTEC,civic,"Posted by Zain Zia on May 28, 2013",Familiarity: I owned this car.,Exterior: stylish n sporty look\r\n\r\nInterio...,5 rating,5 rating,5 rating,4 rating,4 rating,4.0,(7 out of 12 people found this review helpful)
2750,Suzuki,Cultus,2018 Suzuki Cultus VXR,Cultus 2019,"Posted by Najam on Jul 19, 2019",Familiarity: I owned this car.,Car looks great when you look at it from front...,3 rating,4 rating,4 rating,4 rating,3 rating,3.0,(2 out of 2 people found this review helpful)


In [4]:
print(f'Number of entries in the dataset: {df.shape[0]}')

Number of entries in the dataset: 4775


In [5]:
df.dtypes

make                object
model               object
car_model           object
review_title        object
reviewer_info       object
familiarity         object
review_text         object
style               object
comfort_rating      object
fuel_economy        object
perfomance          object
value_for_money     object
overall_rating     float64
helpful_votes       object
dtype: object