# Scraping Ads from PakWheels

In this notebook we scrape ads (cars listings) from the PakWheels website using Python libraries such as `requests` and `BeautifulSoup`.

**Imports**

In [2]:
from bs4 import BeautifulSoup
import requests
import numpy as np
import pandas as pd
import re
from concurrent.futures import ThreadPoolExecutor, as_completed
import os
from json import loads, dumps
from tqdm import tqdm

## Scraping User Reviews

A function to scrape all listings for a car of specific body type.

In [2]:
def scrape_ad(url, b_type):
    try:
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'lxml')

        featured = 1 if soup.find('div', class_='mb40 pos-rel').find('div', class_='featured-ribbon pointer') else 0
        car_name = soup.find('h1').text
        location = soup.find('p', class_='detail-sub-heading').find('a').text.strip()
        car_specifics = soup.find('table', class_=re.compile(r'table table-bordered text-center table-engine-detail fs16')).find_all('td')
        model = car_specifics[0].text.strip()
        mileage = car_specifics[1].text.strip()
        engine_type = car_specifics[2].text.strip()
        transmission = car_specifics[3].text.strip()
        try:
            car_features = soup.find('ul', class_=re.compile(r'list-unstyled car-feature-list nomargin')).find_all('li')
            features = [feature.text.strip() for feature in car_features]
        except:
            features = []
        sellers_comments = soup.find('h2', id='scroll_seller_comments').find_next_sibling('div').text.strip()
        car_details = soup.find('ul', id='scroll_car_detail').find_all('li')
        details = {car_details[idx].text.strip(): car_details[idx + 1].text.strip() for idx in range(0, len(car_details), 2)}
        ad_no = details['Ad Ref #']
        del details['Ad Ref #']
        price = soup.find('div', class_='price-box').text.strip()
        seller_details = soup.find('div', class_='owner-detail-main').text.strip().split('\n\n')[0]

        return {
            'Ad Ref': ad_no,
            'url': url,
            'Featured': featured,
            'Vehicle': car_name,
            'Location': location,
            'Model': model,
            'Vehicle Type': b_type,
            'Mileage': mileage,
            'Engine Type': engine_type,
            'Transmission': transmission,
            'Features': features,
            'Details': details,
            'Price': price,
            'Seller Details': seller_details,
            "Seller's Comments": sellers_comments,
        }
    except:
        return {
            'url': url
        }

## Scraping Ads in Parallel

This function takes a list of URLs and a type of advertisement, and it scrapes the data concurrently. 

In [18]:
def scrape_ads_in_parallel(urls, b_type):
    results = []  # Use a list to hold only valid results
    with ThreadPoolExecutor(max_workers=10) as executor:
        future_to_index = {executor.submit(scrape_ad, url, b_type): index for index, url in enumerate(urls)}

        # Initialize tqdm progress bar
        with tqdm(total=len(urls), desc=f"Scraping {b_type}", unit="ad") as pbar:
            for future in as_completed(future_to_index):
                index = future_to_index[future]
                result = future.result()
                if result is not None:  # Only append valid results
                    results.append(result)
                pbar.update(1)  # Update progress bar

    return results

In [5]:
for body_type in os.listdir('data'):
    if 'urls.txt' in os.listdir(os.path.join('data', body_type)) and 'data.json' not in os.listdir(os.path.join('data', body_type)) and body_type in ['Hatchback']:
        with open(os.path.join('data', body_type, 'urls.txt'), 'r') as f:
            urls = f.read().strip().split('\n')
        results = scrape_ads_in_parallel(urls[:10], body_type)
        with open(os.path.join('data', body_type, 'data.json'), 'w') as f:
            f.write(dumps(results))


Scraping Hatchback: 100%|██████████| 10/10 [00:02<00:00,  3.64ad/s]


**A sample from cars listing data**

In [5]:
df = pd.read_csv('datasets/dataset_1.csv')
df.sample(5)

Unnamed: 0,Ad Ref,url,Featured,Vehicle,Location,Model,Vehicle Type,Mileage,Engine Type,Transmission,Features,Details,Price,Seller Details,Seller's Comments
24086,9329974,https://www.pakwheels.com/used-cars/toyota-cor...,0,Toyota Corolla XLi VVTi 2014,"Nazimabad, Karachi Sindh",2014,Sedan,"113,000 km",Petrol,Manual,"['Air Conditioning', 'CD Player', 'Power Mirro...","{'Registered In': 'Karachi', 'Color': 'White',...",PKR 22.9 lacs,"Shani\nMember Since Oct 18, 2024",corrola XLI convert GLIoutr 70% original inne...
33018,9261499,https://www.pakwheels.com/used-cars/honda-city...,0,Honda City 1.3 i-VTEC Prosmatec 2021,"DHA Phase 5, Lahore Punjab",2021,Sedan,"41,000 km",Petrol,Automatic,"['ABS', 'AM/FM Radio', 'Air Conditioning', 'CD...","{'Registered In': 'Lahore', 'Color': 'Maroon',...","PKR 41 lacs\n\nFinancing starts at PKR 102,666...","Salman\nMember Since Jan 03, 2019",Honda City 1.3 prosmatec. Only front bumper pa...
17946,9330761,https://www.pakwheels.com/used-cars/toyota-aqu...,1,Toyota Aqua G 2021,Karachi Sindh,2021,Hatchback,"25,000 km",Hybrid,Automatic,"['ABS', 'AM/FM Radio', 'Air Bags', 'Air Condit...","{'Registered In': 'Un-Registered', 'Color': 'B...",PKR 83 lacs,Dealer:\nGAZIANI AUTOMOBILES,Toyota AquaG packageNEW SHAPEModel: 2021Fresh ...
35940,9042959,https://www.pakwheels.com/used-cars/honda-city...,0,Honda City EXi 2001,"Peoples Colony, Gujranwala Punjab",2001,Sedan,"230,000 km",Petrol,Manual,"['AM/FM Radio', 'Air Conditioning', 'Cassette ...","{'Registered In': 'Gujranwala', 'Color': 'Whit...",PKR 10.3 lacs,"Umer Akbar\nMember Since Oct 07, 2023",Urgent sale\n Mention PakWheels.com when ca...
32499,9282253,https://www.pakwheels.com/used-cars/toyota-pra...,0,Toyota Prado TX Limited 2.7 2020,"Hayatabad, Peshawar KPK",2020,SUV,"25,438 km",Petrol,Automatic,"['ABS', 'AM/FM Radio', 'Air Bags', 'Air Condit...","{'Registered In': 'Un-Registered', 'Color': 'B...",PKR 2.75 crore,"Jan Malang\nMember Since Oct 06, 2024",Urgent Sell Toyota Prado TX Limited Edition 2....


In [6]:
print(f'Number of entries in the dataset: {df.shape[0]}')

Number of entries in the dataset: 53784


In [9]:
df.dtypes

Ad Ref                int64
url                  object
Featured              int64
Vehicle              object
Location             object
Model                 int64
Vehicle Type         object
Mileage              object
Engine Type          object
Transmission         object
Features             object
Details              object
Price                object
Seller Details       object
Seller's Comments    object
dtype: object