# Importing Required Libraries

In this section, we import the necessary libraries for web scraping, data processing, and data handling:

- `requests`: Used to send HTTP requests to websites and retrieve the content.
- `BeautifulSoup` from `bs4`: A library for parsing HTML and XML documents. It's useful for extracting data from web pages.
- `json`: A built-in Python library for working with JSON data.
- `pandas`: A powerful data manipulation and analysis library, especially useful for handling tabular data.


In [1]:
import requests
from bs4 import BeautifulSoup
import json
import pandas as pd

# Initializing an Empty List

We initialize an empty list called `extracted_data`. This list will be used to store the data that we extract from the web pages during the scraping process. As we loop through the web pages and extract data, we will append each piece of data to this list.


In [2]:
extracted_data=[]

# Defining a Function to Fetch Web Content

We define a function named `res` that takes a URL as an argument. This function sends an HTTP GET request to the specified URL using the `requests.get()` method and returns the server's response.

- `url`: The web address from which we want to fetch the content.
- The function returns the response object, which contains the HTML content of the page.


In [3]:
def res(url):
    return requests.get(url)

# Defining a Function to Extract JSON Data from HTML Content

The function `extract_json_from_soup` is designed to extract structured data from a webpage by parsing JSON objects embedded within `<script>` tags. Here's how the function works:

1. **Finding `<script>` Tags**:
   - The function first searches for all `<script>` tags in the HTML content with the attribute `type='application/ld+json'`. These tags typically contain JSON-LD data, which is a common format for embedding structured data into web pages.

2. **Looping Through Script Tags**:
   - The function iterates through each of these `<script>` tags. An index `i` is used to track the iteration and is printed alongside each script tag.

3. **Parsing the JSON Content**:
   - Inside the loop, the function tries to load the JSON content using `json.loads()`. If successful, it proceeds to extract specific pieces of information from the JSON object:
     - `description`: A brief description of the item.
     - `manufacturer`: The manufacturer of the item.
     - `modelDate`: The model date of the item.
     - `engineDisplacement`: The engine displacement of the vehicle (if available).
     - `price`: The price of the item.
     - `mileageFromOdometer`: The mileage from the odometer (if available).
     - `vehicleTransmission`: The type of transmission in the vehicle.
     - `fuelType`: The type of fuel the vehicle uses.
     - `itemCondition`: The condition of the item.
     - `url`: The URL where the item is listed.

4. **Appending Extracted Data**:
   - The extracted information is then stored in a dictionary named `data` and appended to the `extracted_data` list, which was initialized earlier.

5. **Handling JSON Decoding Errors**:
   - If the JSON content cannot be decoded (e.g., due to a syntax error), an exception is caught, and an error message is printed, indicating the failure to decode the JSON.

This function is crucial for scraping and structuring data from web pages in a meaningful way.


In [4]:
def extract_json_from_soup(soup):
    # Step 4: Find all <script> tags with type "application/ld+json"
    script_tags = soup.find_all('script', type='application/ld+json')
    i=0
    for script in script_tags:
        print(i,end=" ")
        i+=1
        try:
            json_content = json.loads(script.string)

                # Step 6: Extract the relevant information
            data = {
                "description": json_content.get('description'),
                "manufacturer": json_content.get('manufacturer'),
                "modelDate": json_content.get('modelDate'),
                "engineDisplacement": json_content.get('vehicleEngine', {}).get('engineDisplacement'),
                "price": json_content.get('offers', {}).get('price'),
                
                "mileageFromOdometer": json_content.get('mileageFromOdometer'),
                
                "vehicleTransmission": json_content.get('vehicleTransmission'),
                "fuelType": json_content.get('fuelType'),
                
                
                "itemCondition": json_content.get('itemCondition'),
                
                "url": json_content.get('offers', {}).get('url'),
                }

            extracted_data.append(data)

        except json.JSONDecodeError as e:
            print(f"Failed to decode JSON: {e}")

# Scraping Multiple Pages and Extracting Data

This section of code is responsible for iterating through multiple pages of used cars listings and extracting data from each page. Here's a breakdown of the process:

1. **Setting the Base URL**:
   - The variable `bas_url` contains the base URL for the used cars search on the website, with a placeholder for the page number.

2. **Looping Through Pages**:
   - A `for` loop is used to iterate through page numbers from 1 to 2199. For each page number, the URL is constructed by appending the page number to the `bas_url`.

3. **Fetching Web Content**:
   - The constructed URL is passed to the `res()` function, which sends an HTTP GET request to retrieve the page content.

4. **Checking Response Status**:
   - The code checks if the HTTP response status code is 200 (indicating a successful request). If the status code is 200, it proceeds to parse the content.

5. **Parsing HTML Content**:
   - The HTML content of the response is parsed using `BeautifulSoup` with the `html.parser` option.

6. **Extracting Data**:
   - The `extract_json_from_soup()` function is called to extract structured JSON data from the parsed HTML content.

7. **Printing URL and Response**:
   - The URL and the HTTP response object are printed to provide feedback about the current page being processed and the response status.

This loop helps to collect data from a large number of pages and is useful for gathering extensive datasets for analysis.


In [5]:
bas_url="https://www.pakwheels.com/used-cars/search/-/?page="
for i in range(1,2200):
    url=bas_url+str(i)
    response= res(url)
    if response.status_code ==200:
    
        soup=BeautifulSoup(response.content, 'html.parser')
        extract_json_from_soup(soup)
    print(url,"     ",response)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 https://www.pakwheels.com/used-cars/search/-/?page=1       <Response [200]>
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 https://www.pakwheels.com/used-cars/search/-/?page=2       <Response [200]>
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 https://www.pakwheels.com/used-cars/search/-/?page=3       <Response [200]>
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 https://www.pakwheels.com/used-cars/search/-/?page=4       <Response [200]>
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 https://www.pakwheels.com/used-cars/search/-/?page=5       <Response [200]>
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 https://www.pakwheels.c

# Creating and Displaying a DataFrame

In this final section, we convert the extracted data into a Pandas DataFrame and print it for review:

1. **Creating the DataFrame**:
   - The `pd.DataFrame(extracted_data)` command creates a DataFrame from the `extracted_data` list, which contains dictionaries of extracted information from the web pages.

2. **Printing the DataFrame**:
   - The `print(df)` command displays the DataFrame in the console, allowing us to inspect the structured data collected from the web pages.

The DataFrame provides a tabular representation of the data, making it easier to analyze and work with.


In [6]:
df = pd.DataFrame(extracted_data)
print(df)

                                    description manufacturer  modelDate  \
0                                          None         None        NaN   
1           Honda Civic 2017 for sale in Lahore        Honda     2017.0   
2       Toyota Vitz 2018 for sale in Bahawalpur       Toyota     2018.0   
3       Toyota Corolla 2010 for sale in Karachi       Toyota     2010.0   
4       Suzuki Ciaz 2017 for sale in Bahawalpur       Suzuki     2017.0   
...                                         ...          ...        ...   
65240     Suzuki Swift 2017 for sale in Chakwal       Suzuki     2017.0   
65241    Prince Pearl 2022 for sale in Peshawar       Prince     2022.0   
65242     Honda City 2007 for sale in Islamabad        Honda     2007.0   
65243  Suzuki Mehran 2018 for sale in Islamabad       Suzuki     2018.0   
65244   Suzuki Wagon R 2017 for sale in Karachi       Suzuki     2017.0   

      engineDisplacement      price mileageFromOdometer vehicleTransmission  \
0                   

# Saving Data to a CSV File

This section saves the collected data to a CSV file for further analysis or storage:

1. **Saving the DataFrame**:
   - The `df.to_csv('extracted_data.csv', index=False)` command writes the DataFrame `df` to a CSV file named `extracted_data.csv`.
   - The `index=False` parameter ensures that the DataFrame index is not included as a separate column in the CSV file.

This step allows you to export the structured data to a widely used format that can be easily shared, analyzed, or imported into other applications.


In [7]:
df.to_csv('extracted_data.csv', index=False)