# Overview

## Topic: Analyzing the cars sold on Carvago, the largest used car market with vehicles from all over Europe.



### Members





### Brief introduce:

As Carvago stands out as the largest platform for used car transactions in Europe, our capacity to extract a wealth of valuable information from this website is substantial. We meticulously collect a diverse range of data, including pricing details, detailed technical specifications, top equipment features, and ratings. Leveraging this comprehensive dataset, we aim to gain a holistic understanding of the European used car market. Our objective is to delve into various aspects such as the market share of electric versus traditional fuel vehicles, the performance strengths of engines across different segments, and the distinctive attributes that contribute to the top-rated equipment of a vehicle. We seek to analyze the impact of these equipment features on the overall evaluation of a car.

Furthermore, we are in the process of developing a predictive pricing model to assist users in their quest for finding accurate and reliable pricing information. This model will contribute to enhancing the user experience by providing valuable insights into the pricing dynamics of the used car market, thereby aiding individuals in making informed decisions.


## Lisence: 
The data is collected from [Carvago](https://carvago.com/) for educational and research purposes. All information used complies with the regulations and policies of Carvago. The user of this data is responsible for ensuring that their usage complies with all relevant laws and regulations.



In [None]:
#import
import requests
import re
from bs4 import BeautifulSoup
import json
import pandas as pd
import numpy as np


# Crawl data

## Crawl list of links

Before crawling information about cars, we need to observe how the website operates. 

We notice that when accessing the 'https://carvago.com/cars' page, a list of cars is displayed, sorted by the newest ones. At the top right corner, there is a navigation bar to go to the next page. Through observation, we find that each page displays 20 cars, and upon inspecting the source code, we see that the page contains 20 links leading to the details page for each car.

Additionally, we can navigate between pages using the following URL format: https://carvago.com/cars?page={i}, where i represents the page we want to move to.

Before crawling detailed information about the cars, we need to obtain the URLs for each car. Therefore, we will iterate through these pages and crawl the links of the cars using the HTML parsing method.

In [None]:
#Crawl URL của 3000 sản phẩm và lưu vào file links.txt
def get_id(url):

    # Mở file để ghi
    with open('data/links.txt', 'w') as f:

        # Mỗi trang xe có link của 20 xe
        for i in range(1, 151):
            print(i)
            response = requests.get(url.format(i))

            # Trích xuất phần cần thiết của mã nguồn trang web
            text = response.text

            # Tìm tất cả các cặp "id" và "slug" trong phần đã trích xuất
            matches = re.findall(r'"id":"(\d{8})","slug":"([\w-]+)",', text)

            for match in matches:
                id, slug = match
                # Ghi vào file thay vì in ra
                f.write(f'https://carvago.com/car/{id}/{slug}\n')

# Sử dụng hàm với url đã được chỉnh sửa để chứa số trang
#get_id('https://carvago.com/cars?page={}')

After successfully crawling links to detailed information for each car, we will save all of them to the file **data/links.txt** in preparation for the next step.

## Crawl detail information of car

After obtaining the paths to the details of the cars to be crawled, we will implement a function to crawl the detailed information.

In [None]:
def get_data(text, pattern):
    """
    Trả về tuple chứa các thông tin được trích xuất từ text theo pattern
    """
    match = re.search(pattern, text)
    if match:
        return match.groups()
    else:
        return None, None, None, None 

def extract_car_info(url):
    """
    url: đường đẫn đến thông tin của 1 xe
    return df: trả về dataframe chứa thông tin của 1 xe
    Một số trường hợp như xe đã không còn bán nữa,... ta sẽ in ra thông báo lỗi
    """
    try:
        response = requests.get(url)
        if response.status_code == 200:
            html = response.content
            soup = BeautifulSoup(html, 'html.parser')
            if (soup.title.text == 'Carvago.com | Buying a used car online. Warranty and home delivery.'):
                print('web nolonger available, car has been sold!!!!!')
                return None
            print('200 OK')

            car_info_divs = soup.find_all('div', class_='css-1b3bvbl')  # Adjust the class if needed

            carname = ""

            for car_info_div in car_info_divs:
                attribute_name_div = car_info_div.find('div', class_='css-f6f0k1')
                if attribute_name_div is not None:
                    attribute_name = attribute_name_div.text.strip()

                    if attribute_name == 'Model':
                        car_link_div = car_info_div.find('div', class_='css-qres8i')
                        if car_link_div is not None:
                            car_link_tag = car_link_div.find('a')
                            if car_link_tag is not None:
                                car_link = car_link_tag['href']
                                carname = car_link.split('/')[2:]  # Extract the last part of the URL
                                carname = " ".join(carname)
            
            # Lấy tên thông tin (ví dụ: 'Make', 'Model', ...)
            keys = []
            keys.append('CARNAME')
            keys.append('ID')
            keys.extend([div.get_text() for div in soup.find_all('div', {'class': 'css-f6f0k1'})])

            # Lấy giá trị thông tin (ví dụ: 'Chevrolet', 'Orlando', ...)
            values = []
            values.append(carname)
            values.append(url.split('/')[4])
            for key, div in zip(keys[2: ], soup.find_all('div', {'class': 'css-1b3bvbl'})):
                if key == 'VIN':
                    if div.find('div', {'class': 'css-10odsig'}) is not None:
                        value = div.find('div', {'class': 'css-10odsig'}).get_text()
                    else:
                        value = div.find('div', {'class': 'css-1yu0dva'}).get_text()
                else:
                    value = div.find('div', {'class': 'css-qres8i'}).get_text()
                values.append(value)

            #Do thông tin của consumption nằm ở thẻ khác nên ta sẽ xử lí riêng cho nó
            consumption = [div.get_text() for div in soup.find_all('div', {'class': 'css-c55dw0'})]
            keys.append('Consumption')
            values.append('; '.join(consumption))

            # Lấy giá tiền và đơn vị tiền tệ
            price_pattern = r'"price":(\d+),"price_currency":\{"id":(\d+),"name":"(.*?)","const_key":"CURRENCY_(.*?)"\}'
            price, currency_id, currency_name, currency_const_key = get_data(response.text, price_pattern)

            if price is not None:
                keys.append('Price')
                values.append(price)

            if currency_name is not None:
                keys.append('Currency')
                values.append(currency_name)

            # Lấy các tính năng nổi bật (featured tags)
            tags_pattern = r'"featured_tags":(\[.*?\])'
            featured_tags = json.loads(get_data(response.text, tags_pattern)[0])
            keys.append('Tags')
            
            #Chuyển các giá trị trong các thẻ thành một string
            values.append('; '.join([tag['name'] for tag in featured_tags]))

            df = pd.DataFrame([values], columns=keys)

            return df
                
    except requests.exceptions.RequestException as e:
        print(f"Failed to get the webpage at {url} due to error: {e}")

Read the text file obtained from the previous step, iterate through all the links, and call the function to crawl detailed information for each link.

In [None]:
# Read links from data.txt
with open('data/link2.txt', 'r') as file:
    links = file.read().splitlines()

list_dataframe = []
data = pd.DataFrame()
# Loop through each link and extract information
i = 0
for link in links:
    print(i, '/', len(links))
    car_info = extract_car_info(link)
    if (car_info is None):
        print(f"Skipping link due to error")
        continue  
    list_dataframe.append(car_info)


temp_df = list_dataframe.copy()

In [None]:
def rename_duplicates(df):
    cols = pd.Series(df.columns)
    for dup in cols[cols.duplicated()].unique(): 
        cols[cols[cols == dup].index.values.tolist()] = [dup + str(i) if i != 0 else dup for i in range(sum(cols == dup))]
    df.columns = cols
    return df

list_dataframe = [rename_duplicates(df) for df in list_dataframe]


all_columns = set().union(*[set(df.columns) for df in list_dataframe])

list_dataframe = [df.reindex(columns=all_columns, fill_value=np.nan) for df in list_dataframe]

crawldata = pd.concat(list_dataframe, ignore_index=True)

Check the size of the crawled dataframe.

In [None]:
crawldata.shape

As we can observe, the number of rows we obtained does not match the quantity of crawled links. This discrepancy arises from the fact that some items have been removed from the website, but their corresponding detailed information links have not been taken down. It's only when we click on these links that we discover the product is out of stock.

# Save to data/products.csv

In [None]:
crawldata.to_csv("data/crawl_data.csv",header=True)