# Project 6. [SF-DST Car Price Prediction] 

### Done by: Eugenia Voytik, 27.09.2021.

![photo](https://miro.medium.com/max/648/1*kQBj7l-Y1WPZfX9nKIYL1Q.jpeg)

### This notebook is used to parse the new train data from the `auto.ru` website for the project. These new data will be used together with the available dataset (collected from the same website on 09.09.2020) for the project to improve the prediction.

In [3]:
# Install all necessary libraries
!pip install requests beautifulsoup4 

Collecting beautifulsoup4
  Downloading beautifulsoup4-4.10.0-py3-none-any.whl (97 kB)
[K     |████████████████████████████████| 97 kB 1.3 MB/s eta 0:00:01
Collecting soupsieve>1.2
  Downloading soupsieve-2.2.1-py3-none-any.whl (33 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.10.0 soupsieve-2.2.1


In [1]:
# Import libraries that will be used in the project
import requests
from bs4 import BeautifulSoup 

import pandas as pd

import re
import time
import json

Hereby we'll upload the existing dataset collected from the `auto.ru` website on 09.09.2020 to look at the data structure and clarify the data that we need to parse from the website. 

In [87]:
test_df = pd.read_csv('test.csv')
test_df.head()

Unnamed: 0,bodyType,brand,car_url,color,complectation_dict,description,engineDisplacement,enginePower,equipment_dict,fuelType,...,vehicleConfiguration,vehicleTransmission,vendor,Владельцы,Владение,ПТС,Привод,Руль,Состояние,Таможня
0,лифтбек,SKODA,https://auto.ru/cars/used/sale/skoda/octavia/1...,синий,,"Все автомобили, представленные в продаже, прох...",1.2 LTR,105 N12,"{""engine-proof"":true,""tinted-glass"":true,""airb...",бензин,...,LIFTBACK ROBOT 1.2,роботизированная,EUROPEAN,3 или более,,Оригинал,передний,Левый,Не требует ремонта,Растаможен
1,лифтбек,SKODA,https://auto.ru/cars/used/sale/skoda/octavia/1...,чёрный,,ЛОТ: 01217195\nАвтопрага Север\nДанный автомоб...,1.6 LTR,110 N12,"{""cruise-control"":true,""asr"":true,""esp"":true,""...",бензин,...,LIFTBACK MECHANICAL 1.6,механическая,EUROPEAN,1 владелец,,Оригинал,передний,Левый,Не требует ремонта,Растаможен
2,лифтбек,SKODA,https://auto.ru/cars/used/sale/skoda/superb/11...,серый,"{""id"":""20026336"",""name"":""Ambition"",""available_...","Все автомобили, представленные в продаже, прох...",1.8 LTR,152 N12,"{""cruise-control"":true,""tinted-glass"":true,""es...",бензин,...,LIFTBACK ROBOT 1.8,роботизированная,EUROPEAN,1 владелец,,Оригинал,передний,Левый,Не требует ремонта,Растаможен
3,лифтбек,SKODA,https://auto.ru/cars/used/sale/skoda/octavia/1...,коричневый,"{""id"":""20803582"",""name"":""Ambition"",""available_...",КОМПЛЕКТ ЗИМНЕЙ (ЛЕТНЕЙ) РЕЗИНЫ ПО СЕЗОНУ В ПО...,1.6 LTR,110 N12,"{""cruise-control"":true,""roller-blind-for-rear-...",бензин,...,LIFTBACK AUTOMATIC 1.6,автоматическая,EUROPEAN,1 владелец,,Оригинал,передний,Левый,Не требует ремонта,Растаможен
4,лифтбек,SKODA,https://auto.ru/cars/used/sale/skoda/octavia/1...,белый,,ЛОТ: 01220889\nАвтопрага Север\n\nВы можете по...,1.8 LTR,152 N12,"{""cruise-control"":true,""asr"":true,""esp"":true,""...",бензин,...,LIFTBACK AUTOMATIC 1.8,автоматическая,EUROPEAN,1 владелец,,Оригинал,передний,Левый,Не требует ремонта,Растаможен


In [None]:
test_df.info()

Let's look at the list of brands that were used for the dataset's collection before. To reduce then the size of the parsed dataset we'll use the same brands.

In [88]:
brands = [brand.lower() for brand in test_df.brand.unique()]
brands

['skoda',
 'audi',
 'honda',
 'volvo',
 'bmw',
 'nissan',
 'infiniti',
 'mercedes',
 'toyota',
 'lexus',
 'volkswagen',
 'mitsubishi']

### Step I. Collecting the URLs to all auto ads in a file.

At this step for each brand that will be used for the analysis (12 brands) we'd like to extract the links to all autos' ads in a file. This file will be used for the next step for the data collection.

In [89]:
def collect_car_urls(
    brand: str
):
    """
    Collect the all car urls for a specified brand from the `auto.ru` website. 
    For each brand the number of available pages is calculated and the urls from all these pages are saved 
    into a list.
    """
    main_url = f'https://auto.ru/cars/{brand}/all/'   
    main_response = requests.get(main_url)   
    main_soap = BeautifulSoup(main_response.content.decode('utf-8'), 'html.parser')    
    _ = main_soap.find('span', class_='ButtonWithLoader__content').text.replace(u'\xa0', '')
    urls_total = int(re.findall(r'\d+', _)[0])
    ads_per_page = len(main_soap.find_all('a', class_='Link ListingItemTitle__link'))
    pages_num = urls_total // ads_per_page
    
    all_urls = []
    
    for page_num in range(1, pages_num):
        if page_num % 20 == 0:
            print(f"Extracting page {page_num} from {pages_num}...")
        page_url = f'{main_url}?page={page_num}'   
        page_response = requests.get(page_url)
        time.sleep(0.1)
        page_soap = BeautifulSoup(page_response.content.decode('utf-8'), 'html.parser')   
        
        all_urls.extend([a.get('href') for a in page_soap.find_all('a', class_='Link ListingItemTitle__link')])

    return all_urls

In [102]:
urls = []
for brand in brands:
    print(f"Extracting data for the brand {brand}:")
    all_urls = collect_car_urls(brand)
    urls.extend(all_urls)

Extracting data for the brand skoda:
Extracting page 20 from 228...
Extracting page 40 from 228...
Extracting page 60 from 228...
Extracting page 80 from 228...
Extracting page 100 from 228...
Extracting page 120 from 228...
Extracting page 140 from 228...
Extracting page 160 from 228...
Extracting page 180 from 228...
Extracting page 200 from 228...
Extracting page 220 from 228...
Extracting data for the brand audi:
Extracting page 20 from 273...
Extracting page 40 from 273...
Extracting page 60 from 273...
Extracting page 80 from 273...
Extracting page 100 from 273...
Extracting page 120 from 273...
Extracting page 140 from 273...
Extracting page 160 from 273...
Extracting page 180 from 273...
Extracting page 200 from 273...
Extracting page 220 from 273...
Extracting page 240 from 273...
Extracting page 260 from 273...
Extracting data for the brand honda:
Extracting page 20 from 125...
Extracting page 40 from 125...
Extracting page 60 from 125...
Extracting page 80 from 125...
Extrac

In [103]:
train_df = pd.DataFrame({'car_url': urls})
train_df.to_csv('train_df.csv', index=False)

In [104]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130210 entries, 0 to 130209
Data columns (total 1 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   car_url  130210 non-null  object
dtypes: object(1)
memory usage: 1017.4+ KB


### Step II. Retrieve data for all saved car URLs

At this step for all saved URLs we need to extract the same information that is already saved in our test dataset from 09.09.2020. To do this let's look at the columns of this dataset.

In [105]:
extracted_columns = test_df.columns.to_list()
extracted_columns

['bodyType',
 'brand',
 'car_url',
 'color',
 'complectation_dict',
 'description',
 'engineDisplacement',
 'enginePower',
 'equipment_dict',
 'fuelType',
 'image',
 'mileage',
 'modelDate',
 'model_info',
 'model_name',
 'name',
 'numberOfDoors',
 'parsing_unixtime',
 'priceCurrency',
 'productionDate',
 'sell_id',
 'super_gen',
 'vehicleConfiguration',
 'vehicleTransmission',
 'vendor',
 'Владельцы',
 'Владение',
 'ПТС',
 'Привод',
 'Руль',
 'Состояние',
 'Таможня']

To form the updated training dataset, I decided to extract the same columns (features) that are included in the test dataset (32 features) excluding 2 columns: model info and vendor whose information is not informative. To possibly improve the model I'm planning to add also several additional columns:
- 'views' - how many times the ad was viewed
- 'date_added' - the date when the ad was posted on the portal
- 'region' - the region where the car is located
- 'price' - out target column

In the result we need to get a dataset containing 34 features.

In [117]:
def extract_url_data(url: str, extracted_columns: list):
    """
    For the specified url extract information from the `auto.ru` webpage about all necessary features mentioning
    in the extracted_columns list, such as model_name, mileage, price, etc.
    The function returns the list of values for all specified features in the same order as its fields are 
    presented in the extracted_columns list.
    """
    response = requests.get(url)
    page = BeautifulSoup(response.content.decode('utf-8'), 'html.parser')
    
    try:
        catalog_url = page.find(
            'a', class_='Link SpoilerLink CardCatalogLink SpoilerLink_type_default').get('href')
        response_catalog = requests.get(catalog_url)
        page_catalog = BeautifulSoup(response_catalog.content.decode('utf-8'), 'html.parser')
    except:
        pass
    try:
        json_data_catalog = json.loads(
            page_catalog.find('script', type="application/json", id='initial-state').string)
    except:
        pass
    try:
        json_data_equip = json.loads(
            page.find('script', type="application/json", id='initial-state').string)
    except:
        pass
    try:
        data = json.loads(
            page.find('script', type="application/ld+json").string)
        data = flatten(data)
    except:
        pass
    
    try:
        data['complectation_dict'] = [
        k for k, v in json_data_catalog['state']['compare']['selected'][0]['options'].items() if v == 1]
    except:
        pass
    try:
        data['equipment_dict'] = json_data_equip['card']['vehicle_info']['equipment']
    except:
        pass
    try:
        mileage = page.find(
            'li', class_='CardInfoRow CardInfoRow_kmAge').find_all('span')[-1].text.replace(u'\xa0', u'')
        data['mileage'] = int(re.findall(r'\d+', mileage)[0])
    except:
        pass
    try:
        data['model_name'] = page.find_all(
        'div', class_='InfoPopup InfoPopup_theme_plain InfoPopup_withChildren BreadcrumbsPopup')[1].text
    except:
        pass
    try:
        data['parsing_unixtime'] = int(time.time())
    except:
        pass
    try:
        data['sell_id'] = int(re.findall(
            r'\d+', page.find('div', class_='CardHead__infoItem CardHead__id').text)[0])
    except:
        pass
    try:
        data['super_gen'] = json.loads(
            page.find('div', id="sale-data-attributes").get('data-bem'))
    except:
        pass
    try:
        data['Владельцы'] = page.find(
            'li', class_='CardInfoRow CardInfoRow_ownersCount').find_all('span')[-1].text.replace(u'\xa0', u' ')
    except:
        pass
    try:
        data['Владение'] = page.find(
            'li', class_='CardInfoRow CardInfoRow_owningTime').find_all('span')[-1].text
    except:
        pass
    try:
        data['ПТС'] = page.find(
            'li', class_='CardInfoRow CardInfoRow_pts').find_all('span')[-1].text
    except:
        pass
    try:
        data['Привод'] = page.find(
            'li', class_='CardInfoRow CardInfoRow_drive').find_all('span')[-1].text
    except:
        pass
    try:
        data['Руль'] = page.find('li', class_='CardInfoRow CardInfoRow_wheel').find_all('span')[-1].text
    except:
        pass
    try:
        data['Состояние'] = page.find(
            'li', class_='CardInfoRow CardInfoRow_state').find_all('span')[-1].text
    except:
        pass
    try:
        data['Таможня'] = page.find(
            'li', class_='CardInfoRow CardInfoRow_customs').find_all('span')[-1].text
    except:
        pass
    try:
        data['description'] = re.sub('\W+', ' ', data['description'])
    except:
        pass
    
    # additional features
    try:
        data['views'] =  page.find(
            'div', class_='CardHead__infoItem CardHead__views').text.split()[0]
    except:
        pass
    try:
        data['date_added'] = page.find(
            'div', class_='CardHead__infoItem CardHead__creationDate').text 
    except:
        pass
    try:
        data['region'] = page.find(
            'div', class_='CardBreadcrumbs').find_all(
            'div', class_='CardBreadcrumbs__item')[-1].text.replace(u'\xa0', u' ')
    except:
        pass
    
    output = []
    try:
        for col in extracted_columns:
            output.append(data.get(col, None))    
    except:
        pass
    if not output:
        output = [None] * len(extracted_columns)
    return output

In [107]:
df_combined = pd.read_csv('train_df.csv')
df_combined.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130210 entries, 0 to 130209
Data columns (total 1 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   car_url  130210 non-null  object
dtypes: object(1)
memory usage: 1017.4+ KB


In [113]:
len(df_combined.car_url.values.tolist())

130210

In [151]:
final_list = []
for n, url in enumerate(df_combined.car_url.values.tolist()[79201:]):
    if n % 50 == 0:
        print(f"The # of current processing URL is {n}, url is {url}.")
    if n % 1000 == 0:
        print(f"Create a dataframe from the list and concetenate it with the already collected data.")
        df1 = pd.DataFrame(data=final_list, columns=extracted_columns + ['views', 'date_added', 'region', 'price'])
        final_df = pd.concat([final_df, df1], ignore_index=True)
        final_df.to_csv('train_df_full_part1.csv', index=False)
        final_list = []
    final_list.append(extract_url_data(url, extracted_columns + ['views', 'date_added', 'region', 'price']))

The # of current processing URL is 0, url is https://auto.ru/cars/new/group/toyota/fortuner/22457063/22457511/1103711810-2d2c1e05/.
Create a dataframe from the list and concetenate it with the already collected data.
The # of current processing URL is 50, url is https://auto.ru/cars/used/sale/toyota/camry/1105400421-246d155d/.
The # of current processing URL is 100, url is https://auto.ru/cars/used/sale/toyota/rav_4/1103962161-0506bfcd/.
The # of current processing URL is 150, url is https://auto.ru/cars/used/sale/toyota/land_cruiser/1104857475-a30206f3/.
The # of current processing URL is 200, url is https://auto.ru/cars/new/group/toyota/camry/22813335/22813560/1105389268-0c560cfb/.
The # of current processing URL is 250, url is https://auto.ru/cars/used/sale/toyota/land_cruiser_prado/1105077237-74f0de00/.
The # of current processing URL is 300, url is https://auto.ru/cars/used/sale/toyota/yaris/1105399813-3843c07f/.
The # of current processing URL is 350, url is https://auto.ru/cars/

The # of current processing URL is 3300, url is https://auto.ru/cars/used/sale/toyota/rav_4/1105402112-e65837e5/.
The # of current processing URL is 3350, url is https://auto.ru/cars/used/sale/toyota/camry/1105400543-ed7ae23d/.
The # of current processing URL is 3400, url is https://auto.ru/cars/used/sale/toyota/highlander/1104811223-0a0912dc/.
The # of current processing URL is 3450, url is https://auto.ru/cars/used/sale/toyota/hilux_surf/1105402596-e8c448cc/.
The # of current processing URL is 3500, url is https://auto.ru/cars/new/group/toyota/land_cruiser_prado/22495145/22496061/1104717062-6ae14d83/.
The # of current processing URL is 3550, url is https://auto.ru/cars/new/group/toyota/camry/22813335/22813675/1103971886-d3b55172/.
The # of current processing URL is 3600, url is https://auto.ru/cars/used/sale/toyota/land_cruiser/1104873877-6219432f/.
The # of current processing URL is 3650, url is https://auto.ru/cars/used/sale/toyota/camry/1105402581-a8bab4f4/.
The # of current proce

The # of current processing URL is 6600, url is https://auto.ru/cars/new/group/toyota/rav_4/21678468/21678548/1104996016-e419f805/.
The # of current processing URL is 6650, url is https://auto.ru/cars/used/sale/toyota/hilux/1104938298-826f02f8/.
The # of current processing URL is 6700, url is https://auto.ru/cars/used/sale/toyota/rav_4/1105338185-bb1725b0/.
The # of current processing URL is 6750, url is https://auto.ru/cars/used/sale/toyota/camry/1105402574-bbe04416/.
The # of current processing URL is 6800, url is https://auto.ru/cars/new/group/toyota/rav_4/21678504/22537343/1104963137-e0e9d5dc/.
The # of current processing URL is 6850, url is https://auto.ru/cars/used/sale/toyota/land_cruiser/1105401076-e0448eae/.
The # of current processing URL is 6900, url is https://auto.ru/cars/used/sale/toyota/corolla/1105402126-0ab91ea1/.
The # of current processing URL is 6950, url is https://auto.ru/cars/used/sale/toyota/land_cruiser/1105222495-22d7ca53/.
The # of current processing URL is 7

The # of current processing URL is 9900, url is https://auto.ru/cars/used/sale/toyota/camry/1105178999-ae2e8dc0/.
The # of current processing URL is 9950, url is https://auto.ru/cars/new/group/toyota/rav_4/21678460/21678548/1105041061-9333c262/.
The # of current processing URL is 10000, url is https://auto.ru/cars/used/sale/toyota/auris/1105376150-cc00e951/.
Create a dataframe from the list and concetenate it with the already collected data.
The # of current processing URL is 10050, url is https://auto.ru/cars/used/sale/toyota/highlander/1104811223-0a0912dc/.
The # of current processing URL is 10100, url is https://auto.ru/cars/used/sale/toyota/land_cruiser/1103912479-7d64e572/.
The # of current processing URL is 10150, url is https://auto.ru/cars/used/sale/toyota/highlander/1105013889-dfd5415b/.
The # of current processing URL is 10200, url is https://auto.ru/cars/used/sale/toyota/sienna/1102782281-7a7b03f1/.
The # of current processing URL is 10250, url is https://auto.ru/cars/used/s

The # of current processing URL is 13150, url is https://auto.ru/cars/used/sale/toyota/rav_4/1105335458-522988e3/.
The # of current processing URL is 13200, url is https://auto.ru/cars/used/sale/toyota/land_cruiser/1105042166-afc13e35/.
The # of current processing URL is 13250, url is https://auto.ru/cars/new/group/toyota/rav_4/21678468/21678548/1104996016-e419f805/.
The # of current processing URL is 13300, url is https://auto.ru/cars/used/sale/toyota/camry/1105402581-a8bab4f4/.
The # of current processing URL is 13350, url is https://auto.ru/cars/used/sale/toyota/alphard/1105367919-c23789fd/.
The # of current processing URL is 13400, url is https://auto.ru/cars/used/sale/toyota/land_cruiser/1105229909-049db308/.
The # of current processing URL is 13450, url is https://auto.ru/cars/used/sale/toyota/camry/1105402581-a8bab4f4/.
The # of current processing URL is 13500, url is https://auto.ru/cars/used/sale/toyota/land_cruiser/1105401076-e0448eae/.
The # of current processing URL is 1355

The # of current processing URL is 16450, url is https://auto.ru/cars/used/sale/toyota/hilux/1104938298-826f02f8/.
The # of current processing URL is 16500, url is https://auto.ru/cars/new/group/toyota/rav_4/21678468/21678562/1104203698-35e5b666/.
The # of current processing URL is 16550, url is https://auto.ru/cars/new/group/toyota/land_cruiser/22905534/22947809/1104844023-ce267fd2/.
The # of current processing URL is 16600, url is https://auto.ru/cars/used/sale/toyota/alphard/1105401188-72d58268/.
The # of current processing URL is 16650, url is https://auto.ru/cars/used/sale/toyota/alphard/1104805170-06c566b2/.
The # of current processing URL is 16700, url is https://auto.ru/cars/used/sale/toyota/highlander/1104811223-0a0912dc/.
The # of current processing URL is 16750, url is https://auto.ru/cars/used/sale/toyota/land_cruiser/1105222495-22d7ca53/.
The # of current processing URL is 16800, url is https://auto.ru/cars/new/group/toyota/rav_4/21678468/21678562/1104203698-35e5b666/.
The

The # of current processing URL is 19700, url is https://auto.ru/cars/used/sale/lexus/nx/1105239073-f38ea531/.
The # of current processing URL is 19750, url is https://auto.ru/cars/used/sale/lexus/nx/1105218823-658e9f5d/.
The # of current processing URL is 19800, url is https://auto.ru/cars/new/group/lexus/gx/21660380/23018337/1104577727-96c31ef0/.
The # of current processing URL is 19850, url is https://auto.ru/cars/used/sale/lexus/gs/1105274117-b9865539/.
The # of current processing URL is 19900, url is https://auto.ru/cars/used/sale/lexus/nx/1105149990-0c8c7923/.
The # of current processing URL is 19950, url is https://auto.ru/cars/new/group/lexus/nx/21131112/21143116/1105245719-24b4c5bc/.
The # of current processing URL is 20000, url is https://auto.ru/cars/new/group/lexus/rx/21662861/21663202/1104843954-193ad66f/.
Create a dataframe from the list and concetenate it with the already collected data.
The # of current processing URL is 20050, url is https://auto.ru/cars/new/group/lexu

The # of current processing URL is 23050, url is https://auto.ru/cars/used/sale/volkswagen/tiguan/1105046396-a164b587/.
The # of current processing URL is 23100, url is https://auto.ru/cars/used/sale/volkswagen/passat/1105100026-f25308c9/.
The # of current processing URL is 23150, url is https://auto.ru/cars/used/sale/volkswagen/tiguan/1105191455-c72ae5d2/.
The # of current processing URL is 23200, url is https://auto.ru/cars/used/sale/volkswagen/polo/1105174493-9113bf34/.
The # of current processing URL is 23250, url is https://auto.ru/cars/used/sale/volkswagen/polo/1105153343-ce09a512/.
The # of current processing URL is 23300, url is https://auto.ru/cars/used/sale/volkswagen/touareg/1105390160-2bf1e30e/.
The # of current processing URL is 23350, url is https://auto.ru/cars/used/sale/volkswagen/jetta/1105326536-5245bc06/.
The # of current processing URL is 23400, url is https://auto.ru/cars/new/group/volkswagen/touareg/21307095/22777816/1104730120-a71fb6d5/.
The # of current processi

The # of current processing URL is 26300, url is https://auto.ru/cars/used/sale/volkswagen/polo/1105019802-c8cd62b6/.
The # of current processing URL is 26350, url is https://auto.ru/cars/new/group/volkswagen/tiguan/22680193/22688476/1105370329-b4bfdfe0/.
The # of current processing URL is 26400, url is https://auto.ru/cars/used/sale/volkswagen/teramont/1105359052-a118a355/.
The # of current processing URL is 26450, url is https://auto.ru/cars/used/sale/volkswagen/transporter/1105401011-7dd98add/.
The # of current processing URL is 26500, url is https://auto.ru/cars/used/sale/volkswagen/teramont/1105275371-5807a05e/.
The # of current processing URL is 26550, url is https://auto.ru/cars/used/sale/volkswagen/tiguan/1105367902-3d98b1ea/.
The # of current processing URL is 26600, url is https://auto.ru/cars/used/sale/volkswagen/transporter/1105401011-7dd98add/.
The # of current processing URL is 26650, url is https://auto.ru/cars/new/group/volkswagen/tiguan/22680193/22688476/1105134226-d8c

The # of current processing URL is 29550, url is https://auto.ru/cars/used/sale/volkswagen/touareg/1105402245-71b6911b/.
The # of current processing URL is 29600, url is https://auto.ru/cars/used/sale/volkswagen/touareg/1105401739-00900e5f/.
The # of current processing URL is 29650, url is https://auto.ru/cars/used/sale/volkswagen/touareg/1105269196-75deaa11/.
The # of current processing URL is 29700, url is https://auto.ru/cars/used/sale/volkswagen/touareg/1105402113-31b3fe10/.
The # of current processing URL is 29750, url is https://auto.ru/cars/used/sale/volkswagen/tiguan/1105401934-023d6bae/.
The # of current processing URL is 29800, url is https://auto.ru/cars/used/sale/volkswagen/teramont/1105356373-e3bf0f21/.
The # of current processing URL is 29850, url is https://auto.ru/cars/used/sale/volkswagen/caddy/1103994372-c39438e1/.
The # of current processing URL is 29900, url is https://auto.ru/cars/used/sale/volkswagen/tiguan/1105401421-2ff3f777/.
The # of current processing URL is 

The # of current processing URL is 32750, url is https://auto.ru/cars/used/sale/volkswagen/tiguan/1105401421-2ff3f777/.
The # of current processing URL is 32800, url is https://auto.ru/cars/new/group/volkswagen/tiguan/22680194/22688476/1105142445-358c0984/.
The # of current processing URL is 32850, url is https://auto.ru/cars/used/sale/volkswagen/tiguan/1105292937-a514517b/.
The # of current processing URL is 32900, url is https://auto.ru/cars/used/sale/volkswagen/tiguan/1105402309-a7108576/.
The # of current processing URL is 32950, url is https://auto.ru/cars/used/sale/volkswagen/polo/1105401788-8184ae3e/.
The # of current processing URL is 33000, url is https://auto.ru/cars/new/group/volkswagen/tiguan/22680193/22688476/1105370329-b4bfdfe0/.
Create a dataframe from the list and concetenate it with the already collected data.
The # of current processing URL is 33050, url is https://auto.ru/cars/used/sale/volkswagen/touareg/1105402245-71b6911b/.
The # of current processing URL is 33100

The # of current processing URL is 36000, url is https://auto.ru/cars/used/sale/volkswagen/tiguan/1101693038-8ff32698/.
Create a dataframe from the list and concetenate it with the already collected data.
The # of current processing URL is 36050, url is https://auto.ru/cars/new/group/volkswagen/tiguan/22680194/22688476/1105118403-c829ea3f/.
The # of current processing URL is 36100, url is https://auto.ru/cars/used/sale/volkswagen/polo/1105401441-b6f10348/.
The # of current processing URL is 36150, url is https://auto.ru/cars/used/sale/volkswagen/touareg/1105063052-f0ad6657/.
The # of current processing URL is 36200, url is https://auto.ru/cars/new/group/volkswagen/tiguan/22680194/22688476/1105293642-0d0fc0f6/.
The # of current processing URL is 36250, url is https://auto.ru/cars/used/sale/volkswagen/polo/1105401974-307e917d/.
The # of current processing URL is 36300, url is https://auto.ru/cars/new/group/volkswagen/tiguan/22680193/22688476/1105134226-d8c29e5f/.
The # of current process

The # of current processing URL is 39150, url is https://auto.ru/cars/used/sale/volkswagen/teramont/1105356373-e3bf0f21/.
The # of current processing URL is 39200, url is https://auto.ru/cars/used/sale/volkswagen/tiguan/1105262244-0c65f551/.
The # of current processing URL is 39250, url is https://auto.ru/cars/used/sale/volkswagen/tiguan/1105401934-023d6bae/.
The # of current processing URL is 39300, url is https://auto.ru/cars/used/sale/volkswagen/jetta/1105400451-13b03539/.
The # of current processing URL is 39350, url is https://auto.ru/cars/used/sale/volkswagen/multivan/1103895569-c5549469/.
The # of current processing URL is 39400, url is https://auto.ru/cars/used/sale/volkswagen/tiguan/1105401421-2ff3f777/.
The # of current processing URL is 39450, url is https://auto.ru/cars/used/sale/volkswagen/tiguan/1105401350-a5000ac2/.
The # of current processing URL is 39500, url is https://auto.ru/cars/used/sale/volkswagen/tiguan/1105292937-a514517b/.
The # of current processing URL is 39

The # of current processing URL is 42350, url is https://auto.ru/cars/used/sale/mitsubishi/asx/1105265828-29d63138/.
The # of current processing URL is 42400, url is https://auto.ru/cars/used/sale/mitsubishi/l200/1105257385-fb3b4000/.
The # of current processing URL is 42450, url is https://auto.ru/cars/used/sale/mitsubishi/outlander/1105120681-64a9bf1a/.
The # of current processing URL is 42500, url is https://auto.ru/cars/used/sale/mitsubishi/pajero/1104954474-97479374/.
The # of current processing URL is 42550, url is https://auto.ru/cars/used/sale/mitsubishi/outlander/1105215173-375105e7/.
The # of current processing URL is 42600, url is https://auto.ru/cars/used/sale/mitsubishi/lancer/1105209408-26673efa/.
The # of current processing URL is 42650, url is https://auto.ru/cars/used/sale/mitsubishi/pajero/1105161019-e0f60b01/.
The # of current processing URL is 42700, url is https://auto.ru/cars/used/sale/mitsubishi/eclipse/1105390191-c582268a/.
The # of current processing URL is 427

The # of current processing URL is 45600, url is https://auto.ru/cars/new/group/mitsubishi/outlander/21397561/22971556/1103297457-d71dc129/.
The # of current processing URL is 45650, url is https://auto.ru/cars/used/sale/mitsubishi/colt/1105399657-e2f10da7/.
The # of current processing URL is 45700, url is https://auto.ru/cars/used/sale/mitsubishi/galant/1105402618-193be41e/.
The # of current processing URL is 45750, url is https://auto.ru/cars/used/sale/mitsubishi/outlander/1105376117-9b68fb0c/.
The # of current processing URL is 45800, url is https://auto.ru/cars/used/sale/mitsubishi/pajero_sport/1105349839-574f6c0e/.
The # of current processing URL is 45850, url is https://auto.ru/cars/new/group/mitsubishi/outlander/21397560/22971552/1103297473-7f542029/.
The # of current processing URL is 45900, url is https://auto.ru/cars/used/sale/mitsubishi/pajero/1103504536-61a438ed/.
The # of current processing URL is 45950, url is https://auto.ru/cars/used/sale/mitsubishi/asx/1105398705-49f80

The # of current processing URL is 48800, url is https://auto.ru/cars/used/sale/mitsubishi/outlander/1105375211-49d7de3a/.
The # of current processing URL is 48850, url is https://auto.ru/cars/used/sale/mitsubishi/pajero/1105402238-6657017d/.
The # of current processing URL is 48900, url is https://auto.ru/cars/used/sale/mitsubishi/pajero_sport/1105402312-7ed058b2/.
The # of current processing URL is 48950, url is https://auto.ru/cars/new/group/mitsubishi/outlander/21397559/21397679/1103297469-43d418ad/.
The # of current processing URL is 49000, url is https://auto.ru/cars/used/sale/mitsubishi/pajero_sport/1104036017-06bb6a59/.
Create a dataframe from the list and concetenate it with the already collected data.
The # of current processing URL is 49050, url is https://auto.ru/cars/used/sale/mitsubishi/galant/1105402618-193be41e/.
The # of current processing URL is 49100, url is https://auto.ru/cars/used/sale/mitsubishi/outlander/1105066450-9a474f4b/.
The # of current processing URL is 4

In [152]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130201 entries, 0 to 130200
Data columns (total 36 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   bodyType              129988 non-null  object 
 1   brand                 129988 non-null  object 
 2   car_url               129969 non-null  object 
 3   color                 129988 non-null  object 
 4   complectation_dict    105267 non-null  object 
 5   description           129988 non-null  object 
 6   engineDisplacement    129988 non-null  object 
 7   enginePower           129988 non-null  object 
 8   equipment_dict        129969 non-null  object 
 9   fuelType              129988 non-null  object 
 10  image                 130065 non-null  object 
 11  mileage               103034 non-null  float64
 12  modelDate             129986 non-null  float64
 13  model_info            0 non-null       object 
 14  model_name            103034 non-null  object 
 15  

In [153]:
final_df.head()

Unnamed: 0,bodyType,brand,car_url,color,complectation_dict,description,engineDisplacement,enginePower,equipment_dict,fuelType,...,Владение,ПТС,Привод,Руль,Состояние,Таможня,views,date_added,region,price
0,лифтбек,SKODA,https://auto.ru/cars/used/sale/skoda/octavia/1...,белый,"[cruise-control, multi-wheel, xenon, heated-wa...",Автомобиль приобретался у официального дилера ...,1.8 LTR,180 N12,"{'cruise-control': True, 'asr': True, 'tinted-...",бензин,...,,Оригинал,передний,Левый,Не требует ремонта,Растаможен,76.0,24 сентября,в Тюмени,999000.0
1,лифтбек,SKODA,https://auto.ru/cars/used/sale/skoda/rapid/110...,белый,"[cruise-control, multi-wheel, heated-wash-syst...",Группа компаний Элан Моторс официальный дилер ...,1.6 LTR,110 N12,"{'cruise-control': True, 'glonass': True, 'asr...",бензин,...,,Оригинал,передний,Левый,Не требует ремонта,Растаможен,259.0,24 сентября,в Санкт-Петербурге,1179000.0
2,лифтбек,SKODA,https://auto.ru/cars/new/group/skoda/rapid/217...,белый,"[cruise-control, heated-wash-system, airbag-pa...",Специальные предложения на автомобили в наличи...,1.6 LTR,90 N12,"{'cruise-control': True, 'glonass': True, 'esp...",бензин,...,,,,,,,,,,1464100.0
3,лифтбек,SKODA,https://auto.ru/cars/used/sale/skoda/octavia/1...,синий,"[cruise-control, multi-wheel, heated-wash-syst...",Купим Ваш автомобиль ДОРОГО Гарантированная с...,1.4 LTR,150 N12,"{'cruise-control': True, 'esp': True, 'usb': T...",бензин,...,,Оригинал,передний,Левый,Не требует ремонта,Растаможен,31.0,25 сентября,в Тюмени,1420000.0
4,внедорожник 5 дв.,SKODA,https://auto.ru/cars/new/group/skoda/karoq/217...,серый,"[cruise-control, multi-wheel, heated-wash-syst...",ЛОТ 01267595 Скидка на автомобиль при покупке ...,1.4 LTR,150 N12,"{'cruise-control': True, 'asr': True, 'esp': T...",бензин,...,,,,,,,,,,2653190.0
