### Task:

#### 1. Scrape the given web-page (see below) with the car brand/manufacturer you want (for example BMW, Toyota etc.). As a result you have to create a DataFrame object with cars ads, with columns:

- car model (модель машины)
- year of production (год производства)
- engine volume (объем двигателя)
- mileage (пробег)
- fuel (вид топлива)
- gearbox mechanism (механизм коробки передач)
- steering wheel location (расположение руля (левый/правый))
- color (цвет)
- body type (тип кузова (седан, универсал и т.д.))
- location (city) (локация авто, город)
- price in USD (цена в долларах США)

url: `https://www.mashina.kg/`

#### 2. Prepare data for modelling.


- change the numeric columns to int or float (if object type)
- handling missing values `(Работа с пропущенными значениями, заполнение средним, модой. Либо, например, зависимость пробега от года выпуска)`
- working with categorical data (Ordinal and One-Hot Encodings) `(1 и 0, либо dummies. Например, расположение руля - левый: 0, правый: 1. А когда у нас вариантов больше, то необходимо использовать dummies)`

#### 3. Build a model (Linear Regression) for car price prediction. Show the score and coefficients. Interpret the coefficients. Answer to the given questions.

`X = all features, excluding the price`

`y = price in USD`

1. How does the year of car manufacture affect the price of the car? (coefficient of `Year` in equation)
2. How strongly does the mileage of the car affect the car? (coefficient of `Mileage` in equation)
3. Does the location of the steering wheel affect the cost? (coefficient of `Location` in equation)
4. Does the engine volume affect the cost? (coefficient of `Engine volume` in equation)

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import time

In [223]:
html_text = requests.get("https://m.mashina.kg/search/toyota/all/?currency=2&sort_by=upped_at+desc&page=2").text

In [None]:
soup_list = []

for i in range(291):
    url = f"https://m.mashina.kg/search/toyota/all/?currency=2&sort_by=upped_at+desc&page={i}"
    html_text = requests.get(url).text
    
    soup = BeautifulSoup(html_text, 'html.parser')
    
    soup_list.append(soup)

    time.sleep(1)

In [None]:
model = []
year_of_prod = []
engine_volume = []
mileage = []
fuel =[]
gearbox = []
steering_wheel = []
color =[]
body_type = []
price_usd = []
price_kgs = []
location = []

In [None]:
for it in soup_list:
    cars = it.find_all('div', class_ = 'list-item list-label')
    for i in cars:
        model.append(i.find('h2', class_ = "name").text.replace("\n", "").replace(" ",""))
        price_usd.append(i.find("div", class_ = "block price").text.split("\n")[2].replace(" ","" )[1:])
        price_kgs.append(i.find("div", class_ = "block price").text.split("\n")[4].replace(" ","")[:-3])
        year_of_prod.append(i.find("p", class_="year-miles").text.replace(" ","").replace("\n","").replace(",","").split(".")[0][:-1])
        engine_volume.append(i.find("p", class_="year-miles").text.replace(" ","").replace("\n","").replace(",","")[6:].split("л")[0])
        gearbox.append(i.find("p", class_="year-miles").text.replace(" ","").replace("\n","").split(",")[2])
        color.append(i.find("div", class_="item-info-wrapper").find("p", class_="year-miles").i['title'])
        body_type.append(i.find("p", class_="body-type").text.replace(" ","").replace("\n", "").split(",")[0])
        fuel.append(i.find("p", class_="body-type").text.replace(" ","").replace("\n", "").split(",")[1])
        mileage.append(i.find("p", class_="volume").text.replace(" ","").replace("\n", "").split(",")[1][:-2])
        steering_wheel.append(i.find("p", class_="volume").text.replace(" ","").replace("\n", "").split(",")[0][4:])
        location.append(i.find("p", class_="city").text.split("\n")[1].replace(" ", ""))

In [None]:
df = pd.DataFrame(list(zip(model,price_usd,price_kgs,year_of_prod,engine_volume,gearbox,color,body_type,fuel,mileage,steering_wheel,location)),
                  columns = ['модель', 'цена usd', 'цена kgs', 'год производства', 'обьем двигателя', 'КПП', 'цвет', 'кузов', 'топливо', 'пробег', 'расположение руля', 'локация'])

In [45]:
columns = ['цена usd', 'год производства', 'обьем двигателя', 'пробег', "обьем двигателя"]
for i in columns:
    df.loc[:, i] = df[i].astype(float)

In [41]:
df = pd.read_csv("zhunusov_mashinakg.csv")

In [42]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5820 entries, 0 to 5819
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         5820 non-null   int64  
 1   модель             5820 non-null   object 
 2   цена usd           5786 non-null   float64
 3   цена kgs           5820 non-null   int64  
 4   год производства   5820 non-null   float64
 5   обьем двигателя    5820 non-null   object 
 6   КПП                5811 non-null   object 
 7   цвет               5820 non-null   object 
 8   кузов              5820 non-null   object 
 9   топливо            5820 non-null   object 
 10  пробег             5700 non-null   float64
 11  расположение руля  5819 non-null   object 
 12  локация            5820 non-null   object 
dtypes: float64(3), int64(2), object(8)
memory usage: 591.2+ KB


In [43]:
df = df.drop(columns=["Unnamed: 0","цена kgs"])

In [44]:
df = df.dropna()

Unnamed: 0,модель,цена usd,цена kgs,год производства,обьем двигателя,КПП,цвет,кузов,топливо,пробег,расположение руля,локация


In [46]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split as tts
from sklearn.preprocessing import LabelEncoder

In [47]:
le = LabelEncoder()

In [48]:
df['модель'] = le.fit_transform(df['модель'])
df['цвет'] = le.fit_transform(df['цвет'])
df['кузов'] = le.fit_transform(df['кузов'])
df['расположение руля'] = le.fit_transform(df['расположение руля'])
df['локация'] = le.fit_transform(df['локация'])
df['топливо'] = le.fit_transform(df['топливо'])
df['КПП'] = le.fit_transform(df['КПП'])
df['обьем двигателя'] = le.fit_transform(df['обьем двигателя'])


In [49]:
X = df.copy()
y = X.pop("цена usd")

X_train,X_test,y_train,y_test = tts(X,y, test_size=0.2)

In [62]:
model = GradientBoostingRegressor()
model.fit(X_train,y_train)

In [63]:
model.score(X_train,y_train)

0.926309042575296

In [64]:
model.score(X_test,y_test)

0.7440814930733601

In [65]:
df.to_csv("Processed_toyota.csv")