# I. Importing Data

First and foremost, importing the libraries needed and all the data. We are doing this by using *numpy* and *pandas*.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

As you can see above, there are a lot of data from different files. Each file represent different brand. But, when you see the location of these files, you can see that the name of the files are also the brand. Then we just have to manipulate the location of the file, extract the name of the file to name the dataframe that later we will be using. 

In [None]:
sources = []

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        sources.append(os.path.join(dirname, filename))
        
car_names = [j.split('/')[-1].split('.')[0] for j in sources]
df = {}
for c, s in zip(car_names, sources):
    df[c] = pd.read_csv(s)

In [None]:
for n in car_names:
    print(n)
    print(df[n].columns)

As you can see above, each dataframe has different name of columns. Even though they actually showed the similar features (*'fuelType'* and *'fuel type'* for example, they both told the data of what fuel a car used). Because of this, we need to change the name of some columns.

In [None]:
df['hyundi'] = df['hyundi'].rename(columns={'tax(£)':'tax'})
df['unclean cclass'] = df['unclean cclass'].rename(columns={'fuel type':'fuelType', 'engine size':'engineSize',
                                                            'fuel type2':'fuelType2', 'engine size2':'engineSize2'})

## II. Exploratory Data Analysis

Now, I want to find what features from a car that contribute to its cost the most. We can find this out by doing an ANOVA test. We would make the price as the target, and all other features as the data being analyzed. Here, we also excluded categorical features like *model*, *transmission* and *fuel type*. 

In [None]:
from sklearn.feature_selection import SelectKBest, f_regression

train, test = df['audi'].drop(['price', 'model', 'transmission', 'fuelType'], axis=1), df['audi']['price']
selector = SelectKBest(f_regression, k=3)
train = selector.fit_transform(train, test)

important_cols = []
for i in range(3):
    for j in range(len(df['audi'].columns)):
        if sum(train[:, i] == df['audi'].iloc[:, j]) == len(df['audi']):
            important_cols.append(df['audi'].columns[j])
            
important_cols

From the result above, we can see that **year, miles per gallon** and **engine size** are the most important features contributing to price respectively. This analysis by the way, only the analysis on Audi car. But we can assume that the result would be the same if we analyzed another brand. 

In [None]:
df['toyota'].head(3)

# III. Model Building

Now, we will make the model that can predict the price of a car. The features of this model would be:

**1.** Brand

**2.** Year being bought

**3.** Model

**4.** Transmission

**5.** Mileage

**6.** Fuel Type

**7.** Miles per gallon

**8.** Engine Size

To make the model more accurate, **we would just make one model for each brand**. These brands have their own inner value that is hard to measure in numbers (we could, but we wouldn't do that here). Beside, making a model for each brand is also an attempt to discard the effect of model's luxury that we can't calculate. 

We also built this by using functions. Later on, when we need to make a new model on new brand, we just need to call this functions. Or even make a loop so we can get all models at once. 

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Dropout
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

def made_up_encoder(dx, param):
    a = dx[[param, 'price']].groupby(param).mean().sort_values('price', ascending='True').index
    return a

def convert_to_num(dx, param, mu):
    a = []
    for i in dx[param]:
        for j, k in enumerate(mu):
            if i == k: 
                a.append(j)
                break
    return a

def made_up_transformer(ds, params):
    for p in params:
        a = made_up_encoder(ds, p)
        ds[p] = convert_to_num(ds, p, a)
    return ds

def the_model(x_train, y_train):
    model = Sequential()
    model.add(Dense(10, activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(5, activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(1, activation='relu'))
    model.compile(optimizer='adam', loss='mse', metrics=['mae'])
    model.fit(x_train, y_train, epochs=15)
    return model

def process_for_model(dv, params):
    cols = dv.columns
    dv_new = pd.DataFrame(made_up_transformer(dv, params))
    dv_new = pd.DataFrame(MinMaxScaler().fit_transform(dv_new), columns=cols)
    
    x_ = dv_new.drop(['price'], axis=1)
    y_ = dv_new.price
    
    x_train, x_valid, y_train, y_valid = train_test_split(x_, y_, test_size=0.2, random_state=42)
    
    model_car = the_model(x_train, y_train)
    
    return model_car, x_train, x_valid, y_train, y_valid

**Here we use Artificial Neural Network as the model.**

There are some features that are categorical. We can just use Label Encoder. But that's not what we would do here. We realize that this categorical data also contribute to price. There are certain models that are expensive than the other just because of the model. Same thing goes to fuel type. 

So, we decided to do something like Label Encoder but the value that has higher average of price would be labeled higher (BMW X7 would labeled higher than BMW X5 for example).

In [None]:
model = {}
x_train = {}
x_valid = {}
y_train = {}
y_valid = {}

car_brands = ['audi', 'toyota', 'skoda', 'ford', 'vauxhall', 'bmw', 'vw', 'hyundi', 'merc']

for c in car_brands:
    model[c], x_train[c], x_valid[c], y_train[c], y_valid[c] = process_for_model(df[c],
                                                                                 ['model', 'transmission', 'fuelType'])

In [None]:
from sklearn.metrics import mean_absolute_error

scaler_price = {}
y_predict = {}
y_predict_conv = {}
y_valid_conv = {}
the_errors = np.zeros([len(car_brands), 2])

for c in car_brands:
    scaler_price[c] = MinMaxScaler().fit(np.array(df[c].price).reshape(-1, 1))

for i, c in enumerate(car_brands):
    y_predict[c] = model[c].predict(x_valid[c])
    
    y_predict_conv[c] = scaler_price[c].inverse_transform(y_predict[c])
    y_valid_conv[c] = scaler_price[c].inverse_transform(np.array(y_valid[c]).reshape(-1, 1))
    
    the_errors[i, 0] = mean_absolute_error(y_predict[c], y_valid[c])
    the_errors[i, 1] = mean_absolute_error(y_predict_conv[c], y_valid_conv[c])
    
df_errors = pd.DataFrame(the_errors, index=car_brands, columns=['MAE on scale', 'MAE on real price'])
df_errors

We can see from the result above, that Hyundi is a car brand that our model has the worst prediction on (Hyundi is not the right name, we all know that it's supposed to be Hyundai but whatever).The model predict on average the price of a car can be 11,268 pounds higher or lower, which oviously not a good number.

But the other model is quiet satisfying. They don't have an error that are huge. 