<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=Raleway:wght@100&display=swap" rel="stylesheet">

<h1 style="text-transform: uppercase; text-align: center; font-weight: 100;">
Linear regression in Tensorflow using RNN</h1>
<p style="text-align: center; font-weight: 100;">Predicting Volkswagen car prices</p><br/>

Project requirements:
- Only requirement is: minimum 10k rows
- Any type of neural network - classical but convolutional and recurrent, better grade
- Presentation of project, beside practical questions about model there would be also theoretical.
The topic is free.

Theory required for project presentation:
- Practical side.
- Discuss the scheme of an artificial neuron.
- Activation function, why it is so important.
- Explain how does neuron learn - steps, algorithm?
- How does basic neural network work? (The more you know the better, as well as more complex).
- Discuss the algorithm of neural network using backpropagation learning method. (metoda wstecznej propagacji błędów)
- What subsets and why are the data divided into?
Knowledge of ML, statistics.

Data source: data were taken from the popular Polish automotive website Otomoto (https://www.otomoto.pl/).

1. Import libraries and create dataframe:

In [1]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels as sm
import tensorflow as tf
# from model import VolkswagenModel
# from colorama import init, Fore, Style

print(f"np: {np.__version__}, pd: {pd.__version__}, tf: {tf.__version__}, mpl: {mpl.__version__}, sb: {sns.__version__}, sm: {sm.__version__}")
print("Libs loaded.")

data = pd.read_csv("./data/otomoto.csv")
df = pd.DataFrame(data)
df = df.drop(df.columns[0], axis=1)
df.columns = ['Price', 'Year', 'Mileage', 'Tank capacity', 'Fuel type', 'Model', 'Estimation']
print("Data technical info")
df.info()
df

np: 1.23.2, pd: 1.4.4, tf: 2.11.0, mpl: 3.5.3, sb: 0.12.1, sm: 0.13.2
Libs loaded.
Data technical info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14915 entries, 0 to 14914
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   Price          14915 non-null  int64
 1   Year           14915 non-null  int64
 2   Mileage        14915 non-null  int64
 3   Tank capacity  14915 non-null  int64
 4   Fuel type      14915 non-null  int64
 5   Model          14915 non-null  int64
 6   Estimation     14915 non-null  int64
dtypes: int64(7)
memory usage: 815.8 KB


Unnamed: 0,Price,Year,Mileage,Tank capacity,Fuel type,Model,Estimation
0,24900,2010,166000,1390,1,16,0
1,16500,2007,194000,1968,7,39,0
2,364510,2022,5400,1968,7,8,0
3,9999,2005,196425,1390,1,35,2
4,39900,2010,227000,1968,7,32,0
...,...,...,...,...,...,...,...
14910,119500,2020,52000,1498,1,41,2
14911,31000,2002,511000,2461,7,9,0
14912,23500,2011,223000,1390,1,16,1
14913,29800,2012,190000,1598,7,16,2


In [None]:
description = df.describe()
description

In [None]:
check = 14915
flag = True
for i in range(description.shape[1]):
    if description.iloc[0, i] != check:
        print("Number of occurrences of data is not equal for every label.")
        print(f"Problem at cell: (0, {i})")
        flag = False
        continue
    else:
        print(f"Checked: (0, {i})")

print("All columns passed" if flag == True else "Not passed")

2. Clean data:

    Data was cleaned previously in 'scratchpad.py' file and now all the records are represented by integers(documentation of each column values is located in 'model.py' file). Values ​​of 0 represent an error in reading data. Column "Estimation" contains a lot of 0 values, but this is due to the fact that not every article on the website contained such information.

In [None]:
counter = 0
for index, row in df.iterrows():
    if row["Price"] == 0 or row["Year"] == 0 or row["Mileage"] == 0 or row["Tank capacity"] == 0 or row["Model"] == 0: 
        counter += 1
print(counter)


Deleting outliers


In [None]:
plt.figure()
sns.barplot(x=df["Price"], y=df["Mileage"])
plt.plot()

Normalization

Histograms

In [None]:
def make_hist(col_name, bins_val):
    min = df[col_name].min()
    max = df[col_name].max()
    print(f"Lowest {col_name} value: {min}. Highest {col_name} value: {max}.")
    plot_hist = df[col_name].plot.hist(bins=bins_val, grid=True)
    plot_hist.set_title(f"Represents number of cars for each production {col_name.upper()} category")
    plot_hist.set_xlabel(f"{col_name}")
    plot_hist.set_ylabel("Number of observations")
    plt.show()

In [None]:
make_hist("Price", 57)

In [None]:
min = df["Year"].min()
max = df["Year"].max()
make_hist("Year", max-min)

In [None]:
make_hist("Mileage", 70)

In [None]:
make_hist("Tank capacity", 10) # default bins = 10

# Trzeba jeszcze dopasować przedziały

In [None]:
make_hist("Fuel type", 10) # default bins = 10

In [None]:
NO_cars_fuel = df['Fuel type'].value_counts().reset_index()
NO_cars_fuel = NO_cars_fuel.rename(columns={'index': 'type of fuel', 'Fuel type': 'NO of cars with specific fuel type'})
# NaN
# d = {"Benzyna": 1, "Benzyna+LPG": 2, "Benzyna+CNG": 3, "Elektryczny": 4, "Hybryda": 5, "Wodór": 6, "Diesel": 7}
# unique_values['name of fuel'] = unique_values['type of fuel'].map(d)
NO_cars_fuel

In [None]:
make_hist("Model", 29)

In [None]:
make_hist("Estimation", 10) # default bins = 10

Distribution of cars per column per certain category from column in numbers.

In [None]:
# pr = df['Price'].value_counts().reset_index()
# yr = df['Year'].value_counts().reset_index()
# mil = df['Mileage'].value_counts().reset_index()
# tank = df['Tank capacity'].value_counts().reset_index()
# fuel = df['Fuel type'].value_counts().reset_index()
mod = df['Model'].value_counts().reset_index()
est = df['Estimation'].value_counts().reset_index()

# pr = pr.rename(columns={'index': 'price_cat', 'Price': 'NO_cars'})
# yr = yr.rename(columns={'index': 'year_cat', 'Year': 'NO_cars'})
# mil = mil.rename(columns={'index': 'mileage_cat', 'Mileage': 'NO_cars'})
# tank = tank.rename(columns={'index': 'tank_capacity', 'Tank capacity': 'NO_cars'})
# fuel = fuel.rename(columns={'index': 'type_fuel', 'Fuel type': 'NO_cars'})
mod = mod.rename(columns={'index': 'type_model', 'Model': 'NO_cars'})
est = est.rename(columns={'index': 'type_est', 'Estimation': 'NO_cars'})

def create_table(table_data, table_title):
    fig = plt.figure()
    table = plt.table(cellText=table_data.values, colLabels=table_data.columns, loc='upper left')
    table.auto_set_font_size(False)
    table.set_fontsize(14)
    table.scale(1, 1.5)
    plt.subplots_adjust(left=0.2, bottom=0.2)
    plt.title(table_title)
    plt.axis('off')
    return fig

# table1 = create_table(pr, "Pr")
# table2 = create_table(yr, "Yr")
# table3 = create_table(mil, "Mil")
# table4 = create_table(tank, "Tank")
# table5 = create_table(fuel, "Fuel")
table6 = create_table(mod, "Mod")
table7 = create_table(est, "Est")

Box plots

In [None]:
sns.boxplot(df['Price']).set(title="PRICE boxplot", xlabel="Price", ylabel="Value of Price")
sns.despine()

In [None]:
# # create FacetGrid
# grid = sns.FacetGrid(pd.melt(df.reset_index(), id_vars=['index']), col='variable', col_wrap=2)
#
# # plot boxplots on each facet
# grid.map(sns.boxplot, 'variable', 'value')
#
# # set titles and labels
# for ax in grid.axes.flat:
#     ax.set_title(ax.get_title().split('=')[1])
#     ax.set_xlabel('')
#     ax.set_ylabel('')
#
# plt.subplots_adjust(top=0.9)
# grid.fig.suptitle('Boxplots Grid')
#
# plt.show()

def grid_boxplots(*args):
    col_list = [col for col in args]




Heat Map

In [None]:
correlation_matrix = np.corrcoef(df.values.T)

fig, ax = plt.subplots(figsize=(7, 7))
sns.set(font_scale=1.1)
sns.heatmap(data=correlation_matrix, square=True, cbar=True, annot=True, annot_kws={'size': 10}, xticklabels=df.columns, yticklabels=df.columns, fmt=".2f", linewidth=.5, cmap=sns.cubehelix_palette(as_cmap=True))

Pair plot

In [None]:
sns.pairplot(df, height=1.5)
sns.pairplot(df.columns.tolist())

Model Building

Splitting data into sets

Model Architecture

Training the model

Evaluating the model

### DEPLOY - docker, github repo