# Data Cleaning
Here are our solutions for automation of filling missing data.<br>
Main ideology behind this is to train models to recognize patterns like brand, power, etc. in order to classify the correct model (for instance).<br>
We aim to preserve as much data as we can (minimize deletion of records). We are aware that data produced by models may not be credible so might impact the original idea of predicting prices, however it's the path we want to explore.

## Imports and paths

In [1]:
import pandas as pd
import pathlib as pl
import helper_functions as help_me

In [2]:
to_fill_path = pl.Path('datasets', '4_filling.csv')

## Data loading

In [3]:
to_fill_df = pd.read_csv(
    filepath_or_buffer=to_fill_path,
    sep=',',
    header=0,
    index_col='ID'
)
to_fill_df.isna().sum()

Cena                     0
Marka_pojazdu         3274
Model_pojazdu         3232
Wersja_pojazdu       48038
Generacja_pojazdu    41646
Moc_KM                3732
Pojemnosc_cm3         4683
Rodzaj_paliwa         3410
Naped                13044
Skrzynia_biegow       3771
Typ_nadwozia          3358
Liczba_drzwi          4362
dtype: int64

## Create training dataset & testing datasets
In this chapter We are focusing on creating datasets (one training and few testing) to predict targets of chosen column.

In [4]:
training_set = to_fill_df.dropna(
    subset=["Marka_pojazdu", "Model_pojazdu", "Wersja_pojazdu", "Generacja_pojazdu", "Moc_KM", "Pojemnosc_cm3", "Rodzaj_paliwa", "Naped", "Skrzynia_biegow", "Typ_nadwozia", "Liczba_drzwi"]
)
training_set

Unnamed: 0_level_0,Cena,Marka_pojazdu,Model_pojazdu,Wersja_pojazdu,Generacja_pojazdu,Moc_KM,Pojemnosc_cm3,Rodzaj_paliwa,Naped,Skrzynia_biegow,Typ_nadwozia,Liczba_drzwi
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2,25900.0,Renault,Megane,1.6 16V 110,III (2008-2016),110.0,1598.0,Gasoline,Front wheels,Manual,station_wagon,5.0
4,5999.0,Ford,Focus,1.6 TDCi FX Silver / Silver X,Mk2 (2004-2011),90.0,1560.0,Diesel,Front wheels,Manual,compact,5.0
8,11900.0,Renault,Scenic,1.5 dCi Authentique,III (2009-2013),105.0,1461.0,Diesel,Front wheels,Manual,minivan,5.0
10,38900.0,Audi,A4,2.0 TDI,B8 (2007-2015),143.0,1968.0,Diesel,Front wheels,Manual,sedan,4.0
14,38900.0,Audi,A3,1.5 TFSI Sport S tronic,8V (2012-),150.0,1498.0,Gasoline,Front wheels,Automatic,compact,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...
135375,45900.0,Ford,Mondeo,2.0 TDCi Ambiente,Mk5 (2014-),150.0,1997.0,Diesel,Front wheels,Manual,sedan,5.0
135381,3000.0,Citroën,Xsara,II 1.6i Exclusive,II (2001-2004),110.0,1587.0,Gasoline,Front wheels,Manual,compact,5.0
135385,32500.0,Opel,Insignia,2.8 T V6 Cosmo 4x4,A (2008-2017),260.0,2792.0,Gasoline,4x4 (attached automatically),Automatic,sedan,4.0
135389,2300.0,Peugeot,307,2.0 HDi,I (2001-2005),90.0,1997.0,Diesel,Front wheels,Manual,city_cars,5.0


In [5]:
brand_testing_set = help_me.create_testing_set(
    df=to_fill_df,
    target_column='Marka_pojazdu'
)

model_testing_set = help_me.create_testing_set(
    df=to_fill_df,
    target_column='Model_pojazdu'
)

version_testing_set = help_me.create_testing_set(
    df=to_fill_df,
    target_column='Wersja_pojazdu'
)

gen_testing_set = help_me.create_testing_set(
    df=to_fill_df,
    target_column='Generacja_pojazdu'
)

power_testing_set = help_me.create_testing_set(
    df=to_fill_df,
    target_column='Moc_KM'
)

capacity_testing_set = help_me.create_testing_set(
    df=to_fill_df,
    target_column='Pojemnosc_cm3'
)

fuel_testing_set = help_me.create_testing_set(
    df=to_fill_df,
    target_column='Rodzaj_paliwa'
)

drive_testing_set = help_me.create_testing_set(
    df=to_fill_df,
    target_column='Naped'
)

transmission_testing_set = help_me.create_testing_set(
    df=to_fill_df,
    target_column='Skrzynia_biegow'
)

body_testing_set = help_me.create_testing_set(
    df=to_fill_df,
    target_column='Typ_nadwozia'
)

door_testing_set = help_me.create_testing_set(
    df=to_fill_df,
    target_column='Liczba_drzwi'
)


In [6]:
help_me.auto_test(
    training_set=training_set,
    testing_set=brand_testing_set
)

DEBUG:helper_functions:Target column: Marka_pojazdu, Target type: object
DEBUG:helper_functions:Classification


array(['Citroën', 'Audi', 'Opel', ..., 'Renault', 'Toyota', 'Renault'],
      shape=(1183,), dtype=object)