___

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" alt="CLRSWY"></p>

___

# WELCOME!

## Introduction
Welcome to "***AutoScout Data Analysis Project***". This is the capstone project of ***Data Analysis*** Module. **Auto Scout** data which using for this project, scraped from the on-line car trading company in 2019, contains many features of 9 different car models. In this project, you will have the opportunity to apply many commonly used algorithms for Data Cleaning and Exploratory Data Analysis by using many Python libraries such as Numpy, Pandas, Matplotlib, Seaborn, Scipy you will analyze clean dataset.

The project consists of 3 parts:
* First part is related with 'data cleaning'. It deals with Incorrect Headers, Incorrect Format, Anomalies, Dropping useless columns.
* Second part is related with 'filling data'. It deals with Missing Values. Categorical to numeric transformation is done.
* Third part is related with 'handling outliers of data' via Visualisation libraries. Some insights are extracted.


### Columns:

**General Columns**
* url: url of autos
* short_description, description: Description of autos (in English and German) written by users

**Categorical Columns**
* make_model, make, model: Model of autos. Ex:Audi A1
* body_type, body: Body type of autos Example: van, sedans
* vat: VAT deductible, price negotiable
* registration, first_registration: First registration date and year of autos.
* prev_owner, previous_owners: Number of previous owners
* type: new or used
* next_inspection, inspection_new: information about inspection (inspection date,..)
* body_color, body_color_original: Color of auto Ex: Black, red
* paint_type: Paint type of auto Ex: Metallic, Uni/basic
* upholstery: Upholstery information (texture, color)
* gearing_type: Type of gear Ex: automatic, manual
* fuel : fuel type Ex: diesel, benzine
* co2_emission, emission_class, emission_label: emission information
* drive_chain: drive chain Ex: front,rear, 4WD
* consumption: consumption of auto in city, country and combination (lt/100 km)
* country_version
* entertainment_media
* safety_security
* comfort_convenience
* extras

**Quantitative Columns**
* price: Price of cars
* km: km of autos
* hp: horsepower of autos (kW)
* displacement: displacement of autos (cc)
* warranty: warranty period (month)
* weight: weight of auto (kg)
* nr_of_doors: number of doors
* nr_of_seats : number of seats
* cylinders: number of cylinders
* gears: number of gears



# PART- 1 `( Data Cleaning )`

In [1]:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import warnings;
warnings.filterwarnings("ignore")
import re
pd.set_option("display.max_columns",None)
pd.set_option("display.max_rows",None)

In [2]:
auto_org = pd.read_json('scout_car.json', lines=True)

In [3]:
auto = auto_org.copy()

In [4]:
auto.head(3).T

Unnamed: 0,0,1,2
url,https://www.autoscout24.com//offers/audi-a1-sp...,https://www.autoscout24.com//offers/audi-a1-1-...,https://www.autoscout24.com//offers/audi-a1-sp...
make_model,Audi A1,Audi A1,Audi A1
short_description,Sportback 1.4 TDI S-tronic Xenon Navi Klima,1.8 TFSI sport,Sportback 1.6 TDI S tronic Einparkhilfe plus+m...
body_type,Sedans,Sedans,Sedans
price,15770,14500,14640
vat,VAT deductible,Price negotiable,VAT deductible
km,"56,013 km","80,000 km","83,450 km"
registration,01/2016,03/2017,02/2016
prev_owner,2 previous owners,,1 previous owner
kW,,,


In [5]:
auto.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15919 entries, 0 to 15918
Data columns (total 54 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   url                            15919 non-null  object 
 1   make_model                     15919 non-null  object 
 2   short_description              15873 non-null  object 
 3   body_type                      15859 non-null  object 
 4   price                          15919 non-null  int64  
 5   vat                            11406 non-null  object 
 6   km                             15919 non-null  object 
 7   registration                   15919 non-null  object 
 8   prev_owner                     9091 non-null   object 
 9   kW                             0 non-null      float64
 10  hp                             15919 non-null  object 
 11  Type                           15917 non-null  object 
 12  Previous Owners                9279 non-null  

## Change column names

In [6]:
auto.columns = auto.columns.str.lower().str.replace(' ','_').str.replace('.','').str.replace('\n','').str.replace('_&','')

auto.columns

Index(['url', 'make_model', 'short_description', 'body_type', 'price', 'vat',
       'km', 'registration', 'prev_owner', 'kw', 'hp', 'type',
       'previous_owners', 'next_inspection', 'inspection_new', 'warranty',
       'full_service', 'non-smoking_vehicle', 'null', 'make', 'model',
       'offer_number', 'first_registration', 'body_color', 'paint_type',
       'body_color_original', 'upholstery', 'body', 'nr_of_doors',
       'nr_of_seats', 'model_code', 'gearing_type', 'displacement',
       'cylinders', 'weight', 'drive_chain', 'fuel', 'consumption',
       'co2_emission', 'emission_class', 'comfort_convenience',
       'entertainment_media', 'extras', 'safety_security', 'description',
       'emission_label', 'gears', 'country_version', 'electricity_consumption',
       'last_service_date', 'other_fuel_types', 'availability',
       'last_timing_belt_service_date', 'available_from'],
      dtype='object')

## url column

In [7]:
auto.drop('url', axis=1, inplace=True)

# make_model column

In [8]:
auto['make_model'].value_counts(dropna=False)

Audi A3           3097
Audi A1           2614
Opel Insignia     2598
Opel Astra        2526
Opel Corsa        2219
Renault Clio      1839
Renault Espace     991
Renault Duster      34
Audi A2              1
Name: make_model, dtype: int64

## 'Make' column

In [9]:
auto['make'] = auto['make'].str.replace('\n','')

In [10]:
auto.drop('make', axis=1, inplace=True)

* This column includes only main models. It matchs with make_model column so it was dropped.

## 'Model' column

In [11]:
auto['model'] = auto['model'].str[1]

In [12]:
auto.drop('model', axis=1, inplace=True)

* This column includes only models. It matchs with make_model column so it was dropped.

## displacement column

In [13]:
#auto['displacement'].str[0].str.replace(',','').str.replace('\n','').str.replace(' cc','').value_counts().index

In [14]:
auto['displacement'].str[0].str.replace(',','').str.findall('\d+').str[0].astype('float')

0         1422.0
1         1798.0
2         1598.0
3         1422.0
4         1422.0
5         1598.0
6         1598.0
7         1422.0
8         1598.0
9          999.0
10        1598.0
11        1395.0
12        1395.0
13        1395.0
14         999.0
15        1598.0
16        1598.0
17        1422.0
18         999.0
19        1598.0
20        1598.0
21        1395.0
22        1395.0
23        1598.0
24         999.0
25         999.0
26         999.0
27         999.0
28        1422.0
29         999.0
30        1422.0
31        1422.0
32         999.0
33        1422.0
34        1422.0
35         999.0
36         999.0
37        1422.0
38         999.0
39        1422.0
40         999.0
41        1422.0
42        1422.0
43        1422.0
44        1422.0
45         999.0
46        1422.0
47        1598.0
48         999.0
49         999.0
50        1422.0
51         999.0
52         999.0
53        1598.0
54        1422.0
55         999.0
56         999.0
57         999.0
58        1422

In [15]:
auto['displacement'] = auto['displacement'].str[0].str.replace(',','').str.findall('\d+').str[0].astype('float')

## Short description column

In [16]:
sd_disp = auto['short_description'].str.findall('\d\.\d').str[0].astype(float)

In [17]:
auto['sd_disp'] = sd_disp*1000

In [18]:
auto['sd_disp'] = auto['sd_disp'].replace(1600,1598).replace(1800,1798)

In [19]:
auto.loc[auto['sd_disp']<800,'sd_disp'] = np.nan

In [20]:
def disp(d1,d2):
    if (d1>4000) | (d1<700) | np.isnan(d1):
        if np.isnan(d2):
            return d1
        else:
            return d2
    else:
        return d1

In [21]:
auto['displacement'] = auto.apply(lambda x: disp(x['displacement'],x['sd_disp']),axis=1)
    

In [22]:
auto[['displacement', 'sd_disp']].value_counts()

displacement  sd_disp
1598.0        1598.0     3677
999.0         1000.0     1238
1398.0        1400.0     1004
1399.0        1400.0      558
1229.0        1200.0      541
1956.0        2000.0      534
1422.0        1400.0      441
1490.0        1500.0      398
1461.0        1500.0      332
1395.0        1400.0      298
1968.0        2000.0      265
1149.0        1200.0      213
1197.0        1200.0      150
1400.0        1400.0      114
1600.0        1598.0      103
1248.0        1300.0       96
1364.0        1400.0       93
1618.0        1598.0       88
1500.0        1500.0       85
1498.0        1500.0       66
1798.0        1798.0       59
2000.0        2000.0       49
998.0         1000.0       48
1200.0        1200.0       42
898.0         900.0        40
1300.0        1300.0       37
1000.0        1000.0       36
2480.0        2500.0       19
1398.0        4000.0       17
1998.0        2000.0       15
1397.0        1400.0       11
1997.0        2000.0       11
999.0         5700

* The displacement values in short description column was extracted and this values compared and matched with displacement column. Some null values of displacement column were filled in this way.

In [23]:
auto.drop(['short_description', 'sd_disp'], axis=1, inplace=True)

## body_type column

In [24]:
auto['body_type'].value_counts(dropna=False)

Sedans           7903
Station wagon    3553
Compact          3153
Van               783
Other             290
Transporter        88
NaN                60
Off-Road           56
Coupe              25
Convertible         8
Name: body_type, dtype: int64

## body column 

In [25]:
auto['body'] = auto['body'].str[1]

In [26]:
auto['body'].value_counts(dropna=False)

Sedans           7903
Station wagon    3553
Compact          3153
Van               783
Other             290
Transporter        88
NaN                60
Off-Road           56
Coupe              25
Convertible         8
Name: body, dtype: int64

**This column match with body_type column**

In [27]:
auto['body_type'].value_counts(dropna=False)

Sedans           7903
Station wagon    3553
Compact          3153
Van               783
Other             290
Transporter        88
NaN                60
Off-Road           56
Coupe              25
Convertible         8
Name: body_type, dtype: int64

In [28]:
auto[(~(auto['body']==auto['body_type']) & (auto['body_type'].isnull()))]['body'].value_counts(dropna=False)

NaN    60
Name: body, dtype: int64

In [29]:
auto.drop('body', axis=1,inplace=True)

* Since body and body_type columns exactly match, body column was dropped.

## Price Column

In [30]:
auto['price'].sort_values()

8594        13
8828       120
6066       255
8829       331
8827      4950
8825      4990
8826      5250
8824      5300
13770     5445
8823      5450
8822      5490
8820      5499
3235      5555
8821      5600
13763     5700
13762     5800
13760     5850
13759     5890
8818      5890
13761     5900
13757     5900
13758     5900
13756     5938
8819      5950
8602      5970
8600      5970
8601      5970
8597      5990
13755     6000
13754     6000
5712      6000
8599      6100
8598      6200
13753     6200
8596      6200
13752     6200
13751     6250
8816      6250
13749     6290
8807      6290
8806      6299
8805      6300
8803      6380
8802      6390
13750     6400
8801      6400
8799      6450
8804      6450
8800      6479
8796      6480
8797      6480
8798      6480
6398      6480
8795      6489
8794      6490
13421     6490
13423     6490
8793      6490
13748     6499
8792      6499
8791      6500
6063      6500
13747     6500
13746     6500
13745     6500
8790      6500
8788      

In [31]:
auto[auto['price']<4000].T

Unnamed: 0,6066,8594,8828,8829
make_model,Opel Astra,Opel Corsa,Opel Corsa,Opel Corsa
body_type,Station wagon,Sedans,Compact,Other
price,255,13,120,331
vat,,,,
km,"5,563 km",123 km,12 km,10 km
registration,06/2018,06/2018,01/2019,01/2019
prev_owner,,,,
kw,,,,
hp,100 kW,66 kW,66 kW,66 kW
type,"[, Used, , Diesel (Particulate Filter)]","[, Used, , Gasoline]","[, New, , Gasoline]","[, New, , Gasoline]"


In [32]:
price_outlier = auto[auto['price']<4000].index

In [33]:
auto.drop(price_outlier, inplace=True)

* 4 rows were dropped since their price and km are meaningless

## vat column 

In [34]:
auto.vat.value_counts(dropna=False)

VAT deductible      10980
NaN                  4509
Price negotiable      426
Name: vat, dtype: int64

## km column

In [35]:
auto.km = auto.km.str.replace(',','').str.findall('\d+').str[0].astype('float')

* km column is cleaned and converted to float. '-' rows are converted to np.nan. 

## registration column

In [36]:
auto.registration = auto.registration.replace('-/-',np.nan)

In [37]:
auto.registration = pd.to_datetime(auto.registration)

In [38]:
auto['year'] = pd.DatetimeIndex(auto['registration']).year 

In [39]:
auto.drop('registration', axis=1, inplace=True)

* registration column was converted to datetime. 
* year column was created from registration column. 
* All null values are converted to np.nan

## First Registration column

In [40]:
auto['first_registration'] = auto['first_registration'].str[1].astype('float')

In [41]:
auto['first_registration'].value_counts(dropna=False)

2018.0    4520
2016.0    3674
2017.0    3273
2019.0    2851
NaN       1597
Name: first_registration, dtype: int64

In [42]:
auto['year'].value_counts(dropna=False)

2018.0    4520
2016.0    3674
2017.0    3273
2019.0    2851
NaN       1597
Name: year, dtype: int64

In [43]:
auto[(auto['year'].isnull())]['first_registration'].value_counts(dropna=False)

NaN    1597
Name: first_registration, dtype: int64

In [44]:
auto.drop('first_registration',axis=1, inplace=True)

* Year and 'First Registration' columns are same. It was dropped.

## prev_owner and previous_owners columns

In [45]:
auto.prev_owner.value_counts(dropna=False)

1 previous owner     8294
NaN                  6824
2 previous owners     778
3 previous owners      17
4 previous owners       2
Name: prev_owner, dtype: int64

In [46]:
auto.prev_owner = auto.prev_owner.str[0].astype('float')

In [47]:
auto.prev_owner.value_counts(dropna=False)

1.0    8294
NaN    6824
2.0     778
3.0      17
4.0       2
Name: prev_owner, dtype: int64

### previous_owners column

In [48]:
auto['previous_owners'] = auto['previous_owners'].astype('str').str.findall('\d+').str[0].astype(float)

In [49]:
auto['previous_owners'].value_counts(dropna=False)

1.0    8294
NaN    6636
2.0     778
0.0     188
3.0      17
4.0       2
Name: previous_owners, dtype: int64

* Combine these two columns by apply method

In [50]:
def prev_owner_combine(p1,p2):
    if p1 == p2:
        return p1
    elif pd.isnull(p1) :
        if pd.isnull(p2):
            return np.nan
        else:
            return p2
    elif pd.isnull(p2):
        if pd.isnull(p1):
            return np.nan
        else:
            return p1
    else:
        return 'conflict'

In [51]:
auto['prev_owner'] = auto.apply(lambda x: prev_owner_combine(x['prev_owner'],x['previous_owners']), axis=1)

In [52]:
auto['prev_owner'].value_counts(dropna=False)

1.0    8294
NaN    6636
2.0     778
0.0     188
3.0      17
4.0       2
Name: prev_owner, dtype: int64

In [53]:
auto.drop(['previous_owners'],axis=1, inplace=True)

* prev_owner and Previous Owners columns were combined and Previous Owners column was dropped.

## kW column

In [54]:
auto.drop('kw',axis=1,inplace=True)

* Since there is no meaningful data in kW column it was dropped.

## hp column

In [55]:
auto.hp = auto.hp.str.findall('\d+').str[0].astype('float')

In [56]:
#auto.hp = auto.hp.str[:-3].replace('-',np.nan).astype('float')

## type column

In [57]:
auto['new_used'] = auto.type.str[1]

In [58]:
auto.new_used.value_counts(dropna=False)

Used              11094
New                1648
Pre-registered     1364
Employee's car     1011
Demonstration       796
NaN                   2
Name: new_used, dtype: int64

* new_used column was created. This column consists of the information about if the car is new, used, pre-registered, demonstration or employee's car

* This column included also fuel type information.

In [59]:
auto['fuel_type'] = auto.type.str[3]

In [60]:
benzine = auto.type.str[3].str.contains('Benzine', na=False, regex=True)

In [61]:
auto.loc[benzine,'fuel_type'] = 'benzine'

In [62]:
super = auto.type.str[3].str.contains('Super', na=False, regex=True)

In [63]:
gasoline = auto.type.str[3].str.contains('Gasoline', na=False, regex=True)

In [64]:
auto.loc[super,'fuel_type'] = 'benzine'

In [65]:
auto.loc[gasoline,'fuel_type'] = 'benzine'

In [66]:
diesel = auto['fuel_type'].isin(['Diesel (Particulate Filter)', 'Diesel'])

In [67]:
auto.loc[diesel,'fuel_type'] = 'diesel'

In [68]:
gas = auto['fuel_type'].isin(['LPG','Liquid petroleum gas (LPG)',\
                              'CNG','CNG (Particulate Filter)',\
                              'Biogas','Domestic gas H'])
          

In [69]:
auto.loc[gas,'fuel_type'] = 'gas'

In [70]:
others = auto['fuel_type'].isin(['Others', 'Others (Particulate Filter)', 'Electric'])

In [71]:
auto.loc[others,'fuel_type'] = 'others'

In [72]:
auto.drop('type',axis=1, inplace=True)

* fuel_type column was created.

## fuel column

In [73]:
fuel = auto['fuel'].str[1]

In [74]:
benzine = fuel.str.contains('Benzine')

In [75]:
fuel[benzine] = 'benzine'

In [76]:
super = fuel.str.contains('Super')

In [77]:
fuel[super] = 'benzine'

In [78]:
gasoline = fuel.str.contains('Gasoline')

In [79]:
fuel[gasoline] = 'benzine'

In [80]:
diesel = fuel.str.contains('Diesel')

In [81]:
fuel[diesel] = 'diesel'

In [82]:
gas = fuel.isin(['LPG','Liquid petroleum gas (LPG)',\
                              'CNG','CNG (Particulate Filter)',\
                              'Biogas','Domestic gas H'])

In [83]:
fuel[gas] = 'gas'

In [84]:
fuel.value_counts(dropna=False)

benzine                        8546
diesel                         7298
gas                              64
Others                            5
Electric                          1
Others (Particulate Filter)       1
Name: fuel, dtype: int64

In [85]:
auto.drop('fuel', axis=1, inplace=True)

* fuel column totally match with fuel_type column. It was controlled and then dropped.

## Next Inspection', 'Inspection new' columns

In [86]:
list_inspection = []
for i in auto["next_inspection"]:
    if type(i) == float:
        list_inspection.append(i)
    elif type(i) == list:
        list_inspection.append(i[0].strip())
    else:
        list_inspection.append(i.replace("\n",""))

In [87]:
auto['next_inspection_date'] = list_inspection

In [88]:
auto['next_inspection_date'] = pd.to_datetime(auto['next_inspection_date'])

* Next inspection date column was created.

* This column included also car emission data. But this data was messy and emission classes and emission values mixed.
So it was not used.

In [89]:
#auto['car_emission'] = auto['next_inspection'].str[1].str.replace('\n','').str[:-16]

In [90]:
#auto['car_emission'] = auto['car_emission'].replace('',np.nan)

In [91]:
#auto['car_emission'] = auto['car_emission'].replace('0 k',0)

In [92]:
auto.drop('next_inspection', axis=1, inplace=True)

## 'Inspection new' column

In [93]:
def inspection(a):
    if type(a) == list:
        return a[0].replace('\n', '')
    elif type(a) == str:
        return a.replace('\n', '')
    else:
        return a

In [94]:
auto['inspection_new'] = auto['inspection_new'].apply(inspection)

* Inspection new: Other parts of this column shows fuel consumption. Since they are the same data in the following consumption column, this part of data was not extracted.

In [95]:
#auto['fuel_cons_comb'] = auto['inspection_new'].str[2].str[:-16]

#auto['fuel_cons_comb'] = auto['fuel_cons_comb'].replace('',np.nan).astype('float')

In [96]:
#auto['fuel_cons_city'] = auto['inspection_new'].str[4].str[:-16]

In [97]:
#auto['fuel_cons_city'] = auto['fuel_cons_city'].replace('',np.nan).astype('float')

In [98]:
#auto['fuel_cons_country'] = auto['inspection_new'].str[6].str[:-16]

#auto['fuel_cons_country'] = auto['fuel_cons_city'].replace('',np.nan).astype('float')

In [99]:
#auto['car_emission'] = auto['car_emission'].astype('float')

## co2_emission column

In [100]:
auto['co2_emission'] = auto['co2_emission'].str[0].str.findall("\d+").str[0].astype('float')

## emission class column

In [101]:
auto['emission_class'] = auto['emission_class'].str[0].str.replace('\n','')

In [102]:
auto['emission_class'].value_counts(dropna=False)

Euro 6          10137
NaN              3627
Euro 6d-TEMP     1844
Euro 6c           127
Euro 5             78
Euro 6d            62
Euro 4             40
Name: emission_class, dtype: int64

In [103]:
auto['emission_class'].replace(['Euro 6','Euro 6d-TEMP','Euro 6d', 'Euro 6c'], 'Euro 6', inplace = True)

## Consumption

In [104]:
def consume_combined(a):
    if type(a) == list:
        if len(a) >3:
            for i in a:
                if 'comb' in i:
                    return i
        else:
            return a[0]            
    
    else:
        return a
    
auto['consumption_comb'] = auto['consumption'].apply(consume_combined)

In [105]:
def cleaning_consumption(a):
    if type(a) == list:
        if len(a) > 0:
            b = re.findall("\d\.?\d?", a[0])
            return b[0]
        else:
            return np.nan
    elif type(a) == str:
        b = re.findall("\d\.?\d?",a)
        return b[0]        
    else:
        return a

In [106]:
def consume_city(a):
    if type(a) == list:
        if len(a) >3:
            for i in a:
                if 'city' in i:
                    return i
        else:
            return a[1]           
    
    else:
        return a
    
auto['consumption_city'] = auto['consumption'].apply(consume_city)


In [107]:
def consume_country(a):
    if type(a)== list:
        if len(a) >3:
            for i in a:
                if 'country' in i:
                    return i
        else:
            return a[2]            
    
    else:
        return a
    
auto['consumption_country'] = auto['consumption'].apply(consume_country)

In [108]:

auto['consumption_comb'] = auto['consumption_comb'].apply(cleaning_consumption).astype('float')
auto['consumption_city'] = auto['consumption_city'].apply(cleaning_consumption).astype('float')
auto['consumption_country'] = auto['consumption_country'].apply(cleaning_consumption).astype('float')

auto.drop("consumption",axis=1,inplace=True)

## Warranty column

In [109]:
def clean_warranty(a):
    if type(a) == list:
        b = re.findall(r'\d+', a[0])
        if len(b)== 0:
            return np.nan
        else:
            return b[0]
    elif type(a) ==str:
        b = re.findall(r'\d+', a)
        if len(b)== 0:
            return np.nan
        else:
            return b[0]
    else:
        return a

In [110]:
auto['warranty'] = auto['warranty'].apply(clean_warranty)

In [111]:
auto['warranty'] = auto['warranty'].astype('float')

* In warranty column other parts of column are meaningless.

## Full Service column

In [112]:
auto.drop('full_service', axis=1, inplace=True)

* Since there is no meaningful data in this column it was dropped.

## Non-smoking Vehicle column

In [113]:
auto['non-smoking_vehicle'].str[0].value_counts()

\n    7177
Name: non-smoking_vehicle, dtype: int64

In [114]:
auto.drop('non-smoking_vehicle', axis=1, inplace=True)

* Since there is no meaningful data in this column it was dropped.

## 'null' column

In [115]:
auto.drop('null', axis=1, inplace=True)

* Since there is no meaningful data in null column it was dropped.

## Offer Number column

In [116]:
auto.drop('offer_number',axis=1,inplace=True)

* Offer number column includes only ids about columns. It was dropped.

## 'Body Color' column

In [117]:
auto['body_color'] = auto['body_color'].str[1]

## 'Paint Type' column

In [118]:
auto['paint_type'] = auto['paint_type'].str[0].str[1:-1]

## body_color_original column

In [119]:
auto['body_color_original'] = auto['body_color_original'].str[0].str[1:-1]

In [120]:
auto['body_color_original'].isnull().sum()

3757

In [121]:
auto.drop('body_color_original',axis=1,inplace=True)

* Color names in this column were in different languages and did not match with body_color. It was dropped. 

## upholstery column

In [122]:
auto['upholstery_material'] = auto['upholstery'].str[0].str.replace('\n','').str.split(', ').str[0]

In [123]:
list_color = ['Black','Grey','Brown','Beige', 'Blue', 'White']
for i in list_color:
    auto['upholstery_material'] = auto['upholstery_material'].replace(i,np.nan)

In [124]:
auto['upholstery_color'] = auto['upholstery'].str[0].str.replace('\n','').str.replace(', ','')

In [125]:
list_uph_mat = ['Cloth', 'Part leather', 'Full leather', 'Other', 'Velour', 'alcantara']
for i in list_uph_mat:
    auto['upholstery_color'] = auto['upholstery_color'].str.replace(i,'')

In [126]:
auto['upholstery_color'] = auto['upholstery_color'].replace('',np.nan)

* upholstery column cleaned and by this column two columns called upholstery_color and upholstery_material were created. upholstery column was dropped.

In [127]:
auto.drop('upholstery', axis=1,inplace=True)

## nr_of_doors column

In [128]:
auto['nr_of_doors'] = auto['nr_of_doors'].str[0].str.replace('\n','').astype(float)

## nr_of_seats column

In [129]:
auto['nr_of_seats'] = auto['nr_of_seats'].str[0].str.replace('\n','').astype(float)

## model_code column

In [130]:
auto['model_code'] = auto['model_code'].str[0].str.replace('\n','')

In [131]:
auto.drop('model_code',axis=1,inplace=True)

* Since there is no null value in make_model this column was dropped.

## gearing_type column

In [132]:
auto['gearing_type'] = auto['gearing_type'].str[1]

## cylinders column

In [133]:
auto['cylinders'] = auto['cylinders'].str[0].str.replace('\n','')

## weight column

In [134]:
auto['weight'] = auto['weight'].str[0].str.replace(',','').str.findall('\d+').str[0].astype('float')

## drive_chain column

In [135]:
auto['drive_chain'] = auto['drive_chain'].str[0].str.replace('\n','')

## comfort_convenience column

In [136]:
auto['comfort_convenience'] = auto['comfort_convenience'].astype('str').str.replace('[','').str.replace("]",'')

* This column was not changed as it will be transformed with getdummy function later

## entertainment_media column

In [137]:
auto['entertainment_media'] = auto['entertainment_media'].astype('str').str.replace('[','').str.replace("]",'')

* This column was not changed as it will be transformed with getdummy function later

## extras column

In [138]:
auto['extras'] = auto['extras'].astype('str').str.replace('[','').str.replace("]",'')

* This column was not changed as it will be transformed with getdummy function later

## safety_security column

In [139]:
auto['safety_security'] = auto['safety_security'].astype('str').str.replace('[','').str.replace("]",'')

* This column was not changed as it will be transformed with getdummy function later

## description column

In [140]:
auto.drop('description',axis=1,inplace=True)

*This column was dropped since it includes German description of car written by users

## emission_label column

In [141]:
auto['emission_label'] = auto['emission_label'].str[0].str.findall('\((.*?)\)').str[0]

In [142]:
auto['emission_label'].value_counts(dropna=False)

NaN           11970
Green          3553
No sticker      381
Blue              8
Yellow            2
Red               1
Name: emission_label, dtype: int64

In [143]:
auto.drop('emission_label',axis=1,inplace=True)

* This column was dropped since information in this column did not evaluated as important for price

## gears column

In [144]:
auto['gears'] = auto['gears'].str[0].str.findall("\d+").str[0].astype('float')

## country_version column

In [145]:
auto['country_version'] = auto['country_version'].str[0].str.replace('\n','')

In [146]:
auto.drop('country_version',axis=1,inplace=True)

* This column was dropped since there were many null values in columns and it is diffucult to fill. 

## electricity_consumption column

In [147]:
auto.loc[auto['electricity_consumption'].isnull()==False, 'electricity_consumption'] = 1

In [148]:
auto.loc[auto['electricity_consumption'].isnull()==True, 'electricity_consumption'] = 0

In [149]:
auto['electricity_consumption'].value_counts(dropna=False)

0    15778
1      137
Name: electricity_consumption, dtype: int64

## last_service_date column

In [150]:
auto['last_service_date'] = pd.to_datetime(auto['last_service_date'].str[0].str.replace('\n','').replace('',np.nan))

In [151]:
auto.drop('last_service_date',axis=1,inplace=True)

* This column was dropped since 96.8% is null.

## other_fuel_types column

In [152]:
auto.drop('other_fuel_types',axis=1,inplace=True)

* Since there is no meaningful data in this column it was dropped.

## availability column

In [153]:
auto['availability'] = auto['availability'].str.findall('\d+')

In [154]:
auto.drop('availability',axis=1,inplace=True)

* Since there is no meaningful data in this column it was dropped.

## last_timing_belt_service_date column

In [155]:
auto.drop('last_timing_belt_service_date',axis=1,inplace=True)

* Since there is no meaningful data in this column it was dropped.

## available_from column

In [156]:
auto.drop('available_from',axis=1,inplace=True)

* Since there is no meaningful data in this column it was dropped.

## Clean data was saved to a csv file called "auto_scout_clean.csv"

In [157]:
auto.to_csv("auto_scout_clean.csv", index=False)