## Introduction
AutoScout Data Analysis Project, ikinci el araba fiyatlarına etki eden hususları inceleyerek iyi bir fiyat tahmini modülü oluşturabilmek için gerekli veriyi hazır hale getirebilmek ve sonuçta iyi bir fiyat tahmini yapabilmektir. Burada online satış yapan bir firmadan alınan ve 9 farklı araba modeline ait farklı ve dağınık bir veri seti mevcuttur. Burada yapılacak işlemler 3 aşamada değerlendirebiliriz. 

The project consists of 3 parts:
* First part is related with 'data cleaning'. It deals with Incorrect Headers, Incorrect Format, Anomalies, Dropping useless columns.
* Second part is related with 'filling data'. It deals with Missing Values. Categorical to numeric transformation is done.
* Third part is related with 'handling outliers of data' via Visualisation libraries. Some insights are extracted.

In this project, Python libraries such as Numpy, Pandas, Matplotlib, Seaborn, Scipy kullanarak hem dataset temizliği hem de elde ettiğimiz dataseti üzerinden analysis yapma imkanı bulacağız.

In [1]:
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import re

In [2]:
import pandas as pd
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 5)
pd.set_option('display.max_colwidth', 30)

In [3]:
df = pd.read_json('scout_car.json', lines=True, orient='records', convert_axes=True, dtype=True,convert_dates=True)
df.head(3)

Unnamed: 0,url,make_model,short_description,body_type,price,vat,km,registration,prev_owner,kW,hp,Type,Previous Owners,Next Inspection,Inspection new,Warranty,Full Service,Non-smoking Vehicle,null,Make,Model,Offer Number,First Registration,Body Color,Paint Type,Body Color Original,Upholstery,Body,Nr. of Doors,Nr. of Seats,Model Code,Gearing Type,Displacement,Cylinders,Weight,Drive chain,Fuel,Consumption,CO2 Emission,Emission Class,\nComfort & Convenience\n,\nEntertainment & Media\n,\nExtras\n,\nSafety & Security\n,description,Emission Label,Gears,Country version,Electricity consumption,Last Service Date,Other Fuel Types,Availability,Last Timing Belt Service Date,Available from
0,https://www.autoscout24.co...,Audi A1,Sportback 1.4 TDI S-tronic...,Sedans,15770,VAT deductible,"56,013 km",01/2016,2 previous owners,,66 kW,"[, Used, , Diesel (Particu...",\n2\n,"[\n06/2021\n, \n99 g CO2/k...","[\nYes\n, \nEuro 6\n]","[\n, \n, \n4 (Green)\n]","[\n, \n]","[\n, \n]",[],\nAudi\n,"[\n, A1, \n]",[\nLR-062483\n],"[\n, 2016, \n]","[\n, Black, \n]",[\nMetallic\n],[\nMythosschwarz\n],"[\nCloth, Black\n]","[\n, Sedans, \n]",[\n5\n],[\n5\n],[\n0588/BDF\n],"[\n, Automatic, \n]","[\n1,422 cc\n]",[\n3\n],"[\n1,220 kg\n]",[\nfront\n],"[\n, Diesel (Particulate F...","[[3.8 l/100 km (comb)], [4...",[\n99 g CO2/km (comb)\n],[\nEuro 6\n],"[Air conditioning, Armrest...","[Bluetooth, Hands-free equ...","[Alloy wheels, Catalytic C...","[ABS, Central door lock, D...","[\n, Sicherheit:, , Deakt...",,,,,,,,,
1,https://www.autoscout24.co...,Audi A1,1.8 TFSI sport,Sedans,14500,Price negotiable,"80,000 km",03/2017,,,141 kW,"[, Used, , Gasoline]",,,,,,,[],\nAudi\n,"[\n, A1, \n]",,"[\n, 2017, \n]","[\n, Red, \n]",,,"[\nCloth, Grey\n]","[\n, Sedans, \n]",[\n3\n],[\n4\n],[\n0588/BCY\n],"[\n, Automatic, \n]","[\n1,798 cc\n]",[\n4\n],"[\n1,255 kg\n]",[\nfront\n],"[\n, Gasoline, \n]","[[5.6 l/100 km (comb)], [7...",[\n129 g CO2/km (comb)\n],[\nEuro 6\n],"[Air conditioning, Automat...","[Bluetooth, Hands-free equ...","[Alloy wheels, Sport seats...","[ABS, Central door lock, C...",[\nLangstreckenfahrzeug da...,[\n4 (Green)\n],[\n7\n],,,,,,,
2,https://www.autoscout24.co...,Audi A1,Sportback 1.6 TDI S tronic...,Sedans,14640,VAT deductible,"83,450 km",02/2016,1 previous owner,,85 kW,"[, Used, , Diesel (Particu...",\n1\n,,,"[\n, \n, \n99 g CO2/km (co...",,,[],\nAudi\n,"[\n, A1, \n]",[\nAM-95365\n],"[\n, 2016, \n]","[\n, Black, \n]",[\nMetallic\n],[\nmythosschwarz metallic\n],"[\nCloth, Black\n]","[\n, Sedans, \n]",[\n4\n],[\n4\n],,"[\n, Automatic, \n]","[\n1,598 cc\n]",,,[\nfront\n],"[\n, Diesel (Particulate F...","[[3.8 l/100 km (comb)], [4...",[\n99 g CO2/km (comb)\n],[\nEuro 6\n],"[Air conditioning, Cruise ...","[MP3, On-board computer]","[Alloy wheels, Voice Control]","[ABS, Central door lock, D...","[\n, Fahrzeug-Nummer: AM-9...",[\n4 (Green)\n],,,,,,,,


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15919 entries, 0 to 15918
Data columns (total 54 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   url                            15919 non-null  object 
 1   make_model                     15919 non-null  object 
 2   short_description              15873 non-null  object 
 3   body_type                      15859 non-null  object 
 4   price                          15919 non-null  int64  
 5   vat                            11406 non-null  object 
 6   km                             15919 non-null  object 
 7   registration                   15919 non-null  object 
 8   prev_owner                     9091 non-null   object 
 9   kW                             0 non-null      float64
 10  hp                             15919 non-null  object 
 11  Type                           15917 non-null  object 
 12  Previous Owners                9279 non-null  

# Part-1: Data Cleanings

### Veri setimizi incelemeye başlamadan önce ilk olarak boş bir dataframe oluşturuyoruz. Sırasıyla her bir feature inceleyerek bu dataframe ekleyecek ve işlemlerimizi bunun üzerinden yapacağız.

In [5]:
data  = pd.DataFrame()

### Veri setimizin sütunlarını tek tek inceleyeceğiz. make_model sütunu iki parçaya böldük.

In [6]:
temp = df["make_model"].copy()

In [7]:
data["make"] = temp.str.split(' ', expand=True)[0]

In [8]:
data["model"] = temp.str.split(' ', expand=True)[1]

In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15919 entries, 0 to 15918
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   make    15919 non-null  object
 1   model   15919 non-null  object
dtypes: object(2)
memory usage: 248.9+ KB


### Veri setimizdeki url ve make_model sütunlarını düştük. Çünkü url fiyatlara etki eden bir durumu söz konusu değil. make_model ikiye bölerek kullanacağız.

In [22]:
df.drop(["url","make_model"], axis=1).head(2)  # düşülecek.

Unnamed: 0,short_description,body_type,price,vat,km,registration,prev_owner,kW,hp,Type,Previous Owners,Next Inspection,Inspection new,Warranty,Full Service,Non-smoking Vehicle,null,Make,Model,Offer Number,First Registration,Body Color,Paint Type,Body Color Original,Upholstery,Body,Nr. of Doors,Nr. of Seats,Model Code,Gearing Type,Displacement,Cylinders,Weight,Drive chain,Fuel,Consumption,CO2 Emission,Emission Class,\nComfort & Convenience\n,\nEntertainment & Media\n,\nExtras\n,\nSafety & Security\n,description,Emission Label,Gears,Country version,Electricity consumption,Last Service Date,Other Fuel Types,Availability,Last Timing Belt Service Date,Available from
0,Sportback 1.4 TDI S-tronic...,Sedans,15770,VAT deductible,"56,013 km",01/2016,2 previous owners,,66 kW,"[, Used, , Diesel (Particu...",\n2\n,"[\n06/2021\n, \n99 g CO2/k...","[\nYes\n, \nEuro 6\n]","[\n, \n, \n4 (Green)\n]","[\n, \n]","[\n, \n]",[],\nAudi\n,"[\n, A1, \n]",[\nLR-062483\n],"[\n, 2016, \n]","[\n, Black, \n]",[\nMetallic\n],[\nMythosschwarz\n],"[\nCloth, Black\n]","[\n, Sedans, \n]",[\n5\n],[\n5\n],[\n0588/BDF\n],"[\n, Automatic, \n]","[\n1,422 cc\n]",[\n3\n],"[\n1,220 kg\n]",[\nfront\n],"[\n, Diesel (Particulate F...","[[3.8 l/100 km (comb)], [4...",[\n99 g CO2/km (comb)\n],[\nEuro 6\n],"[Air conditioning, Armrest...","[Bluetooth, Hands-free equ...","[Alloy wheels, Catalytic C...","[ABS, Central door lock, D...","[\n, Sicherheit:, , Deakt...",,,,,,,,,
1,1.8 TFSI sport,Sedans,14500,Price negotiable,"80,000 km",03/2017,,,141 kW,"[, Used, , Gasoline]",,,,,,,[],\nAudi\n,"[\n, A1, \n]",,"[\n, 2017, \n]","[\n, Red, \n]",,,"[\nCloth, Grey\n]","[\n, Sedans, \n]",[\n3\n],[\n4\n],[\n0588/BCY\n],"[\n, Automatic, \n]","[\n1,798 cc\n]",[\n4\n],"[\n1,255 kg\n]",[\nfront\n],"[\n, Gasoline, \n]","[[5.6 l/100 km (comb)], [7...",[\n129 g CO2/km (comb)\n],[\nEuro 6\n],"[Air conditioning, Automat...","[Bluetooth, Hands-free equ...","[Alloy wheels, Sport seats...","[ABS, Central door lock, C...",[\nLangstreckenfahrzeug da...,[\n4 (Green)\n],[\n7\n],,,,,,,


In [10]:
temp = df["body_type"].copy()  # 60 tane eksik veri var.

In [11]:
temp.value_counts(dropna=False)

Sedans           7903
Station wagon    3553
Compact          3153
Van               783
Other             290
Transporter        88
NaN                60
Off-Road           56
Coupe              25
Convertible         8
Name: body_type, dtype: int64

In [12]:
data["body_type"] = temp

In [13]:
data[data["body_type"].notna()]

Unnamed: 0,make,model,body_type
0,Audi,A1,Sedans
1,Audi,A1,Sedans
2,Audi,A1,Sedans
3,Audi,A1,Sedans
4,Audi,A1,Sedans
...,...,...,...
15914,Renault,Espace,Van
15915,Renault,Espace,Van
15916,Renault,Espace,Van
15917,Renault,Espace,Van


In [14]:
data[~data["body_type"].notna()]

Unnamed: 0,make,model,body_type
3175,Audi,A3,
3255,Audi,A3,
3975,Audi,A3,
3997,Audi,A3,
4206,Audi,A3,
4297,Audi,A3,
4298,Audi,A3,
5718,Opel,Astra,
5938,Opel,Astra,
5940,Opel,Astra,


In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15919 entries, 0 to 15918
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   make       15919 non-null  object
 1   model      15919 non-null  object
 2   body_type  15859 non-null  object
dtypes: object(3)
memory usage: 373.2+ KB


### Short_description inceleyelim.

In [16]:
temp = df["short_description"].copy() 

In [17]:
temp.head(50)

0     Sportback 1.4 TDI S-tronic...
1                    1.8 TFSI sport
2     Sportback 1.6 TDI S tronic...
3           1.4 TDi Design S tronic
4     Sportback 1.4 TDI S-Tronic...
5     1.6 TDI Sport DSG *SHZ*Xen...
6     Sportback 1.6 TDI S-TRONIC...
7     Sportback 1.4 TDI admired ...
8     SPB 1.6 TDI S-tronic Metal...
9     SPORTBACK TFSI ULTRA 95 S-...
10    SPORTBACK1.6 TDI 116 CV S ...
11    Sportback Sport »1.4 TFSI|...
12    Sportback 1.4 TFSI S-troni...
13    1.4 TFSI 150ch COD Ambitio...
14    Sportback 1.0 TFSI S-TRoni...
15    SPB 1.6 TDI 116 CV S troni...
16    SPB 1.6 TDI 116 CV Design ...
17    1.4 TDi S tronic*S-Line*Na...
18    1.0 TFSI 95pk Automaat Adr...
19    Sportback 1.6 TDI 116 CV S...
20    SPB 1.6 TDI 116 CV S troni...
21    Sportback Sport 1.4 TFSI S...
22                1.4 TFSI S tronic
23    SPB 1.6 TDI 116 CV S troni...
24    SPB 1.0 TFSI ultra S troni...
25    1.0 TFSI ultra Sportback T...
26    1.0 TFSI *PDC*SHZ*Klimaaut...
27    Sportback 1.0 TFSI Att

In [18]:
data["cc"] = temp.str.extract("(\d\.\d)").astype("float")  # bu şekilde alınca 5068 tane eksik değer var.

In [19]:
data["cc"].value_counts(dropna=False)

NaN    5068
1.6    3891
1.4    2535
1.0    1334
1.2     957
1.5     890
2.0     888
1.3     135
1.8      60
0.9      49
4.0      28
2.5      21
4.3       9
5.7       8
5.5       3
1.7       3
5.0       3
1.1       3
0.8       2
3.0       2
6.0       2
9.8       2
3.9       2
6.1       1
5.1       1
7.8       1
5.6       1
8.9       1
0.2       1
4.2       1
2.2       1
9.6       1
5.3       1
8.5       1
0.7       1
7.9       1
0.0       1
4.6       1
0.6       1
9.9       1
7.3       1
8.8       1
8.4       1
2.8       1
4.5       1
2.3       1
0.3       1
Name: cc, dtype: int64

In [20]:
data[~data["cc"].notna()]

Unnamed: 0,make,model,body_type,cc
9,Audi,A1,Sedans,
29,Audi,A1,Compact,
48,Audi,A1,Compact,
55,Audi,A1,Compact,
63,Audi,A1,Sedans,
...,...,...,...,...
15914,Renault,Espace,Van,
15915,Renault,Espace,Van,
15916,Renault,Espace,Van,
15917,Renault,Espace,Van,


### Price inceleyelim.

In [21]:
temp = df["price"].copy() 

In [22]:
data["price"]=temp

In [23]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15919 entries, 0 to 15918
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   make       15919 non-null  object 
 1   model      15919 non-null  object 
 2   body_type  15859 non-null  object 
 3   cc         10851 non-null  float64
 4   price      15919 non-null  int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 622.0+ KB


### vat inceleyelim.

In [24]:
temp = df["vat"].copy() 

In [26]:
data["vat"] = temp

In [27]:
data

Unnamed: 0,make,model,body_type,cc,price,vat
0,Audi,A1,Sedans,1.4,15770,VAT deductible
1,Audi,A1,Sedans,1.8,14500,Price negotiable
2,Audi,A1,Sedans,1.6,14640,VAT deductible
3,Audi,A1,Sedans,1.4,14500,
4,Audi,A1,Sedans,1.4,16790,
...,...,...,...,...,...,...
15914,Renault,Espace,Van,,39950,VAT deductible
15915,Renault,Espace,Van,,39885,VAT deductible
15916,Renault,Espace,Van,,39875,VAT deductible
15917,Renault,Espace,Van,,39700,VAT deductible


### km inceleyelim.

In [41]:
temp = df["km"].copy() 

In [42]:
temp

0        56,013 km
1        80,000 km
2        83,450 km
3        73,000 km
4        16,200 km
           ...    
15914         - km
15915     9,900 km
15916        15 km
15917        10 km
15918         - km
Name: km, Length: 15919, dtype: object

In [43]:
data["km"] = temp.str.replace(r'\D+', '', regex=True)

In [48]:
data[data["km"]==""]   # 1024 tane "" değer var.

Unnamed: 0,make,model,body_type,cc,price,vat,km
743,Audi,A1,Sedans,,25900,,
869,Audi,A1,Sedans,1.0,21300,VAT deductible,
946,Audi,A1,Compact,,21406,,
977,Audi,A1,Compact,1.0,21200,VAT deductible,
980,Audi,A1,Compact,,21100,,
...,...,...,...,...,...,...,...
15890,Renault,Espace,Station wagon,,42490,VAT deductible,
15902,Renault,Espace,Sedans,,41043,VAT deductible,
15912,Renault,Espace,Van,,39950,VAT deductible,
15914,Renault,Espace,Van,,39950,VAT deductible,


In [51]:
data.km.value_counts(dropna=False)

10       1045
         1024
1         367
5         170
50        148
         ... 
61109       1
30980       1
69035       1
13093       1
81295       1
Name: km, Length: 6690, dtype: int64

### registration inceleyelim.

In [52]:
temp = df["registration"].copy() 

In [53]:
temp

0        01/2016
1        03/2017
2        02/2016
3        08/2016
4        05/2016
          ...   
15914        -/-
15915    01/2019
15916    03/2019
15917    06/2019
15918    01/2019
Name: registration, Length: 15919, dtype: object

In [56]:
data["registration"] = temp.str.extract('(\d{4})')

In [76]:
data["registration"] = data["registration"].astype("float")

In [77]:
data["registration"].value_counts(dropna=False)   # 1597 tane eksik değer var.

2018.0    4522
2016.0    3674
2017.0    3273
2019.0    2853
NaN       1597
Name: registration, dtype: int64

### prev_owner inceleyelim.

In [59]:
temp = temp = df["prev_owner"].copy() 

In [60]:
temp.value_counts(dropna=False)

1 previous owner     8294
NaN                  6828
2 previous owners     778
3 previous owners      17
4 previous owners       2
Name: prev_owner, dtype: int64

In [61]:
data["prev_owner"] = temp.str.extract("(\d+)")

In [73]:
data["prev_owner"] = data["prev_owner"].astype("float")

In [74]:
data["prev_owner"].value_counts(dropna=False)   # 6828 tane eksik veri var.

1.0    8294
NaN    6828
2.0     778
3.0      17
4.0       2
Name: prev_owner, dtype: int64

### hp inceleyelim.

In [64]:
temp = df["hp"].copy()

In [66]:
temp.value_counts(dropna=False)

85 kW     2542
66 kW     2122
81 kW     1402
100 kW    1308
110 kW    1112
70 kW      888
125 kW     707
51 kW      695
55 kW      569
118 kW     516
92 kW      466
121 kW     392
147 kW     380
77 kW      345
56 kW      286
54 kW      276
103 kW     253
87 kW      232
165 kW     194
88 kW      177
60 kW      160
162 kW      98
- kW        88
74 kW       81
96 kW       72
71 kW       59
101 kW      47
67 kW       40
154 kW      39
122 kW      35
119 kW      30
164 kW      27
135 kW      24
82 kW       22
52 kW       22
1 kW        20
78 kW       20
146 kW      18
294 kW      18
141 kW      16
57 kW       10
104 kW       8
120 kW       8
112 kW       7
191 kW       7
117 kW       6
155 kW       6
184 kW       5
65 kW        4
76 kW        4
90 kW        4
149 kW       3
80 kW        3
93 kW        3
168 kW       3
98 kW        3
140 kW       2
86 kW        2
167 kW       2
270 kW       2
143 kW       2
150 kW       2
89 kW        2
53 kW        2
63 kW        2
40 kW        2
228 kW    

In [67]:
data["power_hp"] = temp.str.extract("(\d+)")

In [78]:
data["power_hp"] = data["power_hp"].astype("float")

In [82]:
data["power_hp"].value_counts(dropna=False)   # 88 eksik veri var.

85.0     2542
66.0     2122
81.0     1402
100.0    1308
110.0    1112
70.0      888
125.0     707
51.0      695
55.0      569
118.0     516
92.0      466
121.0     392
147.0     380
77.0      345
56.0      286
54.0      276
103.0     253
87.0      232
165.0     194
88.0      177
60.0      160
162.0      98
NaN        88
74.0       81
96.0       72
71.0       59
101.0      47
67.0       40
154.0      39
122.0      35
119.0      30
164.0      27
135.0      24
82.0       22
52.0       22
1.0        20
78.0       20
146.0      18
294.0      18
141.0      16
57.0       10
120.0       8
104.0       8
191.0       7
112.0       7
155.0       6
117.0       6
184.0       5
90.0        4
76.0        4
65.0        4
149.0       3
98.0        3
93.0        3
80.0        3
168.0       3
150.0       2
63.0        2
140.0       2
86.0        2
89.0        2
40.0        2
167.0       2
53.0        2
228.0       2
127.0       2
143.0       2
270.0       2
9.0         1
44.0        1
123.0       1
195.0 

### Type inceleyelim.

In [86]:
temp = df["Type"].copy()   # buradan vehicle condition ve fuel_type çıkardık.

In [88]:
temp.apply(pd.Series)[1]

0                  Used
1                  Used
2                  Used
3                  Used
4                  Used
              ...      
15914               New
15915              Used
15916    Pre-registered
15917    Pre-registered
15918     Demonstration
Name: 1, Length: 15919, dtype: object

In [89]:
data["vehicle_condition"] = temp.apply(pd.Series)[1]

In [91]:
data["vehicle_condition"].value_counts(dropna=False)  # 2 tane eksik değer var.

Used              11096
New                1650
Pre-registered     1364
Employee's car     1011
Demonstration       796
NaN                   2
Name: vehicle_condition, dtype: int64

In [94]:
data[~data["vehicle_condition"].notna()]

Unnamed: 0,make,model,body_type,cc,price,vat,km,registration,prev_owner,power_hp,vehicle_condition
2765,Audi,A3,Sedans,2.0,17900,,115137.0,2016.0,,110.0,
5237,Audi,A3,Sedans,1.6,25400,,,,,85.0,


In [100]:
temp.apply(pd.Series)[3]  # 2 tane veri eksik

0          Diesel (Particulate Filter)
1                             Gasoline
2          Diesel (Particulate Filter)
3          Diesel (Particulate Filter)
4          Diesel (Particulate Filter)
                     ...              
15914      Diesel (Particulate Filter)
15915    Super 95 / Super Plus 98 (...
15916                           Diesel
15917                           Diesel
15918                         Super 95
Name: 3, Length: 15919, dtype: object

In [101]:
data["fuel_1"] = temp.apply(pd.Series)[3]

In [102]:
data["fuel_1"].value_counts(dropna=False)

Diesel (Particulate Filter)                                                                                                       4315
Super 95                                                                                                                          3338
Gasoline                                                                                                                          3175
Diesel                                                                                                                            2982
Super 95 / Regular/Benzine 91                                                                                                      424
Regular/Benzine 91                                                                                                                 354
Super E10 95                                                                                                                       331
Super 95 (Particulate Filter)                          

In [103]:
data

Unnamed: 0,make,model,body_type,cc,price,vat,km,registration,prev_owner,power_hp,vehicle_condition,fuel_1
0,Audi,A1,Sedans,1.4,15770,VAT deductible,56013,2016.0,2.0,66.0,Used,Diesel (Particulate Filter)
1,Audi,A1,Sedans,1.8,14500,Price negotiable,80000,2017.0,,141.0,Used,Gasoline
2,Audi,A1,Sedans,1.6,14640,VAT deductible,83450,2016.0,1.0,85.0,Used,Diesel (Particulate Filter)
3,Audi,A1,Sedans,1.4,14500,,73000,2016.0,1.0,66.0,Used,Diesel (Particulate Filter)
4,Audi,A1,Sedans,1.4,16790,,16200,2016.0,1.0,66.0,Used,Diesel (Particulate Filter)
...,...,...,...,...,...,...,...,...,...,...,...,...
15914,Renault,Espace,Van,,39950,VAT deductible,,,,147.0,New,Diesel (Particulate Filter)
15915,Renault,Espace,Van,,39885,VAT deductible,9900,2019.0,1.0,165.0,Used,Super 95 / Super Plus 98 (...
15916,Renault,Espace,Van,,39875,VAT deductible,15,2019.0,1.0,146.0,Pre-registered,Diesel
15917,Renault,Espace,Van,,39700,VAT deductible,10,2019.0,,147.0,Pre-registered,Diesel


### Previos Owner inceleyelim.

In [105]:
temp = df["Previous Owners"].copy()   # burada yukarıdaki prev_owner dan farklı olarak 0 olan değerler var. Bunlara göre bu sütunlardan birini düşürüceğiz.

In [111]:
data["previous_owners"]= temp.apply(pd.Series)[0].str.extract("(\d+)")

In [114]:
data["previous_owners"].value_counts(dropna=False)

1      8294
NaN    6640
2       778
0       188
3        17
4         2
Name: previous_owners, dtype: int64

In [115]:
data[data["previous_owners"]=="0"]

Unnamed: 0,make,model,body_type,cc,price,vat,km,registration,prev_owner,power_hp,vehicle_condition,fuel_1,previous_owners
47,Audi,A1,Sedans,1.6,11790,,60000,2016.0,,85.0,Used,Diesel,0
418,Audi,A1,Sedans,1.6,15900,,58000,2016.0,,66.0,Used,Diesel,0
586,Audi,A1,Sedans,1.0,13500,,50707,2016.0,,70.0,Used,Gasoline,0
648,Audi,A1,Sedans,1.4,12900,,64000,2016.0,,66.0,Used,Diesel (Particulate Filter),0
734,Audi,A1,Sedans,,30000,,0,,,85.0,New,Gasoline,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
15301,Renault,Espace,Van,,37290,,,,,146.0,New,Diesel,0
15408,Renault,Espace,Van,,24900,,91883,2016.0,,118.0,Used,Diesel,0
15668,Renault,Espace,Van,,39290,,0,,,118.0,New,Diesel,0
15853,Renault,Espace,Van,2.0,43950,VAT deductible,100,,,147.0,New,Diesel,0


### Next_Inspection inceleyelim.

In [117]:
temp = df["Next Inspection"].copy()  # bu sütün silinebilir. Eksik veri sayısı 12384. Ayrıca araba alırken önemli hususlar arasında yok.

In [119]:
data["Next_Inspection"]= temp.apply(pd.Series)[0].str.extract('(\d{4})')

In [122]:
data["Next_Inspection"].value_counts(dropna=False)

NaN     12384
2021     1601
2020      694
2022      688
2019      438
2023       47
2018       38
2017       13
2016        6
2001        5
2014        1
2024        1
1921        1
1955        1
1999        1
Name: Next_Inspection, dtype: int64

### Inspection_new inceleyelim.

In [123]:
temp = df["Inspection new"].copy()     # bu sütün silinebilir. Eksik veri sayısı 11987 veya boş olan değerlere No yazılabilir.

In [127]:
temp.apply(pd.Series)[0].str.extract('(\w+)')

Unnamed: 0,0
0,Yes
1,
2,
3,
4,Yes
...,...
15914,
15915,
15916,Yes
15917,


In [128]:
data["Inspection_new"] = temp.apply(pd.Series)[0].str.extract('(\w+)')

In [129]:
data["Inspection_new"].value_counts(dropna=False)

NaN    11987
Yes     3932
Name: Inspection_new, dtype: int64

In [130]:
data[~data["Inspection_new"].notna()]

Unnamed: 0,make,model,body_type,cc,price,vat,km,registration,prev_owner,power_hp,vehicle_condition,fuel_1,previous_owners,Next_Inspection,Inspection_new
1,Audi,A1,Sedans,1.8,14500,Price negotiable,80000,2017.0,,141.0,Used,Gasoline,,,
2,Audi,A1,Sedans,1.6,14640,VAT deductible,83450,2016.0,1.0,85.0,Used,Diesel (Particulate Filter),1,,
3,Audi,A1,Sedans,1.4,14500,,73000,2016.0,1.0,66.0,Used,Diesel (Particulate Filter),1,,
5,Audi,A1,Sedans,1.6,15090,,63668,2016.0,1.0,85.0,Used,Diesel (Particulate Filter),1,,
8,Audi,A1,Sedans,1.6,16700,,57000,2016.0,1.0,85.0,Used,Diesel (Particulate Filter),1,2020,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15912,Renault,Espace,Van,,39950,VAT deductible,,,,147.0,New,Diesel (Particulate Filter),,,
15913,Renault,Espace,Van,,39950,VAT deductible,1000,2019.0,,165.0,Demonstration,Super 95,,,
15914,Renault,Espace,Van,,39950,VAT deductible,,,,147.0,New,Diesel (Particulate Filter),,,
15915,Renault,Espace,Van,,39885,VAT deductible,9900,2019.0,1.0,165.0,Used,Super 95 / Super Plus 98 (...,1,2022,


### Warranty inceleyelim.

In [131]:
temp = df["Warranty"].copy()   #11066 eksik veri silinebilir veya boş olanlar için guarantee yok denilebilir.

In [132]:
temp.apply(pd.Series)[0].str.extract('(\d{1,3})')

Unnamed: 0,0
0,
1,
2,
3,
4,
...,...
15914,24
15915,
15916,
15917,


In [133]:
data["guarante"] = temp.apply(pd.Series)[0].str.extract('(\d{1,3})')

In [135]:
data["guarante"].value_counts(dropna=False)

NaN    11066
12      2594
24      1118
60       401
36       279
48       149
6        125
72        59
3         33
23        11
18        10
20         7
25         6
2          5
16         4
50         4
26         4
34         3
13         3
19         3
4          3
1          3
21         2
14         2
46         2
45         2
22         2
11         2
17         2
9          2
28         2
7          1
30         1
8          1
56         1
49         1
47         1
65         1
10         1
15         1
33         1
40         1
Name: guarante, dtype: int64

### Full Service inceleyelim.

In [136]:
temp = df["Full Service"].copy()   # bunu düşebiliriz. Kullanılacak veri yok. 

In [139]:
temp.apply(pd.Series)[0].str.extract("(\w+)")

Unnamed: 0,0
0,
1,
2,
3,
4,
...,...
15914,
15915,
15916,
15917,


### Non-smoking Vehicle inceleyelim.

In [140]:
temp = df["Non-smoking Vehicle"].copy()  # düşülebilir. Veri yok.

In [141]:
temp.apply(pd.Series)[0]

0         \n
1        NaN
2        NaN
3         \n
4         \n
        ... 
15914    NaN
15915     \n
15916     \n
15917    NaN
15918    NaN
Name: 0, Length: 15919, dtype: object

### Null inceleyelim. İçinde değer olmadığından dolayı düşülecek.

### Make inceleyelim. Yukarıda çektiğimiz için düşülecek.

### Model inceleyelim. Yukarıda çektiğimiz için düşülecek.

### offer_number inceleyelim. Düşülecek. Kullanım yeri olmayan gereksiz bir bilgi. 

### First Registration inceleyelim. Ancak yukarıda çektiğimiz registration ile aynı veriye sahip olduğundan dolayı duplicate olacak ve bunlardan birinin düşülmesi gerekir.

In [142]:
temp = df['First Registration'].copy()

In [148]:
temp.apply(pd.Series)[1].str.extract('(\d{4})')

Unnamed: 0,0
0,2016
1,2017
2,2016
3,2016
4,2016
...,...
15914,
15915,2019
15916,2019
15917,2019


In [154]:
data["first_registration"] = temp.apply(pd.Series)[1].str.extract('(\d{4})')

In [155]:
data["first_registration"].value_counts(dropna=False)  # 1597 eksik veri var. Yukarıdaki registration ile aynı veriye sahip olduğundan dolayı birini kullanmamız yeterli.

2018    4522
2016    3674
2017    3273
2019    2853
NaN     1597
Name: first_registration, dtype: int64

In [156]:
data[~data["first_registration"].notna()][["first_registration","registration"]]

Unnamed: 0,first_registration,registration
122,,
710,,
734,,
741,,
743,,
...,...,...
15896,,
15902,,
15907,,
15912,,


In [None]:
df['Body Color']