## Introduction
In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a [classifieds](https://en.wikipedia.org/wiki/Classified_advertising) section of the German eBay website.

The dataset was originally scraped and uploaded to Kaggle. We've made a few modifications from the original dataset that was uploaded to Kaggle:

We sampled 50,000 data points from the full dataset 
We dirtied the dataset a bit to more closely resemble what you would expect from a scraped dataset (the version uploaded to Kaggle was cleaned to be easier to work with)

- The data dictionary provided with data is as follows:

| column name | description |
|-------------| ------------|
|dateCrawled | When this ad was first crawled. All field-values are taken from this date|
| name| Name of the car | 
| seller | Whether the seller is private or a dealer. |
| offerType | The type of listing| 
| price | The price on the ad to sell the car.| 
|abtest | Whether the listing is included in an A/B test | 
|vehicleType | The vehicle Type | 
| yearOfRegistration | The year in which the car was first registered.| 
| gearbox | The transmission type.| 
|powerPS | The power of the car in PS. | 
| model | The car model name.| 
| kilometer | How many kilometers the car has driven.| 
| monthOfRegistration | The month in which the car was first registered.| 
| fuelType | What type of fuel the car uses.| 
|brand | The brand of the car. | 
|notRepairedDamage | If the car has a damage which is not yet repaired. | 
|dateCreated | The date on which the eBay listing was created. | 
|nrOfPictures | The number of pictures in the ad. | 
| postalCode | The postal code for the location of the vehicle.| 
| lastSeenOnline | When the crawler saw this ad last online| 


**The aim of this project is to clean the data and analyze the included used car listings**

**let's start**
- Import the libraries

In [1]:
import pandas as pd 
import numpy as np

In [2]:
# open your dataset 
autos = pd.read_csv("autos.csv" , encoding = "Latin-1")
autos.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-24 11:52:17,Golf_3_1.6,privat,Angebot,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,privat,Angebot,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",privat,Angebot,9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,GOLF_4_1_4__3TÜRER,privat,Angebot,1500,test,kleinwagen,2001,manuell,75,golf,150000,6,benzin,volkswagen,nein,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,Skoda_Fabia_1.4_TDI_PD_Classic,privat,Angebot,3600,test,kleinwagen,2008,manuell,69,fabia,90000,7,diesel,skoda,nein,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


In [3]:
# get the information about your dataset
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 371528 entries, 0 to 371527
Data columns (total 20 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   dateCrawled          371528 non-null  object
 1   name                 371528 non-null  object
 2   seller               371528 non-null  object
 3   offerType            371528 non-null  object
 4   price                371528 non-null  int64 
 5   abtest               371528 non-null  object
 6   vehicleType          333659 non-null  object
 7   yearOfRegistration   371528 non-null  int64 
 8   gearbox              351319 non-null  object
 9   powerPS              371528 non-null  int64 
 10  model                351044 non-null  object
 11  kilometer            371528 non-null  int64 
 12  monthOfRegistration  371528 non-null  int64 
 13  fuelType             338142 non-null  object
 14  brand                371528 non-null  object
 15  notRepairedDamage    299468 non-nu

In [4]:
# print the columns names and see if you need to modify it 
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'kilometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

- here the columns are in the camelCase and we prefer snack_case so convert them to snack_case

In [5]:
# first approch to clean the columns
# build a function to clean them and then loop inside and do it
def clean_columns(c):
    c = c.strip()
    c = c.replace("yearOfRegistration" , "registration_year")
    c = c.replace("monthOfRegistration" , "registration_month")
    c = c.replace("notRepairedDamage" , "unrepaired_damage")
    c = c.replace("dateCreated" , "ad_created")
    c = c.lower()
    return c

new_columns = []

for column in autos.columns:
    column = clean_columns(column)
    new_columns.append(column)
    
autos.columns = new_columns
autos.columns

Index(['datecrawled', 'name', 'seller', 'offertype', 'price', 'abtest',
       'vehicletype', 'registration_year', 'gearbox', 'powerps', 'model',
       'kilometer', 'registration_month', 'fueltype', 'brand',
       'unrepaired_damage', 'ad_created', 'nrofpictures', 'postalcode',
       'lastseen'],
      dtype='object')

- this approch works but not all the names are modified to snack_case we will try anothe approach

In [6]:
# use the df.rename method to change each name 
autos.rename({"gearbox" : "gear_box"} , axis = 1 , inplace = True)
autos.columns

Index(['datecrawled', 'name', 'seller', 'offertype', 'price', 'abtest',
       'vehicletype', 'registration_year', 'gear_box', 'powerps', 'model',
       'kilometer', 'registration_month', 'fueltype', 'brand',
       'unrepaired_damage', 'ad_created', 'nrofpictures', 'postalcode',
       'lastseen'],
      dtype='object')

- but the problem now we will write 20 lines for each one at least but let's make it stubid but easy

In [7]:
# third approach modify the names in one list
new_columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'date_created', 'nr_of_pictures', 'postal_code',
       'last_seen'] 
autos.columns = new_columns
autos.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'date_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

- explore the data to know what other cleaning tasks need to be done

In [8]:
autos.describe(include = "all")

Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,date_created,nr_of_pictures,postal_code,last_seen
count,371528,371528,371528,371528,371528.0,371528,333659,371528.0,351319,371528.0,351044,371528.0,371528.0,338142,371528,299468,371528,371528.0,371528.0,371528
unique,280500,233531,2,2,,2,8,,2,,251,,,7,40,2,114,,,182806
top,2016-03-24 14:49:47,Ford_Fiesta,privat,Angebot,,test,limousine,,manuell,,golf,,,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:45:59
freq,7,657,371525,371516,,192585,95894,,274214,,30070,,,223857,79640,263182,14450,,,17
mean,,,,,17295.14,,,2004.577997,,115.549477,,125618.688228,5.734445,,,,,0.0,50820.66764,
std,,,,,3587954.0,,,92.866598,,192.139578,,40112.337051,3.712412,,,,,0.0,25799.08247,
min,,,,,0.0,,,1000.0,,0.0,,5000.0,0.0,,,,,0.0,1067.0,
25%,,,,,1150.0,,,1999.0,,70.0,,125000.0,3.0,,,,,0.0,30459.0,
50%,,,,,2950.0,,,2003.0,,105.0,,150000.0,6.0,,,,,0.0,49610.0,
75%,,,,,7200.0,,,2008.0,,150.0,,150000.0,9.0,,,,,0.0,71546.0,


In [9]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 371528 entries, 0 to 371527
Data columns (total 20 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   date_crawled        371528 non-null  object
 1   name                371528 non-null  object
 2   seller              371528 non-null  object
 3   offer_type          371528 non-null  object
 4   price               371528 non-null  int64 
 5   ab_test             371528 non-null  object
 6   vehicle_type        333659 non-null  object
 7   registration_year   371528 non-null  int64 
 8   gearbox             351319 non-null  object
 9   power_ps            371528 non-null  int64 
 10  model               351044 non-null  object
 11  odometer            371528 non-null  int64 
 12  registration_month  371528 non-null  int64 
 13  fuel_type           338142 non-null  object
 14  brand               371528 non-null  object
 15  unrepaired_damage   299468 non-null  object
 16  da

- dataset here is cleaner than the data on kaggle   

- for some practice i can derties the data and clean it back

In [10]:
print(autos["odometer"].dtype)
autos["odometer"].unique()

int64


array([150000, 125000,  90000,  40000,  30000,  70000,   5000, 100000,
        60000,  20000,  80000,  50000,  10000], dtype=int64)

In [11]:
# change the odometer column name
autos = autos.rename({"odometer" : "odometer_km"} , axis = 1)

####  there are a number of text columns where almost all of the values are the same 
- seller , offer_type

In [12]:
print(autos["seller"].unique())
autos["seller"].value_counts()

['privat' 'gewerblich']


privat        371525
gewerblich         3
Name: seller, dtype: int64

In [13]:
print(autos["offer_type"].unique())
autos["offer_type"].value_counts()

['Angebot' 'Gesuch']


Angebot    371516
Gesuch         12
Name: offer_type, dtype: int64

- Let's continue exploring the data, specifically looking for data that doesn't look right. We'll start by analyzing the odometer_km and price columns

In [14]:
# seee how many unique values
print(autos["seller"].unique().shape)
autos["seller"].unique()

(2,)


array(['privat', 'gewerblich'], dtype=object)

In [15]:
# get some statistics
autos["seller"].describe()

count     371528
unique         2
top       privat
freq      371525
Name: seller, dtype: object

In [16]:
# count of series
autos["seller"].value_counts()

privat        371525
gewerblich         3
Name: seller, dtype: int64

In [17]:
autos["seller"].sort_index(ascending = False).head()

371527    privat
371526    privat
371525    privat
371524    privat
371523    privat
Name: seller, dtype: object

- removing outliers values using df[(df["col"] > x ) & (df["col"] < y )]

- exploring odometer_km and price columns

In [18]:
# let's see the price column if there are outliers or not
# the number of unique values
autos["price"].unique().shape

(5597,)

In [19]:
# statistics
autos["price"].describe()

count    3.715280e+05
mean     1.729514e+04
std      3.587954e+06
min      0.000000e+00
25%      1.150000e+03
50%      2.950000e+03
75%      7.200000e+03
max      2.147484e+09
Name: price, dtype: float64

In [20]:
autos["price"].value_counts(dropna = False)

0         10778
500        5670
1500       5394
1000       4649
1200       4594
          ...  
9654          1
138000        1
14260         1
12469         1
8188          1
Name: price, Length: 5597, dtype: int64

- I am sure that no one is selling his car for free :) cause there is a price zerooo how !!!
- we will try to do somthing to change it or removing those outliers

In [21]:
# remove the rows with price zero
autos = autos[autos["price"] != 0]
autos["price"].value_counts(dropna = False)

500       5670
1500      5394
1000      4649
1200      4594
2500      4438
          ... 
36995        1
19986        1
445000       1
36480        1
8188         1
Name: price, Length: 5596, dtype: int64

- we have removed all the rows for the price zero for any car i am not sure about any car with 500 dollars or whatever the currency but i am gonna leave it 

In [22]:
autos["price"].value_counts().sort_index()

1             1189
2               12
3                8
4                1
5               26
              ... 
32545461         1
74185296         1
99000000         1
99999999        15
2147483647       1
Name: price, Length: 5596, dtype: int64

- from the last cell we can see some prices are one or two dollars not correct i think or impossible
- i will keep prices between (1-500000)

In [23]:
autos = autos[autos["price"].between(1 , 500000)]
autos["price"].value_counts().sort_index(ascending = False)

500000       2
488997       1
487000       1
485000       1
466000       1
          ... 
5           26
4            1
3            8
2           12
1         1189
Name: price, Length: 5557, dtype: int64

In [24]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 360650 entries, 0 to 371527
Data columns (total 20 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   date_crawled        360650 non-null  object
 1   name                360650 non-null  object
 2   seller              360650 non-null  object
 3   offer_type          360650 non-null  object
 4   price               360650 non-null  int64 
 5   ab_test             360650 non-null  object
 6   vehicle_type        326555 non-null  object
 7   registration_year   360650 non-null  int64 
 8   gearbox             342959 non-null  object
 9   power_ps            360650 non-null  int64 
 10  model               342386 non-null  object
 11  odometer_km         360650 non-null  int64 
 12  registration_month  360650 non-null  int64 
 13  fuel_type           330739 non-null  object
 14  brand               360650 non-null  object
 15  unrepaired_damage   293923 non-null  object
 16  da

- now we reduced the number of rows to almost around 289000 rows only
- with the same technique we will explore the "odometer_km"

In [25]:
# unique values first
print(autos["odometer_km"].unique().shape)
autos["odometer_km"].value_counts().sort_index()

(13,)


5000        6018
10000       1869
20000       5534
30000       5937
40000       6319
50000       7532
60000       8593
70000       9673
80000      10905
90000      12349
100000     15479
125000     37371
150000    233071
Name: odometer_km, dtype: int64

- from previos cell the observation is 13 unique values  between (5000 , 150000) km for the odometer and i thinK it makes sense no problem

In [26]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 360650 entries, 0 to 371527
Data columns (total 20 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   date_crawled        360650 non-null  object
 1   name                360650 non-null  object
 2   seller              360650 non-null  object
 3   offer_type          360650 non-null  object
 4   price               360650 non-null  int64 
 5   ab_test             360650 non-null  object
 6   vehicle_type        326555 non-null  object
 7   registration_year   360650 non-null  int64 
 8   gearbox             342959 non-null  object
 9   power_ps            360650 non-null  int64 
 10  model               342386 non-null  object
 11  odometer_km         360650 non-null  int64 
 12  registration_month  360650 non-null  int64 
 13  fuel_type           330739 non-null  object
 14  brand               360650 non-null  object
 15  unrepaired_damage   293923 non-null  object
 16  da

In [27]:
autos.head(2)

Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,date_created,nr_of_pictures,postal_code,last_seen
0,2016-03-24 11:52:17,Golf_3_1.6,privat,Angebot,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,privat,Angebot,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50


### exploring the date columns

In [28]:
autos[["date_crawled" , "date_created" , "last_seen"]].dtypes

date_crawled    object
date_created    object
last_seen       object
dtype: object

In [29]:
autos[["date_crawled" , "date_created" , "last_seen"]][0:3]

Unnamed: 0,date_crawled,date_created,last_seen
0,2016-03-24 11:52:17,2016-03-24 00:00:00,2016-04-07 03:16:57
1,2016-03-24 10:58:45,2016-03-24 00:00:00,2016-04-07 01:46:50
2,2016-03-14 12:52:21,2016-03-14 00:00:00,2016-04-05 12:47:46


- notice that the first 10 characters represent the date so take this substring of 10 chars
- use normalize attribute to get the percentage of the values
- sort result earlier to latest

In [30]:

autos["date_crawled"].str[:10].value_counts(dropna = False , normalize = True).sort_index()

2016-03-05    0.025546
2016-03-06    0.014482
2016-03-07    0.035655
2016-03-08    0.033467
2016-03-09    0.034113
2016-03-10    0.032647
2016-03-11    0.032771
2016-03-12    0.036240
2016-03-13    0.015783
2016-03-14    0.036329
2016-03-15    0.033423
2016-03-16    0.030204
2016-03-17    0.031648
2016-03-18    0.013118
2016-03-19    0.035270
2016-03-20    0.036398
2016-03-21    0.035688
2016-03-22    0.032494
2016-03-23    0.032006
2016-03-24    0.029915
2016-03-25    0.032799
2016-03-26    0.031973
2016-03-27    0.030226
2016-03-28    0.035062
2016-03-29    0.034124
2016-03-30    0.033534
2016-03-31    0.031873
2016-04-01    0.034149
2016-04-02    0.035095
2016-04-03    0.038810
2016-04-04    0.037627
2016-04-05    0.012785
2016-04-06    0.003128
2016-04-07    0.001617
Name: date_crawled, dtype: float64

In [31]:
# by the same method investigate the other columns
autos["date_created"].str[:10].value_counts(dropna = False , normalize = True).sort_index()

2014-03-10    0.000003
2015-03-20    0.000003
2015-06-11    0.000003
2015-06-18    0.000003
2015-08-07    0.000003
                ...   
2016-04-03    0.038999
2016-04-04    0.037735
2016-04-05    0.011618
2016-04-06    0.003119
2016-04-07    0.001553
Name: date_created, Length: 114, dtype: float64

In [32]:
autos["registration_year"].describe()

count    360650.000000
mean       2004.433035
std          81.015348
min        1000.000000
25%        1999.000000
50%        2004.000000
75%        2008.000000
max        9999.000000
Name: registration_year, dtype: float64

- some bad results now like 9999 year which we will die before it surely and the year 1000 ! when there was no cars

- i will say that the max acceptable value is 2016 and lower is 1950 and see what will we get because it makes sense 

In [33]:
autos["registration_year"].between(1950 , 2016).value_counts(dropna = False)

True     346509
False     14141
Name: registration_year, dtype: int64

In [34]:
# remove those rows
autos = autos[autos["registration_year"].between(1950 , 2016)]


In [35]:
# the distribution remaining is 
autos["registration_year"].value_counts(dropna = False , normalize = True).sort_index(ascending = False)

2016    0.026600
2015    0.008433
2014    0.013763
2013    0.017665
2012    0.027009
          ...   
1954    0.000040
1953    0.000049
1952    0.000032
1951    0.000049
1950    0.000061
Name: registration_year, Length: 67, dtype: float64

### exploring price by brand

In [36]:
agg_dic = {}

for b in autos["brand"].unique():
    selected_rows = autos[autos["brand"] == b]
    mean = selected_rows["price"].mean()
    agg_dic[b] = mean
    
agg_dic

{'volkswagen': 5397.604268891069,
 'audi': 9086.796172187138,
 'jeep': 11196.708609271524,
 'skoda': 6530.51084957705,
 'bmw': 8458.90055669345,
 'peugeot': 3266.049956933678,
 'ford': 3680.766059082338,
 'mazda': 4076.357959872492,
 'nissan': 4708.931268561731,
 'renault': 2437.9086769081305,
 'mercedes_benz': 8551.515954118873,
 'seat': 4541.999227083012,
 'honda': 4005.6602834163155,
 'fiat': 2889.312851044709,
 'opel': 2968.663656088961,
 'mini': 10080.620657854288,
 'smart': 3632.480877616747,
 'hyundai': 5568.614583333333,
 'sonstige_autos': 14855.573144441023,
 'alfa_romeo': 4291.759945130316,
 'subaru': 4391.99863574352,
 'volvo': 5238.4203721223585,
 'mitsubishi': 3407.4788091068303,
 'kia': 5855.187734668335,
 'suzuki': 4044.3902991840437,
 'lancia': 3289.9955654101996,
 'porsche': 43580.66805555556,
 'citroen': 3733.2394949285863,
 'toyota': 5339.680571046174,
 'chevrolet': 7060.284398388026,
 'dacia': 5922.862427745665,
 'daihatsu': 1775.0974632843793,
 'trabant': 1919.8592

### aggregate data for some car brands

In [37]:
# aggregate for each brand or at least 5 or 6 brands to practice the aggregation and to get the its idea 
dic = {}
print(autos["brand"].unique())

mean = autos.loc[autos["brand"] == "volkswagen" , "price"].mean()
dic["volkswagen"] = mean
dic
autos["brand"].unique()[0]

['volkswagen' 'audi' 'jeep' 'skoda' 'bmw' 'peugeot' 'ford' 'mazda'
 'nissan' 'renault' 'mercedes_benz' 'seat' 'honda' 'fiat' 'opel' 'mini'
 'smart' 'hyundai' 'sonstige_autos' 'alfa_romeo' 'subaru' 'volvo'
 'mitsubishi' 'kia' 'suzuki' 'lancia' 'porsche' 'citroen' 'toyota'
 'chevrolet' 'dacia' 'daihatsu' 'trabant' 'chrysler' 'jaguar' 'daewoo'
 'rover' 'saab' 'land_rover' 'lada']


'volkswagen'

In [38]:
sorted_brands = autos["brand"].value_counts(dropna = False).sort_values(ascending = False)
top_20 = sorted_brands.head(20).index

In [39]:
dic = {}

for i in top_20:
    rows = autos[autos["brand"] == i]
    mean = rows["price"].mean()
    dic[i] = mean
    
dic

{'volkswagen': 5397.604268891069,
 'bmw': 8458.90055669345,
 'opel': 2968.663656088961,
 'mercedes_benz': 8551.515954118873,
 'audi': 9086.796172187138,
 'ford': 3680.766059082338,
 'renault': 2437.9086769081305,
 'peugeot': 3266.049956933678,
 'fiat': 2889.312851044709,
 'seat': 4541.999227083012,
 'skoda': 6530.51084957705,
 'mazda': 4076.357959872492,
 'smart': 3632.480877616747,
 'citroen': 3733.2394949285863,
 'nissan': 4708.931268561731,
 'toyota': 5339.680571046174,
 'hyundai': 5568.614583333333,
 'mini': 10080.620657854288,
 'sonstige_autos': 14855.573144441023,
 'volvo': 5238.4203721223585}

## try yours

In [43]:
autos["price"] = autos["price"].astype(str)
autos["price"] = autos["price"] + "$"

In [45]:
autos["price"] = autos["price"] + ","

In [55]:
autos["price"].str.replace("$" , "").str.replace("," , "")

  autos["price"].str.replace("$" , "").str.replace("," , "")


0           480
1         18300
2          9800
3          1500
4          3600
          ...  
371523     2200
371524     1199
371525     9200
371526     3400
371527    28990
Name: price, Length: 346509, dtype: object

In [61]:
# autos["odmeter"] analyzing


autos.odometer_km.unique().shape
autos.odometer_km.unique()

array([150000, 125000,  90000,  30000,  70000, 100000,  60000,   5000,
        20000,  80000,  50000,  40000,  10000], dtype=int64)

In [62]:
autos.odometer_km.describe()

count    346509.000000
mean     125493.508105
std       39843.678401
min        5000.000000
25%      100000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [74]:
autos["odometer_km"].value_counts().sort_values()

10000       1780
20000       5368
5000        5544
30000       5798
40000       6210
50000       7353
60000       8427
70000       9437
80000      10611
90000      11983
100000     15026
125000     36008
150000    222964
Name: odometer_km, dtype: int64

In [78]:
autos.odometer_km.value_counts().sort_index( )

5000        5544
10000       1780
20000       5368
30000       5798
40000       6210
50000       7353
60000       8427
70000       9437
80000      10611
90000      11983
100000     15026
125000     36008
150000    222964
Name: odometer_km, dtype: int64

In [81]:
autos["price"].unique().shape

(5511,)

In [86]:
autos["price"] = autos["price"].str.replace("$" , "").str.replace("," , "")
autos["price"] = autos["price"].astype(int)
autos["price"].dtype

  autos["price"] = autos["price"].str.replace("$" , "").str.replace("," , "")


dtype('int32')

In [89]:
autos["price"].unique().shape

(5511,)

In [92]:
autos["price"].value_counts(dropna = False).sort_values()

8188        1
18230       1
83990       1
319         1
2625        1
         ... 
2500     4242
1200     4332
1000     4362
1500     5091
500      5463
Name: price, Length: 5511, dtype: int64

In [96]:
autos["price"].value_counts(dropna = False).sort_index(ascending = True).head()

1    1106
2      11
3       7
4       1
5      26
Name: price, dtype: int64

In [98]:
autos["price"].value_counts(dropna =False).sort_index(ascending= False).head()

500000    2
488997    1
487000    1
485000    1
466000    1
Name: price, dtype: int64

In [102]:
autos["price"].value_counts().sort_values(ascending = False)

500      5463
1500     5091
1000     4362
1200     4332
2500     4242
         ... 
51750       1
2625        1
319         1
83990       1
8188        1
Name: price, Length: 5511, dtype: int64

In [106]:
autos["price"].value_counts().sort_index().tail()

466000    1
485000    1
487000    1
488997    1
500000    2
Name: price, dtype: int64

In [107]:
autos[~ autos["price"].between(550 , 500000)] = np.nan

In [111]:
autos["price"].value_counts(dropna = False , normalize = True)

NaN        0.087172
1500.0     0.014692
1000.0     0.012588
1200.0     0.012502
2500.0     0.012242
             ...   
51995.0    0.000003
2828.0     0.000003
2501.0     0.000003
7898.0     0.000003
706.0      0.000003
Name: price, Length: 5227, dtype: float64