## **Outliers**
Outliers are the cases that have data values that are **very different from the data values compare to majority of cases** in the dataset. It is an observation that lies an **abnormal distance** from other values.

- Treating outliers are important because it can changes the result of data analysis.
- Mean value can be affected by the presence of outlier
  - **Example:**
      - **Mean without outlier**
          - (12+13+14+15+16) / 5 = 14
      
      - **Mean with outlier**
          -  (12+500+14+15+16) / 5 = 111.4

<br>

<p align="center">
  <img src="https://miro.medium.com/max/1352/1*xsJKdRtENPJn4WWx604LGQ.png" height="350" width="600" title="hover text" alt="Normal distribution">
  
</p>





### **Different techniques to remove outliers**
- Percentile, Quantile
- IQR 
- Z-Score

In [8]:
data_path = 'https://raw.githubusercontent.com/codebasics/py/master/ML/FeatureEngineering/1_outliers/Exercise/AB_NYC_2019.csv'

In [9]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [10]:
data  = pd.read_csv(data_path)
data.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [11]:
data.columns

Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365'],
      dtype='object')

In [12]:
data.shape

(48895, 16)

In [13]:
data['price']

0        149
1        225
2        150
3         89
4         80
        ... 
48890     70
48891     40
48892    115
48893     55
48894     90
Name: price, Length: 48895, dtype: int64

In [14]:
min_threshold = data.price.quantile(0.2)
max_threshold = data.price.quantile(0.95)

In [15]:
min_threshold

60.0

In [16]:
max_threshold

355.0

In [17]:
data[data.price < min_threshold].sample(5)

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
2220,1042806,My Other Little Guestroom,2680820,Linda,Queens,Flushing,40.75334,-73.81699,Private room,59,1,281,2019-06-11,3.7,3,322
5245,3791554,Cosy Inwood apartment,5810195,Emilie,Manhattan,Inwood,40.85988,-73.92709,Private room,39,1,13,2019-06-30,1.18,2,233
36997,29414633,private room,221574115,Ahmet,Brooklyn,Bensonhurst,40.61208,-74.00181,Shared room,15,10,0,,,2,0
34991,27735126,Cozy shared place by Central Park Manhattan,209386156,Abraham,Manhattan,East Harlem,40.80033,-73.94201,Shared room,49,2,60,2019-06-22,5.54,9,109
37654,29856229,GREAT FURNISHED BEDROOM NEAR MIDTOWN MANHATTAN,221836975,Jon,Queens,Jackson Heights,40.74945,-73.89299,Private room,50,2,9,2019-07-02,1.38,3,365


In [18]:
data[data.price > max_threshold].sample(5)

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
2355,1146653,Luxury 1 Bedroom Central Park Views,836168,Henry,Manhattan,Upper West Side,40.79208,-73.96482,Entire home/apt,1000,30,24,2016-01-27,0.33,11,364
12273,9492678,Gorgeous 1BR in Williamsburg church,1335392,Matthew,Brooklyn,Williamsburg,40.71584,-73.95868,Entire home/apt,450,2,0,,,1,0
41409,32237413,Sonder | Stock Exchange | Expansive 3BR + Kitchen,219517861,Sonder (NYC),Manhattan,Financial District,40.70631,-74.01098,Entire home/apt,503,2,7,2019-05-28,2.1,327,294
28881,22268514,4 bedroom / 2 bathroom LOCATION LOCATION LOCATION,13532838,Nik,Manhattan,Chinatown,40.71703,-73.9923,Entire home/apt,400,3,0,,,1,0
45909,35000403,Central Park 4 Bed Steps to Train + Ground floor!,2133558,Felipe,Manhattan,Upper West Side,40.7983,-73.96115,Entire home/apt,450,4,4,2019-07-01,3.08,1,345


In [19]:
new_df = data[(data.price > min_threshold) & (data.price < max_threshold)]
new_df

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.94190,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.10,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48888,36484087,"Spacious Room w/ Private Rooftop, Central loca...",274321313,Kat,Manhattan,Hell's Kitchen,40.76392,-73.99183,Private room,125,4,0,,,1,31
48889,36484363,QUIT PRIVATE HOUSE,107716952,Michael,Queens,Jamaica,40.69137,-73.80844,Private room,65,1,0,,,2,163
48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,70,2,0,,,2,9
48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,115,10,0,,,1,27


In [20]:
data.shape[0] - new_df.shape[0]

12514

In [21]:
# plt.scatter(data.price < min_threshold , data.price > max_threshold)



---

---





In [22]:
d = np.array([1,2,3,4,24,23,13,3,23,23,13,199,32,3,32,32,3,4323,32,42,13,1,1])
d

array([   1,    2,    3,    4,   24,   23,   13,    3,   23,   23,   13,
        199,   32,    3,   32,   32,    3, 4323,   32,   42,   13,    1,
          1])

In [23]:
d = pd.Series(d)
d

0        1
1        2
2        3
3        4
4       24
5       23
6       13
7        3
8       23
9       23
10      13
11     199
12      32
13       3
14      32
15      32
16       3
17    4323
18      32
19      42
20      13
21       1
22       1
dtype: int64

In [24]:
min , max = d.quantile([0.3,0.9])
print(f'Minimum thershold is: {min}')
print(f'Maximum thershold is: {max}')

Minimum thershold is: 3.0
Maximum thershold is: 40.00000000000001


In [25]:
new_d = d[(d > min) & (d < max)]
new_d

3      4
4     24
5     23
6     13
8     23
9     23
10    13
12    32
14    32
15    32
18    32
20    13
dtype: int64

In [26]:
d.mean()

210.65217391304347

In [27]:
new_d.mean()

22.0

In [28]:
new_d

3      4
4     24
5     23
6     13
8     23
9     23
10    13
12    32
14    32
15    32
18    32
20    13
dtype: int64



---







## **Detecting and Removing outlier using IQR**

**IQR = Q3 - Q1**

- Q1 = df.quantile(0.25)
- Q3 = df.quantile(0.75)

**And**
- lower_limit = Q1 - 1.5 * IQR
- upper_limit = Q3 + 1.5 * IQR


**Any data point that is in the range of `lower_limit` and `upper_limit` is a `valid point`, but the data points that are lower then `lower_limit` and larger then `upper_limit` is an outlier**


In [29]:
hw_data_path = 'https://raw.githubusercontent.com/codebasics/py/master/ML/FeatureEngineering/3_outlier_IQR/Exercise/height_weight.csv'


In [30]:
data = pd.read_csv(hw_data_path)
data.head()

Unnamed: 0,gender,height,weight
0,Male,73.847017,241.893563
1,Male,68.781904,162.310473
2,Male,74.110105,212.740856
3,Male,71.730978,220.04247
4,Male,69.881796,206.349801


In [31]:
data.shape

(10000, 3)

In [32]:
data.describe()

Unnamed: 0,height,weight
count,10000.0,10000.0
mean,66.36756,161.440357
std,3.847528,32.108439
min,54.263133,64.700127
25%,63.50562,135.818051
50%,66.31807,161.212928
75%,69.174262,187.169525
max,78.998742,269.989699


In [33]:
Q1_height = data.height.quantile(0.25)
Q3_height = data.height.quantile(0.75)

In [34]:
Q1_height , Q3_height

(63.505620481218955, 69.1742617268347)

In [35]:
Q1_weight = data.weight.quantile(0.25)
Q3_weight = data.weight.quantile(0.75)

In [36]:
Q1_weight , Q3_weight

(135.8180513055015, 187.16952486868348)

In [37]:
IQR_height = Q3_height - Q1_height
IQR_height

5.668641245615746

In [38]:
IQR_weight = Q3_weight - Q1_weight
IQR_weight

51.35147356318197

In [39]:
lower_limit_height = Q1_height - 1.5 * IQR_height
upper_limit_height = Q3_height + 1.5 * IQR_height

In [40]:
print(f'Lower limit of height values is: {lower_limit_height}')
print(f'Upper limit of height values is: {upper_limit_height}')

Lower limit of height values is: 55.00265861279534
Upper limit of height values is: 77.67722359525831


In [41]:
lower_limit_weight = Q1_weight - 1.5 * IQR_weight
upper_limit_weight = Q3_weight + 1.5 * IQR_weight

In [42]:
print(f'Lower limit of weight values is: {lower_limit_weight}')
print(f'Upper limit of weight values is: {upper_limit_weight}')

Lower limit of weight values is: 58.79084096072856
Upper limit of weight values is: 264.19673521345646


**Outlier values of height**

In [43]:
data[(data.height < lower_limit_height) | (data.height > upper_limit_height)]

Unnamed: 0,gender,height,weight
994,Male,78.095867,255.690835
1317,Male,78.462053,227.342565
2014,Male,78.998742,269.989699
3285,Male,78.52821,253.889004
3757,Male,78.621374,245.733783
6624,Female,54.616858,71.393749
7294,Female,54.873728,78.60667
9285,Female,54.263133,64.700127


**Outlier values of weight**

In [44]:
data[(data.weight < lower_limit_weight) | (data.weight > upper_limit_weight)]

Unnamed: 0,gender,height,weight
2014,Male,78.998742,269.989699


In [45]:
without_outlier = data[(data.height > lower_limit_height) & (data.height < upper_limit_height)]

without_outlier = data[(data.weight > lower_limit_weight) & (data.weight < upper_limit_weight)]

In [46]:
without_outlier

Unnamed: 0,gender,height,weight
0,Male,73.847017,241.893563
1,Male,68.781904,162.310473
2,Male,74.110105,212.740856
3,Male,71.730978,220.042470
4,Male,69.881796,206.349801
...,...,...,...
9995,Female,66.172652,136.777454
9996,Female,67.067155,170.867906
9997,Female,63.867992,128.475319
9998,Female,69.034243,163.852461


In [47]:
without_outlier.describe()

Unnamed: 0,height,weight
count,9999.0,9999.0
mean,66.366297,161.429501
std,3.845646,32.091686
min,54.263133,64.700127
25%,63.505347,135.817009
50%,66.317899,161.201891
75%,69.172069,187.152394
max,78.621374,255.863326


---
### **Using Z-Score**

##### Formula for Z score: (Observation — Mean) / Standard Deviation
##### z = (X-μ) / σ