![](https://lh3.googleusercontent.com/proxy/ANoOGq2MC7o4P-a1oBAsw5PbIVIS5u0WL8F_0YWVbgruR2JWARjYs6TCSxDX_EvWFAbwq_B-w1hVSsxxLjCeSGmi6UHafxxjTWn0NMiTOa_WDwGhgXHsAVjghM9BFOWwGTQjq-2TzkdX2QM)

### Summary of this notebook :
- About the dataset :

Open Exoplanet Catalogue Tables
This repository contains simple ASCII tables that are generated from the Open Exoplanet Catalogue. The Open Exoplanet Catalogue is a database of all discovered extra-solar planets. New planets are usually added within 24 hours of their announcement.

- In this notebook I will explain the following according to the brevious data:

1- Find Outliers using visualizations : (Box plots, Scatter plots, Data distribution)

2- Find Outliers using numerical and statstical methods 
____________________________________________________________________________________


- I will ansewr question like:


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

In [None]:
df= pd.read_csv('../input/open-exoplanet-catalogue/oec.csv')
df.head()


In [None]:
df.columns

In [None]:
df_planet= df[['PeriodDays','DistFromSunParsec',
       'HostStarMassSlrMass', 'HostStarRadiusSlrRad', 'HostStarMetallicity',
       'HostStarTempK']]
df_planet.head()

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df_planet.isnull(),yticklabels=False ,cbar=False)
# Create heate map to indecate the nan values in all dataset


In [None]:
df_planet=df_planet.fillna(value=df_planet.mean()) #fill nan values with mean
plt.figure(figsize=(20,10))
sns.heatmap( df_planet.isnull() , yticklabels=False ,cbar=False ) # see whether we fill all or not

### Find Outliers using visualizations

One interesting approach to find out outliers is to visualize the data using different visualization techniques, which gives an initial look into these peculiar data points. Some of the techniques used are:
 - Box plots
 - Scatter plots
 - Data distribution

### Boxplot

The first approach we will look at, is using a boxplot.

A boxplot is a way to visualize data using a 5 number summery:
 - Median
 - Q1
 - Q3
 - Minimum
 - Maximum

 <img src=https://www.simplypsychology.org/boxplot.jpg width="400">

In [None]:
fig, ax1 = plt.subplots(figsize=(8,8))
fig, ax2 = plt.subplots(figsize=(8,8))
fig, ax3 = plt.subplots(figsize=(8,8))

sns.boxplot(df_planet['PeriodDays'], ax = ax1, linewidth=2.5)
sns.boxplot(df_planet['DistFromSunParsec'], ax = ax2, linewidth=2.5)
sns.boxplot(df_planet['HostStarMetallicity'], ax = ax3, linewidth=2.5)



#### So here we can see outliers obviously!

### Scatter plots
The next technique for visualizing the data and looking for outliers, is using the loyal fellow (scatter plots) and we are basically plotting the points and looking for points that seem far from the remaining cluster of points or not going in the same pattern.

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
ax.scatter(x = df_planet['HostStarMassSlrMass'], y =df_planet['HostStarTempK'])
plt.show()


In [None]:
fig, ax = plt.subplots(figsize=(16,8))
ax.scatter(x = df_planet['HostStarMassSlrMass'], y =df_planet['HostStarRadiusSlrRad'])
plt.show()


## Find Outliers using numerically

Now as we have seen in the visualization approach, that we can see the outliers, but we can not know by just looking at them, which ones are they exactly. 

Additionally, imagine you have a dataframe with dozen records & features, then just visualizing each is really tedious.


Thus, some of the ways to find outliers numerically that we are going to explore are:
 - Z-score
 - IQR score

### Z-score

Fist we need to normalize the dataset 
\begin{equation*}
Normalized = \frac{x - \mu}{\sigma}
\end{equation*}


In [None]:
df_planet.hist(figsize= (15, 15));

In [None]:
z_scores = stats.zscore(df_planet)

planet_stand = pd.DataFrame(data = z_scores, columns = df_planet.columns)
planet_stand.head()

In [None]:
planet_stand.hist(figsize= (15, 15));

In [None]:
threshold = 4
mask = (planet_stand < -4) | (planet_stand > 4)
display(np.where(mask))
mask

### IQR (Inter-Quartile Range)


In descriptive statistics, the interquartile range (IQR), also called the midspread, middle 50%, or H‑spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles''
<img src=https://upload.wikimedia.org/wikipedia/commons/thumb/1/1a/Boxplot_vs_PDF.svg/1200px-Boxplot_vs_PDF.svg.png width="500">



In [None]:
Q1 = df_planet.quantile(0.25)
Q3 = df_planet.quantile(0.75)
IQR = Q3 - Q1
print(IQR)

##### So from above diagram we can select the median of any statistics from the following equation :
##### median = any thing between : (Q1 - 1.5 * IQR)) & ((Q3 + 1.5 * IQR))]

In [None]:
df_planet[(df_planet > (Q1 - 1.5 * IQR)) & (df_planet < (Q3 + 1.5 * IQR))]