#### I. DETECTING OUTLIERS

**1. Plot a histogram and density plot**
```sh
sns.distplot(data, bin=20) # bin is number of columns
```

**2. Plot a boxplot**
```sh
sns.boxplot(data)
```

**3. Interquartile Range**
```sh
q25, q50, q75 = np.percentile(data,[25,50,75])
iqr = q75 - q25

#Calculate interval to consider outlier
low_limit = q25 - 1.5 * iqr # for example
high_limit = q75 + 1.5 * iqr

#identify the points:
[x for x in data['X'] if x > high_limit or x < low_limit]
```

**4. Residuals**
> **Approach:**
> - _Standardized_: residual divided by standard error.
> - _Deleted_: residual from fitting model on ALL data and ALL data excluding current observation.
> - _Studentized_: Deleted residuals / residual standard error.


#### II. EDA

**1. Sampling from DataFrames**
- For large data, random sample can make computation easier.
- Train/Dev/Test set
```sh
sample = data.sample(n=5, replace=False) #take 5 random rows in dataset, replace=False to ensure no duplicated rows
print(sample.iloc([:,-3:])
```

**2. Visualization**

**_a. Pandas Framework approach_**
```sh
sns.boxplot(data)
```
**_b. Pandas DataFrame approach_**
```sh
%matplotlib inline # allow plot to actually show
import matplotlib.pylot as plt
# this show scatter plot
plt.plot(data['X1'], data['X2'], ls='', marker='o', label='feature1-2')
# ls: line-style
plt.plot(data['X3'], data['X4'], ls='', marker='o', lable='feature3-4')
# this second line will show on the same plot with different color
```
**_c. Histogram_**
```sh
plt.hist(data['X'], bins=25)
```

**_d. Customizing Plots_**
```sh
fig, ax = plt.subplot() # fig: articulate essentially all of the major basics around our plot
                        # ax: actual box
ax.set_yticks()
ax.set_yticklabels()
ax.set(xlabel='xlabel', ylabel='ylabel', title='Title')
```
>**By group**
```sh
data.groupby('species').mean().
    plot(color=['red', 'blue', 'black', 'green'], # Corespond to number of feature
            fontsize=10.0, figsize=(4,4))
```

**_e. Pair plot for Features_**
- Pair plot:
```sh
sns.pairplot(data, hue='species', size=3) # it would split data following the type of species
```
- Hexbin Plot:
```sh
sns.jointplot(x=data['X'], y=data['Y'], kind='hex')
```
- Facet Grid:
```sh
plot = sns.FaceGrid(data, col='species',
                    margin_titles=True)
plot.map(plt.hist, 'X1', color='green')
```

**3. Interquartile Range**
```sh
q25, q50, q75 = np.percentile(data,[25,50,75])
iqr = q75 - q25

#Calculate interval to consider outlier
low_limit = q25 - 1.5 * iqr # for example
high_limit = q75 + 1.5 * iqr

#identify the points:
[x for x in data['X'] if x > high_limit or x < low_limit]
```

**4. Residuals**
> **Approach:**
> - _Standardized_: residual divided by standard error.
> - _Deleted_: residual from fitting model on ALL data and ALL data excluding current observation.
> - _Studentized_: Deleted residuals / residual standard error.
