<a href="https://colab.research.google.com/github/DLPY/Unsupervised-Learning-Session-2/blob/main/Local_Outlier_Factor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Statistical Approach to Outlier Detection: Box Plots

In [None]:
import seaborn as sns
import random
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/DLPY/Unsupervised-Learning-Session-2/main/bikerental.csv')

- **instant**: record index
- **dteday** : date
- **season** : season (1:winter, 2:spring, 3:summer, 4:fall)
- **yr** : year (0: 2011, 1:2012)
- **mnth** : month ( 1 to 12)
- **hr** : hour (0 to 23)
- **holiday** : weather day is holiday or not 
- **weekday** : day of the week
- **workingday** : if day is neither weekend nor holiday is 1, otherwise is 0.
+ **weathersit** :
1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- **temp** : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)
- **atemp**: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)
- **hum**: Normalized humidity. The values are divided to 100 (max)
- **windspeed**: Normalized wind speed. The values are divided to 67 (max)
- **casual**: count of casual users
- **registered**: count of registered users
- **cnt**: count of total rental bikes including both casual and registered

**Display descriptive statistics for the dataframe**

In [None]:
df.describe()

**Display first 5 rows of the dataframe**

In [None]:
df.head()

**Generate the box plots for the attributes**: 
*   **windspeed**: wind speed
*   **hum**: relative humidity
*   **casual**: number of non-registered user rentals generated


In [None]:
sns.set_palette("bone_r")
f , axes = plt.subplots(nrows=1, ncols=3, figsize = (20,8))
sns.set_style("darkgrid")

fig1 = sns.boxplot(y= df.windspeed, ax = axes[0] )
fig2 = sns.boxplot(y= df.hum, ax = axes[1] )
fig3 = sns.boxplot(y=df.casual, ax = axes[2] )
plt.show()

**For the attribute windspeed, calculate the value of Lower Quartile(Q1)**

In [None]:
df.windspeed.quantile(.25)

**For windspeed, calculate the Interquartile range(Q3-Q1)**

In [None]:
IQR = df.windspeed.quantile(.75) - df.windspeed.quantile(.25)
IQR

**Calculate the whisker(1.5* IQR) for attribute:windspeed**

In [None]:
whisker  = (df.windspeed.quantile(.75) - df.windspeed.quantile(.25)) * 1.5
whisker

**Calculate the Lower and Upper Range for windspeed,beyond which any point would be classified as an outlier**

In [None]:
lower_range = df.windspeed.quantile(.25) - whisker 
upper_range = df.windspeed.quantile(.75) + whisker 

**Print the data points that lie beyond the lower and upper range: Outliers**

In [None]:
df.query('windspeed > @upper_range | windspeed < @lower_range' )

**Define a function** *findoutliers* **for finding out the outliers for other attributes as well**

In [None]:
def findoutliers(col):
  whisker  = (col.quantile(.75) - col.quantile(.25)) * 1.5
  lower_range = col.quantile(.25) - whisker 
  upper_range = col.quantile(.75) + whisker
  return df.query('@col > @upper_range | @col < @lower_range')

**Print the ouliers for the attribute hum**

In [None]:
findoutliers(df.hum)

**Print the outliers for the attribute casual**

In [None]:
cas = findoutliers(df.casual)

In [None]:
cas

**Analysing the Outliers for Casual Rental**

Generate box plots for Casual rental by season

In [None]:
sns.set_palette("Paired")
ax = sns.boxplot(x="season", y ="casual",data = df)
plt.rcParams["figure.figsize"]=(9,6.5)
plt.title('Casual Rentals by Season')
plt.show()

Generate boxplots for casual rental by month

In [None]:
sns.set_palette("Paired")
ax = sns.boxplot(x="mnth", y ="casual",data = df)
plt.rcParams["figure.figsize"]=(9,6.5)
plt.title('Casual Rentals by Month')
plt.show()

Generate boxplots for Casual rentals by weekday

In [None]:
sns.set_palette("Paired")
ax = sns.boxplot(x="weekday", y ="casual",data = df)
plt.rcParams["figure.figsize"]=(9,6.5)
plt.title('Casual Renatals by Weekday')
plt.show()

In [None]:
sns.set_palette("Paired")
ax = sns.boxplot(x="weekday", y ="casual", hue ='holiday',data = df)
plt.rcParams["figure.figsize"]=(9,6.5)
plt.title('Casual Renatals by Weekday')
plt.legend(loc='upper right')
plt.show()

# **Local Outlier Factor**

**Generate Train data and Outliers**

In [None]:
# Importing the libraries
import random
import pandas as pd

pct = [.01,.08,.12]
amounts = [1000, 2000, 3000]
charges = pd.DataFrame()

# Generate Train Data
for i in range(0, 1000):
  amount = random.choice(amounts) * (random.uniform(.95, 1.05))
  bank_charge = amount * .04 * (random.uniform(.95, 1.05))
  linedict = {'Amount': [amount], 'Charge':[bank_charge]}
  line = pd.DataFrame(linedict)
  charges = pd.concat([charges, line])

# Generate Outliers
for i in range(0, 10):
  amount = random.choice(amounts) * (random.uniform(.95, 1.05))
  bank_charge = amount * random.choice(pct) * (random.uniform(.95, 1.05))
  linedict = {'Amount': [amount], 'Charge':[bank_charge]}
  line = pd.DataFrame(linedict)
  charges = pd.concat([charges, line])

**Display first 10 rows**

In [None]:
charges.head(10)

**Display Bottom 10 rows**

In [None]:
charges.tail(10)

**Fit the model for outlier detection** 

In [None]:
import numpy as np
from sklearn.neighbors import LocalOutlierFactor
# Select the number of neighbors and fit the model
clf = LocalOutlierFactor(n_neighbors=30)
normalized_df=(charges-charges.mean())/charges.std()

# use fit_predict to compute the predicted labels of the training samples
clf.fit_predict(normalized_df)
results = clf.negative_outlier_factor_
charges['LOF'] = results.tolist()
charges['PCT'] = charges['Charge'] / charges['Amount']

**Visualize the results on a Scatter Plot**

In [None]:
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"]=(9,6.5)
#Generating all the datapoints on the Scatter Plot
plt.scatter(charges.Amount, charges.Charge, c='black', s=charges.LOF * -1,label="Data Points")
#Filtering datapoints for which LOF>-2
charges2 = charges.copy()
charges2['LOF'].loc[(charges['LOF'] > -2)] = 0
results[results>-2] = 0
#Calculating the radius for the outlier circles
radius = (results.max() - results) / (results.max() - results.min())
#Generating circles for the outliers on the scatter plot
plt.scatter(charges2.Amount, charges2.Charge, s=500 * radius,edgecolors="r",facecolors="none",label="Outliers",)
legend = plt.legend(loc="upper left")
legend.legendHandles[0]._sizes = [10]
legend.legendHandles[1]._sizes = [20]
plt.show()


**Display rows with LOF<-1.5**

In [None]:
charges.query('LOF < -1.5')