# Using Data Mining Techniques into Real Estates Industry

**Author**:  _Madalina-Alina Racovita, 1st year master's student on **Computational Optimization at Faculty of Computer Science**, UAIC, Iasi, Romania_

![title](./Images/iris_outlier_graph.png)

<h1>Task 6 - Outliers Detection<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-dependencies-&amp;-environment-configuration" data-toc-modified-id="Import-dependencies-&amp;-environment-configuration-1">Import dependencies &amp; environment configuration</a></span></li><li><span><a href="#Load-dataframes" data-toc-modified-id="Load-dataframes-2">Load dataframes</a></span><ul class="toc-item"><li><span><a href="#Labeled-dataframe" data-toc-modified-id="Labeled-dataframe-2.1">Labeled dataframe</a></span></li><li><span><a href="#Real-estates-dataframe" data-toc-modified-id="Real-estates-dataframe-2.2">Real estates dataframe</a></span></li></ul></li><li><span><a href="#Univariate-analysis:-mean-+/--k*sd,-1.5IQR-rule" data-toc-modified-id="Univariate-analysis:-mean-+/--k*sd,-1.5IQR-rule-3">Univariate analysis: mean +/- k*sd, 1.5IQR rule</a></span></li><li><span><a href="#Multivariate-analysis" data-toc-modified-id="Multivariate-analysis-4">Multivariate analysis</a></span><ul class="toc-item"><li><span><a href="#Outlier-detection-using-Mahalanobis-distance" data-toc-modified-id="Outlier-detection-using-Mahalanobis-distance-4.1">Outlier detection using Mahalanobis distance</a></span></li><li><span><a href="#Local-outlier-factor" data-toc-modified-id="Local-outlier-factor-4.2">Local outlier factor</a></span></li><li><span><a href="#Projection-methods:-Stahel-Donoho-extensions-for-asymmetrical-distributions" data-toc-modified-id="Projection-methods:-Stahel-Donoho-extensions-for-asymmetrical-distributions-4.3">Projection methods: Stahel-Donoho extensions for asymmetrical distributions</a></span></li><li><span><a href="#Autoencoders" data-toc-modified-id="Autoencoders-4.4">Autoencoders</a></span></li><li><span><a href="#Outliers-detection-using-decession-trees" data-toc-modified-id="Outliers-detection-using-decession-trees-4.5">Outliers detection using decession trees</a></span></li></ul></li><li><span><a href="#References" data-toc-modified-id="References-5">References</a></span></li></ul></div>

## Import dependencies & environment configuration

In [1]:
# !pip install 

In [14]:
import pandas as pd
import os
import matplotlib
import matplotlib.pyplot as plt
import warnings
import seaborn as sns
import numpy as np
from scipy.io import loadmat

warnings.filterwarnings('ignore')
matplotlib.style.use('seaborn')

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', -1)

## Load dataframes

In [7]:
os.listdir('./Data')

['seismic-bumps.arff']

### Labeled dataframe

The original **Wine dataset from UCI machine learning repository** is a multiclass classification dataset having 13 attributes and 3 classes. These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines. Class 2 and 3 are used as inliers and class 1 is downsampled to 10 instances to be used as ouliers.

In [21]:
data = loadmat('./Data/wine.mat')
len(data['X'][0])

13

In [26]:
df_wine = pd.DataFrame()
for i in range(len(data['X'][0])):
    df_wine['Feature' + str(i)] = [data['X'][j][i] for j in range(len(data['X']))]
df_wine['Outlier'] = data['y']
df_wine.head()

Unnamed: 0,Feature0,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Feature7,Feature8,Feature9,Feature10,Feature11,Feature12,Outlier
0,13.29,1.97,2.68,16.8,102.0,3.0,3.23,0.31,1.66,6.0,1.07,2.84,1270.0,1
1,14.3,1.92,2.72,20.0,120.0,2.8,3.14,0.33,1.97,6.2,1.07,2.65,1280.0,1
2,13.68,1.83,2.36,17.2,104.0,2.42,2.69,0.42,1.97,3.84,1.23,2.87,990.0,1
3,14.06,2.15,2.61,17.6,121.0,2.6,2.51,0.31,1.25,5.05,1.06,3.58,1295.0,1
4,14.22,1.7,2.3,16.3,118.0,3.2,3.0,0.26,2.03,6.38,0.94,3.31,970.0,1


In [28]:
df_wine['Outlier'].value_counts()

0    119
1    10 
Name: Outlier, dtype: int64

### Real estates dataframe

## Univariate analysis: mean +/- k*sd, 1.5IQR rule

## Multivariate analysis

### Outlier detection using Mahalanobis distance

### Local outlier factor

### Projection methods: Stahel-Donoho extensions for asymmetrical distributions

### Autoencoders

### Outliers detection using decession trees

## References

1. **Wine dataset**: http://odds.cs.stonybrook.edu/wine-dataset/