# DIQ Project

Project ID: 49

Surname: Azank dos Santos

Name: Felipe

Person Code: 10919711

Dataset: ```house.csv```

Objective: Comparison between prediction results before and after the Outlier detection and handling

## 1 - Import necessary libraries

In [None]:
import pandas as pd # for dataframe manipulation
import numpy as np  # for array manipulation and math
from dirty_accuracy import injection #python code to inject the outliers

ModuleNotFoundError: ignored

## 2 - Data Importation and common analysis

In [None]:
df_pandas = pd.read_csv("house.csv")

FileNotFoundError: ignored

In [None]:
df_pandas.info();

It's possible to see that the dataset presented, at first, does not show any missing values

In [None]:
# features overview
df_pandas.describe().T;

NameError: ignored

Using the describe fuction, it's possible to see that the columns of the dataset don't have anything out of the ordinary (values that show a count don't have negative values, extremelly max or min numbers and etc).

In [None]:
#for column in df_pandas.columns:
#   print(f"{column} unique values: {df_pandas[column].nunique()}, datatype: {df_pandas[column].dtype}")

By analysing more closelly, it is possible to see the presence of some columns which, eventhough are presented as integers, should be considered as categorical features when it comes to feature engineering for the Machine Learning part, since there is not a quantity relation with them, this is the case of columns such as "OverallCond", and "YrSold".

However, since the objective is to sudy relations between performance given outliers removal, the dummy_variable generation won't be done.

In [None]:
# example of the above features described
df_pandas['OverallCond'].unique()

## 3 - Outlier Imputation

Considering that this is a classification problem, the target target value should not be part of the imputation process.

In [None]:
df_50, df_40, df_30, df_20, df_10 =  injection(
    df_pandas,
    seed= 72,
    name= 'house',
    name_class= 'SaleCondition'
)

In [None]:
df_50.head(10)

In [None]:
df_pandas.head(10)

By a quick view in the dataframes, its possible to see that the outlier imputation has succeded and added unusual values

## 4 - Computing Machine Learning Model

The two selected machine learning models are: K Nearst Neighbors (KNN) and Support-Vector Machine (SVM).

This choice was made once that KNN is know for not handling outliers well, since its a non-parametric model, which leads to problems in extrapolating assumptions for very different data (does not create a function about the data). The SVM, in the other hand, is a parametrical one, which is good for extrapolating assumptions, but not good enough to don't show changes in the model. Studying them may help understand clearly the impact in each type of model.

OBS: Tree based Models were not choosen since they usually handle well (specially ensemble models) outliers.

In [None]:
from sklearn.neighbors import KNeighborsClassifier           #KNN
from sklearn.svm import SVC                                  #support-vector-classifier

## Train_Test Split

Before preprocessing of features, it's necessary to split the data in training and test. Since all datasets have the same number of columns, it is possible to make a train split with the same random seed.

In [None]:
from sklearn.model_selection import train_test_split

def split_data(df, test_percent=0.3):
    # feature columns
    X = df.iloc[:,0:-1]

    # target column
    y = df['SaleCondition']

    #train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_percent, random_state=72)

    return X_train, X_test, y_train, y_test, X, y

In [None]:
X_train, X_test, y_train, y_test, X, y = split_data(df_pandas)

### Feature Engineeering

In order to work with all columns form the dataset, there's need to process the categorical features in valid ones for the ML models. Since this is not the main goal of the project, let's only apply the following transformations:

* One_Hot_Encoder for categorical columns: in order to fit it to the models
* Standard_Scaler for numerical columns: since KNN is distance sensitive, which means is highly necessary to apply a standarization

We apply it by creating an Sci-kit learn pipeline

In [None]:
from sklearn.pipeline import make_pipeline #pipeline construction
from sklearn.preprocessing import StandardScaler, OneHotEncoder #preprocessing tecniques
from sklearn.compose import ColumnTransformer #in order to split between numerical and categorical features

def build_pipelines(X):
    # x input is needed for gathering data dimensions

    preprocessor = ColumnTransformer(
        transformers=[
            ('num', StandardScaler(), list(X.select_dtypes(include=['int','float']).columns)),
            ('cat', OneHotEncoder(handle_unknown='infrequent_if_exist'), list(X.select_dtypes(include=['object']).columns))
        ])

    pipe_1 = make_pipeline(preprocessor, KNeighborsClassifier(n_neighbors=5))
    pipe_2 = make_pipeline(preprocessor, SVC(kernel='linear', C=1))

    return pipe_1, pipe_2


In [None]:
pipe_knn, pipe_svm = build_pipelines(X)

pipe_knn

## Fitting and Evaluating

In order to evaluate the process, we will use accuracy, precision. The last one should be added to make us able to detect any classification bias in out dataset.

In [None]:
def create_predictons(X_train, y_train, X_test, pipe_1, pipe_2):
    pipe_1.fit(X_train, y_train)
    pipe_2.fit(X_train, y_train)

    y_pred_1 = pipe_1.predict(X_test)
    y_pred_2 = pipe_2.predict(X_test)

    return y_pred_1, y_pred_2

In [None]:
y_pred_knn, y_pred_svm = create_predictons(X_train, y_train, X_test, pipe_knn, pipe_svm)

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score
import warnings

def evaluate(y_true, y_pred_1, y_pred_2, print_=False):
    warnings.filterwarnings('ignore')
    #warning needed since its a multi-class classification, which leads to some divisions by 0 when dealing with precision
    #this is dealth with by applying the macro strategy to compute (treats these cases as 0), once the

    scores = []

    acc1 = accuracy_score(y_true, y_pred_1)
    prec1 = precision_score(y_true, y_pred_1, average='macro')
    scores.append((acc1, prec1))


    acc2 = accuracy_score(y_true, y_pred_2)
    prec2 = precision_score(y_true, y_pred_2, average='macro')
    scores.append((acc2, prec2))

    if print_:

        print(f"result pipeline 1 (KNN)")
        print(f"accuracy: {acc1}")
        print(f"precision: {prec1}\n")

        print(f"result pipeline 2 (SVM)")
        print(f"accuracy: {acc2}")
        print(f"precision: {prec2}")


    return scores

In [None]:
scores_df_original = evaluate(y_test, y_pred_knn, y_pred_svm, print_=True)

## 5 - Applying for every imputed dataset

Now that we have every function and processes well defined, we can apply the pipeline and get the results for all the datasets imputed with outliers

In [None]:
from sklearn.neighbors import KNeighborsClassifier           #KNN
from sklearn.svm import SVC                                  #support-vector-classifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline                      #pipeline construction
from sklearn.preprocessing import StandardScaler, OneHotEncoder #preprocessing tecniques
from sklearn.compose import ColumnTransformer                   #in order to split between numerical and categorical features
from sklearn.metrics import accuracy_score, precision_score
import warnings

def compute_model(df, test_percent=0.3):

    X_train, X_test, y_train, y_test, X, y = split_data(df, test_percent)
    pipe_knn, pipe_svm = build_pipelines(X)
    y_pred_knn, y_pred_svm = create_predictons(X_train, y_train, X_test, pipe_knn, pipe_svm)
    scores_result = evaluate(y_test, y_pred_knn, y_pred_svm)

    return scores_result

In [None]:
compute_model(df_pandas)

In [None]:
results = pd.DataFrame({'Data_Frame':[], 'KNN_accuracy':[], 'KNN_precision_':[], 'SVM_accuracy':[], 'SVM_precision':[]})
name = ['original', '10% outliers', '20% outliers', '30% outliers', '40% outliers', '50% outliers' ]
i = 0

for df in [df_pandas, df_10, df_20, df_30, df_40, df_50]:
    score = compute_model(df)

    results = results.append({'Data_Frame':name[i], 'KNN_accuracy':score[0][0], 'KNN_precision_':score[0][1],
                     'SVM_accuracy':score[1][0], 'SVM_precision':score[1][1]}, ignore_index=True)

    i+=1

results

Analysing the results, it is possible to see that, aoart from a few unusual cases, all metrics of evaluation of the models decayed. One possible explanation to why did the results didn't show a higher decay must be explained by the high number of categorial features created in the One_Hot_Encoder process, leading to the "Dimensionality Curse" problem. By the end of the notebook, its possible to see the impact in performance if we considered only numerical features.

## 6 - Outlier detection

In order to try increasing the model efficency back to its original process, lets apply two outlier detection algorithms in order to correct the dataset:

* Local Outlier Factor (LOF)
* Connectivity-Based Outlier Factor (COF) Algorithm

### 6.1 - Local Outlier Factor (LOF)

Algorithm that stipulates a measurement on how isolated a given point is from its neighbors. This is a good measure because the outliers detected are located outside the min-max values from each column. The application is based in the code developted by the professor during the exercices section

In [None]:
from sklearn.neighbors import LocalOutlierFactor

def LOF_outlier_detection(df,contamination):
    clf = LocalOutlierFactor(n_neighbors=4, contamination=contamination) #setting as auto in order to get realistic results

    clf.fit_predict(df.select_dtypes(include=['int','float']))

    LOF_scores = clf.negative_outlier_factor_

    ####print(np.mean(LOF_scores))
    ####print(np.std(LOF_scores))

    outliers_index = df[LOF_scores < np.mean(LOF_scores - 1*np.std(LOF_scores))].index

    return outliers_index

### 6.2 - Connectivity-Based Outlier Factor (COF) Algorithm

This technique is an improvment of the LOF algorithm, in which a similar measurement of isolation is calculated and then the values are sorted and ranked to detect the outliers in the data. The goal of using this algorithm is to compare if it does a better job than its simpler version. It's implementation can be found in the ```pyod``` library.



In [None]:
from pyod.models.cof import COF

def COF_outlier_detection(df, contamination):
    X = df.select_dtypes(include=['int','float']).values
    clf = COF(n_neighbors=4, contamination=contamination)

    clf.fit(X)
    outliers = clf.predict(X)

    outlier_index = np.where(outliers==1)[0]

    return outlier_index

## 7 - Outlier Handling

Now that we have the defined functions to detect the outliers in the data, it's necessary to decide the methods to substitute the values form the detected problems. Given the target column distribution (mosstly one category) a simple way to deal with the missing value would be by substitute the outlier values by the mean of the feature it belongs (Single Imputation)

In [None]:
def replace_outliers(df,contamination):
    outliers_index_1 = LOF_outlier_detection(df,contamination)
    outliers_index_2 = COF_outlier_detection(df,contamination)

    df1 = df.copy()
    df2 = df.copy()


    for column in df.select_dtypes(include=('int',float)):
        df1[column].iloc[outliers_index_1] = df[column].drop(outliers_index_1).mean()
        df2[column].iloc[outliers_index_2] = df[column].drop(outliers_index_2).mean()

    return df1, df2

## 8 - Evaluating after Handling

In [None]:
results = pd.DataFrame({'Data_Frame':[], 'KNN_accuracy':[], 'KNN_precision_':[], 'SVM_accuracy':[], 'SVM_precision':[]})
name = [
'10% outliers', '10% outliers_LOF_Fixed', '10% outliers_COF_Fixed',
'20% outliers', '20% outliers_LOF_Fixed', '20% outliers_COF_Fixed',
'30% outliers', '30% outliers_LOF_Fixed', '30% outliers_COF_Fixed',
'40% outliers', '40% outliers_LOF_Fixed', '40% outliers_COF_Fixed',
'50% outliers', '50% outliers_LOF_Fixed', '50% outliers_COF_Fixed'
]
contaminations = [0.1,0.1,0.1,0.2,0.2,0.2,0.3,0.3,0.3,0.4,0.4,0.4,0.499,0.499,0.499]
contaminations = [0.1]*15
i = 0

#generate DataFrames
for df in [df_10, df_20, df_30, df_40, df_50]:
    score = compute_model(df)
    results = results.append({'Data_Frame':name[i], 'KNN_accuracy':score[0][0], 'KNN_precision_':score[0][1],
                     'SVM_accuracy':score[1][0], 'SVM_precision':score[1][1]}, ignore_index=True)
    i+=1

    #computing with LOF substitution
    df_LOF, df_COF = replace_outliers(df,contamination=contaminations[i])
    score = compute_model(df_LOF)
    results = results.append({'Data_Frame':name[i], 'KNN_accuracy':score[0][0], 'KNN_precision_':score[0][1],
                     'SVM_accuracy':score[1][0], 'SVM_precision':score[1][1]}, ignore_index=True)
    i+=1

    #computing with COF substitution
    score = compute_model(df_COF)
    results = results.append({'Data_Frame':name[i], 'KNN_accuracy':score[0][0], 'KNN_precision_':score[0][1],
                     'SVM_accuracy':score[1][0], 'SVM_precision':score[1][1]}, ignore_index=True)
    i+=1

results

## 9 - Comparisons and problems with high dimensionality

After evaluating the results, it is possible to see that, in general scale, the accuracy of the models increased and the precision in some cases increased drastically. However, there are cases where the opposite situation happens, leading to believe that some oulier values were not substituted properly.

One phenomenon that could explain the difficulty to collect the clear outliers is the Dimensionality Curse, that is the process in which, given the high dimensionality of the dataset (multiple columns), the computing of the distance between points begins to grow out of control, becoming extremely hard to select what is a big difference, and what is indeed an outlier.

Two ways to dealing with that can be brought to discussion:

* 1- Apply Principal Components Analysis (PCA) and try reducing the dimensions: this way the dimensionality curse can become weaker, however the outlier measures will be mixed up with other columns

* 2- Apply the outlier detection procedures column-wise: so that we identify the index of each problem given

## Conclusion

At the end, its possible to see that the outlier handling was indeed helpful in changing the precision metric and had little effect in the accuracy of the model.