# Module 4: Anomaly Detection
## Practice: Outlier Reduction for Linear Regression
In this session, we'll be fitting a `LinearRegression` model on the `boston` dataset included in `scikit-learn`.  

Having already worked with this dataset,
you may remember it as a simple yet broadly representative linear regression problem.

## Getting started - imports

In [None]:
%matplotlib inline
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.linear_model import Ridge
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, train_test_split

np.random.seed(10)

## Loading dataset
First order of business is to load in the dataset.

In [None]:
boston = load_boston()
print(boston.DESCR)

In [None]:
boston.feature_names

In [None]:
type(boston.data)

Pull columns from dataset into variables X (everything except TARGET) and y (TARGET).

In [None]:
# Split into X and y sets   #P4001

X = boston.data
y = boston.target

# Print out some basic shape data on the arrays
print("X, y shape:", X.shape, y.shape)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)  #P4002

# verify split shapes and contents
print("X_train.shape: ", X_train.shape)
print("y_train.shape: ", y_train.shape)
print("X_test.shape: ", X_test.shape)
print("y_test.shape: ", y_test.shape)

Run cross validation on a linear ridge model.

In [None]:
naive_model = LinearRegression() #P4003
scores = cross_val_score(estimator=naive_model, X=X_train, y=y_train)
print("Scores: ", scores)
print("Mean score (5 folds): ", np.mean(scores))

Fit this model on the training dataset.

In [None]:
# Fit a model normally (nothing new, no pipelining) #P4004
naive_model.fit(X_train, y_train)

Make some predictions from testing dataset and plot them.

In [None]:
naive_predictions = naive_model.predict(X_test) #P4005
# print(X_test.shape, naive_predictions.shape)

from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, naive_predictions)



## What methods are available to us for outlier reduction?
We could try `KMeans` or an `EllipticEnvelope` again, but we're going to explore a few more options. 

In [None]:
from sklearn.ensemble import IsolationForest

# Construct IsolationForest 
iso_forest = IsolationForest(contamination=0.05).fit(X_train, y_train)

# Get labels from classifier and cull outliers #P4006
iso_outliers = iso_forest.predict(X_train)==-1
print(f"Num of outliers = {np.sum(iso_outliers)}")
X_iso = X_train[~iso_outliers]
y_iso = y_train[~iso_outliers]


In [None]:
# Fit a linear regression model without outliers
iso_model = LinearRegression()
iso_model.fit(X_iso, y_iso)

# Cross validate the new model
iso_scores = cross_val_score(estimator=iso_model, 
                             X=X_iso, y=y_iso)
print(iso_scores)
print("Mean CV score w/ IsolationForest:", np.mean(iso_scores))

iso_predictions = iso_model.predict(X_test)
mean_absolute_error(y_test, iso_predictions)

## Alternatives to IsolationForest: OneClassSVM
This means it's time to try something else.  
The code below will look very similar to the above, but using `OneClassSVM` in place of the `IsolationForest`:

In [None]:
from sklearn.svm import OneClassSVM

# Construct OneClassSVM (kernel='rbf') and fit to full dataset
svm = OneClassSVM(kernel='rbf').fit(X_train, y_train)

# Get labels from classifier and cull outliers #P4007
svm_outliers = svm.predict(X_train)==-1
print(f"Num of outliers = {np.sum(svm_outliers)}")
X_svm = X_train[~svm_outliers]
y_svm = y_train[~svm_outliers]


In [None]:
# develop a liner regression model without outliers 

svm_model = LinearRegression().fit(X_svm, y_svm)

# Cross validate the new model
svm_scores = cross_val_score(estimator=svm_model, 
                             X=X_svm, y=y_svm)
print(svm_scores)
print("Mean CV score w/ OneClassSVM:", np.mean(svm_scores))

# Make predictions with the fitted model
svm_predictions = svm_model.predict(X_test)

mean_absolute_error(y_test, svm_predictions)


## Summary Analysis

Of the anomaly detection algorithms used, which had the highest marginal performance. 

## Addtional Tasks
Vary various parameters and performance measures for the above practice and see the performance.