In [None]:
import numpy as np, pandas as pd
import matplotlib.pyplot as plt, seaborn as sns
from sklearn.linear_model import LinearRegression

In [None]:
df = pd.read_csv("housing.data", sep=" +", engine="python", header=None, names=["CRIM","ZN","INDUS","CHAS","NOX","RM","AGE","DIS","RAD","TAX","PTRATIO","B","LSTAT","MEDV"])
df

In [None]:
selected_data = df.loc[:, ["LSTAT", "MEDV"]]
selected_data

In [None]:
selected_data.shape[0]

Assumption: datapoint is outlier if is futher from median than 1.5 length of box (Interquartile Range - IQR), beyond whiskers. 1.5 value is the issue of contractual.

In [None]:
sns.boxplot(y = selected_data["LSTAT"])

In [None]:
X = selected_data.loc[:, ["LSTAT"]].values.reshape(-1, 1)
X

In [None]:
y = selected_data.loc[:, ["MEDV"]].values.reshape(-1, 1)
y

So the score is 54% accurancy.

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

lin_reg = make_pipeline(StandardScaler(), LinearRegression())
lin_reg.fit(X, y)
lin_reg.score(X, y)

METHOD I: Z-score
https://www.statisticshowto.com/probability-and-statistics/z-score/
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.zscore.html
Similar to data scaling, StandardScaler and stuff.
Data shifting.

In [None]:
from scipy import stats
z_score = np.abs(stats.zscore(selected_data))
z_score

Threshold is also set value by software developer / data scientist.
It is also the issue of contractual.
Most of the time, people assume threshold as 3, so did I.
The higher the value the more "tolerant" z-score is for outliers.
Checking which values are further more than 3 times standard deviation.

In [None]:
threshold = 3
selected_data_z = selected_data[(z_score < threshold).all(axis=1)] # we get True where logical condition is True else False. all(axis=1) gives us simple array instead of matrix.
selected_data_z.shape[0]

In [None]:
sns.boxplot(y = selected_data_z["LSTAT"])

In [None]:
X = selected_data_z.loc[:, ["LSTAT"]].values.reshape(-1, 1)
X

In [None]:
y = selected_data_z.loc[:, ["MEDV"]].values.reshape(-1, 1)
y

Score fo z-score is 56% so little better.

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

lin_reg_z = make_pipeline(StandardScaler(), LinearRegression())
lin_reg_z.fit(X, y)
lin_reg_z.score(X, y)

METHOD II: IQR method
https://www.thoughtco.com/what-is-the-interquartile-range-rule-3126244#:~:text=The%20interquartile%20range%20is%20calculated,is%20spread%20about%20the%20median.
We get 1st and 3rd quartile. Q3 - Q1 is so called "quartile range". Value of "tolerance" for outliers. Much stricter than z-score.

In [None]:
Q1 = selected_data_z.quantile(0.25)
Q3 = selected_data_z.quantile(0.75)
IQR = Q3 - Q1
print(Q1)
print()
print(Q3)
print()
print(IQR)

Analogically to z-score condition. In z-score we got 3 * IQR, here only 1.5 * IQR.

In [None]:
iqr_outlier = ((selected_data_z < (Q1 - 1.5 * IQR)) or (selected_data_z > (Q3 + 1.5 * IQR)))

We got also less rows. Outliers deleted. If value marked as True.

In [None]:
selected_data_iqr = selected_data_z[not iqr_outlier.any(axis=1)]
selected_data_iqr.shape[0]

In [None]:
sns.boxplot(y = selected_data_iqr["LSTAT"])

And also we get better result, 60%.

In [None]:
lin_reg_iqr = make_pipeline(StandardScaler(), LinearRegression())
X = selected_data_iqr.loc[:, ["LSTAT"]].values.reshape(-1, 1)
y = selected_data_iqr.loc[:, ["MEDV"]].values.reshape(-1, 1)
lin_reg_iqr.fit(X, y)
lin_reg_iqr.score(X, y)