# CSC3831 Final Assessment - Part I: Data Engineering



In [1]:
import pandas as pd
import seaborn as sns
import missingno as msno
import matplotlib.pyplot as plt
import numpy as np

houses_corrupted = pd.read_csv('https://raw.githubusercontent.com/PaoloMissier/CSC3831-2021-22/main/IMPUTATION/TARGET-DATASETS/CORRUPTED/HOUSES/houses_0.1_MAR.csv', header=0)

# for reproducibility
SEED = 42

Above we've loaded in a corrupted version of a housing dataset. The anomalies need to be dealt with and missing values imputed.

### 1. Data Understanding [7]
- Perform ad hoc EDA to understand and describe what you see in the raw dataset
  - Include graphs, statistics, and written descritpions as appropriate
  - Any extra information about the data you can provide here is useful, think about performing an analysis (ED**A**), what would you find interesting or useful?
- Identify features with missing records, outlier records


Exploratory Data Analysis is the process of using querying and visualization techniques to examine the surface properties of acquired data. This includes:

- Simple statistical analysis
- Distribution of attributes
- Relationships between pairs or small numbers of attributes

First, we are going to check the data itself, to see the shape of the data and the first rows. Then, we are going to use some methods to print tables of information and statistics about the data. We are also going to plot some graphs to visualize the data and make some assumptions about it.


In [None]:

print(houses_corrupted.head())


# print(houses_corrupted.info())
print()
print(houses_corrupted.describe(include='all'))

print(houses_corrupted.info())

print(houses_corrupted.isnull().sum())



In [None]:

msno.matrix(houses_corrupted)
houses_corrupted.hist(bins=50, figsize=(20, 15))
plt.show()


## Results

### 1. Data Understanding
We can see that the Unnamed: 0 column is not necessary, so we are going to drop it. We can also see that there are some missing values in the data in the median_outcome, housing_median_age and population column. 
We can see that all column are numerical, which is going to help later on because dealing with categorical data can include 
additional steps. We have some information about the distribution as well, especially the quartiles, the mean and the standard deviation.

### 2. Plots

We can see visualy the missing data in 3 columns, we can see the distribution of the data in the columns, and we can see the correlation between the columns.
Moreover, some outliers are detected in the data, which we are going to deal with in the next steps, especially because : 

> Outliers can skew a dataset and influence our measure of centre
> and spread. Depending on skew use the metrics


Based on these findings, we will then proceed to clean the remove the useless data first

In [None]:
try:
    houses_corrupted.drop(["Unnamed: 0"], axis=1, inplace=True)
except:
    print("already dropped")

sns.histplot(houses_corrupted['median_house_value'], kde=True)

There is no string features so we can skip the encoding part, and we don't have to worry about the categorical data. We will only focus on the numerical data. We wont need to look for malformed strings too


Then, we are going to make a pairplot to look for any relationship between the features. We are going to use the seaborn library to make the pairplot. It is also going to be useful to describe the distribution of the data, the skewness, and the outliers.

> Def:
> Pairs plots are a series of
> scatter plots of every attribute
> present in the data.
> • Useful for an initial look at
> relationships in the data
> • Visually overwhelming for
> final reports
> • Utilise to detect interesting
> relationships to analyse further

1_Introduction_to_Data_Science.pdf

In [None]:
sns.pairplot(houses_corrupted)
print(houses_corrupted.corr())

## Results :

We can see that there is a strong correlation between the total_bedroom and housholds. Some other features are also strongly corrolated with each other as we can see on this pairplot. 

In the diagonal, we can see the distribution of the data. We can see that the data is not normally distributed, and we can see that there are some outliers in the data. We can see that the data is positively skewed,like the figure 2.1 of 1_Introduction_to_Data_Science.pdf show. This is going to be useful to know for the next steps.

### 3. Imputation [10]
- Identify which features should be imputed and which should be removed
  - Provide a written rationale for this decision
- Impute the missing records using KNN imputation
- Impute the missing records using MICE imputation
- Compare both imputed datasets feature distributions against each other and the non-imputed data
- Build a regressor on all thre datasets
  - Use regression models to predict house median price
  - Compare regressors of non-imputed data against imputed datas
  - **Note**: If you're struggling to compare against the original dataset focus on comparing the two imputed datasets against each other


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import KNNImputer, IterativeImputer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler

from sklearn.experimental import enable_iterative_imputer
from sklearn.tree import DecisionTreeRegressor

# Load your dataset

# Identify features to impute or remove
missing_threshold = 0.3
features_to_impute = []
features_to_remove = []

for column in houses_corrupted.columns:
    missing_percentage = houses_corrupted[column].isnull().sum() / houses_corrupted.shape[0]
    if missing_percentage > missing_threshold:
        features_to_remove.append(column)
    elif missing_percentage > 0:
        features_to_impute.append(column)

print("Features to impute:", features_to_impute)
print("Features to remove:", features_to_remove)

houses_corrupted = houses_corrupted.drop(columns=features_to_remove)

X = houses_corrupted.drop(columns=['median_house_value'])
y = houses_corrupted['median_house_value']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

knn_imputer = KNNImputer(n_neighbors=5)
X_train_knn_imputed = knn_imputer.fit_transform(X_train)
X_test_knn_imputed = knn_imputer.transform(X_test)

mice_imputer = IterativeImputer(random_state=SEED)
X_train_mice_imputed = mice_imputer.fit_transform(X_train)
X_test_mice_imputed = mice_imputer.transform(X_test)

def plot_feature_distributions(original, knn_imputed, mice_imputed, feature_names):
    fig, axes = plt.subplots(len(feature_names), 3, figsize=(15, len(feature_names) * 5))
    for i, feature in enumerate(feature_names):
        sns.histplot(original[feature], ax=axes[i, 0], kde=True, color='blue')
        axes[i, 0].set_title(f'Original {feature}')
        sns.histplot(knn_imputed[:, i], ax=axes[i, 1], kde=True, color='green')
        axes[i, 1].set_title(f'KNN Imputed {feature}')
        sns.histplot(mice_imputed[:, i], ax=axes[i, 2], kde=True, color='red')
        axes[i, 2].set_title(f'MICE Imputed {feature}')
    plt.tight_layout()
    plt.show()

plot_feature_distributions(X_train, X_train_knn_imputed, X_train_mice_imputed, features_to_impute)

def build_and_evaluate_regressor(X_train, X_test, y_train, y_test):
    # regressor = DecisionTreeRegressor(random_state=SEED)
    # regressor = RandomForestRegressor(random_state=SEED)

    regressor = LinearRegression()
    regressor.fit(X_train, y_train)
    y_pred = regressor.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    return mse, r2

X_train_original = X_train.dropna()
y_train_original = y_train[X_train_original.index]
X_test_original = X_test.dropna()
y_test_original = y_test[X_test_original.index]

mse_original, r2_original = build_and_evaluate_regressor(X_train_original, X_test_original, y_train_original, y_test_original)
print(f'Original Data - MSE: {mse_original}, R2: {r2_original}')

# KNN Imputed data
mse_knn, r2_knn = build_and_evaluate_regressor(X_train_knn_imputed, X_test_knn_imputed, y_train, y_test)
print(f'KNN Imputed Data - MSE: {mse_knn}, R2: {r2_knn}')

# MICE Imputed data
mse_mice, r2_mice = build_and_evaluate_regressor(X_train_mice_imputed, X_test_mice_imputed, y_train, y_test)
print(f'MICE Imputed Data - MSE: {mse_mice}, R2: {r2_mice}')

## Results

### Interpretation
We can see that the KNN got a very similar distribution to the original data, more that MICE. 

### Imputation

KNN/MICE. The two imputer got a pretty similar result, which are quite good, compared to the original data.

### Regression
I tried to use DecisionTree, RandomForest and LinearRegression to predict the median_house_value. They got pretty similar results, so I kept to LinearRegressor because as we discussed in class, it allows use to keep the intepretation without the complexity of the other models.


### 2. Outlier Identification [10]
- Utilise a statistical outlier detection approach (i.e., **no** KNN, LOF, 1Class SVM)
- Utilise an algorithmic outlier detection method of your choice
- Compare results and decide what to do with identified outleirs
  - Include graphs, statistics, and written descriptions as appropriate
- Explain what you are doing, and why your analysis is appropriate
- Comment on benefits/detriments of statistical and algorithmic outlier detection approaches


Outliers in a dataset are either:
- Rare: Appear with low frequency relative to the rest of the data (inliers)
- Unusual: Do not fit the data distribution


Missing at Random (MAR) is data where there may be a systemic
reason why some of the data is missing, but this knowledge does
not help us with imputation




Before continuing I think that dealing the missing values in this section is relevant. In fact we can see that there are some missing values in the data, and we are going to use the KNN imputer to fill the missing values. We are going to use the KNN imputer because it is a good imputer for numerical data, and it is going to take into account the correlation between the features. Moreover, We need data without missing values to detect the outliers, so I feel like it is a good idea to fill the missing values before detecting the outliers.

In [None]:
# Use this dataset for comparison against the imputed datasets
houses = pd.read_csv('https://raw.githubusercontent.com/PaoloMissier/CSC3831-2021-22/main/IMPUTATION/TARGET-DATASETS/ORIGINAL/houses.csv', header=0)

original_columns = houses_corrupted.columns
knn_imputer = KNNImputer(n_neighbors=5)
houses_corrupted = pd.DataFrame(knn_imputer.fit_transform(houses_corrupted), columns=original_columns)

print("Missing values per column:")
print( houses_corrupted.isnull().sum() )

houses_corrupted.head()

### MADs and Z-scores

We are now going to use the robust Z-score as discussed in the lecture because the data is skewed and include outliers

> MADs Def: 
> Median of all absolute
> deviations from the median
> 𝑀𝐴𝐷 = 1.483 ∗ 𝑚𝑒𝑑𝑖:1…𝑛(|𝑥𝑖 − 𝑚𝑒𝑑(𝑥𝑗)𝑗:1…𝑛|)

> Z-score Def:
> Z-score is a conversion of standard deviation from a normal distribution to a
> standard normal distribution.


𝑋 = 𝑥1, … , 𝑥𝑛
𝑟𝑜𝑏_𝑧𝑖 =
𝑥𝑖 − 𝑚𝑒𝑑(𝑥)
𝑀𝐴𝐷


I thought multiple time to scale the data using the standard scaler, but I decided not to do it and work with the raw data. In fact, scaling skewed data is not very effective. 




In [None]:
from scipy import stats
from sklearn.ensemble import IsolationForest
import matplotlib.pyplot as plt
import seaborn as sns

columns_to_analyze = ['population', 'median_income', 'housing_median_age']

def detect_outliers_zscore(data, threshold=3):
    # 𝑋 = {𝑥1, … , 𝑥𝑛}  𝑟𝑜𝑏_𝑧𝑖 = (𝑥𝑖 − 𝑚𝑒𝑑(𝑥) )/ 𝑀𝐴𝐷
    median = np.median(data)
    mad = np.median(np.abs(data - median))
    robust_z_scores = 0.6745 * (data - median) / mad
    return np.abs(robust_z_scores) > threshold

def detect_outliers_isolation_forest(data):
    isolation_forest = IsolationForest(contamination=0.05, random_state=42)
    outliers = isolation_forest.fit_predict(data)
    return outliers == -1

def compare_outliers(data, columns):
    fig, axes = plt.subplots(len(columns), 2, figsize=(15, len(columns) * 5))
    for i, column in enumerate(columns):
        # Détection des valeurs aberrantes
        outliers_zscore = detect_outliers_zscore(data[column])
        outliers_iforest = detect_outliers_isolation_forest(data[[column]])
        
        sns.histplot(data[column], ax=axes[i, 0], kde=True, color='blue')
        sns.histplot(data[column][outliers_zscore], ax=axes[i, 0], kde=True, color='red')
        axes[i, 0].set_title(f'{column} - Z-score')
        axes[i, 0].text(0.05, 0.95, 'Blue: Original\nRed: Outliers', transform=axes[i, 0].transAxes,
                        fontsize=10, verticalalignment='top', bbox=dict(boxstyle='round,pad=0.3', edgecolor='black', facecolor='white'))
        
        sns.histplot(data[column], ax=axes[i, 1], kde=True, color='blue')
        sns.histplot(data[column][outliers_iforest], ax=axes[i, 1], kde=True, color='red')
        axes[i, 1].set_title(f'{column} - Isolation Forest')
        axes[i, 1].text(0.05, 0.95, 'Blue: Original\nRed: Outliers', transform=axes[i, 1].transAxes,
                        fontsize=10, verticalalignment='top', bbox=dict(boxstyle='round,pad=0.3', edgecolor='black', facecolor='white'))
    
    plt.tight_layout()
    plt.show()

compare_outliers(houses_corrupted, columns_to_analyze)

## Results : 
###  Chart analysis : 

Population :

Z-score: Outliers are mainly located on the right of the distribution (very high values).
Isolation Forest: Outliers are similar but slightly more numerous, with slightly wider detection.
Median Income :

Z-score: Only very high values are detected as outliers.
Isolation Forest: Detection is more diverse, with anomalies in the high and low parts, indicating that it considers very low and very high incomes to be abnormal.

Housing Median Age :
Z-score: Detects few anomalies, only at the right end (high values).
Isolation Forest: More diversity in anomalies, with better capture of extreme ages.

Explanations for the differences:
Z-score is based on an assumption of normal distribution. if the data are asymmetrical or non-normal, it may miss outliers or detect too few, which is the case in our data. 
Isolation Forest is more flexible because it does not assume any particular shape for the data. It is therefore more effective at detecting anomalies in skewed or complex data. I choosed contamination=0.1 at first, then I tried 0.05 because I thought that the contamination was too high, Especially far from the center of the distribution.

Conclusion:
The two methods give a different perspective on outliers. The Z-score is simple but limited for non-normal distributions, while Isolation Forest is more robust for complex structures but I have to find the right paramater. 



### 4. Conclusions & Throughts [3]
- Disucss methods used for anomaly detection, pros/cons of each method
- Disucss challenges/difficulties in anomaly detection implementation
- Discuss methods used for imputation, pros/cons of each method
- Discuss challenges/difficulties in imputation implementation

This coursework allowed us to combine the different methods that we were able to approach in class and tested in practice. Knowing that we saw different methods for each action, we had to make choices.

# Anomaly detection choice

## Z-score
Concerning anomaly detection, I chose z-score as a statistical method and Isolation Forest as an algorithmic method. I chose z-score because it is a simple and quick method to set up, but it is limited by the assumption of normality of the data. For this I tried to use robust z-score that we were able to approach in class.

## Isolation Forest
Isolation Forest is more robust because it does not make assumptions about the distribution of the data. However, it is more complex to set up and requires finding the right contamination parameter. Isolation Forest is more suitable for small data, which is both an advantage and a disadvantage. I tried to do my best based on my personal research and the knowledge provided in class. I would be very happy to have feedback on my choices and results. I do not know how feedback works in England since I am a French university exchange student here for my third year.

# Imputer choice
For imputation, I chose KNN and MICE (IterativeImputer). KNN and MICE are both multivariate imputation algorithms. I found some advantages and disadvantages of KNN on this website[1]

## KNN Impute:
### Advantages
- Enhances Data Accuracy (tries to fill NaN with accurate values)
- Preserves Data Structure (maintains the relationships and distribution of the data as shown in the above plot)
- Handles Numeric Data Effectively (int/float dtypes, where it can make accurate estimations for missing values.)
- Integration with Scikit-Learn (easy to integrate with data preprocessing pipeline)
### Disadvantages
- Sensitive to the Choice of k (selecting an inappropriate value for k may lead to either over-smoothing (too generalized) - or overfitting (too specific) imputations). In our case, we used the default value of k=5. I tried to find the best value for k, trying with 7, 3 and 1, but I found 5 to be the best value.
- Highly Computational (can be time-consuming for larger datasets to calculate the distance between two data points)
- Handling Categorical Data (imputing discrete values ​​can be challenging, but KNN Imputer remains applicable when the data is encoded) In our case, It doesn't matter because we only have numerical data.
- Impact of Outliers (too many outliers in the data may lead to wrong imputations)



## MICE Impute:
Based on these website[2, 3], here are some advantages and disadvantages of the MICE algorithm:
### Advantages
- Flexibility:
MICE (Multiple Imputation by Chained Equations) is highly flexible and can be applied to datasets with various types of variables, including binary, categorical, and continuous data. It allows each variable to be modeled according to its specific distribution, enhancing the accuracy of imputations1.
- Handling Complex Datasets:
MICE can be used effectively in large datasets with hundreds of variables, making it suitable for complex data structures where traditional joint models may not be appropriate1.
- Uncertainty Quantification:
 By generating multiple imputations for each missing value, MICE accounts for the uncertainty inherent in missing data, leading to more accurate standard errors and reducing the risk of false precision that can occur with single imputation methods3.
- Auxiliary Variables:
 The method allows the inclusion of auxiliary variables that are not part of the main analysis but can improve the quality of imputations by making the Missing At Random (MAR) assumption more plausible1.
- Superefficiency: MICE can sometimes provide more precise statistical inferences than maximum likelihood methods by using additional information that may not be accessible to the analyst2.

### Disadvantages
- Assumption of MAR:
 MICE assumes that data are Missing At Random (MAR), meaning the probability of missingness depends only on observed values. If this assumption is violated, it can lead to biased estimates13.
- Complex Implementation:
The method is more complex to implement compared to simpler imputation techniques. It requires careful consideration in setting up the imputation model and validating the quality of generated data24.
- Potential for Misleading Results:
 Without appropriate care and insight, MICE can yield nonsensical or misleading results. It requires scientific and statistical judgment at various stages, such as diagnosing the missing data problem and setting up a robust imputation model2.
- Computational Expense:
 MICE can be computationally expensive, particularly with large datasets or when a high number of imputations is required to achieve stable results4.
- Lack of Guidance for Complex Data Structures:
There is limited guidance on how to incorporate design factors from complex sampling designs or how to handle nested, hierarchical, or autocorrelated data within the MICE framework2.

The main difficulties encountered in implementing imputers was the interpretation of the results. Indeed I think that there are still unknowns in the method to follow for me. I want to practice this field to better understand and deepen the field. For this I am also taking the biomedical and AI course and I will also join the Formula Student club which aims to win a speed race with an autonomous electric vehicle on a circuit.

## References
1. KNN imputer: https://medium.com/@karthikheyaa/k-nearest-neighbor-knn-imputer-explained-1c56749d0dd7
2. MICE: https://www.machinelearningplus.com/machine-learning/mice-imputation/
3. MICE: https://pmc.ncbi.nlm.nih.gov/articles/PMC3074241/