# Introduction:

There are other notebooks that find the outliers, using statistical methods and what not, but in this notebook, we'll hunt outliers and come up with methods to convert them to normal values, using common sense.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os
from pathlib import Path
from IPython.display import display
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
import lightgbm as lgbm
import xgboost as xgb
import catboost

In [2]:
from warnings import filterwarnings
filterwarnings("ignore")

# Loading Data

In [3]:
# setting a base path variable for easy access
BASE_PATH = Path("/kaggle/input/playground-series-s3e6")
train = pd.read_csv(BASE_PATH / "train.csv").drop(columns=["id"])

test = pd.read_csv(BASE_PATH / "test.csv")
# we need the test id column to make the submission
test_idx = test.id
test = test.drop(columns=["id"])

original = pd.read_csv("/kaggle/input/paris-housing-price-prediction/ParisHousing.csv")

In [4]:
# features presence check
all(original.columns == train.columns)

True

In [5]:
all_datasets = {"train": train, "test": test,"original": original}

# Clipping wild values, before we proceed!
Our solutions to outliers depend quite a lot on taking and using average/mean and we know mean as a metric is very sensitive to outliers or wildy large values, which are present in almost every feature in this competition's datasets.

So we first clip the absolutely wild values before we proceed.
For more on clipping these wild values, checkout my other notebook: **[Setting upper bounds may help!](https://www.kaggle.com/khawajaabaidullah/ps3e6-setting-upper-bounds-may-help/)**

In [6]:
for dataset in all_datasets.values():
    dataset["attic"] = dataset.attic.clip(upper=10000)
    dataset["floors"] = dataset.floors.clip(upper=100)
    dataset["squareMeters"] = dataset.squareMeters.clip(upper=100000)
    dataset["basement"] = dataset.basement.clip(upper=10000)
    dataset["garage"] = dataset.garage.clip(upper=1000)
    dataset["cityCode"] = dataset.cityCode.clip(upper=99999)

# Little preprocessing

In [7]:
def preprocess(train, original):
    X = train.drop(columns="price")
    y = train.price
    X_org = original.drop(columns="price")
    y_org = original.price
    
    return X, y, X_org, y_org

# Train models and check score

In [8]:
def cross_validate(X, y, X_org, y_org, model, model_verbose):
    N_FOLDS = 5
    cv_scores = np.zeros(N_FOLDS)
    feature_importances_all_folds = np.zeros(shape=(N_FOLDS, len(X.columns)))
    
    kf = KFold(n_splits=N_FOLDS, shuffle=True, random_state=1337)
    
    for fold_num, (train_idx, val_idx) in enumerate(kf.split(X)):
        X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
        y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
        
        if model=="XGBoost":
            model = xgb.XGBRegressor()
            
        elif model=="LightGBM":
            model = lgbm.LGBMRegressor()
            
        elif model=="CatBoost":
            model = catboost.CatBoostRegressor(eval_metric="RMSE")
        
        model.fit(X_train, y_train,
                  eval_set=[(X_val, y_val)],
                  early_stopping_rounds=50,
                 verbose=model_verbose)
        
        y_preds = model.predict(X_val)
        
        # to calculate rmse instead of mse, we set squared=False
        rmse = mean_squared_error(y_val, y_preds, squared=False)
        cv_scores[fold_num] = rmse        
        print(f"Fold {fold_num} \t RMSE: {rmse}")
        
    avg_rmse = np.mean(cv_scores)
    print(f"AVG RMSE: {avg_rmse}")

In [9]:
def try_new_changes(X, y, X_org, y_org):    
    models = ["XGBoost", "LightGBM", "CatBoost"]
    
    for model in models:
        print(f"\n{'-'*30} {model} {'-'*30}")
        
        verbose = False
        if model=="LightGBM":
            verbose = -1
            
        cross_validate(X, y, X_org, y_org, model=model, model_verbose=verbose)

### Let's run it and set a baseline

In [10]:
try_new_changes(*preprocess(train, original))


------------------------------ XGBoost ------------------------------
Fold 0 	 RMSE: 173732.7748770942
Fold 1 	 RMSE: 139404.0495448856
Fold 2 	 RMSE: 97092.11937356321
Fold 3 	 RMSE: 217363.71838511084
Fold 4 	 RMSE: 189092.2865408749
AVG RMSE: 163336.98974430576

------------------------------ LightGBM ------------------------------
Fold 0 	 RMSE: 169410.45032032923
Fold 1 	 RMSE: 132249.73024736258
Fold 2 	 RMSE: 133168.19074826475
Fold 3 	 RMSE: 251895.71308501367
Fold 4 	 RMSE: 159952.15709558927
AVG RMSE: 169335.2482993119

------------------------------ CatBoost ------------------------------
Fold 0 	 RMSE: 172447.5853235419
Fold 1 	 RMSE: 129193.30550874898
Fold 2 	 RMSE: 116990.25254906624
Fold 3 	 RMSE: 241655.03520647835
Fold 4 	 RMSE: 146499.10749395363
AVG RMSE: 161357.05721635782


# Prelimiary Data Analysis to see if we can spot anything weird

In [11]:
pd.concat([dataset.isnull().sum().rename(f"Missing in {dataset_name}") 
               for dataset_name, dataset in all_datasets.items()],
         axis=1)

Unnamed: 0,Missing in train,Missing in test,Missing in original
squareMeters,0,0.0,0
numberOfRooms,0,0.0,0
hasYard,0,0.0,0
hasPool,0,0.0,0
floors,0,0.0,0
cityCode,0,0.0,0
cityPartRange,0,0.0,0
numPrevOwners,0,0.0,0
made,0,0.0,0
isNewBuilt,0,0.0,0


In [12]:
pd.concat([train.dtypes.rename("Data Type")] + \
          [dataset.nunique().rename(f"{dataset_name} UniqueValues") for dataset_name, dataset in all_datasets.items()],
          axis=1).sort_values(by="train UniqueValues")

Unnamed: 0,Data Type,train UniqueValues,test UniqueValues,original UniqueValues
hasYard,int64,2,2.0,2
hasPool,int64,2,2.0,2
hasStorageRoom,int64,2,2.0,2
isNewBuilt,int64,2,2.0,2
hasStormProtector,int64,2,2.0,2
cityPartRange,int64,10,10.0,10
numPrevOwners,int64,10,10.0,10
hasGuestRoom,int64,11,11.0,11
made,int64,33,32.0,32
numberOfRooms,int64,100,100.0,100


### INSIGHTS:
1. **made** represents the year probably in which the house was made. This feature in train contains 33 values but only does 32 in test and original. Let's inveestigate that first.

# Outliers in "made" feature:
**made** represents the year probably in which the house was made. This feature in train contains 33 values but only does 32 in test and original. Let's inveestigate!

In [13]:
# Let's first verify that the "made" values for test and original are the same 32
test.made.unique().sort() == original.made.unique().sort()

True

In [14]:
# Let's find the values that's only present in train
set(train.made.unique()) - set(test.made.unique())

{10000}

### Thoughts:
This is definitely an anomalous value as 10000 makes no sense for a year.
Let's see which rows contain this value.

In [15]:
train[train.made == 10000]

Unnamed: 0,squareMeters,numberOfRooms,hasYard,hasPool,floors,cityCode,cityPartRange,numPrevOwners,made,isNewBuilt,hasStormProtector,basement,attic,garage,hasStorageRoom,hasGuestRoom,price
2113,68038,41,0,0,54,87120,3,6,10000,1,1,6537,6304,366,0,0,6807415.1
3608,80062,81,1,0,35,67157,9,4,10000,0,1,732,6475,758,0,4,8007951.1
19124,80062,52,0,0,84,67099,9,4,10000,0,0,7677,5017,148,0,4,8007951.1
19748,80062,58,0,1,86,40408,7,8,10000,0,0,7059,7307,287,0,2,8007951.1
21400,80062,78,0,0,84,59457,4,7,10000,1,0,6382,9507,298,1,4,8007951.1


In [16]:
# lets see if there are outliers in test or original
display(test[test.made > 2021])
display(original[original.made > 2021])

Unnamed: 0,squareMeters,numberOfRooms,hasYard,hasPool,floors,cityCode,cityPartRange,numPrevOwners,made,isNewBuilt,hasStormProtector,basement,attic,garage,hasStorageRoom,hasGuestRoom


Unnamed: 0,squareMeters,numberOfRooms,hasYard,hasPool,floors,cityCode,cityPartRange,numPrevOwners,made,isNewBuilt,hasStormProtector,basement,attic,garage,hasStorageRoom,hasGuestRoom,price


### Thoughts:
No outliers in test and original, but definitely in train. We definitely need to fix this by using some valid value. Maybe the most recent year one?

## Fixing
If the house is newly built, replace outlier value with 2021, otherwise replace it with the average value.
1. Where isNewBuilt == 1, replace outlier value with 2021.
2. Where isNewBuilt == 0, replace it with average.

In [17]:
outlier_newly_built_idx = train[(train.made==10000) & (train.isNewBuilt==1)].index
train.loc[outlier_newly_built_idx, "made"] = 2021

outlier_old_built_idx = train[(train.made==10000) & (train.isNewBuilt==0)].index
train.loc[outlier_old_built_idx, "made"] = int(train.made.mean())

# Outliers in "garage" feature
The garage feature contains value for how big the garage is. It should definitely be in all cases a fraction of the total house size.

But let's see if there are instances where the garage size is greater than the actual house size, which will definitely be an outlier.

## Scanning

In [18]:
for dataset_name, dataset in all_datasets.items():
    print(f"{'-'*30} {dataset_name.upper()} OUTLIERS {'-'*30}")
    print(f"Number of rows with garage size greater than house size: " , len(dataset[dataset.garage > dataset.squareMeters]))
    print()

------------------------------ TRAIN OUTLIERS ------------------------------
Number of rows with garage size greater than house size:  261

------------------------------ TEST OUTLIERS ------------------------------
Number of rows with garage size greater than house size:  186

------------------------------ ORIGINAL OUTLIERS ------------------------------
Number of rows with garage size greater than house size:  59



### INSIGHTS:
Welp, that's a LOT of outliers, not just in train, but also in test and original. and most importancely in test!!

### What should we do?
Okay, since there are quite a lot these outliers in test set as well, we should come up with a strat to address this issue. Dropping rows with outliers it not an option anymore!

### Possible Solutions:
One thing that I'm thinking is that we should find the average garage to squareMeters ratio and use that to fix the garage sizes for these outlier values.

In [19]:
avg_garage_to_house_size_ratio = np.mean(train.garage / train.squareMeters)
avg_garage_to_house_size_ratio * 100

6.215760178269915

## Hmm,
So on average, a garage takes only 6.21 percentance of a house's whole area, which makes complete sense.

## Fixing
To fix the outliers, we should multiply this mean value with squareMeters for the outlier records only and replace the value in garage feature with the new value.

In [20]:
for dataset_name, dataset in all_datasets.items():
    print(f"{'-'*30} {dataset_name.upper()} Ooutliers Resolution {'-'*30}")
    garage_outliers = dataset.garage > dataset.squareMeters
    outlier_indices = dataset[garage_outliers].index
    print("Before: ")
    print(f"\tNumber of rows with garage size greater than house size: " , len(dataset[garage_outliers]))
    print(f"\tAvg garage size of outliers: ", np.mean(dataset[garage_outliers].garage))
    
    dataset.loc[outlier_indices, "garage"] = (dataset.loc[outlier_indices].squareMeters * \
                                                  avg_garage_to_house_size_ratio).astype("int64").to_numpy()
    print("After: ")
    print(f"\tNumber of rows with garage size greater than house size: " , len(dataset[dataset.garage > dataset.squareMeters]))
    print(f"\tAvg garage size of outliers: ", np.mean(dataset[garage_outliers].garage))
    print("\n")

------------------------------ TRAIN Ooutliers Resolution ------------------------------
Before: 
	Number of rows with garage size greater than house size:  261
	Avg garage size of outliers:  753.0459770114943
After: 
	Number of rows with garage size greater than house size:  0
	Avg garage size of outliers:  27.50191570881226


------------------------------ TEST Ooutliers Resolution ------------------------------
Before: 
	Number of rows with garage size greater than house size:  186
	Avg garage size of outliers:  746.6451612903226
After: 
	Number of rows with garage size greater than house size:  0
	Avg garage size of outliers:  28.333333333333332


------------------------------ ORIGINAL Ooutliers Resolution ------------------------------
Before: 
	Number of rows with garage size greater than house size:  59
	Avg garage size of outliers:  729.0508474576271
After: 
	Number of rows with garage size greater than house size:  0
	Avg garage size of outliers:  24.983050847457626




### Thoughts:
Fixed, Yes! Hopefully it will improve performace!

# Outliers in Attic feature
Attic represents the attic size of the house in sq meters. Let's see if like garage it contains any values that are greater than the actual hosue size!

## Scaning:

In [21]:
for dataset_name, dataset in all_datasets.items():
    print(f"{'-'*30} {dataset_name.upper()} Attic Outliers Scan {'-'*30}")
    attic_outliers = dataset.attic > dataset.squareMeters
    outlier_indices = dataset[attic_outliers].index
    print(f"\tNumber of rows with attic size greater than house size: " , len(dataset[attic_outliers]))
    print("\n")

------------------------------ TRAIN Attic Outliers Scan ------------------------------
	Number of rows with attic size greater than house size:  1675


------------------------------ TEST Attic Outliers Scan ------------------------------
	Number of rows with attic size greater than house size:  1161


------------------------------ ORIGINAL Attic Outliers Scan ------------------------------
	Number of rows with attic size greater than house size:  531




### INSIGHTS:
Again, quite a lot outliers, let's do the same to what we did with garage

### Let's check the average attic to house size ratio

In [23]:
avg_attic_to_house_size_ratio = np.mean(train.attic / train.squareMeters)
avg_attic_to_house_size_ratio

0.5574592792662547

#### INSIGHTS:
On average an attic takes up 55.7% of the house, which kinda does make sense

## Fixing

In [24]:
for dataset_name, dataset in all_datasets.items():
    print(f"{'-'*30} {dataset_name.upper()} Attic Outliers Resolution {'-'*30}")
    attic_outliers = dataset.attic > dataset.squareMeters
    outlier_indices = dataset[attic_outliers].index
    print("Before: ")
    print(f"\tNumber of rows with attic size greater than house size: " , len(dataset[attic_outliers]))
    print(f"\tAvg atticc size of outliers: ", np.mean(dataset[attic_outliers].attic))
    
    dataset.loc[outlier_indices, "attic"] = (dataset.loc[outlier_indices].squareMeters * \
                                                  avg_attic_to_house_size_ratio).astype("int64").to_numpy()
    print("After: ")
    print(f"\tNumber of rows with attic size greater than house size: " , len(dataset[dataset.attic > dataset.squareMeters]))
    print(f"\tAvg attic size of outliers: ", np.mean(dataset[attic_outliers].attic))
    print("\n")

------------------------------ TRAIN Attic Outliers Resolution ------------------------------
Before: 
	Number of rows with attic size greater than house size:  1675
	Avg atticc size of outliers:  6723.622089552239
After: 
	Number of rows with attic size greater than house size:  0
	Avg attic size of outliers:  1813.0525373134328


------------------------------ TEST Attic Outliers Resolution ------------------------------
Before: 
	Number of rows with attic size greater than house size:  1161
	Avg atticc size of outliers:  6684.083548664944
After: 
	Number of rows with attic size greater than house size:  0
	Avg attic size of outliers:  1882.875968992248


------------------------------ ORIGINAL Attic Outliers Resolution ------------------------------
Before: 
	Number of rows with attic size greater than house size:  531
	Avg atticc size of outliers:  6706.887005649717
After: 
	Number of rows with attic size greater than house size:  0
	Avg attic size of outliers:  1818.09604519774




# Outliers in "basement" feature
Basement feature represents the basement size. It should again, not be greater than the actual house size!

## Scanning

In [25]:
for dataset_name, dataset in all_datasets.items():
    print(f"{'-'*30} {dataset_name.upper()} Basement Outliers Scan {'-'*30}")
    basement_outliers = dataset.basement > dataset.squareMeters
    outlier_indices = dataset[basement_outliers].index
    print(f"\tNumber of rows with attic size greater than house size: " , len(dataset[basement_outliers]))
    print("\n")

------------------------------ TRAIN Basement Outliers Scan ------------------------------
	Number of rows with attic size greater than house size:  1660


------------------------------ TEST Basement Outliers Scan ------------------------------
	Number of rows with attic size greater than house size:  1130


------------------------------ ORIGINAL Basement Outliers Scan ------------------------------
	Number of rows with attic size greater than house size:  517




### INSIGHTS:
Quite some outliers, let's repeat the same procedure as we did with the attic and garage size!

### Let's check the average basement to house size ratio

In [26]:
avg_basement_to_house_size_ratio = np.mean(train.basement / train.squareMeters)
avg_basement_to_house_size_ratio

0.5503883387299767

#### INSIGHTS:
Basement size on average is pretty close to that average attic size, i.e. around 55% which makes sense.

## Fixing

In [27]:
for dataset_name, dataset in all_datasets.items():
    print(f"{'-'*30} {dataset_name.upper()} Basement Outliers Resolution {'-'*30}")
    basement_outliers = dataset.basement > dataset.squareMeters
    outlier_indices = dataset[basement_outliers].index
    print("Before: ")
    print(f"\tNumber of rows with basement size greater than house size: " , len(dataset[basement_outliers]))
    print(f"\tAvg basement size of outliers: ", np.mean(dataset[basement_outliers].attic))
    
    dataset.loc[outlier_indices, "basement"] = (dataset.loc[outlier_indices].squareMeters * \
                                                  avg_basement_to_house_size_ratio).astype("int64").to_numpy()
    print("After: ")
    print(f"\tNumber of rows with basement size greater than house size: " , len(dataset[dataset.basement > dataset.squareMeters]))
    print(f"\tAvg basement size of outliers: ", np.mean(dataset[basement_outliers].attic))
    print("\n")

------------------------------ TRAIN Basement Outliers Resolution ------------------------------
Before: 
	Number of rows with basement size greater than house size:  1660
	Avg basement size of outliers:  1655.9855421686748
After: 
	Number of rows with basement size greater than house size:  0
	Avg basement size of outliers:  1655.9855421686748


------------------------------ TEST Basement Outliers Resolution ------------------------------
Before: 
	Number of rows with basement size greater than house size:  1130
	Avg basement size of outliers:  1647.0823008849557
After: 
	Number of rows with basement size greater than house size:  0
	Avg basement size of outliers:  1647.0823008849557


------------------------------ ORIGINAL Basement Outliers Resolution ------------------------------
Before: 
	Number of rows with basement size greater than house size:  517
	Avg basement size of outliers:  1698.4255319148936
After: 
	Number of rows with basement size greater than house size:  0
	Avg b

# Let's battle test our changes

In [29]:
try_new_changes(*preprocess(train, original))


------------------------------ XGBoost ------------------------------
Fold 0 	 RMSE: 175362.71682899437
Fold 1 	 RMSE: 133822.6716131682
Fold 2 	 RMSE: 105264.85522386897
Fold 3 	 RMSE: 217881.52558033343
Fold 4 	 RMSE: 182954.45690835686
AVG RMSE: 163057.24523094436

------------------------------ LightGBM ------------------------------
Fold 0 	 RMSE: 170917.8771655546
Fold 1 	 RMSE: 131590.53817086088
Fold 2 	 RMSE: 132909.33195526747
Fold 3 	 RMSE: 252372.22355577798
Fold 4 	 RMSE: 160201.94556735177
AVG RMSE: 169598.38328296255

------------------------------ CatBoost ------------------------------
Fold 0 	 RMSE: 172064.76170779858
Fold 1 	 RMSE: 128173.43722953029
Fold 2 	 RMSE: 117618.48021624243
Fold 3 	 RMSE: 230910.2587129031
Fold 4 	 RMSE: 141071.15041725553
AVG RMSE: 157967.617656746


# Comparison results:
1. CatBoost's average cv RMSE decreased from **161357.05** to **157967.61** which is absolutely bonkers!
2. XGBoost's average cv RMSE decreased from **163336.9897** to **163057.2452**, a ~300 points improvement.
3. LightGBM's average cv RMSE increased from **169335.2482** to **169598.3832**, a ~250 points deterioration.

# Conclusion:
Overall fixing these outliers using the techniques we've come up with has resulted in imporved results and I do believe that we can further improve these results quite a lot using hyperparamters tuning. You may have other techniques that you might wanna try instead of multiplying with average, feel free to test and best of luck!

### If you found this notebook useful, please consider upvoting, as this will serve as a token of appreciation for my work! Thank you for reading this far!