# Train vs Test vs Original Dataset - Comparing datasets and findind anomalies

# Introduction:
In this notebook, we will explore the similarities and discrepencies between the competition and original dataset and see if we can find any anamolous values.

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os
from pathlib import Path
import xgboost as xgb
import lightgbm as lgbm
import catboost
from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import mean_squared_error
from IPython.display import display
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler

In [3]:
from warnings import filterwarnings
filterwarnings("ignore")

# Loading Data

In [4]:
# setting a base path variable for easy access
BASE_PATH = Path("/kaggle/input/playground-series-s3e6")
train = pd.read_csv(BASE_PATH / "train.csv").drop(columns=["id"])

test = pd.read_csv(BASE_PATH / "test.csv")
# we need the test id column to make the submission
test_idx = test.id
test = test.drop(columns=["id"])

original = pd.read_csv("/kaggle/input/paris-housing-price-prediction/ParisHousing.csv")

In [35]:
# features presence check
all(original.columns == train.columns)

True

# Data Analysis

In [7]:
pd.concat([train.isnull().sum().rename("Missing In Train"),
          test.isnull().sum().rename("Missing in Test"),
          original.isnull().sum().rename("Missing in Original")], axis=1)

Unnamed: 0,Missing In Train,Missing in Test,Missing in Original
squareMeters,0,0.0,0
numberOfRooms,0,0.0,0
hasYard,0,0.0,0
hasPool,0,0.0,0
floors,0,0.0,0
cityCode,0,0.0,0
cityPartRange,0,0.0,0
numPrevOwners,0,0.0,0
made,0,0.0,0
isNewBuilt,0,0.0,0


In [12]:
pd.concat([train.dtypes.rename("Data Type"),
          train.nunique().rename("Train UniqueValues"),
          test.nunique().rename("Test UniqueValues"),
          original.nunique().rename("Original UniqueValues")], axis=1)\
            .sort_values(by="Train UniqueValues")

Unnamed: 0,Data Type,Train UniqueValues,Test UniqueValues,Original UniqueValues
hasYard,int64,2,2.0,2
hasPool,int64,2,2.0,2
hasStorageRoom,int64,2,2.0,2
isNewBuilt,int64,2,2.0,2
hasStormProtector,int64,2,2.0,2
cityPartRange,int64,10,10.0,10
numPrevOwners,int64,10,10.0,10
hasGuestRoom,int64,11,11.0,11
made,int64,33,32.0,32
numberOfRooms,int64,100,100.0,100


### INSIGHTS:
1. **made** represents the year probably in which the house was made. This feature in train contains 33 values but only does 32 in test and original. Let's inveestigate that first.

## 1. Hunting the Anomalous value for "made" in train:

In [36]:
# Let's first verify that the "made" values for test and original are the same 32
test.made.unique().sort() == original.made.unique().sort()

True

In [40]:
# Let's find the values that's only present in train
set(train.made.unique()) - set(test.made.unique())

{10000}

### Thoughts:
This is definitely an anomalous value as 10000 makes no sense for a year.
Let's see which rows contain this value.

In [38]:
train[train.made == 10000]

Unnamed: 0,squareMeters,numberOfRooms,hasYard,hasPool,floors,cityCode,cityPartRange,numPrevOwners,made,isNewBuilt,hasStormProtector,basement,attic,garage,hasStorageRoom,hasGuestRoom,price
2113,68038,41,0,0,54,87120,3,6,10000,1,1,6537,6304,366,0,0,6807415.1
3608,80062,81,1,0,35,67157,9,4,10000,0,1,732,6475,758,0,4,8007951.1
19124,80062,52,0,0,84,67099,9,4,10000,0,0,7677,5017,148,0,4,8007951.1
19748,80062,58,0,1,86,40408,7,8,10000,0,0,7059,7307,287,0,2,8007951.1
21400,80062,78,0,0,84,59457,4,7,10000,1,0,6382,9507,298,1,4,8007951.1


### Thoughts:
We definitely need to fix this by using some valid value. Maybe the most recent year one?

# More coming soon!