# CLEANING DATA OFF 

IN THIS NOTEBOOK WE ARE GETTING RID OFF ERROR AND NOISE BY CLEANING OUT INCONSISTENCIES, DETECTING MISPLACED VALUES AND PUTTING THEM INTO THE RIGHT CELLS.

#### DEALING WITH THE FOLLOWING FEATURES
* ---------------------------------------------------------------
* price_doc: sale price (this is the target variable)
* id: transaction id
* timestamp: date of transaction
* full_sq: total area in square meters, including loggias, balconies and other non-residential areas
* life_sq: living area in square meters, excluding loggias, balconies and other non-residential areas
* floor: for apartments, floor of the building
* max_floor: number of floors in the building
* material: wall material
* build_year: year built
* num_room: number of living rooms
* kitch_sq: kitchen area
* state: apartment condition
* product_type: owner-occupier purchase or investment
* sub_area: name of the district

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib nbagg
import xgboost as xgb
import seaborn as sns

In [2]:
train = pd.read_csv("../../Dataset/train.csv/train.csv", encoding= "utf_8")
test = pd.read_csv("../../Dataset/test.csv/test.csv", encoding= "utf_8")

# IMPORTANT NOTES ABOUT FEATURES
THESE ARE SOME OF THE SUMMARY NOTES GAINED FROM KAGGLE DISCUSSIONS, QUESTIONS AND ANSWERS FROM SBERBANK

* CHECK LIFE SQ, FULL SQ, KITCH SQ FOR CONSISTENCY (DONE)
* BUILD YEAR CAN BE IN FUTURE - PRE INVESTMENT TYPE (DONE)
* BUILD YEAR 0 AND 1 ARE MISTAKES (DONE)
* CHECK TRAIN AND TEST PRODUCT TYPES (DONE)
* CHECK NUM OF ROOMS FOR CONSISTENCY (DONE)
* MATERIAL EXPLAINED: 1 - panel, 2 - brick, 3 - wood, 4 - mass concrete, 5 - breezeblock, 6 - mass concrete plus brick
* STATE EXPLAINED: 4 BEST 1 WORST
* KITCHEN INCLUDED IN LIFE SQ CHECK INCONSISTENCY (DONE)
* FULL SQ > LIFE SQ (MOST PROBABLY) (DONE)
* KM DISTANCES ARE AIRLINE DISTANCES
* RAION POPUL AND FULL ALL ARE SAME CALC FROM DIFF SOURCES

### FIRST SET OF FEATURES

In [3]:
first_feat = ["id","timestamp","price_doc", "full_sq", "life_sq",
"floor", "max_floor", "material", "build_year", "num_room",
"kitch_sq", "state", "product_type", "sub_area"]

In [4]:
first_feat = ["id","timestamp", "full_sq", "life_sq",
"floor", "max_floor", "material", "build_year", "num_room",
"kitch_sq", "state", "product_type", "sub_area"]

#### CORRECTIONS RULES FOR FULL_SQ AND LIFE_SQ (APPLY TO TRAIN AND TEST):
 * IF LIFE SQ >= FULL SQ MAKE LIFE SQ NP.NAN
 * IF LIFE SQ < 5 NP.NAN
 * IF FULL SQ < 5 NP.NAN 
 * KITCH SQ < LIFE SQ
 * IF KITCH SQ == 0 OR 1 NP.NAN
 * CHECK FOR OUTLIERS IN LIFE SQ, FULL SQ AND KITCH SQ
 * LIFE SQ / FULL SQ MUST BE CONSISTENCY (0.3 IS A CONSERVATIVE RATIO)

In [5]:
bad_index = train[train.life_sq > train.full_sq].index
train.loc[bad_index, "life_sq"] = np.NaN
bad_index

Index([ 1084,  1188,  1822,  1863,  2009,  4385,  6336,  6531,  6993,  7208,
        8101,  9237,  9256,  9482,  9646, 11332, 11711, 11784, 12569, 13546,
       13629, 13797, 14799, 16067, 16116, 16284, 20672, 21080, 22412, 22611,
       22804, 24296, 24428, 26264, 26342, 26363, 29302],
      dtype='int64')

In [6]:
equal_index = [601,1896,2791]
test.loc[equal_index, "life_sq"] = test.loc[equal_index, "full_sq"]
bad_index

Index([ 1084,  1188,  1822,  1863,  2009,  4385,  6336,  6531,  6993,  7208,
        8101,  9237,  9256,  9482,  9646, 11332, 11711, 11784, 12569, 13546,
       13629, 13797, 14799, 16067, 16116, 16284, 20672, 21080, 22412, 22611,
       22804, 24296, 24428, 26264, 26342, 26363, 29302],
      dtype='int64')

In [7]:
bad_index = test[test.life_sq > test.full_sq].index
test.loc[bad_index, "life_sq"] = np.NaN
bad_index

Index([64, 119, 171, 464, 2027, 2031, 2804, 5187, 5383], dtype='int64')

In [8]:
bad_index = train[train.life_sq < 5].index
train.loc[bad_index, "life_sq"] = np.NaN
bad_index

Index([  104,   858,  1596,  2778,  3426,  3800,  4138,  4311,  5879,  9676,
       ...
       30341, 30353, 30364, 30375, 30391, 30419, 30435, 30441, 30453, 30458],
      dtype='int64', length=435)

In [9]:
bad_index = test[test.life_sq < 5].index
test.loc[bad_index, "life_sq"] = np.NaN
bad_index

Index([  44,   75,   89,  188,  194,  254,  266,  270,  303,  319,
       ...
       7635, 7640, 7646, 7648, 7650, 7653, 7654, 7655, 7656, 7659],
      dtype='int64', length=330)

In [10]:
bad_index = train[train.full_sq < 5].index
train.loc[bad_index, "full_sq"] = np.NaN
bad_index

Index([11332, 16289, 16738, 17194, 17932, 18600, 22171, 22412, 22722, 22795,
       22871, 23048, 23228, 23573, 23726, 24296, 24627, 24892, 25569, 25887,
       26264, 26363, 26386, 26582, 26925, 27154],
      dtype='int64')

In [11]:
bad_index = test[test.full_sq < 5].index
test.loc[bad_index, "full_sq"] = np.NaN
bad_index

Index([464, 5383, 6350], dtype='int64')

In [12]:
kitch_is_build_year = [13117]
train.loc[kitch_is_build_year, "build_year"] = train.loc[kitch_is_build_year, "kitch_sq"]

In [13]:
bad_index = train[train.kitch_sq >= train.life_sq].index
train.loc[bad_index, "kitch_sq"] = np.NaN
bad_index

Index([ 8056,  8949,  9172, 10048, 10187, 10368, 10440, 10539, 10680, 10728,
       11241, 11246, 11446, 11520, 12245, 12423, 13117, 14697, 15588, 15729,
       15930, 15970, 16412, 16888, 17901, 18404, 18792, 19396, 20053, 20258,
       20422, 21058, 21243, 21415, 21529, 22137, 22191, 22260, 22415, 22458,
       22549, 22863, 22979, 23216, 23219, 23244, 23256, 23336, 23692, 23789,
       24057, 26167, 26236, 26255, 26336, 26498, 26780, 26813, 26833, 26850,
       26899, 27118, 27429, 27650, 27931, 27996, 28268, 28314, 28434, 28709,
       28712, 28734, 28938, 28997, 29225, 29588, 29668, 30269],
      dtype='int64')

In [14]:
bad_index = test[test.kitch_sq >= test.life_sq].index
test.loc[bad_index, "kitch_sq"] = np.NaN
bad_index

Index([   3,   11,   60,  136,  139,  238,  540,  907, 1429, 1792, 1916, 2816,
       2907, 3500, 3852, 3859, 4295, 4511, 4621, 4839, 4865, 5537, 5583, 5706,
       5952, 6000, 6048, 6337, 6496, 6883, 7229, 7277],
      dtype='int64')

In [15]:
bad_index = train[(train.kitch_sq == 0).values + (train.kitch_sq == 1).values].index
train.loc[bad_index, "kitch_sq"] = np.NaN
bad_index

Index([ 8111,  8144,  8186,  8216,  8268,  8366,  8498,  8499,  8547,  8618,
       ...
       30449, 30450, 30451, 30453, 30455, 30458, 30459, 30464, 30465, 30468],
      dtype='int64', length=6235)

In [16]:
bad_index = test[(test.kitch_sq == 0).values + (test.kitch_sq == 1).values].index
test.loc[bad_index, "kitch_sq"] = np.NaN
bad_index

Index([   1,    4,    5,    6,    7,    9,   12,   14,   15,   16,
       ...
       7646, 7648, 7649, 7650, 7653, 7654, 7655, 7656, 7658, 7659],
      dtype='int64', length=2128)

In [17]:
bad_index = train[(train.full_sq > 210) * (train.life_sq / train.full_sq < 0.3)].index
train.loc[bad_index, "full_sq"] = np.NaN
bad_index

Index([1478, 1610, 2425, 2780, 3527, 5944, 7207], dtype='int64')

In [18]:
bad_index = test[(test.full_sq > 150) * (test.life_sq / test.full_sq < 0.3)].index
test.loc[bad_index, "full_sq"] = np.NaN
bad_index

Index([], dtype='int64')

In [19]:
bad_index = train[train.life_sq > 300].index
train.loc[bad_index, ["life_sq", "full_sq"]] = np.NaN
bad_index

Index([128, 22785, 27793], dtype='int64')

In [20]:
bad_index = test[test.life_sq > 200].index
test.loc[bad_index, ["life_sq", "full_sq"]] = np.NaN
bad_index

Index([5975], dtype='int64')

#### BUILDYEAR CAN BE IN FUTURE (TYPE OF PRODUCTS)
* CHECK BUILD YEAR FOR EACH PRODUCT TYPE
* CHECK BUILD YEAR FOR CONSISTENCY (IF BUILD YEAR < 1500)

In [21]:
train.product_type.value_counts(normalize= True)

product_type
Investment       0.638246
OwnerOccupier    0.361754
Name: proportion, dtype: float64

In [22]:
test.product_type.value_counts(normalize= True)

product_type
Investment       0.655132
OwnerOccupier    0.344868
Name: proportion, dtype: float64

In [23]:
bad_index = train[train.build_year < 1500].index
train.loc[bad_index, "build_year"] = np.NaN
bad_index

Index([ 9441,  9620,  9700,  9745,  9764, 10122, 10142, 10260, 10294, 10329,
       ...
       30330, 30354, 30368, 30377, 30391, 30394, 30425, 30429, 30430, 30451],
      dtype='int64', length=903)

In [24]:
bad_index = test[test.build_year < 1500].index
test.loc[bad_index, "build_year"] = np.NaN
bad_index

Index([   1,    4,    9,   11,   20,   39,   52,   58,   60,   62,
       ...
       7596, 7616, 7622, 7626, 7633, 7640, 7648, 7653, 7654, 7659],
      dtype='int64', length=558)

#### CHECK NUM OF ROOMS
* IS THERE A OUTLIER ?
* A VERY SMALL OR LARGE NUMBER ?
* LIFE SQ / ROOM > MIN ROOM SQ (LET'S SAY 5 SQ FOR A ROOM MIGHT BE OK)
* IF NUM ROOM == 0 SET TO NP.NAN
* DETECT ABNORMAL NUM ROOMS GIVEN LIFE AND FULL SQ

In [25]:
bad_index = train[train.num_room == 0].index 
train.loc[bad_index, "num_room"] = np.NaN

In [26]:
bad_index = test[test.num_room == 0].index 
test.loc[bad_index, "num_room"] = np.NaN

In [27]:
bad_index = [10076, 11621, 17764, 19390, 24007, 26713, 29172]
train.loc[bad_index, "num_room"] = np.NaN
bad_index

[10076, 11621, 17764, 19390, 24007, 26713, 29172]

In [28]:
bad_index = [3174, 7313]
test.loc[bad_index, "num_room"] = np.NaN
bad_index

[3174, 7313]

#### CHECK FLOOR AND MAX FLOOR 
* FLOOR == 0 AND MAX FLOOR == 0 POSSIBLE ??? WE DON'T HAVE IT IN TEST SO NP.NAN
* FLOOR == 0 0R MAX FLOOR == 0 ??? WE DON'T HAVE IT IN TEST SO NP.NAN (NP.NAN IF MAX FLOOR == 0 FOR BOTH TEST TRAIN)
* CHECK FLOOR < MAX FLOOR (IF FLOOR > MAX FLOOR -> MAX FLOOR NP.NAN)
* CHECK FOR OUTLIERS

In [29]:
bad_index = train[(train.floor == 0).values * (train.max_floor == 0).values].index
train.loc[bad_index, ["max_floor", "floor"]] = np.NaN
bad_index

Index([15363, 17932, 21222, 23637], dtype='int64')

In [30]:
bad_index = train[train.floor == 0].index
train.loc[bad_index, "floor"] = np.NaN
bad_index

Index([5085, 5333, 18669, 21921, 25424], dtype='int64')

In [31]:
bad_index = train[train.max_floor == 0].index
train.loc[bad_index, "max_floor"] = np.NaN
bad_index

Index([ 8216,  8499,  8531,  8912,  9423, 10086, 10142, 10224, 10294, 10331,
       ...
       30140, 30187, 30206, 30221, 30273, 30299, 30400, 30426, 30439, 30450],
      dtype='int64', length=546)

In [32]:
bad_index = test[test.max_floor == 0].index
test.loc[bad_index, "max_floor"] = np.NaN
bad_index

Index([   7,   15,   16,   68,   72,   88,  135,  166,  207,  241,
       ...
       7447, 7468, 7521, 7525, 7566, 7588, 7590, 7610, 7614, 7658],
      dtype='int64', length=233)

In [33]:
bad_index = train[train.floor > train.max_floor].index
train.loc[bad_index, "max_floor"] = np.NaN
bad_index

Index([ 8268,  9161,  9257,  9309,  9388,  9412,  9442,  9452,  9482,  9561,
       ...
       30234, 30244, 30257, 30262, 30317, 30341, 30353, 30360, 30391, 30398],
      dtype='int64', length=947)

In [34]:
bad_index = test[test.floor > test.max_floor].index
test.loc[bad_index, "max_floor"] = np.NaN
bad_index

Index([   5,   14,   32,   44,   73,   89,  151,  156,  177,  179,
       ...
       7612, 7616, 7622, 7630, 7640, 7646, 7648, 7649, 7653, 7659],
      dtype='int64', length=410)

In [35]:
train.floor.describe(percentiles= [0.9999])

count     30295.000000
mean          7.673081
std           5.319135
min           1.000000
50%           7.000000
99.99%       40.911800
max          77.000000
Name: floor, dtype: float64

In [36]:
bad_index = [23584]
train.loc[bad_index, "floor"] = np.NaN
bad_index

[23584]

CHECK MATERIAL

In [37]:
train.material.value_counts()

material
1.0    14197
2.0     2993
5.0     1561
4.0     1344
6.0      803
3.0        1
Name: count, dtype: int64

In [38]:
test.material.value_counts()

material
1    5241
2     958
4     619
5     487
6     356
3       1
Name: count, dtype: int64

CHECK STATE

In [39]:
train.state.value_counts()

state
2.0     5844
3.0     5790
1.0     4855
4.0      422
33.0       1
Name: count, dtype: int64

In [40]:
bad_index = train[train.state == 33].index
train.loc[bad_index, "state"] = np.NaN
bad_index

Index([10089], dtype='int64')

In [41]:
test.state.value_counts()

state
2.0    2662
1.0    2266
3.0    1913
4.0     127
Name: count, dtype: int64

### SAVE TEST AND TRAIN AS CLEAN

In [42]:
test.to_csv("./clean data/test_clean.csv", index= False, encoding= "utf_8")
train.to_csv("./clean data/train_clean.csv", index = False, encoding= "utf_8")