# Using Data Mining Techniques into Real Estates Industry

**Authors**:
-  _Madalina-Alina Racovita, 1st year master's student on Computational Optimization at Faculty of Computer Science, UAIC, Iasi, Romania_
-  _Buterchi Andreea, 1st year master's student on Advanced Studies in Computer Science at Faculty of Computer Science, UAIC, Iasi, Romania_

![title](real-estates.jpg)

<h1>Data Mining Laboratory - 1st Task. Exploratory Data Analysis <span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Using-Data-Mining-Techniques-into-Real-Estates-Industry" data-toc-modified-id="Using-Data-Mining-Techniques-into-Real-Estates-Industry-1">Using Data Mining Techniques into Real Estates Industry</a></span><ul class="toc-item"><li><span><a href="#Motivation" data-toc-modified-id="Motivation-1.1">Motivation</a></span></li><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1.2">Introduction</a></span></li><li><span><a href="#Import-dependencies-&amp;-environment-configuration" data-toc-modified-id="Import-dependencies-&amp;-environment-configuration-1.3">Import dependencies &amp; environment configuration</a></span></li><li><span><a href="#Load-dataframes" data-toc-modified-id="Load-dataframes-1.4">Load dataframes</a></span><ul class="toc-item"><li><span><a href="#RCON-dataframes-loading" data-toc-modified-id="RCON-dataframes-loading-1.4.1">RCON dataframes loading</a></span></li><li><span><a href="#RSFR-dataframes-loading" data-toc-modified-id="RSFR-dataframes-loading-1.4.2">RSFR dataframes loading</a></span></li></ul></li><li><span><a href="#Columns-description" data-toc-modified-id="Columns-description-1.5">Columns description</a></span></li><li><span><a href="#Investigation-of-the-RCON-and-RSFR-dataset" data-toc-modified-id="Investigation-of-the-RCON-and-RSFR-dataset-1.6">Investigation of the RCON and RSFR dataset</a></span></li><li><span><a href="#Non-numerical-columns" data-toc-modified-id="Non-numerical-columns-1.7">Non-numerical columns</a></span></li></ul></li></ul></div>

## Motivation

The way in which the value of a house is currently set in the real estate industry is not necessarily statistically robust. There are two states in which a person can be in the real estate market: seller or buyer. Regardless of the hypostase in which a given person is, it is a certain fact she may under evaluate or she may overestimate that property by reasoning subjectively. 

Given the hundreds of property sales that occur in one specific period of time, it is logical to evaluate and analyze the sales from multiple
points of view to establish a more accurate value for a house with a specific characterization. Regression analysis is a statistical approach of modeling the relationship between one or more independent or explanatory variables (characteristics of a house), and a dependent variable (the value or selling price of the house).The already created models have proven that using regression analysis is a viable way to better establish an estimate of the true value of a house.  Classification can also help in making distinctions between different properties. 


## Introduction 

This notebook is attempting to achieve **explanatory data analysis**, **feature selection** and **feature engineering** required for the task of building a regression / classification model that will be able to predict the value of an independent variable (for instance, the sale price of a house in the case of regression or the type of the property in a classification context) as accurately as possible by minimizing a cost error function.
![title](house-banner.png)

## Import dependencies & environment configuration

In [1]:
import pandas as pd
from scipy.stats import skew
import os
pd.set_option('display.max_columns', None)

## Load dataframes

In [2]:
os.listdir('./Data')

['RCON_12011.assessor.tsv',
 'RCON_53033.assessor.tsv',
 'RSFR_12011.assessor.tsv',
 'RSFR_53033.assessor.tsv']

### RCON dataframes loading

In [3]:
import pandas as pd
from scipy.stats import skew

In [4]:
df_rcon1 = pd.read_csv("./Data/RCON_12011.assessor.tsv", sep = "\t")

In [5]:
df_rcon1.shape

(21637, 63)

In [6]:
df_rcon2 = pd.read_csv("./Data/RCON_53033.assessor.tsv", sep = "\t")

In [7]:
df_rcon2.shape

(11265, 63)

We are going to **merge** those two dataframes concerning **RCON real estates** in the purpose of an overview analysis.

In [8]:
df_rcon = pd.concat([df_rcon1, df_rcon2])
df_rcon.shape

(32902, 63)

In [9]:
print("Number of observations in RCON dataset: ", df_rcon.shape[0])
print("Number of predictors in RCON dataset: ", df_rcon.shape[1] - 1)

Number of observations in RCON dataset:  32902
Number of predictors in RCON dataset:  62


### RSFR dataframes loading

In [10]:
df_rsfr1 = pd.read_csv("./Data/RSFR_12011.assessor.tsv", sep = "\t")

In [11]:
df_rsfr1.shape

(32838, 63)

In [12]:
df_rsfr2 = pd.read_csv("./Data/RSFR_53033.assessor.tsv", sep = "\t")

In [13]:
df_rsfr2.shape

(53041, 63)

As did in the case of RCON dataframe, the same step is going to be proceed for the RSFR real estates dataframe: there **are going to be merged the __df_rsfr1__ and __df_rsfr2__ dataframes**.

In [14]:
df_rsfr = pd.concat([df_rsfr1, df_rsfr2])

In [15]:
print("Number of observations in RSFR dataset: ", df_rsfr.shape[0])
print("Number of predictors in RSFR dataset: ", df_rsfr.shape[1] - 1)

Number of observations in RSFR dataset:  85879
Number of predictors in RSFR dataset:  62


## Columns description

-  **CountyFipsCode** = five-digit Federal Information Processing Standards code which uniquely identified counties and county equivalents in the United States
-  **BuildingCode** = 
-  **StructureCode** = 
-  **StructureNbr** = 
-  **LandSqft** = 
-  **LivingSqft** = 
-  **GarageSqft** = 
-  **BasementSqft** = 
-  **BasementFinishedSqft** = 
-  **AtticSqft** = 
-  **Bedrooms** = 
-  **TotalRooms** = 
-  **TotalBaths** = 
-  **FirePlaces** = 
-  **YearBuilt** = 
-  **EffectiveYearBuilt** = 
-  **Condition** = 
-  **ConditionCode** = 
-  **Quality** = 
-  **QualityCode** = 
-  **GarageCarportCode** = 
-  **GarageNoOfCars** = 
-  **HasPatioPorch** = 
-  **PatioPorchCode** = 
-  **HasPool** = 
-  **PoolCode** = 
-  **Zonning** = 
-  **LandValue** = 
-  **ImprovementValue** = 
-  **TotalValue** = 
-  **AssessedYear** = 
-  **PropTaxAmount** = 
-  **City** = the city where the real estate can be found
-  **State** =  the state where the real estate can be found
-  **Zip** = zip code 
-  **Latitude** = latitude coordinate for the house
-  **Longitude** = longitude coordinate for the house
-  **BuildingShapeCode** = 
-  **ConstructionCode** = 
-  **Stories** = 
-  **UnitsInBuilding** = 
-  **FoundationCode** = 
-  **ExteriorCode** = 
-  **RoofCode** = 
-  **CoolingCode** = 
-  **HeatingCode** = 
-  **HeatingSourceCode** = 
-  **IsWaterfront** = 
-  **View** = numerical variable 
-  **ViewScore** = 
-  **LastSaleDate** = 
-  **LastSalePrice** = 
-  **DocType** = 
-  **DeedType** = 
-  **TransType** = 
-  **ArmsLengthFlag** = 
-  **DistressCode** = 
-  **StatusDate** = 
-  **SellDate** = 
-  **SellPrice** = 
-  **OwnerOccupied** = 
-  **DistrsdProp** = 
-  **IsFixer**

## Investigation of the RCON and RSFR dataset

In [16]:
df_rcon.head()

Unnamed: 0,CountyFipsCode,BuildingCode,StructureCode,StructureNbr,LandSqft,LivingSqft,GarageSqft,BasementSqft,BasementFinishedSqft,AtticSqft,Bedrooms,TotalRooms,TotalBaths,FirePlaces,YearBuilt,EffectiveYearBuilt,Condition,ConditionCode,Quality,QualityCode,GarageCarportCode,GarageNoOfCars,HasPatioPorch,PatioPorchCode,HasPool,PoolCode,Zonning,LandValue,ImprovementValue,TotalValue,AssessedYear,PropTaxAmount,City,State,Zip,Latitude,Longitude,BuildingShapeCode,ConstructionCode,Stories,UnitsInBuilding,FoundationCode,ExteriorCode,RoofCode,CoolingCode,HeatingCode,HeatingSourceCode,IsWaterfront,View,ViewScore,LastSaleDate,LastSalePrice,DocType,DeedType,TransType,ArmsLengthFlag,DistressCode,StatusDate,SellDate,SellPrice,OwnerOccupied,DistrsdProp,IsFixer
0,12011,,,1,3999,1180,0,0,0,0,2,0,2.0,0,1977,0,AVE,2,QAV,6,,0,False,,False,,,6059,54499,60560,2016,1616.0,,FL,33321,26.206,-80.265,,1,0,1,,,,,,,False,,0,1998-03-27 00:00:00,40000,,,R,True,,2017-02-10 00:00:00,1998-03-27,40000.0,False,0,0.0
1,12011,,,1,3999,800,0,0,0,0,1,0,1.5,0,1973,0,AVE,2,QAV,6,,0,False,,False,,,5079,45759,50840,2016,1560.0,,FL,33319,26.171,-80.231,,1,0,1,,,,,,,False,,0,2006-10-06 00:00:00,100000,W,,R,True,,2017-02-10 00:00:00,2006-10-06,100000.0,False,0,0.0
2,12011,,,1,3999,825,0,0,0,0,2,0,1.0,0,1968,0,AVE,2,QAV,6,,0,False,,False,,RM-18,7439,66979,74420,2016,420.0,,FL,33020,26.018,-80.155,,1,0,1,,,,,,,False,,0,2003-12-05 00:00:00,78000,G,,R,True,,2017-02-10 00:00:00,2003-12-05,78000.0,True,0,0.0
3,12011,,,1,3999,750,0,0,0,0,1,0,1.5,0,1989,0,AVE,2,QAV,6,,0,False,,False,,,4929,44329,49260,2016,1300.0,,FL,33063,26.263,-80.232,,1,0,1,,,,,,,False,,0,2006-11-28 00:00:00,111500,W,,R,True,,2017-02-10 00:00:00,2006-11-28,111500.0,True,0,0.0
4,12011,,,1,3999,1250,0,0,0,0,2,0,2.0,0,1988,0,AVE,2,QAV,6,,0,False,,False,,R-4C,13959,125669,139630,2016,880.0,,FL,33442,26.297,-80.158,,1,0,1,,,,,,,False,,0,2009-03-04 00:00:00,85500,G,,R,True,S,2017-02-10 00:00:00,2009-03-04,,True,3,0.0


In [17]:
df_rcon.describe()

Unnamed: 0,CountyFipsCode,BuildingCode,StructureCode,StructureNbr,LandSqft,LivingSqft,GarageSqft,BasementSqft,BasementFinishedSqft,AtticSqft,Bedrooms,TotalRooms,TotalBaths,FirePlaces,YearBuilt,EffectiveYearBuilt,ConditionCode,QualityCode,GarageNoOfCars,PatioPorchCode,LandValue,ImprovementValue,TotalValue,AssessedYear,PropTaxAmount,City,Zip,Latitude,Longitude,BuildingShapeCode,ConstructionCode,Stories,UnitsInBuilding,FoundationCode,ExteriorCode,RoofCode,CoolingCode,HeatingSourceCode,View,ViewScore,LastSalePrice,DeedType,SellPrice,DistrsdProp,IsFixer
count,32902.0,11256.0,3.0,32902.0,32902.0,32902.0,32902.0,32902.0,32902.0,32902.0,32902.0,32902.0,32902.0,32902.0,32902.0,32902.0,32902.0,32902.0,32902.0,522.0,32902.0,32902.0,32902.0,32902.0,32893.0,0.0,32902.0,32902.0,32902.0,0.0,32902.0,32902.0,32902.0,27.0,57.0,0.0,42.0,1404.0,9879.0,32902.0,32902.0,0.0,21985.0,32902.0,32902.0
mean,26056.128868,3.052239,-1.0,0.999757,61474.12,1133.080633,11.961188,16.20792,7.072883,0.0,1.880311,0.0,1.825581,0.320862,1982.841621,0.0,2.050058,6.38177,0.0,1.994253,41177.05,181130.6,222309.4,2015.992067,2576.931475,,55379.678409,33.490786,-94.593287,,0.73342,1.235153,21.85475,8.0,5.368421,,10.857143,2.269943,9.952121,0.765729,319458.4,,207088.7,0.228527,0.0
std,19465.486063,3.453175,0.0,0.023389,176929.7,3144.911885,66.242595,115.874763,58.321785,0.0,0.708913,0.0,0.645229,0.466815,12.478862,0.0,0.232634,1.252693,0.0,0.531147,457465.7,904200.0,1259713.0,0.188168,2960.705532,,30798.861244,10.170521,19.956632,,0.528476,3.023849,46.507234,0.0,1.576972,,2.64641,0.68934,4.297152,1.070396,4340805.0,,216901.7,0.882176,0.0
min,12011.0,1.0,-1.0,0.0,3999.0,99.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1950.0,0.0,1.0,0.0,0.0,1.0,0.0,17.0,1000.0,2010.0,0.0,,0.0,25.962,-122.518,,0.0,0.0,0.0,8.0,0.0,,3.0,1.0,2.0,0.0,0.0,,2250.0,0.0,0.0
25%,12011.0,1.0,-1.0,1.0,3999.0,840.0,0.0,0.0,0.0,0.0,1.0,0.0,1.5,0.0,1974.0,0.0,2.0,6.0,0.0,2.0,7559.0,65769.0,74575.0,2016.0,935.0,,33065.0,26.144,-122.19,,0.0,0.0,1.0,8.0,5.0,,12.0,2.0,12.0,0.0,60000.0,,82500.0,0.0,0.0
50%,12011.0,1.0,-1.0,1.0,3999.0,1040.0,0.0,0.0,0.0,0.0,2.0,0.0,2.0,0.0,1980.0,0.0,2.0,6.0,0.0,2.0,15020.0,116309.0,138400.0,2016.0,1889.0,,33319.0,26.233,-80.257,,1.0,0.0,1.0,8.0,5.0,,12.0,2.0,12.0,0.0,125000.0,,159900.0,0.0,0.0
75%,53033.0,8.0,-1.0,1.0,15355.0,1270.0,0.0,0.0,0.0,0.0,2.0,0.0,2.0,1.0,1991.0,0.0,2.0,6.0,0.0,2.0,43889.0,216659.0,270000.0,2016.0,3291.0,,98032.0,47.544,-80.15,,1.0,2.0,12.0,8.0,7.0,,12.0,2.0,12.0,2.0,230000.0,,265000.0,0.0,0.0
max,53033.0,9.0,-1.0,3.0,3236574.0,408116.0,1120.0,3470.0,2730.0,0.0,9.0,0.0,7.0,1.0,2016.0,0.0,5.0,9.0,0.0,5.0,70193200.0,104794700.0,142109000.0,2016.0,97836.0,,98354.0,47.859,-80.076,,3.0,43.0,402.0,8.0,7.0,,12.0,6.0,15.0,4.0,675000000.0,,9159830.0,5.0,0.0


In [18]:
print("Number of numerical predictors RCON: ", len(list(df_rcon.describe())))

Number of numerical predictors RCON:  45


In [19]:
df_rsfr.head()

Unnamed: 0,CountyFipsCode,BuildingCode,StructureCode,StructureNbr,LandSqft,LivingSqft,GarageSqft,BasementSqft,BasementFinishedSqft,AtticSqft,Bedrooms,TotalRooms,TotalBaths,FirePlaces,YearBuilt,EffectiveYearBuilt,Condition,ConditionCode,Quality,QualityCode,GarageCarportCode,GarageNoOfCars,HasPatioPorch,PatioPorchCode,HasPool,PoolCode,Zonning,LandValue,ImprovementValue,TotalValue,AssessedYear,PropTaxAmount,City,State,Zip,Latitude,Longitude,BuildingShapeCode,ConstructionCode,Stories,UnitsInBuilding,FoundationCode,ExteriorCode,RoofCode,CoolingCode,HeatingCode,HeatingSourceCode,IsWaterfront,View,ViewScore,LastSaleDate,LastSalePrice,DocType,DeedType,TransType,ArmsLengthFlag,DistressCode,StatusDate,SellDate,SellPrice,OwnerOccupied,DistrsdProp,IsFixer
0,12011,,,1,5250,1825,0,0,0,0,3,0,2.0,0,1989,0,AVE,3,QAV,6,,0,True,5.0,True,Y,PRD-5Q,36749,208159,244910,2016,3262.0,,FL,33322,26.149,-80.282,,1,1,1,8.0,5.0,,12.0,,,False,,1,2004-10-06 00:00:00,294000,W,,R,True,O,2017-04-15 11:48:00,2004-10-06,294000.0,True,1,0.0
1,12011,,,1,7817,971,0,0,0,0,0,0,0.0,0,1958,0,AVE,3,QAV,6,,0,False,,False,,RS-5,23449,50479,73930,2016,1794.0,,FL,33068,26.208,-80.213,,1,1,1,8.0,5.0,,12.0,,,False,,0,2008-12-09 00:00:00,81000,G,,R,True,S,2017-02-10 00:00:00,2008-12-09,81000.0,False,3,0.0
2,12011,,,1,5927,1859,0,0,0,0,3,0,2.0,0,1991,0,AVE,3,QAV,6,,0,False,,False,,PUD,40009,232599,272610,2016,3112.0,,FL,33323,26.134,-80.316,,1,1,1,8.0,5.0,,12.0,,,False,,0,2006-08-28 00:00:00,375000,W,,R,True,,2017-02-10 00:00:00,2006-08-28,375000.0,True,0,0.0
3,12011,,,1,7053,1540,0,0,0,0,4,0,2.0,0,1960,0,AVE,3,QAV,6,,0,False,,False,,R-1C,42319,148759,191080,2016,1774.0,,FL,33023,26.009,-80.214,,1,1,1,8.0,5.0,,12.0,,,False,,0,2004-03-25 00:00:00,193000,G,,R,True,,2017-02-10 00:00:00,2004-03-25,193000.0,True,0,0.0
4,12011,,,1,7931,1862,0,0,0,0,4,0,2.0,0,1977,0,AVE,3,QAV,6,,0,False,,False,,RS-5,45599,170939,216540,2016,2161.0,,FL,33322,26.152,-80.297,,1,1,1,8.0,5.0,,12.0,,,False,,0,1998-06-29 00:00:00,113000,,,R,True,,2017-02-10 00:00:00,1998-06-29,,True,0,0.0


In [20]:
df_rsfr.describe()

Unnamed: 0,CountyFipsCode,BuildingCode,StructureCode,StructureNbr,LandSqft,LivingSqft,GarageSqft,BasementSqft,BasementFinishedSqft,AtticSqft,Bedrooms,TotalRooms,TotalBaths,FirePlaces,YearBuilt,EffectiveYearBuilt,ConditionCode,QualityCode,GarageNoOfCars,PatioPorchCode,LandValue,ImprovementValue,TotalValue,AssessedYear,PropTaxAmount,City,Zip,Latitude,Longitude,BuildingShapeCode,ConstructionCode,Stories,UnitsInBuilding,FoundationCode,ExteriorCode,RoofCode,CoolingCode,HeatingSourceCode,View,ViewScore,LastSalePrice,DeedType,SellPrice,DistrsdProp,IsFixer
count,85879.0,52780.0,1.0,85879.0,85879.0,85879.0,85879.0,85879.0,85879.0,85879.0,85879.0,85879.0,85879.0,85879.0,85879.0,85879.0,85879.0,85879.0,85879.0,34458.0,85879.0,85879.0,85879.0,85879.0,85854.0,0.0,85879.0,85879.0,85879.0,0.0,85879.0,85879.0,85879.0,32697.0,32832.0,1.0,26054.0,52771.0,12803.0,85879.0,85879.0,0.0,39632.0,85879.0,85879.0
mean,37347.20445,1.022717,-1.0,1.010655,16222.82,1874.517647,211.161623,355.455431,179.267341,0.0,2.827292,0.0,1.977003,0.535474,1975.227867,0.0,3.211914,6.523073,0.0,2.770097,185035.8,268059.4,453097.1,2015.996542,5160.540091,,73126.300947,39.357506,-106.169047,,0.378812,1.386183,0.956031,8.231397,5.188048,24.0,11.994934,2.467757,4.705069,0.693336,293805.5,,382558.1,0.244833,0.0
std,19935.47468,0.416729,,0.133182,60835.78,1029.014108,281.659989,653.931705,385.812408,0.0,1.460753,0.0,1.210666,0.498743,18.869927,0.0,0.485548,1.243311,0.0,1.322711,284386.6,428053.4,637738.2,0.121223,4094.548236,,31643.595395,10.409928,20.398196,,0.487082,0.51094,0.347963,1.718927,1.963427,,0.196633,0.753664,3.734096,1.04169,4413295.0,,364530.3,0.914076,0.0
min,12011.0,1.0,-1.0,0.0,3999.0,99.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1950.0,0.0,1.0,0.0,0.0,1.0,99.0,499.0,1100.0,2010.0,0.0,,0.0,25.957,-122.526,,0.0,0.0,0.0,8.0,5.0,24.0,3.0,1.0,2.0,0.0,0.0,,2667.0,0.0,0.0
25%,12011.0,1.0,-1.0,1.0,5859.0,1264.0,0.0,0.0,0.0,0.0,2.0,0.0,1.0,0.0,1957.0,0.0,3.0,6.0,0.0,2.0,55794.0,150489.0,240000.0,2016.0,3090.0,,33313.0,26.178,-122.284,,0.0,1.0,1.0,8.0,5.0,24.0,12.0,2.0,2.0,0.0,118000.0,,211000.0,0.0,0.0
50%,53033.0,1.0,-1.0,1.0,7741.0,1650.0,0.0,0.0,0.0,0.0,3.0,0.0,2.0,1.0,1976.0,0.0,3.0,6.0,0.0,2.0,117999.0,217999.0,352000.0,2016.0,4437.0,,98027.0,47.386,-122.108,,0.0,1.0,1.0,8.0,5.0,24.0,12.0,2.0,3.0,0.0,205200.0,,305000.0,0.0,0.0
75%,53033.0,1.0,-1.0,1.0,10704.0,2262.0,440.0,670.0,0.0,0.0,4.0,0.0,3.0,1.0,1992.0,0.0,3.0,8.0,0.0,5.0,242999.0,309999.0,540000.0,2016.0,6222.0,,98092.0,47.605,-80.276,,1.0,2.0,1.0,8.0,5.0,24.0,12.0,3.0,8.0,1.0,339450.0,,447500.0,0.0,0.0
max,53033.0,9.0,-1.0,4.0,6542712.0,69854.0,5430.0,14610.0,9740.0,0.0,128.0,0.0,68.0,1.0,2016.0,0.0,6.0,10.0,0.0,5.0,26176000.0,98818000.0,124994000.0,2016.0,94182.0,,98354.0,47.778,-80.077,,3.0,7.0,68.0,21.0,29.0,24.0,12.0,6.0,15.0,3.0,1280000000.0,,22000000.0,5.0,0.0


In [21]:
print("Number of numerical predictors RSFR: ", len(list(df_rsfr.describe())))

Number of numerical predictors RSFR:  45


## Non-numerical columns

In [22]:
df_rcon.select_dtypes(include=object).columns.tolist()

['Condition',
 'Quality',
 'GarageCarportCode',
 'PoolCode',
 'Zonning',
 'State',
 'HeatingCode',
 'LastSaleDate',
 'DocType',
 'TransType',
 'DistressCode',
 'StatusDate',
 'SellDate']

In [23]:
df_rcon[df_rcon.select_dtypes(include=object).columns.tolist()].head()

Unnamed: 0,Condition,Quality,GarageCarportCode,PoolCode,Zonning,State,HeatingCode,LastSaleDate,DocType,TransType,DistressCode,StatusDate,SellDate
0,AVE,QAV,,,,FL,,1998-03-27 00:00:00,,R,,2017-02-10 00:00:00,1998-03-27
1,AVE,QAV,,,,FL,,2006-10-06 00:00:00,W,R,,2017-02-10 00:00:00,2006-10-06
2,AVE,QAV,,,RM-18,FL,,2003-12-05 00:00:00,G,R,,2017-02-10 00:00:00,2003-12-05
3,AVE,QAV,,,,FL,,2006-11-28 00:00:00,W,R,,2017-02-10 00:00:00,2006-11-28
4,AVE,QAV,,,R-4C,FL,,2009-03-04 00:00:00,G,R,S,2017-02-10 00:00:00,2009-03-04


In [24]:
df_rsfr.select_dtypes(include=object).columns.tolist() == df_rcon.select_dtypes(include=object).columns.tolist()

True

As it can be seen from the above output the same non-numerical columns are appering in the RSFR dataframe.

It seems to be a pattern for each value from the __date__ column. I am going to verify if each value ends with 'T000000'. If yes, I am going to drop that part since it has no relevance for the model due to the fact that every temporary detail about hour, minutes, seconds in which the house was sold is set on 0. 

In [25]:
df_rsfr['SellDate']

0        2004-10-06
1        2008-12-09
2        2006-08-28
3        2004-03-25
4        1998-06-29
            ...    
53036    2004-01-08
53037    1987-11-09
53038    1989-08-09
53039    1992-04-28
53040    2006-07-11
Name: SellDate, Length: 85879, dtype: object

In [26]:
skew(df_rcon1['CountyFipsCode'])

0.0

In [27]:
skew(df_rcon1['StructureCode'])

nan

In [28]:
print(df_rcon1.dtypes)

CountyFipsCode      int64
BuildingCode      float64
StructureCode     float64
StructureNbr        int64
LandSqft            int64
                   ...   
SellDate           object
SellPrice         float64
OwnerOccupied        bool
DistrsdProp         int64
IsFixer           float64
Length: 63, dtype: object


In [29]:
df_rcon1['BuildingCode'].value_counts(normalize=True)

Series([], Name: BuildingCode, dtype: float64)

In [30]:
# del df_rcon1['BuildingCode']

In [31]:
for obs in list(df_rcon1['StructureCode']):
    if not pd.isnull(obs):
        print(obs)

-1.0


In [32]:
# del df_rcon1['StructureCode']