## 1. Data wrangling

## 1.1 Contents<a id='2.1_Contents'></a>
* [1 Data wrangling](#2_Data_wrangling)
  * [1.1 Contents](#2.1_Contents)
  * [1.2 Introduction](#2.2_Introduction)
  * [1.3 Imports](#2.3_Imports)
  * [1.4 Load The House Price Data](#2.4_Load_The_House_Price_Data)
  * [1.5 Filtering Single Family Property Type](#2.5_Filtering_Single_Family_Property_Type) 
  * [1.6 Missing Values](#2.6_Missing_Values) 
    * [1.6.1 Features With > 90% Missing Values](#1.6.1_Features_With_>90%_Missing_Values)
    * [1.6.2 Features With 80%-90% Missing Values](#1.6.2_Features_With_80%_-_90%_Missing_Values)
    * [1.6.3 Features With 70%-80% Missing Values](#1.6.3_Features_With_70%_-_80%_Missing_Values)
    * [1.6.4 Features With 50%-70% Missing Values](#1.6.4_Features_With_50%_-_70%_Missing_Values)
    * [1.6.5 Features With 10%-50% Missing Values](#1.6.5_Features_With_10%_-_50%_Missing_Values)
      * [1.6.5.1 Garage](#1.6.5.1_Garage)
      * [1.6.5.2 Living](#1.6.5.2_Living) 
      * [1.6.5.3 Dining](#1.6.5.3_Dining) 
      * [1.6.5.4 Kitchen](#1.6.5.4_Kitchen)
  * [1.7 Subdivisions And Their Facts](#1.7_Subdivisions_And_their_Facts)
  * [1.8 Fill Null For Kitchen, Dining and Living](#1.8_Fill_Null_For_Kitchen_Dining_Living)
  * [1.9 Listing Price](#1.9_Listing_Price)
  * [1.10 Bedrooms](#1.10_Bedrooms)     
  * [1.11 Bathrooms](#1.11_Bathrooms)
  * [1.12 Stories](#1.12_Stories)
  * [1.13 Style](#1.13_Style)
  * [1.14 Year Built](#1.14_Year_Built)
  * [1.15 Building Sqft](#1.15_Building_Sqft)
  * [1.16 Lot Size](#1.16_Lot_Size)
  * [1.17 Maintenance Fee](#1.17_Maintenance_Fee)
  * [1.18 Fireplace](#1.18_Fireplace)
  * [1.19 HOA Mandatory](#1.19_HOA_Mandatory)
  * [1.20 Other Fees](#1.20_Other_Fees)
  * [1.21 Roof](#1.21_Roof)

## 2.2 Introduction<a id='2.2_Introduction'></a>

In this section I will investigate data scrapped from www.HAR.com. Data cleaning will be done in this stage since all rows are categorical and need to be numerical. I will remove features with lost of none values and will create new features.

## 2.3 Imports<a id='2.3_Imports'></a>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import substring
import os
import re
from fuzzywuzzy import process
import warnings
warnings.filterwarnings('ignore')

## 2.4 Load The House Price Data<a id='2.4_Load_The_House_Price_Data'></a>

In [2]:
data= pd.read_csv('../data/raw/Houston_Home_List.csv',encoding = "ISO-8859-1")
print('data shape is:',data.shape)

data shape is: (15102, 101)


In [3]:
data.columns

Index(['Unnamed: 0', 'image_link', 'Listing Price:', 'Address:', 'City:',
       'State:', 'Zip Code:', 'County:', 'Subdivision:', 'Legal Description:',
       ...
       'Extra Room:', 'Wine Room:', 'Carport Description:',
       'Median Appraised Value / Square ft.:', 'Den:', 'Utility Room Desc:',
       'Sunroom:', 'Guest Suite:', 'Bath:', 'Garage Apartment:'],
      dtype='object', length=101)

## 2.5 Filtering Single Family Property Type<a id='2.5_Filtering_Single_Family_Property_Type'></a>

Since we are going to analysis images and other house features it is important to have all records as same as possible. For example for lots there is no image for building or rooms and features for multi-family properties are different from single family homes. let see what kind of property type we have in our dataset:

In [4]:
data['Property Type:'].value_counts()

Single-Family                          11141
Lots                                    1551
Townhouse/Condo - Townhouse              950
Townhouse/Condo - Condominium            594
Mid/Hi-Rise Condo                        436
Country Homes/Acreage                    154
Multi-Family - Duplex                    107
Multi-Family - Fourplex                   46
Country Homes/Acreage - Free Standi       46
Multi-Family - 5 Plus                     38
Multi-Family - Triplex                    15
Multi-Family - Multiple Detached Dw        9
Country Homes/Acreage - Manufacture        4
Lot & Acreage - Residential                3
Residential - Condo                        2
Residential - Townhouse                    1
Single Family                              1
Name: Property Type:, dtype: int64

Majority of properties are single family so, I keep them and remove the rest of the types.

In [5]:
single_family_df = data[data['Property Type:']=='Single-Family']
single_family_df.reset_index(drop=True,inplace=True)
len(single_family_df)

11141

In [6]:
single_family_df.head()

Unnamed: 0.1,Unnamed: 0,image_link,Listing Price:,Address:,City:,State:,Zip Code:,County:,Subdivision:,Legal Description:,...,Extra Room:,Wine Room:,Carport Description:,Median Appraised Value / Square ft.:,Den:,Utility Room Desc:,Sunroom:,Guest Suite:,Bath:,Garage Apartment:
0,85,['https://photos.harstatic.com/190618667/hr/im...,"$ 575,000 ($232.98/sqft.) $Convert",1316 Hadley Street,Houston,TX,77002,Harris County,Austin Hadley Place,LT 4 BLK 1 AUSTIN HADLEY PLACE,...,,,,,,,,,,
1,88,['https://photos.harstatic.com/190420550/hr/im...,"$ 465,000 ($221.85/sqft.) $Convert",110 Pierce Street,Houston,TX,77002,Harris County,Modern Midtown (View subdivision price trend),LT 12 BLK 1 MODERN MIDTOWN,...,,,,$223.83,,,,,,
2,89,['https://photos.harstatic.com/190088153/hr/im...,"$ 450,000 ($223.33/sqft.) $Convert",118 Pierce Street,Houston,TX,77002,Harris County,Modern Midtown (View subdivision price trend),LT 8 BLK 1 MODERN MIDTOWN,...,,,,$223.83,,,,,,
3,99,['https://photos.harstatic.com/189387790/hr/im...,"$ 259,000 ($203.30/sqft.) $Convert",311 N Milby Street,Houston,TX,77003,Harris County,Merkels Sec 01 (View subdivision price trend),LT 3 BLK 15 MERKELS SEC 1,...,,,,,"['12 x 17, 1st', '12 , 17, 1st']","['12 x 7, 1st', '12 , 7, 1st']",,,,
4,108,['https://photos.harstatic.com/177650081/hr/im...,"$ 236,999 ($196.19/sqft.) $Convert \r\n\r\n\r...",216 Hutcheson,Houston,TX,77003,Harris County,MERKELS (View subdivision price trend),LT 9 BLK 5 MERKELS SEC 1,...,,,,,,,,,,


In our dataset `State` and `Property Type` are the same for all houses so, we can remove them:

In [7]:
single_family_df.drop(['Unnamed: 0','State:','Property Type:'],axis=1,inplace=True)

## 2.6 Missing Values<a id='2.6_Missing_Values'></a>

In [8]:
# function to find missing value and returning count abd %
def missing_cal(df):
    """This function calculates missing value 
    for datafaram passed in as parameter"""
    missing = pd.concat([single_family_df.isnull().sum(), 100 * single_family_df.isnull().mean()], axis=1)
    missing.columns=['count', '%']
    missing.sort_values(by='count',ascending=False)
    return missing

In [9]:
missing = missing_cal(single_family_df)
missing

Unnamed: 0,count,%
image_link,0,0.000000
Listing Price:,3,0.026928
Address:,0,0.000000
City:,0,0.000000
Zip Code:,0,0.000000
...,...,...
Utility Room Desc:,7178,64.428687
Sunroom:,10909,97.917602
Guest Suite:,11008,98.806211
Bath:,9449,84.812853


## 1.6.1 Features With >90% Missing Values<a id='1.6.1_Features_With_>90%_Missing_Values'></a>

Let's take a look at features with more than 90% missing values: 

In [10]:
missing = missing_cal(single_family_df)
nan_90 = missing.loc[missing['%']>90].index
print('Number of Features with more than 90% None: ',len(nan_90))

Number of Features with more than 90% None:  9


In [11]:
missing.loc[nan_90].sort_values(by="%")

Unnamed: 0,count,%
Extra Room:,10068,90.368908
Median Appraised Value / Square ft.:,10217,91.70631
Media Room:,10254,92.038417
Carport Description:,10523,94.452922
Water Amenity:,10747,96.463513
Garage Apartment:,10822,97.136702
Sunroom:,10909,97.917602
Wine Room:,11002,98.752356
Guest Suite:,11008,98.806211


We need to see what kind of information are in each of these features:

In [12]:
for item in nan_90:
    print('Value Count for '+item)
    print(single_family_df[item].value_counts())
    print('-'*100)

Value Count for Media Room:
['2nd', '2nd']                          27
['16x12, 2nd', '4.88 x 3.66(m)']        17
['14x13, 2nd', '4.27 x 3.96(m)']        15
['15x13, 2nd', '4.57 x 3.96(m)']        15
['18x12, 2nd', '5.49 x 3.66(m)']        13
                                        ..
['11 x 14, 2nd', '11 , 14, 2nd']         1
["17'4X20'3, 2nd", "17'4,20'3, 2nd"]     1
['17x19, 3rd', '5.18 x 5.79(m)']         1
['10X22\'7", 3rd', '3.05(m)']            1
['26x18, 2nd', '7.92 x 5.49(m)']         1
Name: Media Room:, Length: 454, dtype: int64
----------------------------------------------------------------------------------------------------
Value Count for Water Amenity:
Lake View                                                     119
Lake View, Lakefront                                           83
Pond                                                           55
Lakefront                                                      48
Bayou Frontage, Bayou View                                

* Values for `Media Room`, `Extra Room`, `Wine Room`, `Sunroom`, `Guest Suite`and `Garage Apartment` are kind of dimension of each of those rooms along with some nonsense values like (`Yes` for `Garage Apartment`). 
* For `Water Amenity` there are to much unique categories and there is no way to be able to fill rest of none values with correct category
* `Carport Description` has 3 different categories for total 611 house and the rest do not have any carport so I will fill  all none values with new category as 'Not Applicable'.
* `Median Appraised Value / Square ft.:` is the fact (based on active listing) for each subdivision and can be fill by the value for same subdivision.

In [13]:
# Replacing None value for 'Carport Description:' with 'not applicable'
single_family_df['Carport Description:'][single_family_df['Carport Description:'].isnull()]='not applicable'

# Dropping 'Media Room:', 'Water Amenity:', 'Extra Room:', 'Wine Room:', 
#'Sunroom:', 'Guest Suite:', 'Garage Apartment:', 'Vacation Rental:'
single_family_df.drop(['Media Room:', 'Water Amenity:', 'Extra Room:', 'Wine Room:', 'Sunroom:', 'Guest Suite:', 
                       'Garage Apartment:'],axis=1,inplace=True)

## 1.6.2 Features With 80%-90% Missing Values<a id='1.6.2_Features_With_80%_-_90%_Missing_Values'></a>

Next step is looking at the features with more than 80% none values:

In [14]:
missing = missing_cal(single_family_df)
nan_80 = missing.loc[missing['%']>80].index
print('Number of Features with more than 80% None: ',len(nan_80))

Number of Features with more than 80% None:  14


In [15]:
missing.loc[nan_80].sort_values(by="%")

Unnamed: 0,count,%
Average Square Ft.:,9412,84.480747
Average Price/Square Ft.:,9412,84.480747
Market Area Name:,9413,84.489723
Home For Sales:,9413,84.489723
Average List Price:,9413,84.489723
Home For Lease:,9413,84.489723
Average Lease:,9413,84.489723
Average Lease/Square Ft.:,9413,84.489723
Bath:,9449,84.812853
Den:,9486,85.14496


In [16]:
#printing value count for each feature with more than 80 none value
for item in nan_80:
    print('Value Count for '+item)
    print(single_family_df[item].value_counts())
    print('-'*100)

Value Count for Private Pool Desc:
In Ground                                355
Gunite, In Ground                        344
Gunite, Heated, In Ground                234
Gunite                                   217
Heated, In Ground                         92
Gunite, Heated, In Ground, Salt Water     46
Gunite, Heated                            40
Above Ground                              28
Heated, In Ground, Salt Water             21
Gunite, In Ground, Salt Water             20
In Ground, Salt Water                     16
Gunite, Salt Water                        13
Gunite, Heated, Salt Water                10
Enclosed, Heated, In Ground                8
Fiberglass, In Ground                      5
Heated                                     5
In Ground, Vinyl Lined                     5
Enclosed, In Ground                        4
Fiberglass                                 4
Salt Water                                 4
Enclosed, Gunite, In Ground                2
Heated, Salt Water  

* `Controlled Access`categories are mixed  of 'Automatic', 'Driveway', 'Manned' and 'Intercom' that makes me believe the rest of the house do not have any type of controlled access. I think filling none values with 'No controlled access' would be reasonable.
* Same as `Water Amenity` there are so many categories for 'Private Pool Desc'. After counting each category for `Private Pool:` groups figured out that there are description for house without private pool and I think it may happened by mistake and I decided to drop this column.
* `Master Planned Community` and `Market Area Name` categories seems to be same as subdivision name and we will deal with them later on subdivision section
* `Home For Sales`, `Average List Price`,`Average Square Ft.`,`Average Price/Square Ft.`, `Home For Lease`, `Average Lease`and `Average Lease/Square Ft.`, are the facts (based on active listing) for each subdivision and can be fill by the value for same subdivision.
* `Den` and `Bath` are dimension along with other values like '1th' which I think is typo mistake and I decided to drop them.

In [17]:
#counting 'Private Pool Desc:' category for `Private Pool:` groups
single_family_df.groupby('Private Pool:')['Private Pool Desc:'].value_counts()

Private Pool:  Private Pool Desc:                   
No             In Ground                                 14
               Enclosed, Heated, In Ground                6
               Above Ground                               4
               Heated, In Ground                          4
               Gunite                                     3
               Gunite, In Ground                          3
               Fiberglass                                 2
               Gunite, Heated, In Ground                  1
Yes            Gunite, In Ground                        341
               In Ground                                341
               Gunite, Heated, In Ground                233
               Gunite                                   214
               Heated, In Ground                         88
               Gunite, Heated, In Ground, Salt Water     46
               Gunite, Heated                            40
               Above Ground                    

In [18]:
single_family_df.drop(['Private Pool Desc:','Bath:','Den:'],axis=1,inplace=True)
# Replacing None value for 'Carport Description:' with 'not applicable'
single_family_df['Controlled Access:'][single_family_df['Controlled Access:'].isnull()]='no controlled access'

## 1.6.3 Features With 70%-80% Missing Values<a id='1.6.3_Features_With_70%_-_80%_Missing_Values'></a>

Now I investigating features with more than 70% none values:

In [19]:
missing = missing_cal(single_family_df)
nan_70 = missing.loc[((missing['%']>70 )& (missing['%']<80))].index
print('Number of Features with more than 70% None: ',len(nan_70))

Number of Features with more than 70% None:  2


In [20]:
missing.loc[nan_70].sort_values(by="%")

Unnamed: 0,count,%
Family Room:,7858,70.532268
Primary Bath:,8315,74.634234


In [21]:
#printing value count for each feature with more than 70 none value
for item in nan_70:
    print('Value Count for '+item)
    print(single_family_df[item].value_counts())
    print('-'*100)

Value Count for Family Room:
['1st', '1st']                                  75
['18x16, 1st', '5.49 x 4.88(m)']                52
['20x16, 1st', '6.10 x 4.88(m)']                51
['21x17, 1st', '6.40 x 5.18(m)']                39
['15x15, 1st', '4.57 x 4.57(m)']                37
                                                ..
["16'10X18'1, 1st", "16'10,18'1, 1st"]           1
['30x20, 1st', '9.14 x 6.10(m)']                 1
["23'x16', 2nd", "23',16', 2nd"]                 1
['23\'8" X 18\', 1st', '23\'8" , 18\', 1st']     1
['28 x 21, 1st', '28 , 21, 1st']                 1
Name: Family Room:, Length: 939, dtype: int64
----------------------------------------------------------------------------------------------------
Value Count for Primary Bath:
['1st', '1st']                                                     765
['2nd', '2nd']                                                     308
['3rd', '3rd']                                                     128
['12x10, 1st', '3.66

`Family Room` and `Primary Bath` are dimension for family room and master bath room and all houses should have these values and can not be 0. I think dropping these features would be appropriate since I can not fill values for more than 70% of houses.

In [22]:
single_family_df.drop(['Family Room:','Primary Bath:'],axis=1,inplace=True)

## 1.6.4 Features With 50%-70% Missing Values<a id='1.6.4_Features_With_50%_-_70%_Missing_Values'></a>

Next step is to look at features with none values between 50% and 70%:

In [23]:
missing = missing_cal(single_family_df)
nan_50_70 = missing.loc[((missing['%']>50 )& (missing['%']<70))].index
print('Number of Features with more than 50% and less than 70% None: ',len(nan_50_70))

Number of Features with more than 50% and less than 70% None:  6


In [24]:
missing.loc[nan_50_70].sort_values(by="%")

Unnamed: 0,count,%
Front Door:,6471,58.082757
Breakfast:,6724,60.353649
Garage Carport:,6734,60.443407
Utility Room Desc:,7178,64.428687
Game Room:,7367,66.125123
Study/Library:,7658,68.737097


In [25]:
#printing value count for each feature with more than 50 none value
for item in nan_50_70:
    print('Value Count for '+item)
    print(single_family_df[item].value_counts())
    print('-'*100)

Value Count for Breakfast:
['1st', '1st']                              192
['10x10, 1st', '3.05 x 3.05(m)']            155
['11x10, 1st', '3.35 x 3.05(m)']            149
['12x10, 1st', '3.66 x 3.05(m)']            147
['10x9, 1st', '3.05 x 2.74(m)']             108
                                           ... 
['10 x 9-8, 1st', '10 , 9-8, 1st']            1
['19\'6"x11\', 1st', '19\'6",11\', 1st']      1
['12.5X13, 1st', '3.81 x 3.96(m)']            1
['8 X 13, 1st', '8 , 13, 1st']                1
['11 x 19, 1st', '11 , 19, 1st']              1
Name: Breakfast:, Length: 769, dtype: int64
----------------------------------------------------------------------------------------------------
Value Count for Game Room:
['2nd', '2nd']                          72
['18x14, 2nd', '5.49 x 4.27(m)']        52
['19x16, 2nd', '5.79 x 4.88(m)']        44
['18x16, 2nd', '5.49 x 4.88(m)']        37
['16x14, 2nd', '4.88 x 4.27(m)']        37
                                        ..
['38x22, 2nd', 

We can calculate area for `Utility Room`, `Study/Library`. `Game Room`and `Breakfast` based on  dimension we have and assume the NA values are zero for those houses that do not have these rooms. I drop `Garage Carport` and `Front Door` since there is no information for rest of the house.

In [26]:
def area_calc(item,pattern = "([\d.]+)(?:.*?([\d.]+))?.*?[x\*].*?([\d.]+)(?:.*?([\d.]+))?"):
    """This function calculates are based on dimension passed in as parameter. Regular experession
    pattern will find dimension and group them as feet and inch"""
    pattern = re.compile(pattern,re.IGNORECASE)
    area=0
    dim=[]
    if type(item)==list:
        for i in item:
            if (('x' in i or 'X' in i or '*' in i )and '(m)' not in i ):
                dim.append(i.replace('[','').strip())
        for d in dim:
            d=d.replace(' ','').strip()
            match=pattern.findall(d)
            try:
                dimension_list = [float(item) if len(item)>0 else 0 for item in match[0]]
                area += (dimension_list[0]+(dimension_list[1]/12))*(dimension_list[2]+(dimension_list[3]/12))
            except:
                area=None
        return(area)

In [27]:
# Calculating area for Utility room
single_family_df['UtilitySqft'] = single_family_df['Utility Room Desc:'].str.split(',')
single_family_df['UtilitySqft'] = single_family_df['UtilitySqft'].apply(area_calc)

# Calculating area for Study/Library room
single_family_df['StudySqft'] = single_family_df['Study/Library:'].str.split(',')
single_family_df['StudySqft'] = single_family_df['StudySqft'].apply(area_calc)

# Calculating area for Game room
single_family_df['GameSqft'] = single_family_df['Game Room:'].str.split(',')
single_family_df['GameSqft'] = single_family_df['GameSqft'].apply(area_calc)

# Calculating area for Breakfast
single_family_df['BreakfastSqft'] = single_family_df['Breakfast:'].str.split(',')
single_family_df['BreakfastSqft'] = single_family_df['BreakfastSqft'].apply(area_calc)

single_family_df.update(single_family_df[['UtilitySqft','StudySqft','GameSqft','BreakfastSqft']].fillna(0))

In [28]:
single_family_df.drop(list(nan_50_70),axis=1,inplace=True)

So far I investigated features with the none value more than 50% and still need to dig more and also fill values for features that are the facts (based on active listing) for each subdivision like: `Home For Sales`, `Average List Price`,`Average Square Ft.`,`Average Price/Square Ft.`, `Home For Lease`, `Average Lease`and `Average Lease/Square Ft.`. But before that lets take a look at features with none values more than 10%:

## 1.6.5 Features With 10%-50% Missing Values<a id='1.6.5_Features_With_10%_-_50%_Missing_Values'></a>

In [29]:
missing = missing_cal(single_family_df)
nan_10_50 = missing.loc[((missing['%']>10 )& (missing['%']<50))].index
print('Number of Features with more than 10% and less than 50% None: ',len(nan_10_50))

Number of Features with more than 10% and less than 50% None:  35


In [30]:
missing.loc[nan_10_50].sort_values(by="%")

Unnamed: 0,count,%
Garage(s):,1292,11.596805
Tax Rate:,1323,11.875056
Dishwasher:,1413,12.682883
Bedroom Desc:,1682,15.097388
Median Appraised Value:,1746,15.671843
Median Year Built:,1746,15.671843
Median Lot Square Ft.:,1746,15.671843
Median Square Ft.:,1746,15.671843
Single Family Properties:,1746,15.671843
County / Zip Code:,1746,15.671843


In [31]:
#printing value count for each feature with more than 50 none value
for item in nan_10_50:
    print('Value Count for '+item)
    print(single_family_df[item].value_counts())
    print('-'*100)

Value Count for Garage(s):
2 / Attached                                    5626
2 / Detached                                    1063
3 / Attached                                     686
1 / Detached                                     341
1 / Attached                                     320
                                                ... 
6 / Attached                                       1
5 / Attached,Attached/Detached,Detached,Over       1
8 / Attached,Attached/Detached,Oversized           1
4 / Detached,Tandem                                1
4 / Attached/Detached,Detached                     1
Name: Garage(s):, Length: 114, dtype: int64
----------------------------------------------------------------------------------------------------
Value Count for Living:
['15x13, 1st', '4.57 x 3.96(m)']         58
['20x15, 1st', '6.10 x 4.57(m)']         57
['14x12, 1st', '4.27 x 3.66(m)']         57
['20x16, 1st', '6.10 x 4.88(m)']         55
['18x15, 1st', '5.49 x 4.57(m)']         55
 

`Room Description`, `Countertop`, `Floors`, `Bedroom Desc`, `Kitchen Desc`, `Bathroom Description`, `Connections`, `Oven`, `Range`, `Energy Feature`, `Interior`, `Exterior`, `Financing Considered` are just information and we can not fill them with unknown values since some may not be accurate and I don't think they are relevant to our analysis so, I will drop all of them.
We also do not need `County / Zip Code:` since we do have another column for zip codes.
I could not find any information regarding to `Single Family Properties:` so, this column will be dropped as well.

In [32]:
single_family_df.drop(['Room Description:', 'Countertop:', 'Floors:', 'Bedroom Desc:', 'Kitchen Desc:', 
                       'Bathroom Description:','Connections:', 'Oven:', 'Range:', 'Energy Feature:',
                       'Interior:', 'Exterior:', 'Financing Considered:','Single Family Properties:','County / Zip Code:'], axis=1,inplace=True)

`Ice Maker`, `Microwave`, `Compactor`, `Dishwasher`, `Disposal` and `Area Pool` are 'Yes/No' categories and I think it is relevant to fill none values with 'No'. For sure it is a little bit optimistic since some houses may have those features and owner/agent forgot to fill them but for now filling with 'No' value is the best way to dealing with them.

In [33]:
single_family_df['Disposal:'][single_family_df['Disposal:'].isnull()]='No'
single_family_df['Ice Maker:'][single_family_df['Ice Maker:'].isnull()]='No'
single_family_df['Compactor:'][single_family_df['Compactor:'].isnull()]='No'
single_family_df['Area Pool:'][single_family_df['Area Pool:'].isnull()]='No'
single_family_df['Microwave:'][single_family_df['Microwave:'].isnull()]='No'
single_family_df['Dishwasher:'][single_family_df['Dishwasher:'].isnull()]='No'

At this point I am investigating other features indevisually:

## 1.6.5.1 Garage<a id='1.6.5.1_Garage'></a>

In [34]:
single_family_df['Garage(s):'].value_counts()

2 / Attached                                    5626
2 / Detached                                    1063
3 / Attached                                     686
1 / Detached                                     341
1 / Attached                                     320
                                                ... 
6 / Attached                                       1
5 / Attached,Attached/Detached,Detached,Over       1
8 / Attached,Attached/Detached,Oversized           1
4 / Detached,Tandem                                1
4 / Attached/Detached,Detached                     1
Name: Garage(s):, Length: 114, dtype: int64

The important part of this feature is the number of garage each house has. Also we now almost every single family homes have at least 2 garages and it is relevant to fill none values with '2'.

In [35]:
single_family_df['Garage(s):'].fillna('2',inplace=True)
single_family_df['Garage'] = [item[0] if item !=None else 0 for item in single_family_df['Garage(s):'].str.split(' ') ]
single_family_df['Garage']=single_family_df['Garage'].astype(int)
single_family_df.drop('Garage(s):',axis=1,inplace=True)
single_family_df['Garage'].value_counts()

2     8703
3     1445
1      758
4      175
5       21
6       11
8        7
7        4
24       2
10       1
40       1
56       1
9        1
57       1
63       1
26       1
42       1
27       1
51       1
20       1
21       1
45       1
22       1
58       1
Name: Garage, dtype: int64

As you can see there are some houses with more than 10 garage which is odd. After checking images for some of these houses in www.HAR.com it seems those have only 2 garage and I fill those values with 2 which is the median of this feature.

In [36]:
single_family_df.Garage[single_family_df['Garage']>8]=single_family_df['Garage'].median()
single_family_df['Garage'].value_counts()

2    8720
3    1445
1     758
4     175
5      21
6      11
8       7
7       4
Name: Garage, dtype: int64

## 1.6.5.2 Living<a id='1.6.5.2_Living'></a>

To calculate the living area I use area calculation function, pass in the dimension of the living room and return the area and for the next step I will fill none values with the average of living room area per subdivision

In [37]:
single_family_df['LivingSqft'] = single_family_df['Living:'].str.split(',')
single_family_df['LivingSqft'] = single_family_df['LivingSqft'].apply(area_calc)        

In [38]:
single_family_df['LivingSqft'].describe()

count    6294.000000
mean      288.996780
std       152.801571
min         0.000000
25%       210.000000
50%       272.125000
75%       342.000000
max      6651.000000
Name: LivingSqft, dtype: float64

## 1.6.5.3 Dining<a id='1.6.5.3_Dining'></a>

In [39]:
single_family_df['Dining:'].value_counts()

['12x11, 1st', '3.66 x 3.35(m)']     220
['13x11, 1st', '3.96 x 3.35(m)']     203
['13x12, 1st', '3.96 x 3.66(m)']     184
['14x12, 1st', '4.27 x 3.66(m)']     175
['1st', '1st']                       172
                                    ... 
['15x12,  1st', '4.57 x 3.66(m)']      1
["13X10'6, 1st", '3.96(m)']            1
['17 x 17, 2nd', '17 , 17, 2nd']       1
['15X18, 1st', '4.57 x 5.49(m)']       1
['16 x 12, 2nd', '16 , 12, 2nd']       1
Name: Dining:, Length: 1308, dtype: int64

In [40]:
single_family_df['DiningSqft'] = single_family_df['Dining:'].str.split(',')
single_family_df['DiningSqft'] = single_family_df['DiningSqft'].apply(area_calc) 

In [41]:
single_family_df[['Dining:','DiningSqft']].sample(20,random_state=101)

Unnamed: 0,Dining:,DiningSqft
10595,"['12x14, 1st', '3.66 x 4.27(m)']",168.0
9396,"['13X10, 1st', '3.96 x 3.05(m)']",130.0
6552,"['10x9, 1st', '3.05 x 2.74(m)']",90.0
2512,"['12x12, 1st', '3.66 x 3.66(m)']",144.0
7776,,
2965,,
8038,"['10x11, 1st', '3.05 x 3.35(m)']",110.0
7054,"['12x16, 1st', '3.66 x 4.88(m)']",192.0
2269,"['12x13, 2nd', '3.66 x 3.96(m)']",156.0
1475,,


## 1.6.5.4 Kitchen<a id='1.6.5.4_Kitchen'></a>

I am using same function to calculate kitchen are in sqft.

In [42]:
single_family_df['KitchenSqft'] = single_family_df['Kitchen:'].str.split(',')
single_family_df['KitchenSqft'] = single_family_df['KitchenSqft'].apply(area_calc) 

In [43]:
single_family_df[['Kitchen:','KitchenSqft']].sample(20,random_state=101)

Unnamed: 0,Kitchen:,KitchenSqft
10595,"['12x19, 1st', '3.66 x 5.79(m)']",228.0
9396,"['13X11, 1st', '3.96 x 3.35(m)']",143.0
6552,"['11x8, 1st', '3.35 x 2.44(m)']",88.0
2512,"['9x12, 1st', '2.74 x 3.66(m)']",108.0
7776,"['0X0, 1st', '0,0, 1st']",0.0
2965,,
8038,,
7054,"['13x17, 1st', '3.96 x 5.18(m)']",221.0
2269,"['15x12, 2nd', '4.57 x 3.66(m)']",180.0
1475,,


Now we can drop old living, dining and kitchen columns:

In [44]:
single_family_df.drop(['Living:', 'Dining:', 'Kitchen:'], axis=1,inplace=True)

## 1.7 Subdivisions And their Facts<a id='1.7_Subdivisions_And_their_Facts'></a>

To uniform subdivision I scraped all subdivision names from HAR.com and will replace names with correct one based on similarity:

In [45]:
sub_df = pd.read_csv('../data/raw/Subdivision.csv')
sub_df.drop(['Unnamed: 0'],axis=1,inplace=True)

In [46]:
sub_df.head()

Unnamed: 0,Subdivision,Zip,Med.Appraisal,Avg.Sqft.,Avg.Yr.Built
0,MARLOWE CONDOS,77002,"$522,701",1100,2018.0
1,Modern Midtown,77002,"$469,147",2096,2014.0
2,Midtowne Plaza,77002,"$439,282",2507,1999.0
3,Macgregor Demerritt,77002,"$438,234",2034,1930.0
4,Hermann Lofts Condo,77002,"$385,446",1546,1998.0


As above table shown Med.Appraisal, Avg.Sqft. and Avg.Yr.Built are same for each subdivision and we can fill none values with these numbers for each column.

In [47]:
single_family_df[['Subdivision:','Subdivision Name:','Market Area Name:','Master Planned Community:','Zip Code:']].head()

Unnamed: 0,Subdivision:,Subdivision Name:,Market Area Name:,Master Planned Community:,Zip Code:
0,Austin Hadley Place,,Midtown - Houston,,77002
1,Modern Midtown (View subdivision price trend),Modern Midtown,,,77002
2,Modern Midtown (View subdivision price trend),Modern Midtown,,,77002
3,Merkels Sec 01 (View subdivision price trend),Merkels,,,77003
4,MERKELS (View subdivision price trend),Merkels,,,77003


In [48]:
single_family_df[['Subdivision:','Subdivision Name:','Market Area Name:','Master Planned Community:']].isna().sum()

Subdivision:                    5
Subdivision Name:            1746
Market Area Name:            9413
Master Planned Community:    9843
dtype: int64

It seems all these 4 columns are the same. As we can see `Subdivision Name:` has more clean name and since `Market Area Name:` has more standard name for subdivisions so, I will replace none values for `Subdivision Name:` with `Market Area Name:` values to see how many none values will remain.

In [49]:
single_family_df['SubName'] = single_family_df['Subdivision Name:'].fillna(single_family_df['Market Area Name:'])

In [50]:
single_family_df['SubName'].isna().sum()

18

In [51]:
single_family_df[['Subdivision:','Subdivision Name:','Market Area Name:','Master Planned Community:']].loc[single_family_df['SubName'].isna()]

Unnamed: 0,Subdivision:,Subdivision Name:,Market Area Name:,Master Planned Community:
67,Mckinney Lndg Sub (View subdivision price trend),,,
498,VERMONT STREET GROVE,,,
629,Magnolia Grove (View subdivision price trend),,,
755,SUNSET HEIGHTS (View subdivision price trend),,,
847,24th Street Manor (View subdivision price trend),,,
922,Heights Homes/Herkimer Sub (View subdivision ...,,,
1358,Shepherd Oaks (View subdivision price trend),,,
2097,Oaks of Lawndale,,,
3161,Lakeside T/H (View subdivision price trend),,,
3163,Lakeside T/H (View subdivision price trend),,,


I am dropping these 18 rows since I can not find correct subdivision name for them.

In [52]:
single_family_df=single_family_df[~single_family_df['SubName'].isnull()]

In [53]:
single_family_df.drop(['Subdivision Name:','Subdivision:','Market Area Name:','Master Planned Community:'],
                      axis=1,inplace=True)

Now we can take a look at those features related to each subdivision like: `Median Appraised Value:`, `Median Year Built:`, 
`Median Lot Square Ft.:`, `Median Square Ft.:`, `Neighborhood Value Range:`.	

In [54]:
single_family_df[['Median Appraised Value:', 'Median Year Built:', 'Median Lot Square Ft.:', 'Median Square Ft.:', 'Neighborhood Value Range:']].isna().sum()

Median Appraised Value:      1728
Median Year Built:           1728
Median Lot Square Ft.:       1728
Median Square Ft.:           1728
Neighborhood Value Range:    1728
dtype: int64

Those features have same number of missing values and it seems it may because of value missing for some the subdivisions. lets take a look:

In [55]:
single_family_df.SubName.loc[single_family_df['Median Appraised Value:'].isna()].value_counts()

Katy - Old Towne                   134
Spring Branch                      128
Cypress South                      105
Medical Center South                98
Hockley                             74
                                  ... 
Fort Bend County North/Richmond      1
Memorial Villages                    1
Pasadena                             1
Memorial West                        1
Lake Conroe Area                     1
Name: SubName, Length: 73, dtype: int64

In [56]:
sub_df.loc[sub_df.Subdivision=='Cypress South']

Unnamed: 0,Subdivision,Zip,Med.Appraisal,Avg.Sqft.,Avg.Yr.Built


In [57]:
single_family_df['Year Built:'].loc[single_family_df['Median Lot Square Ft.:'].isna()].value_counts()

2020   / Builder               1235
2019   / Builder                 72
2021   / Builder                 68
2019   / Appraisal District      22
2017   / Appraisal District      16
                               ... 
1989   / Appraisal District       1
2017   / Appraisal                1
1925   / Appraisal District       1
1972   / Seller                   1
1999   / Appraisal                1
Name: Year Built:, Length: 98, dtype: int64

It seems there is no information for those subdivisions in HAR.com since more than 90% of them are for new subdivisions. so I will drop those rows.

In [58]:
single_family_df=single_family_df[~single_family_df['Median Appraised Value:'].isnull()]

In [59]:
single_family_df[['Median Appraised Value:', 'Median Year Built:', 'Median Lot Square Ft.:', 'Median Square Ft.:', 'Neighborhood Value Range:']]

Unnamed: 0,Median Appraised Value:,Median Year Built:,Median Lot Square Ft.:,Median Square Ft.:,Neighborhood Value Range:
1,"$469,147",2014.0,1450,2096,$441 - $476 K
2,"$469,147",2014.0,1450,2096,$441 - $476 K
3,"$141,314",1938.0,4700,1086,$90 - $213 K
4,"$141,314",1938.0,4700,1086,$90 - $213 K
5,"$404,408",2016.0,1586,2058,$329 - $443 K
...,...,...,...,...,...
11136,"$75,165",1950.0,6325,1160,$56 - $111 K
11137,"$106,856",1963.0,7100,1313,$37 - $170 K
11138,"$106,856",1963.0,7100,1313,$37 - $170 K
11139,"$106,856",1963.0,7100,1313,$37 - $170 K


Now we need to clean those features and change their type to the right one.

In [60]:
single_family_df['MedianApp'] = single_family_df['Median Appraised Value:'].str.replace('$','').str.replace(',','').str.strip()
single_family_df['MedianApp'] = pd.to_numeric(single_family_df['MedianApp'],errors='coerce')

In [61]:
single_family_df['MedianApp']

1        469147
2        469147
3        141314
4        141314
5        404408
          ...  
11136     75165
11137    106856
11138    106856
11139    106856
11140    106856
Name: MedianApp, Length: 9395, dtype: int64

In [62]:
single_family_df['MedianYearBlt'] = pd.to_datetime(single_family_df['Median Year Built:'],format='%Y').dt.year

In [63]:
single_family_df['MedianYearBlt']

1        2014
2        2014
3        1938
4        1938
5        2016
         ... 
11136    1950
11137    1963
11138    1963
11139    1963
11140    1963
Name: MedianYearBlt, Length: 9395, dtype: int64

In [64]:
single_family_df['MedianSqft'] = single_family_df['Median Lot Square Ft.:'].str.replace(',','').str.strip()
single_family_df['MedianSqft'] = pd.to_numeric(single_family_df['MedianSqft'],errors='coerce')

In [65]:
single_family_df['MedianSqft'] 

1        1450
2        1450
3        4700
4        4700
5        1586
         ... 
11136    6325
11137    7100
11138    7100
11139    7100
11140    7100
Name: MedianSqft, Length: 9395, dtype: int64

In [66]:
single_family_df['NeighborValRangeMin'] = single_family_df['Neighborhood Value Range:'].apply(lambda x:x.split('-')[0].strip().replace('$',''))
single_family_df['NeighborValRangeMin'] = pd.to_numeric(single_family_df['NeighborValRangeMin'],errors='coerce')

In [67]:
single_family_df['NeighborValRangeMin'] 

1        441
2        441
3         90
4         90
5        329
        ... 
11136     56
11137     37
11138     37
11139     37
11140     37
Name: NeighborValRangeMin, Length: 9395, dtype: int64

In [68]:
single_family_df['NeighborValRangeMax'] = single_family_df['Neighborhood Value Range:'].apply(lambda x:x.split('-')[1].replace('$','').replace('K','').strip())
single_family_df['NeighborValRangeMax'] = pd.to_numeric(single_family_df['NeighborValRangeMax'],errors='coerce')

In [69]:
single_family_df['NeighborValRangeMax'] 

1        476
2        476
3        213
4        213
5        443
        ... 
11136    111
11137    170
11138    170
11139    170
11140    170
Name: NeighborValRangeMax, Length: 9395, dtype: int64

In [70]:
single_family_df.drop(['Median Appraised Value:', 'Median Year Built:', 'Median Lot Square Ft.:', 'Median Square Ft.:', 'Neighborhood Value Range:'],
                      axis=1,inplace=True)

In [71]:
single_family_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9395 entries, 1 to 11140
Data columns (total 66 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   image_link                            9395 non-null   object 
 1   Listing Price:                        9392 non-null   object 
 2   Address:                              9395 non-null   object 
 3   City:                                 9395 non-null   object 
 4   Zip Code:                             9395 non-null   int64  
 5   County:                               9395 non-null   object 
 6   Legal Description:                    9339 non-null   object 
 7   Bedrooms:                             9366 non-null   object 
 8   Baths:                                9373 non-null   object 
 9   Stories:                              9392 non-null   object 
 10  Style:                                9395 non-null   object 
 11  Year Built:     

We have some features with 100% null values. lets drop them:

In [72]:
single_family_df.drop(['Home For Sales:', 'Average List Price:', 'Average Square Ft.:', 'Average Price/Square Ft.:',
                       'Home For Lease:','Average Lease:','Average Lease/Square Ft.:'],axis=1,inplace=True)

Now we are checking missing values again:

In [73]:
missing = missing_cal(single_family_df)
missing.loc[missing['%'] > 0].sort_values(by="%",ascending=False)

Unnamed: 0,count,%
Median Appraised Value / Square ft.:,8471,90.164981
LivingSqft,4045,43.054816
Fireplace:,3383,36.008515
DiningSqft,2986,31.782863
KitchenSqft,2752,29.292177
Taxes w/o Exemp:,1506,16.029803
Median Price / Square ft.:,924,9.835019
Tax Rate:,863,9.185737
Primary Bedroom:,640,6.812134
Lot Size:,485,5.16232


In [74]:
single_family_df.groupby('SubName').count()['Taxes w/o Exemp:']

SubName
1829 Bering Drive      1
A L Coan               1
ADELAIDE               0
ALYS PARK              1
ARCADIA COURT          1
                      ..
Yorkdale Tr            0
Yorkshire              4
Young Mens             3
Young Samuel           1
Zan Wun Patio Homes    1
Name: Taxes w/o Exemp:, Length: 2111, dtype: int64

In [75]:
single_family_df[['Taxes w/o Exemp:','Listing Price:','Median Price / Square ft.:','Tax Rate:','SubName']]

Unnamed: 0,Taxes w/o Exemp:,Listing Price:,Median Price / Square ft.:,Tax Rate:,SubName
1,"$11, 138/2019","$ 465,000 ($221.85/sqft.) $Convert",,2.4216,Modern Midtown
2,"$11, 292/2019","$ 450,000 ($223.33/sqft.) $Convert",,2.4216,Modern Midtown
3,"$4, 480/2019","$ 259,000 ($203.30/sqft.) $Convert",$188.44,2.5716,Merkels
4,,"$ 236,999 ($196.19/sqft.) $Convert \r\n\r\n\r...",$188.44,,Merkels
5,"$9, 932/2019","$ 390,000 ($210.81/sqft.) $Convert",,2.5716,East End On The Bayou
...,...,...,...,...,...
11136,"$2, 607/2019","$ 189,000 ($111.77/sqft.) $Convert",$118.87,2.8732,Merilyn Place
11137,"$2, 145/2019","$ 177,900 ($135.18/sqft.) $Convert",$92.11,2.8732,South Houston
11138,"$3, 108/2019","$ 169,500 ($146.12/sqft.) $Convert",$92.11,2.8732,South Houston
11139,,"$ 165,000 ($145.37/sqft.) $Convert",$92.11,,South Houston


Tax rate and Median Price / Square ft. would be the same for all house in same subdivision and Taxes paid is different for each house based on tax rate, total house sqft and other factor. I think filling null value for tax paid with the average of taxes per subdivision would be appropriate but first we need to clean these two columns.

In [76]:
single_family_df['MedianPrice/Sqft'] = pd.to_numeric(single_family_df['Median Price / Square ft.:'].str
                                                     .replace('$','').str.strip(),errors='coerce')

single_family_df['PaidTax'] = pd.to_numeric(single_family_df['Taxes w/o Exemp:'].apply(lambda x:x.split('/')[0].replace(',','')
                                                                         .replace(' ','').strip().replace('$','') if 
                                                                         not pd.isna(x) else None),errors='coerce')
single_family_df['TaxRate'] = pd.to_numeric(single_family_df['Tax Rate:'],errors='coerce')

In [77]:
single_family_df['PaidTax'] = single_family_df.groupby('SubName')['PaidTax'].transform(lambda x: x.fillna(x.mean()))
single_family_df['TaxRate'] = single_family_df.groupby('SubName')['TaxRate'].transform(lambda x: x.fillna(x.mean()))
single_family_df['MedianPrice/Sqft'] = single_family_df.groupby('SubName')['MedianPrice/Sqft'].transform(lambda x: x.fillna(x.mean()))

In [78]:
single_family_df['MedianPrice/Sqft'].isna().sum()

924

We still have some null values and it seems there is no information for those subdivisions. lets drop them.

In [79]:
single_family_df=single_family_df[~single_family_df['MedianPrice/Sqft'].isnull()]

In [80]:
single_family_df['TaxRate'].isna().sum()

53

In [81]:
single_family_df=single_family_df[~single_family_df['TaxRate'].isnull()]

In [82]:
single_family_df['PaidTax'].isna().sum()

68

In [83]:
single_family_df=single_family_df[~single_family_df['PaidTax'].isnull()]

In [84]:
single_family_df.drop(['Taxes w/o Exemp:','Tax Rate:','Median Price / Square ft.:'],axis=1,inplace=True)

## 1.8 Fill Null Kitchen, Dining and Living <a id='1.8_Fill_Null_For_Kitchen_Dining_Living'></a>

I think we can fill null values for living, dining and kitchen area with them mean sqft for each subdivision since most of the house in one subdivision are almost the same.

In [85]:
single_family_df['LivingSqft'] = single_family_df.groupby('SubName')['LivingSqft'].apply(lambda x: x.fillna(x.mean()))
single_family_df['DiningSqft'] = single_family_df.groupby('SubName')['DiningSqft'].apply(lambda x: x.fillna(x.mean()))
single_family_df['KitchenSqft'] = single_family_df.groupby('SubName')['KitchenSqft'].apply(lambda x: x.fillna(x.mean()))

In [86]:
single_family_df['LivingSqft'].isna().sum()

367

In [87]:
single_family_df['DiningSqft'].isna().sum()

445

In [88]:
single_family_df['KitchenSqft'].isna().sum()

234

We still have some null values and it seems there is no information for those subdivisions. lets drop them.

In [89]:
single_family_df=single_family_df[~single_family_df['DiningSqft'].isnull()]

In [90]:
single_family_df['LivingSqft'].isna().sum()

197

In [91]:
single_family_df['KitchenSqft'].isna().sum()

75

In [92]:
single_family_df=single_family_df[~single_family_df['LivingSqft'].isnull()]

In [93]:
single_family_df['KitchenSqft'].isna().sum()

51

In [94]:
single_family_df=single_family_df[~single_family_df['KitchenSqft'].isnull()]

## 1.9 Listing Price<a id='1.9_Listing_Price'></a>

In [95]:
single_family_df['Listing Price:'].isna().sum()

2

In [96]:
single_family_df=single_family_df[~single_family_df['Listing Price:'].isnull()]

In [97]:
single_family_df['Listing Price:']=single_family_df['Listing Price:'].str.split(' ').str[1]
single_family_df['Listing Price:']=single_family_df['Listing Price:'].str.replace(',','')
single_family_df['Listing Price:']=pd.to_numeric(single_family_df['Listing Price:'])
single_family_df.rename(columns = {'Listing Price:':'ListingPrice','Address:':'Address', 'Zip Code:':'ZipCode', 'County:':'County',
                                 'Subdivision:':'sub', 'Legal Description:':'Legal'},inplace=True)

In [98]:
single_family_df['ListingPrice'].describe()

count    7.655000e+03
mean     5.318598e+05
std      6.400071e+05
min      1.000000e+00
25%      2.402450e+05
50%      3.450000e+05
75%      5.597000e+05
max      1.450000e+07
Name: ListingPrice, dtype: float64

## 1.10 Bedrooms<a id='1.10_Bedrooms'></a>

In [99]:
single_family_df[['Bedrooms:','Bedroom:','Primary Bedroom:']]

Unnamed: 0,Bedrooms:,Bedroom:,Primary Bedroom:
10,3 Bedroom(s),"[""12' X 10', 3rd"", '12\'10\' X 11\'4"", 1st', ""...","['16\'4"" X 12\'8"", 3rd', '16\'4"" , 12\'8"", 3rd']"
22,3 Bedroom(s),"['12X11, 3rd', '12X11, 1st', '3.66 x 3.35(m)',...","['15X14, 3rd', '4.57 x 4.27(m)']"
23,3 Bedroom(s),"['12X11, 1st', '12X11, 3rd', '3.66 x 3.35(m)',...","['15X14, 3rd', '4.57 x 4.27(m)']"
29,3 Bedroom(s),"['13 x 11, 1st', '11 x 10, 3rd', '13 , 11, 1st...","['17 x 13, 3rd', '17 , 13, 3rd']"
30,3 Bedroom(s),"['11x10, 3rd', '15x11, 1st', '3.35 x 3.05(m)',...","['19X13, 3rd', '5.79 x 3.96(m)']"
...,...,...,...
11135,3 Bedroom(s),"['10x11, 1st', '10x11, 1st', '10x11, 1st', '3....",
11137,3 Bedroom(s),"['10x8, 1st', '10x8, 1st', '3.05 x 2.44(m)', '...","['12x14, 1st', '3.66 x 4.27(m)']"
11138,3 Bedroom(s),"['12x11, 1st', '12x11, 1st', '3.66 x 3.35(m)',...","['13x15, 1st', '3.96 x 4.57(m)']"
11139,3 Bedroom(s),"['12x12, 1st', '12x12, 1st', '3.66 x 3.66(m)',...","['13x15, 1st', '3.96 x 4.57(m)']"


In [100]:
single_family_df[['Bedrooms:','Bedroom:','Primary Bedroom:','Average Bedrooms:']].isna().sum()

Bedrooms:             15
Bedroom:              57
Primary Bedroom:     447
Average Bedrooms:     27
dtype: int64

First I calculate area for Bedroom and Primary Bedroom separately then fill null values with the average of area per subdivision and then add them as total bedrooms in Sqft. 

In [101]:
single_family_df['Primary_Bedroom_clean'] = single_family_df['Primary Bedroom:'].str.split(',')
single_family_df['Primary_Bedroom_clean'] = single_family_df['Primary_Bedroom_clean'].apply(area_calc)
single_family_df.update(single_family_df[['Primary_Bedroom_clean']].fillna(0))

single_family_df['Bedroom_clean'] = single_family_df['Bedroom:'].str.split(',')
single_family_df['Bedroom_clean'] = single_family_df['Bedroom_clean'].apply(area_calc)
single_family_df['Bedroom_clean'] = single_family_df.groupby('SubName')['Bedroom_clean'].transform(lambda x: x.fillna(x.mean()))

In [102]:
single_family_df['TotalBedSqft'] = single_family_df['Bedroom_clean'] + single_family_df['Primary_Bedroom_clean']

In [103]:
pd.options.display.max_colwidth = 100
single_family_df[['Bedrooms:','Bedroom:','Primary Bedroom:','Primary_Bedroom_clean','Bedroom_clean','TotalBedSqft']].sample(20,random_state=100)

Unnamed: 0,Bedrooms:,Bedroom:,Primary Bedroom:,Primary_Bedroom_clean,Bedroom_clean,TotalBedSqft
7745,4 Bedroom(s),"['11X11, 2nd', '15X11, 2nd', '15X11, 2nd', '3.35 x 3.35(m)', '4.57 x 3.35(m)', '4.57 x 3.35(m)']","['14X14, 1st', '4.27 x 4.27(m)']",196.0,451.0,647.0
8887,4 Bedroom(s),"['12x12, 2nd', '12x16, 2nd', '12x15, 2nd', '3.66 x 3.66(m)', '3.66 x 4.88(m)', '3.66 x 4.57(m)']","['15x26, 1st', '4.57 x 7.92(m)']",390.0,516.0,906.0
4646,4 Bedroom(s),"['12x11, 2nd', '10x9, 2nd', '13x10, 2nd', '3.66 x 3.35(m)', '3.05 x 2.74(m)', '3.96 x 3.05(m)']","['14x13, 1st', '4.27 x 3.96(m)']",182.0,352.0,534.0
10263,3 Bedroom(s),"['13 x 12, 1st', '13 x 12, 1st', '13 , 12, 1st', '13 , 12, 1st']","['23 x 18, 1st', '23 , 18, 1st']",414.0,312.0,726.0
8844,4 - 5 Bedroom(s),"['12x14, 2nd', '11x13, 2nd', '11x13, 2nd', '3.66 x 4.27(m)', '3.35 x 3.96(m)', '3.35 x 3.96(m)']","['15x16, 1st', '4.57 x 4.88(m)']",240.0,454.0,694.0
10629,4 Bedroom(s),"['12x12, 2nd', '12x10, 2nd', '11x11, 2nd', '3.66 x 3.66(m)', '3.66 x 3.05(m)', '3.35 x 3.35(m)']","['16x14, 2nd', '4.88 x 4.27(m)']",224.0,385.0,609.0
3473,4 Bedroom(s),"['14 x 11, 2nd', '12 x 12, 2nd', '12 x 12, 2nd', '14 , 11, 2nd', '12 , 12, 2nd', '12 , 12, 2nd']","['15 x 13, 1st', '15 , 13, 1st']",195.0,442.0,637.0
9711,5 Bedroom(s),"['15-13, 2nd', '13-12, 2nd', '17-12, 2nd', '16-12, 2nd', '15-13, 2nd', '13-12, 2nd', '17-12, 2nd...","['18-15, 1st', '18-15, 1st']",0.0,0.0,0.0
2452,3 Bedroom(s),"['10.10x 13.11, 1st', '13x 11, 1st', '3.08 x 4.00(m)', '3.96 x 3.35(m)']","['12x 13.2, 1st', '3.66 x 4.02(m)']",158.4,275.411,433.811
2482,4 Bedroom(s),"['14x12, 1st', '14x12, 1st', '4.27 x 3.66(m)', '4.27 x 3.66(m)']","['13x12, 1st', '14x13, 1st', '3.96 x 3.66(m)', '4.27 x 3.96(m)']",338.0,336.0,674.0


In [104]:
single_family_df['Bedrooms:'].isnull().sum()

15

There is 15 houses with no information about bedrooms and I will drop those rows.

In [105]:
single_family_df=single_family_df[~single_family_df['Bedrooms:'].isnull()]

In [106]:
single_family_df['Bedrooms:']=single_family_df['Bedrooms:'].str.split(' ').str[0]
single_family_df['Bedrooms:']=single_family_df['Bedrooms:'].astype(int)
single_family_df.rename(columns = {'Bedrooms:':'NoBed'},inplace=True)
single_family_df['NoBed'].describe()

count    7640.000000
mean        3.728403
std         0.790657
min         1.000000
25%         3.000000
50%         4.000000
75%         4.000000
max        10.000000
Name: NoBed, dtype: float64

In [107]:
single_family_df['Average Bedrooms:'].isna().sum()

27

In [108]:
single_family_df.groupby('SubName')['Average Bedrooms:'].value_counts()

SubName                           Average Bedrooms:
ASHFORD MANOR                     3.52                  5
Aberdeen Green                    3.46                  7
Aberdeen Trails                   3.84                  4
Afton Oaks                        3.56                 24
Airport Blvd Estates              3.47                  1
                                                       ..
Wrights Landing At Legends Trace  3.64                  5
Wyndham Village                   4.16                  3
YAUPON TRLS                       3.67                  1
Yorkshire                         4.59                  4
Young Mens                        2.85                  4
Name: Average Bedrooms:, Length: 1054, dtype: int64

In [109]:
single_family_df['Average Bedrooms:'].loc[single_family_df.SubName=='Afton Oaks']

2338    3.56
2339    3.56
2345    3.56
2351    3.56
2352    3.56
2353    3.56
2354    3.56
2355    3.56
2356    3.56
2359    3.56
2363    3.56
2376    3.56
2378    3.56
2383    3.56
2384    3.56
2391    3.56
2394    3.56
2403    3.56
2408    3.56
2412    3.56
2417    3.56
2418    3.56
2421    3.56
2426    3.56
Name: Average Bedrooms:, dtype: float64

It seems average bedroom is same for each subdivision. We can fill null values with the mean. 

In [110]:
single_family_df['Average Bedrooms:'] = single_family_df.groupby('SubName')['Average Bedrooms:'].transform(lambda x: x.fillna(x.mean()))

In [111]:
single_family_df['Average Bedrooms:'].isna().sum()

27

So there is no information for those subdivisions. lets drop these rows

In [112]:
single_family_df=single_family_df[~single_family_df['Average Bedrooms:'].isnull()]
single_family_df.rename(columns ={'Average Bedrooms:':'AvgBed'},inplace=True)

In [113]:
single_family_df.drop(['Bedroom:','Primary Bedroom:','Primary_Bedroom_clean','Bedroom_clean'],axis=1,inplace=True)

## 1.11 Bathrooms<a id='1.11_Bathrooms'></a>

In [114]:
single_family_df['Baths:'].isnull().sum()

3

In [115]:
single_family_df=single_family_df[~single_family_df['Baths:'].isnull()]

In [116]:
single_family_df[['Baths:']]

Unnamed: 0,Baths:
10,3 Full & 1 Half Bath(s)
22,3 Full & 1 Half Bath(s)
23,3 Full & 1 Half Bath(s)
29,3 Full & 1 Half Bath(s)
30,3 Full & 1 Half Bath(s)
...,...
11135,1 Full Bath(s)
11137,1 Full & 1 Half Bath(s)
11138,1 Full & 1 Half Bath(s)
11139,2 Full Bath(s)


In [117]:
single_family_df['FullBath']=single_family_df['Baths:'].str.split(' ').str[0].astype(int)

In [118]:
single_family_df['FullBath']

10       3
22       3
23       3
29       3
30       3
        ..
11135    1
11137    1
11138    1
11139    2
11140    1
Name: FullBath, Length: 7610, dtype: int32

In [119]:
No_Bath = single_family_df['Baths:'].str.split('&').str[1].str.strip()
No_Bath.fillna('0',inplace=True) 
single_family_df['HalfBath']=[int(item[0]) for item in No_Bath.str.split(' ')]
single_family_df['HalfBath'].replace(',','',inplace=True)
single_family_df[['FullBath','HalfBath']].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7610 entries, 10 to 11140
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   FullBath  7610 non-null   int32
 1   HalfBath  7610 non-null   int64
dtypes: int32(1), int64(1)
memory usage: 148.6 KB


In [120]:
single_family_df.drop('Baths:',axis=1,inplace=True)

## 1.12 Stories<a id='1.12_Stories'></a>

In [121]:
single_family_df['Stories:'].value_counts()

2       3812
1       3193
3        371
1.5      167
4         52
2.5       12
5          1
2576       1
Name: Stories:, dtype: int64

There is one house with 2576 stories which I believe it is a typo. Lets take look at this home.

In [122]:
single_family_df['Stories:'].isnull().sum()

1

In [123]:
single_family_df=single_family_df[~single_family_df['Stories:'].isnull()]

In [124]:
single_family_df.rename(columns ={'Stories:':'Stories'},inplace=True)
single_family_df['Stories']=pd.to_numeric(single_family_df['Stories'])

In [125]:
single_family_df['Address'].loc[single_family_df['Stories']>4]

535        5334 Calle Cadiz Place
10448    22614 Auburn Valley Lane
Name: Address, dtype: object

The first one is an apartment and second one is 2 stories house. know we replace 2576 with 2.

In [126]:
single_family_df['Stories'].loc[single_family_df['Stories']==2576]=2

In [127]:
single_family_df['Stories'].value_counts()

2.0    3813
1.0    3193
3.0     371
1.5     167
4.0      52
2.5      12
5.0       1
Name: Stories, dtype: int64

## 1.13 Style<a id='1.13_Style'></a>

In [128]:
single_family_df['Style:'].value_counts()

Traditional                                       5373
Contemporary/Modern                                674
Ranch                                              332
Contemporary/Modern,Traditional                    190
Other Style                                        165
                                                  ... 
Contemporary/Modern,English,French,Traditional       1
French,Mediterranean,Traditional                     1
Georgian,Victorian                                   1
English,French                                       1
Contemporary/Modern,Victorian                        1
Name: Style:, Length: 80, dtype: int64

In [129]:
single_family_df['Style:'].isnull().sum()

0

In [130]:
single_family_df.rename(columns ={'Style:':'Style'},inplace=True)

In [131]:
single_family_df['Style'].unique()

array(['Contemporary/Modern', 'Contemporary/Modern,Traditional',
       'Traditional', 'Split Level', 'Other Style', 'Ranch',
       'Mediterranean,Traditional', 'Colonial', 'French',
       'Contemporary/Modern,Ranch,Traditional', 'Georgian,Traditional',
       'Colonial,Traditional', 'Mediterranean',
       'Colonial,Georgian,Traditional', 'Other Style,Traditional',
       'Contemporary/Modern,French', 'Contemporary/Modern,English,French',
       'French,Traditional', 'Spanish,Traditional', 'Georgian',
       'Colonial,Georgian', 'English,French,Traditional', 'Spanish',
       'Victorian', 'Traditional,Victorian',
       'Contemporary/Modern,Mediterranean',
       'Contemporary/Modern,English,French,Traditional',
       'Contemporary/Modern,French,Traditional', 'Other Style,Ranch',
       'Colonial,Contemporary/Modern',
       'Contemporary/Modern,Split Level,Traditional',
       'Georgian,Victorian', 'Colonial,French',
       'Contemporary/Modern,Victorian',
       'Contemporary/Mod

In [132]:
single_family_df['Style'].unique()

array(['Contemporary/Modern', 'Contemporary/Modern,Traditional',
       'Traditional', 'Split Level', 'Other Style', 'Ranch',
       'Mediterranean,Traditional', 'Colonial', 'French',
       'Contemporary/Modern,Ranch,Traditional', 'Georgian,Traditional',
       'Colonial,Traditional', 'Mediterranean',
       'Colonial,Georgian,Traditional', 'Other Style,Traditional',
       'Contemporary/Modern,French', 'Contemporary/Modern,English,French',
       'French,Traditional', 'Spanish,Traditional', 'Georgian',
       'Colonial,Georgian', 'English,French,Traditional', 'Spanish',
       'Victorian', 'Traditional,Victorian',
       'Contemporary/Modern,Mediterranean',
       'Contemporary/Modern,English,French,Traditional',
       'Contemporary/Modern,French,Traditional', 'Other Style,Ranch',
       'Colonial,Contemporary/Modern',
       'Contemporary/Modern,Split Level,Traditional',
       'Georgian,Victorian', 'Colonial,French',
       'Contemporary/Modern,Victorian',
       'Contemporary/Mod

## 1.14 Year Built<a id='1.14_Year_Built'></a>

In [134]:
single_family_df['Year Built:'].isnull().sum()

27

In [135]:
single_family_df['Year Built:'].value_counts()

2020   / Builder               973
2005   / Appraisal District    155
2006   / Appraisal District    150
2015   / Appraisal District    147
1955   / Appraisal District    140
                              ... 
1875   / Appraisal District      1
1941   / Appraisal               1
1964   / Builder                 1
1975   / Builder                 1
1910   / Appraisal               1
Name: Year Built:, Length: 282, dtype: int64

In [136]:
single_family_df=single_family_df[~single_family_df['Year Built:'].isnull()]
single_family_df['Year Built:']=single_family_df['Year Built:'].apply(lambda x:str(x).split(' ')[0])
single_family_df['Year Built:']=pd.to_datetime(single_family_df['Year Built:'],format='%Y').dt.year
single_family_df['Year Built:'].value_counts()

2020    1020
2015     191
2014     173
2006     173
2005     171
        ... 
1907       1
1934       1
1918       1
1905       1
1896       1
Name: Year Built:, Length: 113, dtype: int64

In [137]:
single_family_df.rename(columns ={'Year Built:':'YearBuilt'},inplace=True)

## 1.15 Building Sqft<a id='1.15_Building_Sqft'></a>

In [138]:
single_family_df['Building Sqft.:']

10                  1,897176(mÂ²)  /Builder
22                  2,162201(mÂ²)  /Builder
23                  2,162201(mÂ²)  /Builder
29                  2,016187(mÂ²)  /Builder
30       1,920178(mÂ²)  /Appraisal District
                        ...                
11133    1,651153(mÂ²)  /Appraisal District
11137    1,316122(mÂ²)  /Appraisal District
11138    1,160108(mÂ²)  /Appraisal District
11139    1,135105(mÂ²)  /Appraisal District
11140       83978(mÂ²)  /Appraisal District
Name: Building Sqft.:, Length: 7582, dtype: object

In [139]:
single_family_df['Building Sqft.:'].isnull().sum()

14

In [140]:
single_family_df=single_family_df[~single_family_df['Building Sqft.:'].isnull()]
single_family_df['Building Sqft.:']=single_family_df['Building Sqft.:'].apply(lambda x:x[0:5] if ',' in x else x[0:3])
single_family_df['Building Sqft.:']=single_family_df['Building Sqft.:'].str.replace(',','')
single_family_df['Building Sqft.:']=pd.to_numeric(single_family_df['Building Sqft.:'])
single_family_df.rename(columns ={'Building Sqft.:':'BuildSqft'},inplace=True)

## 1.16 Lot Size<a id='1.16_Lot_Size'></a>

In [141]:
single_family_df.YearBuilt[single_family_df['Lot Size:'].isnull()].value_counts()

2020    336
2019     10
2021      3
2013      3
2007      3
2016      2
2017      2
1955      2
2006      2
1965      2
2018      2
1952      1
1962      1
1978      1
1967      1
1968      1
1969      1
1973      1
1975      1
1977      1
1994      1
1979      1
1980      1
1982      1
1990      1
2000      1
2002      1
2004      1
2005      1
2008      1
2009      1
2015      1
1920      1
Name: YearBuilt, dtype: int64

Since 336 of null values is under cunstruction I will drop all null values for lot size

In [142]:
single_family_df=single_family_df[~single_family_df['Lot Size:'].isnull()]

In [143]:
single_family_df['Lot Size:']

10       1,866 Sqft.173(mÂ²)  /Appraisal District
22       1,435 Sqft.133(mÂ²)  /Appraisal District
23       1,435 Sqft.133(mÂ²)  /Appraisal District
29       1,431 Sqft.133(mÂ²)  /Appraisal District
30       1,556 Sqft.145(mÂ²)  /Appraisal District
                           ...                   
11133    7,100 Sqft.660(mÂ²)  /Appraisal District
11137    7,100 Sqft.660(mÂ²)  /Appraisal District
11138    7,100 Sqft.660(mÂ²)  /Appraisal District
11139    7,100 Sqft.660(mÂ²)  /Appraisal District
11140    7,100 Sqft.660(mÂ²)  /Appraisal District
Name: Lot Size:, Length: 7179, dtype: object

In [144]:
single_family_df['Lot Size:']=single_family_df['Lot Size:'].str.replace(',','')
single_family_df['Lot Size:']=single_family_df['Lot Size:'].apply(lambda x:float(x.split(' ')[0])*43560 if 'Acres' in 
                                                                  x else float(x.split(' ')[0]))

# single_family_df['Lot Size:']=pd.to_numeric(single_family_df['Lot Size:'])
single_family_df.rename(columns ={'Lot Size:':'LotSize'},inplace=True)

## 1.17 Maintenance Fee<a id='1.17_Maintenance_Fee'></a>

In [145]:
single_family_df['Maintenance Fee:']

10       ['$ 1195 / Annually', 'Mandatory / $1195 / Annually']
22       ['$ 1195 / Annually', 'Mandatory / $1195 / Annually']
23       ['$ 1195 / Annually', 'Mandatory / $1195 / Annually']
29       ['$ 1200 / Annually', 'Mandatory / $1200 / Annually']
30       ['$ 2250 / Annually', 'Mandatory / $2250 / Annually']
                                 ...                          
11133                                                       No
11137                                                       No
11138                                                       No
11139                                                       No
11140                                                       No
Name: Maintenance Fee:, Length: 7179, dtype: object

In [146]:
single_family_df['Maintenance Fee:'].isnull().sum()

20

In [147]:
single_family_df['Maintenance Fee:'].value_counts()

No                                                         1853
['$ 450 / Annually', 'Mandatory / $450 / Annually']         129
['$ 650 / Annually', 'Mandatory / $650 / Annually']         128
['$ 600 / Annually', 'Mandatory / $600 / Annually']         115
['$ 350 / Annually', 'Mandatory / $350 / Annually']         114
                                                           ... 
['$ 1845 / Annually', 'Mandatory / $1845 / Annually']         1
['$ 257 / Annually', 'Mandatory / $257 / Annually']           1
['$ 66 / Monthly', 'Mandatory / $66 / Monthly']               1
['$ 1165 / Annually', 'Mandatory / $1165 / Annually']         1
['$ 10509 / Annually', 'Mandatory / $10509 / Annually']       1
Name: Maintenance Fee:, Length: 821, dtype: int64

In [148]:
single_family_df['Maintenance Fee:'].isin(['No','No / $0','Voluntary / Annually','Voluntary /0/ Annually']).sum()

1918

In [149]:
def MaintenanceFee(fee):
    """This function calculates the Maintenance fee for this column"""
    if type(fee) != float:
        fee = fee.split(',')[0].split('/')
        if fee[0].strip() in ['No','No / $0','Voluntary / Annually','Voluntary /0/ Annually','Voluntary','Mandatory']:
            fee = 0
        elif len(fee) > 1 and fee[1].replace('\'','').strip() in ['Annually']:
            fee = float(fee[0].replace('$','').replace('\'','').replace(' ','').replace('[',''))
        elif len(fee) > 1 and fee[1].replace('\'','').strip() in ['Quarterly']:   
            fee = float(fee[0].replace('$','').replace('\'','').replace(' ','').replace('[',''))*4
        elif len(fee) > 1 and fee[1].replace('\'','').strip() in ['Monthly']:   
            fee = float(fee[0].replace('$','').replace('\'','').replace(' ','').replace('[',''))*12 
        else:
            fee = float(fee[0].replace('$','').replace('\'','').replace(' ','').replace('[',''))
    else:
        fee = 0
    return fee

In [150]:
single_family_df['MaintenanceFee'] = single_family_df['Maintenance Fee:'].apply(MaintenanceFee)

In [151]:
single_family_df.drop('Maintenance Fee:',axis=1,inplace=True)

## 1.18 Fireplace<a id='1.18_Fireplace'></a>

In [152]:
single_family_df['Fireplace:'].value_counts()

1/Gaslog Fireplace                                                             1340
1                                                                               852
1/Gas Connections                                                               693
1/Wood Burning Fireplace                                                        453
1/Gas Connections, Gaslog Fireplace                                             384
                                                                               ... 
1/Mock Fireplace, Wood Burning Fireplace                                          1
3/Freestanding, Gaslog Fireplace                                                  1
4/Mock Fireplace                                                                  1
3/Gas Connections, Gaslog Fireplace, Mock Fireplace, Wood Burning Fireplace       1
/Gaslog Fireplace, Wood Burning Fireplace                                         1
Name: Fireplace:, Length: 84, dtype: int64

In [153]:
pd.Series([str(x)[0]  for x in single_family_df['Fireplace:'] if x is not None]).value_counts()

1    4464
n    2076
2     444
3      98
/      62
4      27
5       8
dtype: int64

In [154]:
single_family_df['Fireplace:']=single_family_df['Fireplace:'].apply(lambda x:int(str(x)[0]) if str(x)[0]
                                                                    in ['1','2','3','4','5','6','7'] else 0)

In [155]:
single_family_df['Fireplace:'].value_counts()

1    4464
0    2138
2     444
3      98
4      27
5       8
Name: Fireplace:, dtype: int64

## 1.19 HOA Mandatory<a id='1.19_HOA_Mandatory'></a>

In [156]:
single_family_df['HOA Mandatory:']

10       Yes
22       Yes
23       Yes
29       Yes
30       Yes
        ... 
11133     No
11137     No
11138     No
11139     No
11140     No
Name: HOA Mandatory:, Length: 7179, dtype: object

We can fill null values for HOA by 'NO'.

In [157]:
single_family_df['HOA Mandatory:'][single_family_df['HOA Mandatory:'].isnull()]='No'
single_family_df.rename(columns ={'HOA Mandatory:':'HOA'},inplace=True)

In [158]:
missing = missing_cal(single_family_df)
missing.loc[missing['%'] > 0].sort_values(by="%",ascending=False)

Unnamed: 0,count,%
Median Appraised Value / Square ft.:,7179,100.0
Legal,7,0.097507


Median Appraised Value / Square ft.: has 100% missing value we can drop it and also drop rows with null value for legal.

In [159]:
single_family_df.drop('Median Appraised Value / Square ft.:',axis = 1, inplace=True)
single_family_df=single_family_df[~single_family_df['Legal'].isnull()]

lets take a look at dataset to see if we need to rename some columns:

In [160]:
single_family_df.columns

Index(['image_link', 'ListingPrice', 'Address', 'City:', 'ZipCode', 'County',
       'Legal', 'NoBed', 'Stories', 'Style', 'YearBuilt', 'BuildSqft',
       'LotSize', 'Fireplace:', 'Heating:', 'Cooling:', 'Ice Maker:',
       'Microwave:', 'Compactor:', 'Dishwasher:', 'Disposal:', 'Roof:',
       'Foundation:', 'Private Pool:', 'Exterior Type:', 'Lot Description:',
       'Controlled Access:', 'Water Sewer:', 'Unit Location:', 'Area Pool:',
       'Dwelling Type:', 'HOA', 'List Type:', 'Other Fees:', 'AvgBed',
       'Average Baths:', 'Carport Description:', 'UtilitySqft', 'StudySqft',
       'GameSqft', 'BreakfastSqft', 'Garage', 'LivingSqft', 'DiningSqft',
       'KitchenSqft', 'SubName', 'MedianApp', 'MedianYearBlt', 'MedianSqft',
       'NeighborValRangeMin', 'NeighborValRangeMax', 'MedianPrice/Sqft',
       'PaidTax', 'TaxRate', 'TotalBedSqft', 'FullBath', 'HalfBath',
       'MaintenanceFee'],
      dtype='object')

In [161]:
single_family_df.reset_index(inplace=True,drop=True)

In [162]:
single_family_df.rename(columns = {'City:':'City','Fireplace:':'Fireplace', 'Heating:':'Heating', 'Cooling:':'Cooling',
                                   'Ice Maker:':'IceMaker', 'Microwave:':'Microwave','Compactor:':'Compactor',
                                   'Dishwasher:':'Dishwasher','Disposal:':'Disposal','Roof:':'Roof',
                                   'Foundation:':'Foundation','Private Pool:':'PrivatePool',
                                   'Exterior Type:':'ExteriorType','Lot Description:':'LotDes',
                                   'Controlled Access:':'ControlAccess','Water Sewer:':'WaterSewer',
                                   'Unit Location:':'UnitLoc','Area Pool:':'AreaPool','Dwelling Type:':'DwellingType',
                                   'List Type:':'ListType','Other Fees:':'OtherFees','Average Baths:':'AvgBaths',
                                   'Carport Description:':'CarportDescription'},inplace=True)

In [163]:
single_family_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7172 entries, 0 to 7171
Data columns (total 58 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   image_link           7172 non-null   object 
 1   ListingPrice         7172 non-null   int64  
 2   Address              7172 non-null   object 
 3   City                 7172 non-null   object 
 4   ZipCode              7172 non-null   int64  
 5   County               7172 non-null   object 
 6   Legal                7172 non-null   object 
 7   NoBed                7172 non-null   int32  
 8   Stories              7172 non-null   float64
 9   Style                7172 non-null   object 
 10  YearBuilt            7172 non-null   int64  
 11  BuildSqft            7172 non-null   int64  
 12  LotSize              7172 non-null   float64
 13  Fireplace            7172 non-null   int64  
 14  Heating              7172 non-null   object 
 15  Cooling              7172 non-null   o

## 1.20 Other Fees<a id='1.20_Other_Fees'></a>

In [164]:
single_family_df.OtherFees.value_counts()

No / 0        3453
Yes / 250     1350
Yes / 200      462
Yes / 300      265
Yes / 150      217
              ... 
Yes / 95         1
Yes / 2237       1
Yes / 2948       1
Yes / 1075       1
Yes / 180        1
Name: OtherFees, Length: 200, dtype: int64

We need to clean this column and to do so I will change No value with 0 and calculate total fees for the rest.

In [165]:
single_family_df.OtherFees = single_family_df.OtherFees.apply(lambda x: float(x.split('/')[1].strip()))

In [166]:
single_family_df.OtherFees

0         80.0
1         80.0
2         80.0
3       1100.0
4        250.0
         ...  
7167       0.0
7168       0.0
7169       0.0
7170       0.0
7171       0.0
Name: OtherFees, Length: 7172, dtype: float64

In [167]:
single_family_df.head()

Unnamed: 0,image_link,ListingPrice,Address,City,ZipCode,County,Legal,NoBed,Stories,Style,...,MedianSqft,NeighborValRangeMin,NeighborValRangeMax,MedianPrice/Sqft,PaidTax,TaxRate,TotalBedSqft,FullBath,HalfBath,MaintenanceFee
0,"['https://photos.harstatic.com/188738555/hr/img-1.jpeg?ts=2020-08-24T11:45:48.617', 'https://pho...",379990,1727 Eado Point Lane,Houston,77003,Harris County,LT 13 BLK 2 EADO POINT,3,3.0,Contemporary/Modern,...,1485,51,383,192.51,2169.0,2.6554,472.333333,3,1,1195.0
1,"['https://photos.harstatic.com/189115063/hr/img-1.jpeg?ts=2020-08-31T17:38:09.017', 'https://pho...",394990,2712 EaDo Grove Lane,Houston,77003,Harris County,LT 21 BLK 1 EaDo Grove,3,3.0,Contemporary/Modern,...,1485,51,383,192.51,2169.0,2.6554,474.0,3,1,1195.0
2,"['https://photos.harstatic.com/186855660/hr/img-1.jpeg?ts=2020-06-26T17:06:15.333', 'https://pho...",394990,2714 EaDo Grove Lane,Houston,77003,Harris County,LT 22 BLK 1 EaDo Grove,3,3.0,Contemporary/Modern,...,1485,51,383,192.51,2169.0,2.6554,474.0,3,1,1195.0
3,"['https://photos.harstatic.com/188972348/hr/img-1.jpeg?ts=2020-08-26T12:59:30.780', 'https://pho...",409000,1806 Elite Drive,Houston,77003,Harris County,LT 8 BLK 1 ELITE TOWNHOMES LLC,3,4.0,Contemporary/Modern,...,1434,35,383,186.01,886.0,2.5264,474.0,3,1,1200.0
4,"['https://photos.harstatic.com/189100987/hr/img-1.jpeg?ts=2020-09-04T12:19:37.237', 'https://pho...",408000,2614 Capitol Street,Houston,77003,Harris County,LT 7 BLK 1 CAPITOL OAKS SEC 1 2ND AMEND,3,3.0,"Contemporary/Modern,Traditional",...,1563,360,527,200.05,8899.0,2.5466,522.0,3,1,2250.0


## 1.21 Roof<a id='1.21_Roof'></a>

In [168]:
single_family_df.Roof.unique()

array(['Composition', 'Aluminum', 'Other', 'Tile', 'Composition, Other',
       'Aluminum, Composition', 'Composition, Slate', 'Slate',
       'Composition, Tile', 'Wood Shingle', 'Built Up, Composition',
       'Aluminum, Other', 'Built Up, Tile', 'Other, Tile',
       'Aluminum, Composition, Other', 'Other, Wood Shingle', 'Built Up',
       'Aluminum, Other, Wood Shingle', 'Aluminum, Slate',
       'Composition, Wood Shingle'], dtype=object)

To reduce number of category in roof type column I create a list of standard roof type and will match each category to standard list, then replace it with higher score.

In [169]:
standard_roof=['Composition','Aluminum','Tile','Slate','Wood Shingle','Built Up','Other']
#For each correct roof  type . in standard roof list
for roof in standard_roof:
    
    # Find matches in gender
    matches = process.extract(roof, single_family_df.Roof,
                 limit = single_family_df.shape[0])
    
    
# For each possible_match with similarity score >= 90
    for possible_match in matches:
        if possible_match[1] >= 90:
      
            
            matching = single_family_df.Roof == possible_match[0]
           # I decided to use 'W' for female since there is high similarity between 'female' and 'male' 
        single_family_df.loc[matching , 'Roof'] = roof

In [170]:
single_family_df.Roof.unique()

array(['Composition', 'Aluminum', 'Other', 'Tile', 'Slate',
       'Wood Shingle', 'Built Up'], dtype=object)

## 1.21 Foundation<a id='1.21_Foundation'></a>

In [171]:
single_family_df.Foundation.unique()

array(['Slab', 'Block & Beam', 'Slab on Builders Pier', 'Pier & Beam',
       'Other, Slab', 'Pier & Beam, Slab', 'Slab, Slab on Builders Pier',
       'Pier & Beam, Slab on Builders Pier', 'Other',
       'Block & Beam, Slab', 'Other, Pier & Beam',
       'Block & Beam, Pier & Beam', 'Block & Beam, Slab on Builders Pier',
       'Other, Slab on Builders Pier', 'Block & Beam, Other, Pier & Beam',
       'Block & Beam, Other', 'On Stilts'], dtype=object)

In [173]:
standard_foundation=['Slab','Block & Beam','Pier & Beam','On Stilts','Other']
for found in standard_foundation:
    
    # Find matches in gender
    matches = process.extract(found, single_family_df.Foundation,
                 limit = single_family_df.shape[0])
    
    
# For each possible_match with similarity score >= 90
    for possible_match in matches:
        if possible_match[1] >= 90:
      
            
            matching = single_family_df.Foundation == possible_match[0]
           # I decided to use 'W' for female since there is high similarity between 'female' and 'male' 
        single_family_df.loc[matching , 'Foundation'] = found
print(single_family_df.Foundation.unique())

['Slab' 'Block & Beam' 'Pier & Beam' 'Other' 'On Stilts']


In [174]:
single_family_df.ExteriorType.unique()

array(['Cement Board', 'Cement Board, Stucco', 'Brick, Wood',
       'Brick, Cement Board, Wood', 'Brick, Cement Board',
       'Brick & Wood, Cement Board', 'Stucco', 'Brick, Vinyl, Wood',
       'Brick, Other', 'Other', 'Brick', 'Brick & Wood', 'Brick, Stucco',
       'Wood', 'Brick Veneer, Cement Board, Stone', 'Stone, Stucco',
       'Brick & Wood, Stucco', 'Aluminum',
       'Brick & Wood, Cement Board, Wood', 'Cement Board, Stone',
       'Asbestos, Wood', 'Brick, Stone, Stucco, Vinyl, Wood',
       'Brick Veneer, Stucco', 'Cement Board, Stone, Wood',
       'Aluminum, Vinyl', 'Stone', 'Stucco, Wood', 'Other, Wood',
       'Brick & Wood, Brick', 'Stone, Stucco, Wood',
       'Brick Veneer, Cement Board', 'Asbestos, Brick & Wood',
       'Brick, Vinyl', 'Brick Veneer', 'Cement Board, Other, Stone, Wood',
       'Brick, Stone, Stucco', 'Brick, Cement Board, Stucco',
       'Brick Veneer, Stone', 'Other, Stucco', 'Stone & Wood, Stucco',
       'Unknown', 'Cement Board, Wood', 'Alumi

In [175]:
single_family_df.to_csv('../data/processed/SingleFamily.csv',index=False)