## 2. Data wrangling

## 2.1 Contents<a id='2.1_Contents'></a>
* [2 Data wrangling](#2_Data_wrangling)
  * [2.1 Contents](#2.1_Contents)
  * [2.2 Introduction](#2.2_Introduction)
  * [2.3 Imports](#2.3_Imports)
  * [2.4 Load The House Price Data](#2.4_Load_The_House_Price_Data)
  * [2.5 Filtering Single Family Property Type](#2.5_Filtering_Single_Family_Property_Type) 
  * [2.6 Missing Values](#2.6_Missing_Values) 
  * [2.7 Garage](#2.7_Garage) 
  * [2.8 Living](#2.8_Living) 
  * [2.9 Dining](#2.9_Dining) 
  * [2.10 Kitchen](#2.10_Kitchen)
  * [2.11 Subdivision](#2.11_Subdivision)

## 2.2 Introduction<a id='2.2_Introduction'></a>

In this section I will investigate data scrapped from www.HAR.com. Data cleaning will be done in this stage since all rows are categorical and need to be numerical. I will remove features with lost of none values and will create new features.

## 2.3 Imports<a id='2.3_Imports'></a>

In [400]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import substring
import re
from fuzzywuzzy import process
import warnings
warnings.filterwarnings('ignore')

## 2.4 Load The House Price Data<a id='2.4_Load_The_House_Price_Data'></a>

In [534]:
data= pd.read_csv('../Prediction House Price Using Image Processing/Data/Houston_Home_List.csv',encoding = "ISO-8859-1")
print('data shape is:',data.shape)

data shape is: (15102, 101)


In [535]:
data.columns

Index(['Unnamed: 0', 'image_link', 'Listing Price:', 'Address:', 'City:',
       'State:', 'Zip Code:', 'County:', 'Subdivision:', 'Legal Description:',
       ...
       'Extra Room:', 'Wine Room:', 'Carport Description:',
       'Median Appraised Value / Square ft.:', 'Den:', 'Utility Room Desc:',
       'Sunroom:', 'Guest Suite:', 'Bath:', 'Garage Apartment:'],
      dtype='object', length=101)

## 2.5 Filtering Single Family Property Type<a id='2.5_Filtering_Single_Family_Property_Type'></a>

Since we are going to analysis images and other house features it is important to have all records as same as possible. For example for lots there is no image for building or rooms and features for multi-family properties are different from single family homes. let see what kind of property type we have in our dataset:

In [536]:
data['Property Type:'].value_counts()

Single-Family                          11141
Lots                                    1551
Townhouse/Condo - Townhouse              950
Townhouse/Condo - Condominium            594
Mid/Hi-Rise Condo                        436
Country Homes/Acreage                    154
Multi-Family - Duplex                    107
Multi-Family - Fourplex                   46
Country Homes/Acreage - Free Standi       46
Multi-Family - 5 Plus                     38
Multi-Family - Triplex                    15
Multi-Family - Multiple Detached Dw        9
Country Homes/Acreage - Manufacture        4
Lot & Acreage - Residential                3
Residential - Condo                        2
Residential - Townhouse                    1
Single Family                              1
Name: Property Type:, dtype: int64

Majority of properties are single family so, I keep them and remove the rest of the types.

In [537]:
single_family_df = data[data['Property Type:']=='Single-Family']
single_family_df.reset_index(drop=True,inplace=True)
len(single_family_df)

11141

In [538]:
single_family_df.head()

Unnamed: 0.1,Unnamed: 0,image_link,Listing Price:,Address:,City:,State:,Zip Code:,County:,Subdivision:,Legal Description:,...,Extra Room:,Wine Room:,Carport Description:,Median Appraised Value / Square ft.:,Den:,Utility Room Desc:,Sunroom:,Guest Suite:,Bath:,Garage Apartment:
0,85,"['https://photos.harstatic.com/190618667/hr/img-1.jpeg?ts=2020-10-19T10:42:58.003', 'https://pho...","$ 575,000 ($232.98/sqft.) $Convert",1316 Hadley Street,Houston,TX,77002,Harris County,Austin Hadley Place,LT 4 BLK 1 AUSTIN HADLEY PLACE,...,,,,,,,,,,
1,88,"['https://photos.harstatic.com/190420550/hr/img-1.jpeg?ts=2020-10-15T13:09:36.753', 'https://pho...","$ 465,000 ($221.85/sqft.) $Convert",110 Pierce Street,Houston,TX,77002,Harris County,Modern Midtown (View subdivision price trend),LT 12 BLK 1 MODERN MIDTOWN,...,,,,$223.83,,,,,,
2,89,"['https://photos.harstatic.com/190088153/hr/img-1.jpeg?ts=2020-10-08T13:52:48.230', 'https://pho...","$ 450,000 ($223.33/sqft.) $Convert",118 Pierce Street,Houston,TX,77002,Harris County,Modern Midtown (View subdivision price trend),LT 8 BLK 1 MODERN MIDTOWN,...,,,,$223.83,,,,,,
3,99,"['https://photos.harstatic.com/189387790/hr/img-1.jpeg?ts=2020-10-22T12:28:16.593', 'https://pho...","$ 259,000 ($203.30/sqft.) $Convert",311 N Milby Street,Houston,TX,77003,Harris County,Merkels Sec 01 (View subdivision price trend),LT 3 BLK 15 MERKELS SEC 1,...,,,,,"['12 x 17, 1st', '12 , 17, 1st']","['12 x 7, 1st', '12 , 7, 1st']",,,,
4,108,"['https://photos.harstatic.com/177650081/hr/img-1.jpeg?ts=2019-08-30T14:53:35.547', 'https://pho...","$ 236,999 ($196.19/sqft.) $Convert \n\n\n Reduced 1.25%\n Reduced 1.25%\n\nReduced 1.25%\n X...",216 Hutcheson,Houston,TX,77003,Harris County,MERKELS (View subdivision price trend),LT 9 BLK 5 MERKELS SEC 1,...,,,,,,,,,,


In our dataset `State` and `Property Type` are the same for all houses so, we can remove them:

In [539]:
single_family_df.drop(['Unnamed: 0','State:','Property Type:'],axis=1,inplace=True)

## 2.6 Missing Values<a id='2.6_Missing_Values'></a>

In [540]:
# function to find missing value and returning count abd %
def missing_cal(df):
    missing = pd.concat([single_family_df.isnull().sum(), 100 * single_family_df.isnull().mean()], axis=1)
    missing.columns=['count', '%']
    missing.sort_values(by='count',ascending=False)
    return missing

In [541]:
missing = missing_cal(single_family_df)
missing

Unnamed: 0,count,%
image_link,0,0.000000
Listing Price:,3,0.026928
Address:,0,0.000000
City:,0,0.000000
Zip Code:,0,0.000000
...,...,...
Utility Room Desc:,7178,64.428687
Sunroom:,10909,97.917602
Guest Suite:,11008,98.806211
Bath:,9449,84.812853


Let's take a look at features with more than 90% missing values:

In [542]:
missing = missing_cal(single_family_df)
nan_90 = missing.loc[missing['%']>90].index
print('Number of Features with more than 90% None: ',len(nan_90))

Number of Features with more than 90% None:  9


In [543]:
missing.loc[nan_90].sort_values(by="%")

Unnamed: 0,count,%
Extra Room:,10068,90.368908
Median Appraised Value / Square ft.:,10217,91.70631
Media Room:,10254,92.038417
Carport Description:,10523,94.452922
Water Amenity:,10747,96.463513
Garage Apartment:,10822,97.136702
Sunroom:,10909,97.917602
Wine Room:,11002,98.752356
Guest Suite:,11008,98.806211


We need to see what kind of information are in each of these features:

In [544]:
for item in nan_90:
    print('Value Count for '+item)
    print(single_family_df[item].value_counts())
    print('-'*100)

Value Count for Media Room:
['2nd', '2nd']                        27
['16x12, 2nd', '4.88 x 3.66(m)']      17
['15x13, 2nd', '4.57 x 3.96(m)']      15
['14x13, 2nd', '4.27 x 3.96(m)']      15
['13x15, 2nd', '3.96 x 4.57(m)']      13
                                      ..
['17x18, 2nd', '5.18 x 5.49(m)']       1
['23 x 18, 3rd', '23 , 18, 3rd']       1
['17X22, 2nd', '5.18 x 6.71(m)']       1
['20x14.7, 2nd', '6.10 x 4.48(m)']     1
['23 x 15, 2nd', '23 , 15, 2nd']       1
Name: Media Room:, Length: 454, dtype: int64
----------------------------------------------------------------------------------------------------
Value Count for Water Amenity:
Lake View                                                     119
Lake View, Lakefront                                           83
Pond                                                           55
Lakefront                                                      48
Bayou Frontage, Bayou View                                      9
Bayou View    

* Values for `Media Room`, `Extra Room`, `Wine Room`, `Sunroom`, `Guest Suite`and `Garage Apartment` are kind of dimension of each of those rooms along with some nonsense values like (`Yes` for `Garage Apartment`). 
* For `Water Amenity` there are to much unique categories and there is no way to be able to fill rest of none values with correct category
* `Carport Description` has 3 different categories for total 611 house and the rest do not have any carport so I will fill  all none values with new category as 'Not Applicable'.
* `Median Appraised Value / Square ft.:` is the fact (based on active listing) for each subdivision and can be fill by the value for same subdivision.

In [545]:
# Replacing None value for 'Carport Description:' with 'not applicable'
single_family_df['Carport Description:'][single_family_df['Carport Description:'].isnull()]='not applicable'

# Dropping 'Media Room:', 'Water Amenity:', 'Extra Room:', 'Wine Room:', 'Median Appraised Value / Square ft.:',
#'Sunroom:', 'Guest Suite:', 'Garage Apartment:', 'Vacation Rental:'
single_family_df.drop(['Media Room:', 'Water Amenity:', 'Extra Room:', 'Wine Room:', 'Sunroom:', 'Guest Suite:', 
                       'Garage Apartment:'],axis=1,inplace=True)

Next step is looking at the features with more than 80% none values:

In [546]:
missing = missing_cal(single_family_df)
nan_80 = missing.loc[missing['%']>80].index
print('Number of Features with more than 80% None: ',len(nan_80))

Number of Features with more than 80% None:  14


In [547]:
missing.loc[nan_80].sort_values(by="%")

Unnamed: 0,count,%
Average Square Ft.:,9412,84.480747
Average Price/Square Ft.:,9412,84.480747
Market Area Name:,9413,84.489723
Home For Sales:,9413,84.489723
Average List Price:,9413,84.489723
Home For Lease:,9413,84.489723
Average Lease:,9413,84.489723
Average Lease/Square Ft.:,9413,84.489723
Bath:,9449,84.812853
Den:,9486,85.14496


In [548]:
#printing value count for each feature with more than 80 none value
for item in nan_80:
    print('Value Count for '+item)
    print(single_family_df[item].value_counts())
    print('-'*100)

Value Count for Private Pool Desc:
In Ground                                355
Gunite, In Ground                        344
Gunite, Heated, In Ground                234
Gunite                                   217
Heated, In Ground                         92
Gunite, Heated, In Ground, Salt Water     46
Gunite, Heated                            40
Above Ground                              28
Heated, In Ground, Salt Water             21
Gunite, In Ground, Salt Water             20
In Ground, Salt Water                     16
Gunite, Salt Water                        13
Gunite, Heated, Salt Water                10
Enclosed, Heated, In Ground                8
Fiberglass, In Ground                      5
Heated                                     5
In Ground, Vinyl Lined                     5
Salt Water                                 4
Fiberglass                                 4
Enclosed, In Ground                        4
Above Ground, Heated                       2
Heated, Salt Water  

* `Controlled Access`categories are mixed  of 'Automatic', 'Driveway', 'Manned' and 'Intercom' that makes me believe the rest of the house do not have any type of controlled access. I think filling none values with 'No controlled access' would be reasonable.
* Same as `Water Amenity` there are so many categories for 'Private Pool Desc'. After counting each category for `Private Pool:` groups figured out that there are description for house without private pool and I think it may happened by mistake and I decided to drop this column.
* `Master Planned Community` and `Market Area Name` categories seems to be same as subdivision name and we will deal with them later on subdivision section
* `Home For Sales`, `Average List Price`,`Average Square Ft.`,`Average Price/Square Ft.`, `Home For Lease`, `Average Lease`and `Average Lease/Square Ft.`, are the facts (based on active listing) for each subdivision and can be fill by the value for same subdivision.
* `Den` and `Bath` are dimension along with other values like '1th' which I think is typo mistake and I decided to drop them.

In [549]:
#counting 'Private Pool Desc:' category for `Private Pool:` groups
single_family_df.groupby('Private Pool:')['Private Pool Desc:'].value_counts()

Private Pool:  Private Pool Desc:                   
No             In Ground                                 14
               Enclosed, Heated, In Ground                6
               Above Ground                               4
               Heated, In Ground                          4
               Gunite                                     3
               Gunite, In Ground                          3
               Fiberglass                                 2
               Gunite, Heated, In Ground                  1
Yes            Gunite, In Ground                        341
               In Ground                                341
               Gunite, Heated, In Ground                233
               Gunite                                   214
               Heated, In Ground                         88
               Gunite, Heated, In Ground, Salt Water     46
               Gunite, Heated                            40
               Above Ground                    

In [550]:
single_family_df.drop(['Private Pool Desc:','Bath:','Den:'],axis=1,inplace=True)
# Replacing None value for 'Carport Description:' with 'not applicable'
single_family_df['Controlled Access:'][single_family_df['Controlled Access:'].isnull()]='no controlled access'

Now I investigating features with more than 70% none values:

In [551]:
missing = missing_cal(single_family_df)
nan_70 = missing.loc[((missing['%']>70 )& (missing['%']<80))].index
print('Number of Features with more than 70% None: ',len(nan_70))

Number of Features with more than 70% None:  2


In [552]:
missing.loc[nan_70].sort_values(by="%")

Unnamed: 0,count,%
Family Room:,7858,70.532268
Primary Bath:,8315,74.634234


In [553]:
#printing value count for each feature with more than 70 none value
for item in nan_70:
    print('Value Count for '+item)
    print(single_family_df[item].value_counts())
    print('-'*100)

Value Count for Family Room:
['1st', '1st']                          75
['18x16, 1st', '5.49 x 4.88(m)']        52
['20x16, 1st', '6.10 x 4.88(m)']        51
['21x17, 1st', '6.40 x 5.18(m)']        39
['19x18, 1st', '5.79 x 5.49(m)']        37
                                        ..
['0x0, 1st', '0,0, 1st']                 1
["16'6x19'8, 1st", "16'6,19'8, 1st"]     1
['22.3 x 20, 1st', '22.3 , 20, 1st']     1
['14X20, 1st', '4.27 x 6.10(m)']         1
['17x30, 1st', '5.18 x 9.14(m)']         1
Name: Family Room:, Length: 939, dtype: int64
----------------------------------------------------------------------------------------------------
Value Count for Primary Bath:
['1st', '1st']                          765
['2nd', '2nd']                          308
['3rd', '3rd']                          128
['12x10, 1st', '3.66 x 3.05(m)']         31
['12x7, 1st', '3.66 x 2.13(m)']          20
                                       ... 
["15' x 12', 1st", "15' , 12', 1st"]      1
['18X8, 3rd',

`Family Room` and `Primary Bath` are dimension for family room and master bath room and all houses should have these values and can not be 0. I think dropping these features would be appropriate since I can not fill values for more than 70% of houses.

In [554]:
single_family_df.drop(['Family Room:','Primary Bath:'],axis=1,inplace=True)

Next step is to look at features with none values between 50% and 60%:

In [555]:
missing = missing_cal(single_family_df)
nan_50_60 = missing.loc[((missing['%']>50 )& (missing['%']<70))].index
print('Number of Features with more than 50% and less than 60% None: ',len(nan_50_60))

Number of Features with more than 50% and less than 60% None:  6


In [556]:
missing.loc[nan_50_60].sort_values(by="%")

Unnamed: 0,count,%
Front Door:,6471,58.082757
Breakfast:,6724,60.353649
Garage Carport:,6734,60.443407
Utility Room Desc:,7178,64.428687
Game Room:,7367,66.125123
Study/Library:,7658,68.737097


In [557]:
#printing value count for each feature with more than 50 none value
for item in nan_50_60:
    print('Value Count for '+item)
    print(single_family_df[item].value_counts())
    print('-'*100)

Value Count for Breakfast:
['1st', '1st']                              192
['10x10, 1st', '3.05 x 3.05(m)']            155
['11x10, 1st', '3.35 x 3.05(m)']            149
['12x10, 1st', '3.66 x 3.05(m)']            147
['10x9, 1st', '3.05 x 2.74(m)']             108
                                           ... 
["11' x 10', 1st", "11' , 10', 1st"]          1
['12.4 x 17.6, 1st', '12.4 , 17.6, 1st']      1
["14'6x12'5, 1st", "14'6,12'5, 1st"]          1
['11 x 14, 1st', '11 , 14, 1st']              1
["11'2 x 12', 1st", "11'2 , 12', 1st"]        1
Name: Breakfast:, Length: 769, dtype: int64
----------------------------------------------------------------------------------------------------
Value Count for Game Room:
['2nd', '2nd']                        72
['18x14, 2nd', '5.49 x 4.27(m)']      52
['19x16, 2nd', '5.79 x 4.88(m)']      44
['16x14, 2nd', '4.88 x 4.27(m)']      37
['18x16, 2nd', '5.49 x 4.88(m)']      37
                                      ..
['20 x 16, 2nd', '20 , 16, 

It seems we can not do anything to fill NA values for these features because there is no information about dimension for `Utility Room`, `Study/Library`. `Game Room`and `Breakfast` area and I do not know about `Garage Carport` and `Front Door` direction for rest of the houses so, these features can be dropped as well.

In [558]:
single_family_df.drop(list(nan_50_60),axis=1,inplace=True)

So far I investigated features with the none value more than 50% and still need to dig more and also fill values for features that are the facts (based on active listing) for each subdivision like: `Home For Sales`, `Average List Price`,`Average Square Ft.`,`Average Price/Square Ft.`, `Home For Lease`, `Average Lease`and `Average Lease/Square Ft.`. But before that lets take a look at features with none values more than 10%:

In [559]:
missing = missing_cal(single_family_df)
nan_10_50 = missing.loc[((missing['%']>10 )& (missing['%']<50))].index
print('Number of Features with more than 10% and less than 50% None: ',len(nan_10_50))

Number of Features with more than 10% and less than 50% None:  35


In [560]:
missing.loc[nan_10_50].sort_values(by="%")

Unnamed: 0,count,%
Garage(s):,1292,11.596805
Tax Rate:,1323,11.875056
Dishwasher:,1413,12.682883
Bedroom Desc:,1682,15.097388
Median Appraised Value:,1746,15.671843
Median Year Built:,1746,15.671843
Median Lot Square Ft.:,1746,15.671843
Median Square Ft.:,1746,15.671843
Single Family Properties:,1746,15.671843
County / Zip Code:,1746,15.671843


In [561]:
#printing value count for each feature with more than 50 none value
for item in nan_10_50:
    print('Value Count for '+item)
    print(single_family_df[item].value_counts())
    print('-'*100)

Value Count for Garage(s):
2 / Attached           5626
2 / Detached           1063
3 / Attached            686
1 / Detached            341
1 / Attached            320
                       ... 
4 / Detached,Tandem       1
20 / Attached             1
42 / Attached             1
21 / Attached             1
7 / Attached              1
Name: Garage(s):, Length: 114, dtype: int64
----------------------------------------------------------------------------------------------------
Value Count for Living:
['15x13, 1st', '4.57 x 3.96(m)']                                    58
['14x12, 1st', '4.27 x 3.66(m)']                                    57
['20x15, 1st', '6.10 x 4.57(m)']                                    57
['20x16, 1st', '6.10 x 4.88(m)']                                    55
['18x15, 1st', '5.49 x 4.57(m)']                                    55
                                                                    ..
["13.5 x 11.1, 1st", "13'6 , 11'2, 1st"]                             1

Aliana                      120
Houston Heights             109
Tavola                       98
Oak Forest ( East )          94
Westbury                     81
                           ... 
Merilyn Place                 1
Mission Trace (Fortbend)      1
Harbor Homesite               1
Waterhill Homes On Ralph      1
Loftus Oaks                   1
Name: Subdivision Name:, Length: 2111, dtype: int64
----------------------------------------------------------------------------------------------------
Value Count for County / Zip Code:
77407.0    229
77373.0    225
77077.0    211
77379.0    209
77008.0    199
          ... 
77355.0      1
77362.0      1
77401.0      1
77450.0      1
77503.0      1
Name: County / Zip Code:, Length: 116, dtype: int64
----------------------------------------------------------------------------------------------------
Value Count for Single Family Properties:
3,292    120
3,936    109
852      102
3,845     94
3,399     81
        ... 
503        1
304      

`Room Description`, `Countertop`, `Floors`, `Bedroom Desc`, `Kitchen Desc`, `Bathroom Description`, `Connections`, `Oven`, `Range`, `Energy Feature`, `Interior`, `Exterior`, `Financing Considered` are just information and we can not fill them with unknown values since some may not be accurate and I don't think they are relevant to our analysis so, I will drop all of them.

In [562]:
single_family_df.drop(['Room Description:', 'Countertop:', 'Floors:', 'Bedroom Desc:', 'Kitchen Desc:', 
                       'Bathroom Description:','Connections:', 'Oven:', 'Range:', 'Energy Feature:',
                       'Interior:', 'Exterior:', 'Financing Considered:'], axis=1,inplace=True)

`Ice Maker`, `Microwave`, `Compactor`, `Dishwasher`, `Disposal` and `Area Pool` are 'Yes/No' categories and I think it is relevant to fill none values with 'No'. For sure it is a little bit optimistic since some houses may have those features and owner/agent forgot to fill them but for now filling with 'No' value is the best way to dealing with them.

In [563]:
single_family_df['Disposal:'][single_family_df['Disposal:'].isnull()]='No'
single_family_df['Ice Maker:'][single_family_df['Ice Maker:'].isnull()]='No'
single_family_df['Compactor:'][single_family_df['Compactor:'].isnull()]='No'
single_family_df['Area Pool:'][single_family_df['Area Pool:'].isnull()]='No'
single_family_df['Microwave:'][single_family_df['Microwave:'].isnull()]='No'
single_family_df['Dishwasher:'][single_family_df['Dishwasher:'].isnull()]='No'

At this point I am investigating other features indevisually:

## 2.7 Garage<a id='2.7_Garage'></a>

In [564]:
single_family_df['Garage(s):'].value_counts()

2 / Attached           5626
2 / Detached           1063
3 / Attached            686
1 / Detached            341
1 / Attached            320
                       ... 
4 / Detached,Tandem       1
20 / Attached             1
42 / Attached             1
21 / Attached             1
7 / Attached              1
Name: Garage(s):, Length: 114, dtype: int64

The important part of this feature is the number of garage each house has. Also we now almost every single family homes have at least 2 garages and it is relevant to fill none values with '2'.

In [565]:
single_family_df['Garage(s):'].fillna('2',inplace=True)
single_family_df['garage'] = [item[0] if item !=None else 0 for item in single_family_df['Garage(s):'].str.split(' ') ]
single_family_df['garage']=single_family_df['garage'].astype(int)
single_family_df.drop('Garage(s):',axis=1,inplace=True)
single_family_df['garage'].value_counts()

2     8703
3     1445
1      758
4      175
5       21
6       11
8        7
7        4
24       2
10       1
40       1
56       1
9        1
57       1
63       1
26       1
42       1
27       1
51       1
20       1
21       1
45       1
22       1
58       1
Name: garage, dtype: int64

As you can see there are some houses with more than 10 garage which is odd. After checking images for some of these houses in www.HAR.com it seems those have only 2 garage and I fill those values with 2 which is the median of this feature.

In [566]:
single_family_df.garage[single_family_df['garage']>8]=single_family_df['garage'].median()
single_family_df['garage'].value_counts()

2    8720
3    1445
1     758
4     175
5      21
6      11
8       7
7       4
Name: garage, dtype: int64

## 2.8 Living<a id='2.8_Living'></a>

To calculate the living area I need to multiply the dimension of the living room and return the area and for the next stem I will fill none values with the average of living room area per subdivision

In [567]:
def area_calc(item,pattern = "([\d.]+)(?:.*?([\d.]+))?.*?[x\*].*?([\d.]+)(?:.*?([\d.]+))?"):
    pattern = re.compile(pattern,re.IGNORECASE)
    area=0
    dim=[]
    if type(item)==list:
        for i in item:
            if (('x' in i or 'X' in i or '*' in i )and '(m)' not in i ):
                dim.append(i.replace('[','').strip())
        for d in dim:
            d=d.replace(' ','').strip()
            match=pattern.findall(d)
            try:
                dimension_list = [float(item) if len(item)>0 else 0 for item in match[0]]
                area += (dimension_list[0]+(dimension_list[1]/12))*(dimension_list[2]+(dimension_list[3]/12))
            except:
                area=None
        return(area)

In [568]:
single_family_df['clean_living'] = single_family_df['Living:'].str.split(',')
single_family_df['clean_living'] = single_family_df['clean_living'].apply(area_calc)        

In [569]:
single_family_df['clean_living'].describe()

count    6294.000000
mean      288.996780
std       152.801571
min         0.000000
25%       210.000000
50%       272.125000
75%       342.000000
max      6651.000000
Name: clean_living, dtype: float64

## 2.9 Dining<a id='2.9_Dining'></a>

In [570]:
single_family_df['Dining:'].value_counts()

['12x11, 1st', '3.66 x 3.35(m)']              220
['13x11, 1st', '3.96 x 3.35(m)']              203
['13x12, 1st', '3.96 x 3.66(m)']              184
['14x12, 1st', '4.27 x 3.66(m)']              175
['1st', '1st']                                172
                                             ... 
['9x9, 2nd', '2.74 x 2.74(m)']                  1
['14x29, 1st', '4.27 x 8.84(m)']                1
['14\'2"x9\'9", 1st', '14\'2",9\'9", 1st']      1
['17x8, 2nd', '5.18 x 2.44(m)']                 1
["15'6 x 12'6, 1st", "15'6 , 12'6, 1st"]        1
Name: Dining:, Length: 1308, dtype: int64

In [571]:
single_family_df['clean_dining'] = single_family_df['Dining:'].str.split(',')
single_family_df['clean_dining'] = single_family_df['clean_dining'].apply(area_calc) 

In [572]:
single_family_df[['Dining:','clean_dining']].sample(20,random_state=101)

Unnamed: 0,Dining:,clean_dining
10595,"['12x14, 1st', '3.66 x 4.27(m)']",168.0
9396,"['13X10, 1st', '3.96 x 3.05(m)']",130.0
6552,"['10x9, 1st', '3.05 x 2.74(m)']",90.0
2512,"['12x12, 1st', '3.66 x 3.66(m)']",144.0
7776,,
2965,,
8038,"['10x11, 1st', '3.05 x 3.35(m)']",110.0
7054,"['12x16, 1st', '3.66 x 4.88(m)']",192.0
2269,"['12x13, 2nd', '3.66 x 3.96(m)']",156.0
1475,,


## 2.10 Kitchen<a id='2.10_Kitchen'></a>

I am using same function to calculate kitchen are in sqft.

In [573]:
single_family_df['clean_kitchen'] = single_family_df['Kitchen:'].str.split(',')
single_family_df['clean_kitchen'] = single_family_df['clean_kitchen'].apply(area_calc) 

In [574]:
single_family_df[['Kitchen:','clean_kitchen']].sample(20,random_state=101)

Unnamed: 0,Kitchen:,clean_kitchen
10595,"['12x19, 1st', '3.66 x 5.79(m)']",228.0
9396,"['13X11, 1st', '3.96 x 3.35(m)']",143.0
6552,"['11x8, 1st', '3.35 x 2.44(m)']",88.0
2512,"['9x12, 1st', '2.74 x 3.66(m)']",108.0
7776,"['0X0, 1st', '0,0, 1st']",0.0
2965,,
8038,,
7054,"['13x17, 1st', '3.96 x 5.18(m)']",221.0
2269,"['15x12, 2nd', '4.57 x 3.66(m)']",180.0
1475,,


Now we can drop old living, dining and kitchen columns:

In [575]:
single_family_df.drop(['Living:', 'Dining:', 'Kitchen:'], axis=1,inplace=True)

## 2.11 Subdivision<a id='2.11_Subdivision'></a>

To uniform subdivision I scraped all subdivision names from HAR.com and will replace names with correct one based on similarity:

In [576]:
sub_df = pd.read_csv('../Prediction House Price Using Image Processing/Data/Subdivision.csv')
sub_df.drop(['Unnamed: 0'],axis=1,inplace=True)

In [577]:
sub_df.head()

Unnamed: 0,Subdivision,Zip,Med.Appraisal,Avg.Sqft.,Avg.Yr.Built
0,MARLOWE CONDOS,77002,"$522,701",1100,2018.0
1,Modern Midtown,77002,"$469,147",2096,2014.0
2,Midtowne Plaza,77002,"$439,282",2507,1999.0
3,Macgregor Demerritt,77002,"$438,234",2034,1930.0
4,Hermann Lofts Condo,77002,"$385,446",1546,1998.0


As above table shown Med.Appraisal, Avg.Sqft. and Avg.Yr.Built are same for each subdivision and we can fill none values with these numbers for each column.

In [578]:
single_family_df[['Subdivision:','Subdivision Name:','Market Area Name:','Master Planned Community:']].sample(20)

Unnamed: 0,Subdivision:,Subdivision Name:,Market Area Name:,Master Planned Community:
821,Stude Sec 02 (View subdivision price trend),Stude,,
6516,North Forest (View subdivision price trend),North Forest (Houston),,
5503,Southlake (View subdivision price trend),Southlake (Houston),,
4575,Greenwood Forest (View subdivision price trend),Greenwood Forest,,
5489,Lakeside Place (View subdivision price trend),Lakeside Place (Houston),,
3774,Medical Center,,Medical Center Area,
5482,Ashford Forest Lake Sec 01 (View subdivision price trend),Ashford Forest Lake,,
3191,BRIARGROVE PARK (View subdivision price trend),Briargrove Park,,
10881,LakeHouse,,Katy - Old Towne,
9609,Fieldstone Sec 3 (View subdivision price trend),Fieldstone,,Fieldstone


In [579]:
single_family_df[['Subdivision:','Subdivision Name:','Market Area Name:','Master Planned Community:']].isna().sum()

Subdivision:                    5
Subdivision Name:            1746
Market Area Name:            9413
Master Planned Community:    9843
dtype: int64

It seems all these 4 columns are the same.  `Subdivision:` has less null values but `Market Area Name:` has more standard name for subdivisions so, I will replace none values for `Subdivision Name:` with `Market Area Name:` values to see how many none values will remain.

In [580]:
single_family_df['SubName'] = single_family_df['Subdivision Name:'].fillna(single_family_df['Market Area Name:'])

In [581]:
single_family_df['SubName'].isna().sum()

18

In [582]:
single_family_df[['Subdivision:','Subdivision Name:','Market Area Name:']].loc[single_family_df['SubName'].isna()]

Unnamed: 0,Subdivision:,Subdivision Name:,Market Area Name:
67,Mckinney Lndg Sub (View subdivision price trend),,
498,VERMONT STREET GROVE,,
629,Magnolia Grove (View subdivision price trend),,
755,SUNSET HEIGHTS (View subdivision price trend),,
847,24th Street Manor (View subdivision price trend),,
922,Heights Homes/Herkimer Sub (View subdivision price trend),,
1358,Shepherd Oaks (View subdivision price trend),,
2097,Oaks of Lawndale,,
3161,Lakeside T/H (View subdivision price trend),,
3163,Lakeside T/H (View subdivision price trend),,


I am dropping these 18 rows since I can not find correct subdivision name for them.

In [583]:
single_family_df=single_family_df[~single_family_df['SubName'].isnull()]

In [584]:
single_family_df.drop(['Subdivision Name:','Subdivision:','Market Area Name:','Master Planned Community:'],
                      axis=1,inplace=True)

In [585]:
print(len(single_family_df.SubName.unique()))
print(len(sub_df.Subdivision.unique()))

2179
3995


In [None]:

correct_subdevision = sub_df.Subdivision.unique()

#For each correct sub. in subevision list
for sub in correct_subdevision:
    print(sub)
    # Find matches in subdevisiom of houses
    matches = process.extract(sub, single_family_df['SubName'],
                 limit = single_family_df.shape[0])
    
    print(matches[:5])
# # For each possible_match with similarity score >= 90
#     for possible_match in matches:
#         if possible_match[1] >= 90:
#       # Find matching subdevision type
#         matching_sub = single_family_df['Subdivision:'] == possible_match[0]
#         single_family_df.loc[matching_sub , 'Subdivision:'] = sub

# Print unique values to confirm mapping
print( single_family_df['Subdivision:'].unique())  

In [587]:
single_family_df.groupby('SubName')['Home For Sales:'].value_counts()

SubName                   Home For Sales:
1960/Cypress Creek North  143.0               9
1960/Cypress Creek South  218.0               8
                          218                 2
Aldine Area               387.0              19
Bear Creek South          327                52
                                             ..
Upper Kirby               89                  2
Waller                    310                40
Washington East/Sabine    82.0                6
Westchase Area            147                 1
Willow Meadows Area       92.0                1
Name: Home For Sales:, Length: 85, dtype: int64

In [588]:
single_family_df.groupby('Subdivision:')['Average List Price:'].value_counts().isna().sum()

KeyError: 'Subdivision:'

In [589]:
single_family_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11123 entries, 0 to 11140
Data columns (total 64 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   image_link                            11123 non-null  object 
 1   Listing Price:                        11120 non-null  object 
 2   Address:                              11123 non-null  object 
 3   City:                                 11123 non-null  object 
 4   Zip Code:                             11123 non-null  int64  
 5   County:                               11123 non-null  object 
 6   Legal Description:                    11022 non-null  object 
 7   Bedrooms:                             11090 non-null  object 
 8   Baths:                                11097 non-null  object 
 9   Stories:                              11119 non-null  object 
 10  Style:                                11123 non-null  object 
 11  Year Built:    

## 2.12 Listing Price<a id='2.12_Listing_Price'></a>

In [590]:
single_family_df['Listing Price:'].isna().sum()

3

In [591]:
single_family_df=single_family_df[~single_family_df['Listing Price:'].isnull()]

In [592]:
single_family_df['Listing Price:']=single_family_df['Listing Price:'].str.split(' ').str[1]
single_family_df['Listing Price:']=single_family_df['Listing Price:'].str.replace(',','')
single_family_df['Listing Price:']=pd.to_numeric(single_family_df['Listing Price:'])
single_family_df.rename(columns = {'Listing Price:':'listing_price','Address:':'address', 'Zip Code:':'zip_code', 'County:':'county',
                                 'Subdivision:':'sub', 'Legal Description:':'legal'},inplace=True)

In [593]:
single_family_df['listing_price'].describe()

count    1.112000e+04
mean     5.103105e+05
std      6.177403e+05
min      1.000000e+00
25%      2.399900e+05
50%      3.468495e+05
75%      5.399225e+05
max      1.450000e+07
Name: listing_price, dtype: float64

## 2.13 Bedrooms<a id='2.8_Bedrooms'></a>

In [598]:
single_family_df[['Bedrooms:','Bedroom:','Primary Bedroom:']]

Unnamed: 0,Bedrooms:,Bedroom:,Primary Bedroom:
0,3 Bedroom(s),"['13x10, 1st', '15x11, 2nd', '3.96 x 3.05(m)', '4.57 x 3.35(m)']","['19x16, 3rd', '5.79 x 4.88(m)']"
1,3 Bedroom(s),"['13x11, 3rd', '13x10, 1st', '3.96 x 3.35(m)', '3.96 x 3.05(m)']","['18x12, 3rd', '5.49 x 3.66(m)']"
2,3 Bedroom(s),"['13 x 11, 1st', '13 x 11, 3rd', '13 , 11, 1st', '13 , 11, 3rd']","['19 x 13, 3rd', '19 , 13, 3rd']"
3,3 Bedroom(s),"['15 x 12, 1st', '9 x 12, 1st', '13 x 13, 1st', '15 , 12, 1st', '9 , 12, 1st', '13 , 13, 1st']",
4,3 Bedroom(s),"['13X10, 1st', '10X10, 1st', '13X10, 1st', '3.96 x 3.05(m)', '3.05 x 3.05(m)', '3.96 x 3.05(m)']",
...,...,...,...
11136,3 Bedroom(s),"['13 x 11, 1st', '13 x 10, 1st', '13 , 11, 1st', '13 , 10, 1st']","['11 x 16, 1st', '11 , 16, 1st']"
11137,3 Bedroom(s),"['10x8, 1st', '10x8, 1st', '3.05 x 2.44(m)', '3.05 x 2.44(m)']","['12x14, 1st', '3.66 x 4.27(m)']"
11138,3 Bedroom(s),"['12x11, 1st', '12x11, 1st', '3.66 x 3.35(m)', '3.66 x 3.35(m)']","['13x15, 1st', '3.96 x 4.57(m)']"
11139,3 Bedroom(s),"['12x12, 1st', '12x12, 1st', '3.66 x 3.66(m)', '3.66 x 3.66(m)']","['13x15, 1st', '3.96 x 4.57(m)']"


In [599]:
single_family_df[['Bedrooms:','Bedroom:','Primary Bedroom:']].isna().sum()

Bedrooms:            33
Bedroom:             99
Primary Bedroom:    672
dtype: int64

In [600]:
single_family_df=single_family_df[~single_family_df['Primary Bedroom:'].isnull()]

In [601]:
single_family_df[['Bedrooms:','Bedroom:','Primary Bedroom:']].isna().sum()

Bedrooms:            0
Bedroom:            27
Primary Bedroom:     0
dtype: int64

In [602]:
single_family_df=single_family_df[~single_family_df['Bedroom:'].isnull()]

'Bedroom:' feature includes the size of one of the bedrooms I am using same function to calculate the total area for bedrooms.

In [603]:
single_family_df['Primary_Bedroom_clean']=single_family_df['Primary Bedroom:'].str.split(',')
single_family_df['TotalBedSqft'] = single_family_df['Bedroom:'].str.split(',')
single_family_df['TotalBedSqft'] = single_family_df['TotalBedSqft'].apply(area_calc) + single_family_df['Primary_Bedroom_clean'].apply(area_calc)

In [604]:
pd.options.display.max_colwidth = 100
single_family_df[['Bedrooms:','Bedroom:','Primary Bedroom:','TotalBedSqft']].sample(20,random_state=100)

Unnamed: 0,Bedrooms:,Bedroom:,Primary Bedroom:,TotalBedSqft
376,5 Bedroom(s),"['13x12, 2nd', '13x12, 2nd', '13x11, 2nd', '13x12, 2nd', '3.96 x 3.66(m)', '3.96 x 3.66(m)', '3....","['18x14, 2nd', '5.49 x 4.27(m)']",863.0
1720,3 Bedroom(s),"['13X10, 3rd', '12X11, 1st', '3.96 x 3.05(m)', '3.66 x 3.35(m)']","['14X13, 3rd', '4.27 x 3.96(m)']",444.0
9279,3 Bedroom(s),"['11x12, 2nd', '10x12, 2nd', '3.35 x 3.66(m)', '3.05 x 3.66(m)']","['15x15, 1st', '4.57 x 4.57(m)']",477.0
10427,4 Bedroom(s),"['12x10, 1st', '11x10, 1st', '12x10, 1st', '3.66 x 3.05(m)', '3.35 x 3.05(m)', '3.66 x 3.05(m)']","['17x13, 1st', '5.18 x 3.96(m)']",571.0
2437,3 Bedroom(s),"['10x10, 1st', '10x10, 1st', '3.05 x 3.05(m)', '3.05 x 3.05(m)']","['12x12, 1st', '3.66 x 3.66(m)']",344.0
4269,3 Bedroom(s),"['12x13, 1st', '12x13, 1st', '3.66 x 3.96(m)', '3.66 x 3.96(m)']","['19x14, 1st', '5.79 x 4.27(m)']",578.0
4514,3 Bedroom(s),"['11x10, 1st', '11.5x11, 1st', '3.35 x 3.05(m)', '3.51 x 3.35(m)']","['16.5x13, 2nd', '5.03 x 3.96(m)']",451.0
6457,3 Bedroom(s),"['10x11, 1st', '11x11, 1st', '3.05 x 3.35(m)', '3.35 x 3.35(m)']","['14x12, 1st', '4.27 x 3.66(m)']",399.0
6217,4 Bedroom(s),"['11x14, 1st', '11x14, 1st', '15x11, 1st', '3.35 x 4.27(m)', '3.35 x 4.27(m)', '4.57 x 3.35(m)']","['15x11, 1st', '4.57 x 3.35(m)']",638.0
5558,4 Bedroom(s),"['16x11, 1st', '11x10, 1st', '12x11, 1st', '4.88 x 3.35(m)', '3.35 x 3.05(m)', '3.66 x 3.35(m)']","['18x12, 1st', '5.49 x 3.66(m)']",634.0


In [606]:
single_family_df.drop(['Bedroom:','Primary Bedroom:','Primary_Bedroom_clean'],axis=1,inplace=True)

In [608]:
single_family_df['Bedrooms:']=single_family_df['Bedrooms:'].str.split(' ').str[0]
single_family_df['Bedrooms:']=single_family_df['Bedrooms:'].astype(int)
single_family_df.rename(columns = {'Bedrooms:':'NoBed'},inplace=True)
single_family_df['NoBed'].describe()

count    10421.000000
mean         3.680357
std          0.775109
min          2.000000
25%          3.000000
50%          4.000000
75%          4.000000
max         10.000000
Name: NoBed, dtype: float64

## 2.9 Bathrooms<a id='2.9_Bathrooms'></a>

In [609]:
single_family_df['Baths:'].isnull().sum()

0

In [610]:
single_family_df[['Baths:']]

Unnamed: 0,Baths:
0,3 Full & 1 Half Bath(s)
1,3 Full & 1 Half Bath(s)
2,3 Full & 1 Half Bath(s)
5,3 Full & 1 Half Bath(s)
6,3 Full & 1 Half Bath(s)
...,...
11134,2 Full Bath(s)
11136,2 Full Bath(s)
11137,1 Full & 1 Half Bath(s)
11138,1 Full & 1 Half Bath(s)


In [611]:
single_family_df['full_bath']=single_family_df['Baths:'].str.split(' ').str[0].astype(int)

In [612]:
single_family_df['full_bath']

0        3
1        3
2        3
5        3
6        3
        ..
11134    2
11136    2
11137    1
11138    1
11139    2
Name: full_bath, Length: 10421, dtype: int32

In [613]:
No_Bath = single_family_df['Baths:'].str.split('&').str[1].str.strip()
No_Bath.fillna('0',inplace=True) 
single_family_df['half_bath']=[int(item[0]) for item in No_Bath.str.split(' ')]
single_family_df['half_bath'].replace(',','',inplace=True)
single_family_df[['full_bath','half_bath']].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10421 entries, 0 to 11139
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   full_bath  10421 non-null  int32
 1   half_bath  10421 non-null  int64
dtypes: int32(1), int64(1)
memory usage: 203.5 KB


In [614]:
single_family_df.drop('Baths:',axis=1,inplace=True)

In [615]:
single_family_df[['full_bath','half_bath']].tail()

Unnamed: 0,full_bath,half_bath
11134,2,0
11136,2,0
11137,1,1
11138,1,1
11139,2,0


## Stories

In [616]:
single_family_df['Stories:'].value_counts()

2       5178
1       4003
3        849
1.5      213
4        153
2.5       21
5          2
2576       1
Name: Stories:, dtype: int64

In [617]:
single_family_df['Stories:'].isnull().sum()

1

In [618]:
single_family_df=single_family_df[~single_family_df['Stories:'].isnull()]

In [619]:
single_family_df.rename(columns ={'Stories:':'stories'},inplace=True)
single_family_df['stories']=pd.to_numeric(single_family_df['stories'])

## Style

In [620]:
single_family_df['Style:'].value_counts()

Traditional                              7159
Contemporary/Modern                      1185
Ranch                                     366
Contemporary/Modern,Traditional           354
Other Style                               215
                                         ... 
Other Style,Split Level                     1
Contemporary/Modern,Ranch,Split Level       1
Colonial,Georgian,Traditional               1
English,Georgian,Traditional                1
Contemporary/Modern,English,French          1
Name: Style:, Length: 84, dtype: int64

In [621]:
single_family_df['Style:'].isnull().sum()

0

In [622]:
single_family_df.rename(columns ={'Style:':'style'},inplace=True)

## Year Built

In [623]:
single_family_df['Year Built:'].isnull().sum()

58

In [624]:
single_family_df['Year Built:'].value_counts()

2020   / Builder               2436
2006   / Appraisal District     196
2005   / Appraisal District     189
2014   / Appraisal District     184
2015   / Appraisal District     182
                               ... 
1989   / Appraisal                1
1984   / Seller                   1
2021   / Seller                   1
1921   / Appraisal                1
1970   / Seller                   1
Name: Year Built:, Length: 294, dtype: int64

In [625]:
single_family_df=single_family_df[~single_family_df['Year Built:'].isnull()]
single_family_df['Year Built:']=single_family_df['Year Built:'].apply(lambda x:str(x).split(' ')[0])
single_family_df['Year Built:']=pd.to_datetime(single_family_df['Year Built:'],format='%Y').dt.year
single_family_df['Year Built:'].value_counts()

2020    2515
2019     259
2015     237
2014     234
2006     224
        ... 
1921       1
1875       1
1916       1
1934       1
1880       1
Name: Year Built:, Length: 113, dtype: int64

In [626]:
single_family_df.rename(columns ={'Year Built:':'year_built'},inplace=True)

In [49]:
single_family_df['Building Sqft.:']

0        2,096195(m²)  /Appraisal District
1        2,015187(m²)  /Appraisal District
2        2,468229(m²)  /Appraisal District
3                   2,878267(m²)  /Builder
4                   2,073193(m²)  /Builder
                       ...                
11146    1,630151(m²)  /Appraisal District
11148    1,316122(m²)  /Appraisal District
11149    1,160108(m²)  /Appraisal District
11150    1,135105(m²)  /Appraisal District
11151       83978(m²)  /Appraisal District
Name: Building Sqft.:, Length: 11044, dtype: object

In [50]:
single_family_df['Building Sqft.:'].isnull().sum()

18

In [51]:
single_family_df=single_family_df[~single_family_df['Building Sqft.:'].isnull()]
single_family_df['Building Sqft.:']=single_family_df['Building Sqft.:'].apply(lambda x:x[0:5] if ',' in x else x[0:3])
single_family_df['Building Sqft.:']=single_family_df['Building Sqft.:'].str.replace(',','')
single_family_df['Building Sqft.:']=pd.to_numeric(single_family_df['Building Sqft.:'])
single_family_df.rename(columns ={'Building Sqft.:':'build_Sq'},inplace=True)

In [52]:
single_family_df.year_built[single_family_df['Lot Size:'].isnull()].value_counts()

2020    857
2019     34
2021     18
2013      6
2018      5
2016      5
2017      3
2015      3
1955      2
1965      2
1967      2
1975      2
1978      2
1979      2
1982      2
2009      2
2007      2
2000      1
1994      1
1930      1
1935      1
1945      1
1950      1
1952      1
1962      1
2008      1
1968      1
1969      1
1973      1
2006      1
1976      1
1977      1
2005      1
2004      1
1980      1
2002      1
1985      1
1990      1
1920      1
Name: year_built, dtype: int64

Since 857 of null values is under cunstruction I will drop all null values for lot size

In [53]:
single_family_df=single_family_df[~single_family_df['Lot Size:'].isnull()]

In [54]:
single_family_df['Lot Size:']

0        2,173 Sqft.202(m²)  /Appraisal District
1        1,446 Sqft.134(m²)  /Appraisal District
2        1,786 Sqft.166(m²)  /Appraisal District
4        1,788 Sqft.166(m²)  /Appraisal District
9        2,460 Sqft.229(m²)  /Appraisal District
                          ...                   
11146    7,975 Sqft.741(m²)  /Appraisal District
11148    7,100 Sqft.660(m²)  /Appraisal District
11149    7,100 Sqft.660(m²)  /Appraisal District
11150    7,100 Sqft.660(m²)  /Appraisal District
11151    7,100 Sqft.660(m²)  /Appraisal District
Name: Lot Size:, Length: 10055, dtype: object

In [55]:
single_family_df['Lot Size:']=single_family_df['Lot Size:'].str.replace(',','')
single_family_df['Lot Size:']=single_family_df['Lot Size:'].apply(lambda x:float(x.split(' ')[0])*43560 if 'Acres' in 
                                                                  x else float(x.split(' ')[0]))

# single_family_df['Lot Size:']=pd.to_numeric(single_family_df['Lot Size:'])
single_family_df.rename(columns ={'Lot Size:':'lot_size'},inplace=True)

In [56]:
single_family_df.lot_size.isnull().sum()

0

In [57]:
single_family_df['Maintenance Fee:']

0        $ 1304 / Annually
1        $ 1200 / Annually
2        $ 2000 / Annually
4        $ 1200 / Annually
9                       No
               ...        
11146                   No
11148                   No
11149                   No
11150                   No
11151                   No
Name: Maintenance Fee:, Length: 10055, dtype: object

In [58]:
single_family_df['Maintenance Fee:'].isnull().sum()

28

In [59]:
single_family_df['Maintenance Fee:'].value_counts()

No                   2973
$ 1200 / Annually     170
$ 450 / Annually      163
$ 600 / Annually      155
$ 400 / Annually      154
                     ... 
$ 257 / Annually        1
$ 190 / Monthly         1
$ 110 / Quarterly       1
$ 2811 / Annually       1
No / Monthly            1
Name: Maintenance Fee:, Length: 929, dtype: int64

In [60]:
single_family_df['Maintenance Fee:'].isin(['No','No / $0','Voluntary / Annually','Voluntary /0/ Annually']).sum()

3091

In [61]:
single_family_df.drop('Maintenance Fee:',axis=1,inplace=True)

In [62]:
single_family_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10055 entries, 0 to 11151
Data columns (total 66 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   image_link                  10055 non-null  object 
 1   listing_price               10055 non-null  int64  
 2   City:                       10055 non-null  object 
 3   zip_code                    10055 non-null  int64  
 4   county                      10055 non-null  object 
 5   sub                         10050 non-null  object 
 6   legal                       10038 non-null  object 
 7   bedrooms                    10055 non-null  int32  
 8   stories                     10055 non-null  float64
 9   style                       10055 non-null  object 
 10  year_built                  10055 non-null  int64  
 11  build_Sq                    10055 non-null  int64  
 12  lot_size                    10055 non-null  float64
 13  Living:                     601

In [63]:
single_family_df['Living:'].isnull().sum()

4043

In [64]:
new_missing=missing_cal(single_family_df)

In [65]:
new_missing

Unnamed: 0,count,%
image_link,0,0.000000
listing_price,0,0.000000
City:,0,0.000000
zip_code,0,0.000000
county,0,0.000000
...,...,...
Neighborhood Value Range:,1198,11.914470
Median Price / Square ft.:,2086,20.745898
full_bath,0,0.000000
half_bath,0,0.000000


In [66]:
new_missing[new_missing['%']>0].sort_values("%")

Unnamed: 0,count,%
sub,5,0.049727
Average Baths:,15,0.14918
legal,17,0.16907
HOA Mandatory:,35,0.348086
Average Bedrooms:,47,0.467429
Primary Bedroom:,617,6.136251
Tax Rate:,854,8.493287
Median Lot Square Ft.:,1198,11.91447
Neighborhood Value Range:,1198,11.91447
Subdivision Name:,1198,11.91447


In [67]:
single_family_df['Disposal:'].value_counts()

Yes    8153
No      309
Name: Disposal:, dtype: int64

In [69]:
single_family_df['Fireplace:'].value_counts()

1/Gaslog Fireplace                                 1783
1                                                  1087
1/Gas Connections                                   878
1/Wood Burning Fireplace                            542
1/Gas Connections, Gaslog Fireplace                 445
                                                   ... 
4/Mock Fireplace                                      1
/Freestanding, Wood Burning Fireplace                 1
1/Gas Connections, Stove                              1
2/Stove, Wood Burning Fireplace                       1
1/Freestanding, Gas Connections, Mock Fireplace       1
Name: Fireplace:, Length: 92, dtype: int64

In [70]:
pd.Series([str(x)[0]  for x in single_family_df['Fireplace:'] if x is not None]).value_counts()

1    5605
n    3605
2     573
3     130
/      93
4      37
5      10
7       1
6       1
dtype: int64

In [71]:
single_family_df['Fireplace:']=single_family_df['Fireplace:'].apply(lambda x:int(str(x)[0]) if str(x)[0]
                                                                    in ['1','2','3','4','5','6','7'] else 0)

In [72]:
single_family_df['Fireplace:'].value_counts()

1    5605
0    3698
2     573
3     130
4      37
5      10
7       1
6       1
Name: Fireplace:, dtype: int64

In [73]:
single_family_df['Median Price / Square ft.:'].value_counts()

$303.75     108
$259.49      95
$146.71      82
$127.37      77
$139.32      60
           ... 
$126.72       1
$132.07       1
$100.69       1
$94.82        1
$118.87       1
Name: Median Price / Square ft.:, Length: 1369, dtype: int64

In [74]:
single_family_df['Median Price / Square ft.:']=pd.to_numeric(single_family_df['Median Price / Square ft.:'].str.replace("$",' ').str.strip())

In [75]:
single_family_df['Subdivision Name:']

0               Modern Midtown
1               Modern Midtown
2                          NaN
4             ELITE TWNHMS LLC
9                          NaN
                 ...          
11146    South Houston Terrace
11148            South Houston
11149            South Houston
11150            South Houston
11151            South Houston
Name: Subdivision Name:, Length: 10055, dtype: object

In [76]:
single_family_df['Average Bedrooms:']

0        3.00
1        3.00
2        2.11
4        3.00
9        2.55
         ... 
11146    2.93
11148    3.01
11149    3.01
11150    3.01
11151    3.01
Name: Average Bedrooms:, Length: 10055, dtype: float64

In [77]:
single_family_df.drop(['Living:','Kitchen Desc:','Dining:','Kitchen:','Interior:','Countertop:','Energy Feature:'
                      ,'Energy Feature:','Exterior:','Connections:','Oven:','Taxes w/o Exemp:','Range:'
                       ,'Floors:','Room Description:','Financing Considered:','Bathroom Description:'
                       ,'County / Zip Code:','Single Family Properties:','Bedroom Desc:','Subdivision Name:','Primary Bedroom:'],axis=1,inplace=True)

In [78]:
new_missing=missing_cal(single_family_df)
new_missing[new_missing['%']>0].sort_values("%")

Unnamed: 0,count,%
sub,5,0.049727
Average Baths:,15,0.14918
legal,17,0.16907
HOA Mandatory:,35,0.348086
Average Bedrooms:,47,0.467429
Tax Rate:,854,8.493287
Median Square Ft.:,1198,11.91447
Median Lot Square Ft.:,1198,11.91447
Median Year Built:,1198,11.91447
Median Appraised Value:,1198,11.91447
