## 2. Data wrangling

## 2.1 Contents<a id='2.1_Contents'></a>
* [2 Data wrangling](#2_Data_wrangling)
  * [2.1 Contents](#2.1_Contents)
  * [2.2 Introduction](#2.2_Introduction)
  * [2.3 Imports](#2.3_Imports)
  * [2.4 Load The House Price Data](#2.4_Load_The_House_Price_Data)
  * [2.5 Filtering Single Family Property Type](#2.5_Filtering_Single_Family_Property_Type) 
  * [2.6 Missing Values](#2.6_Missing_Values) 
  * [2.7 Garage](#2.7_Garage) 
  * [2.8 Living](#2.8_Living) 

## 2.2 Introduction<a id='2.2_Introduction'></a>

In this section I will investigate data scrapped from www.HAR.com. Data cleaning will be done in this stage since all rows are categorical and need to be numerical. I will remove features with lost of none values and will create new features.

## 2.3 Imports<a id='2.3_Imports'></a>

In [43]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import substring
import re
from fuzzywuzzy import process
import warnings
warnings.filterwarnings('ignore')

## 2.4 Load The House Price Data<a id='2.4_Load_The_House_Price_Data'></a>

In [2]:
data= pd.read_csv('../Prediction House Price Using Image Processing/Data/Houston_Home_List.csv',encoding = "ISO-8859-1")
print('data shape is:',data.shape)

data shape is: (15102, 101)


In [3]:
data.columns

Index(['Unnamed: 0', 'image_link', 'Listing Price:', 'Address:', 'City:',
       'State:', 'Zip Code:', 'County:', 'Subdivision:', 'Legal Description:',
       ...
       'Extra Room:', 'Wine Room:', 'Carport Description:',
       'Median Appraised Value / Square ft.:', 'Den:', 'Utility Room Desc:',
       'Sunroom:', 'Guest Suite:', 'Bath:', 'Garage Apartment:'],
      dtype='object', length=101)

## 2.5 Filtering Single Family Property Type<a id='2.5_Filtering_Single_Family_Property_Type'></a>

Since we are going to analysis images and other house features it is important to have all records as same as possible. For example for lots there is no image for building or rooms and features for multi-family properties are different from single family homes. let see what kind of property type we have in our dataset:

In [4]:
data['Property Type:'].value_counts()

Single-Family                          11141
Lots                                    1551
Townhouse/Condo - Townhouse              950
Townhouse/Condo - Condominium            594
Mid/Hi-Rise Condo                        436
Country Homes/Acreage                    154
Multi-Family - Duplex                    107
Country Homes/Acreage - Free Standi       46
Multi-Family - Fourplex                   46
Multi-Family - 5 Plus                     38
Multi-Family - Triplex                    15
Multi-Family - Multiple Detached Dw        9
Country Homes/Acreage - Manufacture        4
Lot & Acreage - Residential                3
Residential - Condo                        2
Residential - Townhouse                    1
Single Family                              1
Name: Property Type:, dtype: int64

Majority of properties are single family so, I keep them and remove the rest of the types.

In [5]:
single_family_df = data[data['Property Type:']=='Single-Family']
single_family_df.reset_index(drop=True,inplace=True)
len(single_family_df)

11141

In [6]:
single_family_df.head()

Unnamed: 0.1,Unnamed: 0,image_link,Listing Price:,Address:,City:,State:,Zip Code:,County:,Subdivision:,Legal Description:,...,Extra Room:,Wine Room:,Carport Description:,Median Appraised Value / Square ft.:,Den:,Utility Room Desc:,Sunroom:,Guest Suite:,Bath:,Garage Apartment:
0,85,['https://photos.harstatic.com/190618667/hr/im...,"$ 575,000 ($232.98/sqft.) $Convert",1316 Hadley Street,Houston,TX,77002,Harris County,Austin Hadley Place,LT 4 BLK 1 AUSTIN HADLEY PLACE,...,,,,,,,,,,
1,88,['https://photos.harstatic.com/190420550/hr/im...,"$ 465,000 ($221.85/sqft.) $Convert",110 Pierce Street,Houston,TX,77002,Harris County,Modern Midtown (View subdivision price trend),LT 12 BLK 1 MODERN MIDTOWN,...,,,,$223.83,,,,,,
2,89,['https://photos.harstatic.com/190088153/hr/im...,"$ 450,000 ($223.33/sqft.) $Convert",118 Pierce Street,Houston,TX,77002,Harris County,Modern Midtown (View subdivision price trend),LT 8 BLK 1 MODERN MIDTOWN,...,,,,$223.83,,,,,,
3,99,['https://photos.harstatic.com/189387790/hr/im...,"$ 259,000 ($203.30/sqft.) $Convert",311 N Milby Street,Houston,TX,77003,Harris County,Merkels Sec 01 (View subdivision price trend),LT 3 BLK 15 MERKELS SEC 1,...,,,,,"['12 x 17, 1st', '12 , 17, 1st']","['12 x 7, 1st', '12 , 7, 1st']",,,,
4,108,['https://photos.harstatic.com/177650081/hr/im...,"$ 236,999 ($196.19/sqft.) $Convert \n\n\n Red...",216 Hutcheson,Houston,TX,77003,Harris County,MERKELS (View subdivision price trend),LT 9 BLK 5 MERKELS SEC 1,...,,,,,,,,,,


In our dataset `State` and `Property Type` are the same for all houses so, we can remove them:

In [7]:
single_family_df.drop(['Unnamed: 0','State:','Property Type:'],axis=1,inplace=True)

## 2.6 Missing Values<a id='2.6_Missing_Values'></a>

In [8]:
# function to find missing value and returning count abd %
def missing_cal(df):
    missing = pd.concat([single_family_df.isnull().sum(), 100 * single_family_df.isnull().mean()], axis=1)
    missing.columns=['count', '%']
    missing.sort_values(by='count',ascending=False)
    return missing

In [9]:
missing = missing_cal(single_family_df)
missing

Unnamed: 0,count,%
image_link,0,0.000000
Listing Price:,3,0.026928
Address:,0,0.000000
City:,0,0.000000
Zip Code:,0,0.000000
...,...,...
Utility Room Desc:,7178,64.428687
Sunroom:,10909,97.917602
Guest Suite:,11008,98.806211
Bath:,9449,84.812853


Let's take a look at features with more than 90% missing values:

In [10]:
missing = missing_cal(single_family_df)
nan_90 = missing.loc[missing['%']>90].index
print('Number of Features with more than 90% None: ',len(nan_90))

Number of Features with more than 90% None:  9


In [11]:
missing.loc[nan_90].sort_values(by="%")

Unnamed: 0,count,%
Extra Room:,10068,90.368908
Median Appraised Value / Square ft.:,10217,91.70631
Media Room:,10254,92.038417
Carport Description:,10523,94.452922
Water Amenity:,10747,96.463513
Garage Apartment:,10822,97.136702
Sunroom:,10909,97.917602
Wine Room:,11002,98.752356
Guest Suite:,11008,98.806211


We need to see what kind of information are in each of these features:

In [12]:
for item in nan_90:
    print('Value Count for '+item)
    print(single_family_df[item].value_counts())
    print('-'*100)

Value Count for Media Room:
['2nd', '2nd']                      27
['16x12, 2nd', '4.88 x 3.66(m)']    17
['15x13, 2nd', '4.57 x 3.96(m)']    15
['14x13, 2nd', '4.27 x 3.96(m)']    15
['18x12, 2nd', '5.49 x 3.66(m)']    13
                                    ..
['20X12, 2nd', '6.10 x 3.66(m)']     1
['12*15, 2nd', '12*15, 2nd']         1
['17X15, 1st', '5.18 x 4.57(m)']     1
['15X11, 1st', '4.57 x 3.35(m)']     1
['22X10, 2nd', '6.71 x 3.05(m)']     1
Name: Media Room:, Length: 454, dtype: int64
----------------------------------------------------------------------------------------------------
Value Count for Water Amenity:
Lake View                                                     119
Lake View, Lakefront                                           83
Pond                                                           55
Lakefront                                                      48
Bayou Frontage, Bayou View                                      9
Bayou View                          

* Values for `Media Room`, `Extra Room`, `Wine Room`, `Sunroom`, `Guest Suite`and `Garage Apartment` are kind of dimension of each of those rooms along with some nonsense values like (`Yes` for `Garage Apartment`). 
* For `Water Amenity` there are to much unique categories and there is no way to be able to fill rest of none values with correct category
* `Carport Description` has 3 different categories for total 611 house and the rest do not have any carport so I will fill  all none values with new category as 'Not Applicable'.
* `Median Appraised Value / Square ft.:` is the fact (based on active listing) for each subdivision and can be fill by the value for same subdivision.

In [13]:
# Replacing None value for 'Carport Description:' with 'not applicable'
single_family_df['Carport Description:'][single_family_df['Carport Description:'].isnull()]='not applicable'

# Dropping 'Media Room:', 'Water Amenity:', 'Extra Room:', 'Wine Room:', 'Median Appraised Value / Square ft.:',
#'Sunroom:', 'Guest Suite:', 'Garage Apartment:', 'Vacation Rental:'
single_family_df.drop(['Media Room:', 'Water Amenity:', 'Extra Room:', 'Wine Room:', 'Sunroom:', 'Guest Suite:', 
                       'Garage Apartment:'],axis=1,inplace=True)

Next step is looking at the features with more than 80% none values:

In [14]:
missing = missing_cal(single_family_df)
nan_80 = missing.loc[missing['%']>80].index
print('Number of Features with more than 80% None: ',len(nan_80))

Number of Features with more than 80% None:  14


In [15]:
missing.loc[nan_80].sort_values(by="%")

Unnamed: 0,count,%
Average Square Ft.:,9412,84.480747
Average Price/Square Ft.:,9412,84.480747
Market Area Name:,9413,84.489723
Home For Sales:,9413,84.489723
Average List Price:,9413,84.489723
Home For Lease:,9413,84.489723
Average Lease:,9413,84.489723
Average Lease/Square Ft.:,9413,84.489723
Bath:,9449,84.812853
Den:,9486,85.14496


In [16]:
#printing value count for each feature with more than 80 none value
for item in nan_80:
    print('Value Count for '+item)
    print(single_family_df[item].value_counts())
    print('-'*100)

Value Count for Private Pool Desc:
In Ground                                355
Gunite, In Ground                        344
Gunite, Heated, In Ground                234
Gunite                                   217
Heated, In Ground                         92
Gunite, Heated, In Ground, Salt Water     46
Gunite, Heated                            40
Above Ground                              28
Heated, In Ground, Salt Water             21
Gunite, In Ground, Salt Water             20
In Ground, Salt Water                     16
Gunite, Salt Water                        13
Gunite, Heated, Salt Water                10
Enclosed, Heated, In Ground                8
In Ground, Vinyl Lined                     5
Fiberglass, In Ground                      5
Heated                                     5
Enclosed, In Ground                        4
Fiberglass                                 4
Salt Water                                 4
Above Ground, In Ground                    2
Enclosed, Gunite, In

* `Controlled Access`categories are mixed  of 'Automatic', 'Driveway', 'Manned' and 'Intercom' that makes me believe the rest of the house do not have any type of controlled access. I think filling none values with 'No controlled access' would be reasonable.
* Same as `Water Amenity` there are so many categories for 'Private Pool Desc'. After counting each category for `Private Pool:` groups figured out that there are description for house without private pool and I think it may happened by mistake and I decided to drop this column.
* `Master Planned Community` and `Market Area Name` categories seems to be same as subdivision name and can be dropped
* `Home For Sales`, `Average List Price`,`Average Square Ft.`,`Average Price/Square Ft.`, `Home For Lease`, `Average Lease`and `Average Lease/Square Ft.`, are the facts (based on active listing) for each subdivision and can be fill by the value for same subdivision.
* `Den` and `Bath` are dimension along with other values like '1th' which I think is typo mistake and I decided to drop them.

In [17]:
#counting 'Private Pool Desc:' category for `Private Pool:` groups
single_family_df.groupby('Private Pool:')['Private Pool Desc:'].value_counts()

Private Pool:  Private Pool Desc:                   
No             In Ground                                 14
               Enclosed, Heated, In Ground                6
               Above Ground                               4
               Heated, In Ground                          4
               Gunite                                     3
               Gunite, In Ground                          3
               Fiberglass                                 2
               Gunite, Heated, In Ground                  1
Yes            Gunite, In Ground                        341
               In Ground                                341
               Gunite, Heated, In Ground                233
               Gunite                                   214
               Heated, In Ground                         88
               Gunite, Heated, In Ground, Salt Water     46
               Gunite, Heated                            40
               Above Ground                    

In [18]:
single_family_df.drop(['Private Pool Desc:','Master Planned Community:','Market Area Name:','Bath:','Den:'],axis=1,inplace=True)
# Replacing None value for 'Carport Description:' with 'not applicable'
single_family_df['Controlled Access:'][single_family_df['Controlled Access:'].isnull()]='no controlled access'

Now I investigating features with more than 70% none values:

In [19]:
missing = missing_cal(single_family_df)
nan_70 = missing.loc[((missing['%']>70 )& (missing['%']<80))].index
print('Number of Features with more than 70% None: ',len(nan_70))

Number of Features with more than 70% None:  2


In [20]:
missing.loc[nan_70].sort_values(by="%")

Unnamed: 0,count,%
Family Room:,7858,70.532268
Primary Bath:,8315,74.634234


In [21]:
#printing value count for each feature with more than 70 none value
for item in nan_70:
    print('Value Count for '+item)
    print(single_family_df[item].value_counts())
    print('-'*100)

Value Count for Family Room:
['1st', '1st']                              75
['18x16, 1st', '5.49 x 4.88(m)']            52
['20x16, 1st', '6.10 x 4.88(m)']            51
['21x17, 1st', '6.40 x 5.18(m)']            39
['18x15, 1st', '5.49 x 4.57(m)']            37
                                            ..
['24 x 13, 1st', '24 , 13, 1st']             1
['29x23, 1st', '8.84 x 7.01(m)']             1
["18'2 x 22'6, 1st", "18'2 , 22'6, 1st"]     1
['16 x 12, 1st', '16 , 12, 1st']             1
['21X12, 1st', '6.40 x 3.66(m)']             1
Name: Family Room:, Length: 939, dtype: int64
----------------------------------------------------------------------------------------------------
Value Count for Primary Bath:
['1st', '1st']                                                    765
['2nd', '2nd']                                                    308
['3rd', '3rd']                                                    128
['12x10, 1st', '3.66 x 3.05(m)']                                   

`Family Room` and `Primary Bath` are dimension for family room and master bath room and all houses should have these values and can not be 0. I think dropping these features would be appropriate since I can not fill values for more than 70% of houses.

In [22]:
single_family_df.drop(['Family Room:','Primary Bath:'],axis=1,inplace=True)

Next step is to look at features with none values between 50% and 60%:

In [23]:
missing = missing_cal(single_family_df)
nan_50_60 = missing.loc[((missing['%']>50 )& (missing['%']<70))].index
print('Number of Features with more than 50% and less than 60% None: ',len(nan_50_60))

Number of Features with more than 50% and less than 60% None:  6


In [24]:
missing.loc[nan_50_60].sort_values(by="%")

Unnamed: 0,count,%
Front Door:,6471,58.082757
Breakfast:,6724,60.353649
Garage Carport:,6734,60.443407
Utility Room Desc:,7178,64.428687
Game Room:,7367,66.125123
Study/Library:,7658,68.737097


In [25]:
#printing value count for each feature with more than 50 none value
for item in nan_50_60:
    print('Value Count for '+item)
    print(single_family_df[item].value_counts())
    print('-'*100)

Value Count for Breakfast:
['1st', '1st']                              192
['10x10, 1st', '3.05 x 3.05(m)']            155
['11x10, 1st', '3.35 x 3.05(m)']            149
['12x10, 1st', '3.66 x 3.05(m)']            147
['10x9, 1st', '3.05 x 2.74(m)']             108
                                           ... 
['17x10, 2nd', '5.18 x 3.05(m)']              1
['26x9, 1st', '7.92 x 2.74(m)']               1
['12.4 x 17.6, 1st', '12.4 , 17.6, 1st']      1
['10X22, 1st', '3.05 x 6.71(m)']              1
["12'8 x 11'5, 1st", "12'8 , 11'5, 1st"]      1
Name: Breakfast:, Length: 769, dtype: int64
----------------------------------------------------------------------------------------------------
Value Count for Game Room:
['2nd', '2nd']                              72
['18x14, 2nd', '5.49 x 4.27(m)']            52
['19x16, 2nd', '5.79 x 4.88(m)']            44
['16x14, 2nd', '4.88 x 4.27(m)']            37
['18x16, 2nd', '5.49 x 4.88(m)']            37
                                      

It seems we can not do anything to fill NA values for these features because there is no information about dimension for `Utility Room`, `Study/Library`. `Game Room`and `Breakfast` area and I do not know about `Garage Carport` and `Front Door` direction for rest of the houses so, these features can be dropped as well.

In [26]:
single_family_df.drop(list(nan_50_60),axis=1,inplace=True)

So far I investigated features with the none value more than 50% and still need to dig more and also fill values for features that are the facts (based on active listing) for each subdivision like: `Home For Sales`, `Average List Price`,`Average Square Ft.`,`Average Price/Square Ft.`, `Home For Lease`, `Average Lease`and `Average Lease/Square Ft.`. But before that lets take a look at features with none values more than 10%:

In [27]:
missing = missing_cal(single_family_df)
nan_10_50 = missing.loc[((missing['%']>10 )& (missing['%']<50))].index
print('Number of Features with more than 10% and less than 50% None: ',len(nan_10_50))

Number of Features with more than 10% and less than 50% None:  35


In [28]:
missing.loc[nan_10_50].sort_values(by="%")

Unnamed: 0,count,%
Garage(s):,1292,11.596805
Tax Rate:,1323,11.875056
Dishwasher:,1413,12.682883
Bedroom Desc:,1682,15.097388
Median Appraised Value:,1746,15.671843
Median Year Built:,1746,15.671843
Median Lot Square Ft.:,1746,15.671843
Median Square Ft.:,1746,15.671843
Single Family Properties:,1746,15.671843
County / Zip Code:,1746,15.671843


In [29]:
#printing value count for each feature with more than 50 none value
for item in nan_10_50:
    print('Value Count for '+item)
    print(single_family_df[item].value_counts())
    print('-'*100)

Value Count for Garage(s):
2 / Attached                      5626
2 / Detached                      1063
3 / Attached                       686
1 / Detached                       341
1 / Attached                       320
                                  ... 
8 / Attached/Detached                1
26 / Attached                        1
4 / Attached/Detached,Detached       1
4 / Oversized,Tandem                 1
58 / Attached                        1
Name: Garage(s):, Length: 114, dtype: int64
----------------------------------------------------------------------------------------------------
Value Count for Living:
['15x13, 1st', '4.57 x 3.96(m)']     58
['20x15, 1st', '6.10 x 4.57(m)']     57
['14x12, 1st', '4.27 x 3.66(m)']     57
['18x15, 1st', '5.49 x 4.57(m)']     55
['20x16, 1st', '6.10 x 4.88(m)']     55
                                     ..
['22 X 16, 1st', '22 , 16, 1st']      1
['12x23, 1st', '3.66 x 7.01(m)']      1
['17x13,  1st', '5.18 x 3.96(m)']     1
['22x23, 2nd', 

`Room Description`, `Countertop`, `Floors`, `Bedroom Desc`, `Kitchen Desc`, `Bathroom Description`, `Connections`, `Oven`, `Range`, `Energy Feature`, `Interior`, `Exterior`, `Financing Considered` are just information and we can not fill them with unknown values since some may not be accurate and I don't think they are relevant to our analysis so, I will drop all of them.

In [30]:
single_family_df.drop(['Room Description:', 'Countertop:', 'Floors:', 'Bedroom Desc:', 'Kitchen Desc:', 
                       'Bathroom Description:','Connections:', 'Oven:', 'Range:', 'Energy Feature:',
                       'Interior:', 'Exterior:', 'Financing Considered:'], axis=1,inplace=True)

`Ice Maker`, `Microwave`, `Compactor`, `Dishwasher`, `Disposal` and `Area Pool` are 'Yes/No' categories and I think it is relevant to fill none values with 'No'. For sure it is a little bit optimistic since some houses may have those features and owner/agent forgot to fill them but for now filling with 'No' value is the best way to dealing with them.

In [31]:
single_family_df['Disposal:'][single_family_df['Disposal:'].isnull()]='No'
single_family_df['Ice Maker:'][single_family_df['Ice Maker:'].isnull()]='No'
single_family_df['Compactor:'][single_family_df['Compactor:'].isnull()]='No'
single_family_df['Area Pool:'][single_family_df['Area Pool:'].isnull()]='No'
single_family_df['Microwave:'][single_family_df['Microwave:'].isnull()]='No'
single_family_df['Dishwasher:'][single_family_df['Dishwasher:'].isnull()]='No'

At this point I am investigating other features indevisually:

## 2.7 Garage<a id='2.7_Garage'></a>

In [32]:
single_family_df['Garage(s):'].value_counts()

2 / Attached                      5626
2 / Detached                      1063
3 / Attached                       686
1 / Detached                       341
1 / Attached                       320
                                  ... 
8 / Attached/Detached                1
26 / Attached                        1
4 / Attached/Detached,Detached       1
4 / Oversized,Tandem                 1
58 / Attached                        1
Name: Garage(s):, Length: 114, dtype: int64

The important part of this feature is the number of garage each house has. Also we now almost every single family homes have at least 2 garages and it is relevant to fill none values with '2'.

In [33]:
single_family_df['Garage(s):'].fillna('2',inplace=True)
single_family_df['garage'] = [item[0] if item !=None else 0 for item in single_family_df['Garage(s):'].str.split(' ') ]
single_family_df['garage']=single_family_df['garage'].astype(int)
single_family_df.drop('Garage(s):',axis=1,inplace=True)
single_family_df['garage'].value_counts()

2     8703
3     1445
1      758
4      175
5       21
6       11
8        7
7        4
24       2
10       1
40       1
56       1
9        1
57       1
63       1
26       1
42       1
27       1
51       1
20       1
21       1
45       1
22       1
58       1
Name: garage, dtype: int64

As you can see there are some houses with more than 10 garage which is odd. After checking images for some of these houses in www.HAR.com it seems those have only 2 garage and I fill those values with 2 which is the median of this feature.

In [34]:
single_family_df.garage[single_family_df['garage']>8]=single_family_df['garage'].median()
single_family_df['garage'].value_counts()

2    8720
3    1445
1     758
4     175
5      21
6      11
8       7
7       4
Name: garage, dtype: int64

## 2.8 Living<a id='2.8_Living'></a>

To calculate the living area I need to multiply the dimension of the living room and return the area and for the next stem I will fill none values with the average of living room area per subdivision

In [173]:
def area_calc(item,pattern = "[+-]? *(?:\d+(?:\.\d*)?|\.\d+)"):
    pattern = re.compile(pattern,re.IGNORECASE)
    area=0
    dim=[]
    if type(item)==list:
        for i in item:
            if (('x' in i or 'X' in i )and '(m)' not in i ):
                dim.append(i)
        for d in dim:
            d=d.replace(' ','').strip()
            area += (float(pattern.findall(d)[0])*float(pattern.findall(d)[1]))
        return(area)
single_family_df['clean_living'] = single_family_df['Living:'].str.split(',')
single_family_df['clean_living'] = single_family_df['clean_living'].apply(area_calc)        

In [174]:
single_family_df['clean_living'].describe()

count    6294.000000
mean      288.354747
std       152.449699
min         0.000000
25%       210.000000
50%       272.000000
75%       342.000000
max      6651.000000
Name: clean_living, dtype: float64

In [187]:
single_family_df['Subdivision:'].unique()

array(['Austin Hadley Place',
       ' Modern Midtown (View subdivision price trend)',
       ' Merkels Sec 01 (View subdivision price trend)', ...,
       ' South Houston (View subdivision price trend)',
       ' South Houston Terrace (View subdivision price trend)',
       ' Merilyn Place Sec 03 (View subdivision price trend)'],
      dtype=object)

In [None]:

correct_subdevision = ['Acres Homes', 'Addicks', 'Afton Oaks', 'Aldine', 'Alief', 'Almeda', 'Atascocita',
                       'Audubon Place', 'Avenida Houston', 'Avondale East', 'Bay Forest', 'Bay Glen', 'Bay Knoll', 
                       'Barrett Station', 'Binz', 'Blue Ridge', 'Bordersville', 'Boulevard Oaks', 'Braeburn',
                       'Braeswood Place', 'Brays Oaks', 'Brentwood', 'Briar Meadow', 'Briargrove', 'Briargrove Park',
                       'Briarhills', 'Briarmeadow', 'Broadacres', 'Brooke Smith', 'Brookline', 'Camden Park',
                       'Camino South', 'Camp Logan', 'Candlelight Estates', 'Candlelight Place', 'Carverdale',
                       'Central City', 'Champion Forest', 'Chasewood ', 'Cherryhurst', 'Chevy Chase', 'Chinatown', 
                       'City Park', 'CityCentre', 'Clear Lake City', 'Clinton Park ', 'Cloverland', 'Cole Creek Manor',
                       'Copperfield', 'Corinthian Pointe', 'Cottage Grove', 'Courtlandt Place ',
                       'Crestwood / Glen Cove', 'Candlelight Forest West', 'Denver Harbor', 'Downtown', 
                       'East Downtown', 'East End', 'East Houston', 'Eastex / Jensen', 'East Little York / Homestead', 
                       'Eastwood ', 'Edgebrook', 'El Dorado / Oates Prairie', 'Eldridge / West Oaks', 'Energy Corridor',
                       'Fairbanks / Northwest Crossing', 'Fondren Southwest ', 'Fifth Ward', 'First Ward', 'Forum Park',
                       'Fourth Ward', 'Forest West / Pinemont', 'Forrest Lake', 'Foster Place', 'Frenchtown',
                       'Frostwood', 'Garden Oaks ', 'Garden Villas ', 'Gaywood', 'Genoa', 'Glenbrook Valley', 
                       'Glenshire', 'Golfcrest', 'Greenfield Village', 'Greenspoint', 'Greenway Plaza', 'Greenwood', 
                       'Gulfgate / Pine Valley', 'Gulfton', 'Gulfway Terrace', 'Harrisburg', 'Heather Glen',
                       'Herschellwood', 'Hidden Valley', 'Highland Village ', 'Hillwood', 'Hiram Clarke', 
                       'Houston Gardens ', 'Houston Heights', 'Humble', 'Hunters Glen', 'Hunters Point', 'Hunterwood',
                       'Hyde Park', 'Idylwood', 'Independence Heights', 'International District', 'Inwood Forest', 
                       'Ingrando Park', 'Jeanetta', 'Kashmere Gardens ', 'Kingwood', 'Kleinbrook', 'Knollwood Village',
                       'Lake Houston', 'Lakes of Parkway', 'Lakewood', 'Langwood', 'Larchmont', 'Lawndale / Wayside',
                       'Lazybrook / Timbergrove', 'Lindale Park', 'Link Valley', 'Linkwood', 'Little Saigon', 
                       'Lincoln Greens', 'Lower Westheimer', 'Magnolia Grove', 'Magnolia Park', 
                       'Mahatma Gandhi District', 'Manchester', 'Maplewood', 'Maplewood South–North', 
                       'Marilyn Estates', 'Meadowcreek Village', 'Memorial', 'Memorial Bend', 'Memorial City',
                       'Memorial Park', 'Meyerland', 'Midtown', 'Montrose', 'Moonshine Hill', 'Morningside Place', 
                       'Museum District', 'Museum Park', 'Mykawa', 'Near Northside', 'Near Northwest', 'Neartown', 
                       'North Lindale', 'Norhill', 'North Central', 'North Shore', 'Northcliffe', 'Northcliffe Manor', 
                       'Northfield ', 'Northline', 'Northside', 'Nottingham Forest', 'Nottingham West', 'Oak Brook',
                       'Oak Estates ', 'Oak Forest', 'Oak Manor–University Woods ','Oak Grove',;riverstone 'Old Braeswood ', 'Overbrook',
                       'Paradise Valley', 'Park Place', 'Parkway Villages', 'Pecan Park ', 'Pierce Junction',
                       'Pine Valley', 'Pleasantville ', 'Port Houston', 'Ponderosa Forest', 'Prestonwood Forest',
                       'Recreation Acres ', 'Rice Military ', 'Rice Village ', 'Richmond Strip', 'Ridgegate', 
                       'Ridgemont', 'River Oaks', 'Rivercrest Estates', 'Riverside Terrace ','Riverstone', 'Robindell',
                       'Royal Oaks Country Club', 'Sagemont', 'Scenic Woods', 'Second Ward', 'Settegast', 
                       'Shady Acres', 'Shadyside', 'Sharpstown', 'Shenandoah', 'Shepherd Park Plaza', 
                       'Sherwood Forest', 'Sherwood Oaks', 'Sixth Ward', 'Somerset Green', 'South Acres', 
                       'South Bank', 'South Main', 'South Park', 'South Union', 'Southcrest', 'Southampton', 
                       'Southbelt / Ellington', 'Southgate', 'Southwest', 'Spring Branch', 'Spring Lakes', 
                       'St. George Place', 'Sugar Valley', 'Sunnyside', 'Sunset Terrace / Montclair', 'Tanglewood', 
                       'Tanglewilde', 'Third Ward', 'Timbergrove Manor', 'University Oaks', 'Upper Kirby', 'Uptown', 
                       'Village at Glen Iris', 'Walnut Bend', 'Washington Avenue', 'Washington Terrace', 
                       'West Eleventh Place', 'West End', 'West Oaks', 'Westbury', 'Westmoreland', 
                       'Westmoreland Farms', 'Westwood', 'Willow Meadows', 'Willowbend', 'Willowbrook', 
                       'Willowick Place', 'Willowood', 'Windemere', 'Windsor Village', 'Woodland Heights ', 
                       'Woodland Trails ', 'Woodshire ', 'Woodside', 'Wrenwood', 'Yellowstone', 'Yorkshire', 
                       'Yorkwood']

#For each correct sub. in subevision list
for sub in correct_subdevision:
    print(sub)
    # Find matches in subdevisiom of houses
    matches = process.extract(sub, single_family_df['Subdivision:'], 
                 limit = single_family_df.shape[0])
    
    print(len(matches))
# # For each possible_match with similarity score >= 90
#     for possible_match in matches:
#         if possible_match[1] >= 90:
#       # Find matching subdevision type
#         matching_sub = single_family_df['Subdivision:'] == possible_match[0]
#         single_family_df.loc[matching_sub , 'Subdivision:'] = sub

# Print unique values to confirm mapping
print( single_family_df['Subdivision:'].unique())  

In [53]:
single_family_df.groupby('Subdivision:')['Home For Sales:'].value_counts()

Subdivision:                        Home For Sales:
1/Drexel Place Sub                  35.0               1.0
433919                              824                1.0
897-903 Knox Th                     341.0              1.0
A323000 A-230 SAMUEL C NEILL TRACT  286                1.0
ALLSON STREET GROVE                 838.0              1.0
                                                      ... 
Woodson's Reserve                   717                1.0
Woodson's Reserve - Villas          717                1.0
Woodson's Reserve Select            717                1.0
Zen Tu                              309.0              1.0
Zugheri Ford Acres                  341.0              1.0
Name: Home For Sales:, Length: 555, dtype: float64

In [59]:
single_family_df.groupby('Subdivision:')['Average List Price:'].value_counts().isna().sum()

0

In [16]:
single_family_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11152 entries, 0 to 11151
Data columns (total 67 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   image_link                  11152 non-null  object 
 1   Listing Price:              11152 non-null  object 
 2   City:                       11152 non-null  object 
 3   Zip Code:                   11152 non-null  int64  
 4   County:                     11152 non-null  object 
 5   Subdivision:                11147 non-null  object 
 6   Legal Description:          11048 non-null  object 
 7   Bedrooms:                   11117 non-null  object 
 8   Baths:                      11125 non-null  object 
 9   Garage(s):                  9992 non-null   object 
 10  Stories:                    11148 non-null  object 
 11  Style:                      11152 non-null  object 
 12  Year Built:                 11082 non-null  object 
 13  Building Sqft.:             111

## 2.7 Listing Price<a id='2.7_Listing_Price'></a>

In [17]:
single_family_df['Listing Price:'].isna().sum()

0

In [18]:
single_family_df['Listing Price:']=single_family_df['Listing Price:'].str.split(' ').str[1]
single_family_df['Listing Price:']=single_family_df['Listing Price:'].str.replace(',','')
single_family_df['Listing Price:']=pd.to_numeric(single_family_df['Listing Price:'])
single_family_df.rename(columns = {'Listing Price:':'listing_price','Address:':'address', 'Zip Code:':'zip_code', 'County:':'county',
                                 'Subdivision:':'sub', 'Legal Description:':'legal'},inplace=True)

In [19]:
single_family_df['listing_price'].describe()

count    1.115200e+04
mean     5.114418e+05
std      6.209544e+05
min      1.000000e+00
25%      2.399750e+05
50%      3.450000e+05
75%      5.402250e+05
max      1.450000e+07
Name: listing_price, dtype: float64

## 2.8 Bedrooms<a id='2.8_Bedrooms'></a>

In [20]:
single_family_df[['Bedrooms:','Bedroom:']]

Unnamed: 0,Bedrooms:,Bedroom:
0,3 Bedroom(s),"13x11, 3rd"
1,3 Bedroom(s),"13 x 11, 1st"
2,3 Bedroom(s),"13x10, 1st"
3,4 Bedroom(s),"11x11, 3rd"
4,3 Bedroom(s),"13 x 11, 1st"
...,...,...
11147,3 Bedroom(s),"10x11, 1st"
11148,3 Bedroom(s),"10x8, 1st"
11149,3 Bedroom(s),"12x11, 1st"
11150,3 Bedroom(s),"12x12, 1st"


'Bedroom:' feature includes the size of one of the bedrooms I think is better to delete this column.

In [21]:
single_family_df.drop('Bedroom:',axis=1,inplace=True)

In [22]:
single_family_df['Bedrooms:'].isnull().sum()

35

In [23]:
single_family_df=single_family_df[~single_family_df['Bedrooms:'].isnull()]

In [24]:
print('Number of Missing Values on Bedrooms:',single_family_df['Bedrooms:'].isnull().sum())
print('data shape is: ',single_family_df.shape)

Number of Missing Values on Bedrooms: 0
data shape is:  (11117, 66)


In [25]:
single_family_df['Bedrooms:']=single_family_df['Bedrooms:'].str.split(' ').str[0]
single_family_df['Bedrooms:']=single_family_df['Bedrooms:'].astype(int)
single_family_df.rename(columns = {'Bedrooms:':'bedrooms'},inplace=True)
single_family_df['bedrooms'].describe()

count    11117.000000
mean         3.661959
std          0.813128
min          1.000000
25%          3.000000
50%          4.000000
75%          4.000000
max         21.000000
Name: bedrooms, dtype: float64

## 2.9 Bathrooms<a id='2.9_Bathrooms'></a>

In [26]:
single_family_df['Baths:'].isnull().sum()

5

In [27]:
single_family_df=single_family_df[~single_family_df['Baths:'].isnull()]

In [28]:
single_family_df[['Baths:']]

Unnamed: 0,Baths:
0,3 Full & 1 Half Bath(s)
1,3 Full & 1 Half Bath(s)
2,3 Full & 1 Half Bath(s)
3,3 Full & 1 Half Bath(s)
4,3 Full & 1 Half Bath(s)
...,...
11147,1 Full Bath(s)
11148,1 Full & 1 Half Bath(s)
11149,1 Full & 1 Half Bath(s)
11150,2 Full Bath(s)


In [29]:
single_family_df['full_bath']=single_family_df['Baths:'].str.split(' ').str[0].astype(int)

In [30]:
single_family_df['full_bath']

0        3
1        3
2        3
3        3
4        3
        ..
11147    1
11148    1
11149    1
11150    2
11151    1
Name: full_bath, Length: 11112, dtype: int32

In [31]:
No_Bath = single_family_df['Baths:'].str.split('&').str[1].str.strip()
No_Bath.fillna('0',inplace=True) 
single_family_df['half_bath']=[int(item[0]) for item in No_Bath.str.split(' ')]
single_family_df['half_bath'].replace(',','',inplace=True)
single_family_df[['full_bath','half_bath']].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11112 entries, 0 to 11151
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   full_bath  11112 non-null  int32
 1   half_bath  11112 non-null  int64
dtypes: int32(1), int64(1)
memory usage: 217.0 KB


In [32]:
single_family_df.drop('Baths:',axis=1,inplace=True)

In [33]:
single_family_df[['full_bath','half_bath']].tail()

Unnamed: 0,full_bath,half_bath
11147,1,0
11148,1,1
11149,1,1
11150,2,0
11151,1,0


In [37]:
single_family_df.garage[single_family_df['garage']>8]=single_family_df['garage'].median()
single_family_df['garage'].value_counts()

2    7531
3    1468
0    1133
1     750
4     184
5      24
6      11
8       7
7       4
Name: garage, dtype: int64

In [38]:
single_family_df['Stories:'].value_counts()

2      5367
1      4475
3       857
1.5     229
4       154
2.5      26
5         2
Name: Stories:, dtype: int64

In [39]:
single_family_df['Stories:'].isnull().sum()

2

In [40]:
single_family_df=single_family_df[~single_family_df['Stories:'].isnull()]

In [41]:
single_family_df.rename(columns ={'Stories:':'stories'},inplace=True)
single_family_df['stories']=pd.to_numeric(single_family_df['stories'])

In [42]:
single_family_df['Style:'].value_counts()

Traditional                        7671
Contemporary/Modern                1216
Ranch                               413
Contemporary/Modern,Traditional     361
Other Style                         253
                                   ... 
French,Mediterranean                  1
Mediterranean,Split Level             1
Ranch,Split Level,Traditional         1
Split Level,Victorian                 1
Georgian,Victorian                    1
Name: Style:, Length: 80, dtype: int64

In [43]:
single_family_df['Style:'].isnull().sum()

0

In [44]:
single_family_df.rename(columns ={'Style:':'style'},inplace=True)

In [45]:
single_family_df['Year Built:'].isnull().sum()

66

In [46]:
single_family_df['Year Built:'].value_counts()

2020   / Builder               2461
2006   / Appraisal District     204
2005   / Appraisal District     201
2019   / Builder                200
2015   / Appraisal District     192
                               ... 
1985   / Seller                   1
1918   / Appraisal District       1
1992   / Builder                  1
1950   / Seller                   1
1996   / Seller                   1
Name: Year Built:, Length: 297, dtype: int64

In [47]:
single_family_df=single_family_df[~single_family_df['Year Built:'].isnull()]
single_family_df['Year Built:']=single_family_df['Year Built:'].apply(lambda x:str(x).split(' ')[0])
single_family_df['Year Built:']=pd.to_datetime(single_family_df['Year Built:'],format='%Y').dt.year
single_family_df['Year Built:'].value_counts()

2020    2541
2019     279
2015     253
2006     235
2014     232
        ... 
1923       1
1918       1
1908       1
1916       1
1896       1
Name: Year Built:, Length: 115, dtype: int64

In [48]:
single_family_df.rename(columns ={'Year Built:':'year_built'},inplace=True)

In [49]:
single_family_df['Building Sqft.:']

0        2,096195(m²)  /Appraisal District
1        2,015187(m²)  /Appraisal District
2        2,468229(m²)  /Appraisal District
3                   2,878267(m²)  /Builder
4                   2,073193(m²)  /Builder
                       ...                
11146    1,630151(m²)  /Appraisal District
11148    1,316122(m²)  /Appraisal District
11149    1,160108(m²)  /Appraisal District
11150    1,135105(m²)  /Appraisal District
11151       83978(m²)  /Appraisal District
Name: Building Sqft.:, Length: 11044, dtype: object

In [50]:
single_family_df['Building Sqft.:'].isnull().sum()

18

In [51]:
single_family_df=single_family_df[~single_family_df['Building Sqft.:'].isnull()]
single_family_df['Building Sqft.:']=single_family_df['Building Sqft.:'].apply(lambda x:x[0:5] if ',' in x else x[0:3])
single_family_df['Building Sqft.:']=single_family_df['Building Sqft.:'].str.replace(',','')
single_family_df['Building Sqft.:']=pd.to_numeric(single_family_df['Building Sqft.:'])
single_family_df.rename(columns ={'Building Sqft.:':'build_Sq'},inplace=True)

In [52]:
single_family_df.year_built[single_family_df['Lot Size:'].isnull()].value_counts()

2020    857
2019     34
2021     18
2013      6
2018      5
2016      5
2017      3
2015      3
1955      2
1965      2
1967      2
1975      2
1978      2
1979      2
1982      2
2009      2
2007      2
2000      1
1994      1
1930      1
1935      1
1945      1
1950      1
1952      1
1962      1
2008      1
1968      1
1969      1
1973      1
2006      1
1976      1
1977      1
2005      1
2004      1
1980      1
2002      1
1985      1
1990      1
1920      1
Name: year_built, dtype: int64

Since 857 of null values is under cunstruction I will drop all null values for lot size

In [53]:
single_family_df=single_family_df[~single_family_df['Lot Size:'].isnull()]

In [54]:
single_family_df['Lot Size:']

0        2,173 Sqft.202(m²)  /Appraisal District
1        1,446 Sqft.134(m²)  /Appraisal District
2        1,786 Sqft.166(m²)  /Appraisal District
4        1,788 Sqft.166(m²)  /Appraisal District
9        2,460 Sqft.229(m²)  /Appraisal District
                          ...                   
11146    7,975 Sqft.741(m²)  /Appraisal District
11148    7,100 Sqft.660(m²)  /Appraisal District
11149    7,100 Sqft.660(m²)  /Appraisal District
11150    7,100 Sqft.660(m²)  /Appraisal District
11151    7,100 Sqft.660(m²)  /Appraisal District
Name: Lot Size:, Length: 10055, dtype: object

In [55]:
single_family_df['Lot Size:']=single_family_df['Lot Size:'].str.replace(',','')
single_family_df['Lot Size:']=single_family_df['Lot Size:'].apply(lambda x:float(x.split(' ')[0])*43560 if 'Acres' in 
                                                                  x else float(x.split(' ')[0]))

# single_family_df['Lot Size:']=pd.to_numeric(single_family_df['Lot Size:'])
single_family_df.rename(columns ={'Lot Size:':'lot_size'},inplace=True)

In [56]:
single_family_df.lot_size.isnull().sum()

0

In [57]:
single_family_df['Maintenance Fee:']

0        $ 1304 / Annually
1        $ 1200 / Annually
2        $ 2000 / Annually
4        $ 1200 / Annually
9                       No
               ...        
11146                   No
11148                   No
11149                   No
11150                   No
11151                   No
Name: Maintenance Fee:, Length: 10055, dtype: object

In [58]:
single_family_df['Maintenance Fee:'].isnull().sum()

28

In [59]:
single_family_df['Maintenance Fee:'].value_counts()

No                   2973
$ 1200 / Annually     170
$ 450 / Annually      163
$ 600 / Annually      155
$ 400 / Annually      154
                     ... 
$ 257 / Annually        1
$ 190 / Monthly         1
$ 110 / Quarterly       1
$ 2811 / Annually       1
No / Monthly            1
Name: Maintenance Fee:, Length: 929, dtype: int64

In [60]:
single_family_df['Maintenance Fee:'].isin(['No','No / $0','Voluntary / Annually','Voluntary /0/ Annually']).sum()

3091

In [61]:
single_family_df.drop('Maintenance Fee:',axis=1,inplace=True)

In [62]:
single_family_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10055 entries, 0 to 11151
Data columns (total 66 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   image_link                  10055 non-null  object 
 1   listing_price               10055 non-null  int64  
 2   City:                       10055 non-null  object 
 3   zip_code                    10055 non-null  int64  
 4   county                      10055 non-null  object 
 5   sub                         10050 non-null  object 
 6   legal                       10038 non-null  object 
 7   bedrooms                    10055 non-null  int32  
 8   stories                     10055 non-null  float64
 9   style                       10055 non-null  object 
 10  year_built                  10055 non-null  int64  
 11  build_Sq                    10055 non-null  int64  
 12  lot_size                    10055 non-null  float64
 13  Living:                     601

In [63]:
single_family_df['Living:'].isnull().sum()

4043

In [64]:
new_missing=missing_cal(single_family_df)

In [65]:
new_missing

Unnamed: 0,count,%
image_link,0,0.000000
listing_price,0,0.000000
City:,0,0.000000
zip_code,0,0.000000
county,0,0.000000
...,...,...
Neighborhood Value Range:,1198,11.914470
Median Price / Square ft.:,2086,20.745898
full_bath,0,0.000000
half_bath,0,0.000000


In [66]:
new_missing[new_missing['%']>0].sort_values("%")

Unnamed: 0,count,%
sub,5,0.049727
Average Baths:,15,0.14918
legal,17,0.16907
HOA Mandatory:,35,0.348086
Average Bedrooms:,47,0.467429
Primary Bedroom:,617,6.136251
Tax Rate:,854,8.493287
Median Lot Square Ft.:,1198,11.91447
Neighborhood Value Range:,1198,11.91447
Subdivision Name:,1198,11.91447


In [67]:
single_family_df['Disposal:'].value_counts()

Yes    8153
No      309
Name: Disposal:, dtype: int64

In [69]:
single_family_df['Fireplace:'].value_counts()

1/Gaslog Fireplace                                 1783
1                                                  1087
1/Gas Connections                                   878
1/Wood Burning Fireplace                            542
1/Gas Connections, Gaslog Fireplace                 445
                                                   ... 
4/Mock Fireplace                                      1
/Freestanding, Wood Burning Fireplace                 1
1/Gas Connections, Stove                              1
2/Stove, Wood Burning Fireplace                       1
1/Freestanding, Gas Connections, Mock Fireplace       1
Name: Fireplace:, Length: 92, dtype: int64

In [70]:
pd.Series([str(x)[0]  for x in single_family_df['Fireplace:'] if x is not None]).value_counts()

1    5605
n    3605
2     573
3     130
/      93
4      37
5      10
7       1
6       1
dtype: int64

In [71]:
single_family_df['Fireplace:']=single_family_df['Fireplace:'].apply(lambda x:int(str(x)[0]) if str(x)[0]
                                                                    in ['1','2','3','4','5','6','7'] else 0)

In [72]:
single_family_df['Fireplace:'].value_counts()

1    5605
0    3698
2     573
3     130
4      37
5      10
7       1
6       1
Name: Fireplace:, dtype: int64

In [73]:
single_family_df['Median Price / Square ft.:'].value_counts()

$303.75     108
$259.49      95
$146.71      82
$127.37      77
$139.32      60
           ... 
$126.72       1
$132.07       1
$100.69       1
$94.82        1
$118.87       1
Name: Median Price / Square ft.:, Length: 1369, dtype: int64

In [74]:
single_family_df['Median Price / Square ft.:']=pd.to_numeric(single_family_df['Median Price / Square ft.:'].str.replace("$",' ').str.strip())

In [75]:
single_family_df['Subdivision Name:']

0               Modern Midtown
1               Modern Midtown
2                          NaN
4             ELITE TWNHMS LLC
9                          NaN
                 ...          
11146    South Houston Terrace
11148            South Houston
11149            South Houston
11150            South Houston
11151            South Houston
Name: Subdivision Name:, Length: 10055, dtype: object

In [76]:
single_family_df['Average Bedrooms:']

0        3.00
1        3.00
2        2.11
4        3.00
9        2.55
         ... 
11146    2.93
11148    3.01
11149    3.01
11150    3.01
11151    3.01
Name: Average Bedrooms:, Length: 10055, dtype: float64

In [77]:
single_family_df.drop(['Living:','Kitchen Desc:','Dining:','Kitchen:','Interior:','Countertop:','Energy Feature:'
                      ,'Energy Feature:','Exterior:','Connections:','Oven:','Taxes w/o Exemp:','Range:'
                       ,'Floors:','Room Description:','Financing Considered:','Bathroom Description:'
                       ,'County / Zip Code:','Single Family Properties:','Bedroom Desc:','Subdivision Name:','Primary Bedroom:'],axis=1,inplace=True)

In [78]:
new_missing=missing_cal(single_family_df)
new_missing[new_missing['%']>0].sort_values("%")

Unnamed: 0,count,%
sub,5,0.049727
Average Baths:,15,0.14918
legal,17,0.16907
HOA Mandatory:,35,0.348086
Average Bedrooms:,47,0.467429
Tax Rate:,854,8.493287
Median Square Ft.:,1198,11.91447
Median Lot Square Ft.:,1198,11.91447
Median Year Built:,1198,11.91447
Median Appraised Value:,1198,11.91447
