# ** rename file without spaces?? **

## Project Plan

### Goals

### Hypothesis

### Components

In [1]:
import acquire
import prep

## Acquire and Prep

### Pull SQL data and filter data frame

Using the `get_sql_zillow` function from the `acquire` module, pull all the columns from the Zillow dataset from SQL following the conditions below:
 - only include properties that have recorded transactions in 2017
 - use a left join to merge all tables of descriptions on their associated id property
 - exclude properties that do not have latitude and longitude values (location is important in our analysis later so we wante to make sure we are including observations with known locations.)

In [None]:
# get_sql_zillow code here
df = acquire.get_sql_zillow()

Using the `wrangle_zillow` function, perform initial cleaning of the original dataframe specifically:
 - Drop redundant ID columns from Left Joins (typeconstructionid, storytypeid, etc.)
 - Remove duplicates from multiple transactions by maintaining latest transaction date and dropping previous records
 - Include property land use type "Single Family Residential" only, and drop the rest of the observations
 - Drop properties that have unit counts of 2 (27 rows) and 3 (1 row) to remove ambiguity in the definition of "single-unit" or "single-family" houses, especially since the total number of these observations are relatively minimal.
 - Keep transactions that are recorded in 2017 (there was 1 row that has 2018 as transaction year)

In [4]:
# wrangle_zillow code here
df = acquire.wrangle_zillow(df)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 52292 entries, 77578 to 0
Data columns (total 60 columns):
parcelid                        52292 non-null int64
basementsqft                    47 non-null float64
bathroomcnt                     52292 non-null float64
bedroomcnt                      52292 non-null float64
buildingqualitytypeid           33632 non-null float64
calculatedbathnbr               52158 non-null float64
decktypeid                      388 non-null float64
finishedfloor1squarefeet        4368 non-null float64
calculatedfinishedsquarefeet    52211 non-null float64
finishedsquarefeet12            52047 non-null float64
finishedsquarefeet13            0 non-null float64
finishedsquarefeet15            0 non-null float64
finishedsquarefeet50            4368 non-null float64
finishedsquarefeet6             164 non-null float64
fips                            52292 non-null float64
fireplacecnt                    7230 non-null float64
fullbathcnt                    

### Handling missing values

**Drop Unsalvagable Columns and Rows**

`handle_missing_values` function from the `prep` module drops columns that are 90% empty, reducing our columns from 60 to 27. Then, remove observations that are 40% empty. The latter removed no observations.  
  
In this stage, our working data has 52292 rows with 27 columns.

In [6]:
df = prep.handle_missing_values(df,.90,.40)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 52292 entries, 77578 to 0
Data columns (total 27 columns):
parcelid                        52292 non-null int64
bathroomcnt                     52292 non-null float64
bedroomcnt                      52292 non-null float64
calculatedbathnbr               52158 non-null float64
calculatedfinishedsquarefeet    52211 non-null float64
finishedsquarefeet12            52047 non-null float64
fips                            52292 non-null float64
fullbathcnt                     52158 non-null float64
latitude                        52292 non-null float64
longitude                       52292 non-null float64
lotsizesquarefeet               51930 non-null float64
propertycountylandusecode       52292 non-null object
rawcensustractandblock          52292 non-null float64
regionidcity                    51256 non-null float64
regionidcounty                  52292 non-null float64
regionidzip                     52266 non-null float64
roomcnt       

**Drop Non-value-adding Columns**

*dropping parcelid*

`bathroomcnt` reflects the same information as the `calculatedbathnbr`, which is the number of bathrooms in a property including half bathrooms (.5's). The	`fullbathcnt` column only includes full bathrooms and discounts half baths. So, we are discarding redundant `calculatedbathnbr` and `fullbathcnt` and keep `bathroomcnt` to provide information on the properties' bathrooms.

`calculatedfinishedsquarefeet` and `finishedsquarefeet12` hold practically the same information. `calculatedfinishedsquarefeet` has less nulls (81 rows) than `finishedsquarefeet12` (245 rows), so we are dropping the `finishedsquarefeet12`.

`propertycountylandusecode` are codes used in the industry to specify the land use. For example, a Single Family Residential property land use type may be Single Family Class II (0102), Vacant Residential (0000), Vacant Lake View (0035), etc. We don't need these further information. All we need to know is that the properties that we are looking at are under the umbrella of "Single Family Residential."

`rawcensustractandblock` and `censustractandblock` contain census information that we are not concerned about in this project.

`roomcnt` has 37,588 properties with 0 rooms, so we are dropping this column.

`fips` and `regionidcounty` have cotain the same information. We will keep the `fips` column.

`regionidzip` and `regionidcity` are location-based columns which do not give much added information on the properties' location given that we have `latitude` and `longitude`. These columns are dropped.

All wrangled observation have "2016" as values in the `assessmentyear` which is not adding value to the analysis.

`propertylandusedesc` is dropped because it only displays "Single Family Residential" which was helpful in the filtering phase but not useful in the analysis.

In [8]:
df = prep.clean_columns(df)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 52292 entries, 77578 to 0
Data columns (total 14 columns):
bathroomcnt                     52292 non-null float64
bedroomcnt                      52292 non-null float64
calculatedfinishedsquarefeet    52211 non-null float64
fips                            52292 non-null float64
latitude                        52292 non-null float64
longitude                       52292 non-null float64
lotsizesquarefeet               51930 non-null float64
yearbuilt                       52178 non-null float64
structuretaxvaluedollarcnt      52213 non-null float64
taxvaluedollarcnt               52291 non-null float64
landtaxvaluedollarcnt           52291 non-null float64
taxamount                       52288 non-null float64
logerror                        52292 non-null float64
transactiondate                 52292 non-null object
dtypes: float64(13), object(1)
memory usage: 6.0+ MB


**Handle Nulls for Relevant Columns**

In [10]:
# show null values
df.isnull().sum()

bathroomcnt                       0
bedroomcnt                        0
calculatedfinishedsquarefeet     81
fips                              0
latitude                          0
longitude                         0
lotsizesquarefeet               362
yearbuilt                       114
structuretaxvaluedollarcnt       79
taxvaluedollarcnt                 1
landtaxvaluedollarcnt             1
taxamount                         4
logerror                          0
transactiondate                   0
dtype: int64

_Calculated Finished Square Feet, Structure Tax Value Dollar Count_ -  Observations that have missing values in either `calculatedfinishedsquarefeet` (71 rows) or  `structuretaxvaluedollarcnt` (69 rows), both (10 rows) are dropped, removing a total of 150 observations.

_Tax Value Dollar Count, Land Tax Value Dollar Count_ - There are a handful of observations with missing values in the columns `taxvaluedollarcnt` (1 row), `landtaxvaluedollarcnt` (1 row), `taxamount` (4 rows).

_Year Built_ - After all the above removals, `yearbuilt` have a total of 38 missing values which are dropped because of the manageable size.

In [11]:
df = prep.drop_minimal_nulls(df)

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 52101 entries, 77578 to 0
Data columns (total 14 columns):
bathroomcnt                     52101 non-null float64
bedroomcnt                      52101 non-null float64
calculatedfinishedsquarefeet    52101 non-null float64
fips                            52101 non-null float64
latitude                        52101 non-null float64
longitude                       52101 non-null float64
lotsizesquarefeet               51761 non-null float64
yearbuilt                       52101 non-null float64
structuretaxvaluedollarcnt      52101 non-null float64
taxvaluedollarcnt               52101 non-null float64
landtaxvaluedollarcnt           52101 non-null float64
taxamount                       52101 non-null float64
logerror                        52101 non-null float64
transactiondate                 52101 non-null object
dtypes: float64(13), object(1)
memory usage: 6.0+ MB


_Lot Size Sqft_ - The process for imputing `lotsizesquarefeet` is the least straightforward in terms of handling missing values. Lot size sqft is the sum of the area of the land without structure (land or dirt) and the finished space (structure or house). The column has 340 missing values.

Below is the process for filling the missing values on `lotsizesquarefeet`:

1. Derive the total tax dollar value of the lot by adding the tax value of the land and the tax value of the structure

  $ value_{lot} = value_{land} + value_{structure} $
  
  
2. Get the proportion of the lot area and the lot tax value

  $ proportion = \frac {area_{lot}}{value_{lot}} $
  

3. There were unrealistic proportions that seem to come from properties that have a big lot area but low tax value. At 75 percentile, we see a proportion of 0.041 and a mean of 0.048. Because of this, we are confident to take the mean of all proportions less than 1 (proportion < 1). The mean will inform us a generalized value for the lot square footage, i.e., $area_{lot}$, given the total lot value, i.e., $value_{lot}$.

  
4. Impute the nulls in the `lotsizesquarefeet` aka $value_{lot}$ using the derived formula:

  $ area_{lot} = value_{lot} * \mu_{proportion} $

 

In [13]:
df = prep.impute_lotsize_nulls(df)

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 52101 entries, 77578 to 0
Data columns (total 14 columns):
bathroomcnt                     52101 non-null float64
bedroomcnt                      52101 non-null float64
calculatedfinishedsquarefeet    52101 non-null float64
fips                            52101 non-null float64
latitude                        52101 non-null float64
longitude                       52101 non-null float64
lotsizesquarefeet               52101 non-null float64
yearbuilt                       52101 non-null float64
structuretaxvaluedollarcnt      52101 non-null float64
taxvaluedollarcnt               52101 non-null float64
landtaxvaluedollarcnt           52101 non-null float64
taxamount                       52101 non-null float64
logerror                        52101 non-null float64
transactiondate                 52101 non-null object
dtypes: float64(13), object(1)
memory usage: 8.5+ MB


## A clean dataframe!!!

Created a land area column from lot size and calculated house size

In [15]:
df = prep.feature_eng(df)

Renamed and reordered columns

In [19]:
df = prep.pretty_cols(df)

In [20]:
df.head()

Unnamed: 0,countyid,latitude,longitude,yearbuilt,bathroomcnt,bedroomcnt,house_area,house_value,land_area,land_value,whole_area,whole_value,taxamount,logerror,transactiondate
77578,6037.0,33937685.0,-117996709.0,1955.0,2.0,3.0,1762.0,140000.0,4585.0,382000.0,6347.0,522000.0,6317.15,0.007204,2017-09-25
77577,6037.0,34040895.0,-118038169.0,1954.0,1.0,3.0,1032.0,32797.0,4042.0,16749.0,5074.0,49546.0,876.43,0.037129,2017-09-21
77576,6111.0,34300140.0,-118706327.0,1964.0,2.0,4.0,1612.0,50683.0,10493.0,16522.0,12105.0,67205.0,1107.48,0.013209,2017-09-21
77575,6037.0,34245368.0,-118282383.0,1940.0,2.0,2.0,1286.0,70917.0,46119.0,283704.0,47405.0,354621.0,4478.43,0.020615,2017-09-20
77394,6037.0,33983643.0,-118362294.0,1948.0,2.0,3.0,1518.0,116897.0,4281.0,112345.0,5799.0,229242.0,3277.29,0.023168,2017-09-19
