# Data Wrangling Exercises

## Acquire (acquire.py)

### Zillow

For the following, iterate through the steps you would take to create functions: Write the code to do the following in a jupyter notebook, test it, convert to functions, then create the file to house those functions.

You will have a zillow.ipynb file and a helper file for each section in the pipeline.

**acquire & summarize**

1. Acquire data from mySQL using the python module to connect and query. You will want to end with a single dataframe. Make sure to include: the logerror, all fields related to the properties that are available. You will end up using all the tables in the database.

Be sure to do the correct join (inner, outer, etc.). We do not want to eliminate properties purely because they may have a null value for airconditioningtypeid.
Only include properties with a transaction in 2017, and include only the last transaction for each properity (so no duplicate property ID's), along with zestimate error and date of transaction.
Only include properties that include a latitude and longitude value.



In [1]:
import pandas as pd
import acquire

In [2]:
df = acquire.get_zillow_cluster_data()

**acquire & summarize**
2. Summarize your data (summary stats, info, dtypes, shape, distributions, value_counts, etc.)

In [3]:
# before added filter for property use type = (77381, 70)
df.shape

(71789, 70)

In [4]:
df.head()

Unnamed: 0,parcelid,typeconstructiontypeid,storytypeid,propertylandusetypeid,heatingorsystemtypeid,buildingclasstypeid,architecturalstyletypeid,airconditioningtypeid,id,basementsqft,...,logerror,pid,tdate,airconditioningdesc,architecturalstyledesc,buildingclassdesc,heatingorsystemdesc,propertylandusedesc,storydesc,typeconstructiondesc
0,14297519,,,261.0,,,,,1727539,,...,0.025595,14297519,2017-01-01,,,,,Single Family Residential,,
1,17052889,,,261.0,,,,,1387261,,...,0.055619,17052889,2017-01-01,,,,,Single Family Residential,,
2,14186244,,,261.0,,,,,11677,,...,0.005383,14186244,2017-01-01,,,,,Single Family Residential,,
3,12177905,,,261.0,2.0,,,,2288172,,...,-0.10341,12177905,2017-01-01,,,,Central,Single Family Residential,,
4,10887214,,,266.0,2.0,,,1.0,1970746,,...,0.00694,10887214,2017-01-01,Central,,,Central,Condominium,,


In [5]:
df.dtypes

parcelid                    int64
typeconstructiontypeid    float64
storytypeid               float64
propertylandusetypeid     float64
heatingorsystemtypeid     float64
                           ...   
buildingclassdesc          object
heatingorsystemdesc        object
propertylandusedesc        object
storydesc                  object
typeconstructiondesc       object
Length: 70, dtype: object

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71789 entries, 0 to 71788
Data columns (total 70 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   parcelid                      71789 non-null  int64  
 1   typeconstructiontypeid        222 non-null    float64
 2   storytypeid                   47 non-null     float64
 3   propertylandusetypeid         71789 non-null  float64
 4   heatingorsystemtypeid         46571 non-null  float64
 5   buildingclasstypeid           0 non-null      object 
 6   architecturalstyletypeid      206 non-null    float64
 7   airconditioningtypeid         23027 non-null  float64
 8   id                            71789 non-null  int64  
 9   basementsqft                  47 non-null     float64
 10  bathroomcnt                   71789 non-null  float64
 11  bedroomcnt                    71789 non-null  float64
 12  buildingqualitytypeid         44961 non-null  float64
 13  c

In [7]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
parcelid,71789.0,13044090.0,3394466.0,10711860.0,11537910.0,12575480.0,14255330.0,167688500.0
typeconstructiontypeid,222.0,6.040541,0.5572847,4.0,6.0,6.0,6.0,13.0
storytypeid,47.0,7.0,0.0,7.0,7.0,7.0,7.0,7.0
propertylandusetypeid,71789.0,262.3569,2.245354,260.0,261.0,261.0,266.0,275.0
heatingorsystemtypeid,46571.0,3.950806,3.654866,1.0,2.0,2.0,7.0,24.0
architecturalstyletypeid,206.0,7.38835,2.734542,2.0,7.0,7.0,7.0,21.0
airconditioningtypeid,23027.0,1.868632,3.064534,1.0,1.0,1.0,1.0,13.0
id,71789.0,1495247.0,860466.7,349.0,754010.0,1499007.0,2239533.0,2982274.0
basementsqft,47.0,678.9787,711.8252,38.0,263.5,512.0,809.5,3560.0
bathroomcnt,71789.0,2.260876,0.955185,0.0,2.0,2.0,3.0,18.0


**acquire & summarize**   
3. Write a function that takes in a dataframe of observations and attributes and returns a dataframe where each row is an atttribute name, the first column is the number of rows with missing values for that attribute, and the second column is percent of total rows that have missing values for that attribute. Run the function and document takeaways from this on how you want to handle missing values. 

    |           |num_rows_missing |pct_rows_missing   
    | parcelid|	0|	0.000000   
    | airconditioningtypeid |	29041 |	0.535486   
    | architecturalstyletypeid |	54232 |	0.999982   


In [8]:
num_rows_missing = df.isna().sum()
num_rows_missing

parcelid                      0
typeconstructiontypeid    71567
storytypeid               71742
propertylandusetypeid         0
heatingorsystemtypeid     25218
                          ...  
buildingclassdesc         71789
heatingorsystemdesc       25218
propertylandusedesc           0
storydesc                 71742
typeconstructiondesc      71567
Length: 70, dtype: int64

In [9]:
dfmissing = pd.DataFrame(num_rows_missing, columns=['num_rows_missing'])

dfmissing.head()

Unnamed: 0,num_rows_missing
parcelid,0
typeconstructiontypeid,71567
storytypeid,71742
propertylandusetypeid,0
heatingorsystemtypeid,25218


In [10]:
dfmissing['pct_rows_missing'] = dfmissing.num_rows_missing/df.shape[0]
dfmissing.head()

Unnamed: 0,num_rows_missing,pct_rows_missing
parcelid,0,0.0
typeconstructiontypeid,71567,0.996908
storytypeid,71742,0.999345
propertylandusetypeid,0,0.0
heatingorsystemtypeid,25218,0.351279


In [11]:
def get_missing_rows(df):
    '''
    Write a function that takes in a dataframe of observations and attributes and returns a dataframe
    where each row is an atttribute name, the first column is the number of rows with missing values 
    for that attribute, and the second column is percent of total rows that have missing values for that 
    attribute. Run the function and document takeaways from this on how you want to handle missing values
    '''
    # find the number of rows in each column that are missing values
    num_rows_missing = df.isna().sum()
    # create new df with just that column
    dfrows = pd.DataFrame(num_rows_missing, columns=['num_rows_missing'])
    # add a calculation of % missing to the new df
    dfrows['pct_rows_missing'] = dfrows.num_rows_missing/df.shape[0]
    # return the new df
    return dfrows

In [12]:
zrows = get_missing_rows(df)
zrows.head(35)

Unnamed: 0,num_rows_missing,pct_rows_missing
parcelid,0,0.0
typeconstructiontypeid,71567,0.996908
storytypeid,71742,0.999345
propertylandusetypeid,0,0.0
heatingorsystemtypeid,25218,0.351279
buildingclasstypeid,71789,1.0
architecturalstyletypeid,71583,0.99713
airconditioningtypeid,48762,0.679241
id,0,0.0
basementsqft,71742,0.999345


In [13]:
zrows.tail(35)

Unnamed: 0,num_rows_missing,pct_rows_missing
pooltypeid7,57158,0.796194
propertycountylandusecode,0,0.0
propertyzoningdesc,26387,0.367563
rawcensustractandblock,0,0.0
regionidcity,1335,0.018596
regionidcounty,0,0.0
regionidneighborhood,43599,0.607321
regionidzip,45,0.000627
roomcnt,0,0.0
threequarterbathnbr,61800,0.860856


****
**take aways**
1. could fireplace, garage, pool, hottub, deck be made to 0 or 1 then summed as "plus_item" column?
    - this would assume null values do not have the feature, as opposed to feature is present but not noted
2. drop features with 70% or more missing values to start
****

4. Write a function that takes in a dataframe and returns a dataframe with 3 columns: the number of columns missing, percent of columns missing, and number of rows with n columns missing. Run the function and document takeaways from this on how you want to handle missing values.

num_cols_missing	pct_cols_missing	num_rows   
23	38.333	108   
24	40.000	123   
25	41.667	5280   

In [14]:
# rephrase of question:
# for each observation how many columns are missing values?
# what is the % of columns with missing values for that row?
# group the rows with the same ansers to those 2 questions together

In [15]:
df.head()

Unnamed: 0,parcelid,typeconstructiontypeid,storytypeid,propertylandusetypeid,heatingorsystemtypeid,buildingclasstypeid,architecturalstyletypeid,airconditioningtypeid,id,basementsqft,...,logerror,pid,tdate,airconditioningdesc,architecturalstyledesc,buildingclassdesc,heatingorsystemdesc,propertylandusedesc,storydesc,typeconstructiondesc
0,14297519,,,261.0,,,,,1727539,,...,0.025595,14297519,2017-01-01,,,,,Single Family Residential,,
1,17052889,,,261.0,,,,,1387261,,...,0.055619,17052889,2017-01-01,,,,,Single Family Residential,,
2,14186244,,,261.0,,,,,11677,,...,0.005383,14186244,2017-01-01,,,,,Single Family Residential,,
3,12177905,,,261.0,2.0,,,,2288172,,...,-0.10341,12177905,2017-01-01,,,,Central,Single Family Residential,,
4,10887214,,,266.0,2.0,,,1.0,1970746,,...,0.00694,10887214,2017-01-01,Central,,,Central,Condominium,,


In [16]:
# this will add a column that has a total number of columns that are blank for that row
df['null_count'] = df.isna().sum(axis=1)
null_count = df.null_count
df.head()

Unnamed: 0,parcelid,typeconstructiontypeid,storytypeid,propertylandusetypeid,heatingorsystemtypeid,buildingclasstypeid,architecturalstyletypeid,airconditioningtypeid,id,basementsqft,...,pid,tdate,airconditioningdesc,architecturalstyledesc,buildingclassdesc,heatingorsystemdesc,propertylandusedesc,storydesc,typeconstructiondesc,null_count
0,14297519,,,261.0,,,,,1727539,,...,14297519,2017-01-01,,,,,Single Family Residential,,,36
1,17052889,,,261.0,,,,,1387261,,...,17052889,2017-01-01,,,,,Single Family Residential,,,33
2,14186244,,,261.0,,,,,11677,,...,14186244,2017-01-01,,,,,Single Family Residential,,,34
3,12177905,,,261.0,2.0,,,,2288172,,...,12177905,2017-01-01,,,,Central,Single Family Residential,,,32
4,10887214,,,266.0,2.0,,,1.0,1970746,,...,10887214,2017-01-01,Central,,,Central,Condominium,,,29


In [17]:
# this calculates the percentage of null columns for that row
df['pct_null'] = df.null_count/df.shape[1]
pct_null = df.pct_null
df.head()

Unnamed: 0,parcelid,typeconstructiontypeid,storytypeid,propertylandusetypeid,heatingorsystemtypeid,buildingclasstypeid,architecturalstyletypeid,airconditioningtypeid,id,basementsqft,...,tdate,airconditioningdesc,architecturalstyledesc,buildingclassdesc,heatingorsystemdesc,propertylandusedesc,storydesc,typeconstructiondesc,null_count,pct_null
0,14297519,,,261.0,,,,,1727539,,...,2017-01-01,,,,,Single Family Residential,,,36,0.507042
1,17052889,,,261.0,,,,,1387261,,...,2017-01-01,,,,,Single Family Residential,,,33,0.464789
2,14186244,,,261.0,,,,,11677,,...,2017-01-01,,,,,Single Family Residential,,,34,0.478873
3,12177905,,,261.0,2.0,,,,2288172,,...,2017-01-01,,,,Central,Single Family Residential,,,32,0.450704
4,10887214,,,266.0,2.0,,,1.0,1970746,,...,2017-01-01,Central,,,Central,Condominium,,,29,0.408451


In [18]:
# this gets a dataframe with just the 2 new columns
dfcol = pd.DataFrame(null_count, columns=['null_count'])
dfcol['pct_null'] = pct_null
dfcol.head()

Unnamed: 0,null_count,pct_null
0,36,0.507042
1,33,0.464789
2,34,0.478873
3,32,0.450704
4,29,0.408451


In [19]:
# this shows how many groups of rows have the same number of null columns
dfcol.nunique()

null_count    26
pct_null      26
dtype: int64

In [20]:
# create a series that has the number of rows in each group
num_rows_ingroup = dfcol.null_count.value_counts()
# create a dataframe with the count of null_count and pct_null
groups = dfcol.groupby(['null_count', 'pct_null']).count()

In [21]:
# create a df from the num_rows_ingroup, rename the columns, sort, and reset the index 
dfnum_rows = pd.DataFrame(num_rows_ingroup)
dfnum_rows = dfnum_rows.reset_index()
dfnum_rows = dfnum_rows.rename(columns={'index': 'num_null_col', 'null_count': 'num_rows_with_count'})
dfnum_rows = dfnum_rows.sort_values('num_null_col')
dfnum_rows = dfnum_rows.reset_index()

In [22]:
#visual check
dfnum_rows

Unnamed: 0,index,num_null_col,num_rows_with_count
0,25,23,2
1,21,24,13
2,18,25,24
3,15,26,65
4,11,27,312
5,10,28,451
6,4,29,5147
7,8,30,3233
8,3,31,9170
9,2,32,11680


In [23]:
# reset the index on the groups df so that we can add the num_rows_with_count
groups = groups.reset_index()

In [24]:
# visual check
groups

Unnamed: 0,null_count,pct_null
0,23,0.323944
1,24,0.338028
2,25,0.352113
3,26,0.366197
4,27,0.380282
5,28,0.394366
6,29,0.408451
7,30,0.422535
8,31,0.43662
9,32,0.450704


In [25]:
# combine num_rows_with_count from dfnum_rows with groups
groups['rows_with_count'] = dfnum_rows.num_rows_with_count

In [26]:
# visual check
groups

Unnamed: 0,null_count,pct_null,rows_with_count
0,23,0.323944,2
1,24,0.338028,13
2,25,0.352113,24
3,26,0.366197,65
4,27,0.380282,312
5,28,0.394366,451
6,29,0.408451,5147
7,30,0.422535,3233
8,31,0.43662,9170
9,32,0.450704,11680


In [27]:
# create a function that does the abocve and returns groups df
def get_missing_cols(df):
    # add calculation columns to original df
    df['null_count'] = df.isna().sum(axis=1)
    df['pct_null'] = df.null_count/df.shape[1]
    
    # create a dataframe with just the 2 new columns
    dfcol = pd.DataFrame(null_count, columns=['null_count'])
    dfcol['pct_null'] = df.pct_null
    
    # create a series that has the number of rows in each group
    num_rows_ingroup = dfcol.null_count.value_counts()
    
    # create a dataframe with the count of null_count and pct_null
    groups = dfcol.groupby(['null_count', 'pct_null']).count()
    
    # create a df from the num_rows_ingroup, rename the columns, sort, and reset the index 
    dfnum_rows = pd.DataFrame(num_rows_ingroup)
    dfnum_rows = dfnum_rows.reset_index()
    dfnum_rows = dfnum_rows.rename(columns={'index': 'num_null_col', 'null_count': 'num_rows_with_count'})
    dfnum_rows = dfnum_rows.sort_values('num_null_col')
    dfnum_rows = dfnum_rows.reset_index()
    
    # reset the index on the groups df so that we can add the num_rows_with_count
    groups = groups.reset_index()
    
    # combine num_rows_with_count from dfnum_rows with groups
    groups['rows_with_count'] = dfnum_rows.num_rows_with_count
    return groups


In [28]:
zcols = get_missing_cols(df)
zcols

Unnamed: 0,null_count,pct_null,rows_with_count
0,23,0.319444,2
1,24,0.333333,13
2,25,0.347222,24
3,26,0.361111,65
4,27,0.375,312
5,28,0.388889,451
6,29,0.402778,5147
7,30,0.416667,3233
8,31,0.430556,9170
9,32,0.444444,11680


****
**take aways**
- most rows have 32-34 columns with missing values
****

## Prepare

1. Remove any properties that are likely to be something other than single unit properties.    
(e.g. no duplexes, no land/lot, ...). There are multiple ways to estimate that a property is a single unit, and there is not a single "right" answer. But for this exercise, do not purely filter by unitcnt as we did previously. Add some new logic that will reduce the number of properties that are falsely removed. You might want to use # bedrooms, square feet, unit type or the like to then identify those with unitcnt not defined.


**This is deffinition used in previous project**   

Determine deffinition of single property used article by James Chen Updated Sep 11, 2020 What Is a Housing Unit? "The term housing unit refers to a single unit within a larger structure that can be used by an individual or household to eat, sleep, and live. The unit can be in any type of residence such as a house, apartment, mobile home, or may also be a single unit in a group of rooms. Essentially, a housing unit is deemed to be a separate living quarter where the occupants live and eat separately from other residents of the structure or building. They also have direct access from the building's exterior or through a common hallway."
https://www.investopedia.com/terms/h/housingunits.asp   

**In my opinion deffinition should include condo, townhouse, any unit that can be sold to an individual owner. So my deffinition will include townhouse, condo, ect. but not commercial, business, land only, etc. **   

This site has the property use codes for LA county https://www.titleadvantage.com/mdocs/LA%20County%20Use%20Codes%20nm.pdf   
looking at the common use codes for Duplex, Triplex, and Quadplex these codes indicate the units are multi-family/income properties or retail/store properties so these will be excluded

Identify Properties in the Database: Based on the above definition some categories do not fit brief
Propertylandusetypeid | propertylandusedesc
No 31 Commercial/Office/Residential Mixed Used (not a residence)
No 46 Multi-Story Store (not a residence)
No 47 Store/Office (Mixed Use) (not a residence)
No 246 Duplex (2 Units, Any Combination)
No 247 Triplex (3 Units, Any Combination)
No 248 Quadruplex (4 Units, Any Combination)
260 Residential General
261 Single Family Residential
262 Rural Residence
263 Mobile Home
264 Townhouse
No 265 Cluster Home
266 Condominium
No 267 Cooperative (become shareholder not owner)
268 Row House    
No 269 Planned Unit Development
No 270 Residential Common Area (propterty feature)
No 271 Timeshare (become shareholder not owner)
273 Bungalow   
274 Zero Lot Line
275 Manufactured, Modular, Prefabricated Homes
276 Patio Home
279 Inferred Single Family Residential   
No 290 Vacant Land - General (not a residence)
No 291 Residential Vacant Land (not a residence)

So we will keep only those where propertylandusetypeid = ('260', '261', '262', '263', '264', '266', '268', '273', '274', '275', '276', '279')  

**acquire function updated to filter only for these**
new shape = (71789, 70)

2. Create a function that will drop rows or columns based on the percent of values that are missing: handle_missing_values(df, prop_required_column, prop_required_row).

The input:
df = a dataframe    
prop_required_column = a number between 0 and 1 that represents the proportion, for each column, of rows with non-missing values required to keep the column. i.e. if prop_required_column = .6, then you are requiring a column to have at least 60% of values not-NA (no more than 40% missing).   
prop_required_row = a number between 0 and 1 that represents the proportion, for each row, of columns/variables with non-missing values required to keep the row. For example, if prop_required_row = .75, then you are requiring a row to have at least 75% of variables with a non-missing value (no more that 25% missing).   

The output: The dataframe with the columns and rows dropped as indicated. Be sure to drop the columns prior to the rows in your function.   

hint: Look up the dropna documentation.

You will want to compute a threshold from your input values (prop_required) and total number of rows or columns.

Make use of inplace, i.e. inplace=True/False.

In [None]:
# create the function
def handle_missing_values(df, prop_required_column, prop_required_row):
    

3. Decide how to handle the remaining missing values:

Fill with constant value.
Impute with mean, median, mode.
Drop row/column
wrangle_zillow.py

Functions of the work above needed to acquire and prepare a new sample of data.

Mall Customers

notebook

Acquire data from mall_customers.customers in mysql database.
Summarize data (include distributions and descriptive statistics).
Detect outliers using IQR.
Split data (train-test-split).
Encode categorical columns using a one hot encoder.
Handles missing values.
Scaling
wrangle_mall.py

Acquire data from mall_customers.customers in mysql database.
Split the data
One-hot-encoding
Missing values
Scaling