# Clustering:  Data Acquisition

### Data

- Demo: Iris
- E2E Lesson: Zillow
- Exercises: Mall

### Goal of Zillow Example

Cluster properties based on similarity of numeric features that are likely to influence errors in estimates of home value. 

#### Read Data

In [6]:
import pandas as pd

path = "~/CODEUP/curriculum/Mod07_Clustering/7.20_Data/Zillow/"

df = pd.read_csv(path + "properties_2016.csv", low_memory=False)

label_df = pd.read_csv(path + "train_2016_v2.csv")

In [7]:
df.head()

Unnamed: 0,parcelid,airconditioningtypeid,architecturalstyletypeid,basementsqft,bathroomcnt,bedroomcnt,buildingclasstypeid,buildingqualitytypeid,calculatedbathnbr,decktypeid,...,numberofstories,fireplaceflag,structuretaxvaluedollarcnt,taxvaluedollarcnt,assessmentyear,landtaxvaluedollarcnt,taxamount,taxdelinquencyflag,taxdelinquencyyear,censustractandblock
0,10754147,,,,0.0,0.0,,,,,...,,,,9.0,2015.0,9.0,,,,
1,10759547,,,,0.0,0.0,,,,,...,,,,27516.0,2015.0,27516.0,,,,
2,10843547,,,,0.0,0.0,,,,,...,,,650756.0,1413387.0,2015.0,762631.0,20800.37,,,
3,10859147,,,,0.0,0.0,3.0,7.0,,,...,1.0,,571346.0,1156834.0,2015.0,585488.0,14557.57,,,
4,10879947,,,,0.0,0.0,4.0,,,,...,,,193796.0,433491.0,2015.0,239695.0,5725.17,,,


In [8]:
label_df.head()

Unnamed: 0,parcelid,logerror,transactiondate
0,11016594,0.0276,2016-01-01
1,14366692,-0.1684,2016-01-01
2,12098116,-0.004,2016-01-01
3,12643413,0.0218,2016-01-02
4,14432541,-0.005,2016-01-02


In [9]:
print(df.shape)
print(label_df.shape)

(2985217, 58)
(90275, 3)


#### Merge the Attributes with the Target Variable

In [10]:
df = pd.merge(df, label_df, how='left', on='parcelid').drop_duplicates()

#### Get column names so I can identify those that are useful for this clustering exercise

In [11]:
df.columns.values

array(['parcelid', 'airconditioningtypeid', 'architecturalstyletypeid',
       'basementsqft', 'bathroomcnt', 'bedroomcnt', 'buildingclasstypeid',
       'buildingqualitytypeid', 'calculatedbathnbr', 'decktypeid',
       'finishedfloor1squarefeet', 'calculatedfinishedsquarefeet',
       'finishedsquarefeet12', 'finishedsquarefeet13',
       'finishedsquarefeet15', 'finishedsquarefeet50',
       'finishedsquarefeet6', 'fips', 'fireplacecnt', 'fullbathcnt',
       'garagecarcnt', 'garagetotalsqft', 'hashottuborspa',
       'heatingorsystemtypeid', 'latitude', 'longitude',
       'lotsizesquarefeet', 'poolcnt', 'poolsizesum', 'pooltypeid10',
       'pooltypeid2', 'pooltypeid7', 'propertycountylandusecode',
       'propertylandusetypeid', 'propertyzoningdesc',
       'rawcensustractandblock', 'regionidcity', 'regionidcounty',
       'regionidneighborhood', 'regionidzip', 'roomcnt', 'storytypeid',
       'threequarterbathnbr', 'typeconstructiontypeid', 'unitcnt',
       'yardbuildingsqft17'

The columns that I will use are those that will tell me something about the location, tax value, and whether the home has been renovated or not in extremely diverse neighborhoods. 

In [12]:
mycols = ['parcelid', 'calculatedfinishedsquarefeet', 'latitude', 'longitude', 'lotsizesquarefeet', 
          'yearbuilt', 'structuretaxvaluedollarcnt', 'taxvaluedollarcnt', 'landtaxvaluedollarcnt', 
          'taxamount', 'censustractandblock', 'regionidzip', 'regionidcounty', 'regionidcity']
df = df[mycols]

#### View First 5 Rows

In [13]:
df.head()

Unnamed: 0,parcelid,calculatedfinishedsquarefeet,latitude,longitude,lotsizesquarefeet,yearbuilt,structuretaxvaluedollarcnt,taxvaluedollarcnt,landtaxvaluedollarcnt,taxamount,censustractandblock,regionidzip,regionidcounty,regionidcity
0,10754147,,34144442.0,-118654084.0,85768.0,,,9.0,9.0,,,96337.0,3101.0,37688.0
1,10759547,,34140430.0,-118625364.0,4083.0,,,27516.0,27516.0,,,96337.0,3101.0,37688.0
2,10843547,73026.0,33989359.0,-118394633.0,63085.0,,650756.0,1413387.0,762631.0,20800.37,,96095.0,3101.0,51617.0
3,10859147,5068.0,34148863.0,-118437206.0,7521.0,1948.0,571346.0,1156834.0,585488.0,14557.57,,96424.0,3101.0,12447.0
4,10879947,1776.0,34194168.0,-118385816.0,8512.0,1947.0,193796.0,433491.0,239695.0,5725.17,,96450.0,3101.0,12447.0


#### Store the Data Frame for Use in Next Notebook

In [14]:
df_acq = df
%store df_acq

Stored 'df_acq' (DataFrame)
