# Acquisition and Prep
Goal is to predict the values of single unit properties using the obervations from 2017

1. Acquire bedroomcnt, bathroomcnt, calculatedfinishedsquarefeet, taxvaluedollarcnt, yearbuilt, taxamount, and fips from the zillow database for all 'Single Family Residential' properties.

In [1]:
# prepare
import pandas as pd
import numpy as np
import wrangle

# turn off pink warning boxes
import warnings
warnings.filterwarnings("ignore")

# acquire
import env

In [2]:
zillow_df = wrangle.get_zillow_data()
zillow_df

Unnamed: 0,bedroomcnt,bathroomcnt,calculatedfinishedsquarefeet,taxvaluedollarcnt,yearbuilt,taxamount,fips
0,0.0,0.0,,9.0,,,6037.0
1,0.0,0.0,,27516.0,,,6037.0
2,0.0,0.0,73026.0,1434941.0,1959.0,20800.37,6037.0
3,0.0,0.0,5068.0,1174475.0,1948.0,14557.57,6037.0
4,0.0,0.0,1776.0,440101.0,1947.0,5725.17,6037.0
...,...,...,...,...,...,...,...
2985212,,,,,,,
2985213,,,,,,,
2985214,,,,,,,
2985215,,,,,,,


In [3]:
pd.isnull(zillow_df).sum()

bedroomcnt                       2945
bathroomcnt                      2957
calculatedfinishedsquarefeet    45097
taxvaluedollarcnt               34266
yearbuilt                       47833
taxamount                       22752
fips                             2932
dtype: int64

In [4]:
(zillow_df).shape[0]

2985217

In [5]:
(zillow_df).shape[0] - pd.isnull(zillow_df).sum()

bedroomcnt                      2982272
bathroomcnt                     2982260
calculatedfinishedsquarefeet    2940120
taxvaluedollarcnt               2950951
yearbuilt                       2937384
taxamount                       2962465
fips                            2982285
dtype: int64

In [6]:
# Should be percentage of nulls but lecture had a different answer
(pd.isnull(zillow_df).sum() / (zillow_df).shape[0])*100

bedroomcnt                      0.098653
bathroomcnt                     0.099055
calculatedfinishedsquarefeet    1.510677
taxvaluedollarcnt               1.147856
yearbuilt                       1.602329
taxamount                       0.762156
fips                            0.098217
dtype: float64

In [7]:
# This is the method Ryan used which has different results, well at least in the lesson, hmm...
zillow_df.isna().mean()

bedroomcnt                      0.000987
bathroomcnt                     0.000991
calculatedfinishedsquarefeet    0.015107
taxvaluedollarcnt               0.011479
yearbuilt                       0.016023
taxamount                       0.007622
fips                            0.000982
dtype: float64

In [8]:
# percentage of data left if we dropped all rows with any null values; lecture different number
round(zillow_df.dropna().shape[0] / zillow_df.shape[0], 4)

0.9721

In [9]:
# drops the rows with any null values and returns a new null-free df
zillow_df = zillow_df.dropna()

In [10]:
# null free df; different from lecture
zillow_df.shape[0]

2901918

In [11]:
zillow_df.columns.tolist()

['bedroomcnt',
 'bathroomcnt',
 'calculatedfinishedsquarefeet',
 'taxvaluedollarcnt',
 'yearbuilt',
 'taxamount',
 'fips']

In [12]:
# lists number of results for each of the values in each column; different from lecture
for column in zillow_df.columns:
    print(column)
    print(zillow_df[column].value_counts())
    print("-----------------")

bedroomcnt
3.0     1170717
4.0      730024
2.0      604965
5.0      182200
1.0       86501
6.0       48534
0.0       45547
8.0       13283
7.0       12656
9.0        4218
10.0       1670
12.0        941
11.0        415
13.0         83
14.0         64
16.0         47
15.0         23
17.0         11
18.0          9
20.0          6
25.0          1
23.0          1
19.0          1
21.0          1
Name: bedroomcnt, dtype: int64
-----------------
bathroomcnt
2.00     1217537
3.00      631906
1.00      497185
2.50      208514
4.00      133100
1.50       45667
0.00       40405
5.00       38362
3.50       31769
4.50       19832
6.00       16319
5.50        6259
7.00        6186
8.00        4498
6.50        1349
9.00        1326
10.00        489
7.50         383
12.00        264
11.00        198
8.50         111
13.00         53
9.50          50
14.00         39
16.00         23
15.00         20
0.50          16
10.50         14
18.00         12
17.00          8
20.00          7
1.75           4


Based on these results we can change bedrooms to an integer since they're all whole numbers, same with calculatedfinishedsquarefeet, taxvaluedollarcnt, and yearbuilt
However the bathrooms include half baths which we'd like to keep so they'll stay as floats
FIPS can also be changed to an integer because this number is categorical representing county zip codes

In [13]:
# We can use the following code to figure out that 100% of calculatedfinishedsquarefeet can be converted to int w/o data loss
(zillow_df.calculatedfinishedsquarefeet == zillow_df.calculatedfinishedsquarefeet.astype(int)).mean()

1.0

In [14]:
# Same with taxvaluedollarcnt. 100% of taxvaluedollarcnt can lose the deicimal and be OK
(zillow_df.taxvaluedollarcnt == zillow_df.taxvaluedollarcnt.astype(int)).mean()

1.0

In [15]:
# This is not the case for out bathrooms
(zillow_df.bathroomcnt == zillow_df.bathroomcnt.astype(int)).mean()

0.8918039724072148

In [16]:
# But is also the case for the square feet for our homes
(zillow_df.calculatedfinishedsquarefeet == zillow_df.calculatedfinishedsquarefeet.astype(int)).mean()

1.0

In [17]:
# converts our fips, yearbuilt, bedrooms, taxvaluedollarcnt, to integers
zillow_df["fips"] = zillow_df["fips"].astype(int)
zillow_df["yearbuilt"] = zillow_df["yearbuilt"].astype(int)
zillow_df["bedroomcnt"] = zillow_df["bedroomcnt"].astype(int)
zillow_df["taxvaluedollarcnt"] = zillow_df["taxvaluedollarcnt"].astype(int)
zillow_df["calculatedfinishedsquarefeet"] = zillow_df["calculatedfinishedsquarefeet"].astype(int)

In [18]:
print(zillow_df.max()), print(zillow_df.min()), zillow_df.mean(), zillow_df.mode()

bedroomcnt                      2.500000e+01
bathroomcnt                     3.200000e+01
calculatedfinishedsquarefeet    9.525760e+05
taxvaluedollarcnt               2.870985e+08
yearbuilt                       2.016000e+03
taxamount                       3.458861e+06
fips                            6.111000e+03
dtype: float64
bedroomcnt                         0.00
bathroomcnt                        0.00
calculatedfinishedsquarefeet       1.00
taxvaluedollarcnt                 22.00
yearbuilt                       1801.00
taxamount                          5.04
fips                            6037.00
dtype: float64


(None,
 None,
 bedroomcnt                           3.170278
 bathroomcnt                          2.271617
 calculatedfinishedsquarefeet      1836.479432
 taxvaluedollarcnt               444828.167789
 yearbuilt                         1964.258716
 taxamount                         5456.144079
 fips                              6047.878252
 dtype: float64,
    bedroomcnt  bathroomcnt  calculatedfinishedsquarefeet  taxvaluedollarcnt  \
 0           3          2.0                          1200             450000   
 
    yearbuilt  taxamount  fips  
 0       1955     345.72  6037  )

In [19]:
zillow_df.describe().round(1)

Unnamed: 0,bedroomcnt,bathroomcnt,calculatedfinishedsquarefeet,taxvaluedollarcnt,yearbuilt,taxamount,fips
count,2901918.0,2901918.0,2901918.0,2901918.0,2901918.0,2901918.0,2901918.0
mean,3.2,2.3,1836.5,444828.2,1964.3,5456.1,6047.9
std,1.2,1.0,1935.6,730804.5,23.6,8740.2,20.1
min,0.0,0.0,1.0,22.0,1801.0,5.0,6037.0
25%,2.0,2.0,1218.0,192600.0,1950.0,2543.7,6037.0
50%,3.0,2.0,1581.0,324450.0,1963.0,4059.5,6037.0
75%,4.0,3.0,2148.0,516903.0,1981.0,6277.9,6059.0
max,25.0,32.0,952576.0,287098486.0,2016.0,3458861.1,6111.0


In [20]:
zillow_df.dtypes

bedroomcnt                        int64
bathroomcnt                     float64
calculatedfinishedsquarefeet      int64
taxvaluedollarcnt                 int64
yearbuilt                         int64
taxamount                       float64
fips                              int64
dtype: object

In [21]:
train, validate, test = wrangle.wrangle_zillow()
train

Unnamed: 0,bedroomcnt,bathroomcnt,calculatedfinishedsquarefeet,taxvaluedollarcnt,yearbuilt,taxamount,fips
2086451,3,2.5,1746,329842,1979,3281.70,6059
1608617,2,2.0,1503,378681,1980,3779.86,6059
854792,4,4.0,2469,3464540,1950,41401.23,6037
2066831,2,1.0,1507,78173,1952,1477.34,6037
2508263,2,2.0,1150,54832,1966,1139.31,6037
...,...,...,...,...,...,...,...
2578047,1,1.0,678,294384,1990,3804.80,6059
972773,4,2.0,1392,173600,1955,2840.54,6037
2204227,3,2.0,1370,339532,1977,3854.74,6059
994295,2,3.0,1204,305000,1979,3747.38,6037


In [22]:
train.shape

(1632328, 7)

## Exploration

1. As with encoded vs. unencoded data, we recommend exploring un-scaled data in your EDA process.

2. Make sure to perform a train, validate, test split before and use only your train dataset to explore the relationships between independent variables with other independent variables or independent variables with your target variable.

3. Write a function named `plot_variable_pairs` that accepts a dataframe as input and plots all of the pairwise relationships along with the regression line for each pair.

4. Write a function named `plot_categorical_and_continuous_vars` that accepts your dataframe and the name of the columns that hold the continuous and categorical features and outputs 3 different plots for visualizing a categorical variable and a continuous variable.

5. Save the functions you have written to create visualizations in your `explore.py` file. Rewrite your notebook code so that you are using the functions imported from this file.

6. Use the functions you created above to explore your Zillow train dataset in your `explore.ipynb` notebook.

7. Come up with some initial hypotheses based on your goal of predicting property value.

8. Visualize all combinations of variables in some way.

9. Run the appropriate statistical tests where needed.

10. What independent variables are correlated with the dependent variable, home value?

11. Which independent variables are correlated with other independent variables (bedrooms, bathrooms, year built, square feet)?

12. Make sure to document your takeaways from visualizations and statistical tests as well as the decisions you make throughout your process.

13. Explore your dataset with any other visualizations you think will be helpful.