In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
import os

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

import acquire
import prep


# #1. Acquire the correct data subset from the 'zillow' dataset

## Import

In [2]:
# from env import host, username, password
# def get_db_url(db_name, username=username, hostname=host, password=password):
#     return f'mysql+pymysql://{username}:{password}@{hostname}/{db_name}'

## Acquire

In [3]:
# url = get_db_url(db_name='zillow')
# query = """
#             SELECT parcelid as ID,
#                     transactiondate as DateSold,
#                     taxvaluedollarcnt as Worth,
#                     taxamount as Taxes,
#                     roomcnt as Rooms,
#                     bathroomcnt as Baths,
#                     bedroomcnt as Beds,
#                     garagecarcnt as GarageCarCount,
#                     numberofstories as Stories,
#                     lotsizesquarefeet as LotSize,
#                     garagetotalsqft as GarageSize,
#                     calculatedfinishedsquarefeet as FinishedSize,
#                     yearbuilt as YearBuilt,
#                     fips as LocalityCode,
#                     regionidcounty as County,
#                     regionidzip as Zipcode,
#                     propertycountylandusecode as UseCode
#             FROM properties_2017
#             JOIN predictions_2017 USING(parcelid)
#             WHERE propertylandusetypeid = 261 AND 
#                   transactiondate BETWEEN '2017-05-01' AND '2017-08-31'
#         """

In [4]:
# df = pd.read_sql(query, url)
# df.head(3)

## Store to .csv

In [5]:
# if not os.path.isfile('zillow.csv'):
#     df.to_csv('zillow.csv')

* **ACQUIRE.py**: 
    * Add get_db_url - *Done*
    * Add acquire-store function - *Done*

In [6]:
df = acquire.acquire_zillow()
df.head(3)

Unnamed: 0,ID,DateSold,Worth,Taxes,Rooms,Baths,Beds,GarageCarCount,Stories,LotSize,GarageSize,FinishedSize,YearBuilt,LocalityCode,County,Zipcode,UseCode
0,11721753,2017-07-21,205123.0,2627.48,0.0,2.0,3.0,,,5672.0,,1316.0,1923.0,6037.0,3101.0,95997.0,100
1,11289917,2017-06-23,136104.0,2319.9,0.0,2.0,3.0,,,8284.0,,1458.0,1970.0,6037.0,3101.0,97318.0,101
2,11705026,2017-06-30,35606.0,543.69,0.0,1.0,2.0,,,6707.0,,1421.0,1911.0,6037.0,3101.0,96018.0,100


* **Present**:
    * Acquisition steps and data subsecting

# #2. Create a distribution for residence locality (state, county) against tax rate

* Notebook: Explore data for locality information
    * **Deliverable**: List of states and counties where properties are located
        * 6037: Los Angeles, CA
        * 6059: Orange, CA
        * 6111: Ventura, CA

## Investigate locality information

In [7]:
# df[['County','LocalityCode']].value_counts()

These three groups might be correct, let's investigate the Zipcode column...

In [8]:
# Create array of unique zip codes, converted to five character str
# zips = pd.DataFrame(df['Zipcode'].value_counts().keys().astype('int').astype('str').tolist())
# Subsect zip codes to first three digits
# zips[0] = zips[0].str[:3]
# Show unique first-three-digits
# zips.value_counts()

- 961: Reno, NV (West)
- 960: Redding, CA
- 962: Armed Forces - Korea
- 970: Portland, OR (Vicinity)
- 963: Armed Forces - Japan
- 969: Barrigda, Guam
- 964: Armed Forces - Phillipines
- 959: Marysville, CA
- 965: Armed Forces - Pacific
- 973: Salem, OR
- 971: Portland, OR (West)
- 399: Atlanta, GA (IRS)
- 972: Portland, OR (Main)

This doesn't seem quite right. The FIPS code might be the right one.

In [9]:
# Checking aliased fips column
df['LocalityCode'].unique()

array([6037., 6059., 6111.])

- 6037: Los Angeles, CA
- 6059: Orange, CA
- 6111: Ventura, CA

This seems correct, these three are all counties in California. We're going with that.

## Calculate tax rate for each property

In [10]:
df['TaxRate'] = round((df['Taxes'] / df['Worth']) * 100, 2)
df[['Worth','Taxes','TaxRate']].head(3)

Unnamed: 0,Worth,Taxes,TaxRate
0,205123.0,2627.48,1.28
1,136104.0,2319.9,1.7
2,35606.0,543.69,1.53


## Create distribution of tax rates for each county
* **Deliverable**: This distribution

In [11]:
# sns.histplot(df, x='TaxRate', hue='LocalityCode')

That doesn't look right. Let's see what went wrong...

In [12]:
# df.TaxRate.sort_values(ascending=False).head(20)

Ok, let's drop all values over 10% and see what happens...

In [13]:
# sns.histplot(df[df.TaxRate < 10], x='TaxRate', hue='LocalityCode', palette='bright')

So it seems most values are between 1% and 2%, which checks. So let's see what value is poking out...

In [14]:
# df.TaxRate.value_counts().head(10)

The 2500-count value isn't appearing, weird. Let's zoom way in...

In [15]:
# mask = (df.TaxRate > 1.1) & (df.TaxRate < 1.5)
# sns.histplot(df[mask], x='TaxRate', hue='LocalityCode', palette='bright')

So it seems two values were grouped into one bar. Let's widen the view a bit to capture the distribution without combining two tax rates.

In [16]:
# mask = (df.TaxRate > 0.9) & (df.TaxRate < 1.7)
# sns.histplot(df[mask], x='TaxRate', hue='LocalityCode', palette='bright', kde=True)

In [17]:
# def remove_outliers(df, k, col_list):
#     for col in col_list:
#         q1, q3 = df[col].quantile([.25, .75])  # get quartiles
#         iqr = q3 - q1   # calculate interquartile range
#         upper_bound = q3 + k * iqr   # get upper bound
#         lower_bound = q1 - k * iqr   # get lower bound
#         # return dataframe without outliers
#         df = df[(df[col] > lower_bound) & (df[col] < upper_bound)]
#     return df

In [18]:
# new = prep.remove_outliers(df, 1.5, ['TaxRate','Worth','Taxes'])
# sns.histplot(new, x='TaxRate', hue='LocalityCode', palette='bright', kde=True)

Cool. So, clean up this distribution and put it in the presentation. Some things you should note:
- Each locality's highest rate
- General shape of distributions
- Any odd phenomenon, like the seemingly-random peaks through the skew
    * This is likely two values grouped into one

* **Present**: 
    * All localities for dataset's residences
    * Each locality's tax rate range
    * Each peak in the tax rate distributions

# #3. Conduct hypothesis testing and check univariate distributions

### Univariate Distributions

* Notebook: Initial exploration
    * Use square-footage, bedroom-count, bathroom-count, and tax-value for initial exploration
    * *After MVP*: Run initial exploration on new features

In [19]:
# df.columns

In [20]:
# initial = df[['Worth','FinishedSize','Beds','Baths']]
# print(initial.info())
# initial = initial.dropna()
# initial = initial.drop_duplicates()
# initial.info()

In [21]:
# def initial_plots(df):
#     for col in df.columns:
#         # Raw plots
#         sns.histplot(df, x=col)
#         plt.title(col)
#         plt.show()
        
#         # Interquartile Rule
#         q1, q3 = df[col].quantile([.25, .75])  # get quartiles
#         k = 1.5
#         iqr = q3 - q1   # calculate interquartile range
#         upper_bound = q3 + k * iqr   # get upper bound
#         lower_bound = q1 - k * iqr   # get lower bound
#         temp = df[(df[col] > lower_bound) & (df[col] < upper_bound)]
#         sns.histplot(temp, x=col)
#         plt.title(col + '_no_outliers')
#         plt.show()

In [22]:
# df['Worth'].dtype == 'float64'

In [23]:
# prep.initial_plots(initial)

In [24]:
# prep.initial_plots(df)

### Hypothesis Testing

* Notebook: Hypothesis testing
    * Create initial hypotheses for MVP features, push to readme
    * Run statistical tests
        * At least two statistical tests along with visualizations documenting hypotheses and takeaways
    * Convey results to readme and Jupyter notebook
    * *After MVP*: Create hypotheses for new features 
    * *After MVP*: Push new features through hypothesis testing to check viability
    * *After MVP*: Convey results to readme and Jupyter notebook

**Initial Hypotheses** (two for MVP):
- There is a linear relationship between a home's value and the number of bedrooms it has.
- The value of a home with two bedrooms is not statistically different from a home with two bathrooms.

**Confidence Interval = 95%**

In [25]:
def significant(alpha, p):
    if p < alpha:
        print("Reject the null hypothesis")
    else:
        print("Accept the null hypothesis")

In [26]:
# Spearman test to determine correlation between Worth and Beds
alpha = .05
corr, p = stats.spearmanr(df.dropna().Worth, df.dropna().Beds)
significant(alpha, p)

Reject the null hypothesis


In [27]:
# Mannwhitney test to determine statistical difference between 
# Worth of homes with 2 Beds and Worth of homes with 2 Baths
t, p = stats.mannwhitneyu(df[df.Beds == 2].dropna().Worth, df[df.Baths == 2].dropna().Worth) 
if p > alpha: # Note: hypothesis is that there *isn't* a statistical difference
    print("Reject the null hypothesis")
else:
    print("Accept the null hypothesis")

Accept the null hypothesis


* **PREP.py**: 
    * Add initial-plots function to loop through and plot features - *Done*
* **Present**: 
    * First four distributions
    * *After MVP*: Any additional distributions included/excluded
    * Initial hypotheses
    * Statistical tests
    * Results

# #4. Prepare using Minimum-Viable-Product (MVP) specification restriction

### 1. Drop all columns except square-footage, bedroom-count, bathroom-count, tax-value

In [28]:
# df = df[['Worth','FinishedSize','Beds','Baths']]
# df.head(3)

### 2. Drop all nulls and duplicates in above columns

In [29]:
# print("Before:", df.shape)
# df = df.dropna().drop_duplicates()
# print("After:", df.shape)

In [30]:
# print("Before:", df.shape)
# df = df
# print("After:", df.shape)

### 3. Check for outliers using box-and-whisker plot

In [31]:
# for col in df.columns:
#     sns.boxplot(data=df[col])
#     plt.show()

### 4. Eliminate outliers if needed using Inter-Quartile Rule

In [32]:
# df = remove_outliers(df, k=1.5, col_list=['Worth','FinishedSize','Beds','Baths'])
# df.head(3)

### 5. Rename columns to something more readable
- Accomplished in SQL query

### 6. Split data into train, validate, and test

In [33]:
# train_validate, test = train_test_split(df, test_size=0.2, random_state=123)
# train, validate = train_test_split(train_validate, test_size=0.25, random_state=123)
# train.shape, validate.shape, test.shape

### 7. Isolate target variable 'tax-value' into y_train from X_train
Do the same for X_validate and X_test

In [34]:
# X_train, y_train = train.drop(columns='Worth'), train.Worth
# X_validate, y_validate = validate.drop(columns='Worth'), validate.Worth
# X_test, y_test = test.drop(columns='Worth'), test.Worth

### 8. Scale data
* Create and fit scaler using X_train
* Create X_train_exp using scaler transform of X_train while retaining original values
* Scale X_train, drop unscaled columns
* Scale X_validate, drop unscaled columns
* Scale X_test, drop unscaled columns

In [35]:
# scaler = StandardScaler().fit(X_train)

In [36]:
# X_train_exp = X_train.copy()
# col_list = []
# for col in X_train.columns:
#     col_list.append(col + "_scaled")
# X_train_exp[col_list] = scaler.transform(X_train)
# X_train_exp.head(3)

In [37]:
# X_train = scaler.transform(X_train)
# X_validate = scaler.transform(X_validate)
# X_test = scaler.transform(X_test)

* *After MVP*: Run through above steps as needed with any additional features
* **PREP.py**: 
    * Add plot-data function to make various plots for a dataframe when called - *Done*
    * Add clean-data function to limit dataset features, drop nulls, eliminate outliers, rename columns - *Done*
        * *After MPV*: Revise clean-data with new features
    * Add split-data function for train/validate/test *and* target isolation - *Done*
    * Add scale-data function - *Done*
    * Add wrangle-data function to run acquire-store, clean-data, split-data, and scale-data functions, then return all dataframes - *Done*
* **Present**:
    * Overview of wrangling, mentioning feature limitation to MVP, additional features, nulls, outliers, feature renaming, split, target isolation, scaler creation, scaler application, returned dataframes

## Check to see if prep.py is working as intended

In [38]:
df, X_train_exp, X_train, y_train, X_validate, y_validate, X_test, y_test = prep.wrangle_zillow_MVP()
X_train_exp.head(3)

Unnamed: 0,FinishedSize,Beds,Baths,FinishedSize_scaled,Beds_scaled,Baths_scaled
12185,2890.0,5.0,4.0,1.897476,2.254989,2.574043
17845,1352.0,3.0,2.0,-0.598498,-0.278078,-0.167565
28109,1516.0,4.0,2.0,-0.332348,0.988456,-0.167565


# #5. Create models for MVP restriction

### 1. Cast y_train and y_validate as dataframes

In [39]:
y_train = pd.DataFrame(y_train)
y_validate = pd.DataFrame(y_validate)
y_train.shape, y_validate.shape

((14848, 1), (4950, 1))

### 2. Create model-performance function
* Takes in actuals_series, predictions_series, 'model_name', df_to_append_to
* Calculates RMSE
* Calculates r^2 score
* Appends dataframe with new row for model_name, RMSE_validate, r^2_score
* Returns dataframe

In [40]:
def model_performance(train_actuals, val_actuals, train_preds, val_preds, model_name, running_df):
    rmse_train = mean_squared_error(train_actuals, train_preds) ** 0.5
    rmse_validate = mean_squared_error(val_actuals, val_preds) ** 0.5
    r2_train = r2_score(train_actuals, train_preds)
    r2_validate = r2_score(val_actuals, val_preds)
    running_df = running_df.append({'Model':model_name, 
                                   'Train_RMSE': rmse_train,
                                   'Validate_RMSE': rmse_validate,
                                   'Train_r2': r2_train,
                                   'Validate_r2': r2_validate}, ignore_index=True)
    return running_df

### 3. Create plot-residuals function - WIP

In [41]:
# def plot_residuals(df):
#     sns.relplot(x = 'total_bill', y = 'residual', data = df)
#     plt.axhline(0, ls = ':')

### 4. Create baseline model
* Calculate mean and median of target (tax-value)
* Assign mean and median to columns in y_train and y_validate
* Calculate RMSE for both train and validate
    * mean_squared_error(actuals, baseline) ** 0.5
* Keep the lower-error baseline (of mean and median)
* Call plot-residuals

In [42]:
y_train['mean_bl'] = y_train['Worth'].mean()
y_train['median_bl'] = y_train['Worth'].median()
y_validate['mean_bl'] = y_validate['Worth'].mean()
y_validate['median_bl'] = y_validate['Worth'].median()
y_train.head(3)

Unnamed: 0,Worth,mean_bl,median_bl
12185,683726.0,388435.613618,346509.0
17845,418993.0,388435.613618,346509.0
28109,348477.0,388435.613618,346509.0


In [43]:
running_df = pd.DataFrame(columns=['Model','Train_RMSE','Validate_RMSE','Train_r2','Validate_r2'])
running_df

Unnamed: 0,Model,Train_RMSE,Validate_RMSE,Train_r2,Validate_r2


In [44]:
running_df = model_performance(y_train.Worth, y_validate.Worth, y_train.mean_bl, y_validate.mean_bl, 'mean_baseline', running_df)
running_df = model_performance(y_train.Worth, y_validate.Worth, y_train.median_bl, y_validate.median_bl, 'median_baseline', running_df)
running_df

Unnamed: 0,Model,Train_RMSE,Validate_RMSE,Train_r2,Validate_r2
0,mean_baseline,260807.509266,265580.600849,0.0,0.0
1,median_baseline,264156.010378,268905.009083,-0.025843,-0.025192


### 5. Create models for different regression algorithms
* Loop through one algorithm's hyperparameters, save to list
* Loop through next algorithm, and next... using same

### 6. Loop lists of models through model-performance function
* Extend the 'model_name' to include hyperparameter
* Add to same dataframe for easy column-wise analysis
* Call plot-residuals

### 7. "Choose" best-performing model
* Plot y by yhat
* *After MVP*: Add features, use k-best or RFE to determine which features to include
* *After MVP*: Loop model-performance using new feature set and suitable names
* *After MVP*: "Choose" best-performing model
* **Present**: 
    * model-performance function
    * baseline performance
    * MVP model performance
    * After-MVP model performance
    * model selected

# #6. Revisit Step #3, #4, and #5 with more features than the MVP restriction

* Complete these steps:
    * Run at least 1 t-test and 1 correlation test (but as many as you need!)
    * Visualize all combinations of variables in some way(s).
    * What independent variables are correlated with the dependent?
    * Which independent variables are correlated with other independent variables?
* Run all *After MVP* steps

# #7. Push work and findings to a slide deck
- Practice/script the presentation
- Present!

# Notes to self
- "You will want to make sure you are using the best fields to represent square feet of home, number of bedrooms, and number of bathrooms. "Best" meaning the most accurate and available information. Here you will need to do some data investigation in the database and use your domain expertise to make some judgement calls."
- "Brainstorming ideas and form hypotheses related to how variables might impact or relate to each other, both within independent variables and between the independent variables and dependent variable."
- "Document any ideas for new features you may have while first looking at the existing variables and the project goals ahead of you."
- "Add a data dictionary in your notebook at this point that defines all the fields used in your model and your analysis and answers the question, "Why did you use the fields you used?". e.g. "Why did you use bedroom_field1 over bedroom_field2?", not, "Why did you use number of bedrooms?""