<div class='alert alert-block alert-success'>
    
# Exercises (Exploration (Regression))

<hr style='border:2px solid green'>

Our Zillow scenario continues:

As a Codeup data science graduate, you want to show off your skills to the Zillow data science team in hopes of getting an interview for a position you saw pop up on LinkedIn. You thought it might look impressive to build an end-to-end project in which you use some of their Kaggle data to predict `property values` using some of their available features; who knows, you might even do some **feature engineering** to blow them away. Your goal is to predict the values of single unit properties using the observations from 2017.

In these exercises, you will run through the stages of exploration as you continue to work toward the above goal. Use only your train dataset to explore the relationships between independent variables with other independent variables or independent variables with your target variable.

In [1]:
import os
#standard ds imports
import pandas as pd
import numpy as np

#viualization imports
import matplotlib.pyplot as plt
import seaborn as sns

#stats imports
from scipy.stats import pearsonr, spearmanr, ttest_ind

#custom modules
import wrangle as w

#remove pink warning box
import warnings
warnings.filterwarnings('ignore')

In [2]:
df = w.wrangle_zillow()

In [3]:
df.head(3)

Unnamed: 0,bathrooms,bedrooms,area,county,tax_amount,tax_value,year_built
4,2.0,4,3633,LA,6941.39,296425,2005
6,4.0,3,1620,LA,10244.94,847770,2011
7,2.0,3,2077,LA,7924.68,646760,1926


In [4]:
train, validate, test = w.splitting_data(df)

train ----> (1284141, 7) 60%
validate -> (428047, 7)  20%
test -----> (428047, 7)  20%


In [5]:
df.dtypes

bathrooms     float64
bedrooms        int64
area            int64
county         object
tax_amount    float64
tax_value       int64
year_built      int64
dtype: object

In [21]:
for col in df.columns:
    unique_value_counts = df[col].value_counts().nunique()
    print(f"Number of unique value counts in '{col}': {unique_value_counts}")

Number of unique value counts in 'bathrooms': 33
Number of unique value counts in 'bedrooms': 19
Number of unique value counts in 'area': 1389
Number of unique value counts in 'county': 3
Number of unique value counts in 'tax_amount': 34
Number of unique value counts in 'tax_value': 405
Number of unique value counts in 'year_built': 141


<div class="alert alert-block alert-info"> 

* I created sample functions for df_sample to make a smaller version of Zillow
* 10,000 rows before dropping nulls compared to over 2,000,000

In [9]:
df_sample = w.wrangle_zillow_sample()

In [12]:
df_sample.head(3)

Unnamed: 0,bathrooms,bedrooms,area,county,tax_amount,tax_value,year_built
4,2.0,4,3633,LA,6941.39,296425,2005
6,4.0,3,1620,LA,10244.94,847770,2011
7,2.0,3,2077,LA,7924.68,646760,1926


In [11]:
df_sample.shape

(9951, 7)

In [23]:
train_sample, validate_sample, test_sample = w.splitting_data(df_sample)

train ----> (5970, 7) 60%
validate -> (1990, 7)  20%
test -----> (1991, 7)  20%


In [24]:
train_sample.head(3)

Unnamed: 0,bathrooms,bedrooms,area,county,tax_amount,tax_value,year_built
1675,3.0,4,2107,LA,6604.44,528239,1967
4653,2.0,2,1416,LA,2906.84,229888,1923
1136,2.0,3,1753,LA,3342.26,272783,1970


#### Target variable?
* `tax_value`
* continuous int64 feature AKA regression problem

#### Other variables of value?
* `Continuous features` - bathrooms, bedrooms, area, tax_amount, tax_value, and year_built
* `Categorical features` - county
    * possibly bedrooms since there are 19 unique values

<hr style='border:2px solid green'>

## 1. Write a function named plot_variable_pairs that accepts a dataframe as input and plots all of the pairwise relationships along with the regression line for each pair.

In [32]:
def plot_variable_pairs(df):
    """
    - Accepts a dataframe as input
    - Returns a plot and regression line for each pairwise relationship.

    """
    # Set the style of the plot
    sns.set(style="ticks")

    # Plot the pairwise relationships with regression line
    sns.pairplot(df, kind="reg")

In [None]:
plot_variable_pairs(df_sample)

## 2. Write a function named plot_categorical_and_continuous_vars that accepts your dataframe and the name of the columns that hold the continuous and categorical features and outputs 3 different plots for visualizing a categorical variable and a continuous variable.

## 3. Save the functions you have written to create visualizations in your explore.py file. Rewrite your notebook code so that you are using the functions imported from this file.

## 4. Use the functions you created above to explore your Zillow train dataset in your explore.ipynb notebook.

## 5. Come up with some initial hypotheses based on your goal of predicting property value.

## 6. Visualize all combinations of variables in some way.

## 7. Run the appropriate statistical tests where needed.

## 8. What independent variables are correlated with the dependent variable, home value?

## 9. Which independent variables are correlated with other independent variables (bedrooms, bathrooms, year built, square feet)?

## 10. Make sure to document your takeaways from visualizations and statistical tests as well as the decisions you make throughout your process.

## 11. Explore your dataset with any other visualizations you think will be helpful.

<div class="alert alert-block alert-info"> 

# Bonus Exercise

## 1. In a separate notebook called explore_mall, use the functions you have developed in this exercise with the mall_customers dataset in the Codeup database server. You will need to write a sql query to acquire your data. Make spending_score your target variable.