# Data acquisition 
What is done:
- get connection url to zillow data set in sql
- create sql query that meets the requirements
- using the sql query create a zillow database
- encode that database to a csv file

##### Project Requirements:
- ML regression model that predicts `taxvaluedollarcnt` of **Single Family Properties**
- Use properties that had a transaction in 2017
- Tables: 
    - properties_2017
    - predictions_2017
    - propertylandusetype
- Features sudjestions to use:
    - SQFT
    - number of bedrooms
    - number of bathrooms
    - feature engineer new coloumn: number of rooms
- Create a table that tells the zillow data team which state and county the houses are located in (fips)

In [1]:
# imports:
import pandas as pd
import numpy as np

#personal
import env 
import os
import wrangle as wr

In [2]:
# function to get connectio url: 
def get_db_url(db, user= env.user, host=env.host, password=env.password):
    """
    This function will:
    - take credentials from env.py file
    - make a connection to the SQL database with given credentials
    - return url connection
    """
    return f'mysql+pymysql://{user}:{password}@{host}/{db}'

In [3]:
# need to get connection url 
url = get_db_url('zillow')

In [4]:
# create a zillow query (I know that this runs in sql)
sql_query = '''
SELECT 
    id,
    transactiondate,
    bathroomcnt,
    bedroomcnt,
    calculatedfinishedsquarefeet,
    fips,
    taxvaluedollarcnt,
    propertylandusetypeid,
    propertylandusedesc
FROM 
    predictions_2017
LEFT JOIN properties_2017
    USING (id)
LEFT JOIN propertylandusetype
    USING (propertylandusetypeid)
WHERE propertylandusetypeid = 261
'''

In [5]:
# acquire the zillow data
df = pd.read_sql(sql_query, url)
df.head()

Unnamed: 0,id,transactiondate,bathroomcnt,bedroomcnt,calculatedfinishedsquarefeet,fips,taxvaluedollarcnt,propertylandusetypeid,propertylandusedesc
0,1,2017-01-01,0.0,0.0,,6037.0,27516.0,261.0,Single Family Residential
1,15,2017-01-02,0.0,0.0,,6037.0,10.0,261.0,Single Family Residential
2,16,2017-01-02,0.0,0.0,,6037.0,10.0,261.0,Single Family Residential
3,17,2017-01-02,0.0,0.0,,6037.0,2108.0,261.0,Single Family Residential
4,20,2017-01-02,2.0,4.0,3633.0,6037.0,296425.0,261.0,Single Family Residential


In [6]:
# make a function based on this data:
def new_zillow_data():
    '''
    This function will:
    - read a set sql query
    - return a dataframe based on the given query
    '''

    zillow_query = '''
    SELECT 
        id,
        transactiondate,
        bathroomcnt,
        bedroomcnt,
        calculatedfinishedsquarefeet,
        fips,
        taxvaluedollarcnt,
        propertylandusetypeid,
        propertylandusedesc,
        yearbuilt
        
    FROM 
        predictions_2017
    LEFT JOIN properties_2017
        USING (id)
    LEFT JOIN propertylandusetype
        USING (propertylandusetypeid)
    WHERE propertylandusetypeid = 261
        '''
        
    # read in the dataframe from codeup
    df = pd.read_sql(zillow_query, get_db_url('zillow'))
    
    return df

In [7]:
df = new_zillow_data()
df.head()

Unnamed: 0,id,transactiondate,bathroomcnt,bedroomcnt,calculatedfinishedsquarefeet,fips,taxvaluedollarcnt,propertylandusetypeid,propertylandusedesc,yearbuilt
0,1,2017-01-01,0.0,0.0,,6037.0,27516.0,261.0,Single Family Residential,
1,15,2017-01-02,0.0,0.0,,6037.0,10.0,261.0,Single Family Residential,
2,16,2017-01-02,0.0,0.0,,6037.0,10.0,261.0,Single Family Residential,
3,17,2017-01-02,0.0,0.0,,6037.0,2108.0,261.0,Single Family Residential,
4,20,2017-01-02,2.0,4.0,3633.0,6037.0,296425.0,261.0,Single Family Residential,2005.0


In [8]:
# make the data set into a csv
def get_zillow_data():
    '''
    This functino will check for a zillow.csv,
    If it exists it will pull data from said file.
    '''
    
    if os.path.isfile('zillow.csv'):
        #if csv file exists read in data from csv file:
        df = pd.read_csv('zillow.csv', index_col = 0)
        
    else:
        
        #read the fresh data form db into a dataframe
        df = new_zillow_data()
        
        #cache data:
        df.to_csv('zillow.csv')
    
    return df

In [9]:
df = get_zillow_data()
df.head()

Unnamed: 0,id,transactiondate,bathroomcnt,bedroomcnt,calculatedfinishedsquarefeet,fips,taxvaluedollarcnt,propertylandusetypeid,propertylandusedesc,yearbuilt
0,1,2017-01-01,0.0,0.0,,6037.0,27516.0,261.0,Single Family Residential,
1,15,2017-01-02,0.0,0.0,,6037.0,10.0,261.0,Single Family Residential,
2,16,2017-01-02,0.0,0.0,,6037.0,10.0,261.0,Single Family Residential,
3,17,2017-01-02,0.0,0.0,,6037.0,2108.0,261.0,Single Family Residential,
4,20,2017-01-02,2.0,4.0,3633.0,6037.0,296425.0,261.0,Single Family Residential,2005.0


### Initial Questions from looking at the dataset
Could these features be factors? 
- `yearbuilt` : How old the house is?
- `lotsizesqaurefeet` : How big is the property?
- `numberofstories` : Does number of stories influence purchase? (need to remove outliers to make data normal)

Features that were looked into but don't look reliable(outliers)
- `fullbathcnt` : This is the same as bathroomcnt
- `roomcnt` : How many rooms are there, is this even actuarte? (Not reliable)