# Zillow Dataset Cleaning - First Step
 Use this as a reference, you **do not** need to run these cells
 - [Link to the orginal dataset](https://www.zillow.com/research/data/) from Zillow's website
 - Select: `ZORI All Homes Plus Multifamily Smoothed` 

In [None]:
# Imports
import pandas as pd
import numpy as np

In [None]:
# Connect to data
FILEPATH = 'https://raw.githubusercontent.com/Lambda-School-Labs/cityspire-a-ds/main/notebooks/Rental_Data/Original_Zillow.csv'
zillow_original = pd.read_csv(FILEPATH)

In [None]:
# Function to clean up the dataset + yearly average columns

def zillow_cleaner(df):
  ''' A function to convert a dataset from Zillow into something more useable 
  for a predictive model.
  '''

  # Drop columns
  df = df.drop(['RegionID', 'RegionName', 'SizeRank'], axis=1)

  # Change column name
  df = df.rename(columns={"MsaName": "City/State"})

  # Creating yearly average columns
  # 2014 averages
  fourteen_columns = list(df)[1:13]
  df.insert(13, '2014_Average',
                    df[fourteen_columns].mean(axis=1), True)
  # 2015 averages
  fifteen_columns = list(df)[14:26]
  df.insert(26, '2015_Average',
                    df[fifteen_columns].mean(axis=1), True)
  # 2016 averages
  sixteen_columns = list(df)[27:39]
  df.insert(39, '2016_Average',
                    df[sixteen_columns].mean(axis=1), True)
  # 2017 averages
  seventeen_columns = list(df)[40:52]
  df.insert(52, '2017_Average',
                    df[seventeen_columns].mean(axis=1), True)
  # 2018 averages
  eightteen_columns = list(df)[53:65]
  df.insert(65, '2018_Average',
                    df[eightteen_columns].mean(axis=1), True)
  # 2019 averages
  nineteen_columns = list(df)[66:78]
  df.insert(78, '2019_Average',
                    df[nineteen_columns].mean(axis=1), True)
  # 2020 averaes
  twentytwenty_columns = list(df)[79:91]
  df.insert(91, '2020_Average',
                    df[twentytwenty_columns].mean(axis=1), True)

  return df

In [None]:
# Run cleaning function
zillow_original = zillow_cleaner(zillow_original)

# Check it out
print(zillow_original.shape)
zillow_original.head()

[GeekforGeeks](https://www.geeksforgeeks.org/python-pandas-dataframe-interpolate) - Interpolate Function

In [None]:
# Deal with NaN with interpolate, a way to estimate instead of dropping
zillow_original.interpolate(method='linear', axis=0, inplace=True, limit_direction='both', limit_area='inside', downcast=None)

# Inspect
zillow_original.info()

Oddly not all of the NaNs were converted to a numerical value based on neighbors. Assumption is that if there are multiple NaNs in sequence the `interpolate()` function has a hard time defining a value.

There were only a hand full of these, so they were corrected manually. Taking the average of the nearest four neighbors, a numerical value was given for the NaN.

# Zillow Dataset Cleaning - Phase 2
- The dataset below has the NaNs correctly manually

In [None]:
# Connect to data
FILEPATH = 'https://raw.githubusercontent.com/Lambda-School-Labs/cityspire-a-ds/main/notebooks/Rental_Data/Corrected_Zillow.csv'
zillow_corrected = pd.read_csv(FILEPATH)

## Step 1: break apart the state from the city list

This dataset contains multiple cities that are in the same observations. For example *Dallas-Fort Worth*

Thus `explode` is used to seperate the observation into two rows. Giving the model more data to work with.

In order to use this function, the `City/State` columns has to be two seperate columns

[Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html) - *Explode Function*

In [None]:
# Split 'city/state' so the cities can be exploded
zillow_corrected[['City', 'State']] = zillow_corrected['City/State'].str.split(', ', expand=True)
 
# This is being dropped - changing the name later
zillow_corrected = zillow_corrected.drop('City/State', axis=1)
 
# Explode the rows with multiple cities in an observation
zillow_corrected['City'] = zillow_corrected['City'].str.split('-')
zillow_corrected = zillow_corrected.explode('City')

## Step 2: Convert abbreviated state names into full names
- [GitHub](https://gist.githubusercontent.com/rogerallen/1583593/raw/0fffdee6149ab1d993dffa51b1fa9aa466704e18/us_state_abbrev.py) - US State dictionary used below

In [None]:
us_state_abbrev = {
    'Alabama': 'AL', 'Alaska': 'AK', 'American Samoa': 'AS', 'Arizona': 'AZ',
    'Arkansas': 'AR', 'California': 'CA', 'Colorado': 'CO', 'Connecticut': 'CT',
    'Delaware': 'DE', 'District of Columbia': 'DC', 'Florida': 'FL',
    'Georgia': 'GA', 'Guam': 'GU', 'Hawaii': 'HI', 'Idaho': 'ID',
    'Illinois': 'IL', 'Indiana': 'IN', 'Iowa': 'IA', 'Kansas': 'KS',
    'Kentucky': 'KY', 'Louisiana': 'LA', 'Maine': 'ME', 'Maryland': 'MD',
    'Massachusetts': 'MA', 'Michigan': 'MI', 'Minnesota': 'MN',
    'Mississippi': 'MS', 'Missouri': 'MO', 'Montana': 'MT', 'Nebraska': 'NE',
    'Nevada': 'NV', 'New Hampshire': 'NH', 'New Jersey': 'NJ',
    'New Mexico': 'NM', 'New York': 'NY', 'North Carolina': 'NC',
    'North Dakota': 'ND', 'Northern Mariana Islands':'MP', 'Ohio': 'OH',
    'Oklahoma': 'OK', 'Oregon': 'OR', 'Pennsylvania': 'PA', 'Puerto Rico': 'PR',
    'Rhode Island': 'RI', 'South Carolina': 'SC', 'South Dakota': 'SD',
    'Tennessee': 'TN', 'Texas': 'TX', 'Utah': 'UT', 'Vermont': 'VT',
    'Virgin Islands': 'VI', 'Virginia': 'VA', 'Washington': 'WA',
    'West Virginia': 'WV', 'Wisconsin': 'WI', 'Wyoming': 'WY'
}
 
# Flip the dictonary, need the state codes as keys
us_state_abbrev = {value: key for key, value in us_state_abbrev.items()}
 
# Convert the state codes to state names
zillow_corrected['State'] = zillow_corrected['State'].map(us_state_abbrev)


## Step 3: Re-combine the city and state information

- New column will be `City_State`

In [None]:
# Combine and insert at the front of the dataframe
zillow_corrected.insert(loc=0, column='City_State', 
                        value=(zillow_corrected['City'] + ', ' + 
                               zillow_corrected['State']))
 
# Delete our temp columns
zillow_corrected = zillow_corrected.drop(['City', 'State'], axis=1)

## Step 4: Handle duplicate city-state entries

The original dataset has multiple observations for the same city. The reason behind this is due to a city having multiple zipcodes, thus data was collected by zipcode.

Grouping these observations together into one row is ideal. Taking the average of the combined cities zipcodes helps keep the most data.


In [None]:
# Average accross the city 
zillow_corrected = zillow_corrected.groupby('City_State').mean()

# Round to 2 decimal places
zillow_corrected = zillow_corrected.round(decimals=2)