# Zillow Dataset Cleaning - First Step
 Use this as a reference, you **do not** need to run these cells
 - [Link to the orginal dataset](https://www.zillow.com/research/data/) from Zillow's website
 - Select: `ZORI All Homes Plus Multifamily Smoothed` 

In [None]:
# Imports
import pandas as pd
import numpy as np

In [None]:
# Connect to data
FILEPATH = 'https://raw.githubusercontent.com/Lambda-School-Labs/cityspire-a-ds/main/notebooks/Rental_Data/Original_Zillow.csv'
zillow_original = pd.read_csv(FILEPATH)

In [None]:
# Function to clean up the dataset + yearly average columns

def zillow_cleaner(df):
  ''' A function to convert a dataset from Zillow into something more useable 
  for a predictive model.
  '''

  # Drop columns
  df = df.drop(['RegionID', 'RegionName', 'SizeRank'], axis=1)

  # Change column name
  df = df.rename(columns={"MsaName": "City/State"})

  # Creating yearly average columns
  # 2014 averages
  fourteen_columns = list(df)[1:13]
  df.insert(13, '2014_Average',
                    df[fourteen_columns].mean(axis=1), True)
  # 2015 averages
  fifteen_columns = list(df)[14:26]
  df.insert(26, '2015_Average',
                    df[fifteen_columns].mean(axis=1), True)
  # 2016 averages
  sixteen_columns = list(df)[27:39]
  df.insert(39, '2016_Average',
                    df[sixteen_columns].mean(axis=1), True)
  # 2017 averages
  seventeen_columns = list(df)[40:52]
  df.insert(52, '2017_Average',
                    df[seventeen_columns].mean(axis=1), True)
  # 2018 averages
  eightteen_columns = list(df)[53:65]
  df.insert(65, '2018_Average',
                    df[eightteen_columns].mean(axis=1), True)
  # 2019 averages
  nineteen_columns = list(df)[66:78]
  df.insert(78, '2019_Average',
                    df[nineteen_columns].mean(axis=1), True)
  # 2020 averaes
  twentytwenty_columns = list(df)[79:91]
  df.insert(91, '2020_Average',
                    df[twentytwenty_columns].mean(axis=1), True)

  return df

In [None]:
# Run cleaning function
zillow_original = zillow_cleaner(zillow_original)

# Check it out
print(zillow_original.shape)
zillow_original.head()

[GeekforGeeks](https://www.geeksforgeeks.org/python-pandas-dataframe-interpolate) - Interpolate Function

In [None]:
# Deal with NaN with interpolate, a way to estimate instead of dropping
zillow_original.interpolate(method='linear', axis=0, inplace=True, limit_direction='both', limit_area='inside', downcast=None)

# Inspect
zillow_original.info()

Oddly not all of the NaNs were converted to a numerical value based on neighbors. Assumption is that if there are multiple NaNs in sequence the `interpolate()` function has a hard time defining a value.

There were only a hand full of these, so they were corrected manually. Taking the average of the nearest four neighbors, a numerical value was given for the NaN.