# Introduction

## Business Question

---

> **Guiding question:** What are the top 3 zip codes for short-term investment (based on ROI) and the worst 3 (based on risk) in the city of Pittsburgh, PA?
>
>
> **Evaluation Metric:** ROI/Risk
>
>
> **Dataset:** Zillow data from 1996-2018
>
>
> **Goal:** Determine ROI and risk via time series forecasting 
>
>
> 

---

# Imports

In [None]:
## Data Handling
import pandas as pd
import numpy as np

## Visualizations
import matplotlib as mpl
import matplotlib.pyplot as plt

## Time Series Modeling
import statsmodels
import statsmodels.tsa.api as tsa
from statsmodels.tsa.seasonal import seasonal_decompose

import pmdarima as pmd
from pmdarima.arima import ndiffs
from pmdarima.arima import nsdiffs

## Custom-made Functions
from bmc_functions import eda
from bmc_functions import time_series_modeling as tsm

## Settings
%matplotlib inline
plt.style.use('seaborn-talk')
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: f'{x:,.2f}')
pd.set_option('max_rows', 100)
%load_ext autoreload
%autoreload 2

## Reading Data

In [None]:
## Reading data
source = '../data/zillow_data.csv'
data = pd.read_csv(source)
data

In [None]:
## Initial inspection
data.info()

## Creating Subset of Zipcodes

---

> The dataset is much larger than I need for my purposes, so I will select only the zip codes for the Pittsburgh Metro area.
>
>
> To select this data, I will filter the initial dataframe by selecting "Pittsburgh" from the "city" column.

---

In [None]:
## Selecting the city of Pittsburgh 
pitt_df = data[data['City'] == 'Pittsburgh']
pitt_df

In [None]:
## Examining statistics for the new dataframe
eda.report_df(pitt_df).T

# Data Cleaning and Prep

---

> The dataset currently contains monthly sale price data as columns for each zip code. In order to be able to use the sale pricing, I will use a custom function provided as part of this project to convert the year/month column label into a new single column.

---

In [None]:
def melt_data(df):
    """
    Takes the zillow_data dataset in wide form or a subset of the zillow_dataset.  
    Returns a long-form datetime dataframe with the datetime column names
    as the index and the values as the 'values' column.
    
    If more than one row is passes in the wide-form dataset, the values column
    will be the mean of the values from the datetime columns in all of the rows.
    
    Source: https://github.com/learn-co-curriculum/dsc-phase-4-project/blob/
    main/time-series/starter_notebook.ipynb
    """
    
    melted = pd.melt(df, id_vars=['RegionName', 'RegionID', 'SizeRank','City',
                                  'State', 'Metro', 'CountyName'],
                     var_name='time')
    melted['time'] = pd.to_datetime(melted['time'], infer_datetime_format=True)
    melted = melted.dropna(subset=['value'])
    
    return melted

In [None]:
## Melting the dataframe to move the dates from columns to new rows per zipcode
pitt_melted = melt_data(pitt_df)
pitt_melted

In [None]:
## Confirming conversion to "datetime" datatype
pitt_melted['time']

In [None]:
## Selecting columns to keep for modeling
keep = ['RegionName', 'time', 'value']

In [None]:
## Keeping only modeling-relevant data
pitt_data = pitt_melted[keep]
pitt_data

In [None]:
## Setting datetime index (required for modeling)
pitt_data.set_index('time', inplace=True)
pitt_data

---

> *The following code is adapted from code within [this notebook](https://github.com/flatiron-school/Online-DS-FT-022221-Cohort-Notes/blob/master/Phase_4/topic_37_intro_to_time_series/topic_37_intro_to_time_series_crime_v3-SG.ipynb) by James Irving, Ph.D.*

---

In [None]:
## Creating list of unique zipcodes from the dataframe
zipcodes = list(pitt_data['RegionName'].unique())
zipcodes

In [None]:
## Inspecting first zipcode in list - datetime index and associated sell value
test_code = zipcodes[0]
test_zipcode_series = pitt_data.groupby('RegionName')\
                                .get_group(test_code)['value']\
                                                            .rename(test_code)
test_zipcode_series

In [None]:
## Creating a dictionary to store each zipcode and its timeseries data

zipcodes_dict = {}

for zipcode in zipcodes:
    
    ## Create the series for each zipcode
    zipcode_series = pitt_data.groupby('RegionName')\
                                                .get_group(zipcode)['value']\
                                                            .rename(zipcode)
    
    ## Save in zipcode dictionary
    zipcodes_dict[zipcode] = zipcode_series.resample('MS').asfreq()
    
## Display the keys
zipcodes_dict.keys()

In [None]:
## Confirming all zip codes are present in dictionary
list(zipcodes_dict.keys()) == zipcodes

In [None]:
## Inspecting values for one key:value pair
zipcodes_dict[15206]

In [None]:
## reviewing full dataset for Pittsburgh
zipcodes_df_full = pd.DataFrame(zipcodes_dict)
zipcodes_df_full

In [None]:
## Selecting data starting from 2008 onwards
zipcodes_df = zipcodes_df_full.loc['2008':]
zipcodes_df

# T/T Split

In [None]:
## Testing first zipcode from dictionary
zipcode_val = zipcodes_df[15206].copy()
zipcode_val

In [None]:
## Visualizing first zipcode priot to split

fig, ax = plt.subplots(figsize = (12,4))
ax = zipcode_val.plot()
ax.legend()
ax.set_xlabel('Years')
ax.set_ylabel('Sale Price ($)')
ax.set_title(f'Train/Test Split for Zipcode {zipcode_val.name}');

In [None]:
## Splitting Data

tts_cutoff = round(zipcode_val.shape[0]*.85)
train = zipcode_val.iloc[:tts_cutoff]
test = zipcode_val.iloc[tts_cutoff:]

## Plot
fig, ax = plt.subplots(figsize = (12,4))
ax = train.plot(label='Train')
ax = test.plot(label='Test')
ax.legend()
ax.set_xlabel('Years')
ax.set_ylabel('Price ($)')
ax.set_title(f'Train/Test Split for Zipcode {zipcode_val.name}')
ax.axvline(train.index[-1], linestyle=":");

In [None]:
## Testing functionalized train/test split for reuse on other zipcodes
train, test, _,_ = tsm.ts_split(zipcode_val, show_vis=True)

In [None]:
## Inspecting training set
len(train)

In [None]:
## Inspecting testing set
len(test)

# Stationarity Check

---

> The following functions are adapted from [this notebook](https://github.com/flatiron-school/Online-DS-FT-022221-Cohort-Notes/blob/master/Phase_4/topic_37_intro_to_time_series/topic_37_intro_to_time_series_crime_v3-SG.ipynb) by James Irving, Ph.D.

---

## Dickey Fuller Test

In [None]:
## Performing Dickey-Fuller Test
zipdf_results = tsa.stattools.adfuller(train)
zipdf_results

In [None]:
## Creating a dictionary to store initial results
index_label =[f'{train.name}']
labels = ['Test Stat','P-Value','Number of Lags Used','Number of Obs. Used',
        'Critical Thresholds', 'AIC Value']
results_dict  = dict(zip(labels,zipdf_results))

## Saving results to a dictionary and adding T/F for whether exceeds standard
## p-value of .05 and if we fail to reject the null hypothesis or not.
results_dict['p < .05'] = results_dict['P-Value']<.05
results_dict['Stationary'] = results_dict['p < .05']

## Creating DataFrame from dictionary
if isinstance(index_label,str):
    index_label = [index_label]
results_dict = pd.DataFrame(results_dict,index=index_label)
results_dict = results_dict[['Test Stat','P-Value','Number of Lags Used',
                             'Number of Obs. Used','P-Value','p < .05',
                             'Stationary']]

results_dict

In [None]:
## Testing functionality
tsm.adf_test(train)

## Removing Trends, Seasonality

In [None]:
## Testing differenced data
tz_diff = train.diff().dropna()
print("|","---",f"Zipcode {train.name}","---","|","\n")
print(tz_diff)
print('\n\n',"|","----"*5,f"ADF Results for Zipcode {train.name}","-----"*6,"|")
display(tsm.adf_test(tz_diff))

print('\n\n','|',"----"*7,f"Visualizing Difference Shift","---"*7,"|")
fig, ax = plt.subplots()
ax = tz_diff.plot(label='Post-Differencing')
ax.legend()
ax.set_xlabel('Years')
ax.set_ylabel('Price ($)')
ax.set_title(f'Difference Shift for Zipcode {train.name}');

In [None]:
diff_results = tsm.remove_trends(train, "diff")

In [None]:
log_results = tsm.remove_trends(train, "log")

In [None]:
rolling_results = tsm.remove_trends(train, "rolling mean")

In [None]:
ewm_results = tsm.remove_trends(train, "EWM")

In [None]:
## Seasonal Decomposition
decomp = seasonal_decompose(train)
decomp.plot();

In [None]:
## Creating Dataframe with seasonality test results

test_results = []
test_results.append(tsm.adf_test(train))

decomp_dict = {"trend": decomp.trend,'seasonal': decomp.seasonal,
               'residuals': decomp.resid}
 
for trend, results in decomp_dict.items():

    results = results.fillna(0)
    res = tsm.adf_test(results)
    test_results.append(res)

## make into a df
seasonality_df = pd.concat(test_results)
seasonality_df

# ACF/PACF Check

In [None]:
tsm.plot_acf_pacf(train, suptitle='ACF/PACF for Training Data');

# SARIMA Modeling and Forecasting

## Auto-ARIMA

In [None]:
# ## Using pmdarima's functions to pre-determine the best values for 
# ## differencing prior to running auto_arima

# n_d = ndiffs(train)
# n_d

# n_D = nsdiffs(train, m=12)
# n_D

In [None]:
# ## Using auto_arima to determine best parameters for modeling
# auto_model = pmd.auto_arima(train,start_p=0,start_q=0,d=n_d,
#                             max_p=3,max_q=3,
#                             max_P=3,max_Q=3, D=n_D,
#                             start_P=0,start_Q=0,
#                             m=12,
#                             verbose=2)

# display(auto_model.summary())
# auto_model.plot_diagnostics(figsize= (12,9));
# plt.tight_layout()

## Fit Best Model

In [None]:
# best_model = tsa.SARIMAX(train,order=auto_model.order,
#                          seasonal_order = auto_model.seasonal_order,
#                          enforce_invertibility=False).fit()

# ## Display Summary + Diagnostics
# display(best_model.summary())
# best_model.plot_diagnostics(figsize=(12,9));
# plt.tight_layout()

In [None]:
# ## Using get_forecast to generate forecasted data
# forecast = best_model.get_forecast(steps=len(test))

# ## Saving confidence intervals and predicted mean for future
# forecast_df = forecast.conf_int()
# forecast_df.columns = ['Lower CI','Upper CI']
# forecast_df['Forecast'] = forecast.predicted_mean
# forecast_df.head(5)

In [None]:
# ## Plotting training, test data and forecasted results
# fig,ax = plt.subplots(figsize=(13,6))

# last_n_lags=12*5         

# train.iloc[-last_n_lags:].plot(label='Training Data')
# test.plot(label='Test Data')

# ## Plotting forecasted data and confidence intervals
# forecast_df['Forecast'].plot(ax=ax,label='Forecast')
# ax.fill_between(forecast_df.index,forecast_df['Lower CI'],
#                 forecast_df['Upper CI'],color='b',alpha=0.4)

# ax.set(xlabel='Time')
# ax.set(ylabel='Sell Price ($)')
# ax.set_title('Data and Forecasted Data')
# ax.legend();

In [None]:
# fig, ax = tsm.plot_forecast_ttf(train=train, test=test,
#                                 forecast_df = forecast_df, n_yrs_past=5)

## Forecasting

---

> Save `conf_int`, `predicted_mean` - 4cDF
>
>
> Plot Tr, Te, 4cDF

---

In [None]:
# best_model = tsa.SARIMAX(zipcode_val,order=auto_model.order,
#                          seasonal_order = auto_model.seasonal_order,
#                          enforce_invertibility=False).fit()

# display(best_model.summary())
# best_model.plot_diagnostics(figsize=(12,9));
# plt.tight_layout()

In [None]:
# auto_model_best, best_model_overall = tsm.create_best_model(zipcode_val, m=12)

In [None]:
# ## Using get_forecast to generate forecasted data
# forecast = best_model.get_forecast(steps=24)

# ## Saving confidence intervals and predicted mean for future
# forecast_df = forecast.conf_int()
# forecast_df.columns = ['Lower CI','Upper CI']
# forecast_df['Forecast'] = forecast.predicted_mean
# forecast_df.head(5)

In [None]:
# ## Plotting training, test data and forecasted results
# fig,ax = plt.subplots(figsize=(13,6))

# zipcode_val.plot(label='Training Data')

# ## Plotting forecasted data and confidence intervals
# forecast_df['Forecast'].plot(ax=ax,label='Forecast')
# ax.fill_between(forecast_df.index,forecast_df['Lower CI'],
#                 forecast_df['Upper CI'],color='b',alpha=0.4)

# ax.set(xlabel='Time')
# ax.set(ylabel='Sale Price ($)')
# ax.set_title('Data and Forecasted Data')
# ax.legend();

In [None]:
# forecast_overall = tsm.forecast_and_ci(best_model_overall, n_yrs_future = 2)
# forecast_overall

In [None]:
# fig, ax = tsm.plot_forecast_final(zipcode_val, forecast_overall)
# fig

In [None]:
# investment_cost = forecast_df.iloc[0,2]
# investment_cost

In [None]:
# roi_df = (forecast_df - investment_cost)/investment_cost*100
# roi_df

In [None]:
# roi_final = roi_df.iloc[-1]
# roi_final.name = zipcode_val.name.astype('str')
# roi_final

In [None]:
# pd.DataFrame(roi_final)

In [None]:
_, roi_df, _, _, _, _, _, _, _, _ = tsm.ts_modeling_workflow(zipcodes_df, 15206, threshold=.85)
roi_df.iloc[0]

# Interpreting Results

---

> Based on my model, the ROI for the zipcode 15206 would be an average of 65.48%. However, the results may fall anywhere between 19.05% - 111.91%.

---

# Functionalizing Workflow

In [None]:
## Testing full workflow function

fcst_full, roi_df, split_vis, fcst_len, sum_train, diag_train,sum_full,\
    diag_full, training_frcst,final_frcst = tsm.ts_modeling_workflow\
            (dataframe = zipcodes_df, zipcode = 15206, m=12, show_vis = True);

# Processing Remaining Zip Codes

---

> Now I will process the remaining zip codes via a for loop to process them through the work flow. As part of the work flow, I will review each model's performance visualizations to ensure it is appropriate for forecasting.
>
>
>I will save the results to the overall dictionary for my final review and interpretation.

---

> ***Special Note:*** Before looping through the entirety of the zip codes, I remove the zip code "15210" and process it separately.
>
>
> This is due to significant delays in running the loop (increasing loop runtime from 1.5 min to upwards of 10 minutes). The issue stems from errors during the modeling process when using the default train/test split threshold of .85.
>
>
> To resolve the issue, I run the zip code through the same process as the loop and save the results to the overall dictionary.

---

In [None]:
## Separating the 15210 zipcode to prevent runtime delays
shorter_list = list(zipcodes_df.columns)
shorter_list.remove(15210)

In [None]:
## Creating dictionary and storing all zipcodes and results
overall_results = {}

for i, zipcode in enumerate(shorter_list):
    
    print('|',"---"*10,f'Zipcode {zipcode}',"---"*10,'|\n')
    print(f'--> Zipcode {i+1} of {len(shorter_list)}')
    
    ## Create temporary dictionaries
    zip_tsa_results = {}
    metrics = {}
    forecast_vis = {}
    
    ## Use functionalized workflow to obtain results
    forecast_full, roi_df, split_vis, forecast_length, summary_train,\
        diag_train, summary_full, diag_full, training_frcst,final_frcst =\
        tsm.ts_modeling_workflow(dataframe = zipcodes_df, threshold = .85,
                                 zipcode = zipcode, m=12, show_vis = True,
                                 figsize=(12,4))
    
    ## Save results to temporary dictionaries
    metrics['train'] = [summary_train, diag_train]
    metrics['full'] = [summary_full, diag_full]
    
    forecast_vis['train'] = training_frcst
    forecast_vis['full'] = final_frcst
    forecast_vis['split'] = split_vis
    
    zip_tsa_results['num_yrs_forecast'] = forecast_length
    zip_tsa_results['forecasted_prices'] = forecast_full
    zip_tsa_results['roi'] = roi_df
    zip_tsa_results['model_metrics'] = metrics
    zip_tsa_results['model_visuals'] = forecast_vis
    
    ## Save final temporary dictionary to overall dictionary
    overall_results[zipcode] = zip_tsa_results
    
    print(f'--> Zipcode {i+1} of {len(shorter_list)}')
    print('|',"---"*5,f'Completed: {zipcode}',"---"*5,'|\n\n')

In [None]:
## Processing zipcode 15210 separately
overall_results[15210] = tsm.make_dict(zipcodes_df, 15210, .8)

# Inspecting Dictionary Results

In [None]:
overall_results[15206].keys()

In [None]:
## Inspecting "forecasted prices" key
overall_results[15206]['forecasted_prices']

In [None]:
## Inspecting "roi" key
overall_results[15206]['roi']

In [None]:
## Reviewing training model metrics
display(overall_results[15206]['model_metrics']['train'][0])
display(overall_results[15206]['model_metrics']['train'][1])

In [None]:
## Reviewing model forecasts
display(overall_results[15206]['model_visuals']['split'])
display(overall_results[15206]['model_visuals']['train'])
display(overall_results[15206]['model_visuals']['full'])

## Diagnosing Zip Code Forecasts

---

> After generating the forecast results for all of the zip codes, I reviewed the validation results for each zip code.
>
>
> Certain zip codes showed the actual sale price trend lines getting too close to the upper/lower confidence intervals. Several models missed the trends entirely, resulting in the actual data exceeding the confidence interval.
>
>
> **I will readjust the train/test threshold for the selected zip codes to address these issues.**

---

## Creating Groups for Adjustments

In [None]:
## Adding .025 to threshold
thresh_a025 = [15217,15213,15216]

## Subtracting .05 from threshold
thresh_s05 = [15243]

## Subtracting .075 from threshold
thresh_s075 = [15210, 15207, 15204]

## Review - `'thresh_a025'`

In [None]:
## Inspecting split and validation visuals for missed trends

for code in thresh_a025:
    print("\n|","--"*24,f"Visualizations for {code}","--"*24,"|\n")
    display(overall_results[code]['model_visuals']['split'])
    display(overall_results[code]['model_visuals']['train'])

### Interpretation - `'thresh_a025'`

---

> For these zip codes, I see that the train/test split threshold slightly missed a trend in the data, causing the actual results to approach one of the limits of the threshold too closely.
>
>
> I will test whether **increasing the threshold slightly would capture more of the trend**, bringing my forecast data closer to the test data.

---

## Review - `'thresh_s05'`

In [None]:
## Inspecting split and validation visuals for missed trends

for code in thresh_s05:
    print("\n|","--"*24,f"Visualizations for {code}","--"*24,"|\n")
    display(overall_results[code]['model_visuals']['split'])
    display(overall_results[code]['model_visuals']['train'])

### Interpretation - `'thresh_s05'`

---

> Similar to the prior zipcodes, these zipcodes missed the trend as well. However, it seems that the trend may be *behind* the threshold.
>
>
> I will test whether **decreasing the threshold by .05 would capture more of the trend.**

---

## Review - `'thresh_s075'`

In [None]:
## Inspecting split and validation visuals for missed trends

for i, zipcode in enumerate(thresh_s075):
    
    print('|',"---"*10,f'Zipcode {zipcode}',"---"*10,'|\n')
    print(f'--> Zipcode {i+1} of {len(thresh_s075)}\n')
    
    ## Reviewing training model metrics
    
    print('|',"---"*5,'Model Visualizations',"---"*5,'|\n')
    display(overall_results[zipcode]['model_visuals']['split'])
    display(overall_results[zipcode]['model_visuals']['train'])

### Interpretation - `'thresh_s075'`

---

> **These zipcodes missed the trends significantly as well, with forecasts exceeding the confidence intervals.** The trends may be further behind the threshold, requiring more of a reduction in the threshold.
>
>
> I will test whether **decreasing the threshold by .075 would capture more of the trend.**

---

## Review - Zipcode `15226`

---

> During my review, I noticed there was a sharp increase in the trend line for the zip code 15226, causing my model to mis-forecast the sale prices.
>
>
> **In order to address this error, I would need to increase my threshold an additional 5%.** This decision would be problematic, however, as it would limit the scope of this, and all other forecasts, to a one-year scope.
>
>
> **Instead of limiting all of my forecasts due to this single zip code, I will leave the model results at the .85 threshold.** 
>
>
> For exploratory purposes, I will visualize the impact of the change to a .90 threshold. However, **these results will not be included in my final results.**
>
>

---

In [None]:
## EDA modeling of the 15226 zip code at a .9 threshold for train/test split

tsm.ts_modeling_workflow(zipcodes_df, 15226,threshold = .9, show_vis = True);

### Interpretation  - Zip Code `15226`

---

> As expected, increasing the threshold did increase the accuracy of the trend for the 15226 zip code. However, this change would limit the forecasts of the other zip codes by nearly 6 months. As this is only one zipcode, I will leave it's threshold at .85 to maintain the forecasts for the others.

---

# Updating Thresholds

## Updating - `'thresh_a025'`

In [None]:
for i, zipcode in enumerate(thresh_a025):
    
    print('|',"---"*10,f'Zipcode {zipcode}',"---"*10,'|\n')
    print(f'--> Zipcode {i+1} of {len(thresh_a025)}')
    
   
    overall_results[zipcode] = tsm.make_dict(zipcodes_df, zipcode,
                                             threshold = .875, show_vis = True)
    
    print('|',"---"*3,f'Completed: {zipcode}, {i+1} of {len(thresh_a025)}',"---"*3,'|\n\n')

### Reviewing Changes  - `thresh_a025`

---

> The slight increase to the threshold for these zipcodes brought the forecasts much closer to the test values, in most cases making them nearly the same as the test data.

---

## Updating `'thresh_s05'`

In [None]:
for i, zipcode in enumerate(thresh_s05):
    
    print('|',"---"*10,f'Zipcode {zipcode}',"---"*10,'|\n')
    print(f'--> Zipcode {i+1} of {len(thresh_s05)}')
    
   
    overall_results[zipcode] = tsm.make_dict(zipcodes_df, zipcode,
                                             threshold = .8, show_vis = True)

    print('|',"---"*3,f'Completed: {zipcode}, {i+1} of {len(thresh_s05)}',
          "---"*3,'|\n\n')

### Reviewing Changes  - `thresh_s05`

---

> The decrease of .05 in my threshold improved my forecasts for 15243. However, it seems my forecast for 15226 could still improve.
>
>
> I will need to change the threshold again for 15226 to increase the accuracy of my forecast.

---

## Updating `'thresh_s075'`

In [None]:
for i, zipcode in enumerate(thresh_s075):
    
    print('|',"---"*10,f'Zipcode {zipcode}',"---"*10,'|\n')
    print(f'--> Zipcode {i+1} of {len(thresh_s075)}')
    
   
    overall_results[zipcode] = tsm.make_dict(zipcodes_df, zipcode,
                                             threshold = .775, show_vis = True)
    
    print('|',"---"*3,f'Completed: {zipcode}, {i+1} of {len(thresh_s075)}',
          "---"*3,'|\n\n')

### Reviewing Changes  - `thresh_s075`

---

> The decrease of .075 in my threshold improved my forecasts for 15207, but they still have room for improvement. However, this threshold is still showing poor performance for 15210 and 15204.
>
>
> I will change the threshold again for these zip codes to see if a larger decrease would improve the accuracy further.

---

### Updating `thresh_s075` - .725

In [None]:
## Decreasing threshold to .725

for i, zipcode in enumerate(thresh_s075):
    print('|',"---"*10,f'Zipcode {zipcode}',"---"*10,'|\n')
    print(f'--> Zipcode {i+1} of {len(thresh_s075)}')
    
   
    overall_results[zipcode] = tsm.make_dict(zipcodes_df, zipcode,
                                             threshold = .725, show_vis = True)
    
    print('|',"---"*3,f'Completed: {zipcode}, {i+1} of {len(thresh_s075)}',
          "---"*3,'|\n\n')

### Reviewing Changes v2  - `thresh_s075`

---

> Bringing the threshold down to .725 from .85 brought the zip code 15207 within its confidence interval. However, zip codes 15204 and 15210 both have unstable trend lines in the training and test data, making it hard for the model to predict accurate results.
>
>
> I will accept these results with the understanding that the forecast for zip codes 15204 and 15210 will be inaccurate.

---

# Final Results

---

> Now that I collected all of the results for each zip code, I will calculate and save the return on investment (ROI) values for each zip code.
>
>
> I will determine my final recommendations based on the ROI results as well as using the lower confidence interval to determine the risk of each zip code.

---

In [None]:
## Identifying keys for each zip code
overall_results[15206].keys()

In [None]:
## Inspecting ROI dictionary
overall_results[15206]['roi']

In [None]:
## Calculating number of months used in each forecast
roi_len = []

for zipcode, data in overall_results.items():
    roi_len.append(len(data['roi']))
    
roi_len

In [None]:
## Determining minimum number of months for comparisons
roi_idx = min(roi_len)
roi_idx

In [None]:
## Confirming indexing works as expected
overall_results[15206]['roi'].iloc[roi_idx]

In [None]:
## Collecting forecasted ROI and confidence intervals
roi_test = []

for zipcode, data in overall_results.items():
    roi_test.append(data['roi'].iloc[roi_idx-1].rename(zipcode).to_frame().T)
    
roi_df = pd.concat(roi_test)

roi_df

In [None]:
## Sorting for zip codes with highest ROI
best_roi_df = roi_df.sort_values('Forecast', ascending=False)
best_roi_df.style.background_gradient(subset=['Forecast'],
                                  cmap='RdYlGn')\
                                    .set_caption('Zipcodes by Forecasted ROI')

In [None]:
best_roi_df.iloc[:3]

In [None]:
## Sorting for riskiest zipcodes
risk_df = roi_df.sort_values('Lower CI')
risk_df.style.background_gradient(subset=['Lower CI'],
                                  cmap='RdYlGn').set_caption('Zipcodes by Risk')

In [None]:
## Saving forecast figures in notebook
import os
fig_folder = "./img/"
os.makedirs(fig_folder,exist_ok=True)

for zipcode in overall_results:
    fig = overall_results[zipcode]['model_visuals']['full']
    fig.savefig(f"{fig_folder}forecast_for_{zipcode}.png",dpi=300)

# Final Recommendations

---

> My forecasts are limited to a 16-month viewpoint (based on the size of the data used for testing purposes).

---

> I would recommend short-term buyers to **focus on the following areas:**
>  * East Liberty **(zip code 15206, ROI: 42.7%)**
>  * Lawrenceville **(15201, ROI: 38.8)**
>  * North Shore/Brighton Heights **(15212, ROI: 28.9%)**

---

> I would recommend for short-term buyers to **avoid the following areas due to high risk of losing money:**
>  * Shadyside **(15232, 39.3% risk)**
>  * Oakland/North Oakland **(15213, 34.2% risk)**
>  * Perry South/Northview Heights/Summer Hill **(15214, 27.4%)**

---

# Future Work

---

> * Comparing forecasts to actualized sales using updated data from Zillow.
> * Exploring a larger range of values for the splitting threshold.
> * Identifying and adding exogenous data to support forecasts

---