<a href="https://colab.research.google.com/github/tmarissa/marissa_DATA606/blob/main/ipynb/604_Choropleth_K_means_State_(Cleansed_and_Uncleansed).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DATA 606 Capstone
## Marissa Tan
### Impact of COVID-19 on the US Housing Market
__Real Estate and Density Dataset__<br>
_Choropleth Map for 2019 and 2021_
- State(Mean) 
  - Cleansed - Refer cluster to "603 K_Means State.ipynb" Section 3.1
  - Uncleansed with Outliers - Refer cluster to "603 K_means State.ipynb Section 5.1

Reference:<br>
https://ecyy.medium.com/mapping-by-geopandas-in-colab-fe4b63b9ac00<br>
https://plotly.com/python/county-choropleth/

In [1]:
!apt install gdal-bin python-gdal python3-gdal
!apt install python3-rtree 
!pip install geopandas
!pip install pyshp==1.2.10
!pip install shapely==1.6.3
!pip install chart_studio
!pip install plotly==5.6.0
!pip install plotly-geo==1.0.0

Reading package lists... Done
Building dependency tree       
Reading state information... Done
gdal-bin is already the newest version (2.2.3+dfsg-2).
python-gdal is already the newest version (2.2.3+dfsg-2).
python3-gdal is already the newest version (2.2.3+dfsg-2).
0 upgraded, 0 newly installed, 0 to remove and 39 not upgraded.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
python3-rtree is already the newest version (0.8.3+ds-1).
0 upgraded, 0 newly installed, 0 to remove and 39 not upgraded.


In [2]:
#import tools: NumPy for Advanced linear algebra, Matplotlib for Visualization and data plotting, Pandas for Data manipulation and analysis, Geopandas for programming geospatial data in python,  matplotlib.pyplot for plotting map
import pandas as pd
import numpy as np
import geopandas as gpd
from shapely.geometry import Point
import matplotlib
import matplotlib.pyplot as plt
import chart_studio.plotly as py
import plotly.figure_factory as ff
import plotly.graph_objects as go

# 1. Kmeans State

## 1.1 Read Cleansed CSV

### 1.1a For the Year 2019


In [3]:
df_2019 = pd.read_csv('df_state_2019.csv', index_col=False)
df_2019.sample(5)

Unnamed: 0,state,density,average_listing_price,cluster
22,MI,178,234051.837338,7
29,NE,25,185801.133308,2
42,TN,167,291397.356364,5
24,MO,89,194100.65166,2
17,KY,114,200891.279049,2


### 1.1b For the Year 2021

In [4]:
df_2021 = pd.read_csv('df_state_2021.csv', index_col=False)
df_2021.sample(5)

Unnamed: 0,state,density,average_listing_price,cluster
27,NC,214,380903.723774,6
46,VT,69,498462.211344,4
22,MI,178,289763.59907,8
16,KS,35,175575.621046,10
3,AZ,62,589690.27381,7


## 2.1 Read Uncleaned CSV


### 2.1a For the Year 2019

In [5]:
df2_2019 = pd.read_csv('df_outliers_state_2019.csv', index_col=False)
df2_2019.sample(5)

Unnamed: 0,state,density,average_listing_price,cluster
44,UT,39,540315.5625,9
6,CT,744,512557.9375,2
24,MO,89,223117.58556,5
17,KY,114,229981.631344,5
25,MS,63,209818.220328,5


### 2.1b For the Year 2021



In [6]:
df2_2021 = pd.read_csv('df_outliers_state_2021.csv', index_col=False)
df2_2021.sample(5)

Unnamed: 0,state,density,average_listing_price,cluster
20,MD,636,520415.4,7
36,OK,57,278062.8,1
4,CA,253,1199067.0,4
48,WI,108,359393.9,5
25,MS,63,271780.9,1


# 3. Choropleth Map

## 3.1 Cleansed CSV

### 3.1a For the Year 2019

In [7]:
fig = go.Figure(data=go.Choropleth(
    locations=df_2019['state'],
    z=df_2019['cluster'].astype(int),
    locationmode='USA-states',
    colorscale='thermal',
    autocolorscale=False,
    text=df_2019[['average_listing_price']], # hover text
    marker_line_color='white', # line markers between states
    colorbar_title="Cluster"
))

fig.update_layout(
    title_text= "2019 State's Cluster using Cleaned Data",
    font=dict(
        family="Courier New, monospace",
        size=25,
        color="RebeccaPurple"
    ), 
    geo = dict(
        scope='usa',
        projection=go.layout.geo.Projection(type = 'albers usa'),
        showlakes=True, # lakes
        lakecolor='rgb(255, 255, 255)')
)

fig.show()

### 3.1b For the Year 2021

In [8]:
fig = go.Figure(data=go.Choropleth(
    locations=df_2021['state'],
    z=df_2021['cluster'].astype(int),
    locationmode='USA-states',
    colorscale='thermal',
    autocolorscale=False,
    text=df_2021[['average_listing_price']], # hover text
    marker_line_color='white', # line markers between states
    colorbar_title="Cluster"
))

fig.update_layout(
    title_text= "2021 State's Cluster using Cleaned Data",
    font=dict(
        family="Courier New, monospace",
        size=25,
        color="RebeccaPurple"
    ),
    geo = dict(
        scope='usa',
        projection=go.layout.geo.Projection(type = 'albers usa'),
        showlakes=True, # lakes
        lakecolor='rgb(255, 255, 255)'),
)

fig.show()

## 4.1 Uncleansed CSV

### 4.1a For the Year 2019

In [9]:
fig = go.Figure(data=go.Choropleth(
    locations=df2_2019['state'],
    z=df2_2019['cluster'].astype(int),
    locationmode='USA-states',
    colorscale='haline',
    autocolorscale=False,
    text=df2_2019[['average_listing_price']], # hover text
    marker_line_color='white', # line markers between states
    colorbar_title="cluster"
))

fig.update_layout(
    title_text="2019 State's Cluster using Data with Outliers",
    font=dict(
        family="Courier New, monospace",
        size=25,
        color="black"
    ),
    geo = dict(
        scope='usa',
        projection=go.layout.geo.Projection(type = 'albers usa'),
        showlakes=True, # lakes
        lakecolor='rgb(255, 255, 255)'),
)

fig.show()

### 4.1b For the Year 2021

In [10]:
fig = go.Figure(data=go.Choropleth(
    locations=df2_2021['state'],
    z=df2_2021['cluster'].astype(int),
    locationmode='USA-states',
    colorscale='haline',
    autocolorscale=False,
    text=df2_2021[['average_listing_price']], # hover text
    marker_line_color='white', # line markers between states
    colorbar_title="cluster"
))

fig.update_layout(
    title_text="2021 State's Cluster using Data with Outliers",
    font=dict(
        family="Courier New, monospace",
        size=25,
        color="black"
    ),
    geo = dict(
        scope='usa',
        projection=go.layout.geo.Projection(type = 'albers usa'),
        showlakes=True, # lakes
        lakecolor='rgb(255, 255, 255)'),
)

fig.show()

Hawaii has been a cluster on its own with cleansed or uncleansed data in 2019 and 2021. On the other hand, California clustered with other states when the cleaned data was used in 2019 and 2021. However, California is its only cluster in with uncleansed data on 2019 and 2021. For this reason, removing the outliers looses some of the integrity of the dataset.
