<a href="https://colab.research.google.com/github/Avipsa1/UPPP275-Notebooks/blob/main/Pandas_data_visualization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We will learn some visualization packages in Python.

* **matplotlib**:  is a basic and powerful library; good for basic plotting like generating scatter plots, lines and barplots. It can be used on data loaded through pandas directly.
* **seaborn**:  is a library built on top of matplotlib for statistical visualization like summarizing data, understanding distributions, searching for patterns and trends etc.
* **bokeh**:  is an interactive data visualization library which allows users to explore data themselves.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

## 1. Load and preproces data in pandas

In [None]:
tracts = pd.read_csv("CA_census_tracts.csv")

In [None]:
tracts.columns

In [None]:
tracts.head(3)

In [None]:
tracts = tracts.set_index("GEOID10")
tracts.head(3)

## 2. Subset data and use group functions to generate descriptive statistics

In [None]:
#Pick out some counties of your choice
counties = ["Orange","Los Angeles", "Santa Barbara","San Diego", "Riverside"]
filter1 = tracts["county_name"].isin(counties)
tracts_in_counties = tracts[filter1]
tracts_in_counties.head()

In [None]:
tracts_in_counties["county_name"].unique()

In [None]:
tracts_in_counties.describe()

In [None]:
tracts_in_counties[["total_pop","median_age","med_home_value", "med_household_income"]].describe()

In [None]:
tracts_in_counties.groupby("county_name")["med_home_value"].describe()

## 3. Data visualization from filtered rows

In [None]:
ax = sns.boxplot(x=tracts_in_counties["med_home_value"], y=tracts_in_counties["county_name"])

In [None]:
type(ax)

In [None]:
fig = ax.get_figure()
type(fig)

In [None]:
fig = ax.get_figure()
fig.set_size_inches(10, 10)  # inches
fig

In [None]:
sns.set_style("whitegrid")  # visual styles
sns.set_context("paper")  # presets for scaling figure element sizes

# fliersize changes the size of the outlier dots
# boxprops lets you set more configs with a dict, such as alpha (which means opacity)
ax = sns.boxplot(x=tracts_in_counties["med_home_value"],
                 y=tracts_in_counties["county_name"],
                 fliersize=1,
                 boxprops={"alpha": 0.87})

# set the x-axis limit, the figure title, and x/y axis labels
ax.set_xlim(left=0)
ax.set_title("Box plot of tract-level median home values")
ax.set_xlabel("Median home prices in USD")
ax.set_ylabel("Counties")

# save figure to disk with 600 dpi and a tight bounding box
fig = ax.get_figure()
fig.set_size_inches(10,10)
fig.savefig("figure-homevalue-boxplot.png", dpi=600, bbox_inches="tight")

Now modify the code so that you can visualize box plots of median age and median household income for the 5 counties 

In [None]:
sns.set_style("whitegrid")  # visual styles
sns.set_context("paper")  # presets for scaling figure element sizes

# fliersize changes the size of the outlier dots
# boxprops lets you set more configs with a dict, such as alpha (which means opacity)
ax = sns.boxplot(x=tracts_in_counties["med_household_income"],
                 y=tracts_in_counties["county_name"],
                 fliersize=1,
                 boxprops={"alpha": 0.87})

# set the x-axis limit, the figure title, and x/y axis labels
ax.set_xlim(left=0)
ax.set_title("Box plot of tract-level median household incom values")
ax.set_xlabel("Median household income in USD")
ax.set_ylabel("Counties")

# save figure to disk with 600 dpi and a tight bounding box
fig = ax.get_figure()
fig.set_size_inches(10,10)
fig.savefig("figure-income-boxplot.png", dpi=600, bbox_inches="tight")

In [None]:
sns.set_style("whitegrid")  # visual styles
sns.set_context("paper")  # presets for scaling figure element sizes

# fliersize changes the size of the outlier dots
# boxprops lets you set more configs with a dict, such as alpha (which means opacity)
ax = sns.boxplot(x=tracts_in_counties["median_age"],
                 y=tracts_in_counties["county_name"],
                 fliersize=1,
                 boxprops={"alpha": 0.87})

# set the x-axis limit, the figure title, and x/y axis labels
ax.set_xlim(left=0)
ax.set_title("Box plot of tract-level median age")
ax.set_xlabel("Median age in years")
ax.set_ylabel("Counties")

# save figure to disk with 600 dpi and a tight bounding box
fig = ax.get_figure()
fig.set_size_inches(10,10)
fig.savefig("figure-age-boxplot.png", dpi=600, bbox_inches="tight")

Now lets try barplots

In [None]:
sns.set_style("whitegrid")  # visual styles
sns.set_context("paper")  # presets for scaling figure element sizes

# fliersize changes the size of the outlier dots
# boxprops lets you set more configs with a dict, such as alpha (which means opacity)
ax = sns.barplot(x=tracts_in_counties["med_household_income"],
                 y=tracts_in_counties["county_name"])

# set the x-axis limit, the figure title, and x/y axis labels
ax.set_xlim(left=0)
ax.set_title("Box plot of tract-level median household income values")
ax.set_xlabel("Median household income in USD")
ax.set_ylabel("Counties")

# save figure to disk with 600 dpi and a tight bounding box
fig = ax.get_figure()
fig.set_size_inches(10,10)
fig.savefig("figure-income-barplot.png", dpi=600, bbox_inches="tight")

## Histograms and Density plots

Histograms visualize the distribution of some variable by binning it then counting observations per bin. Density plots are similar, but continuous and smooth as they visulize the probability distributions.

In [None]:
ax = sns.histplot(tracts_in_counties["median_age"].dropna(), stat="density", kde=True)

In [None]:
ax = sns.histplot(tracts_in_counties["median_age"].dropna(), stat="density", kde=False)

In [None]:
df_white = tracts[tracts["pct_white"] > 50]
df_hispanic = tracts[tracts["pct_hispanic"] > 50]

In [None]:
ax = sns.histplot(df_white["median_age"].dropna(), stat="density", color = "green")
ax = sns.histplot(df_hispanic["median_age"].dropna(), stat="density", color="grey")

In [None]:
ax = sns.histplot(df_white["median_age"].dropna(),
                  stat="density", kde = True,
                  label="Majority White Tracts",
                  color = "blue")

ax = sns.histplot(df_hispanic["median_age"].dropna(),
                  stat="density",kde = True,
                  label="Majority Hispanic Tracts",
                  color="orange")
ax.legend()

# set x-limit, add x-label, then save to disk
ax.set_xlim(10, 85)
ax.set_xlabel("Median Age of Population (Years)")
ax.get_figure().savefig("figure-age-distributions.png", dpi=600, bbox_inches="tight")

## Scatter Plots

In [None]:
ax = sns.scatterplot(x=tracts["pct_bachelors_degree"], y=tracts["med_household_income"])

In [None]:
ax = sns.scatterplot(x=tracts_in_counties["pct_bachelors_degree"],
                     y=tracts_in_counties["med_household_income"],
                     hue=tracts_in_counties["county_name"])

In [None]:
ax = sns.scatterplot(x=tracts_in_counties["pct_bachelors_degree"],
                     y=tracts_in_counties["med_household_income"],
                     hue=tracts_in_counties["county_name"])
# remove the column name from the legend
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles=handles, labels=labels)

# set x/y limits, labels, and save figure
ax.set_xlim(0, 100)
ax.set_ylim(bottom=0)
ax.set_xlabel("Tract population % with bachelor's degree or higher")
ax.set_ylabel("Tract median household income (2017 USD)")
ax.get_figure().savefig("figure-income-degree-scatterplot.png", dpi=600, bbox_inches="tight")

### Now it's your turn: pick 2 new variables from the full dataset and scatter plot them against each other. How do you interpret the pattern? what if you look at only 1 county?