## Introduction to matplotlib

#### Note for Jenny: For ICFJ, do global/international/Iranian or company datasets. For mapping, make sure to do something in Iran or country-level, NOT bay area/sf specific - do JHU or CDC, not DataSF for COVID (global or Iranian numbers)

### Importing matplotlib

Make sure you first have matplotlib installed locally if you're running this notebook locally! If you have Python 3, run <strong>pip3 install matplotlib</strong> on the terminal before running the cells below.

In [8]:
from matplotlib import pyplot as plt

In [9]:
# importing any other necessary libraries below

import numpy as np
import pandas as pd

### Getting started and making basic charts with matplotlib

#### Step 1: Importing your data source for your visualizations

Let's import the <a href="https://data.sfgov.org/COVID-19/COVID-19-Cases-by-Geography-Over-Time/d2ef-idww/explore">COVID-19 by Geography Over Time dataset</a> from DataSF, San Francisco's open data portal. This specific dataset has been filtered to only include cases between Mar. 1, 2022 and Mar. 1, 2023.

You can also download the dataset we'll be visualizing <a href="https://raw.githubusercontent.com/kwonjs/datasets-to-use/main/From01012023to02012023COVID-19_Cases_by_Geography_Over_Time.csv">here</a> (takes you to a raw Github link).

In [19]:
# importing the dataset below - make sure that if you're not importing from a URL that you first upload
# the raw dataset to jupyter notebook

# import dataset as a dataframe to analyze in python
url = "https://raw.githubusercontent.com/kwonjs/datasets-to-use/main/From03012022to03012023_COVID-19_Cases_by_Geography_Over_Time.csv"
cases_raw = pd.read_csv(url)

# to make sure the data is showing up properly for basic analysis and charting/graphing
# cases_raw.head(20) 

Let's take a look at this dataset. What variables/columns are we working with?

#### Step 2: Doing some basic filtering and analysis with pandas 

<i>For a refresher, refer to the data analysis section (LINK TO SECTION HERE) on how to work with the pandas library!</i>

In [22]:
# filter rows by the 'Analysis Neighborhood' area_type, which is associated with specific districts
# within San Francisco (i.e. Bayview Hunters Point)

neighborhood_cases = cases_raw[cases_raw['area_type']=='Analysis Neighborhood']

In this dataset, there's a column called "new_confirmed_cases." We can use the <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html">groupby</a> function on the dataframe of cases by neighborhood (neighborhood_cases) to get the total number of cases in all of SF's neighborhoods (in other words, in all of SF) by date. 

In [31]:
new_cases_byday = neighborhood_cases.groupby('specimen_collection_date').sum()['new_confirmed_cases']

  new_cases_byday = neighborhood_cases.groupby('specimen_collection_date').sum()['new_confirmed_cases']


In [35]:
# convert the groupby output to a dataframe
new_cases_byday_df = new_cases_byday.to_frame()

# reset the index so that the date column isn't treated as the index
new_cases_byday_df = new_cases_byday_df.reset_index()
new_cases_byday_df

Unnamed: 0,specimen_collection_date,new_confirmed_cases
0,2022/03/01,129.0
1,2022/03/02,102.0
2,2022/03/03,108.0
3,2022/03/04,78.0
4,2022/03/05,52.0
...,...,...
361,2023/02/25,59.0
362,2023/02/26,46.0
363,2023/02/27,98.0
364,2023/02/28,107.0


In [38]:
# let's go back to the neighborhood_cases dataframe
# and let's split the specimen_collection_date into three separate columns
# one column for year, one for month and one for day

neighborhood_cases[['year', 'month', 'day']] = neighborhood_cases['specimen_collection_date'].str.split('/', expand=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  neighborhood_cases[['year', 'month', 'day']] = neighborhood_cases['specimen_collection_date'].str.split('/', expand=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  neighborhood_cases[['year', 'month', 'day']] = neighborhood_cases['specimen_collection_date'].str.split('/', expand=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing

In [41]:
# we can choose only the variables/columns we care about
# for this specific dataset, we care about the year, month and day of the collection, maybe the population
# of the neighborhood based on the american community survey (acs_population) and 
# new confirmed cases for that specific day in that neighborhood

new_cases_neighborhood = neighborhood_cases[["year", "month", "day", "id", "acs_population", "new_confirmed_cases"]]

In [45]:
# we can further filter new_cases_neighborhood for cases per day in a specific neighborhood (i.e. the Tenderloin)

tenderloin_cases = new_cases_neighborhood[new_cases_neighborhood["id"]=="Tenderloin"]

#### Step 3: Let's make some visualizations as part of our exploratory data analysis

The general format for creating a <strong>bar chart</strong> using matplotlib is:

plt.bar(x, height, width=0.8, bottom=None, *, align='center', data=None, **kwargs) 

In [51]:
tenderloin_cases.groupby(['month', 'year']).sum().reset_index()

  tenderloin_cases.groupby(['month', 'year']).sum().reset_index()


Unnamed: 0,month,year,acs_population,new_confirmed_cases
0,1,2023,921506,74.0
1,2,2023,832328,97.0
2,3,2022,921506,75.0
3,3,2023,29726,4.0
4,4,2022,891780,156.0
5,5,2022,921506,300.0
6,6,2022,891780,343.0
7,7,2022,921506,270.0
8,8,2022,921506,162.0
9,9,2022,891780,84.0


In [None]:
# our data source will be new_cases_byday_df

# defining the different parameters used to create a bar using matplotlib
bar_x = new_cases_byday_df['']

plt.bar()