## Introduction to <a href="https://matplotlib.org/">matplotlib</a>

#### Note for Jenny: For ICFJ, do global/international/Iranian or company datasets. For mapping, make sure to do something in Iran or country-level, NOT bay area/sf specific - do JHU or CDC, not DataSF for COVID (global or Iranian numbers)

### Importing matplotlib and other libraries

Make sure you first have matplotlib installed locally if you're running this notebook locally! If you have Python 3, run <strong>pip3 install matplotlib</strong> on the terminal before running the cells below.

In [8]:
from matplotlib import pyplot as plt

In [2]:
# importing any other necessary libraries below

import numpy as np
import pandas as pd

### Getting started and making basic charts with matplotlib

#### Step 1: Importing the dataset for your visualizations

Let's import the <a href="https://iranopendata.org/en/dataset/iod-03222-crude-birth-rate-death-rate-child-mortality-rate-in-selected-countries-world-202">crude birth rate, death rate and child mortality rate in selected countries of the world</a> from Iran Open Data, Iran's open data portal. This specific dataset has raw birth, death and child mortality rates from different countries in 2020. 

If you want to download the dataset directly, you can also access the dataset from <a href="https://raw.githubusercontent.com/kwonjs/datasets-to-use/main/iod-03222-crude-birth-rate-death-rate-child-mortality-rate-in-selected-countries-world-202-en.csv">this github link.</a>

In [7]:
# import the dataset below - make sure that if you're not importing from a URL that you first 
# upload the raw dataset to jupyter notebook

# import dataset as a dataframe to analyze in python
url = "https://raw.githubusercontent.com/kwonjs/datasets-to-use/main/iod-03222-crude-birth-rate-death-rate-child-mortality-rate-in-selected-countries-world-202-en.csv"
df = pd.read_csv(url)

# to make sure the data is showing up properly for basic analysis and charting/graphing
# df.head(50) 

Let's take a look at this dataset. What  <strong>variables/columns</strong> are we working with?

#### Step 2: Cleaning your dataset

Sometimes you will be lucky and have a clean dataset that you can immediately start visualizing. But more often than not, you will have to clean your dataset. Cleaning involves "fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset" (<a href="https://www.tableau.com/learn/articles/what-is-data-cleaning">Tableau</a>).

In [13]:
# let's look at the first 10 rows of this dataset

df.head(10)

Unnamed: 0,Region and country - Asia,Crude birth rate - per thousand population,Crude death rate - per thousand population,Child mortality rate is less than - One year - per thousand live births
0,asia,17,7,27.0
1,"Azerbaijan, Republic of - Asia",14,6,11.0
2,Jordan - Asia,22,4,17.0
3,Armenia - Asia,12,9,6.0
4,Uzbekistan - Asia,23,5,11.0
5,Afghanistan - Asia,33,6,50.0
6,United Arab Emirates - Asia,11,1,6.0
7,Indonesia - Asia,18,7,25.0
8,"Iran, Islamic Republic - Asia",17,5,6.0
9,Bahrain - Asia,14,2,6.0


In [14]:
# let's also look at the last 10 rows of this dataset

df.tail(10)

Unnamed: 0,Region and country - Asia,Crude birth rate - per thousand population,Crude death rate - per thousand population,Child mortality rate is less than - One year - per thousand live births
77,Europe - France,11,9,3.6
78,Europe - Finland,8,10,2.1
79,Europe - Poland,10,11,3.7
80,Europe - Hungary,9,13,3.8
81,Europe - Norway,10,8,2.1
82,Europe - Netherlands,10,9,3.5
83,Europe - Greece,8,11,3.5
84,Oceania,17,7,16.0
85,Oceania - Australia,13,6,3.1
86,Oceania - New Zealand - New Zealand,12,7,4.5


In [None]:
# we notice that in the first column "Region and country - Asia" that most rows have either 
# the country (i.e. Indonesia, 'Iran, Islamic Republic') or the region (i.e. Europe, Asia)
# and a dash separating the two 

# however, there are issues. some rows have the country first, then the region. 
# other rows have the region first, then the country. additionally, some rows 
# are the average rates for an entire region (i.e. Oceania in row 84).



#### Step 3: Doing some basic filtering and analysis with pandas 

<i>For a refresher, refer to the data analysis section (LINK TO SECTION HERE) on how to work with the pandas library!</i>

In [22]:
# filter rows by the 'Analysis Neighborhood' area_type, which is associated with specific districts
# within San Francisco (i.e. Bayview Hunters Point)

neighborhood_cases = cases_raw[cases_raw['area_type']=='Analysis Neighborhood']

In this dataset, there's a column called "new_confirmed_cases." We can use the <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html">groupby</a> function on the dataframe of cases by neighborhood (neighborhood_cases) to get the total number of cases in all of SF's neighborhoods (in other words, in all of SF) by date. 

In [31]:
new_cases_byday = neighborhood_cases.groupby('specimen_collection_date').sum()['new_confirmed_cases']

  new_cases_byday = neighborhood_cases.groupby('specimen_collection_date').sum()['new_confirmed_cases']


In [35]:
# convert the groupby output to a dataframe
new_cases_byday_df = new_cases_byday.to_frame()

# reset the index so that the date column isn't treated as the index
new_cases_byday_df = new_cases_byday_df.reset_index()
new_cases_byday_df

Unnamed: 0,specimen_collection_date,new_confirmed_cases
0,2022/03/01,129.0
1,2022/03/02,102.0
2,2022/03/03,108.0
3,2022/03/04,78.0
4,2022/03/05,52.0
...,...,...
361,2023/02/25,59.0
362,2023/02/26,46.0
363,2023/02/27,98.0
364,2023/02/28,107.0


In [38]:
# let's go back to the neighborhood_cases dataframe
# and let's split the specimen_collection_date into three separate columns
# one column for year, one for month and one for day

neighborhood_cases[['year', 'month', 'day']] = neighborhood_cases['specimen_collection_date'].str.split('/', expand=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  neighborhood_cases[['year', 'month', 'day']] = neighborhood_cases['specimen_collection_date'].str.split('/', expand=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  neighborhood_cases[['year', 'month', 'day']] = neighborhood_cases['specimen_collection_date'].str.split('/', expand=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing

In [41]:
# we can choose only the variables/columns we care about
# for this specific dataset, we care about the year, month and day of the collection, maybe the population
# of the neighborhood based on the american community survey (acs_population) and 
# new confirmed cases for that specific day in that neighborhood

new_cases_neighborhood = neighborhood_cases[["year", "month", "day", "id", "acs_population", "new_confirmed_cases"]]

In [45]:
# we can further filter new_cases_neighborhood for cases per day in a specific neighborhood (i.e. the Tenderloin)

tenderloin_cases = new_cases_neighborhood[new_cases_neighborhood["id"]=="Tenderloin"]

#### Step 3: Let's make some visualizations as part of our exploratory data analysis

The general format for creating a <strong>bar chart</strong> using matplotlib is:

plt.bar(x, height, width=0.8, bottom=None, *, align='center', data=None, **kwargs) 

In [51]:
tenderloin_cases.groupby(['month', 'year']).sum().reset_index()

  tenderloin_cases.groupby(['month', 'year']).sum().reset_index()


Unnamed: 0,month,year,acs_population,new_confirmed_cases
0,1,2023,921506,74.0
1,2,2023,832328,97.0
2,3,2022,921506,75.0
3,3,2023,29726,4.0
4,4,2022,891780,156.0
5,5,2022,921506,300.0
6,6,2022,891780,343.0
7,7,2022,921506,270.0
8,8,2022,921506,162.0
9,9,2022,891780,84.0


In [None]:
# our data source will be new_cases_byday_df

# defining the different parameters used to create a bar using matplotlib
bar_x = new_cases_byday_df['']

plt.bar()