# In-Class Activity: Combining Data Sets

February 9, 2023

Today we will get some hands-on practice combining dataframes together in some useful ways:
* merging / joinging
* concatenating

## Part 1: Back to air quality data

Remember the air quality index (AQI) data we worked with a few weeks back?

We want to see if there is any relationship between the AQI and the temperate.

How might we do that?

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

In [2]:
df_AQI=pd.read_csv('walla-walla-aqidaily2022-1.csv')
df_Temp=pd.read_csv('walla-walla-temp-sep2022.csv')

In [3]:
df_AQI

Unnamed: 0,Date,Overall AQI Value,Main Pollutant,Site Name (of Overall AQI),Site ID (of Overall AQI),Source (of Overall AQI),Ozone,PM10,PM25
0,01/01/2022,62,PM2.5,WALLA WALL - 12TH ST,53-071-0005,AQS,35,8,62
1,01/02/2022,43,PM2.5,WALLA WALL - 12TH ST,53-071-0005,AQS,41,10,43
2,01/03/2022,42,Ozone,Confederated Tribes of the Umatilla Indian Res...,53-013-9991,AQS,42,9,13
3,01/04/2022,33,Ozone,Confederated Tribes of the Umatilla Indian Res...,53-013-9991,AQS,33,10,17
4,01/05/2022,39,PM2.5,WALLA WALL - 12TH ST,53-071-0005,AQS,34,7,39
...,...,...,...,...,...,...,...,...,...
360,12/27/2022,37,Ozone,Confederated Tribes of the Umatilla Indian Res...,53-013-9991,AirNow,37,.,3
361,12/28/2022,36,Ozone,Confederated Tribes of the Umatilla Indian Res...,53-013-9991,AirNow,36,.,11
362,12/29/2022,31,Ozone,Confederated Tribes of the Umatilla Indian Res...,53-013-9991,AirNow,31,.,20
363,12/30/2022,31,Ozone,Confederated Tribes of the Umatilla Indian Res...,53-013-9991,AirNow,31,.,19


In [4]:
df_Temp

Unnamed: 0,Day,Max_Temp,Min_Temp,Avg_Temp,Precipitation
0,2022-09-01,96,69,82.5,0
1,2022-09-02,103,62,82.5,0
2,2022-09-03,86,66,76.0,T
3,2022-09-04,88,57,72.5,0
4,2022-09-05,84,59,71.5,0
5,2022-09-06,86,56,71.0,0
6,2022-09-07,89,62,75.5,0
7,2022-09-08,77,55,66.0,0
8,2022-09-09,81,51,66.0,0
9,2022-09-10,87,50,68.5,0


In [5]:
# Examine the data frames -- what columns do they each contain?
df_AQI.

AttributeError: 'DataFrame' object has no attribute 'concat'

In [None]:
# How might we combine these two data frames? Would we merging/joining or concatenate?
df_AQI.merge(df_Temp, left_on='Date',right_on="Day",copy=True)
df_AQI

In [None]:
# Let's make a simple plot of the AQI values
# Try using .plot()

# YOUR CODE HERE

In [None]:
# We can also make a simple plot of the temperature 
# This time try matplotlib/pyplot (and maybe some styling)

# YOUR CODE HERE


In [None]:
# We are going to merge/join them
# To do this, we need to have at least one SHARED column between the two data frames
# Do we have that?
# Yes! We have "Date" in the AQI one and "Day" in the temperature one...
# Let's try...
# We will use .merge()
# Here's the documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

# YOUR CODE HERE
df_merge=df_AQI.merge(df_Temp,how='left', left_on='Date',right_on="Day",copy=True)
df_merge


In [None]:
df_merge[df_merge['Date']=="09/01/2022"]

In [None]:
# What happened... did it work?


In [None]:
# No! Why might this be...

In [None]:
# Check for a few days in September...

In [None]:
# Notice the format of the "Date" and "Day" columns... they are not the same!
# Let's investigate...

In [None]:
# Check the types...

In [None]:
# How might we fix this?

In [None]:
# Introducing... datetime!
# See: https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html

# Datetime objects (including timestamps and time intervals) are special kinds of objects
# They let us do time-based calculations (which we will come back to later in this course)

In [None]:
# Let's make a new column in each data frame that has the date/day, but as a pandas datetime object

# YOUR CODE HERE

In [None]:
# Side bar: We can do some super cool stuff now that we have a datetime object!

# What dype is it?
# Note that this particular one is a "Timestamp" (to disambiguate from a time range, etc.)
# You can read more about Timestamps here: https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.html#pandas.Timestamp

In [None]:
# We can extract the year, month, day, etc.
# COMMENT OUT CODE BELOW AND MAKE SURE IT MATCHES WITH YOUR VARIABLES TO RUN
# print(df_AQI["D"][0]) # print out the whole Timestamp
# print(df_AQI["D"][0].year) # print out just the year
# print(df_AQI["D"][0].month) # print out just the month
# print(df_AQI["D"][0].day) # print out just the day

In [None]:
# Now, we can merge again... but this time, on our new datetime columns

In [None]:
# Try it out... what happens if we use:
# a) how="left"
# b) how="inner"

# YOUR CODE HERE

In [None]:
# Let's examine...
# First look at the left join, then the right join... how are they different?
# Which would we use if we just want to analyze September data?

In [None]:
# Look at just the rows in September

In [None]:
# Now we can do some cool things! Like use .corr to examine correlations
# More on .corr: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html
# More on the correlation metric (Pearson's correlation): https://en.wikipedia.org/wiki/Pearson_correlation_coefficient

# YOUR CODE HERE

# How do we interpret these results?

In [None]:
# Or, make a scatterplot of AQI versus Max Temp
# More on pyplot's scatter: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html

# YOUR CODE HERE

## Part 2: Let's look at ALL the months in 2022

The source of our temperature data, [NOWData/weather.gov](https://www.weather.gov/wrh/Climate) (a service of NOAA) has a frustrating feature:
It lets you download historical daily temperature data for geographic areas like Walla Walla (yay!), but you can only do so for a month at a time (boo!). Note that you can get a _plot_ of an entire year, though (go figure!). Tl;dr: Open data is hard!

But, with our data skills, we can easily combine all of the months into a single data file.

I've done the grunt-work of getting the monthly data for Walla Walla from 2022 for you and putting it in a .zip file, which you can download from Canvas.

Now, your task is to improt each of these files and combine them into a single data frame so we can do some analysis.

How might we do that? Do we need to use join/merge or concatenate?

In [None]:
# Import each month's csv file as a dataframe

# YOUR CODE HERE

In [None]:
# We will concatenate them -- we can do this using .concat() or .append()


In [None]:
# The first way, with .append()
# Note that this feature is depreciated (but we can still use for now)
# More info here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html
# (I include this because often things get depreciated in other people's packages!)

# YOUR CODE HERE

In [None]:
# The second way, with .concat()
# More info here: https://pandas.pydata.org/docs/reference/api/pandas.concat.html

# YOUR CODE HERE

## Part 3: Challenge

Here are some challenges if you get all the way through.
* How do the correlations change when you look at the entire year's worth of data?
* What other plots can you make of the temperature and AQI data?
* Investigate the relationshio between some of the other columns in the temperature data and the AQI -- like min temperature or precipitation.
* How might you extend this using other data that isn't contained in these two data sets?
