# Temperature Data ETL

In [141]:
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt

### State Temperature Dataset

Begin by loading in the data, and displaying it. This data shows us monthly average temperatures for each US state (excluding Alaska and Hawaii) going back until 1950. Additionally, it also contains a column that has the monthly mean from 1901 to 2000 as well. This column is good for comparisons to how the monthly mean has changed over time. Additionally, there is also information on longitude and latitude if needed as well.

In [142]:
temperatures = pd.read_csv('average_monthly_temperature_by_state_1950-2022.csv')
temperatures.head()

Unnamed: 0.1,Unnamed: 0,month,year,state,average_temp,monthly_mean_from_1901_to_2000,centroid_lon,centroid_lat
0,0,1,1950,Alabama,53.8,45.9,-86.828372,32.789832
1,1,1,1950,Arizona,39.6,41.1,-111.664418,34.29311
2,2,1,1950,Arkansas,45.6,40.4,-92.439268,34.899745
3,3,1,1950,California,39.4,42.7,-119.610699,37.246071
4,4,1,1950,Colorado,25.2,24.5,-105.547825,38.998552


In [143]:
# check for nulls
temperatures.isnull().sum(axis=0)

Unnamed: 0                        0
month                             0
year                              0
state                             0
average_temp                      0
monthly_mean_from_1901_to_2000    0
centroid_lon                      0
centroid_lat                      0
dtype: int64

Since there is no missing data, the cleaning process is quite relaxed. We began by sorting the values into a logical order. To do this we sorted first by state, then by year, and then finally by month. Next, we drop the unnecessary Unnamed: 0 column and reset the index, so the new order dictates the index id. After this step we added a column called net difference that compares the months average temperature to the historical mean of that month. Our final cleaning step was to rename some columns for ease of use.

In [144]:

temperatures.sort_values(by= ["state", "year", "month"], inplace=True)
temperatures = temperatures.drop("Unnamed: 0", axis = 1)
temperatures.reset_index(inplace=True, drop= True)


temperatures["net_difference"] = temperatures["average_temp"] - temperatures["monthly_mean_from_1901_to_2000"]


temperatures.rename(columns={
    "monthly_mean_from_1901_to_2000":         "average_historic",
    "centroid_lon":            "longitude",
    "centroid_lat":       "latitude"
}, inplace=True)


temperatures.head()

Unnamed: 0,month,year,state,average_temp,average_historic,longitude,latitude,net_difference
0,1,1950,Alabama,53.8,45.9,-86.828372,32.789832,7.9
1,2,1950,Alabama,56.0,46.5,-86.828372,32.789832,9.5
2,3,1950,Alabama,52.7,51.6,-86.828372,32.789832,1.1
3,4,1950,Alabama,55.7,59.0,-86.828372,32.789832,-3.3
4,5,1950,Alabama,66.4,66.7,-86.828372,32.789832,-0.3


Now that the data is in a good format, we can load our cleaned dataset elsewhere to perform analysis and create visualizations.

In [145]:
temperatures.to_csv('cleaned_temperature_data_States.csv')

### Overall US Temperature Dataset

The previous dataset contained values for all of the states. However, we also may want a dataset that has all of this info aggregated for US Overall. So, we will create that dataset as well.

We begin by storing the previous temperatures dataframe into a new variable. Then, we grouped the data by month and year. Additionally, we took all of the columns (excluding longitude and latitude) and aggreagated them using the mean method. Following this, we used drop level and reset index to get rid of the multi-index on the dataframe. Then, we sorted the values by year and then month, which meant we needed to reset the index one final time.

In [146]:
overall = temperatures


overall = overall.groupby(["month", "year"])[["month", "year", "average_temp", "average_historic", "net_difference"]].mean()
overall = overall.droplevel(level=["year"])
overall.reset_index(drop=True, inplace=True)
overall.sort_values(["year", "month"], inplace=True)
overall.reset_index(drop=True, inplace=True)

Then we were ready to load this data into it's own data set.

In [147]:
overall.to_csv('cleaned_temperature_data_US.csv')