## Introduction to Pandas 1

_from pandas documentation_

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

Good introduction here: https://pandas.pydata.org/pandas-docs/stable/10min.html#

## Demo with long-term rain data

Import modules

In [None]:
#If you're going to use matplotlib in a notebook, always start with this command FIRST
%matplotlib notebook

import pandas as pd
import matplotlib.pyplot as plt



Set some variables to make our lives easier

In [None]:
excelfile = 'Input_data/DCA_longterm_hourly.xlsx'

Import data into a Series or a DataFrame

In [None]:
df = pd.read_excel(excelfile,usecols=[4,9],index_col=0)

check out the data

In [None]:
#head prints the first x rows of data. 
#Try tail to print the last 5 rows of data. Or sample to print x random rows of data
df.head(10)

In [None]:
df = df.rename(columns={"Precip_clean":"Precip_in"})

In [None]:
df.head()

### A dataframe is made up of series

In [None]:
df["Precip_in"]

### It's easy to work with timeseries data

There are gaps in the time series that we imported. Let's make the index continuous and fill the gaps with NaN.

Pandas has lots of tools to help deal with NaNs later, although we may not get to it. You can replace them with values, leave them, or fill them based on rules.

Matplotlib has options for leaving them out, interpolating, filling, etc.


In [None]:
df = df.resample('H').mean()

In [None]:
df

In [None]:
df.info()   #give basic info on the dataframe

### Let's query the data to only get data for 1990

In [None]:
df1990 = df["1990-01-01":"1990-12-31"] #dates are smart!
df1990.head()


In [None]:
df1990['Precip_in'].sum()

### Export results to a new excel spreadsheet

In [None]:
df1990.to_excel('Output_data/df1990.xlsx')

### Let's do some basic math and create a new column

In [None]:
def in_to_cm(inch):
    return inch * 2.54



In [None]:
df["precip_cm"] = df["Precip_in"].apply(in_to_cm)  #create a new column using apply()

In [None]:
df["precip_cm2"] = df["Precip_in"] * 2.54   #or, you can use bitwise basic math.

In [None]:
df.tail()

In [None]:
#df.to_excel('Output_data/long_rain.xlsx')
#only uncomment and execute this cell if you have a little bit of time to wait. that's about 10 MB of data

### Try converting hourly data to daily data. Get out your stopwatch!

In [None]:
#df is over 50 years of nearly hourly data - convert to daily
#Make sure datatime is the index!

df_daily = df.resample('D').sum()

In [None]:
df_daily.head()

### Let's check to make sure pandas aggregated our data correctly

In [None]:
hourlysum = df["Precip_in"].sum()
dailysum = df_daily["Precip_in"].sum()

print("Total Hourly Sum is {} in. and Total Daily Sum is {} in.".format(hourlysum, dailysum))

In [None]:
df_daily.to_excel("Output_data/df_daily.xlsx")  #Export data to Excel

### Pandas has many timeseries options

In [None]:
#every 6 hours?
df_6hours = df.resample('6H').sum()

In [None]:
df_6hours.tail()

In [None]:
#every second business day??!!
df_bd = df.resample('2B').sum()
df_bd.tail()

### Let's try to calculate the long-term average rainfall for the month of March in DC

In [None]:
#what's the average march rainfall?
dfmonth = df.resample('M').sum()          #Convert hourly data to monthly data
dfmonth['month'] = dfmonth.index.month    #Add a column showing the month by getting the month from the index date
dfmonth.head()


In [None]:
#what's the average march rainfall?
dfmonth[dfmonth['month']==3].mean()     #Take the mean of all march lines

In [None]:
dfmonth.plot(grid=True)

In [None]:
dfmonth.plot(y="Precip_in",grid=True)