# Pandas tutorial

In this notebook i'm going to show a few basics of using the pandas library for loading, processing and analyzing data. It's by no means a complete guide, but it shows some basic possibilities. If you need more advanced features, please google or look on https://pandas.pydata.org for more information!

In [1]:
import pandas
import numpy as np

from matplotlib import pyplot as plt
%matplotlib inline

First wee'll load some data:

In [2]:
df = pandas.read_csv('weather.csv', index_col='Date', parse_dates=['Date'])

Next thing is we display the data to get a quick view on what columns there are:

In [3]:
df.head()

Unnamed: 0_level_0,Max_Temperature_F,Mean_Temperature_F,Min_TemperatureF,Max_Dew_Point_F,MeanDew_Point_F,Min_Dewpoint_F,Max_Humidity,Mean_Humidity,Min_Humidity,Max_Sea_Level_Pressure_In,Mean_Sea_Level_Pressure_In,Min_Sea_Level_Pressure_In,Max_Visibility_Miles,Mean_Visibility_Miles,Min_Visibility_Miles,Max_Wind_Speed_MPH,Mean_Wind_Speed_MPH,Max_Gust_Speed_MPH,Precipitation_In,Events
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2014-10-13,71,62.0,54,55,51,46,87,68,46,30.03,29.79,29.65,10,10,4,13,4,21,0.0,Rain
2014-10-14,63,59.0,55,52,51,50,88,78,63,29.84,29.75,29.54,10,9,3,10,5,17,0.11,Rain
2014-10-15,62,58.0,54,53,50,46,87,77,67,29.98,29.71,29.51,10,9,3,18,7,25,0.45,Rain
2014-10-16,71,61.0,52,49,46,42,83,61,36,30.03,29.95,29.81,10,10,10,9,4,-,0.0,Rain
2014-10-17,64,60.0,57,55,51,41,87,72,46,29.83,29.78,29.73,10,10,6,8,3,-,0.14,Rain


Pandas offers functions to look at basic info and describe the data. This is things like counting rows, minimum and maximum values, quatiles and a verbose description. These examples are for inspiration, there are many more options!

In [4]:
df['Mean_Temperature_F'].count()

688

In [5]:
df['Mean_Temperature_F'].min()

33.0

In [6]:
df['Mean_Temperature_F'].max()

83.0

In [7]:
df['Mean_Temperature_F'].describe()

count    688.000000
mean      56.584302
std       10.408058
min       33.000000
25%       48.000000
50%       56.000000
75%       65.000000
max       83.000000
Name: Mean_Temperature_F, dtype: float64

After looking at the basic data, you might want to add some 'computed' values. Pandas allows you to add columns and calculate values from combining existing columns etc:

In [8]:
df['Month'] = df.index.map(lambda x: x.month)

quarters = { 1: 'Q1', 2: 'Q1', 3: 'Q1', 4: 'Q2', 5: 'Q2', 6: 'Q2',
             7: 'Q3', 8: 'Q3', 9: 'Q3', 10: 'Q4', 11: 'Q4', 12: 'Q4' }

df['Quarter'] = df['Month'].apply(lambda x: quarters[x])

Another possibility is grouping and aggregating data. This can allow e.g. to summarize data per quarter for reporting and similar use cases:

In [9]:
df.groupby('Quarter')['Mean_Temperature_F'].mean()

Quarter
Q1    48.488889
Q2    60.851648
Q3    68.461039
Q4    49.906977
Name: Mean_Temperature_F, dtype: float64

In [10]:
df.groupby('Quarter')['Precipitation_In'].mean()

Quarter
Q1    0.155967
Q2    0.031209
Q3    0.020195
Q4    0.205640
Name: Precipitation_In, dtype: float64

The final feature i want to show, is filtering. You can filter rows by value and perform operations on this:

In [11]:
df[df['Quarter'] == 'Q1']['Precipitation_In'].mean()

0.1559668508287293

In [12]:
df[df['Quarter'] == 'Q4'].groupby('Month')['Mean_Temperature_F'].mean()

Month
10    58.840000
11    46.800000
12    45.709677
Name: Mean_Temperature_F, dtype: float64