<a href="https://colab.research.google.com/github/KasiBaskerLaxmanan/pyalgotrading/blob/master/Pandas_02_basic_stats_filtering_nans.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import pandas as pd
import matplotlib
%matplotlib inline 
import numpy as np

In [0]:
### Data transformation from previous notebooks
nyc = pd.read_csv('data/central-park-raw.csv', parse_dates=[0])
nyc.columns = [x.strip() for x in nyc.columns]
nyc.columns = [x.replace(' ', '_') for x in nyc.columns]
nyc.PrecipitationIn.replace("T", '0.001')
nyc.PrecipitationIn = pd.to_numeric(nyc.PrecipitationIn.replace("T", '0.001'))
nyc['Events'] = nyc.Events.fillna('')

# Basic Stats

A nice feature of pandas is that you can quickly inspect data and get summary statistics.

In [0]:
# The describe method gives us basic stats. The result is a Data Frame
nyc.describe()

In [0]:
# Remember transpose
nyc.describe().T


In [0]:
# to view non-numeric data pass include='all'
nyc.describe(include='all').T

In [0]:
# Various aggregation methods (max, mean, median, min, mad, skew, kurtosis, autocorr,
#   nunique, sem, std, var)
# and properties (hasnans, is_monotonic, is_unique)
nyc.Max_Humidity.max()

In [0]:
nyc.Max_Humidity.quantile(.2)

In [0]:
nyc.Max_Humidity.quantile([.2,.3])

In [0]:
nyc.Max_Humidity.min()

In [0]:
nyc.Mean_Humidity.corr(nyc.Mean_TemperatureF)

## Basic Stats Assignment
With the nino dataset:

* *Describe* the data
* Choose a column
  * Print out the max, min, and mean
* Correlate (``corr``) the temperature column with the date column (might need to use ``.astype('int64')`` method)

## Basic Stats Extra
* use the ``scatter_matrix`` function in ``pandas.plotting`` to create a correlation matrix (note this might take tens of seconds to run)

In [0]:
pd.plotting.scatter_matrix(nino)

# Plotting

Pandas has built-in integration with Matplotlib. Other libraries such as Seaborn also support plotting DataFrames and Series. This is not an in depth intro to Matplotlib, but their website and gallery are great for finding more information

In [0]:
# histograms are a quick way to visualize the distribution
nyc.Mean_Humidity.hist()

In [0]:
# add in figsize=(width,height) to boost size
nyc.Mean_Humidity.hist(figsize=(14, 10))

In [0]:
# If we use the .plot method we can add title and other attributes
nyc.Mean_Humidity.plot(kind='hist', title='Avg Humidity', figsize=(14, 10))

In [0]:
nyc.plot(x='EST', y='Mean_Humidity')

In [0]:
nyc.plot(x='EST', y='Mean_Humidity', figsize=(12, 8) )

In [0]:
# Can resample columns, since our index is a date we can use *Offset Aliases*
# see http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
nyc.set_index('EST').Mean_Humidity.resample('M').mean().plot(figsize=(10, 6)) 

In [0]:
# Can resample columns, since our index is a date we can use *Offset Aliases*
# see http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
nyc.set_index('EST').Mean_Humidity.resample('2W').mean().plot(figsize=(10, 6)) 

In [0]:
# Plot all the things (may be useful or just art)
nyc.set_index('EST').plot(figsize=(12,8))

In [0]:
nyc.plot(x='Max_TemperatureF', y='Max_Humidity', kind='scatter', alpha=.5, 
        figsize=(10, 8))

In [0]:
nyc.Max_TemperatureF.corr(nyc.Max_Humidity)

## Plotting Assignment
With the nino dataset:
* Plot a histogram of air temp
* Plot a scatter plot of latitude and longitude


# Filtering

In [0]:
# When we apply a conditional operator to a series we get back a series of True/False values
# We call this a "mask", which we can use to filter (similar to Photoshop)
# all EST in 2000's
m2000 = nyc.EST.dt.year >= 2000

# below 2010
lt2010 = nyc.EST.dt.year < 2010



In [0]:
# The "and" operation looks at whether the operands are truthy or falsey
# This is a case where normal Python syntax doesn't work
nyc[m2000 and lt2010]

In [0]:
# & does bitwise comparisons - which is what we want
nyc[m2000 & lt2010]

In [0]:
# beware if you embed the operations, the bitwise operator binds more tightly to the integers
nyc[nyc.EST.dt.year >= 2000 & nyc.EST.dt.year < 2010]

In [0]:
# beware if you embed the operations, the bitwise operator binds more tightly to the integers
nyc[(nyc.EST.dt.year >= 2000) & (nyc.EST.dt.year < 2010)]

In [0]:
m_dec = nyc.EST.dt.month == 12
nyc[m_dec]

In [0]:
# Can use loc to filter out based on index value, also takes a boolean index
# In fact, you should use .loc instead as a matter of habit (you won't see warnings)
nyc.loc[m_dec]

In [0]:
# Can use loc to filter out based on index value, also takes a boolean index
# 2nd option in index op is column names (: to include everything)
nyc.loc[m_dec, [x for x in nyc.columns if 'Max' in x]]

In [0]:
# loc note:
# can use set_index and sort_index to do quick lookups (if you sort you get quick lookups)
nyc.set_index('Events').sort_index().head()

In [0]:
nyc.set_index('Events').sort_index().loc['Fog']

In [0]:
# Can use iloc to filter out based on index location (or position)
# 2nd option in index op is column indices
nyc.iloc[5:10, [2, 5, -2]]  


In [0]:
# Can use iloc to filter out based on index location
# 2nd option in index op is column indices
nyc.iloc[:, [2, 5, -2]]  


In [0]:
nyc.EST.describe()

## Filtering Assignment
Using the nino dataframe:
* Create a mask, ``m80``, that all years >= 1980 and < 1990
* Create a mask, ``m90``, that all years >= 1990 and < 2000
* Create a mask, ``lon120``, that has all longitudes > 120
* Create a mask, ``lat0``, that has latitudes > -2 and < 2
* Create a dataframe, ``df80``, that has only those values in ``m80`` and ``lon120`` and ``lat0``
* Create a dataframe, ``df90``, that has only those values in ``m90`` and ``lon120`` and ``lat0``


## Filtering Bonus Assignment
* Create a mask, ``m80_2``, that uses a function to filter years >= 1980 and < 1990
* Make sure that ``m80`` is created using operations
* Use the ``%%time`` *cell magic* to determine which is faster to calculate, ``m80`` or ``m80_2``

# Dealing with NaN

In [0]:
# find rows that have null data
# fish create a mask
nyc.isnull().any(axis=1)

In [0]:
nyc[nyc.isnull().any(axis=1)]

In [0]:
# Find columns with null values
nyc.isnull()

In [0]:
# Find columns with null values
nyc.isnull().any()

In [0]:
missing_df = nyc.isnull() 
nyc[missing_df.Max_TemperatureF]

In [0]:
nyc.Max_TemperatureF.fillna(nyc.Max_TemperatureF.mean()).iloc[2219:2222]

In [0]:
# The .interpolate method will do linear interpolation by default
nyc.Max_TemperatureF.interpolate().iloc[2219:2222]

In [0]:
#dropping rows with missing data
nyc.dropna()

## Dealing with NaN Assignment
With the nino dataset:
* Find the rows that have null data
* Find the columns that have null data
* It looks like the ``zon_winds`` has some missing values, use summary stats or plotting to determine how to fill in those values