# Do you need `pandas`?

* Do you need to import data to Python?


* Data is not clean?


* Do you need to explore your datasets and want to gain insight from them in a simple and fast manner?


* Do you need to precess your data to be ready for subsequent analyses using statsmodels, scikit-learn,..., others?


* ...?

## If you agree to any of the previous questions and you don't have much idea about `pandas` then you are in the right place...

# What is `pandas`?

For each intensive task involving data `pandas` has become an essential library in the Python world.

> You can see `pandas` like numpy array on steroids, that is, numpy arrays with labels for columns and rows and better support to work with diverse datasets.
    
The pervious one is not a deep definition. Devil is in the details!!!

## Interesting features:


* Input/Output for a lot of different data formats in an easy, fast and flexible way (csv, json, sql, HDF5, HTML,...).


* Tools to deal with *missing* data (`.dropna()`, `pd.isnull()`).


* Merge and combination (`concat`, `join`, `merge`).


* Grouping (`groupby`).


* Reshaping (`stack`, `unstack`, `pivot`, `pivot_table`).


* Powerful time series data treatment (*resampling*, *timezones*, ...).


* Easy plotting.

# What we will see?

* Data structures with `pandas`.


* Data I/O.


* Get information from the data sructure, sttistical operations, setting indexes, work with missing data, work with dates.


* Selection and indexing of data.


* NA, NaN, missing data,...


* Combination, grouping, aggregation,...


* Results plotting.

------------------------------

# What do we need for the tutorial

In [None]:
# First, some imports
import os
import datetime as dt
import sys

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
from IPython.display import display

np.random.seed(19760812)
plt.style.use('bmh')
%matplotlib inline

In [None]:
print('Python version:')
print(sys.version)
print()
print('Pandas: ', pd.__version__)
print('Numpy: ', np.__version__)
print('Matplotlib: ', mpl.__version__)

---------------------------------------------------

# Preliminary analysis of a wind dataset

Before we start with the tutorial let's make a small analysis of wind data to see some of the capabilities of the library.

In [None]:
# We read data from 'model.txt'
ipath = os.path.join('Datos', 'model.txt')

model = pd.read_csv(ipath, sep = "\s*", skiprows = 3,
                    parse_dates = {'Timestamp': [0, 1]}, index_col = 'Timestamp')

In [None]:
model.head()

In [None]:
pd.tools.plotting.scatter_matrix(model.ix[0:1000, 'M(m/s)':'D(deg)'])

In [None]:
print(model.index[0], model.index[-1], sep = '\n')

In [None]:
model.mean()

In [None]:
model.max()

In [None]:
idx = model.loc[:, 'M(m/s)'].sort_values(ascending = False).index

In [None]:
pd.tools.plotting.scatter_matrix(model.loc[idx[:1000], 'M(m/s)':'D(deg)'])

In [None]:
model.loc[:, 'M(m/s)'].plot.hist(bins = np.arange(0, 35))

In [None]:
model['month'] = model.index.month
model['year'] = model.index.year

In [None]:
model.groupby(by = ['year', 'month']).mean().plot(y = 'M(m/s)', figsize = (15, 5))

In [None]:
monthly = model.groupby(by = ['year', 'month']).mean()
monthly['ma'] = monthly.loc[:, 'M(m/s)'].rolling(5, center = True).mean()
monthly.head()

In [None]:
monthly.loc[:,['M(m/s)', 'ma']].plot(figsize = (15, 6))

In [None]:
monthly.loc[:, 'M(m/s)'].reset_index().pivot(index = 'year', columns = 'month')

In [None]:
(monthly.loc[:, 'M(m/s)'].reset_index().pivot(
    index = 'year', 
        columns = 'month'
    ).T.plot(
        figsize = (15, 10), 
        legend = False
    )
 )