# Pandas introduction

>Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R.

https://pandas.pydata.org/

In [3]:
import pandas as pd  # always import pandas as pd <-
import numpy as np

Pandas introduces two nice new objects:
+ Series
+ DataFrames

## Series

Series are like numpy arrays but with **explicit index**:

In [None]:
a = np.array([2.14, 1.57, 1.76, 1.88, 1.70])
a

In [None]:
s = pd.Series(a)
s

In [None]:
s.index  # get the index of our series with this attribute

In [None]:
s.index = ['Miquel', 'Bet', 'Maties', 'Maria', 'Cati']
s

In [None]:
s.index

like lists and numpy arrays, series can be sliced with common `[ ]` operator

In [None]:
s[1:3]

To look for a concret index use the `loc` method

In [None]:
s.loc['Maria']  # slice by index

In [None]:
s.loc[['Miquel', 'Bet']]   # slice by index

In [None]:
s.loc[::-1]   # reverse the Series

or slice by inplicit index with `iloc`, always and integer from zero to `len(series)`. Like a numpy array

In [None]:
s.iloc[2]  # which is the same as s[2] 

Pandas offers a `describe` method to obtain **descriptive statistics** fast and easy:

In [None]:
s.describe()   # descritptive statistics

We can also retrieve the subyacent numpy array with `.values`

In [None]:
s.values      # returns a Numpy array

## Dataframes

DataFrames are inspired by the R dataframe tabular data. 
General 2D labeled, size-mutable tabular structure with heterogeneously typed columns.

In [None]:
df = pd.DataFrame(s)
df

Now the columns have names. Let's rename the column whe have with:

In [None]:
df.columns

In [None]:
df.columns = ['estatura']
df

Define a new column of `df` simply doing:

In [None]:
df['enquesta'] = [0,1,1,0,1]  # careful with the len of the new data

In [None]:
df

Another usefull method is `.info()`

In [None]:
df.info()  # get dataframe's info

In [None]:
df.describe()

### Missing values capabilities

In [None]:
df.iloc[2,1] = np.nan  # insert a nan
df

In [None]:
np.isnan(df)  # using numpy funciton on series

In [None]:
df.isna()

In [None]:
df.dropna(axis=0)

### Filtering the DataFrame

Filtering is done with boolean indexing. i.e.: select the persons that put 0 in the survey:

In [None]:
df.enquesta == 0

Use this boolean series (or array, or list) as indexer of the DataFrame to get the rows that have `True`

In [None]:
df[df.enquesta == 0]

In [None]:
df[df.estatura < 1.8]

### Groupby operations

<img src="https://jakevdp.github.io/PythonDataScienceHandbook/figures/03.08-split-apply-combine.png">

_image source:_ Jake VanderPlas, author of [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)

In [None]:
df.groupby(by='enquesta').mean()

### Apply operations over axis

In [None]:
df['estatura_feet'] = df.estatura * 3.2808

In [None]:
df

or

In [None]:
df.index.apply(lambda x: x.lower())

In [None]:
map_res = {1:'Yes', 0:'No'}

df.enquesta = df.enquesta.map(map_res, na_action='ignore')
df

## Drop entries

In [None]:
df.drop('estatura_feet', axis=1) # returns a view without the column 

In [None]:
df

In [None]:
df.drop('estatura_feet', axis = 1, inplace=True)  # inplace modifies the called DataFrame
df

### Example: drop ouliers

keep only values that are within +1 to -1 standard deviations in the column 'estatura'

In [None]:
df[np.abs(df.estatura-df.estatura.mean())<=(1*df.estatura.std())] 

In [None]:
df.estatura-df.estatura.mean() 

In [None]:
np.abs(df.estatura-df.estatura.mean())

## Other operations for reshaping and pivoting: 

Visual intuition : https://jalammar.github.io/visualizing-pandas-pivoting-and-reshaping/

Reshaping and pivoting the table is one of the most usefull and fun operations!


In [5]:
df = pd.DataFrame({
    'foo':['one', 'one', 'one', 'two', 'two', 'two'],
    'bar':['A', 'B', 'C', 'A', 'B', 'C'],
    'baz': np.arange(1,7),
    'zoo': 'x y z q w t'.split()
})

df

Unnamed: 0,bar,baz,foo,zoo
0,A,1,one,x
1,B,2,one,y
2,C,3,one,z
3,A,4,two,q
4,B,5,two,w
5,C,6,two,t


In [6]:
df.pivot(index='foo', columns='bar', values='baz')

bar,A,B,C
foo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,1,2,3
two,4,5,6


### Limits of Pandas

+ Stepped learning curve
+ Hundreds of methods. Which one is the correct and efficient?
+ Columns with string values are heavy
+ Pandas uses x5 ~ x10 memory. 
+ With datasets around 10Gb pandas starts to fail (depends on computer)
+ No integration with big data backends

### Some solutions

+ Use `pd.Categorical` type for string data columns
+ Use [Dask](https://dask.pydata.org/en/latest/) for big datasets ( Parallel computing ) 
+ Wait for [Pandas 2](https://pandas-dev.github.io/pandas2/). Not even announced. Beta release maybe in 2~3 years?