# Pandas introduction

In [1]:
import pandas as pd
import numpy as np

Pandas introduces two nice new objects:
+ Series
+ DataFrames

## Series

Series are like numpy arrays but with **explicit index**:

In [11]:
a = np.array([2.14, 1.57, 1.76, 1.88, 1.70])
a

array([ 2.14,  1.57,  1.76,  1.88,  1.7 ])

In [14]:
s = pd.Series(a)
s

0    2.14
1    1.57
2    1.76
3    1.88
4    1.70
dtype: float64

In [15]:
s.index

RangeIndex(start=0, stop=5, step=1)

In [16]:
s.index = ['Miquel', 'Bet', 'Maties', 'Maria', 'Cati']
s

Miquel    2.14
Bet       1.57
Maties    1.76
Maria     1.88
Cati      1.70
dtype: float64

In [17]:
s.index

Index(['Miquel', 'Bet', 'Maties', 'Maria', 'Cati'], dtype='object')

In [10]:
s[1:3]

Lluís     1.57
Maties    1.76
dtype: float64

In [24]:
s.loc[['Miquel', 'Bet']]   # slice by index

Miquel    2.14
Bet       1.57
dtype: float64

In [30]:
s.loc[::-1]   # reverse the Serie

Cati      1.70
Maria     1.88
Maties    1.76
Bet       1.57
Miquel    2.14
dtype: float64

In [35]:
s.describe()   # descritpion statistics

count    5.000000
mean     1.810000
std      0.215639
min      1.570000
25%      1.700000
50%      1.760000
75%      1.880000
max      2.140000
dtype: float64

In [34]:
s.values      # returns a Numpy array

array([ 2.14,  1.57,  1.76,  1.88,  1.7 ])

## Dataframes

DataFrames are inspired by the R dataframe tabular data. 
General 2D labeled, size-mutable tabular structure with heterogeneously typed columns.

In [37]:
df = pd.DataFrame(s)
df

Unnamed: 0,0
Miquel,2.14
Bet,1.57
Maties,1.76
Maria,1.88
Cati,1.7


In [39]:
df.columns = ['estatura']
df

Unnamed: 0,estatura
Miquel,2.14
Bet,1.57
Maties,1.76
Maria,1.88
Cati,1.7


Define a new column of `df` simply:

In [41]:
df['enquesta'] = [0,1,1,0,1]

In [43]:
df

Unnamed: 0,estatura,enquesta
Miquel,2.14,0
Bet,1.57,1
Maties,1.76,1
Maria,1.88,0
Cati,1.7,1


In [45]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, Miquel to Cati
Data columns (total 2 columns):
estatura    5 non-null float64
enquesta    5 non-null int64
dtypes: float64(1), int64(1)
memory usage: 280.0+ bytes


In [46]:
df.describe()

Unnamed: 0,estatura,enquesta
count,5.0,5.0
mean,1.81,0.6
std,0.215639,0.547723
min,1.57,0.0
25%,1.7,0.0
50%,1.76,1.0
75%,1.88,1.0
max,2.14,1.0


### Filtering the DataFrame

In [47]:
df[df.enquesta == 0]

Unnamed: 0,estatura,enquesta
Miquel,2.14,0
Maria,1.88,0


In [50]:
df[df.estatura < 1.8]

Unnamed: 0,estatura,enquesta
Bet,1.57,1
Maties,1.76,1
Cati,1.7,1


### Groupby operations

In [55]:
df.groupby(by='enquesta').mean()

Unnamed: 0_level_0,estatura
enquesta,Unnamed: 1_level_1
0,2.01
1,1.676667


### Apply operations over axis

In [67]:
df['mod_estatura'] = df.estatura.apply(lambda x: x **2)

In [68]:
df

Unnamed: 0,estatura,enquesta,mod_estatura
Miquel,2.14,0,4.5796
Bet,1.57,1,2.4649
Maties,1.76,1,3.0976
Maria,1.88,0,3.5344
Cati,1.7,1,2.89


## Drop entries

In [69]:
df.drop('mod_estatura', axis = 1) # returns a view without the column 

Unnamed: 0,estatura,enquesta
Miquel,2.14,0
Bet,1.57,1
Maties,1.76,1
Maria,1.88,0
Cati,1.7,1


In [70]:
df

Unnamed: 0,estatura,enquesta,mod_estatura
Miquel,2.14,0,4.5796
Bet,1.57,1,2.4649
Maties,1.76,1,3.0976
Maria,1.88,0,3.5344
Cati,1.7,1,2.89


In [71]:
df.drop('mod_estatura', axis = 1, inplace=True)  # inplace modifies the called DataFrame
df

Unnamed: 0,estatura,enquesta
Miquel,2.14,0
Bet,1.57,1
Maties,1.76,1
Maria,1.88,0
Cati,1.7,1


### Example: drop ouliers

keep only values that are within +1 to -1 standard deviations in the column 'estatura'

In [77]:
df[np.abs(df.estatura-df.estatura.mean())<=(1*df.estatura.std())] 

Unnamed: 0,estatura,enquesta
Maties,1.76,1
Maria,1.88,0
Cati,1.7,1


In [78]:
df.estatura-df.estatura.mean() 

Miquel    0.33
Bet      -0.24
Maties   -0.05
Maria     0.07
Cati     -0.11
Name: estatura, dtype: float64

In [79]:
np.abs(df.estatura-df.estatura.mean())

Miquel    0.33
Bet       0.24
Maties    0.05
Maria     0.07
Cati      0.11
Name: estatura, dtype: float64