# CHAPTER 5
---
# Getting Started with pandas

In [1]:
import pandas as pd
import numpy as np

## Introduction to pandas Data Structures <font color='green'>[Essential]</font>
To get started with pandas, you will need to get comfortable with its two workhorse
data structures: Series and DataFrame. While they are not a universal solution for every
problem, they provide a solid, easy-to-use basis for most applications.

### Series <font color='green'>[Essential]</font> <font color='green'>[Beginner]</font>
A Series is a one-dimensional array-like object containing an array of data (of any
NumPy data type) and an associated array of data labels, called its index. The simplest
Series is formed from only an array of data:

Une série est un tableau lié à un index (l'index est lui même un tableau)

In [2]:
# la plus simple des séries est formée par un simple tableau de données
obj = pd.Series([4, 7, -5, 3])

In [3]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [4]:
obj.values

array([ 4,  7, -5,  3], dtype=int64)

In [5]:
obj.index

RangeIndex(start=0, stop=4, step=1)

on peut choisir l'index en le passant en argument

In [6]:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [7]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

utiliser l'index pour sélectionner une ou plusieur valeurs

In [8]:
obj2['a']

-5

In [9]:
obj2['d'] = 6
obj2[['c', 'a', 'd']]

c    3
a   -5
d    6
dtype: int64

le lien index <-> valeur est conservé lorsqu'on fitre ou fait des opérations mathématiques sur les valeurs

In [10]:
obj2 * 2

d    12
b    14
a   -10
c     6
dtype: int64

Les séries peuvent être construites à partir de dictionnaires

In [11]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [12]:
obj3.to_dict()

{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

### DataFrame <font color='green'>[Essential]</font> <font color='green'>[Beginner]</font>

- Un DataFrame représente un tableau. 
- Il est comparable à un tableau dont les lignes et les colonnes sont indexés
- C'est un peu une 'Series de Series'

In [13]:
data = {
    'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
    'year': [2000, 2001, 2002, 2001, 2002],
    'pop': [1.5, 1.7, 3.6, 2.4, 2.9]
}
frame = pd.DataFrame(data)

In [14]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


- On peut accéder aux séries d'un DataFrame comme aux valeurs d'un dictionnaire 
- On accède à des «vues sur les séries» et non à des copies
- les séries d'un df partagent le même index

In [15]:
frame['state'] # accès à une colonne

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
Name: state, dtype: object

In [16]:
frame.loc[2] # accès à une ligne

state    Ohio
year     2002
pop       3.6
Name: 2, dtype: object

In [17]:
frame['debt'] = 10
frame['emp'] = [1, 2, 3, 4, 5]
frame

Unnamed: 0,state,year,pop,debt,emp
0,Ohio,2000,1.5,10,1
1,Ohio,2001,1.7,10,2
2,Ohio,2002,3.6,10,3
3,Nevada,2001,2.4,10,4
4,Nevada,2002,2.9,10,5


In [18]:
# les valeurs d'un DataFrame sont rangés dans un tableau à 2 dimentions
frame.values

array([['Ohio', 2000, 1.5, 10, 1],
       ['Ohio', 2001, 1.7, 10, 2],
       ['Ohio', 2002, 3.6, 10, 3],
       ['Nevada', 2001, 2.4, 10, 4],
       ['Nevada', 2002, 2.9, 10, 5]], dtype=object)

### Index Objects <font color="#D22328">[Advanced]</font>

In [19]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [20]:
index[1:]

Index(['b', 'c'], dtype='object')

In [21]:
if False:
    # immutable
    index[1]= 0

## Essential Functionalities

### Dropping entries from an axis <font color='green'>[Essential]</font>

In [22]:
frame = pd.DataFrame(
    np.arange(9).reshape((3, 3)), 
    index=['a', 'c', 'd'],
    columns=['Ohio', 'Texas', 'California']
)
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [23]:
frame.drop('California', axis=1)

Unnamed: 0,Ohio,Texas
a,0,1
c,3,4
d,6,7


In [24]:
frame.drop('a')

Unnamed: 0,Ohio,Texas,California
c,3,4,5
d,6,7,8


### Indexing, selection, and filtering <font color='green'>[Essential]</font> <font color='green'>[Beginner]</font>

In [25]:
obj = pd.Series([2, 3, 5, 9], index=['a', 'b', 'c', 'd'])
obj

a    2
b    3
c    5
d    9
dtype: int64

In [26]:
# indexing
obj.iloc[[1, 3]]

b    3
d    9
dtype: int64

In [27]:
obj < 6

a     True
b     True
c     True
d    False
dtype: bool

In [28]:
# filtering
obj.loc[obj < 6]

a    2
b    3
c    5
dtype: int64

In [29]:
# slicing
obj.loc['b':'c']

b    3
c    5
dtype: int64

In [30]:
obj.loc['a':'c'] = 4

In [31]:
obj

a    4
b    4
c    4
d    9
dtype: int64

In [32]:
data = pd.DataFrame(
    np.arange(16).reshape((4, 4)),
    index=['Ohio', 'Colorado', 'Utah', 'New York'],
    columns=['one', 'two', 'three', 'four']
)

In [33]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [34]:
data.iloc[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [35]:
data['three'] > 5

Ohio        False
Colorado     True
Utah         True
New York     True
Name: three, dtype: bool

In [36]:
data.loc[data['three'] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


### Function application and mapping <font color='green'>[Essential]</font> <font color='green'>[Beginner]</font>

In [37]:
frame = pd.DataFrame(
    np.random.randn(3, 3), 
    columns=list('bde'),
    index=['Utah', 'Ohio', 'Texas']
)

In [38]:
frame

Unnamed: 0,b,d,e
Utah,-1.782182,-0.475069,0.876777
Ohio,0.170154,0.120826,1.229603
Texas,-2.255348,-1.536344,-0.32387


Les fonctions de la librairie «np» peuvent être appliquées à des Series ou des DataFrame. Directement

In [39]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,1.782182,0.475069,0.876777
Ohio,0.170154,0.120826,1.229603
Texas,2.255348,1.536344,0.32387


In [40]:
def series_min(series):
    return series.min() # on retourne une valeur unique

# en appelant la fonction pour chaque Series du DataFrame, on crée une nouvelle series
frame.apply(series_min)

b   -2.255348
d   -1.536344
e   -0.323870
dtype: float64

In [41]:
# en appelant la fonction pour chaque Series du DataFrame, on crée une nouvelle series
frame.apply(series_min, axis=1)

Utah    -1.782182
Ohio     0.120826
Texas   -2.255348
dtype: float64

In [42]:
def fois_deux(series):
    return series * 2 # on retourne une série

# en appelant la fonction pour chaque Series du DataFrame, on crée une nouvelle series
frame.apply(fois_deux)

Unnamed: 0,b,d,e
Utah,-3.564364,-0.950137,1.753554
Ohio,0.340309,0.241653,2.459207
Texas,-4.510696,-3.072689,-0.647741


In [43]:
def plus_un(valeur):
    return 'valeur: %.2f' % (valeur + 1)

frame.applymap(plus_un) # on applique la fonction pour chaque cellule !

Unnamed: 0,b,d,e
Utah,valeur: -0.78,valeur: 0.52,valeur: 1.88
Ohio,valeur: 1.17,valeur: 1.12,valeur: 2.23
Texas,valeur: -1.26,valeur: -0.54,valeur: 0.68


### Sorting and ranking <font color='green'>[Essential]</font>

In [44]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])

In [45]:
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

In [46]:
frame = pd.DataFrame(
    np.arange(8).reshape((2, 4)), 
    index=['three', 'one'],
    columns=['d', 'a', 'b', 'c']
)

In [47]:
frame

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [48]:
# on peut ordonner selon l'axe des colonnes ou des lignes
sorted_frame = frame.sort_index(axis=1)
sorted_frame

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [49]:
transposed = sorted_frame.T
transposed

Unnamed: 0,three,one
a,1,5
b,2,6
c,3,7
d,0,4


In [50]:
transposed.sort_values(by='one')

Unnamed: 0,three,one
d,0,4
a,1,5
b,2,6
c,3,7


In [51]:
obj = pd.Series([7, -5, 7, 4, 2, 0,5, 4])
obj

0    7
1   -5
2    7
3    4
4    2
5    0
6    5
7    4
dtype: int64

In [52]:
obj

0    7
1   -5
2    7
3    4
4    2
5    0
6    5
7    4
dtype: int64

In [53]:
obj.rank() # rank nous donne le rang de chaque valeur, les ex aequo sont représentés avec des fractions

0    7.5
1    1.0
2    7.5
3    4.5
4    3.0
5    2.0
6    6.0
7    4.5
dtype: float64

### Arithmetic and data alignment <font color='#D22328'>[Advanced]</font>

In [54]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])

In [55]:
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])

In [56]:
s1 + s2 # seules les valeurs des index définis dans les deux objets sont conservées

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

In [57]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))

In [58]:
df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [59]:
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [60]:
df1 + df2 # seules les valeurs définies dans les deux objets sont conservées

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,11.0,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


In [61]:
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,11.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [62]:
# opérations entre les DataFrames et les séries
df1 - df1.iloc[0]

Unnamed: 0,a,b,c,d
0,0.0,0.0,0.0,0.0
1,4.0,4.0,4.0,4.0
2,8.0,8.0,8.0,8.0


### Reindexing <font color='#D22328'>[Advanced]</font>

In [63]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [64]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [65]:
frame = pd.DataFrame(
    np.arange(9).reshape((3, 3)), 
    index=['a', 'c', 'd'],
    columns=['Ohio', 'Texas', 'California']
)
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [66]:
frame.reindex(columns=[ 'Texas', 'California', 'France']) # la valeur est renvoyée, frame n'a pas changé

Unnamed: 0,Texas,California,France
a,1,2,
c,4,5,
d,7,8,


## Axis indexes with duplicate values <font color='#D22328'>[Advanced]</font>

In [67]:
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [68]:
obj['c'] # c'est une valeur

4

In [69]:
obj['a'] # c'est une série !

a    0
a    1
dtype: int64

## Summarizing and Computing Descriptive Statistics <font color='green'>[Essential]</font>  <font color='#D22328'>[Advanced]</font>
pandas objects are equipped with a set of common mathematical and statistical methods.
Most of these fall into the category of reductions or summary statistics, methods
that extract a single value (like the sum or mean) from a Series or a Series of values from
the rows or columns of a DataFrame. Compared with the equivalent methods of vanilla
NumPy arrays, they are all built from the ground up to exclude missing data.

In [70]:
df = pd.DataFrame(
    [[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]],
    index=['a', 'b', 'c', 'd'],
    columns=['one', 'two']
) # nan : not a number est utilisé pour matérialiser des données manquantes

In [71]:
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [72]:
df.sum()

one    9.25
two   -5.80
dtype: float64

In [73]:
df.cumsum() # cumul 

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


In [74]:
df.idxmax() # index de la valeur max

one    b
two    d
dtype: object

In [75]:
df.sum(axis=1) # on peut sommer selon l'axe de son choix

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [76]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


In [77]:
df['one'].describe()

count    3.000000
mean     3.083333
std      3.493685
min      0.750000
25%      1.075000
50%      1.400000
75%      4.250000
max      7.100000
Name: one, dtype: float64

In [78]:
strings = df.astype(str) # on change le type des valeurs de df
strings['one'].describe() # le fonctionnement de describe est différent avec des données non numériques

count        4
unique       4
top       0.75
freq         1
Name: one, dtype: object

### Correlation and Covariance <font color='#D22328'>[Advanced]</font>

In [79]:
filled = df.fillna(0) # on remplace les nan par des 0
filled

Unnamed: 0,one,two
a,1.4,0.0
b,7.1,-4.5
c,0.0,0.0
d,0.75,-1.3


In [80]:
filled.corr()

Unnamed: 0,one,two
one,1.0,-0.94454
two,-0.94454,1.0


### Unique Values, Value Counts, and Membership <font color='#D22328'>[Advanced]</font>

In [81]:
filled['one'].unique()

array([1.4 , 7.1 , 0.  , 0.75])

In [82]:
filled['one'].nunique()

4

## Handling Missing Data <font color='green'>[Essential]</font> <font color='green'>[Beginner]</font>

In [83]:
df.isnull()

Unnamed: 0,one,two
a,False,True
b,False,False
c,True,True
d,False,False


In [84]:
df.dropna(subset=['two'])

Unnamed: 0,one,two
b,7.1,-4.5
d,0.75,-1.3


## Hierarchical Indexing <font color='#D22328'>[Advanced]</font>

In [85]:
data = pd.Series(
    np.random.randn(10), 
    index=[
        ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'],
        [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]
    ]
)

In [86]:
data

a  1    0.266755
   2   -0.153763
   3   -1.402006
b  1   -1.444795
   2    1.802626
   3   -0.871870
c  1   -0.413098
   2   -0.893916
d  2    0.369963
   3   -0.051655
dtype: float64

In [87]:
data.index # 

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 2),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )

In [103]:
data['a'] # c'est une série

1    0.266755
2   -0.153763
3   -1.402006
dtype: float64

In [106]:
# on peut filtrer la série par chacun des niveaux de son index
# la syntaxe se déduit de celle des index unidimensionels 
data.loc['a', [1,2]] 

a  1    0.266755
   2   -0.153763
dtype: float64

In [107]:
unstacked = data.unstack()

In [108]:
unstacked

Unnamed: 0,1,2,3
a,0.266755,-0.153763,-1.402006
b,-1.444795,1.802626,-0.87187
c,-0.413098,-0.893916,
d,,0.369963,-0.051655


In [109]:
unstacked[4] = 7

In [110]:
unstacked

Unnamed: 0,1,2,3,4
a,0.266755,-0.153763,-1.402006,7
b,-1.444795,1.802626,-0.87187,7
c,-0.413098,-0.893916,,7
d,,0.369963,-0.051655,7


In [111]:
stacked = unstacked.stack()
stacked

a  1    0.266755
   2   -0.153763
   3   -1.402006
   4    7.000000
b  1   -1.444795
   2    1.802626
   3   -0.871870
   4    7.000000
c  1   -0.413098
   2   -0.893916
   4    7.000000
d  2    0.369963
   3   -0.051655
   4    7.000000
dtype: float64

In [112]:
# on peut construire des dataframe avec des indexs hiérarchiques sur les deux axes
frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
    index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
    columns=[['Ohio', 'Ohio', 'Colorado'],
    ['Green', 'Red', 'Green']])

In [113]:
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [97]:
# on peut nommer les niveaux (levels) des axes afin de les manipuler plus facilement
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [98]:
frame['Ohio']

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


### Reordering and Sorting Levels <font color='#D22328'>[Advanced]</font>

In [99]:
frame.swaplevel('key1', 'key2')

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


In [100]:
frame.swaplevel('key1', 'key2').sort_index()

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
1,b,6,7,8
2,a,3,4,5
2,b,9,10,11


### Using a DataFrame’s Columns <font color='#D22328'>[Advanced]</font>

In [101]:
frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1),
'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
'd': [0, 1, 2, 0, 1, 2, 3]}
                 )

In [102]:
frame.set_index(['c', 'd'])

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1
