# Discover Pandas 

Pandas is built on top of Numpy

`pandas.DataFrame`: multidimensional arrays with rows and columns' labels (but in most cases, you're better off using 2 dimensions) with in most cases heterogeneous types or missing data.

dealing with less structured, clean and complete data consists in most of the time spent by the data scientist

In [1]:
import pandas as pd

In [2]:
pd.__version__

'0.25.3'

In [3]:
pd?

In [176]:
%%timeit
3+2

14.7 ns ± 0.386 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)


## Series

one-dimensional array of indexed data. 

In [6]:
pd.Series([3,2,1])

0    3
1    2
2    1
dtype: int64

with explicit index definition !

Example:

In [276]:
serie_1 = pd.Series([3,2,1], index=[93,129, 219394])

In [277]:
serie_1

93        3
129       2
219394    1
dtype: int64

In [278]:
serie_1.index

Int64Index([93, 129, 219394], dtype='int64')

a dictionnary-like, object with possible keys repetition

In [304]:
serie = pd.Series([3,2,1], index=["rené", "rené", "jean"])

In [305]:
serie

rené    3
rené    2
jean    1
dtype: int64

In [306]:
serie.values

array([3, 2, 1])

In [307]:
serie.index

Index(['rené', 'rené', 'jean'], dtype='object')

* Access by key

In [308]:
serie['rené']

rené    3
rené    2
dtype: int64

* Set a new key pair

In [309]:
serie['joseph'] = 5

* Change a value for a key

In [310]:
serie['rené'] = 4

In [311]:
serie

rené      4
rené      4
jean      1
joseph    5
dtype: int64

In [312]:
serie['rené'] = [4,3]

In [313]:
serie

rené      4
rené      3
jean      1
joseph    5
dtype: int64

* delete a key val pair

In [314]:
del serie["rené"]

In [315]:
serie

jean      1
joseph    5
dtype: int64

In [321]:
serie[0:4:2] # indexing: not possible in a simple dict 

jean    1
dtype: int64

* lookup 

In [324]:
print('rené' in serie)
print("jean" in serie)

False
True


- When index is unique, pandas use a hashtable just like `dict`s : O(1). 
- When index is non-unique and sorted, pandas use binary search O(logN)
- When index is non-unique and not-sorted pandas need to check all the keys just like a list look-up: O(N).



using a `dict` in the `pd.Series` constructor automatically assigns the index as the ordered keys in the `dict`

In [328]:
test = pd.Series(dict(zip(["ea","fzf","aeif"], [2,3,2])))
# with zip or using a dict
test2 = pd.Series({"ea":2, "fzf":3, "aeif":2})

In [329]:
test

aeif    2
ea      2
fzf     3
dtype: int64

In [330]:
test2

aeif    2
ea      2
fzf     3
dtype: int64

In [331]:
test

aeif    2
ea      2
fzf     3
dtype: int64

## selection in Series

In [332]:
(test>2)

aeif    False
ea      False
fzf      True
dtype: bool

In [333]:
(test<4)

aeif    True
ea      True
fzf     True
dtype: bool

In [340]:
# not "and" but "&" : & operator is a bitwise "and"
(test>2) & (test < 4) 

aeif    False
ea      False
fzf      True
dtype: bool

In [344]:
type((test>2) & (test < 4) )

pandas.core.series.Series

In [345]:
# mask ( the last expression whose result is an pd.Serie stored in the variable mask)
mask = (test>2) & (test < 4)

In [346]:
test[mask]

fzf    3
dtype: int64

In [347]:
# fancy indexing
test[["ea", "fzf"]]

ea     2
fzf    3
dtype: int64

In [348]:
# explicit index slicing
test["aeif": "fzf"]

aeif    2
ea      2
fzf     3
dtype: int64

In [349]:
# implicit index slicing
test[0: 2]

aeif    2
ea      2
dtype: int64

using explicit indexes while slicing makes the final index ***included*** in the slice hence the results

using implicit index in slicing ***exclude*** the final index during slicing 

what about i defined explicit integer indexes and i want to slice ? 🙄

## using loc

In [350]:
serie2 = pd.Series({1:2, 5:3, 7:2})

In [351]:
serie2

1    2
5    3
7    2
dtype: int64

In [352]:
serie2.loc[1] # explicit index

2

In [353]:
serie2.iloc[1] # implicit index

3

In [354]:
serie2.iloc[1:2] # implicit index for slicing

5    3
dtype: int64

In [355]:
serie2.loc[1:5] # explicit index for slicing

1    2
5    3
dtype: int64

In [357]:
serie2.loc[[1,5]] # fancy indexing

1    2
5    3
dtype: int64

### Index object 

* are immutable

In [386]:
df2.index[0]=18

TypeError: Index does not support mutable operations

* can be sliced or indexed (just like an array)

In [387]:
df2.index[0]

0

In [393]:
df.index[:2]

Index(['Corentin', 'Luc'], dtype='object')

In [395]:
df.index & {'Corentin', 'Yolo'}

Index(['Corentin'], dtype='object')

In [397]:
df.index ^ {'Corentin', 'Yolo'}

Index(['Luc', 'René', 'Yolo'], dtype='object')

# DataFrame

* sequence of "aligned" Series objects (sharing same indexes / like an Excel file )

* each Series object is a column

* Hence `pd.DataFrame` can be seen as dictionnary of Series objects

* Flexible rows and columns' labels

In [366]:
serie1 = pd.Series({"Luc": 25, "Corentin":29, "René": 40})
serie2 = pd.Series({"René": "100%", "Corentin": "25%", "Luc": "20%"})

In [367]:
df = pd.DataFrame({"note": serie1, 
                   "charge_de_travail": serie2})

In [368]:
df

Unnamed: 0,charge_de_travail,note
Corentin,25%,29
Luc,20%,25
René,100%,40


In [369]:
df.index

Index(['Corentin', 'Luc', 'René'], dtype='object')

In [370]:
df.columns

Index(['charge_de_travail', 'note'], dtype='object')

In [371]:
df.shape

(3, 2)

shape: tuple of the number of elements with respect to each dimension

For a 1D array, the shape would be (n,) where n is the number of elements in your array.

For a 2D array, the shape would be (n,m) where n is the number of rows and m is the number of columns in your array

accessing columns by key : 

In [372]:
df['note'] /2

Corentin    14.5
Luc         12.5
René        20.0
Name: note, dtype: float64


Using the attribute notation is not advised for assignements as some methods or attributes of the same name already exist in the DataFrame class' own namespace

In [380]:
df.note

Corentin    29
Luc         25
René        40
Name: note, dtype: int64

The `DataFrame` can be constructed using a list of dictionary
each dict element is a row
each key of each dict refers a column

In [381]:
df2 = pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

In [382]:
df2

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


In [383]:
# filled with NaN ("Not A Number") when no value is given

Indexing works the same way as for Series, but you have to account this time for the second dimension

`df.loc_or_iloc[ dim1 = rows, dim2 = columns]`


In [411]:
df.iloc[:3, :1]

Unnamed: 0,charge_de_travail
Corentin,25%
Luc,20%
René,100%


columns slicing/indexing is optional here, without specifying it, you select only rows 

In [412]:
df.iloc[:3]

Unnamed: 0,charge_de_travail,note
Corentin,25%,29
Luc,20%,25
René,100%,40


In [419]:
df.loc[["Corentin", "Luc"], :] # mixing slicing and fancy indexing

Unnamed: 0,charge_de_travail,note
Corentin,25%,29
Luc,20%,25


In [421]:
df.loc[["Corentin", "Luc"]] # without the "col argument"

Unnamed: 0,charge_de_travail,note
Corentin,25%,29
Luc,20%,25


Something to mention here, by default:
- indexing directly `df`, performs the indexing on its columns (1)
- slicing by conditions, or using a slice notation (::), is performed on rows (2)

(1)

In [429]:
df[["charge_de_travail"]] # indexing, defaults to columns

Unnamed: 0,charge_de_travail
Corentin,25%
Luc,20%
René,100%


(2) 

In [434]:
mask = df["charge_de_travail"]=="25%" 
mask

Corentin     True
Luc         False
René        False
Name: charge_de_travail, dtype: bool

In [435]:
df[mask] # masking, on lines

Unnamed: 0,charge_de_travail,note
Corentin,25%,29


In [438]:
df[:3] # slicing, on rows

Unnamed: 0,charge_de_travail,note
Corentin,25%,29
Luc,20%,25
René,100%,40


## Operations on Pandas