
## Atelier : Introduction à Pandas

1. Introduction
Pandas est une bibliothèque open-source pour l'analyse de données en Python.
Elle fournit des structures de données et des outils de manipulation de données simples et puissants.

Installation de pandas (à exécuter si pandas n'est pas installé)

---

```python
 !pip install pandas


```python
# Importation de la bibliothèque pandas
import pandas as pd


2. Premiers pas avec Pandas

### Création d'une Series (tableau à une dimension)
```python
s = pd.Series([1, 3, 5, 7, 9])
print(s)
```

In [1]:
import pandas as pd

In [3]:
serie = pd.Series([1,2,3,4.5,5])
serie

0    1.0
1    2.0
2    3.0
3    4.5
4    5.0
dtype: float64

### Création d'un DataFrame (tableau à deux dimensions)
```python
data = {
    'Nom': ['Alice', 'Bob', 'Charlie', 'David'],
    'Âge': [25, 30, 35, 40],
    'Ville': ['Paris', 'Lyon', 'Marseille', 'Toulouse']
}
df = pd.DataFrame(data)
print(df)
```

In [4]:
data = {
    'Nom': ['Alice', 'Bob', 'Charlie', 'David'],
    'Âge': [25, 30, 35, 40],
    'Ville': ['Paris', 'Lyon', 'Marseille', 'Toulouse']
}

df = pd.DataFrame(data)
df

Unnamed: 0,Nom,Âge,Ville
0,Alice,25,Paris
1,Bob,30,Lyon
2,Charlie,35,Marseille
3,David,40,Toulouse


```python
dates = pd.date_range("20130101", periods=6)

dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))

df

                   A         B         C         D
2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
2013-01-02  1.212112 -0.173215  0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
2013-01-04  0.721555 -0.706771 -1.039575  0.271860
2013-01-05 -0.424972  0.567020  0.276232 -1.087401
2013-01-06 -0.673690  0.113648 -1.478427  0.524988
```

In [5]:
dates = pd.date_range("20240701", periods=6)
dates

DatetimeIndex(['2024-07-01', '2024-07-02', '2024-07-03', '2024-07-04',
               '2024-07-05', '2024-07-06'],
              dtype='datetime64[ns]', freq='D')

In [7]:
# import de numpy
import numpy as np

In [9]:
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list("ABCD"))
df

Unnamed: 0,A,B,C,D
2024-07-01,-0.14265,1.886698,-0.136227,-0.574382
2024-07-02,-0.703991,-1.374827,1.135306,-0.963215
2024-07-03,0.393368,0.617033,-0.076336,2.333053
2024-07-04,2.789227,-0.41148,0.385045,-0.269252
2024-07-05,1.168652,0.147398,0.646844,-0.392141
2024-07-06,-2.09226,-0.803235,-1.221998,-0.123048


### Creation d'un dataframe à partir d'un dictionnaire

```python
df2 = pd.DataFrame(
 {
    "A": 1.0,
   "B": pd.Timestamp("20130102"),
   "C": pd.Series(1, index=list(range(4)), dtype="float32"),
   "D": np.array([3] * 4, dtype="int32"),
   "E": pd.Categorical(["test", "train", "test", "train"]),
   "F": "foo",
       }
    )
    

In [10]: df2
Out[10]: 
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo

```

### Types des données de la dataframe
```python
df2.dtypes
```

Utiliser DataFrame.head() and DataFrame.tail() pour voir to view the top and bottom des ligne dans le frame. 

```python
In [13]: df.head()
Out[13]: 
                   A         B         C         D
2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
2013-01-02  1.212112 -0.173215  0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
2013-01-04  0.721555 -0.706771 -1.039575  0.271860
2013-01-05 -0.424972  0.567020  0.276232 -1.087401

In [14]: df.tail(3)
Out[14]: 
                   A         B         C         D
2013-01-04  0.721555 -0.706771 -1.039575  0.271860
2013-01-05 -0.424972  0.567020  0.276232 -1.087401
2013-01-06 -0.673690  0.113648 -1.478427  0.524988
```

In [13]:
df_head = df.head(2) # 2 pour indiquer le nombre de lignes qu'on aimerais retourner
df_head

Unnamed: 0,A,B,C,D
2024-07-01,-0.14265,1.886698,-0.136227,-0.574382
2024-07-02,-0.703991,-1.374827,1.135306,-0.963215


In [14]:
df_tail = df.tail(2)
df_tail

Unnamed: 0,A,B,C,D
2024-07-05,1.168652,0.147398,0.646844,-0.392141
2024-07-06,-2.09226,-0.803235,-1.221998,-0.123048


### Afficher les index et les colonnes d'un dataframe
```python
In [15]: df.index
Out[15]: 
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [16]: df.columns
Out[16]: Index(['A', 'B', 'C', 'D'], dtype='object')
```

In [15]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [16]:
df.index

DatetimeIndex(['2024-07-01', '2024-07-02', '2024-07-03', '2024-07-04',
               '2024-07-05', '2024-07-06'],
              dtype='datetime64[ns]', freq='D')

### Convertir un dataframe en representation numpy

```python
In [17]: df.to_numpy()
Out[17]: 
array([[ 0.4691, -0.2829, -1.5091, -1.1356],
       [ 1.2121, -0.1732,  0.1192, -1.0442],
       [-0.8618, -2.1046, -0.4949,  1.0718],
       [ 0.7216, -0.7068, -1.0396,  0.2719],
       [-0.425 ,  0.567 ,  0.2762, -1.0874],
       [-0.6737,  0.1136, -1.4784,  0.525 ]])
```

In [19]:
df_to_numpy = df.to_numpy()
df_to_numpy

array([[-0.14265008,  1.88669755, -0.13622682, -0.57438206],
       [-0.70399135, -1.37482663,  1.13530625, -0.96321482],
       [ 0.39336779,  0.61703322, -0.07633563,  2.33305264],
       [ 2.78922686, -0.41148026,  0.38504507, -0.26925195],
       [ 1.1686523 ,  0.14739811,  0.64684398, -0.39214077],
       [-2.09226013, -0.80323545, -1.22199819, -0.12304793]])

In [20]:
df_to_numpy.shape

(6, 4)

### La méthode describe

```python
In [20]: df.describe()
Out[20]: 
              A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean   0.073711 -0.431125 -0.687758 -0.233103
std    0.843157  0.922818  0.779887  0.973118
min   -0.861849 -2.104569 -1.509059 -1.135632
25%   -0.611510 -0.600794 -1.368714 -1.076610
50%    0.022070 -0.228039 -0.767252 -0.386188
75%    0.658444  0.041933 -0.034326  0.461706
max    1.212112  0.567020  0.276232  1.071804
```

In [21]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.235391,0.010264,0.122106,0.001836
std,1.665012,1.154652,0.810406,1.178323
min,-2.09226,-1.374827,-1.221998,-0.963215
25%,-0.563656,-0.705297,-0.121254,-0.528822
50%,0.125359,-0.132041,0.154355,-0.330696
75%,0.974831,0.499624,0.581394,-0.159599
max,2.789227,1.886698,1.135306,2.333053


### Recherche
- Faire des recherches sur l'ecart type, la médiane, le quartile etc 

### La transposition

```python
In [21]: df.T
Out[21]: 
   2013-01-01  2013-01-02  2013-01-03  2013-01-04  2013-01-05  2013-01-06
A    0.469112    1.212112   -0.861849    0.721555   -0.424972   -0.673690
B   -0.282863   -0.173215   -2.104569   -0.706771    0.567020    0.113648
C   -1.509059    0.119209   -0.494929   -1.039575    0.276232   -1.478427
D   -1.135632   -1.044236    1.071804    0.271860   -1.087401    0.524988
```

In [22]:
df_transpose = df.T
df_transpose

Unnamed: 0,2024-07-01,2024-07-02,2024-07-03,2024-07-04,2024-07-05,2024-07-06
A,-0.14265,-0.703991,0.393368,2.789227,1.168652,-2.09226
B,1.886698,-1.374827,0.617033,-0.41148,0.147398,-0.803235
C,-0.136227,1.135306,-0.076336,0.385045,0.646844,-1.221998
D,-0.574382,-0.963215,2.333053,-0.269252,-0.392141,-0.123048


- Selection de ligne `loc`
- Selection de par position `iloc`

In [25]:
df_b = df["B"]
df_b

2024-07-01    1.886698
2024-07-02   -1.374827
2024-07-03    0.617033
2024-07-04   -0.411480
2024-07-05    0.147398
2024-07-06   -0.803235
Freq: D, Name: B, dtype: float64

### Se référer à la documentation

>>> df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
...                   index=['cobra', 'viper', 'sidewinder'],
...                   columns=['max_speed', 'shield'])
>>> df
            max_speed  shield
cobra               1       2
viper               4       5
sidewinder          7       8

>>> df.loc['viper']
max_speed    4
shield       5
Name: viper, dtype: int64

In [31]:
df_loc = df.loc["2024-07-01"]
df_loc

A   -0.142650
B    1.886698
C   -0.136227
D   -0.574382
Name: 2024-07-01 00:00:00, dtype: float64

### Lecture d'un fichier CSV

```python
df_csv = pd.read_csv('titanic.csv')
print(df_csv.head())
```

In [34]:
# Lire le fichier csv de la dataset Iris
df_iris_csv = pd.read_csv('datasets/Iris.csv')

In [35]:
df_iris_csv

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [36]:
# Récupérer les 5 premiers valeurs
df_iris_csv.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [37]:
# Récupérer les 5 derniers valeurs
df_iris_csv.tail()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


In [38]:
# Récupérer la valeur de sepal length
sepal_length = df_iris_csv["sepal_length"]

In [39]:
sepal_length

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sepal_length, Length: 150, dtype: float64

In [40]:
df_iris_csv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   class         150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [41]:
# Récupérer la 50ème valeur 
df_iris_csv.loc[49]

sepal_length            5.0
sepal_width             3.3
petal_length            1.4
petal_width             0.2
class           Iris-setosa
Name: 49, dtype: object

In [43]:
# Récupérer la 50ème et la 60ème valeur
df_iris_csv.loc[49:59]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
49,5.0,3.3,1.4,0.2,Iris-setosa
50,7.0,3.2,4.7,1.4,Iris-versicolor
51,6.4,3.2,4.5,1.5,Iris-versicolor
52,6.9,3.1,4.9,1.5,Iris-versicolor
53,5.5,2.3,4.0,1.3,Iris-versicolor
54,6.5,2.8,4.6,1.5,Iris-versicolor
55,5.7,2.8,4.5,1.3,Iris-versicolor
56,6.3,3.3,4.7,1.6,Iris-versicolor
57,4.9,2.4,3.3,1.0,Iris-versicolor
58,6.6,2.9,4.6,1.3,Iris-versicolor


### Exercice
- 1. Dessiner un graphique de sepal_length en fonction de sepal_width
- 2. Dessiner un graphique de petal_length en fonction de petal_width

3. Manipulation de DataFrames

```python
### Sélection de colonnes
print(df['Nom'])

### Sélection de lignes par index
print(df.iloc[0])

### Sélection de lignes par condition
print(df[df['Âge'] > 30])
```

### Manipulation de chaînes de caractères

```python
df['Ville'] = df['Ville'].str.upper()
print(df)
```

### Description des données

```python
print(df_csv.describe())
print(df_csv.info())
```

### 4. Opérations de transformation

```python
# Ajout d'une nouvelle colonne
df['Pays'] = 'France'
print(df)

# Suppression d'une colonne
df = df.drop('Pays', axis=1)
print(df)
```

### Tri des données

```python
df_sorted = df.sort_values(by='Âge', ascending=False)
print(df_sorted)
```

In [None]:
### Groupement des données

```python
df_grouped = df_csv.groupby('Sex').mean()
print(df_grouped)
```


### Fusion de DataFrames

```python
df1 = pd.DataFrame({'id': [1, 2, 3], 'valeur': ['A', 'B', 'C']})
df2 = pd.DataFrame({'id': [1, 2, 4], 'valeur': ['D', 'E', 'F']})
df_merged = pd.merge(df1, df2, on='id', how='inner')
print(df_merged)
```

5. Visualisation des données

### Importation des bibliothèques de visualisation

```python
import matplotlib.pyplot as plt
import seaborn as sns
```

### Histogramme
```python
df_csv['Age'].plot(kind='hist', bins=20)
plt.xlabel('Âge')
plt.title('Distribution des âges')
plt.show()
```

### Scatter plot

```python
df_csv.plot(kind='scatter', x='Age', y='Fare')
plt.xlabel('Âge')
plt.ylabel('Tarif')
plt.title('Âge vs Tarif')
plt.show()
```

6. Étude de cas pratique

```python
# Analyse du jeu de données Iris
# Affichage des informations de base
print(df_csv.info())
```

```python
# Calcul du taux de survie par sexe
survival_rate = df_csv.groupby('Sex')['Survived'].mean()
print(survival_rate)
```

```python
# Création d'un nouveau DataFrame avec les informations pertinentes
df_summary = df_csv[['Age', 'Fare', 'Survived']]
print(df_summary.head())
```

```python
# Présentation des résultats
plt.figure(figsize=(10, 6))
sns.boxplot(x='Survived', y='Age', data=df_csv)
plt.title('Âge des passagers en fonction de leur survie')
plt.show()
```