---

<center>

# **Python for Data Science**

### *Data Processing*

</center>


---

<center>

## **📖 Introduction**

</center>

---


Data preprocessing can be summarized in 4 essential operations: **filtering, joining, ordering, and grouping**.

The **DataFrame** structure has become the standard in data manipulation because in most cases, it is enough to repeat or combine these four operations.

In this exercise, you will learn how to use these 4 data preprocessing methods.

Before starting this notebook, run the following cell in order to retrieve the work done in the previous notebooks.



In [3]:
### Import ###

import pandas as pd

# Import the dataset
transactions = pd.read_csv("transactions.csv", sep=';', index_col="transaction_id")

# Remove duplicates
transactions = transactions.drop_duplicates(keep='first')

# Rename columns
new_names = {'Store_type': 'store_type',
             'Qty': 'qty',
             'Rate': 'rate',
             'Tax': 'tax'}

transactions = transactions.rename(new_names, axis=1)

### Handling Missing Values (NAs) ###

# Replace NaNs in 'prod_subcat_code' with -1
transactions['prod_subcat_code'] = transactions['prod_subcat_code'].fillna(-1).astype("int")

# Get the mode of 'store_type'
store_type_mode = transactions['store_type'].mode()

# Replace NaNs in 'store_type' with its mode
transactions['store_type'] = transactions['store_type'].fillna(transactions['store_type'].mode()[0])

# Drop rows where 'rate', 'tax', and 'total_amt' are all missing
transactions = transactions.dropna(axis=0, how='all', subset=['rate', 'tax', 'total_amt'])

---

<center>

## **📖 Filtering a DataFrame with binary operators**

</center>

---


Filtering consists of selecting a subset of rows from a DataFrame that satisfy a condition.  
This is what we previously called *conditional indexing*, but the term *filtering* is the most commonly used in database management.  

We cannot use the logical operators `and` and `or` when filtering with multiple conditions.  
These operators create ambiguity that **pandas** cannot handle when filtering rows.  

The operators adapted to filtering with multiple conditions are the **binary operators**:

- The 'and' operator: `&`  
- The 'or' operator: `|`  
- The 'not' operator: `~`  

These operators are similar to logical operators, but their evaluation methods are not the same.  

---

### The 'and' operator: `&`

The `&` operator is used to filter a DataFrame with multiple conditions that must all be satisfied simultaneously.  

**Example:**  

Let’s consider the following DataFrame `df` containing information about apartments in Paris:

| neighborhood       | year | surface |
|--------------------|------|---------|
| 'Champs-Elysées'   | 1979 | 70      |
| 'Europe'           | 1850 | 110     |
| 'Père-Lachaise'    | 1935 | 55      |
| 'Bercy'            | 1991 | 30      |

If we want to find an apartment built in **1979** and with a surface greater than **60 m²**, we can filter the rows of `df` with the following code:

```python
# Filtering the DataFrame with the 2 previous conditions
print(df[(df['year'] == 1979) & (df['surface'] > 60)])

>>>       neighborhood   year  surface
>>> 0   Champs-Elysées  1979       70
```
The conditions must be enclosed in parentheses to avoid ambiguity in the order of evaluation.  
Indeed, if the conditions are not properly separated, we will get the following error:

```python
print(df[df['year'] == 1979 & df['surface'] > 60])

>>> ValueError: The truth value of a Series is ambiguous.
>>> Use a.empty, a.bool(), a.item(), a.any() or a.all().
```

### The 'or' operator: `|`

The `|` operator is used to filter a DataFrame with multiple conditions where at least one must be satisfied.  

**Example:**  

Let’s consider the same DataFrame `df`:  

| neighborhood       | year | surface (m²) |
|--------------------|------|--------------|
| 'Champs-Elysées'   | 1979 | 70           |
| 'Europe'           | 1850 | 110          |
| 'Père-Lachaise'    | 1935 | 55           |
| 'Bercy'            | 1991 | 30           |

If we want to find an apartment built **after 1900** or located in the **Père-Lachaise** neighborhood, we can filter the rows of `df` with the following code:

```python
# Filtering the DataFrame with the 2 previous conditions
print(df[(df['year'] > 1900) | (df['neighborhood'] == 'Père-Lachaise')])

>>>     neighborhood    year  surface
>>> 0  Champs-Elysées   1979       70
>>> 2  Père-Lachaise    1935       55
>>> 3           Bercy   1991       30
````

### The 'not' operator: `~`

The `~` operator is used to filter a DataFrame on a condition whose **negation** must be satisfied.  

**Example:**  

Let’s consider the same DataFrame `df`:  

| neighborhood       | year | surface (m²) |
|--------------------|------|--------------|
| 'Champs-Elysées'   | 1979 | 70           |
| 'Europe'           | 1850 | 110          |
| 'Père-Lachaise'    | 1935 | 55           |
| 'Bercy'            | 1991 | 30           |

If we want an apartment that is **not located in the Bercy neighborhood**, we can filter `df` as follows:

```python
# Filtering the DataFrame to exclude the Bercy neighborhood
print(df[~(df['neighborhood'] == 'Bercy')])

>>>     neighborhood    year  surface
>>> 0  Champs-Elysées   1979       70
>>> 1          Europe   1850      110
>>> 2  Père-Lachaise    1935       55
```

<center>

### **🔍 Example: Filtering with conditions**

</center>

---

- (a) Display the first 5 rows of the DataFrame `transactions`.  
- (b) From `transactions`, create a DataFrame named `e_shop` containing only the transactions made in stores of type `'e-Shop'` with a total amount greater than 5000 (columns: `store_type` and `total_amt`).  
- (c) Similarly, create a DataFrame named `teleshop` containing the transactions made in stores of type `'TeleShop'` with a total amount greater than 5000.  
- (d) Which of the two store types has the highest number of transactions greater than €5000?

In [4]:
# TODO

<center>

### **🔍 Example: Handling Missing Values**

</center>

---

- (a) Import the data from the files `'customer.csv'` and `'prod_cat_info.csv'` into two DataFrames named `customer` and `prod_cat_info`, respectively.  

- (b) The columns `Gender` and `city_code` in `customer` each contain two missing values. Replace them with their mode using the `fillna` and `mode` methods.



In [5]:
# TODO

## Combining DataFrames with `concat`

The `concat` function from the **pandas** module allows you to concatenate multiple DataFrames, i.e., to stack them **vertically** or **horizontally**.  

The function signature is as follows: `pandas.concat(objs, axis=...)`  

- The `objs` parameter contains the list of DataFrames to concatenate.  
- The `axis` parameter specifies whether to concatenate **vertically** (`axis=0`) or **horizontally** (`axis=1`).  

When the number of rows or columns in the DataFrames does not match, the `concat` function fills the missing cells with `NaN`, as illustrated below.

<center>

### **🔍 Example: Concatenating DataFrames**

</center>

---

- (a) Split the variables (columns) of the `transactions` DataFrame into two, with half of the columns in a DataFrame named `part_1` and the other half in a DataFrame named `part_2`.  
- (b) Reconstruct `transactions` in a DataFrame named `union` by concatenating `part_1` and `part_2`.  
- (c) What happens if we concatenate `part_1` and `part_2` using the argument `axis=0`?

In [6]:
# TODO

## Merging DataFrames with the `merge` method

Two DataFrames can be merged if they have a column in common.  
This is done using the `merge` method of a DataFrame, which has the following signature:

`merge(right, on, how, ...)`

- The `right` parameter is the DataFrame to merge with the calling DataFrame.  
- The `on` parameter is the name of the columns in the DataFrames that will serve as the reference for the merge. These columns must exist in both DataFrames.  
- The `how` parameter specifies the type of join to perform for merging the DataFrames. Its values are based on SQL join syntax.  

The `how` parameter can take 4 values (`'inner'`, `'outer'`, `'left'`, `'right'`), illustrated with the following two DataFrames `Persons` and `Vehicle`:

**Persons**

| Name     | Car        |
|----------|------------|
| Lila     | Twingo     |
| Tiago    | Clio       |
| Berenice | C4 Cactus  |
| Joseph   | Twingo     |
| Kader    | Swift      |
| Romy     | Scenic     |

**Vehicle**

| Car       | Price  |
|-----------|--------|
| Twingo    | 11000  |
| Swift     | 14500  |
| C4 Cactus | 23000  |
| Clio      | 16000  |
| Prius     | 30000  |

- `'inner'`: This is the default value of `how`. An inner join returns only the rows where the values in the common columns exist in both DataFrames. This type of join is often discouraged because it can lead to many missing entries. However, an inner join produces **no NaNs**.  

    Example: `Persons.merge(right=Vehicle, on='Car', how='inner')`  

- `'outer'`: An outer join merges all rows from both DataFrames. No row is removed. This method can generate a lot of NaNs.  

    Example: `Persons.merge(right=Vehicle, on='Car', how='outer')`  

- `'left'`: A left join returns all rows from the left DataFrame, and fills them with matching rows from the right DataFrame based on the common column.  

    Example: `Persons.merge(right=Vehicle, on='Car', how='left')`  

- `'right'`: A right join returns all rows from the right DataFrame, and fills them with matching rows from the left DataFrame based on the common column.  

    Example: `Persons.merge(right=Vehicle, on='Car', how='right')`  

Performing a left join, right join, or outer join followed by `dropna(how='any')` is equivalent to an inner join.

<center>

### **🔍 Example: Merging transactions with customer data**

</center>

---

The `customer` DataFrame contains information about clients corresponding to the `'cust_id'` column in `transactions`.  

The `'customer_Id'` column in the `customer` DataFrame will allow us to join `transactions` and `customer`.  
This will enrich the `transactions` dataset with additional information.  

- (a) Using the `rename` method and a dictionary, rename the `'customer_Id'` column in the `customer` DataFrame to `'cust_id'`.  
- (b) Using the `merge` method, perform a **left join** between `transactions` and `customer` on the `'cust_id'` column. Name the resulting DataFrame `fusion`.  
- (c) Did the merge produce any `NaN` values?  
- (d) Display the first rows of `fusion`. What are the new columns?

In [7]:
# TODO

## Resetting and Setting the Index of a DataFrame

The merge was successful and did not produce any NaNs. However, the index of the resulting DataFrame is no longer the `'transaction_id'` column and has been reset to the default index (0, 1, 2, ...).  

It is possible to **redefine the index** of a DataFrame using the `set_index` method.  

This method can take as an argument:

- The name of a column to use as the index.  
- A Numpy array or a pandas Series with the same number of rows as the calling DataFrame.  

**Example:**  

Let `df` be the following DataFrame:

| Name     | Car        |
|----------|------------|
| Lila     | Twingo     |
| Tiago    | Clio       |
| Berenice | C4 Cactus  |
| Joseph   | Twingo     |
| Kader    | Swift      |
| Romy     | Scenic     |

We can set the `'Name'` column as the new index:

```python
df = df.set_index('Name')
```

This will produce the following DataFrame:

| Name     | Car        |
|----------|------------|
| Lila     | Twingo     |
| Tiago    | Clio       |
| Berenice | C4 Cactus  |
| Joseph   | Twingo     |
| Kader    | Swift      |
| Romy     | Scenic     |

We can also set the index using a Numpy array, a Series, etc.

```python
# New index to use
new_index = ['10000' + str(i) for i in range(6)]
print(new_index)
>>> ['100000', '100001', '100002', '100003', '100004', '100005']

# Using a Numpy array or a Series is equivalent
index_array = np.array(new_index)
index_series = pd.Series(new_index)

df = df.set_index(index_array)
df = df.set_index(index_series)
```

This will produce the following DataFrame:

|       | Name     | Car        |
|-------|----------|------------|
| 100000| Lila     | Twingo     |
| 100001| Tiago    | Clio       |
| 100002| Berenice | C4 Cactus  |
| 100003| Joseph   | Twingo     |
| 100004| Kader    | Swift      |
| 100005| Romy     | Scenic     |

To return to the default numeric indexing, use the `reset_index` method of the DataFrame:

```python
df = df.reset_index()
```

The previous index is not deleted. A new column will be created containing the old index:

|       | index    |   Name     |   Car     |
|-------|----------|------------|-----------|
| 0     | 100000   | Lila       | Twingo    |
| 1     | 100001   | Tiago      | Clio      |
| 2     | 100002   | Berenice   | C4 Cactus |
| 3     | 100003   | Joseph     | Twingo    |
| 4     | 100004   | Kader      | Swift     |
| 5     | 100005   | Romy       | Scenic    |


<center>

### **🔍 Example: Restoring the index after merging**

</center>

---
The merge between `transactions` and `customer` removed the index of `transactions`.  

The index of a DataFrame can be retrieved using its `.index` attribute.  

- (a) Retrieve the index of `transactions` and use it to set the index of `fusion`.

In [8]:
# TODO

## Sorting a DataFrame: `sort_values` and `sort_index` methods

The `sort_values` method allows you to sort the rows of a DataFrame based on the values of one or more columns.  

The method signature is: `sort_values(by, ascending, ...)`

- The `by` parameter specifies the column(s) to sort by.  
- The `ascending` parameter is a boolean (`True` or `False`) that determines whether the sort is ascending or descending. By default, it is `True`.  

**Example:**  

Consider the following DataFrame `df` describing students:

| FirstName | Grade | BonusPoints |
|-----------|-------|-------------|
| 'Amelie'  | A     | 1           |
| 'Marin'   | F     | 1           |
| 'Pierre'  | A     | 2           |
| 'Zoe'     | C     | 1           |

First, we will sort by a single column, for example the `'BonusPoints'` column:

```python
# Sort the DataFrame df by the 'BonusPoints' column
df_sorted = df.sort_values(by='BonusPoints', ascending=True)
```
The result will be as follows:

| FirstName | Grade | BonusPoints |
|-----------|-------|-------------|
| 'Amelie'  | A     | 1           |
| 'Marin'   | F     | 1           |
| 'Zoe'     | C     | 1           |
| 'Pierre'  | A     | 2           |

The rows of the `df_sorted` DataFrame are thus sorted in ascending order of the `'Points bonus'` column.  
However, if we look at the `'Note'` column, we notice that it is not sorted alphabetically for the rows that have the same `'Points bonus'` value.  

We can fix this by also sorting by the `'Note'` column:

```python
# Sort the DataFrame df by 'Points bonus', and in case of ties, by 'Note'
df_sorted = df.sort_values(by=['Points bonus', 'Note'], ascending=True)
```
The result will be as follows:

The `sort_index` method allows you to sort a DataFrame based on its index.  
When the index is the default numeric index, this method is not very useful.  
It is therefore often combined with the `set_index` method of pandas, as we saw earlier.  

**Example:**
```python
# Set the 'Grade' column as the index of df
df = df.set_index('Grade')

# Sort the DataFrame df by its index
df = df.sort_index()
```

This produces the following DataFrame:

| Grade | FirstName | BonusPoints |
|-------|-----------|-------------|
| A     | 'Amelie'  | 1           |
| A     | 'Pierre'  | 2           |
| C     | 'Zoe'     | 1           |
| F     | 'Marin'   | 1           |

Consider the following two DataFrames containing boat rental data.  

**Boats DataFrame (`bateaux`):**

| BoatName   | Color  | ReservationNumber | NumberOfReservations |
|------------|--------|-----------------|--------------------|
| Julia      | blue   | 2               | 34                 |
| Siren      | green  | 3               | 10                 |
| Sea Sons   | red    | 6               | 20                 |
| Hercules   | blue   | 1               | 41                 |
| Cesar      | yellow | 4               | 12                 |
| Minerva    | green  | 5               | 16                 |

**Clients DataFrame (`clients`):**

| ClientID | ClientName | ReservationID |
|----------|------------|---------------|
| 91       | Marie      | 1             |
| 154      | Anna       | 2             |
| 124      | Yann       | 3             |
| 320      | Lea        | 7             |
| 87       | Marc       | 9             |
| 22       | Yassine    | 10            |



In [9]:
# Run the following cell to create these DataFrames.
# Define the dictionaries
data_boats = {
    'BoatName': ['Julia', 'Siren', 'Sea Sons', 'Hercules', 'Cesar', 'Minerva'],
    'Color': ['blue', 'green', 'red', 'blue', 'yellow', 'green'],
    'ReservationNumber': [2, 3, 6, 1, 4, 5],
    'NumberOfReservations': [34, 10, 20, 41, 12, 16]
}

data_clients = {
    'ClientID': [91, 154, 124, 320, 87, 22],
    'ClientName': ['Marie', 'Anna', 'Yann', 'Lea', 'Marc', 'Yassine'],
    'ReservationID': [1, 2, 3, 7, 9, 10]
}

# Create the DataFrames
boats = pd.DataFrame(data_boats)
clients = pd.DataFrame(data_clients)

<center>

### **🔍 Example: Joining boat and client data**

</center>

---

We want to easily determine which client reserved the boats in the `boats` DataFrame.  
To do this, we just need to merge the DataFrames.  

- (a) Rename the `'ReservationNumber'` column in `boats` to `'ReservationID'` using the `rename` method.  
- (b) In a DataFrame named `boats_clients`, perform a **left join** between `boats` and `clients`.  
- (c) Set the `'BoatName'` column as the index of the `boats_clients` DataFrame.  
- (d) Using the `loc` method, which allows indexing a DataFrame, find out who reserved the boats `'Julia'` and `'Siren'`.  
- (e) Using the `isna` method applied to the `'ClientName'` column, determine which boats have not been reserved.  
- (f) The number of times a boat has been reserved so far is given in the `'NumberOfReservations'` column.  
Using the `sort_values` method, determine the name of the client who reserved the **blue boat** with the highest number of reservations.

In [10]:
# TODO

## Grouping elements of a DataFrame: `groupby`, `agg`, and `crosstab` methods

The `groupby` method allows you to group the rows of a DataFrame that share a common value in a column.  

This method does **not** return a DataFrame.  
The object returned by `groupby` is of class `DataFrameGroupBy`.  

This class allows operations such as computing statistics (sum, mean, max, etc.) for each category of the column used for grouping.  

The general structure of a `groupby` operation is as follows:

1. Split the data (Split).  
2. Apply a function (Apply).  
3. Combine the results (Combine).  

**Example:**  

Assume that the boats in the `boats` DataFrame are all identical and have the same age.  
We want to determine if the color of a boat influences its number of reservations.  
To do this, we will calculate the average number of reservations per boat for each color:

- Split the boats by color.  
- Compute the average number of reservations (`mean`).  
- Combine the results into a DataFrame for easy comparison.  

We can use `groupby` followed by `mean` to obtain the result.  

All common statistical methods (`count`, `mean`, `max`, etc.) can be used after `groupby`.  
They will only apply to columns with compatible types.  

It is also possible to specify for each column which function should be applied in the "Apply" step of a `groupby` operation.  
To do this, use the `agg` method of the `DataFrameGroupBy` object, providing a dictionary where each key is a column name and the value is the function to apply.

**Example:**  

Consider the `transactions` DataFrame:

| transaction_id | cust_id | tran_date  | prod_subcat_code | prod_cat_code | qty  | rate  | tax    | total_amt | store_type |
|----------------|---------|-----------|-----------------|---------------|------|-------|--------|-----------|------------|
| 80712190438    | 270351  | 28-02-14  | 1               | 1             | -5   | -772  | 405.3  | -4265.3   | e-Shop     |
| 29258453508    | 270384  | 27-02-14  | 5               | 3             | -5   | -1497 | 785.925| -8270.92  | e-Shop     |
| 51750724947    | 273420  | 24-02-14  | 6               | 5             | -2   | -791  | 166.11 | -1748.11  | TeleShop   |
| 93274880719    | 271509  | 24-02-14  | 11              | 6             | -3   | -1363 | 429.345| -4518.35  | e-Shop     |
| 51750724947    | 273420  | 23-02-14  | 6               | 5             | -2   | -791  | 166.11 | -1748.11  | TeleShop   |

We want, for each client (`cust_id`):

- For the `total_amt` column: minimum, maximum, and total amount spent.  
- For the `store_type` column: the number of different store types in which the client made a transaction.  

We can perform these calculations using a `groupby` operation:

1. Split transactions by client ID.  
2. For `total_amt`, compute `min`, `max`, and `sum`. For `store_type`, count the number of unique categories.  
3. Combine the results into a DataFrame.

To find the number of unique categories for `store_type`, we can use the following lambda function:

```python
import numpy as np

n_modalities = lambda store_type: len(np.unique(store_type))
```

- The lambda function must take a column as an argument and return a number.  
- The function `np.unique` determines the unique values present in a sequence.  
- The function `len` counts the number of elements in a sequence.  

Thus, this function allows us to determine the number of unique categories for the `store_type` column.  

To apply these functions in a `groupby` operation, we use a dictionary where the keys are the columns to process and the values are the functions to apply.

```python
functions_to_apply = {
    # Standard statistical methods can be specified as strings
    'total_amt': ['min', 'max', 'sum'],
    'store_type': n_modalities
}
```

This dictionary can now be used with the `agg` method:

```python
transactions.groupby('cust_id').agg(functions_to_apply)
```

This produces the following `DataFrameGroupBy`:

            total_amount          store_type
| cust_id | min      | max     | sum     | lambda  |
|---------|----------|---------|---------|---------|
| 266783  | -5838.82 | 5838.82 | 3113.89 |   2     |
| 266784  | 442      | 4279.66 | 5694.07 |   3     |
| 266785  | -6828.9  | 6911.77 | 21613.8 |   3     |
| 266788  | 1312.74  | 1927.12 | 6092.97 |   3     |
| 266794  | -135.915 | 4610.06 | 27981.9 |   4     |
           


<center>

### **🔍 Example: Grouping by client to analyze quantities**

</center>

---

- (a) Using a `groupby` operation, determine for each client, based on the quantity of items purchased in a transaction (`qty` column):

  - The maximum quantity.  
  - The minimum quantity.  
  - The median quantity.  

  You should filter the transactions to keep only those with positive quantities.  
  To do this, you can use conditional indexing (`qty[qty > 0]`) within a lambda function.

In [11]:
# TODO

Another way to group and summarize data is to use the `crosstab` function from pandas, which, as its name suggests, is used to cross-tabulate columns of a DataFrame.  

It allows you to visualize the frequency of occurrence of pairs of categories in a DataFrame.  

**Example:**  

In the `transactions` DataFrame, we want to know which category and sub-category pairs are the most frequent (columns `prod_cat_code` and `prod_subcat_code`).  

The pandas `crosstab` function can be used as follows:

```python
colonne1 = transactions['prod_cat_code']
colonne2 = transactions['prod_subcat_code']
pd.crosstab(colonne1, colonne2)
```

This command produces the following DataFrame:

prod_subcat_code

| prod_cat_code | -1 | 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   | 9   | 10  | 11  | 12  |
|---------------|----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
| 1             | 4  | 1001| 0   | 981 | 958 | 0   | 0   | 0   | 0   | 0   | 0   | 0   | 0   |
| 2             | 4  | 934 | 0   | 1040|1005 | 0   | 0   | 0   | 0   | 0   | 0   | 0   | 0   |
| 3             | 11 | 0   | 0   | 0   |1020 | 950 | 0   | 0   | 966 | 976 | 945 | 0   | 0   |
| 4             | 5  | 993 | 0   | 0   | 988 | 0   | 0   | 0   | 0   | 0   | 0   | 0   | 0   |
| 5             | 3  | 0   | 0   |1023 | 0   | 0   | 984 |1037 | 0   | 0   | 998 |1029 | 962 |
| 6             | 5  | 0   |1002 | 0   | 0   | 0   | 0   | 0   | 0   | 0   |1025 |1013 |1057 |

The cell (i, j) in the resulting DataFrame contains the number of elements in the DataFrame that have category i for the first column (`prod_cat_code`) and category j for the second column (`prod_subcat_code`).  

Thus, it is easy to determine, for example, that the dominant sub-categories of category 4 are 1 and 4.  

The `normalize` argument of `crosstab` allows you to display the frequencies as percentages.  
For example, using `normalize=1` normalizes the table along axis 1, i.e., across each column:

The cell (i, j) in the resulting DataFrame contains the number of elements that have category i for the first column (`prod_cat_code`) and category j for the second column (`prod_subcat_code`).  

This makes it easy to see, for example, that the dominant sub-categories of category 4 are 1 and 4.  

The `normalize` argument in `crosstab` allows displaying frequencies as percentages.  
For example, `normalize=1` normalizes the table along **axis 1**, i.e., across each column:

```python
# Extract the year from the transaction date
column1 = transactions['tran_date'].apply(lambda x: int(x.split('-')[2]))
column2 = transactions['store_type']

pd.crosstab(column1,
            column2,
            normalize=1)
```

This produces the following DataFrame:

| tran_date | Flagship store | MBR     | TeleShop | e-Shop  |
|-----------|----------------|--------|----------|---------|
| 2011      | 0.291942       | 0.323173 | 0.283699 | 0.306947 |
| 2012      | 0.331792       | 0.322093 | 0.336767 | 0.322886 |
| 2013      | 0.335975       | 0.3115   | 0.332512 | 0.320194 |
| 2014      | 0.0402906      | 0.0432339| 0.0470219| 0.0499731|

This DataFrame allows us to say that 33.5975% of the transactions made in a 'Flagship store' occurred in 2013.  

Conversely, by setting `normalize=0`, we normalize the table across **rows**:

| tran_date | Flagship store | MBR     | TeleShop | e-Shop  |
|-----------|----------------|--------|----------|---------|
| 2011      | 0.191121       | 0.21548  | 0.182617 | 0.410781 |
| 2012      | 0.20096        | 0.198693 | 0.20056  | 0.399787 |
| 2013      | 0.205522       | 0.194074 | 0.2      | 0.400404 |
| 2014      | 0.173132       | 0.189215 | 0.198675 | 0.438978 |

Row-wise normalization allows us to deduce that transactions made in an 'e-Shop' account for 41.0781% of the transactions in 2011.  

In the file `covid_tests.csv`, we have a dataset of 200 COVID-19 tests. The columns in this dataset are:

- `patient_id`: ID of the tested patient.  
- `test_result`: Result of the detection test. 1 if the patient tested positive, 0 otherwise.  
- `infected`: 1 if the patient was actually infected, 0 otherwise.


<center>

### **🔍 Example: COVID-19 Test Analysis**

</center>

---

- (a) Load the dataset from the file `covid_tests.csv`. The separator is `;`.  
- (b) Using the `pd.crosstab` function, determine the number of **False Negatives** produced by this test.  
A false negative occurs when the test indicates that a patient is not infected, but they actually are.  
- (c) What is the **False Positive rate** of the test?  
The false positive rate corresponds to the proportion of false positives among all healthy individuals.  
You will need to normalize the results to compute this.

In [12]:
# TODO

---

<center>

## **📖 Conclusion and Summary**

</center>

---

## Conclusion and Recap

In this notebook, you have learned how to:

- **Filter rows of a DataFrame** with multiple conditions using the binary operators `&`, `|`, and `-`:

  ```python
  # Year equals 1979 and surface greater than 60
  df[(df['annee'] == 1979) & (df['surface'] > 60)]

  # Year greater than 1900 or district equals 'Père-Lachaise'
  df[(df['année'] > 1900) | (df['quartier'] == 'Père-Lachaise')]
  ```

- **Merge DataFrames** using the `concat` function and the `merge` method:

  ```python
  # Vertical concatenation
  pd.concat([df1, df2], axis=0)

  # Horizontal concatenation
  pd.concat([df1, df2], axis=1)

  # Different types of joins
  df1.merge(right=df2, on='column', how='inner')
  df1.merge(right=df2, on='column', how='outer')
  df1.merge(right=df2, on='column', how='left')
  df1.merge(right=df2, on='column', how='right')
  ```

- **Sort and order the values of a DataFrame** using the methods `sort_values` and `sort_index`:

  ```python
  # Sort a DataFrame by the 'column' in ascending order
  df.sort_values(by='column', ascending=True)
  ```

- **Perform a complex `groupby` operation** using lambda functions along with the `groupby` and `agg` methods:

  ```python
  functions_to_apply = {
      'column1': ['min', 'max'],
      'column2': [np.mean, np.std],
      'column3': lambda x: x.max() - x.min()
  }

  df.groupby('column_to_group_by').agg(functions_to_apply)
  ```