---

<center>

# **Python for Data Science**

### *Data Cleaning: Handling Missing Data (NA) and Cleaning Datasets*

</center>


---

<center>

## **📖 Introduction**

</center>

---


Data cleaning and the proper handling of missing values (also called **NaN** or **NA**) are two essential steps before performing any analysis on a dataset.  

The objective of this notebook is to go step by step through these cleaning operations in order to obtain a **clean and reliable DataFrame**.  
Indeed, real-world datasets often contain issues such as missing values, duplicates, or inconsistent entries.  

For this course, we will continue working with the **`transactions` DataFrame** that we imported in the previous exercise.


<center>

### **🔍 Example: Loading and Inspecting the Dataset**

</center>

---

- (a) Import the `pandas` module as `pd` and load the file **`transactions.csv`** into a DataFrame named `transactions`.  
The file uses semicolons (`;`) as separators, and the column containing the identifiers is `'transaction_id'`.  

- (b) Display the first 10 rows of the DataFrame using the `.head()` method.

In [1]:
# TODO

---

<center>

## **📖 Cleaning a Dataset**

</center>

---


In this section, we introduce the main **DataFrame methods** that are useful for cleaning a dataset.  
These methods can be grouped into three main categories:

1. **Handling Duplicates**  
   - `duplicated` → detects duplicate rows.  
   - `drop_duplicates` → removes duplicate rows.  

2. **Modifying Elements in a DataFrame**  
   - `replace` → replaces specific values.  
   - `rename` → renames columns or indexes.  
   - `astype` → changes the data type of columns.  

3. **Operations on DataFrame Values**  
   - `apply` → applies a function to rows or columns.  
   - `lambda` → allows writing small anonymous functions for transformations.  


## Handling Duplicates (methods `duplicated` and `drop_duplicates`)

Duplicates are identical rows that appear multiple times in a dataset.  

👉 When working with new data, it’s very important to **check for duplicates early on**.  
The presence of duplicates can generate errors in statistical calculations or when plotting graphs.  

---

📊 Example DataFrame:

| Name   | Age | Gender | Height |
|--------|-----|--------|--------|
| Robert | 56  | M      | 174    |
| Mark   | 23  | M      | 182    |
| Alina  | 32  | F      | 169    |
| Mark   | 23  | M      | 182    |

---

✅ To check for duplicates, we use the **`duplicated`** method:

```python
import pandas as pd

# Example DataFrame
df = pd.DataFrame({
    "Name": ["Robert", "Mark", "Alina", "Mark"],
    "Age": [56, 23, 32, 23],
    "Gender": ["M", "M", "F", "M"],
    "Height": [174, 182, 169, 182]
})

# Check for duplicates
df.duplicated()
>>>
False
False
False
True
```

## 📌 Understanding the `duplicated()` method

The **`duplicated()`** method returns a **Pandas Series** (similar to a column of a DataFrame).  
It tells us for each row whether it is a duplicate (`True`) or not (`False`).  

👉 In our example, the result of `duplicated()` indicates that row with index **3** is a duplicate,  
meaning it is an exact copy of a previous row (in this case, row **1**).

---

Since `duplicated()` returns a **Series**, we can apply the **`.sum()`** method to count the total number of duplicates.

```python
# Identify duplicates
print(df.duplicated())

# Count total number of duplicates
print("Number of duplicates:", df.duplicated().sum())
>>> 1
```

## 🧹 Removing duplicates with `drop_duplicates()`

The method of a DataFrame used to **remove duplicates** is `drop_duplicates`.

Its syntax is as follows:

```python
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)
```
- **subset** : column label or sequence of labels  
  → Allows you to specify which columns should be checked for duplicates.  
  → By default, all columns are considered.  

- **keep** : {'first', 'last', False}, default `'first'`  
  → `'first'`: keeps the first occurrence and removes the others.  
  → `'last'`: keeps the last occurrence and removes the others.  
  → `False`: removes *all* duplicates.  

- **inplace** : bool, default `False`  
  → If `True`, modifies the DataFrame directly without returning a new one.  
  → If `False`, returns a new DataFrame with duplicates removed.

⚠️ **Warning** Be very careful when using the `inplace` parameter.  

A **good practice** is to **avoid** using `inplace=True` and instead assign the DataFrame returned by the method to a new variable.  
This way, you won’t accidentally overwrite your original data and you’ll keep better control over your transformations.

<center>

### **🔍 Example: Creating a DataFrame from a Dictionary**

</center>

---

- (a) How many duplicates are there in the `transactions` DataFrame?  
- (b) Remove duplicates from the dataset while keeping only the **first occurrence**.  
- (c) Using the parameters `subset` and `keep` of the method `drop_duplicates` on `transactions`, display the most recent transaction for each `prod_cat_code`.

In [2]:
# TODO

## Modifying DataFrame Elements (methods `replace`, `rename`, and `astype`)

The `replace` method allows you to substitute one or multiple values in a DataFrame column.

Method Signature

```python
replace(to_replace, value, ...)
```

- `to_replace`: The value or list of values to be replaced.
→ Can be integers, strings, booleans, etc.

- `value`: The replacement value or list of values.
→ Can also be integers, strings, booleans, etc.

💡 This method is very useful when you need to clean or standardize categorical variables in your dataset.

**df**

|   Name   |  Country  | Age |
|----------|-----------|-----|
| 'Brown'  | Australia | 33  |
| 'Dupont' | France    | 25  |
| 'Anna'   | Japan     | 54  |

**df_new**

|   Name   |  Country  | Age |
|----------|-----------|-----|
| 'Brown'  | AUS       | 33  |
| 'Dupont' | FRA       | 25  |
| 'Anna'   | JPN       | 54  |

```python
df_new = df.replace(to_replace=['Australia','France','Japan'], value=['AUS','FRA','JPN'])
```

## Renaming Columns in a DataFrame

In addition to modifying the elements of a DataFrame, you can also rename its columns.

This is done using the rename method, which takes a dictionary as an argument: the keys are the old column names and the values are the new column names.

You should also specify axis=1 (or columns=) to indicate that you are renaming columns and not rows.

```python
# Example DataFrame
import pandas as pd

df = pd.DataFrame({
    'Name': ['Brown', 'Dupont', 'Anna'],
    'Country': ['Australia', 'France', 'Japan'],
    'Age': [33, 25, 54]
})

# Renaming columns
df_renamed = df.rename(columns={'Name': 'Full_Name', 'Country': 'Nation', 'Age': 'Years'})

df_renamed
```

## Changing Column Types with astype

Sometimes, it is necessary to change not only the name of a column but also its type.

For example, when importing a dataset, a variable might be interpreted as a string (str) while it is actually numeric. This can happen if even a single entry is misread.

In pandas, you can change column types using the astype method.

Common types you will use:

- str : String ('Hello')
- float : Floating-point number (1.0, 3.1415)
- int : Integer (1, 1234)

astype can take a dictionary where the keys are column names and the values are the new types. This is convenient when changing multiple columns at once.

Most of the time, you will select a single column and overwrite it with its new type:

```python
# Method 1: Create a dictionary and apply astype to the entire DataFrame
type_dict = {'col_1': 'int',
             'col_2': 'float'}
df = df.astype(type_dict)

# Method 2: Select a single column and apply astype to the Series
df['col_1'] = df['col_1'].astype('int')
```

✅ Explanation:

- Method 1 is useful when you want to change the type of multiple columns at once.
- Method 2 is handy when you need to change the type of a single column.

Both methods ensure that the column(s) have the correct type for calculations or further data manipulation.

In [None]:
import pandas as pd
transactions = pd.read_csv("transactions.csv", sep =',', index_col = "transaction_id")

# Remove duplicates
transactions = transactions.drop_duplicates(keep = 'first')

<center>

### **🔍 Example: Cleaning and Modifying Columns**

</center>

---

- (a) Import the numpy module as np.

- (b) Replace the values ['e-Shop', 'TeleShop', 'MBR', 'Flagship store', np.nan] in the column Store_type with [1, 2, 3, 4, 0]. At the same time, replace any missing values (np.nan) in the column prod_subcat_code with 0.

- (c) Convert the columns Store_type and prod_subcat_code to type int.

- (d) Rename the columns Store_type, Qty, Rate, and Tax to store_type, qty, rate, and tax.

In [3]:
# TODO

## Operations on DataFrame values (apply method and lambda functions)

It is often useful to modify or aggregate the information contained in the columns of a DataFrame using an operation or a function.

These operations can be any type of function that takes a column as input.  

The method used to apply an operation on a column is the **apply** method of a DataFrame, whose header is:

```python
apply(func, axis, ...)
```
### Where:

- **func** is the function to apply on the column.  
- **axis** specifies the dimension on which the operation should be applied.  

### Example: `apply` with `np.sum`

Suppose we want to compute the **sum of all rows** for each numerical column.  
The `sum` function from NumPy performs this operation, which makes it perfect to use with the `apply` method.  

Since the operation must be performed **on rows**, we need to specify the argument `axis=0` in the `apply` method.

```python
import numpy as np
import pandas as pd

# Example DataFrame
df = pd.DataFrame({
    "A": [1, 2, 3],
    "B": [10, 20, 30],
    "C": [100, 200, 300]
})

# Apply np.sum on each column
df_columns = df.apply(np.sum, axis=0)
>>>
```
Result :
|     |     |
|-----|-----|
| 'A' | 6   |
| 'B' | 60  |
| 'C' | 600 |

```python
import numpy as np
import pandas as pd

# Example DataFrame
df = pd.DataFrame({
    "A": [1, 2, 3],
    "B": [10, 20, 30],
    "C": [100, 200, 300]
})

# Apply np.sum on each column
df_columns = df.apply(np.sum, axis=1)
>>>
```
Result :
|     |     |
|-----|-----|
| 'A' | 111 |
| 'B' | 222 |
| 'C' | 333 |

The column tran_date in the transactions DataFrame contains the transaction dates in the format day/month/year (e.g., '28/02/2014').

Currently, these dates are stored as strings, which means we cannot directly perform calculations or statistical operations on them.

👉 A better approach would be to split this information into three separate columns: day, month, and year.
This would allow us, for instance, to analyze seasonal trends or detect changes in customer behavior over time.

For example, the date string '28/02/2014' is separated by the / character:

```python
date = '28/02/2014'
date.split('/')
>>> ['28', '02', '2014']
```

The split method returns a list containing the parts of the string separated by the chosen character.

From here:
- The day is the first element (parts[0])
- The month is the second element (parts[1])
- The year is the third element (parts[2])

<center>

### **🔍 Example: Splitting Dates into Day, Month, and Year**

</center>

---
- (a) Define a function get_day that takes a string as input and returns the first element after splitting it on '/'.

- (b) Define the functions get_month and get_year that return the second and third elements of the split respectively.

- (c) Store the results of applying these functions to the tran_date column in three variables: days, months, and years. Since these functions work element by element, you do not need to specify the axis argument in the apply method.

- (d) Create the columns 'day', 'month', and 'year' in the DataFrame and assign them the values of days, months, and years. A new column can be created simply by declaring it.

In [4]:
# TODO

The apply method becomes even more powerful when combined with a lambda function.

In Python, the keyword lambda is used to define an anonymous function — that is, a function without a name.

A lambda function can take any number of arguments, but it must contain only one expression.

The syntax
```python
lambda arguments: expression
```
Lambda functions allow us to define operations with a very compact syntax.

They are particularly useful when the operation is simple and we don’t want to define a separate function with `def`.

```python
# Standard function
def square(x):
    return x**2

# Equivalent with lambda
square_lambda = lambda x: x**2

print(square(4))        # Output: 16
print(square_lambda(4)) # Output: 16
```
Thus, the previous exercise (extracting the day, month, and year from `tran_date`) can be written in a much more compact way using lambda functions inside `apply`.

```python
# Extract day, month, year using lambda inside apply
transactions['day'] = transactions['tran_date'].apply(lambda x: x.split('/')[0])
transactions['month'] = transactions['tran_date'].apply(lambda x: x.split('/')[1])
transactions['year'] = transactions['tran_date'].apply(lambda x: x.split('/')[2])

# Display the updated DataFrame
transactions[['tran_date', 'day', 'month', 'year']].head()
```
---
The column `prod_subcat_code` in transactions depends on the column `prod_cat_code` since it represents a subcategory of a product.

It would make more sense to combine both category and subcategory into a single variable.

Steps:
- Convert both columns into strings using the method astype(str).
- Concatenate them to create a unique code representing both the category and subcategory.

📌 Example with string concatenation:

```python
string1 = "I think"
string2 = "therefore I am."

# Concatenate the two strings with a space
print(string1 + " " + string2)
# >>> I think therefore I am.
```

To apply a function row by row, you must set `axis = 1` inside the `apply` method.

Inside the function itself, each column can be accessed like a key in a DataFrame row.

👉 Example: computing the unit price of a product:

```python
transactions.apply(lambda row: row['total_amt'] / row['qty'], axis=1)
```

<center>

### **🔍 Example: Splitting Dates into Day, Month, and Year**

</center>

---

- (a) Using a **lambda function** applied on the **transactions** DataFrame, create a new column prod_cat containing the concatenation of `prod_cat_code` and `prod_subcat_code` separated by a hyphen `'-'`.
Make sure to **convert both values** to strings before concatenating.

In [5]:
# TODO

---

<center>

## **📖 Handling Missing Values**

</center>

---


A **missing value** can be either:  

- A value that was not provided.  
- A value that does not exist, often resulting from mathematical operations with no solution (e.g., division by zero).  

In a DataFrame, missing values appear as **NaN** ("Not a Number").  

In this section, we will explore several methods to:  

- **Detect missing values** using `isna` and `any`.  
- **Replace missing values** using `fillna`.  
- **Remove missing values** using `dropna`.  

In a previous exercise, we used the `replace` method on `transactions` to replace missing values with `0`.  
This approach is **not rigorous** and should generally be avoided in practice.  

For this reason, we will re-import the raw version of the `transactions` DataFrame to undo the transformations we applied in the previous exercises.

Run the following cell to **re-import** the `transactions` dataset, **remove duplicates**, and **rename columns**:

In [6]:
# Import the dataset
transactions = pd.read_csv("transactions.csv", sep=',', index_col="transaction_id")

# Remove duplicate rows
transactions = transactions.drop_duplicates(keep='first')

# Rename columns
new_names = {
    'Store_type': 'store_type',
    'Qty': 'qty',
    'Rate': 'rate',
    'Tax': 'tax'
}

transactions = transactions.rename(new_names, axis=1)

transactions.head()

## Detecting Missing Values (isna and any methods)

The `isna` method of a DataFrame detects missing values. This method does not take any arguments.

It returns a DataFrame of the same shape with:

- `True` if the cell contains a missing value (`np.nan`).
- `False` otherwise.

Since `isna` returns a DataFrame, we can combine it with other DataFrame methods to get more detailed information:

- The `any` method with the `axis` argument can determine which **columns** (`axis=0`) or **rows** (`axis=1`) contain at least one missing value.
- The `sum` method counts the number of missing values per column or row (using the `axis` argument). Other statistical methods like `mean`, `max`, `argmax`, etc., can also be applied.

Example using the previous DataFrame `df`:

| Name     | Country    | Age |
|----------|------------|-----|
| NaN      | Australia  | NaN |
| Duchamp  | France     | 25  |
| Hana     | Japan      | 54  |

Running `df.isna()` returns:

| Name  | Country | Age   |
|-------|---------|-------|
| True  | False   | True  |
| False | False   | False |
| False | False   | False |

```python
# Example: Detecting missing values in a DataFrame

# Detect COLUMNS that contain at least one missing value
df.isna().any(axis=0)

# Output:
# Nom      True
# Pays     False
# Age      True

# Detect ROWS that contain at least one missing value
df.isna().any(axis=1)

# Output:
# 0     True
# 1    False
# 2    False

# Use conditional indexing to display rows with at least one missing value
df[df.isna().any(axis=1)]
>>>
```
| Name  | Country   |  Age  |
|-------|-----------|-------|
| NaN   | Australia |   NaN |

```python
# Count missing values per COLUMN
df.isnull().sum(axis=0)  # isnull and isna are equivalent
>>>
Name    1
Country 0
Age     1

# Count missing values per ROW
df.isnull().sum(axis=1)
>>>
0  2
1  0
2  0
```


<center>

### **🔍 Example: Handling Missing Values in a DataFrame**

</center>

---

- (a) How many columns in the transactions DataFrame contain missing values?
- (b) How many rows in transactions contain at least one missing value? You can use the `any` method combined with `sum`.
- (c) Which column in transactions has the highest number of missing values?
- (d) Display the rows in transactions that have at least one missing value in the columns 'rate', 'tax', and 'total_amt'. What do you observe?

In [7]:
# TODO

## Replacement of missing values (`fillna` method)

The `fillna` method allows you to replace missing values (NaN) in a DataFrame with a value of your choice. This is useful to clean the dataset before analysis or statistical calculations.

For example, we can replace missing values in a numeric column with 0, or in a categorical column with a default category.

```python
# Replace all NaN values in the DataFrame with zeros
df.fillna(0)

# Replace NaN values in each numeric column with the column mean
df.fillna(df.mean())  # df.mean() can be replaced by any other statistical method
```

It is common to replace missing values in a numeric column with statistics such as:

- Mean: `mean`
- Median: `median`
- Minimum/Maximum: `min`/`max`

For categorical columns, missing values are usually replaced with:

- Mode, i.e., the most frequent category: `mode`
- A constant or arbitrary category: 0, -1

To avoid mistakes when replacing missing values, it is strongly recommended to select the correct columns before using `fillna`.

If you make mistakes in the following exercise, you can re-import the `transactions` DataFrame using the next cell.

In [8]:
# Import the dataset
transactions = pd.read_csv("transactions.csv", sep=',', index_col="transaction_id")

# Remove duplicate rows
transactions = transactions.drop_duplicates(keep='first')

# Rename columns
new_names = {
    'Store_type': 'store_type',
    'Qty': 'qty',
    'Rate': 'rate',
    'Tax': 'tax'
}

transactions = transactions.rename(new_names, axis=1)

<center>

### **🔍 Example: Replacing Missing Values in a DataFrame**

</center>

---
- (a) Replace the missing values in the column `prod_subcat_code` of `transactions` with -1.

- (b) Determine the most frequent category (mode) of the column `store_type` in `transactions`.

- (c) Replace the missing values in the column `store_type` with this mode. You can access the mode value at index 0 of the Series returned by `mode`.

- (d) Verify that the columns `prod_subcat_code` and `store_type` in `transactions` no longer contain any missing values.

In [9]:
# TODO

## Suppression of Missing Values (dropna method)¶

The `dropna` method allows you to remove rows or columns that contain missing values.

The method signature is as follows: `dropna(axis, how, subset, ..)`

- **axis** specifies whether to remove rows or columns (0 for rows, 1 for columns).

- **how** specifies the condition for removal:
    - `how='any'`: remove the row (or column) if it contains at least one missing value.
    - `how='all'`: remove the row (or column) only if all values are missing.

- **subset** specifies which columns/rows to consider when checking for missing values.

```python
# Remove all rows that contain at least one missing value
df = df.dropna(axis=0, how='any')

# Remove columns that are completely empty
df = df.dropna(axis=1, how='all')

# Remove rows where all values are missing in the specific columns 'col2', 'col3', and 'col4'
df = df.dropna(axis=0, how='all', subset=['col2','col3','col4'])
```


<center>

### **🔍 Example: Removing Missing Values**

</center>

---
Some transactions for which the transaction amount is not provided are not relevant. For this reason:

- (a) Remove the entries in the `transactions` DataFrame where the columns `rate`, `tax`, and `total_amt` are simultaneously empty.

- (b) Verify that the columns in `transactions` no longer contain any missing values.

In [10]:
# TODO

---

<center>

## **📖 Conclusion and Summary**

</center>

---

In this chapter, we covered the essential `pandas` methods for cleaning a dataset and handling missing values (`NaN`).

Preparing a dataset is always the first step in any data project.

- **Data Cleaning**:
  - Detect and remove duplicates in a `DataFrame` using `duplicated` and `drop_duplicates`.
  - Modify DataFrame values and their types using `replace`, `rename`, and `astype`.
  - Apply a function to a DataFrame using `apply` and `lambda` expressions.
- **Handling Missing Values**:
  - Detect them using `isna()` with `any()` and `sum()`.
  - Replace them using `fillna()` and statistical functions.
  - Remove them using `dropna()`.

In the next notebook, you will explore more advanced `DataFrame` manipulations for deeper data analysis.

In practice, datasets are rarely perfectly clean: missing values, duplicates, or inconsistent entries are common.  
In the next section, we will learn how to clean and preprocess datasets using pandas, a crucial step before any meaningful analysis.
