---

<center>

# **Python for Data Science**

### *Introduction to DataFrames*

</center>


---

<center>

## **📖 Introduction**

</center>

---


The **pandas** module was developed to bring Python the necessary tools to manipulate and analyze large volumes of data.  

Pandas introduces the **DataFrame class**, a data structure similar to a table, but much more powerful than NumPy arrays.  

---

### Main features of pandas:
- 📂 Loading data from files (CSV, Excel, etc.).  
- ✏️ Manipulating this data (adding/removing, modifying, cleaning).  
- 📊 Performing quick statistical analysis and visualization.  

---

### 🎯 Objectives of this notebook:
1. Understand the structure of a **DataFrame**.  
2. Create a first **DataFrame**.  
3. Explore a dataset using pandas.


<center>

### **🔍 Example: Import the pandas module**

</center>

---

- (a) Import the pandas module

In [45]:
# TODO

---

<center>

## **📖 Format of a DataFrame**

</center>

---


A **DataFrame** looks like a matrix where each row and column has an index.  
Usually, columns are indexed by their names.  

👉 A DataFrame is used to store databases.  
- Rows = **entries** of the dataset (people, animals, objects, etc.).  
- Columns = **characteristics** of these entries.  

---

### Example:  
|   | Name   | Sex | Height | Age |  
|---|--------|-----|--------|-----|  
| 0 | Robert | M   | 174    | 23  |  
| 1 | Mark   | M   | 182    | 40  |  
| 2 | Aline  | F   | 169    | 56  |  

- Here we have **3 rows** → 3 individuals.  
- And **4 columns** → Name, Sex, Height, Age.  

---

### 📌 The index column
The index is the left-most column that numbers the rows.  
It is **not managed like the other columns** of the dataset.  

We can:  
- Use the **default index** (0, 1, 2 …).  
- Index with one of the columns (e.g., Name).  
- Index with a **custom list** we provide.  

---

### Examples:

#### Default index:
|   | Name   | Sex | Height | Age |  
|---|--------|-----|--------|-----|  
| 0 | Robert | M   | 174    | 23  |  
| 1 | Mark   | M   | 182    | 40  |  
| 2 | Aline  | F   | 169    | 56  |  

#### Indexed by "Name":
|        | Sex | Height | Age |  
|--------|-----|--------|-----|  
| Robert | M   | 174    | 23  |  
| Mark   | M   | 182    | 40  |  
| Aline  | F   | 169    | 56  |  

#### Indexed by custom list:
|            | Name   | Sex | Height | Age |  
|------------|--------|-----|--------|-----|  
| person_1   | Robert | M   | 174    | 23  |  
| person_2   | Mark   | M   | 182    | 40  |  
| person_3   | Aline  | F   | 169    | 56  |  

---

<center>

## **📖 Creating a DataFrame from a Dictionary**

</center>

---



It'spossible to create a DataFrame by using a **Python dictionary**.  

- Columns can contain **different types** (numbers, strings, etc.).  
- Column names are **already defined** when creating the DataFrame.

### Example :

```python
my_dic = {'A': [1, 5, 9],
          'B': [2, 6, 10],
          'C': [3, 7, 11],
          'D': [4, 8, 12]}

# Create a DataFrame
df = pd.DataFrame(data = my_dic,
                  index = ['i_1', 'i_2', 'i_3'])
>>>
      A   B    C    D
i_1   1   2    3    4
i_2   5   6    7    8
i_3   9   10   11   12
```

<center>

### **🔍 Example: Creating a DataFrame from a Dictionary**

</center>

---

The manager of a grocery store keeps track of the following food stock:  
  - **100 jars of honey**, expiration date: 10/08/2025, price: 2€ each.  
  - **55 bags of flour**, expiration date: 25/09/2024, price: 3€ each.  
  - **1800 bottles of wine**, expiration date: 15/10/2023, price: 10€ each.

Task :
- (a) Using a **Python dictionary**, create and display a DataFrame `df` that contains for each product:  
  - Product name  
  - Expiration date  
  - Quantity  
  - Unit price

In [46]:
# TODO

---

<center>

## **📖 Creating a DataFrame from a Data File**

</center>

---

Most often, DataFrames are created directly from data files such as **CSV, Excel, or text files**.  

👉 The most common format is **CSV** (Comma-Separated Values), which represents a spreadsheet-like table where values are separated by a delimiter (`,` by default, but sometimes `;` is used).

**Example of a CSV file**:

A,B,C,D

1,2,3,4

5,6,7,8

9,10,11,12

In this format:

- **The first line contains the column names**, but sometimes the column names **are not provided**.
- Each **line** corresponds to an entry in the **database**.
- The values are **separated by a delimiter**. In this example, it is `','` but it could also be `';'`.

To import this data into a DataFrame, we then use the pandas `read_csv` function, which has the following header:
```python
pd.read_csv(filepath_or_buffer , sep = ',', header = 0, index_col = 0 ... )
```
The key arguments of the `pd.read_csv` function to know are:

- `filepath_or_buffer`: The path to the .csv file relative to the execution environment.
If the file is in the same folder as your Python environment, you can simply provide the file name
(e.g., 'my_dataframe.csv'). This path should be provided as a string.

- `sep`: The character used in the .csv file to separate columns. This argument must be specified as a character.

- `header`: The row number that contains the column names.
For example, if the column names are in the first row of the .csv file, specify `header=0`.
If the column names are not present, use `header=None`.

- `index_col`: The name or number of the column to use as the row labels of the DataFrame.
If the entries are indexed by the first column, set `index_col=0`.
Alternatively, if the rows are indexed by a column named "Id", you can specify `index_col="Id"`.

This function returns a `DataFrame` object containing all the data from the file.

<center>

### **🔍 Example: Loading Data into a DataFrame**

</center>

---
- (a) Load the data from the file `transactions.csv` into a DataFrame named `transactions`:
    - The file is located in the same folder as this notebook environment.
    - Columns are separated by a comma.
    - Column names are provided in the first row of the file.
    - The rows are indexed by the column "transaction_id", which is also the first column.

In [47]:
# TODO

**Loading a Real Dataset: `transactions.csv`**

We have just loaded the file **`transactions.csv`** into a pandas DataFrame named **`transactions`**.  
This dataset contains a history of financial transactions made between **2011 and 2014**.

**Why use a dataset?**

Working with a real dataset allows us to:
- Practice **data loading** and **cleaning**.  
- Explore the **structure of a DataFrame**.  
- Perform **statistical analysis** and **visualization**.  

**Reminder**

We usually load a CSV file with:

```python
import pandas as pd

# Load CSV file into a DataFrame
transactions = pd.read_csv("transactions.csv")

# Display the first rows of the dataset
transactions.head()
```
In the next section, we will explore the dataset to understand its structure, columns, and content.

---

<center>

## **📖 First Exploration of a Dataset with `pandas.DataFrame`**

</center>

---


Now that we have loaded the dataset into a DataFrame (`transactions`),  
let’s explore it using some **basic methods** provided by the `pandas` library.

**Get a Quick Look at the Data**

- **`head()`** → Displays the first 5 rows (by default).  
- **`.columns`** → Lists the column names.  
- **`.shape`** → Returns the number of rows and columns.

```python
# Display the first 5 rows
transactions.head()

# Display the list of column names
transactions.columns

# Display the dimensions of the DataFrame (rows, columns)
transactions.shape
```

**Selecting Data**

There are two main ways to access rows/columns in a DataFrame:

`.loc[]` → Selection by labels (row/column names).

`.iloc[]` → Selection by indices (row/column positions).

```python
# Select one row by index (first row)
transactions.iloc[0]

# Select multiple rows (first 3 rows)
transactions.iloc[0:3]

# Select a specific column by label
transactions.loc[:, "amount"]

# Select a specific row and column by label
transactions.loc[80712190438, "Store_type"]
```
**Quick Statistical Overview**

- `describe()` → Generates summary statistics (mean, std, min, max, quartiles).

- `value_counts()` → Counts unique values in a column.

---

<center>

## **📖 Previewing a DataFrame: `head()`, `tail()`, `columns` and `shape`**

</center>

---

You can get a quick overview of a dataset by displaying only the first few rows of a DataFrame.

- Use the `head()` method to display the first rows. You can pass the number of rows you want as an argument (default is 5):

```python
transactions.head(10)  # Display the first 10 rows
```

- Similarly, use `tail()` to see the last rows of the DataFrame:

```python
transactions.tail(20)  # Display the last 20 rows
```

- To see the column names of a DataFrame, use the `columns` attribute:
```python
transactions.columns
```

- To check the shape of the DataFrame (number of rows and columns), use the `shape` attribute:
```python
transactions.shape
```
These tools are very useful to quickly understand the structure and content of your dataset.

<center>

### **🔍 Example: Exploring a DataFrame**

</center>

---

We will practice some basic DataFrame exploration methods with the dataset **`transactions`**.

- (a) Display the first 20 rows of the DataFrame.
- (b) Display the last 10 rows of the DataFrame.
- (c) Display the dimensions (number of rows and columns) of the DataFrame and the name of the 5th column.  
💡 *Reminder: in Python, indexing starts at 0!*


In [48]:
# TODO

---

<center>

## **📖 Selecting Columns from a DataFrame**

</center>

---

Extracting columns from a **DataFrame** is very similar to extracting data from a dictionary.

---

- Extracting a single column

To extract a column, specify its name between square brackets:

```python
# Display the column 'cust_id'
print(transactions['cust_id'])
```

- Extracting multiple columns

To extract multiple columns, pass a list of column names inside the brackets (so you need double brackets):

```python
# Extract the columns 'cust_id' and 'Qty' from transactions
cust_id_qty = transactions[["cust_id", "Qty"]]
```
`cust_id_qty` is now a new DataFrame containing only the columns `'cust_id'` and `'Qty'`.

Example output (first 3 rows):

```python
cust_id_qty.head(3)
>>>
```

| transactions_id | cust_id | Qty |
|-----------------|---------|-----|
| 80712190438     | 270351  | -5  |
| 29258453508     | 270384  | -5  |
| 51750724947     | 273420  | -2  |

**Categorical vs Quantitative variables**

When preparing a dataset, it is important to separate categorical variables from quantitative variables.

- Categorical variable → contains categories or labels, with no natural ordering.
Example: favorite color, country, nationality.

  👉 In our DataFrame transactions, the categorical variables are:
  `['cust_id', 'tran_date', 'prod_subcat_code', 'prod_cat_code', 'Store_type']`

  - `cust_id` → The customer ID (unique identifier for each customer).
  - `tran_date` → The transaction date (when the purchase or return happened).
  - `prod_subcat_code` → The product subcategory code (e.g., “Trousers” if the main category is “Clothing”).
  - `prod_cat_code` → The product category code (e.g., “Clothing”).
  - `Store_type` → The type of store where the transaction took place (e.g., Online Store, Supermarket, Specialized Store).

- Quantitative variable → measures a numerical quantity, with an ordering relationship.
Example: height, weight, age.

  👉 In our DataFrame transactions, the quantitative variables are:
  `['Qty', 'Rate', 'Tax', 'total_amt']`

  - `Qty` → The quantity purchased (can be negative if the product was returned).
  - `Rate` → The unit price of the product (before tax).
  - `Tax` → The amount of tax applied to the transaction.
  - `total_amt` → The total transaction amount (= quantity × rate + tax).

⚠️ Why it matters?

Some basic operations (like computing a mean) only make sense for quantitative variables.

<center>

### **🔍 Example: Splitting Categorical and Quantitative Variables**

</center>

---

We will now practice splitting our dataset into categorical and quantitative variables.

- (a) In a DataFrame named **`cat_vars`**, store the categorical variables of `transactions`.  
- (b) In a DataFrame named **`num_vars`**, store the quantitative variables of `transactions`.  
- (c) Display the first 5 rows of each DataFrame.


In [49]:
# TODO

---

<center>

## **📖 Selecting Rows in a DataFrame: `loc` and `iloc`**

</center>

---

To extract one or several rows from a DataFrame, we use the **`loc`** method.

This method is special because the arguments are placed **inside square brackets `[]`** instead of parentheses `()`.

- **Example 1**: Selecting a single row with `loc`
If we want to retrieve the row with index `80712190438`:

  ```python
  # Get row with index 80712190438 from num_vars DataFrame
  num_vars.loc[80712190438]
  >>>
  ```
  | transaction_id | Qty |  Rate  |   Tax  | total_amt |
  |----------------|-----|--------|--------|-----------|
  | 80712190438    | -5  | -772.0 | 405.3  | -4265.3   |
  | 80712190438    |  5  |  772.0 | 405.3  |  4265.3   |

- **Example 2**: Selecting multiple rows with loc
We can pass:
  - a list of indices
  - or use slicing (`start:end`).

  ```python
  # Select multiple rows using their indices
  num_vars.loc[[80712190438, 29258453508, 51750724947]]
  >>>
  ```

- **Example 3**: Selecting both rows and columns with loc
We can also specify which columns to extract.

  ```python
  # Select only 'Tax' and 'total_amt' for two transactions
  transactions.loc[[80712190438, 29258453508], ['Tax', 'total_amt']]
  >>>
  ```
  | transaction_id | Tax     | total_amt |
  |----------------|---------|-----------|
  |   80712190438  | 405.300 | -4265.300 |
  |   80712190438  | 405.300 |  4265.300 |
  |   29258453508  | 785.925 | -8270.925 |
  |   29258453508  | 785.925 |	8270.925 |

- **Example 4**: Using iloc

The iloc method works like NumPy arrays:
we use only numeric indices for rows and columns.

  ```python
  # Extract the first 4 rows and the first 3 columns
  transactions.iloc[0:4, 0:3]
  >>>
  ```
  | transaction_id | cust_id |  tran_date  | prod_subcat_code |
  |----------------|---------|-------------|------------------|
  | 80712190438    | 270351  | 28/02/2014  | 1.0              |
  | 29258453508    | 270384  | 27/02/2014  | 5.0              |
  | 51750724947    | 273420  | 24/02/2014  | 6.0              |
  | 93274880719    | 271509  | 24/02/2014  | 11.0             |

- Summary
  - loc → selection by labels (row/column names).
  - iloc → selection by numeric positions (row/column indices).

If the DataFrame uses the default integer index (0, 1, 2, ...), then loc and iloc often give the same result.


---

<center>

## **📖 Conditional Indexing of a DataFrame**

</center>

---

we can use **conditional indexing** to extract the rows of a DataFrame that satisfy a given condition.  

In the following example, we select the rows of the DataFrame `df` where the column **col 2** is equal to `3`.

There are two syntaxes for conditional indexing:

```python
# Select rows where column 'col 2' equals 3
df[df['col 2'] == 3]

# Alternative using loc
df.loc[df['col 2'] == 3]
```
⚠️ Important:

If we want to assign a new value to these entries, we must absolutely use the .loc method.
Using the syntax `df[df['col 2'] == 3]` only returns a copy of the entries and does not give access to the original memory location of the data.

<center>

### **🔍 Example: Filtering Transactions by Store Type**

</center>

---
The manager of the transactions recorded in the DataFrame `transactions` wants to access the **customer IDs** of clients who made a purchase online (i.e., in an `"e-Shop"`) and the corresponding **transaction date**.

We are given the following information about the columns of `transactions`:

| Column name   | Description                                      |
|---------------|--------------------------------------------------|
| `cust_id`     | Customer IDs                                     |
| `Store_type`  | The type of store where the transaction occurred |
| `tran_date`   | The date of the transactions                     |

---

- (a) Create a new DataFrame named `transactions_eshop` containing only the transactions that took place in an `"e-Shop"`.  

- (b) Create another DataFrame named `transactions_id_date` that stores the **customer IDs** and the **transaction dates** from the DataFrame `transactions_eshop`.  

- (c) Display the first 5 rows of `transactions_id_date`.


In [50]:
# TODO

<center>

### **🔍 Example: Conditional Selection and Aggregation**

</center>

---
The store manager now wants to focus on a specific client.

- (a) Create a DataFrame named `transactions_client_268819` that contains all transactions where the client ID is `268819`.

- (b) A column of a DataFrame can be iterated like a list in a for loop. Using a loop on the column `'total_amt'`, calculate and display the total transaction amount for client `268819`.

In [51]:
# TODO

---

<center>

## **📖 Quick Statistical Study of a DataFrame**

</center>

---

The `describe()` method of a DataFrame returns a summary of the descriptive statistics (minimum, maximum, mean, quantiles, etc.) for its numerical variables.

It is a very useful tool to quickly visualize the type and distribution of these variables.

```python
# Display descriptive statistics of the numerical variables
print(num_vars.describe())
```
or categorical variables, it is often better to start with the value_counts() method, which returns the number of occurrences of each category.

⚠️ Note: the `value_counts()` method cannot be used directly on a whole DataFrame, but only on a single column, since it works on objects of type pd.Series.

```python
# Count the number of transactions per store type
print(transactions["Store_type"].value_counts())
```

<center>

### **🔍 Example: Quick Statistical Study of a DataFrame**

</center>

---
- (a) Use the `describe()` method on the DataFrame transactions.
- (b) The numerical variables are: `Qty`, `Rate`, `Tax`, and `total_amt`. By default, are the statistics produced by the describe() method computed only on numerical variables?
- (c) Display the number of occurrences of each category taken by the variable `Store_type` using the method `value_counts`.


In [52]:
# TODO

Interpreting `describe()` on categorical variables

The `describe()` method calculated statistics on the columns `cust_id`, `prod_subcat_code`, and `prod_cat_code` even though these are categorical variables.

These statistics do not make sense for categorical data. The method treated them as quantitative because the values happen to be numeric.

⚠️ Takeaway: Always be cautious when interpreting `describe()` results. Make sure you understand the type of each variable in your DataFrame before drawing conclusions.

<center>

### **🔍 Example: Quick Transaction Summary**

</center>

---
The manager wants to generate a quick report on the transactions. In particular, they are interested in the `average amount` spent and the `maximum quantity` purchased.

(a) What is the `average total amount` spent? Focus on the `total_amt` column of the transactions DataFrame.

(b) What is the `maximum quantity purchased`? Focus on the `Qty` column of the transactions DataFrame.

In [53]:
# TODO

Some transactions have negative amounts.
These correspond to canceled transactions that were refunded to the customer.

These negative amounts can distort the distribution of the total_amt variable, leading to inaccurate estimates of the mean and quantiles.

It is therefore important to filter or handle these cases before performing statistical analysis.

<center>

### **🔍 Example: Average of Positive Transactions**

</center>

---
- (a) Calculate the mean of the `total_amt` column for transactions where the amount is positive.

In [54]:
# TODO

---

<center>

## **📖 Conclusion and Summary**

</center>

---

The pandas DataFrame class will be your go-to data structure for exploring, analyzing, and processing datasets.

In this brief introduction, you have learned how to:

- Create a DataFrame from a a dictionary using `pd.DataFrame`.
- Create a DataFrame from a `.csv` file using `pd.read_csv`.
- Preview the first and last rows of a DataFrame using the `head` and `tail` methods.
- Select one or more columns from a DataFrame by specifying their names in brackets, similar to a dictionary.
- Select one or more rows from a DataFrame by specifying their indices using `loc` and `iloc`.
- Filter rows of a DataFrame that satisfy a specific condition using conditional indexing.
- Perform a quick statistical overview of quantitative variables in a DataFrame using the `describe` method.

In practice, datasets are rarely perfectly clean: missing values, duplicates, or inconsistent entries are common.  
In the next section, we will learn how to clean and preprocess datasets using pandas, a crucial step before any meaningful analysis.
