# Lab Session 2

## Normalization

### Data Normalization Guide

Normalization ensures that all features contribute equally to a machine learning model. The steps are as follows:

1. **Import the required library**  
   First, import the normalization tools from scikit-learn, such as Min-Max Scaler or Standard Scaler.

`from sklearn.preprocessing import MinMaxScaler, StandardScaler`

2. **Prepare your dataset and split it into training and testing sets**  
   Split the dataset into training and testing sets before scaling. This prevents information from the test set from influencing the scaling parameters, which could lead to data leakage.

`X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size = 0.2)`

3. **Create a scaler object**  
   Decide which scaling method is appropriate:
   - **Min-Max scaling:** rescales feature values to a fixed range, usually [0,1].  
   - **Z-score standardization:** centers features around zero with unit variance.

`scaler = MinMaxScaler()`

4. **Fit the scaler to the training data**  

`scaler.fit(X_train)`

5. **Transform the training data using the fitted scaler**  
   Apply the scaling to the training data to normalize its features. The training data is now ready to be used for model training.

`X_train_scaled = scaler.transform(X_train)`

6. **Transform the testing data using the same scaler**  
   Apply the scaler (fitted on the training data) to the testing data.
   
`X_test_scaled = scaler.transform(X_test)`

7. **Use the scaled data for modeling**  
   Train your machine learning model using the scaled training data and evaluate it on the scaled testing data. 

---


## Exercise 1: Splitting + Normalization
Given the following iris dataset, which aims at distinguishing between three different species of iris flower ('setosa' 'versicolor' 'virginica') based on their sepal and petal dimensions, as shown below when you run the example code. follow the normalization steps to normalize your data and call first 5 rows of your normalized data and compare it to non-normalized data, what difference can you notice?


In [None]:
import pandas as pd
from sklearn.datasets import load_iris

# Load the dataset
data = load_iris()
X, y = data.data, data.target

# Feature names and target
feature_names = data.feature_names
target_name = 'species'

# Map numeric labels to species names
species = data.target_names[y]

# Create DataFrame
df = pd.DataFrame(X, columns=feature_names)
df[target_name] = species

# Display the first 5 rows
print("Iris dataset (first 5 rows):")
df.head()

## Signal Conditioning Techniques 

### Signal Conditioning guide 

### 1. Amplification
**Amplification** is the process of increasing the amplitude of a signal.  
- Purpose: Make weak signals strong enough for further processing or measurement.  
- Example: Multiplying a signal by a gain factor \(G > 1\) increases its amplitude.

---

### 2. Attenuation
**Attenuation** is the process of reducing the amplitude of a signal.  
- Purpose: Prevent signals from saturating measurement equipment or bring them into a measurable range.  
- Example: Multiplying a signal by a gain factor \(0 < G < 1\) reduces its amplitude.


---

### 3. Filtering  
**Filtering** is about cleaning up your data by removing parts of the signal you don't care about (like slow drift or random noise) while keeping the important patterns.  
- **Purpose:** Focus on the meaningful signal and make it easier to analyze.  
- **Example:** A bandpass filter keeps only the frequencies in a chosen range (like 1–30 Hz), removing both very slow changes and very fast noise.

#### Filtering application:

`lowcut, highcut = 1.0, 30.0`  
- **What this line does:** Sets the range of frequencies we want to keep — from 1 Hz up to 30 Hz.  
- **Why we use it:** This tells the filter what counts as “useful signal” vs. “noise.”  
  - The low cutoff (1 Hz) removes slow trends or constant offsets.  
  - The high cutoff (30 Hz) removes very fast fluctuations that are probably just noise.  
- **tip:** Pick these numbers based on your problem — if you expect slower patterns, lower the low cutoff; if you expect faster patterns, raise the high cutoff.

---

`nyq = 0.5 * fs`  
- **What this line does:** Finds the Nyquist frequency, which is half the sampling rate.  
- **Why we use it:** Filters expect cutoff frequencies to be between 0 and 1 (as a fraction of Nyquist).  
- **tip:** Think of this step as scaling your real-world frequencies into a range the computer understands — it’s just a conversion step.

---

`b, a = sg.butter(4, [lowcut/nyq, highcut/nyq], btype='band')`  
- **What this line does:** Designs the actual filter using the cutoff frequencies we defined.  
- **Why we use it:**  
  - **Butterworth** filters are smooth and don’t distort the size of your signal in the range you care about.  
  - **Order = 4** controls how “sharp” the filter is. A higher order cuts noise more aggressively but might make the signal look too perfect or cause weird edges.  
- **tip:** You can experiment with the filter order — higher values give a cleaner signal, but be careful not to over-filter and lose important details.

----

`filtered = sg.filtfilt(b, a, original)`  
- **What this line does:** Runs the data through the filter twice (forward and backward) so the cleaned signal stays perfectly aligned with the original (no time shift).  
- **Why we use it:**  `filtfilt` keeps peaks, spikes, and other features in the correct place while removing noise.  
- **What to watch for:**  
  - Works best on signals with enough data points (avoid very short signals).  
  - Always check the plot — the filtered result should look smoother but still follow the same pattern as the original.


## Exercise 2: Signal Conditioning
Run the following code representing a signal function of a noisy sine wave, then perform the explained three methods and visualize the impact of each technique

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy import signal as sg

# Sampling setup
fs = 1000  # Hz
t = np.linspace(0, 1, fs, endpoint=False)

# Original signal: sine + DC offset + random noise
original = 2*np.sin(2*np.pi*5*t) + 0.5 + 0.3*np.random.randn(len(t))

plt.figure(figsize=(10, 3))
plt.plot(t, original, label='Original')
plt.title("Original Noisy Sine Wave with DC Offset")
plt.xlabel("Time [s]")
plt.ylabel("Amplitude")
plt.xlim(0, 0.5)
plt.legend()
plt.tight_layout()
plt.show()

# Data conditioning techniques

## Exercise 3: Data Conditioning

#### 3.1 Using the same signal function as before and with the help of the hands on guide below, perform FFT Decomposition and aggregation(min, max, mean) and produce a plot for each part.

#### 3.2 run synchronization code then determine the target frequency achieved.



---
## Data conditioning guide
### 1. Decomposition
Decomposition breaks down a signal into simpler components, often to analyze its **frequency content** or **time-frequency patterns**.


### Fourier Transform Code Implementation

```python
from scipy.fft import fft, fftfreq  # Import FFT functions

# Step 1: Get signal parameters
N = len(original)                   # N = number of samples in time series
```
**What's happening:** We need to know the length of our signal because the FFT algorithm requires this for proper frequency bin calculation and normalization.

```python
# Step 2: Apply the Discrete Fourier Transform
yf = fft(original)                  # yf = FFT of signal; returns complex numbers
```
**Theory connection:** This implements the discrete version of the continuous Fourier transform formula above. Each element in `yf` is a complex number containing:
- **Real part:** represents the cosine component at that frequency
- **Imaginary part:** represents the sine component at that frequency
- **Magnitude:** `abs(yf[k])` gives the amplitude of frequency component k
- **Phase:** `angle(yf[k])` gives the phase shift of frequency component k

```python
# Step 3: Generate corresponding frequency values
xf = fftfreq(N, 1/fs)              # xf = frequency bins; 1/fs is time step
```
**What's happening:** 
- `fftfreq()` creates an array of frequencies corresponding to each FFT bin
- `1/fs` is the sampling period (time between samples)
- For sampling rate `fs`, frequencies range from 0 to `fs/2` (Nyquist frequency)

```python
# Step 4: Plot the single-sided amplitude spectrum
plt.figure(figsize=(10, 3))
plt.plot(xf[:N//2], 2.0/N * np.abs(yf[:N//2]))
```

**Breaking down the plotting code:**

| Code Component | Mathematical Meaning | Why We Do This |
|---|---|---|
| `xf[:N//2]` | Take first half of frequency bins | FFT is symmetric; we only need positive frequencies |
| `yf[:N//2]` | Take first half of FFT results | Corresponds to positive frequencies only |
| `np.abs(yf[:N//2])` | $|X[k]| = \sqrt{\text{Re}(X[k])^2 + \text{Im}(X[k])^2}$ | Convert complex numbers to magnitudes |
| `2.0/N` | Normalization factor | Account for single-sided spectrum and sample count |

**Why the `2.0/N` scaling?**
- **`N`:** Normalizes for the number of samples (FFT spreads energy across N bins)
- **`2.0`:** Compensates for throwing away negative frequencies (doubles the amplitude)
- Result: True amplitude values that match the original signal components


**tip:** Fourier is widely used because it reveals the dominant frequencies in a signal and is simple to compute with most libraries (e.g., NumPy’s `fft`). For signals with transient changes, wavelets can provide better time-localized frequency information.

---

### 2. Aggregation 

Aggregation reduces multiple data points into a single representative value within a sliding window, commonly used for:

- **Mean** - Average value over the window, which reults in **smoothing**
- **Median** - Middle value when sorted  
- **Minimum / Maximum** - Extreme values in the window

#### Aggregation Implementation

**Step 1: Convert to pandas Series for window operations**
```python
import pandas as pd

# Convert the signal to a pandas Series
signal = pd.Series(original)
```
**Why:** Pandas provides efficient rolling window operations that are optimized for time series analysis.

**Step 2: Define window size**
```python
window_size = 50  # Number of samples in each aggregation window
```
**Important:** Choose window size based on your signal characteristics:
- **Smaller windows:** Preserve more detail, less smoothing
- **Larger windows:** More smoothing, lose fine details

**Step 3: Apply rolling aggregation functions**
```python
# Minimum aggregation
min_signal = signal.rolling(window=window_size, min_periods=1).min()

# Maximum aggregation (same pattern)
max_signal = signal.rolling(window=window_size, min_periods=1).max()

# Mean aggregation
mean_signal = signal.rolling(window=window_size, min_periods=1).mean()

# Median aggregation
median_signal = signal.rolling(window=window_size, min_periods=1).median()
```

**Code breakdown:**
- **`.rolling(window=window_size)`:** Creates sliding window of specified size
- **`min_periods=1`:** Ensures calculation even when fewer than `window_size` points available (useful at signal boundaries)
- **`.min()/.max()/.mean()/.median()`:** Applies the aggregation function to each window

**tip:** Rolling aggregation is perfect for noise reduction and trend identification. Use min/max to find envelope boundaries, mean for general smoothing, and median for robust smoothing that's less affected by outliers.


---

### 4. Interpolation
Interpolation estimates values at points where data is missing or needs higher resolution.

- **Linear interpolation:** Connects points with straight lines  
- **Polynomial Interpolation:** Fits a polynomial through known data points  
- **Splines:** Piecewise polynomials with smooth transitions at the joins  

---

### 5. Data Synchronization
When combining datasets sampled at different frequencies, synchronization aligns them in time space. Techniques include **resampling**, **interpolation**, or **time-stamping alignment**.

---

In [None]:
# synchronization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# -----------------------------
# 1. Create sample datasets (15 seconds)
# -----------------------------

# Time series 1: sampled every 1 second (15 points)
time1 = pd.date_range(start='2025-09-04 00:00:00', periods=15, freq='1s')
data1 = np.sin(np.linspace(0, 3*np.pi, 15))  # sine wave data
df1 = pd.DataFrame({'Time': time1, 'Value1': data1}).set_index('Time')

# Time series 2: sampled every 3 seconds (6 points)
time2 = pd.date_range(start='2025-09-04 00:00:00', periods=6, freq='3s')
data2 = np.cos(np.linspace(0, 3*np.pi, 6))  # cosine wave data
df2 = pd.DataFrame({'Time': time2, 'Value2': data2}).set_index('Time')

# -----------------------------
# 2. Visualize original data
# -----------------------------
plt.figure(figsize=(10, 4))
plt.plot(df1.index, df1['Value1'], 'o-', label='Value1 (1s freq)')
plt.plot(df2.index, df2['Value2'], 's-', label='Value2 (3s freq)')
plt.title('Original Time Series with Different Frequencies')
plt.xlabel('Time')
plt.ylabel('Value')
plt.legend()
plt.show()

# -----------------------------
# 3. Synchronize data
# -----------------------------
# Resample df2 to 1-second frequency and interpolate
df2_sync = df2.resample('1s').interpolate(method='linear')

# Merge the datasets
df_sync = df1.join(df2_sync)

# -----------------------------
# 4. Visualize synchronized data
# -----------------------------
plt.figure(figsize=(10, 4))
plt.plot(df_sync.index, df_sync['Value1'], 'o-', label='Value1 (1s freq)')
plt.plot(df_sync.index, df_sync['Value2'], 's-', label='Value2 synchronized')
plt.title('Synchronized Time Series (1s frequency)')
plt.xlabel('Time')
plt.ylabel('Value')
plt.legend()
plt.show()

# Handling Missing Values in Data



## Exercise 4: Handling missing values
run the following code cell that contains a dataset with several missing values (NaNs), and make use of the Missing values guide to perform the following tasks


### 1. Explore Missing Values
- Check which columns or rows have missing values.


### 2. Implement Your Solution
- Apply deletion or imputation based on your strategy.


### 3. Verify
- Check that no missing values remain.
- Observe how your changes affected the dataset.

---

## Missing values guide 


In real-world datasets, we often encounter **missing values** (NaNs). Dealing with them is important because many machine learning algorithms cannot handle NaNs directly.
We typically handle missing values in **two main ways**: **deletion** and **imputation**.

## 1. Deletion

Deletion means removing the missing values from the dataset. There are several approaches:

- **Remove rows with missing values**:  
  ```python
  df.dropna(axis=0, inplace=True)
  ```
  This removes any row that contains at least one NaN.

- **Remove columns with missing values**:
  ```python
  df.dropna(axis=1, inplace=True)
  ```
  This removes any column that contains at least one NaN.

- **Conditional removal**: Remove rows if NaN appears in a specific column:
  ```python
  df.dropna(subset=['column_name'], inplace=True)
  ```

---

## 2. Imputation

Imputation means filling in missing values with reasonable estimates. Common methods include:

- **Fill with a constant value**:
  ```python
  df.fillna(0, inplace=True)  # or any constant
  ```

- **Fill with mean, median, or mode**:
  ```python
  df['column_name'].fillna(df['column_name'].mean(), inplace=True)
  df['column_name'].fillna(df['column_name'].median(), inplace=True)
  df['column_name'].fillna(df['column_name'].mode()[0], inplace=True)
  ```

- **Forward fill (propagate last valid value)**:
  ```python
  df.fillna(method='ffill', inplace=True)
  ```

- **Backward fill (propagate next valid value)**:
  ```python
  df.fillna(method='bfill', inplace=True)
  ```

- **Interpolation (linear, polynomial, etc.)**:
  ```python
  df['column_name'].interpolate(method='linear', inplace=True)
  ```

---

## 3. Checking for Missing Values

Before handling NaNs, we can check for them using:

- **Check if any NaNs exist**:
  ```python
  df.isna().any()
  ```

- **Count of NaNs per column**:
  ```python
  df.isna().sum()
  ```

- **Quick overview**:
  ```python
  df.info()
  ```

---

By choosing the appropriate deletion or imputation method, we can prepare the dataset for further analysis or machine learning tasks.

### Missing values dataset

In [None]:
# Import libraries
import pandas as pd
import numpy as np

# Generate larger sample data with Address column full of NaNs
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Hannah',
             'Ian', 'Jack', 'Karen', 'Liam', 'Mona', 'Nina', 'Oscar', 'Paula'],
    'Age': [25, np.nan, 30, 22, np.nan, 28, 35, 32, 29, np.nan, 40, 31, 27, np.nan, 33, 26],
    'Salary': [50000, 54000, np.nan, 48000, 52000, np.nan, 60000, 58000,
               51000, 53000, np.nan, 49000, 55000, 57000, np.nan, 50000],
    'Department': ['HR', 'IT', 'Finance', np.nan, 'IT', 'HR', 'Finance', np.nan,
                   'IT', 'Finance', 'HR', np.nan, 'Finance', 'HR', 'IT', np.nan],
    'Address': [np.nan]*16  # Entire column is NaN
}

df = pd.DataFrame(data)
df

##

<h2>Take Home: Reading Data, Exploring Data Types, Properties and Basic Statistics in pandas</h2>

**in the following instructions, you will learn techniques to work with Pandas dataframes, go through them and understand the value of their different functions in perparation for next lab session**

We import two of the most fundamental libraries for data mining: [pandas](https://pandas.pydata.org/docs/index.html) for data manipulation and [numpy](https://numpy.org/) for numerical computations.

In [None]:
import pandas as pd
import numpy as np

---

### The `read_csv()` Function in pandas
We are using `pd.read_csv` to read two CSV files:
- `country_wise_latest.csv`: This contains the latest country-wise statistics for COVID-19.
<!-- - `day_wise.csv`: This file holds the day-wise statistics for global COVID-19 cases. -->

Parameters used:
- `pd.read_csv(file_path)`: Reads the CSV files from the specified path.
- `sep=','`: Specifies the separator used to divide values in the file. The default value is a comma (`,`), which is used in standard CSV files like ours.

Since commas are the default separator in `pd.read_csv()`, there is no need to specify this parameter unless a different separator is used (e.g., a tab `\t` or semicolon `;`).

In [None]:
# Reading the CSV file
df1 = pd.read_csv('country_wise_latest.csv')

---

### The `head()` Function in pandas
The `head()` function in pandas allows you to preview the first few rows of a DataFrame. By default, it shows the first 5 rows, but you can specify a different number of rows if needed.

**Syntax**: `df.head(n)`  
- `n` (optional): The number of rows to display. If not specified, the default is 5.

This method is useful for quickly inspecting the top rows of a dataset to understand its structure and contents.


In [None]:
# Display the first few rows of both dataset
df1.head()

---

### The `tail()` Function in pandas

The `tail()` function in pandas allows you to view the last few rows of a DataFrame. By default, it returns the last 5 rows, but you can specify the number of rows to display.

#### Syntax:
`df.tail(n)`

- **n** (optional): The number of rows to return. If not specified, the default is 5.

This function is useful for quickly inspecting the final portion of a dataset.


In [None]:
df1.tail()

---

### The `shape()` Function in pandas

The `shape()` function in pandas returns a tuple representing the dimensions of a DataFrame. It provides two values:
- The number of rows.
- The number of columns.

This is useful for quickly inspecting the size and structure of the dataset.

In [None]:
df1.shape

---

### The `info()` Function in pandas
The `info()` function in pandas provides a concise summary of a DataFrame, including:
- The total number of entries (rows) and columns.
- The names of all the columns and their data types.
- The number of non-null values in each column.
- Memory usage of the DataFrame.

This method is useful for quickly inspecting the structure and contents of a DataFrame.

Example:
`df.info()`

In [None]:
df1.info()

---

### The `type()` Function in pandas
The `type()` method in Python returns the type of a given object. It can be used to check the type of a variable, a DataFrame, or any other object.

For example:
- `type(df)` will return `<class 'pandas.core.frame.DataFrame'>`, confirming that `df` is a DataFrame.
- You can also use it on individual columns or any variable to determine their type.

Example:
`type(df['column_name'])`

In [None]:
print(type(df1['WHO Region'][0]))
type(df1['Country/Region'][0])

### Why Does `type()` Return Different Outputs?

In Jupyter notebooks, the `type()` function may return different outputs depending on how it's used:

- **With `print()`**: The full class information is displayed, like `<class 'str'>`.
  - Example: `print(type(df1['WHO Region'][0]))` will output `<class 'str'>`.
  
- **Without `print()`**: Jupyter automatically simplifies the output to just `str`.
  - Example: `type(df1['Country/Region'][0])` will display `str`.

To ensure consistent results, always use `print()` to display the output of `type()`.

---

### The `describe()` Function in pandas

The `describe()` function in pandas generates descriptive statistics for numerical columns of a DataFrame, providing a summary of key statistical measures.

#### Key Outputs:
- **count**: The number of non-null (valid) entries in the column.
- **mean**: The average (mean) value of the data in the column.
- **std**: The standard deviation, a measure of how spread out the data is.
- **min**: The minimum value in the column.
- **25% (1st Quartile)**: The value below which 25% of the data falls.
- **50% (Median)**: The middle value of the data.
- **75% (3rd Quartile)**: The value below which 75% of the data falls.
- **max**: The maximum value in the column.


In [None]:
df1.describe()

### Explanation of `e+02` in `describe()`

The `e+02` notation seen in the `describe()` output refers to scientific notation, which is used to express very large or very small numbers in a compact form. 

In scientific notation, `e+02` means "multiply by \(10^2\)" (or 100). For example:
- `1.87e+02` is equivalent to `1.87 × 10^2 = 187`.
- `8.81e+04` is equivalent to `8.81 × 10^4 = 88,130`.

This notation is often used in pandas to make it easier to display large datasets in a more compact form, especially when numbers vary widely in magnitude.


---

### How to Bypass Scientific Notation

If you want to disable scientific notation and display the full numbers in pandas, you can adjust the pandas display settings using `pd.set_option()`. This allows you to format the output with a specific number of decimal places. 

For example:
```python
pd.set_option('display.float_format', '{:.2f}'.format)

The command `pd.set_option('display.float_format', '{:.2f}'.format)` modifies how floating-point numbers are displayed in pandas. It formats all floats to show 2 decimal places (`.2f`), ensuring consistent number presentation throughout the DataFrame without scientific notation.


In [None]:
pd.set_option('display.float_format', '{:.2f}'.format)

---

In [None]:
df1.describe()

### Explanation of `std (383318.66)` in the Confirmed Column

The **std (383318.66)** represents the **standard deviation** of the confirmed cases. It indicates the spread of the data around the mean. A high standard deviation means there is a wide range in the number of confirmed cases across locations/countries, with significant variation from the average.


### Explanation of Percentiles in the Confirmed Column

- **25% (1114.00)**:  
  This represents the **first quartile (Q1)**. It means that 25% of the locations/countries have confirmed cases **less than or equal to 1,114**, while 75% have more.

- **50% (5059.00)**:  
  This represents the **median (Q2)**, or the middle value of the dataset. Half of the locations/countries have confirmed cases **less than or equal to 5,059**, and the other half have more.

- **75% (40460.50)**:  
  This represents the **third quartile (Q3)**. It means that 75% of the locations/countries have confirmed cases **less than or equal to 40,460.50**, while 25% have more.

These percentiles help to understand how the confirmed cases are distributed across different locations/countries in the dataset.


---

### Mapping Statistics to Functions in pandas

When calculating basic statistics for a specific column in pandas, you can use the following functions:

- **Mean**: `<target_column>.mean()` — This calculates the average of the column.
- **Standard Deviation**: `<target_column>.std()` — This measures how much the values deviate from the mean.
- **Min**: `<target_column>.min()` — Returns the smallest value in the column.
- **Max**: `<target_column>.max()` — Returns the largest value in the column.
- **Quartiles**: 
  - **25th percentile (Q1)**: `<target_column>.quantile(0.25)` — The value below which 25% of the data falls.
  - **50th percentile (Median)**: `<target_column>.median()` or `<target_column>.quantile(0.50)` — The middle value of the data.
  - **75th percentile (Q3)**: `<target_column>.quantile(0.75)` — The value below which 75% of the data falls.

These functions allow you to compute individual statistics for any column in your dataset. For example, you can apply these to the **New recovered** column to explore its central tendencies and spread.

In [None]:
# Mean of the New recovered column
mean_new_recovered = df1['New recovered'].mean()

# Standard deviation of the New recovered column
std_new_recovered = df1['New recovered'].std()

# Minimum value of the New recovered column
min_new_recovered = df1['New recovered'].min()

# Maximum value of the New recovered column
max_new_recovered = df1['New recovered'].max()

# Quartiles
q1_new_recovered = df1['New recovered'].quantile(0.25)
q2_new_recovered = df1['New recovered'].median()
q3_new_recovered = df1['New recovered'].quantile(0.75)

# Display the results
print(f"Mean: {mean_new_recovered}")
print(f"Standard Deviation: {std_new_recovered}")
print(f"Minimum: {min_new_recovered}")
print(f"Maximum: {max_new_recovered}")
print(f"25% (Q1): {q1_new_recovered}")
print(f"50% (Median): {q2_new_recovered}")
print("75% (Q3): " + str(q3_new_recovered))

### Difference Between `print(f"...")` and `print("..." + str(...))`

1. **`print(f"50% (Median): {q2_new_recovered}")`**:
   - This is **f-string** formatting, introduced in Python 3.6+. It allows you to directly embed variables within curly braces `{}` inside a string, making the code more concise and readable.

2. **`print("75% (Q3): " + str(q3_new_recovered))`**:
   - This uses string concatenation. The `str()` function converts the variable to a string before combining it with another string using the `+` operator.

---

### The `isnull()` Function in pandas

The `isnull()` function in pandas identifies missing (null or NaN) values in a DataFrame. It returns a DataFrame of the same shape, with `True` indicating missing values and `False` indicating non-missing values. This function is useful for detecting incomplete data, which is crucial for data cleaning and preprocessing. By understanding where data is missing, we can take appropriate steps, such as filling or removing those values, to ensure accurate analysis.


In [None]:
# Using isnull() to detect missing values
missing_values = df1.isnull()

# Display the DataFrame with True for missing values and False for non-missing values
print(missing_values)

### Output of `isnull()`

The `isnull()` function returns a DataFrame of the same shape, where each cell contains either `True` or `False`. 
- **`True`** indicates that the value in the corresponding cell is missing (null or NaN).
- **`False`** indicates that the value in the corresponding cell is not missing.

This allows us to identify which parts of the DataFrame have missing data, helping in the data cleaning process.
We can use it directly on a column too. For example:

In [None]:
df1['Country/Region'].isnull()

### Output Explanation of `isnull()`

In this output, `isnull()` is applied to the **Country/Region** column, returning `True` for missing values and `False` for non-missing values. However, only the first few and last few rows are displayed, with the middle rows omitted (indicated by `...`). As a result, while we can see that the first and last rows contain no missing values, the state of the rows in between is not immediately visible from this output.


### How to Get the Count of Missing Values per Column

When using `isnull()`, it returns `True` or `False` for each cell, but it doesn't summarize how many missing values exist per column. To address this, we use `isnull().sum()` to get the total number of missing values for each column. This gives a clearer breakdown of missing data, helping us assess data quality and decide how to handle missing values for each column.

In [None]:
df1.isnull().sum()

### Explanation of `df_world.isnull().sum()`

The command `df_world.isnull().sum()` checks for missing (null) values in the DataFrame `df_world`. 

- **`isnull()`**: Identifies whether each element is missing (`True` for missing, `False` for non-missing).
- **`sum()`**: Adds up the number of `True` values (which represent null values) for each column.

The result is the total count of missing values for each column in the DataFrame.


---

### The `duplicated()` Function in pandas

The `duplicated()` function in pandas identifies duplicate rows in a DataFrame or Series. It returns a boolean Series where:
- **`True`** indicates that the row is a duplicate (i.e., it has appeared before).
- **`False`** indicates that the row is unique.

By default, it checks for duplicates across all columns, but you can specify columns or control whether the first or last occurrence is marked as duplicate.

This function is useful for identifying and handling repeated data entries.

In [None]:
df1.duplicated()

### Common Issue with `duplicated()` Output

Similar to `isnull()`, the `duplicated()` function returns a Series of `True` or `False` values for each row, indicating whether it's a duplicate. However, this output alone doesn't tell us how many duplicates are present in each column. To solve this, we can use `duplicated().sum()` to get the total count of duplicate rows in the dataset. This gives a clear summary of how many rows are duplicated.


In [None]:
df1.duplicated().sum()

### Explanation of `df1.duplicated().sum()`

Running `df1.duplicated().sum()` provides the total count of duplicated rows in the DataFrame. It identifies duplicate rows and then sums them to give the number of repeated entries, offering a quick overview of data redundancy in the dataset.


---

<h2>Let's load another csv file</h2>

In [None]:
df2 = pd.read_csv('covid_19_clean_complete.csv')

In [None]:
# Display the first few rows of the second dataset
df2.head()

### Purpose of `fillna()` in pandas

The `fillna()` function in pandas is used to replace missing values (NaN) in a DataFrame or Series with a specified value. This is essential for handling incomplete data, as it allows you to fill gaps in the dataset with meaningful or placeholder values (e.g., a space `' '`, zero, or a specific string). Filling NaN values helps avoid errors in further analysis or calculations by ensuring that no missing data is left untreated.

In [None]:
df2.isnull().sum()

In [None]:
df2['Province/State'] = df2['Province/State'].fillna(' ')

In [None]:
df2.head()

In [None]:
df2['Province/State'][0]

### Explanation of `fillna()` command

In this command, `df['Province/State'].fillna(' ', inplace=True)`, we are replacing all missing values (NaN) in the **Province/State** column with a single space `' '`.

- **`' '`**: This fills any missing values with a space, ensuring that there are no NaN values left in the column.
- **`inplace=True`**: This ensures the changes are made directly in the DataFrame without creating a copy, meaning the original DataFrame is updated.

This helps to clean missing values in the **Province/State** column.


---

<h4>Let's load csv file again</h4>

In [None]:
df3 = pd.read_csv('country_wise_latest.csv')

In [None]:
df3.head()

---

### The `columns` function in pandas

The `columns` function in pandas is used to return the column labels of a DataFrame. It provides an Index object containing the column names in the order they appear in the DataFrame. This is useful for quickly inspecting the structure of the DataFrame, accessing specific columns, or renaming them if needed.


In [None]:
df3.columns

In [None]:
df2.columns

### Visual Comparison of Common Columns

As we can see, there are some common columns between the two datasets. Identifying these common columns will allow us to merge or compare the datasets more effectively. Let's proceed by finding out exactly which columns are shared between both datasets.
<br />Let's find the common columns between the two datasets:

In [None]:
#finding the common columns in between the two datasets that we'd be analysing
df2_columns = set(df2.columns)
df3_columns = set(df3.columns)
common_columns = df2_columns & df3_columns
print(common_columns)

### Explanation of the Code

In this code, we are comparing the columns of two datasets (`df2` and `df3`) to find the common columns:

1. **`set(df2.columns)` and `set(df3.columns)`**: Convert the columns of each DataFrame into a set. A set is an unordered collection of unique elements, allowing us to easily perform set operations like finding common elements.
   
2. **`&` (AND operator)**: This operator is used to find the intersection of two sets, meaning it returns the columns that are common between both DataFrames.

The result, `common_columns`, will display the shared columns between `df2` and `df3`.


### Merging the Two DataFrames

Now that we've identified the common columns—'Confirmed', 'WHO Region', 'Recovered', 'Deaths', 'Active', and 'Country/Region'—between the two datasets, we can proceed to merge them. By using these shared columns, particularly `Country/Region`, we can combine the data into a single DataFrame for a more comprehensive analysis. Merging will allow us to consolidate the information from both datasets while maintaining the structure of the common columns.

In [None]:
common = list(common_columns)
merged = pd.merge(df2, df3, on=common, how='inner')
merged.head()

In [None]:
merged.columns

### Explanation of Converting a Set to a List

In the code:
common = list(common_columns)

We convert `common_columns`, which is a set of column names shared between two DataFrames, into a list using `list(common_columns)`. This is necessary because the `pd.merge()` function requires a list of column names to specify which columns to merge on. While sets store unique elements, lists preserve the order of the columns and are compatible with pandas functions like `merge()`.

### Explanation of the `merge()` Function

In the code:
merged = pd.merge(df2, df3, on=common, how='inner')

The `pd.merge()` function merges two DataFrames (`df2` and `df3`) based on the common columns:

- **`on=common`**: Specifies the columns used for the merge.
- **`how='inner'`**: An inner join includes only rows where the values in the common columns match in both DataFrames. Rows without matches in either DataFrame are excluded.

For example, if a <u><i>country value</i></u> (e.g. Germany) in the `Country/Region` column is present in `df2` but not in `df3`, <u><i>that row</i></u> will be excluded from the final merged DataFrame.

### Explanation of Other `how` Options in `merge()`

In addition to `how='inner'`, there are other join options for merging DataFrames:

- **`how='outer'`**: Performs an outer join, keeping all rows from both DataFrames. Missing values are filled with NaN where there is no match.
- **`how='left'`**: Performs a left join, keeping all rows from the left DataFrame and adding matching rows from the right. The left DataFrame is the first DataFrame passed to `pd.merge()`, while the right DataFrame is the second one.
- **`how='right'`**: Performs a right join, keeping all rows from the right DataFrame and adding matching rows from the left. The right DataFrame is the second one passed to `pd.merge()`.

For more information, refer to the [pandas documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html).


# The End