(1) 1.1. To identify missing values, you can use the isna() or isnull() methods, both of which will return a DataFrame or Series of the same shape with True for missing values and False for non-missing values. To handle missing values, we can drop rows or columns containing missing values using the dropna() method or impute or fill missing values using 
the fillna() method.

1.2. Imputation is the process of replacing missing values with estimated ones based on the data. It can be useful because removing rows with missing values may lead to loss of important information, especially if the missing values are in many rows.

In [None]:
import pandas as pd
import numpy as np

exam_data = {
    'name': ['Anastasia', 'Dima', 'Katherine', 'James', None],  
    'score': [12.5, 9, 16.5, np.nan, np.nan],  
    'attempts': [1, 3, 2, 3, 2],  
    'qualify': [None, 'no', 'yes', 'no', 'no']  
}

labels = ['a', 'b', 'c', 'd', 'e']

df = pd.DataFrame(exam_data, index=labels)
print("Original DataFrame:")
print(df)

print("\nIdentifying missing values:")
print(df.isna())

df['score'] = df['score'].fillna(df['score'].mean())
df['qualify'] = df['qualify'].fillna(df['qualify'].mode()[0])
df['name'] = df['name'].fillna('Unknown')

print("\nDataFrame after Imputation:")
print(df)

2) 2.1. Categorical variables are non-numeric variables that take on a limited, fixed number of values. There are several ways to encode categorical variables in a Pandas DataFrame:
Label Encoding: Each category is mapped to a unique integer.
One-Hot Encoding: Each unique category value is converted into a new binary (0 or 1) column, representing whether that category is present or not.

2.2. One-hot encoding is a technique used to convert categorical variables into binary vectors. Each category in the original variable is turned into a separate binary column where each row will have a 1 for the category that corresponds to that row and 0 for all other categories. It's most useful when dealing with nominal categorical variables (no intrinsic order) and when preparing data for algorithms that require numeric input, such as linear regression, logistic regression, or neural networks

In [None]:
import pandas as pd

data = {
    'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'John'],
    'color': ['red', 'blue', 'green', 'blue', 'red']
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

df['color_encoded'] = pd.factorize(df['color'])[0]
print("\nDataFrame with Label Encoding:")
print(df)

df_one_hot = pd.get_dummies(df, columns=['color'], prefix=['color'])
print("\nDataFrame with One-Hot Encoding:")
print(df_one_hot)

(3) 3.1. In Pandas, duplicate rows can be identified using duplicated() and removed using drop_duplicates(). 

 3.2. duplicated() - identifies duplicate rows but does not remove them; returns a boolean Series where True indicates that a row is a duplicate of a previous row.
 drop_duplicates() - removes duplicate rows from the DataFrame; by default, it keeps the first occurrence and removes the subsequent duplicates.

In [None]:
import pandas as pd

data = {
    'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'James', 'Dima'],
    'score': [12.5, 9, 16.5, 10, 10, 9],
    'attempts': [1, 3, 2, 3, 3, 3]
}

df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

print("\nIdentifying duplicate rows using duplicated():")
print(df.duplicated())  

df_no_duplicates = df.drop_duplicates()

print("\nDataFrame after removing duplicates with drop_duplicates():")
print(df_no_duplicates)

(4) 4.1. Feature scaling plays a crucial role in machine learning by ensuring that all features are on a similar scale, preventing bias towards certain features. It improves the performance and convergence of many machine learning algorithms.

4.2. Min-max scaling, also known as normalization, rescales values to a fixed range, usually [0,1] (or sometimes [-1,1]). Formula: subtract x(min) from x and divide by x(max) - x(min). The smallest value becomes 0, the largest becomes 1, and other values are proportionally scaled. It normalizes data so that all features have the same scale.
It helps machine learning models work better. It preserves relationships between values. Retains the original shape of the data. Use Min-Max Scaling when: the data is already within a fixed range; the algorithm does not assume a normal distribution; it is needed to preserve the relationship between original values.

Z-score normalization, also known as standardization, transforms data so that it has a mean of 0 and a standard deviation of 1. Formula: subract mean from x and divide by standard deviation. Centers data around 0, making it easier to compare different distributions. Changes distribution shape. Use Z-Score Normalization when: the data has different units or scales; the algorithm assumes normal distribution; there are extreme outliers.

In [None]:
import pandas as pd

data = {
    'age': [22, 25, 30, 35, 40],
    'salary': [30000, 45000, 50000, 65000, 70000],
    'score': [88, 92, 85, 90, 87]
}

df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Min-Max Scaling:
def min_max_scaling(df):
    return (df - df.min()) / (df.max() - df.min())

df_min_max_scaled = min_max_scaling(df)

print("\nDataFrame after Min-Max Scaling:")
print(df_min_max_scaled)

#Z-Score Normalization:
def z_score_normalization(df):
    return (df - df.mean()) / df.std()

df_z_score_scaled = z_score_normalization(df)

print("\nDataFrame after Z-Score Normalization:")
print(df_z_score_scaled)

(5) 5.1. Outliers are data points that significantly differ from the rest of the dataset. They may arise due to measurement errors, rare events, or natural variations in data.
They distort statistics like the mean and standard deviation.
They affect machine learning models that rely on assumptions about the data distribution (e.g., linear regression, k-NN).
They increase model complexity and reduce accuracy.

5.2. Methods to detect outliers:
-Using the IQR (Interquartile Range) Method
The IQR method considers values that lie outside 1.5 times the interquartile range as outliers.
-Using Z-score (Standard Deviation Method)
The Z-score method identifies values that are more than 3 standard deviations away from the mean.

5.3. Handling Outliers in a Continuous Numerical Variable: 
-Removing Outliers (If outliers result from data entry errors, you can remove them)
-Replacing Outliers (Instead of removing, replace outliers with min/max values within a threshold)

In [None]:
import pandas as pd
import numpy as np

data = pd.DataFrame({'Value': [10, 12, 100, 15, -50, 20]})

print("Original DataFrame:\n", data)

Q1 = data['Value'].quantile(0.25)
Q3 = data['Value'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = data[(data['Value'] < lower_bound) | (data['Value'] > upper_bound)]
print("\nOutliers detected:\n", outliers)

cleaned_data = data[(data['Value'] >= lower_bound) & (data['Value'] <= upper_bound)]
print("\nData after removing outliers:\n", cleaned_data)

replaced_data = data.copy()
replaced_data['Value'] = replaced_data['Value'].clip(lower=lower_bound, upper=upper_bound)
print("\nData with replaced outliers:\n", replaced_data)