# **Data Science Buildables Fellowship**

### **Task 1: Handling Missing Data – Titanic Dataset**

##### **Task Overview**

The goal of this task is to identify and handle missing values in the Titanic dataset. We'll specifically address missing data in the 'Age', 'Cabin', and 'Embarked' columns, which are common challenges in this dataset. The primary objective is to prepare the data for use in a machine learning model by ensuring there are no missing entries.

 ##### **Chosen Technique with Reason**

**Median Imputation for 'Age':** The 'Age' column has missing values and is a numerical feature. Using median imputation is a robust choice because the median is less sensitive to outliers than the mean. This prevents extreme age values from skewing the imputed data.

**Mode Imputation for 'Embarked':** The 'Embarked' column is a categorical feature. Mode imputation is the most appropriate technique as it fills missing values with the most frequently occurring category, preserving the distribution of the categorical data.

**Dropping the 'Cabin' Column:** The 'Cabin' column has a large percentage of missing values (over 70%). Imputing this amount of data would introduce significant noise and could be misleading. Therefore, the most pragmatic solution is to drop the entire column.

In [None]:
import pandas as pd
import numpy as np

# Load the Titanic dataset
df = pd.read_csv(r'C:\Users\ilaib\Downloads\test.csv')


# --- Step 1: Handling 'Cabin' column ---
# Drop the 'Cabin' column due to a high percentage of missing values.
df.drop('Cabin', axis=1, inplace=True)

# --- Step 2: Handling 'Age' column ---
# Calculate the median age.
median_age = df['Age'].median()

# Impute missing 'Age' values with the median.
df['Age'].fillna(median_age, inplace=True)

# --- Step 3: Handling 'Embarked' column ---
# Calculate the mode of the 'Embarked' column.
mode_embarked = df['Embarked'].mode()[0]

# Impute missing 'Embarked' values with the mode.
df['Embarked'].fillna(mode_embarked, inplace=True)

# Verify that there are no more missing values in the preprocessed columns.
print("Missing values after preprocessing:")
print(df[['Age', 'Embarked']].isnull().sum())

Missing values after preprocessing:
Age         0
Embarked    0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(median_age, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Embarked'].fillna(mode_embarked, inplace=True)


##### **Results and Observation**

After applying the preprocessing steps, the 'Age' and 'Embarked' columns no longer contain any missing values. The 'Cabin' column has been successfully dropped from the DataFrame. The chosen techniques effectively addressed the missing data problem for each specific data type (numerical, categorical, and a column with too many missing entries), preparing the dataset for further analysis or machine learning model training.

### **Task 2: Encoding Categorical Variables – Car Evaluation Dataset**

##### **Task Overview:**

The Car Evaluation Dataset from the UCI Machine Learning Repository contains six categorical input features: buying, maint, doors, persons, lug_boot, and safety. Machine learning algorithms typically require numerical input, so these features must be encoded. This task will demonstrate and compare two primary encoding techniques.

##### **Chosen Technique with Reason**

**One-Hot Encoding:** This technique creates a new binary column for each unique category in a feature. For example, the safety feature has categories low, med, and high. One-hot encoding will create three new columns (safety_low, safety_med, safety_high), with a value of 1 in the column corresponding to the car's safety rating and 0 in the others. This is the preferred method for nominal data (categories without a natural order) because it avoids implying an arbitrary ordinal relationship that a model might misinterpret.

**Label Encoding:** This technique assigns a unique integer to each category (e.g., low=0, med=1, high=2). While straightforward, this method can be problematic for nominal data because it can introduce a false sense of order or magnitude, which can negatively impact the performance of algorithms that are sensitive to such relationships (e.g., linear models or k-NN). However, it is a valid choice for ordinal data (categories with a meaningful order) and is useful for comparison here.

In [6]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# Define column names as the data file has no header
column_names = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']

# Load the dataset
try:
    df = pd.read_csv(r'C:\Users\ilaib\Downloads\car.data', names=column_names)
except FileNotFoundError:
    print("Please download the 'car.data' file from the UCI repository and place it in the same directory.")
    # You can also load it directly from the URL if needed, but a local file is generally more reliable.

# Identify categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns

# --- Step 1: One-Hot Encoding ---
# Create a copy of the original DataFrame for one-hot encoding
df_one_hot = df.copy()

# Initialize the OneHotEncoder
one_hot_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

# Fit and transform the categorical data
encoded_features = one_hot_encoder.fit_transform(df_one_hot[categorical_cols])

# Convert the encoded features back to a DataFrame and add appropriate column names
encoded_df = pd.DataFrame(encoded_features, columns=one_hot_encoder.get_feature_names_out(categorical_cols))

# Drop the original categorical columns and concatenate the new one-hot encoded columns
df_one_hot = pd.concat([df_one_hot.drop(categorical_cols, axis=1), encoded_df], axis=1)

# --- Step 2: Label Encoding ---
# Create a fresh copy of the original DataFrame for label encoding
df_label = df.copy()

# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Apply Label Encoding to each categorical column
for col in categorical_cols:
    df_label[col] = label_encoder.fit_transform(df_label[col])

##### **Results and observations**

**One-Hot Encoded DataFrame (df_one_hot):**

The number of columns has significantly increased from the original 7. Each original categorical feature has been expanded into multiple binary columns. For example, buying (4 unique values) is now represented by buying_high, buying_low, etc.

This representation is sparse but avoids implying any ordinal relationship between the categories. For most machine learning models, this is the safest and most effective way to handle nominal categorical data.

**Label Encoded DataFrame (df_label):**

The number of columns remains the same as the original dataset. Each categorical feature is now represented by a single column of integers.

For example, the safety column, which had values like low, med, high, is now represented by integers (e.g., 0, 1, 2). While this saves space, it imposes a numerical order (0 < 1 < 2) that does not exist in the original data, which could mislead algorithms that interpret these numbers as having a meaningful distance. This is a crucial distinction and a potential drawback of this method for this specific dataset.

### **Task 3: Feature Scaling – Wine Quality Dataset**

##### **Task Overview**
The Wine Quality dataset contains various physicochemical features (e.g., fixed acidity, alcohol, pH) that are measured in different units and have different ranges. To ensure that each feature contributes equally to a machine learning model, we will apply feature scaling. This task will apply both normalization and standardization and analyze how each technique affects the data's distribution.

##### **Chosen Technique with Reason**

**Standardization (Z-score scaling):** This method transforms the data to have a mean of 0 and a standard deviation of 1. It is especially useful for algorithms that assume a Gaussian (normal) distribution or are sensitive to feature variances, such as SVMs, linear regression, and k-NN. Standardization makes the data more robust to outliers and is a common practice for most machine learning models.

**Normalization (Min-Max scaling):** This technique scales the features to a specific range, typically between 0 and 1. It is useful when the data does not follow a Gaussian distribution or when you need a fixed range for a specific algorithm, such as neural networks. The main effect is to compress the data into a bounded interval. We will use this to show how it differs from standardization, which does not bound the data to a specific range.

In [7]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Load the red wine quality dataset
try:
    df = pd.read_csv(r'C:\Users\ilaib\Downloads\winequality-red.csv', sep=';')
except FileNotFoundError:
    print("Please download 'winequality-red.csv' from the UCI repository and place it in the same directory.")

# Separate the features from the target variable
features = df.drop('quality', axis=1)

# --- Step 1: Apply Standardization ---
# Initialize the StandardScaler
scaler_standard = StandardScaler()

# Fit and transform the features
df_standardized = scaler_standard.fit_transform(features)

# Convert the NumPy array back to a DataFrame for easier analysis
df_standardized = pd.DataFrame(df_standardized, columns=features.columns)

# --- Step 2: Apply Normalization ---
# Initialize the MinMaxScaler
scaler_minmax = MinMaxScaler()

# Fit and transform the features
df_normalized = scaler_minmax.fit_transform(features)

# Convert the NumPy array back to a DataFrame
df_normalized = pd.DataFrame(df_normalized, columns=features.columns)

##### **Results & Observations**
**Standardization:** After standardization, the mean of each feature will be 0 and the standard deviation will be 1. The original distribution shape is preserved, but the data is centered and rescaled.

**Normalization:** After normalization, each feature is scaled to a range between 0 and 1. The original distribution shape is also preserved, but the range of the data is compressed.

**Analysis:** When you compare the .describe() output of both df_standardized and df_normalized with the original features DataFrame, you will see a clear change in the mean, standard deviation, and value ranges for all columns. Normalization bounds the data, while standardization gives it a consistent mean and variance, which is often more beneficial for many models as it is less affected by extreme values than normalization.

### **Task 4: Handling Outliers – Boston Housing Dataset**

##### **Task Overview**
The objective is to identify and manage outliers in the Boston Housing Dataset. We will use a robust statistical method, the Interquartile Range (IQR), to detect outliers and then apply a capping technique to mitigate their influence.

##### **Chosen Technique with Reason**
**IQR Method:** The Interquartile Range (IQR) method is a robust and widely used statistical technique for identifying outliers. It is less susceptible to the influence of extreme values compared to the Z-score method. The IQR is the range between the first quartile (Q1) and the third quartile (Q3). Outliers are defined as data points that fall below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR. This method is an excellent choice for this dataset as it helps us identify genuine anomalies without being overly sensitive to every slight deviation.

**Capping:** Rather than removing the rows with outliers, which would lead to data loss, we will cap the outlier values. This means any value below the lower bound is set to the lower bound, and any value above the upper bound is set to the upper bound. This approach preserves the data points while reducing the negative impact of extreme values

In [9]:
import pandas as pd
from sklearn.datasets import fetch_california_housing
import numpy as np

# Load the Boston Housing dataset
# The original dataset is deprecated, so we use the California Housing dataset.
try:
    # Use fetch_california_housing() to load the dataset
    boston = fetch_california_housing()
    df = pd.DataFrame(boston.data, columns=boston.feature_names)
    df['target'] = boston.target

except ImportError:
    # This block is for an alternative, but the error is in the main 'try' block.
    # We will assume you want to load from a CSV file.
    # Replace the 'try' block with this if you prefer loading from a local CSV.
    try:
        df = pd.read_csv(r'C:\Users\ilaib\Downloads\california_housing_test.csv')
    except FileNotFoundError:
        print("Please ensure the CSV file is in the specified path.")

# --- Step 1: Detect Outliers using the IQR Method ---
# We will check for outliers in a numerical feature, for example, 'MedInc'
feature_to_check = 'MedInc'

# Calculate Q1, Q3, and IQR for the selected feature
Q1 = df[feature_to_check].quantile(0.25)
Q3 = df[feature_to_check].quantile(0.75)
IQR = Q3 - Q1

# Define the lower and upper bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# --- Step 2: Treat Outliers using Capping ---
# Cap the values in the selected feature
df[feature_to_check] = np.where(df[feature_to_check] > upper_bound, upper_bound,
                                 np.where(df[feature_to_check] < lower_bound, lower_bound, df[feature_to_check]))

# You can repeat this process for other features as needed.

##### **Results & Observations**
After applying the IQR method and capping, the extreme values in the selected feature ('MedInc') have been adjusted to the calculated upper and lower bounds. This process effectively mitigates the influence of outliers without discarding valuable data points. By checking the minimum and maximum values of the modified column, you will observe that they now lie within the calculated bounds, confirming the successful treatment of the outliers. This prepares the data for more stable and reliable model training.

### **Task 5: Advanced Data Imputation – Retail Sales Dataset**

##### **Task Overview**
The goal is to handle missing data in a retail sales dataset using an advanced imputation technique. Unlike simple methods (mean, median), these techniques consider relationships between features to make more informed predictions for missing values. We will use KNN Imputation.

**Chosen Technique with Reason**
**KNN Imputation:** KNN (K-Nearest Neighbors) Imputation is an advanced technique that imputes missing values by considering the values of the k most similar data points (neighbors) in the dataset. This method is superior to simple imputation because it leverages the underlying structure and correlations within the data. For a missing value in a given row, it finds the 'k' rows that are most similar to it across the other features and then imputes the missing value based on the values in those neighboring rows (e.g., using the mean or a weighted average). This results in more accurate and context-aware imputations.

In [11]:
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer

# Load your Retail Sales dataset from your local CSV file
try:
    df = pd.read_csv(r'C:\Users\ilaib\OneDrive\Desktop\retail_sales_dataset.csv')
except FileNotFoundError:
    print("Please make sure your CSV file is in the same directory and the name is correct.")
    # If your file has no headers, you may need to specify column names
    # df = pd.read_csv('your_retail_sales_file.csv', header=None, names=['Transaction ID', 'Date', ...])

# --- Step 1: Identify Numerical Columns for Imputation ---
# KNN Imputer works on numerical data.
# We will assume 'Quantity' and 'Total Amount' are the columns with potential missing values.
numerical_cols = ['Quantity', 'Price per Unit', 'Total Amount']

# --- Step 2: Introduce Missing Values (for demonstration if none exist) ---
# This is a temporary step to show the process if your data is clean.
# If your data already has NaNs, you can skip this block.
df.loc[df.sample(frac=0.1).index, 'Quantity'] = np.nan
df.loc[df.sample(frac=0.1).index, 'Total Amount'] = np.nan

# --- Step 3: Initialize and Apply KNN Imputation ---
# Initialize the KNNImputer with a specified number of neighbors (k).
imputer = KNNImputer(n_neighbors=3)

# Fit and transform the numerical data
# We'll use all numerical columns to assist in imputation.
df_imputed = pd.DataFrame(imputer.fit_transform(df[numerical_cols]), columns=numerical_cols)

# --- Step 4: Combine with Original DataFrame ---
# Replace the original numerical columns with the new imputed columns.
df_combined = df.copy()
df_combined[numerical_cols] = df_imputed

# Display the DataFrame with imputed values
print("Original DataFrame missing values count:")
print(df.isnull().sum())
print("\nDataFrame after KNN Imputation missing values count:")
print(df_combined.isnull().sum())

Original DataFrame missing values count:
Transaction ID        0
Date                  0
Customer ID           0
Gender                0
Age                   0
Product Category      0
Quantity            100
Price per Unit        0
Total Amount        100
dtype: int64

DataFrame after KNN Imputation missing values count:
Transaction ID      0
Date                0
Customer ID         0
Gender              0
Age                 0
Product Category    0
Quantity            0
Price per Unit      0
Total Amount        0
dtype: int64


##### **Results & Observations**
After applying KNN Imputation, the missing values in the Quantity and Total Amount columns have been filled. The imputed values are calculated based on the similarity of other features like Price per Unit. This approach ensures that the imputed data maintains a logical relationship with the rest of the dataset, providing a more reliable foundation for subsequent data analysis or machine learning tasks

### **Task 6: Feature Engineering – Heart Disease Dataset**

##### **Task Overview**
The goal is to perform feature engineering on the Heart Disease UCI Dataset. We will create new derived features that capture more complex relationships within the data, such as a patient's risk level or age group. This helps the model to better understand the data and make more accurate

**Chosen Technique with Reason**
**Creating Derived Features:** This is a direct approach to feature engineering where new variables are created from existing ones. We will create three new features:

**Age Group:** This feature will categorize the continuous age variable into discrete bins (e.g., 'Young', 'Middle-Aged', 'Senior'). This is beneficial because it allows the model to capture non-linear relationships, as the effect of age on heart disease might not be a simple linear function.

**Cholesterol Category:** We will create a binary or categorical feature indicating whether a patient's cholesterol level is high, based on a medical threshold. This simplifies a continuous variable into a more interpretable, risk-based feature.

**Risk Factor Count:** This feature will be a simple count of key risk factors (e.g., sex, cp, trestbps). This aggregates multiple variables into a single, intuitive measure of a patient's overall risk, which can be a powerful predictor for a model.

In [12]:
import pandas as pd
import numpy as np

# Load the Heart Disease UCI dataset
try:
    df = pd.read_csv(r'C:\Users\ilaib\Downloads\heart.csv')
except FileNotFoundError:
    print("Please download 'heart.csv' from the UCI repository and place it in the same directory.")

# --- Step 1: Create 'Age Group' feature ---
# Define age bins and labels
age_bins = [0, 29, 44, 59, float('inf')]
age_labels = ['Young', 'Middle-Aged', 'Senior', 'Elderly']

# Create the new 'Age Group' column using pd.cut()
df['AgeGroup'] = pd.cut(df['age'], bins=age_bins, labels=age_labels, right=False)

# --- Step 2: Create 'Cholesterol Category' feature ---
# A common threshold for high cholesterol is > 200 mg/dL.
# We'll create a binary feature (1 for high, 0 for normal).
cholesterol_threshold = 200
df['HighCholesterol'] = (df['chol'] > cholesterol_threshold).astype(int)

# --- Step 3: Create 'Risk Factor Count' feature ---
# Sum up key risk factors. We'll select a few relevant binary columns.
# We can create binary indicators for things like 'chest pain type' or 'fasting blood sugar > 120'
# For simplicity, let's count a few key indicators.
# Let's assume some risk factors are `cp`, `trestbps`, and `thalach`.
# We'll need to define what constitutes a 'risk' for each.
# For example, let's count high blood pressure (`trestbps` > 140) and low max heart rate (`thalach` < 150)
df['RiskFactorCount'] = 0
df['RiskFactorCount'] += (df['trestbps'] > 140).astype(int)
df['RiskFactorCount'] += (df['chol'] > 240).astype(int) # A different threshold for demonstration
df['RiskFactorCount'] += (df['fbs'] == 1).astype(int)

# Display the new features
print("DataFrame with new engineered features:")
print(df[['age', 'AgeGroup', 'chol', 'HighCholesterol', 'trestbps', 'chol', 'fbs', 'RiskFactorCount']].head())

DataFrame with new engineered features:
   age     AgeGroup  chol  HighCholesterol  trestbps  chol  fbs  \
0   63      Elderly   233                1       145   233    1   
1   37  Middle-Aged   250                1       130   250    0   
2   41  Middle-Aged   204                1       130   204    0   
3   56       Senior   236                1       120   236    0   
4   57       Senior   354                1       120   354    0   

   RiskFactorCount  
0                2  
1                1  
2                0  
3                0  
4                1  


##### **Results & Observations**
The feature engineering process successfully added three new columns to the dataset: AgeGroup, HighCholesterol, and RiskFactorCount. These new features are more direct and potentially more predictive than their original continuous counterparts. The AgeGroup column simplifies the age variable, while HighCholesterol and RiskFactorCount provide the model with a clear, aggregated understanding of a patient's risk profile, which can lead to better model performance.

### **Task 7: Variable Transformation – Bike Sharing Dataset**

##### **Task Overview**
The goal is to apply variable transformation techniques to normalize the distribution of skewed features in the Bike Sharing Dataset. We will use the cnt (total rentals) column, which is typically right-skewed. We will apply Log Transformation and Box-Cox Transformation to see their effects on the data distribution.

##### **Chosen Technique with Reason**
Log Transformation: This is a simple and effective technique for reducing right-skewness. It works by compressing the range of high values and expanding the range of low values. The log transformation is particularly useful for features with a wide range and a long tail of large values. We'll use np.log1p() which calculates log(1+x), a common practice that handles zero values gracefully.

Box-Cox Transformation: The Box-Cox transformation is a more generalized technique. It can handle a wider range of data distributions and is more flexible than the log transformation. It estimates the optimal power exponent (
lambda) to transform the data, making the distribution as close to normal as possible. This is a good choice when you are unsure about the nature of the data's skewness.

In [13]:
import pandas as pd
import numpy as np
from scipy.stats import boxcox

# Load the dataset
try:
    df = pd.read_csv(r'C:\Users\ilaib\Downloads\day.csv')
except FileNotFoundError:
    print("Please download 'day.csv' from the UCI repository and place it in the same directory.")

# --- Step 1: Apply Log Transformation to 'cnt' (total rentals) ---
# The original 'cnt' column is often right-skewed.
# We'll use np.log1p which is log(1+x) to handle any potential zero values.
df['cnt_log'] = np.log1p(df['cnt'])

# --- Step 2: Apply Box-Cox Transformation to 'cnt' ---
# The Box-Cox transformation requires the data to be strictly positive.
# The 'cnt' column contains zero values, so we'll add a small constant (1)
# which is already handled by np.log1p. For boxcox, we can do it explicitly.
df['cnt_boxcox'], fitted_lambda = boxcox(df['cnt'] + 1)

# Display the first few rows with the new transformed columns
print("DataFrame with transformed features:")
print(df[['cnt', 'cnt_log', 'cnt_boxcox']].head())

DataFrame with transformed features:
    cnt   cnt_log  cnt_boxcox
0   985  6.893656  499.329506
1   801  6.687109  415.834439
2  1349  7.207860  659.495676
3  1562  7.354362  750.805907
4  1600  7.378384  766.938328


##### **Results & Observations**
After applying the transformations, you will observe that the cnt_log and cnt_boxcox columns have a much more symmetrical distribution compared to the original cnt column. The log transformation provides a quick and effective way to reduce skewness, while the Box-Cox transformation, by finding an optimal lambda value, often provides a more robust normalization. These transformed variables are better suited for many machine learning models, leading to improved performance.

### **Task 8: Feature Selection – Diabetes Dataset**

**Task Overview**

The goal is to identify and select the most relevant features for predicting diabetes. The dataset contains various diagnostic measurements, and not all of them may be equally predictive. We will use two key techniques: Correlation Analysis to identify redundant features and Recursive Feature Elimination (RFE) to systematically select the best subset of features.

##### **Chosen Technique with Reason**
Correlation Analysis: We'll begin by calculating the correlation matrix. This helps us understand the relationships between features. If two features are highly correlated, one of them might be redundant, and we can consider removing it to simplify the model without losing much information.

Recursive Feature Elimination (RFE): RFE is a powerful and iterative feature selection method. It works with an external estimator (like a logistic regression model) that assigns weights to features. RFE then recursively removes the weakest feature (the one with the lowest weight) until the desired number of features is reached. This method is highly effective because it considers the features' contribution to a predictive model, rather than just their individual relationship with the target variable. We will use RFE to select a top subset of features.

In [14]:
import pandas as pd
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# Load the dataset
try:
    df = pd.read_csv(r'C:\Users\ilaib\Downloads\diabetes.csv')
except FileNotFoundError:
    print("Please download 'diabetes.csv' and place it in the same directory.")

# Separate features (X) and target (y)
X = df.drop('Outcome', axis=1)
y = df['Outcome']

# --- Step 1: Correlation Analysis ---
# Calculate the correlation matrix
correlation_matrix = X.corr()

# Print the correlation matrix
print("Correlation Matrix:")
print(correlation_matrix)
print("\n")

# --- Step 2: Recursive Feature Elimination (RFE) ---
# Initialize the logistic regression model as the estimator
model = LogisticRegression(max_iter=200)

# Initialize RFE with the model and the desired number of features to select
# Let's select the top 5 features
rfe = RFE(estimator=model, n_features_to_select=5)

# Fit RFE on the data
fit = rfe.fit(X, y)

# Get the selected features
selected_features = X.columns[fit.support_]

# Get the feature rankings
feature_rankings = pd.Series(fit.ranking_, index=X.columns)

Correlation Matrix:
                          Pregnancies   Glucose  BloodPressure  SkinThickness  \
Pregnancies                  1.000000  0.129459       0.141282      -0.081672   
Glucose                      0.129459  1.000000       0.152590       0.057328   
BloodPressure                0.141282  0.152590       1.000000       0.207371   
SkinThickness               -0.081672  0.057328       0.207371       1.000000   
Insulin                     -0.073535  0.331357       0.088933       0.436783   
BMI                          0.017683  0.221071       0.281805       0.392573   
DiabetesPedigreeFunction    -0.033523  0.137337       0.041265       0.183928   
Age                          0.544341  0.263514       0.239528      -0.113970   

                           Insulin       BMI  DiabetesPedigreeFunction  \
Pregnancies              -0.073535  0.017683                 -0.033523   
Glucose                   0.331357  0.221071                  0.137337   
BloodPressure             0.

**Results & Observations**

Correlation Analysis: The correlation matrix reveals the linear relationships between the features. Features with a very high correlation (e.g., above 0.8 or 0.9) might be redundant, and one could be dropped.

Recursive Feature Elimination (RFE): The RFE process identified a subset of the top 5 most predictive features. The selected_features variable holds the names of these features. RFE's ranking provides a clear order of importance for each feature, with a rank of 1 indicating the most important. By using this method, we have identified a smaller, more effective set of features that are most relevant to predicting the Outcome. This will lead to a more efficient and potentially more accurate model.

### **Task 9: Handling Imbalanced Data – Credit Card Fraud Detection**

##### **Task Overview**

The objective is to manage the highly skewed class distribution in the Credit Card Fraud Detection dataset. We will use a technique called SMOTE (Synthetic Minority Over-sampling Technique) to create a more balanced dataset, which will help a classification model learn the patterns of the minority class more effectively.

##### **Chosen Technique with Reason**

SMOTE (Synthetic Minority Over-sampling Technique): SMOTE is an oversampling method that generates synthetic data points for the minority class. Unlike simple random oversampling, which just duplicates existing data, SMOTE creates new, synthetic samples by taking a minority class data point and its nearest neighbors. It then generates new data points along the line segments connecting them. This approach helps to prevent overfitting and provides the model with a richer, more diverse set of data for the minority class, leading to improved predictive performance, especially in detecting rare events like fraud.

In [15]:
import pandas as pd
from imblearn.over_sampling import SMOTE
from collections import Counter

# Load the dataset
try:
    df = pd.read_csv(r"C:\Users\ilaib\Downloads\creditcard.csv")
except FileNotFoundError:
    print("Please download 'creditcard.csv' from Kaggle and place it in the same directory.")

# Separate features (X) and target (y)
X = df.drop('Class', axis=1)
y = df['Class']

# --- Step 1: Check the original class distribution ---
print("Original class distribution:")
print(Counter(y))

# --- Step 2: Apply SMOTE ---
# Initialize SMOTE
sm = SMOTE(random_state=42)

# Fit and apply the transformation to the dataset
X_resampled, y_resampled = sm.fit_resample(X, y)

# --- Step 3: Check the resampled class distribution ---
print("\nResampled class distribution after applying SMOTE:")
print(Counter(y_resampled))

Original class distribution:
Counter({0: 284315, 1: 492})

Resampled class distribution after applying SMOTE:
Counter({0: 284315, 1: 284315})


##### **Results & Observations**

Before applying SMOTE, the original dataset shows a severe class imbalance, with a very high number of non-fraudulent transactions (class 0) and a very low number of fraudulent transactions (class 1). After applying SMOTE, the class distribution is perfectly balanced. The number of samples for the minority class has been increased to match the number of samples in the majority class, which will significantly improve a model's ability to correctly identify fraudulent transactions. This technique effectively addresses the challenge of imbalanced data, making the dataset ready for a more robust classification task.

### **Task 11: Dimensionality Reduction – MNIST Dataset**

##### **Task Overview**

The objective is to reduce the high-dimensional feature space of the MNIST dataset. We will use PCA (Principal Component Analysis) to project the data into a lower-dimensional space, which helps with visualization, reduces computational complexity, and can mitigate the "curse of dimensionality."

##### **Chosen Technique with Reason**

PCA (Principal Component Analysis): PCA is an unsupervised linear dimensionality reduction technique. It works by identifying the directions (principal components) of maximum variance in the data. By projecting the data onto these components, we can retain the most significant information (variance) with a much smaller number of features. PCA is a strong choice for this task because the pixel data in the MNIST images is highly correlated and redundant, making it an ideal candidate for this method.

In [16]:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# --- Step 1: Load the MNIST dataset ---
# Fetching the data from openml
# This may take a few moments
print("Loading MNIST dataset...")
mnist = fetch_openml('mnist_784', version=1, parser='auto')
X = mnist.data
y = mnist.target

# --- Step 2: Standardize the data ---
# PCA is sensitive to the scale of the features.
# It's crucial to standardize the data before applying PCA.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# --- Step 3: Apply PCA ---
# We'll reduce the dimensions from 784 to a smaller number, say 50.
# A small number is sufficient for visualization and a large portion of variance is retained.
pca = PCA(n_components=50)
X_pca = pca.fit_transform(X_scaled)

# --- Step 4: Check the explained variance ratio ---
# This shows how much of the original variance is captured by the new components.
explained_variance_ratio = pca.explained_variance_ratio_

print("\nShape of original data:", X.shape)
print("Shape of data after PCA:", X_pca.shape)
print("\nTotal variance explained by the first 50 components:", explained_variance_ratio.sum())

Loading MNIST dataset...

Shape of original data: (70000, 784)
Shape of data after PCA: (70000, 50)

Total variance explained by the first 50 components: 0.5494590903755294


##### **Results & Observations**

After applying PCA, the original 784-dimensional data has been successfully reduced to a new dataset with only 50 dimensions. The explained variance ratio shows that a significant portion of the original data's variance has been captured by these 50 new components. This demonstrates how PCA effectively compresses the data, making it more manageable for machine learning models while retaining the most important information.

### **Task 13: Time-Series Preprocessing – Air Quality Dataset**

##### **Task Overview**
The objective is to handle missing timestamps, resample data, and apply a smoothing technique to a time-series dataset. Time-series data often has gaps or is recorded at inconsistent intervals, which needs to be addressed before analysis or forecasting. We will demonstrate how to make the data consistent and smoother.

##### **Chosen Technique with Reason**
Handling Missing Timestamps: We will first convert the date column to a datetime format and set it as the DataFrame's index. Then, we will use resampling to create a consistent time series, filling in any missing timestamps. The ffill (forward fill) method will be used to fill missing values, which is a common and simple approach for time-series data, as it assumes the last known value carries forward.

Smoothing: We will use a rolling average (or moving average) to smooth out short-term fluctuations and highlight long-term trends. This is particularly useful for noisy time-series data, as it provides a clearer view of the underlying patterns

In [40]:
import pandas as pd
import numpy as np

# Create a sample DataFrame to simulate an Air Quality dataset
# This is for demonstration purposes. Replace this with your actual data loading.
data = {
    'Date': pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-04', '2023-01-05', '2023-01-08']),
    'CO_level': [10.5, 11.2, 12.0, 11.8, 13.1]
}
df = pd.DataFrame(data)

# --- Step 1: Handle Missing Timestamps ---
# Convert 'Date' to datetime and set it as the index
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)

# Resample the data to a daily frequency and use forward fill
df_resampled = df.resample('D').ffill()

# --- Step 2: Apply Smoothing (Rolling Average) ---
# Calculate a 3-day rolling average
df_resampled['CO_level_smoothed'] = df_resampled['CO_level'].rolling(window=3).mean()

# Display the original and preprocessed data
print("Original DataFrame:")
print(df)
print("\nResampled and Smoothed DataFrame:")
print(df_resampled)

Original DataFrame:
            CO_level
Date                
2023-01-01      10.5
2023-01-02      11.2
2023-01-04      12.0
2023-01-05      11.8
2023-01-08      13.1

Resampled and Smoothed DataFrame:
            CO_level  CO_level_smoothed
Date                                   
2023-01-01      10.5                NaN
2023-01-02      11.2                NaN
2023-01-03      11.2          10.966667
2023-01-04      12.0          11.466667
2023-01-05      11.8          11.666667
2023-01-06      11.8          11.866667
2023-01-07      11.8          11.800000
2023-01-08      13.1          12.233333


### **Results & Observations**
The original DataFrame has missing dates (e.g., January 3rd, January 6th, and January 7th). After the resampling step, the new df_resampled DataFrame includes all the missing dates and fills in the CO_level using the value from the previous day (ffill). The CO_level_smoothed column shows the smoothed data, where daily fluctuations are averaged out, providing a clearer trend of the CO_level over time.