# Exercise-1: Data Preparation [10 points]

This notebook covers the fundamental aspects of data preparation for machine learning tasks.

## Learning Objectives
- Understanding data loading and exploration
- Data cleaning and preprocessing techniques
- Feature engineering and selection
- Data splitting for training and testing

## Instructions
Complete the exercises below by implementing the required code in the designated cells.

## 1. Import Required Libraries

Import the necessary Python libraries for data manipulation and analysis.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer

# Set random seed for reproducibility
np.random.seed(42)

# Configure plotting
plt.style.use('default')
sns.set_palette("husl")

## 2. Data Loading

Load your dataset and perform initial exploration.

In [None]:
# TODO: Load your dataset
# Example: df = pd.read_csv('../data/your_dataset.csv')

# Display basic information about the dataset
# TODO: Implement data loading and basic exploration

## 3. Exploratory Data Analysis (EDA)

Perform exploratory data analysis to understand the dataset structure and characteristics.

In [None]:
# TODO: Display dataset shape, column names, and data types

# TODO: Check for missing values

# TODO: Display summary statistics

In [None]:
# TODO: Create visualizations to understand data distribution
# - Histograms for numerical features
# - Count plots for categorical features
# - Correlation matrix heatmap

## 4. Data Cleaning

Handle missing values, outliers, and inconsistencies in the data.

In [None]:
# TODO: Handle missing values
# - Identify columns with missing values
# - Decide on appropriate strategy (drop, impute, etc.)
# - Implement the chosen strategy

In [None]:
# TODO: Detect and handle outliers
# - Use statistical methods (IQR, Z-score) or visualization
# - Decide whether to remove, cap, or transform outliers

## 5. Feature Engineering

Create new features or transform existing ones to improve model performance.

In [None]:
# TODO: Feature engineering tasks
# - Create new features from existing ones
# - Encode categorical variables
# - Scale numerical features if necessary

## 6. Data Splitting

Split the data into training and testing sets.

In [None]:
# TODO: Separate features and target variable
# X = ...
# y = ...

# TODO: Split data into training and testing sets
# X_train, X_test, y_train, y_test = train_test_split(...)

# TODO: Display the shapes of the resulting datasets

## 7. Data Preprocessing Pipeline

Create a preprocessing pipeline for consistent data transformation.

In [None]:
# TODO: Create preprocessing pipeline
# - Include scaling, encoding, and any other necessary transformations
# - Fit the pipeline on training data
# - Transform both training and testing data

## 8. Summary and Next Steps

Summarize the data preparation process and prepare for model training.

In [None]:
# TODO: Save the preprocessed data for use in subsequent exercises
# Example: 
# pd.DataFrame(X_train_processed).to_csv('../data/X_train_processed.csv', index=False)
# pd.DataFrame(X_test_processed).to_csv('../data/X_test_processed.csv', index=False)
# pd.Series(y_train).to_csv('../data/y_train.csv', index=False)
# pd.Series(y_test).to_csv('../data/y_test.csv', index=False)

## Reflection Questions

1. What challenges did you encounter during data preparation?
2. How did you decide on the strategy for handling missing values?
3. What feature engineering techniques did you apply and why?
4. How might different preprocessing choices affect model performance?

**TODO: Answer the reflection questions above in markdown cells below.**