# Data Preprocessing

In this notebook, we will perform data preprocessing steps necessary for preparing the datasets for analysis and modeling. This includes handling missing values, encoding categorical variables, and feature scaling.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Load the dataset
data = pd.read_csv('../data/raw/pima_indians_diabetes.csv')  # Update with the correct path

# Display the first few rows of the dataset
data.head()

In [None]:
# Check for missing values
missing_values = data.isnull().sum()
missing_values[missing_values > 0]

In [None]:
# Handling missing values
imputer = SimpleImputer(strategy='mean')
data[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']] = imputer.fit_transform(data[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']])

In [None]:
# Splitting the dataset into features and target
X = data.drop('Outcome', axis=1)
y = data['Outcome']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
# Save the processed data
processed_data = pd.DataFrame(X_train_scaled, columns=X.columns)
processed_data['Outcome'] = y_train.reset_index(drop=True)
processed_data.to_csv('../data/processed/pima_indians_diabetes_processed.csv', index=False)

## Conclusion

In this notebook, we have successfully preprocessed the Pima Indians Diabetes dataset by handling missing values, splitting the data into training and testing sets, and scaling the features. The processed data is now ready for modeling.