# Data Preprocessing

Data preprocessing is a crucial step in the data analysis and machine learning pipeline that involves preparing raw data for analysis. The goal of data preprocessing is to clean, transform, and organize data to improve its quality and make it suitable for modeling. This process typically includes several key steps:

* **Data Cleaning**: Identifying and correcting errors or inconsistencies in the data, such as missing values, duplicates, and outliers.

* **Data Transformation**: Converting data into a suitable format or structure. This may involve normalization, scaling, encoding categorical variables, or aggregating data.

* **Data Integration**: Combining data from different sources to create a unified dataset, ensuring that it is coherent and consistent.

* **Data Reduction**: Reducing the volume of data while maintaining its integrity, which can involve techniques like feature selection or dimensionality reduction.

* **Data Splitting**: Dividing the dataset into training, validation, and test sets to evaluate the performance of machine learning models.

Effective data preprocessing enhances the accuracy and efficiency of models, ultimately leading to better insights and predictions. It is often considered one of the most time-consuming yet essential parts of the data science workflow.

# Importing Dataset from drive

To import a dataset from Google Drive using the mount method in a Google Colab environment, you can follow these steps:

Mount Google Drive: This allows you to access files stored in your Google Drive.

Access the Dataset: Once mounted, you can navigate to the specific file path where your dataset is stored.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Upload the dataset

In [3]:
upload_file = ('/content/drive/MyDrive/Machine Learning/healthcare-dataset-stroke-data.csv')

# Import Numpy and Pandas

**NumPy**

NumPy (Numerical Python) is a powerful library in Python that provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays. It is a fundamental package for scientific computing in Python and serves as the foundation for many other libraries, including Pandas, SciPy, and Matplotlib.


**Pandas**

Pandas is a powerful data manipulation and analysis library built on top of NumPy. It provides data structures like Series and DataFrames that make it easy to work with structured data, such as time series, tabular data, and more

In [4]:
import numpy as np
import pandas as pd

# Make a Dataframe

A ***DataFrame*** is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure provided by the Pandas library in Python. It is one of the most commonly used data structures in data analysis and manipulation, resembling a table in a database or a spreadsheet in Excel.

In [5]:
df = pd.read_csv(upload_file)
df

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
...,...,...,...,...,...,...,...,...,...,...,...,...
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,never smoked,0
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,never smoked,0
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0


# Handling Missing Values

**What is a Missing Value in Machine Learning?**

In the context of machine learning and data analysis, a missing value refers to the absence of a value in a dataset. This can occur for various reasons, such as:

* **Data Entry Errors** : Mistakes made during data collection or entry can lead to missing values.

* **Non-Response** : In surveys or questionnaires, respondents may skip questions, resulting in missing data.

* **Data Corruption** : Issues during data transmission or storage can lead to loss of information.
* **Inapplicability**: Certain features may not apply to all observations, leading to missing values for specific entries.

Missing values can pose significant challenges in machine learning, as they can lead to biased models, reduced accuracy, and difficulties in data interpretation. Most machine learning algorithms require complete datasets, and the presence of missing values can hinder the training process.


**Methods to Handle Missing Values**

There are several strategies to handle missing values in a dataset. The choice of method depends on the nature of the data, the amount of missing data, and the specific analysis or modeling goals. Here are some common methods:

*Removing Missing Values:*

* Drop Rows: Remove any rows that contain missing values. This is suitable when the number of missing values is small compared to the dataset size.

* Drop Columns: Remove entire columns that contain missing values, especially if they are not critical for analysis.
Filling Missing Values:

*Constant Value* : Replace missing values with a specific constant (e.g., zero, "unknown").

*Mean/Median/Mode Imputation* : Replace missing values with the mean, median, or mode of the column. This is common for numerical data.

*Forward Fill and Backward Fill*:

* Forward Fill: Propagate the last valid observation forward to fill missing values.

* Backward Fill: Propagate the next valid observation backward to fill missing values.

*Interpolation* :

Use interpolation methods to estimate missing values based on surrounding data points. This is particularly useful for time series data.

*Predictive Modeling* :

Use machine learning algorithms to predict and fill missing values based on other features in the dataset. This method can be more sophisticated and may yield better results.

*Custom Logic* :

Implement custom logic to fill missing values based on specific conditions or domain knowledge.

In [6]:
#check missing value
df.isnull().sum()

Unnamed: 0,0
id,0
gender,0
age,0
hypertension,0
heart_disease,0
ever_married,0
work_type,0
Residence_type,0
avg_glucose_level,0
bmi,201


In this dataset their were 12 columns


*   id	0
* gender	0
* age	0
* hypertension	0
* heart_disease	0
* ever_married	0
* work_type	0
* Residence_type	0
* avg_glucose_level	0
* bmi	201
* smoking_status	0
* stroke  0

out of these 12 columns  *bmi* has 201 null values so we have to handle them to build a machine learning model


For handling missing values in the BMI column of a stroke prediction dataset, effective methods include:

*Mean/Median Imputation* : Replacing missing values with the mean or median of the BMI column is straightforward and preserves the overall distribution. This method is suitable when the missing data is minimal (e.g., 3.93% as noted).

*Predictive Modeling* : Using machine learning algorithms to predict missing BMI values based on other features can yield more accurate imputations, especially if the dataset has strong correlations among variables.

**Reasons for Not Using Other Methods :**

* *Removing the feature* : Removing rows with missing BMI values can lead to significant data loss, especially if the missing rate is non-negligible, potentially biasing results.

* *Constant Value Imputation*: Filling missing values with a constant (e.g., zero) can distort the data distribution and lead to misleading conclusions.

* *Forward/Backward Fil*: These methods are more suitable for time series data and may not be appropriate for the BMI column, which does not have a temporal component.

Choosing the right method depends on the specific characteristics of the dataset and the analysis goals, ensuring that the imputation strategy aligns with the overall modeling approach.

In [9]:
#replace null values with mean
df['bmi'].fillna(df['bmi'].mean(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['bmi'].fillna(df['bmi'].mean(), inplace=True)


In [10]:
df.isnull().sum()

Unnamed: 0,0
id,0
gender,0
age,0
hypertension,0
heart_disease,0
ever_married,0
work_type,0
Residence_type,0
avg_glucose_level,0
bmi,0



# Encoding Categorical Data

Encoding categorical data is an essential step in the data preprocessing phase of machine learning, especially when working with algorithms that require numerical input. Categorical data refers to variables that represent categories or groups, such as gender, color, or type of work. Since most machine learning algorithms operate on numerical data, categorical variables need to be converted into a numerical format.

## Common Methods for Encoding Categorical Data

### 1. Label Encoding
- **Description**: Each category is assigned a unique integer value. For example, if you have a column for "Gender" with categories "Male" and "Female," you might encode "Male" as 0 and "Female" as 1.
- **Use Case**: Suitable for ordinal categorical variables where the categories have a meaningful order (e.g., "Low," "Medium," "High").
- **Limitation**: For nominal categorical variables (no intrinsic order), label encoding can introduce unintended ordinal relationships, which may mislead the model.

### 2. One-Hot Encoding
- **Description**: Each category is converted into a new binary column (0 or 1). For example, if you have a "Color" column with categories "Red," "Green," and "Blue," one-hot encoding will create three new columns: "Color_Red," "Color_Green," and "Color_Blue."
- **Use Case**: Ideal for nominal categorical variables where there is no inherent order among categories.
- **Limitation**: Can lead to a high-dimensional feature space if there are many unique categories, which may increase computational complexity and the risk of overfitting.

### 3. Binary Encoding
- **Description**: Combines the features of label encoding and one-hot encoding. Each category is first converted to an integer, and then that integer is converted to binary code. Each binary digit becomes a separate column.
- **Use Case**: Useful when dealing with high cardinality categorical variables, as it reduces the dimensionality compared to one-hot encoding.
- **Limitation**: More complex to implement and interpret than one-hot encoding.

### 4. Target Encoding (Mean Encoding)
- **Description**: Each category is replaced with the mean of the target variable for that category. For example, if you have a "City" column and you want to predict house prices, you would replace each city with the average house price in that city.
- **Use Case**: Effective for categorical variables with a strong relationship to the target variable.
- **Limitation**: Can lead to overfitting, especially if the dataset is small or if there are many categories. It is essential to use techniques like cross-validation to mitigate this risk.

### 5. Frequency Encoding
- **Description**: Each category is replaced with its frequency (the number of occurrences) in the dataset. For example, if "City A" appears 100 times and "City B" appears 50 times, you would replace "City A" with 100 and "City B" with 50.
- **Use Case**: Useful for high cardinality categorical variables and can help capture the importance of categories based on their frequency.
- **Limitation**: May not capture the relationship between the category and the target variable as effectively as other methods.

## Choosing the Right Encoding Method

The choice of encoding method depends on several factors:
- **Nature of the Categorical Variable**: Determine whether the variable is nominal (no order) or ordinal (has a meaningful order).
- **Cardinality**: Consider the number of unique categories. High cardinality variables may benefit from binary or frequency encoding to reduce dimensionality.
- **Model Requirements**: Some machine learning algorithms (e.g., tree-based models) can handle categorical variables directly, while others (e.g., linear models) require numerical input.
- **Risk of Overfitting**: Be cautious with methods like target encoding, which can lead to overfitting if not handled properly.


# Import Label Encoder from scikit-learn
from sklearn.preprocessing import LabelEncoder

# Initialize the LabelEncoder
le = LabelEncoder()

# List of columns to encode
columns_to_encode = ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']

# Perform label encoding for each specified column
for col in columns_to_encode:
    df[col] = le.fit_transform(df[col])

In [11]:
#function to encode columns
def label_encode(df, columns):
    for col in columns:
        # Identify unique categories
        unique_categories = df[col].unique()

        # Create a mapping from category to integer
        category_to_int = {category: idx for idx, category in enumerate(unique_categories)}

        # Map the categories to integers
        df[col] = df[col].map(category_to_int)

    return df

# List of columns to encode
columns_to_encode = ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']

# Perform label encoding
df_encoded = label_encode(df, columns_to_encode)

In [12]:
df_encoded

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,0,67.0,0,1,0,0,0,228.69,36.600000,0,1
1,51676,1,61.0,0,0,0,1,1,202.21,28.893237,1,1
2,31112,0,80.0,0,1,0,0,1,105.92,32.500000,1,1
3,60182,1,49.0,0,0,0,0,0,171.23,34.400000,2,1
4,1665,1,79.0,1,0,0,1,1,174.12,24.000000,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...
5105,18234,1,80.0,1,0,0,0,0,83.75,28.893237,1,0
5106,44873,1,81.0,0,0,0,1,0,125.20,40.000000,1,0
5107,19723,1,35.0,0,0,0,1,1,82.99,30.600000,1,0
5108,37544,0,51.0,0,0,0,0,1,166.29,25.600000,0,0


# Splitting Dataset into Independent and Dependent Variables

In machine learning, it is essential to split the dataset into independent and dependent variables. This separation allows us to identify which features (independent variables) are used to predict the outcome (dependent variable).

## Definitions

- **Independent Variables (Features)**: These are the input variables that are used to predict the target variable. They can be numerical or categorical and represent the characteristics of the data.
  
- **Dependent Variable (Target)**: This is the output variable that we want to predict. It is the outcome that depends on the independent variables.

In [14]:
X = df_encoded.drop(columns=['stroke'])
y = df_encoded['stroke']

In [15]:
X

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status
0,9046,0,67.0,0,1,0,0,0,228.69,36.600000,0
1,51676,1,61.0,0,0,0,1,1,202.21,28.893237,1
2,31112,0,80.0,0,1,0,0,1,105.92,32.500000,1
3,60182,1,49.0,0,0,0,0,0,171.23,34.400000,2
4,1665,1,79.0,1,0,0,1,1,174.12,24.000000,1
...,...,...,...,...,...,...,...,...,...,...,...
5105,18234,1,80.0,1,0,0,0,0,83.75,28.893237,1
5106,44873,1,81.0,0,0,0,1,0,125.20,40.000000,1
5107,19723,1,35.0,0,0,0,1,1,82.99,30.600000,1
5108,37544,0,51.0,0,0,0,0,1,166.29,25.600000,0


In [16]:
y

Unnamed: 0,stroke
0,1
1,1
2,1
3,1
4,1
...,...
5105,0
5106,0
5107,0
5108,0


# Training and Test Data Splitting

In machine learning, splitting the dataset into training and test sets is a crucial step in the model development process. This practice helps ensure that the model can generalize well to unseen data.

## Definitions

- **Training Set**: This subset of the dataset is used to train the machine learning model. The model learns the patterns and relationships in the data from this set.

- **Test Set**: This subset is used to evaluate the performance of the trained model. It contains data that the model has not seen during training, allowing us to assess how well the model generalizes to new, unseen data.

## Importance of Splitting

1. **Prevent Overfitting**: By evaluating the model on a separate test set, we can determine if the model is overfitting the training data (i.e., performing well on training data but poorly on unseen data).

2. **Model Evaluation**: The test set provides an unbiased evaluation of the model's performance, helping to ensure that the model is robust and reliable.

3. **Hyperparameter Tuning**: A validation set can also be created from the training set to fine-tune model hyperparameters without compromising the integrity of the test set.

## Example: Stroke Prediction Dataset

Consider a stroke prediction dataset with various features, including age, gender, and medical history. The goal is to predict whether an individual has had a stroke.


In [18]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [19]:
X_train

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status
802,4970,0,79.00,0,0,0,1,1,112.64,28.5,0
3927,56137,1,62.00,0,0,0,0,0,88.32,36.3,3
2337,54590,1,21.00,0,0,1,0,1,59.52,33.7,1
3910,36548,0,31.00,0,0,0,2,0,65.70,30.4,0
1886,61171,1,31.00,0,0,1,0,1,59.63,19.9,1
...,...,...,...,...,...,...,...,...,...,...,...
4426,13846,0,43.00,0,0,0,2,1,88.00,30.6,1
466,1307,1,61.00,1,0,0,0,1,170.05,60.2,2
3092,31481,1,1.16,0,0,1,3,0,97.28,17.8,3
3772,61827,0,80.00,0,0,0,1,1,196.08,31.0,0


# Feature Scaling

Feature scaling is a crucial preprocessing step in machine learning that involves transforming the features of a dataset to a similar scale. This process is essential for many machine learning algorithms that rely on the distance between data points or assume that the data is normally distributed.

## Importance of Feature Scaling

1. **Improves Model Performance**: Many algorithms, such as K-Nearest Neighbors (KNN) and Support Vector Machines (SVM), are sensitive to the scale of the input features. Scaling ensures that all features contribute equally to the distance calculations.

2. **Speeds Up Convergence**: In gradient descent optimization, feature scaling can help the algorithm converge faster by ensuring that the gradients are on a similar scale.

3. **Prevents Dominance**: Features with larger ranges can dominate the learning process, leading to biased models. Scaling helps to mitigate this issue.

## Common Methods of Feature Scaling

### 1. Min-Max Scaling (Normalization)

Min-Max scaling transforms the features to a fixed range, usually \([0, 1]\). The formula for Min-Max scaling is:

\[
X' = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}}
\]

- **Use Case**: Useful when the data does not follow a Gaussian distribution and when you want to preserve the relationships between the data points.

### 2. Standardization (Z-score Normalization)

Standardization transforms the features to have a mean of 0 and a standard deviation of 1. The formula for standardization is:

\[
X' = \frac{X - \mu}{\sigma}
\]

where \( \mu \) is the mean and \( \sigma \) is the standard deviation of the feature.

- **Use Case**: Useful when the data follows a Gaussian distribution and when you want to ensure that the features have similar distributions.

### 3. Robust Scaling

Robust scaling uses the median and the interquartile range (IQR) to scale the features. The formula is:

\[
X' = \frac{X - \text{median}}{IQR}
\]

- **Use Case**: Useful when the dataset contains outliers, as it is less sensitive to extreme values compared to Min-Max scaling and standardization.

In [22]:
X_train.drop(columns=['id'], inplace=True)
X_test.drop(columns=['id'], inplace=True)

# Function to perform standard scaling
def standard_scaling(X_train, X_test):
    # Calculate the mean and standard deviation from the training set
    mean = X_train.mean()
    std_dev = X_train.std()

    # Standardize the training set
    X_train_scaled = (X_train - mean) / std_dev

    # Standardize the test set using the training set's mean and std deviation
    X_test_scaled = (X_test - mean) / std_dev

    return X_train_scaled, X_test_scaled

# Perform standard scaling on X_train and X_test
X_train_scaled, X_test_scaled = standard_scaling(X_train, X_test)

In [23]:
X_train_scaled

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status
802,-1.192849,1.583961,-0.321942,-0.23616,-0.725916,0.144060,1.023140,0.135576,-0.058009,-1.449779
3927,0.838124,0.829606,-0.321942,-0.23616,-0.725916,-0.750951,-0.977145,-0.397409,0.947346,1.294909
2337,0.838124,-0.989720,-0.321942,-0.23616,1.377233,-0.750951,1.023140,-1.028575,0.612228,-0.534883
3910,-1.192849,-0.545982,-0.321942,-0.23616,-0.725916,1.039071,-0.977145,-0.893137,0.186885,-1.449779
1886,0.838124,-0.545982,-0.321942,-0.23616,1.377233,-0.750951,1.023140,-1.026164,-1.166478,-0.534883
...,...,...,...,...,...,...,...,...,...,...
4426,-1.192849,-0.013496,-0.321942,-0.23616,-0.725916,1.039071,1.023140,-0.404421,0.212664,-0.534883
466,0.838124,0.785232,3.105394,-0.23616,-0.725916,-0.750951,1.023140,1.393745,4.027858,0.380013
3092,0.838124,-1.870096,-0.321942,-0.23616,1.377233,1.934082,-0.977145,-0.201046,-1.437150,1.294909
3772,-1.192849,1.628335,-0.321942,-0.23616,-0.725916,0.144060,1.023140,1.964206,0.264220,-1.449779


# Model Training

# K-Nearest Neighbors (KNN)

## Introduction

K-Nearest Neighbors (KNN) is a simple, non-parametric, and supervised machine learning algorithm used for classification and regression tasks. It is one of the most straightforward algorithms and is widely used due to its intuitive nature and ease of implementation.

## How KNN Works

The KNN algorithm operates on the principle of finding the `k` nearest data points (neighbors) to a given input data point in the feature space. The steps involved in the KNN algorithm are as follows:

1. **Choose the number of neighbors (k)**: The user selects the number of nearest neighbors to consider for making predictions.

2. **Calculate the distance**: For a given input data point, the algorithm calculates the distance between this point and all other points in the training dataset. Common distance metrics include:
   - Euclidean distance
   - Manhattan distance
   - Minkowski distance

3. **Identify the nearest neighbors**: The algorithm identifies the `k` data points in the training set that are closest to the input data point based on the calculated distances.

4. **Make predictions**:
   - **For Classification**: The algorithm assigns the class label that is most common among the `k` nearest neighbors (majority voting).
   - **For Regression**: The algorithm predicts the output value by averaging the values of the `k` nearest neighbors.

## Advantages of KNN

- **Simplicity**: KNN is easy to understand and implement, making it a good choice for beginners in machine learning.
- **No Training Phase**: KNN does not require a training phase; it simply stores the training data and makes predictions based on it.
- **Versatility**: KNN can be used for both classification and regression tasks.

## Disadvantages of KNN

- **Computationally Expensive**: KNN can be slow for large datasets, as it requires calculating the distance to all training samples for each prediction.
- **Sensitive to Irrelevant Features**: The presence of irrelevant features can negatively impact the performance of the KNN algorithm.
- **Choice of k**: The performance of KNN can be sensitive to the choice of `k`. A small value of `k` can lead to noise sensitivity, while a large value can smooth out the decision boundary.

## Use Cases

KNN is commonly used in various applications, including:

- **Recommendation Systems**: KNN can be used to recommend products based on user preferences and similarities.
- **Image Recognition**: KNN can classify images based on pixel intensity and color features.
- **Medical Diagnosis**: KNN can help in diagnosing diseases by comparing patient data with historical cases.

In [31]:
#build a class of KNN
class KNN:
    def __init__(self, k=3):
        self.k = k

    def fit(self, X, y):
        self.X_train = X.values
        self.y_train = y.values

    def predict(self, X):
        predictions = [self._predict(x) for x in X.values]
        return np.array(predictions)

    def _predict(self, x):
        # Calculate distances between x and all points in the training set
        distances = np.sqrt(np.sum((self.X_train - x) ** 2, axis=1))

        # Get the indices of the k nearest neighbors
        k_indices = np.argsort(distances)[:self.k]

        # Get the labels of the k nearest neighbors
        k_nearest_labels = [self.y_train[i] for i in k_indices]

        # Return the most common class label
        most_common = np.bincount(k_nearest_labels).argmax()
        return most_common

# Initialize and train the KNN model
k = 5  # You can choose the value of k
knn = KNN(k)
knn.fit(X_train_scaled, y_train)

# Make predictions on the test set
predictions = knn.predict(X_test_scaled)

# Evaluate the model
accuracy = np.mean(predictions == y_test.values)
print(f"Predictions: {predictions}")
print(f"True Labels: {y_test.values}")
print(f"Accuracy: {accuracy * 100:.2f}%")

Predictions: [0 0 0 ... 0 0 0]
True Labels: [0 0 0 ... 0 0 0]
Accuracy: 93.93%
