Instructor: **Carlos Mejia**

Student: **Gabriela Sánchez**

# Setup
In this notebook section, we will import the libraries needed to run this code.

In [585]:
import os
import re
import missingno as msno
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objs as go
import plotly.offline as py
import warnings

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline

#from imblearn.over_sampling import SMOTE
from plotly.offline import iplot
from plotly.subplots import make_subplots
from sklearn.compose import ColumnTransformer
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.dummy import DummyClassifier
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, cross_val_score, RepeatedStratifiedKFold, StratifiedKFold, train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PowerTransformer, MinMaxScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, roc_auc_score

warnings.filterwarnings("ignore")

# Constants
In a Jupyter Notebook, creating constant variables can be important for several reasons:

* **Readability and Maintainability**: Using constant variables with meaningful names can improve the readability of your code. It makes it easier for others (or even yourself in the future) to understand the purpose of the values being used throughout the notebook.

* **Code Consistency**: By defining constants, you ensure that specific values are consistently used across the notebook. If you need to change the value later, you only have to modify it in one place, reducing the risk of errors due to inconsistent values.

* **Preventing Magic Numbers**: Magic numbers are hardcoded numeric values scattered throughout the code without any explanation or context. Using constants instead of magic numbers makes the code self-documenting and provides context for the values used.

* **Flexibility**: If you need to change a value that is used in multiple places, having it defined as a constant allows you to change it once, and the change will automatically apply throughout the notebook.

* **Easy Debugging**: When debugging the code, having constants allows you to quickly check the values being used in different parts of the notebook without having to search for where they are defined.

* **Unit Testing**: If you plan to write unit tests for your code, using constants can make it easier to define test cases and assert expected results.

In [586]:
DATASETS_DIR = './datasets/'
URL = 'C:/Users/gcsanchez/Documents/GitHub/mlops_mod3/module-3/datasets/healthcare-dataset-stroke-data.csv'
DROP_COLS = ['id']
RETRIEVED_DATA = 'raw-data-stroke.csv'

SEED_SPLIT = 404
TRAIN_DATA_FILE = DATASETS_DIR + 'train.csv'
TEST_DATA_FILE  = DATASETS_DIR + 'test.csv'

TARGET = 'stroke'
FEATURES = ['gender','age','hypertension','heart_disease','ever_married','work_type','Residence_type','avg_glucose_level','bmi','smoking_status']
NUMERICAL_VARS = ['age','avg_glucose_level', 'bmi']
CATEGORICAL_VARS = ['gender','hypertension','heart_disease','ever_married','work_type','Residence_type','smoking_status']

NUMERICAL_VARS_WITH_NA = ['bmi']
CATEGORICAL_VARS_WITH_NA = []
NUMERICAL_NA_NOT_ALLOWED = [var for var in NUMERICAL_VARS if var not in NUMERICAL_VARS_WITH_NA]
CATEGORICAL_NA_NOT_ALLOWED = [var for var in CATEGORICAL_VARS if var not in CATEGORICAL_VARS_WITH_NA]

SEED_MODEL = 404

SELECTED_FEATURES = ['gender','age','hypertension','heart_disease','ever_married','work_type','Residence_type','avg_glucose_level','bmi','smoking_status']

# Functions
Writing functions will help us for several things, for example:
* **Modularity**: Functions allow you to break down complex problems into smaller, manageable pieces. Each function can handle a specific task, making the code easier to understand, test, and maintain. This concept is known as "modularity."

* **Reusability**: Once you define a function, you can use it multiple times throughout your code or even in other projects. This promotes code reuse and saves time since you don't have to rewrite the same logic each time you need it.

In [587]:
def data_retrieval(url):

    # Loading data from specific url
    data = pd.read_csv(url)

    # Uncovering missing data
    data.replace('?', np.nan, inplace=True)

    # Droping irrelevant columns
    data.drop(DROP_COLS, axis=1, inplace=True)

    # Create directory if it does not exist
    if not os.path.exists(DATASETS_DIR):
        os.makedirs(DATASETS_DIR)
        print(f"Directory '{DATASETS_DIR}' created successfully.")
    else:
        print(f"Directory '{DATASETS_DIR}' already exists.")

    # Save data to CSV file
    data.to_csv(DATASETS_DIR + RETRIEVED_DATA, index=False)

    return print('Data stored in {}'.format(DATASETS_DIR + RETRIEVED_DATA))

data_retrieval(URL)

def evaluate_model(X, y, model):
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=42)
    scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
    return scores

def get_models():
    models, names = list(), list()
    models.append(LogisticRegression(solver='liblinear'))
    names.append('LR')
    models.append(LinearDiscriminantAnalysis())
    names.append('LDA')
    models.append(SVC(gamma='scale'))
    names.append('SVM')
    return models, names

Directory './datasets/' already exists.
Data stored in ./datasets/raw-data-stroke.csv


# Custom Transformers
Custom transformers are really important if we want to have high-quality code, able to be maintaned, changed and be reused by other pieces of code.

The following code is the migration from [3-create-convenient-classes.ipynb](../session-7/3-create-convenient-classes.ipynb) notebook.

In [588]:
class MissingIndicator(BaseEstimator, TransformerMixin):
    """
    Custom scikit-learn transformer to create indicator features for missing values in specified variables.

    Parameters:
        variables (list or str, optional): List of column names (variables) to create indicator features for.
            If a single string is provided, it will be treated as a single variable. Default is None.

    Attributes:
        variables (list): List of column names (variables) to create indicator features for.

    Methods:
        fit(X, y=None):
            This method does not perform any actual training or fitting.
            It returns the transformer instance itself.

        transform(X):
            Creates indicator features for missing values in the specified variables and returns the modified DataFrame.

    Example usage:
    ```
    from sklearn.pipeline import Pipeline

    # Instantiate the custom transformer
    missing_indicator = MissingIndicator(variables=['age', 'income'])

    # Define the pipeline with the custom transformer
    pipeline = Pipeline([
        ('missing_indicator', missing_indicator),
        # Other pipeline steps...
    ])

    # Fit and transform the data using the pipeline
    X_transformed = pipeline.fit_transform(X)
    ```
    """
    def __init__(self, variables=None):
        """
        Initialize the MissingIndicator transformer.

        Parameters:
            variables (list or str, optional): List of column names (variables) to create indicator features for.
                If a single string is provided, it will be treated as a single variable. Default is None.
        """
        if not isinstance(variables, list):
            self.variables = [variables]
        else:
            self.variables = variables

    def fit(self, X, y=None):
        """
        This method does not perform any actual training or fitting, as indicator features are created based on data.
        It returns the transformer instance itself.

        Parameters:
            X (pd.DataFrame): Input data to be transformed. Not used in this method.
            y (pd.Series or np.array, optional): Target variable. Not used in this method.

        Returns:
            self (MissingIndicator): The transformer instance.
        """
        return self

    def transform(self, X):
        """
        Creates indicator features for missing values in the specified variables and returns the modified DataFrame.

        Parameters:
            X (pd.DataFrame): Input data to be transformed.

        Returns:
            X_transformed (pd.DataFrame): Transformed DataFrame with additional indicator features for missing values.
        """
        X = X.copy()
        for var in self.variables:
            X[f'{var}_nan'] = X[var].isnull().astype(int)

        return X

# create_missing_flag = MissingIndicator(variables=NUMERICAL_VARS)
# X_train = create_missing_flag.transform(X_train)
# X_train

In [589]:
class ExtractLetters(BaseEstimator, TransformerMixin):
    """
    Custom scikit-learn transformer to extract letters from a specified variable.

    Parameters:
        None

    Attributes:
        variable (str): The name of the column (variable) from which letters will be extracted.

    Methods:
        fit(X, y=None):
            This method does not perform any actual training or fitting.
            It returns the transformer instance itself.

        transform(X):
            Extracts letters from the specified variable and returns the modified DataFrame.

    Example usage:
    ```
    from sklearn.pipeline import Pipeline

    # Instantiate the custom transformer
    extractor = ExtractLetters()

    # Define the pipeline with the custom transformer
    pipeline = Pipeline([
        ('extractor', extractor),
        # Other pipeline steps...
    ])

    # Fit and transform the data using the pipeline
    X_transformed = pipeline.fit_transform(X)
    ```
    """
    def __init__(self):
        """
        Initialize the ExtractLetters transformer.

        Parameters:
            None
        """
        self.variable = 'cabin'

    def fit(self, X, y=None):
        """
        This method does not perform any actual training or fitting, as it is not necessary for this transformer.
        It returns the transformer instance itself.

        Parameters:
            X (pd.DataFrame): Input data to be transformed. Not used in this method.
            y (pd.Series or np.array, optional): Target variable. Not used in this method.

        Returns:
            self (ExtractLetters): The transformer instance.
        """
        return self

    def transform(self, X):
        """
        Extracts letters from the specified variable and returns the modified DataFrame.

        Parameters:
            X (pd.DataFrame): Input data to be transformed.

        Returns:
            X_transformed (pd.DataFrame): Transformed DataFrame with letters extracted from the specified variable.
        """
        X = X.copy()
        X[self.variable] = X[self.variable].apply(lambda x: ''.join(re.findall("[a-zA-Z]+", x)) if type(x)==str else x)
        return X

# extractor = ExtractLetters()
# X_train = extractor.transform(X_train)
# X_train

In [590]:
class CategoricalImputer(BaseEstimator, TransformerMixin):
    """
    Custom scikit-learn transformer to impute missing values in categorical variables.

    Parameters:
        variables (list or str, optional): List of column names (variables) to impute missing values for.
            If a single string is provided, it will be treated as a single variable. Default is None.

    Attributes:
        variables (list): List of column names (variables) to impute missing values for.

    Methods:
        fit(X, y=None):
            This method does not perform any actual training or fitting.
            It returns the transformer instance itself.

        transform(X):
            Imputes missing values in the specified categorical variables and returns the modified DataFrame.

    Example usage:
    ```
    from sklearn.pipeline import Pipeline

    # Instantiate the custom transformer
    imputer = CategoricalImputer(variables=['category1', 'category2'])

    # Define the pipeline with the custom transformer
    pipeline = Pipeline([
        ('imputer', imputer),
        # Other pipeline steps...
    ])

    # Fit and transform the data using the pipeline
    X_transformed = pipeline.fit_transform(X)
    ```
    """
    def __init__(self, variables=None):
        """
        Initialize the CategoricalImputer transformer.

        Parameters:
            variables (list or str, optional): List of column names (variables) to impute missing values for.
                If a single string is provided, it will be treated as a single variable. Default is None.
        """
        self.variables = [variables] if not isinstance(variables, list) else variables

    def fit(self, X, y=None):
        """
        This method does not perform any actual training or fitting, as imputation is based on data.
        It returns the transformer instance itself.

        Parameters:
            X (pd.DataFrame): Input data to be transformed. Not used in this method.
            y (pd.Series or np.array, optional): Target variable. Not used in this method.

        Returns:
            self (CategoricalImputer): The transformer instance.
        """
        return self

    def transform(self, X):
        """
        Imputes missing values in the specified categorical variables and returns the modified DataFrame.

        Parameters:
            X (pd.DataFrame): Input data to be transformed.

        Returns:
            X_transformed (pd.DataFrame): Transformed DataFrame with missing values imputed for the specified categorical variables.
        """
        X = X.copy()
        for var in self.variables:
            X[var] = X[var].fillna('Missing')
        return X

# imputer = CategoricalImputer(variables=CATEGORICAL_VARS_WITH_NA)
# X_train = imputer.transform(X_train)
# X_train

In [591]:
class NumericalImputer(BaseEstimator, TransformerMixin):
    """
    Custom scikit-learn transformer to impute missing values in numerical variables.

    Parameters:
        variables (list or str, optional): List of column names (variables) to impute missing values for.
            If a single string is provided, it will be treated as a single variable. Default is None.

    Attributes:
        variables (list): List of column names (variables) to impute missing values for.
        median_dict_ (dict): Dictionary to store the median values for each specified numerical variable during fitting.

    Methods:
        fit(X, y=None):
            Calculates the median values for the specified numerical variables from the training data.
            It returns the transformer instance itself.

        transform(X):
            Imputes missing values in the specified numerical variables using the median values and returns the modified DataFrame.

    Example usage:
    ```
    from sklearn.pipeline import Pipeline

    # Instantiate the custom transformer
    imputer = NumericalImputer(variables=['age', 'income'])

    # Define the pipeline with the custom transformer
    pipeline = Pipeline([
        ('imputer', imputer),
        # Other pipeline steps...
    ])

    # Fit and transform the data using the pipeline
    X_transformed = pipeline.fit_transform(X)
    ```
    """
    def __init__(self, variables=None):
        """
        Initialize the NumericalImputer transformer.

        Parameters:
            variables (list or str, optional): List of column names (variables) to impute missing values for.
                If a single string is provided, it will be treated as a single variable. Default is None.
        """
        self.variables = [variables] if not isinstance(variables, list) else variables

    def fit(self, X, y=None):
        """
        Calculates the median values for the specified numerical variables from the training data.

        Parameters:
            X (pd.DataFrame): Input data to be transformed.

        Returns:
            self (NumericalImputer): The transformer instance.
        """
        self.median_dict = {}
        for var in self.variables:
            self.median_dict[var] = X[var].median()
        return self


    def transform(self, X):
        """
        Imputes missing values in the specified numerical variables using the median values and returns the modified DataFrame.

        Parameters:
            X (pd.DataFrame): Input data to be transformed.

        Returns:
            X_transformed (pd.DataFrame): Transformed DataFrame with missing values imputed for the specified numerical variables.
        """
        X = X.copy()
        for var in self.variables:
            X[var] = X[var].fillna(self.median_dict[var])
        return X

# print(NUMERICAL_VARS_WITH_NA)
# median_imputation = NumericalImputer(variables=NUMERICAL_VARS_WITH_NA)
# median_imputation.fit(X_train)
# X_train = median_imputation.transform(X_train)
# X_train

In [592]:
class RareLabelCategoricalEncoder(BaseEstimator, TransformerMixin):
    """
    Custom scikit-learn transformer to encode rare categories in categorical variables.

    Parameters:
        tol (float, optional): The tolerance level to define rare categories.
            Categories with a frequency lower than tol will be encoded as 'rare'.
            Default is 0.05.
        variables (list or str, optional): List of column names (variables) to encode rare categories for.
            If a single string is provided, it will be treated as a single variable. Default is None.

    Attributes:
        tol (float): The tolerance level to define rare categories.
        variables (list): List of column names (variables) to encode rare categories for.
        rare_labels_dict (dict): Dictionary to store the rare category labels for each specified categorical variable during fitting.

    Methods:
        fit(X, y=None):
            Calculates the rare category labels for the specified categorical variables from the training data.
            It returns the transformer instance itself.

        transform(X):
            Encodes rare categories in the specified categorical variables and returns the modified DataFrame.

    Example usage:
    ```
    from sklearn.pipeline import Pipeline

    # Instantiate the custom transformer
    encoder = RareLabelCategoricalEncoder(tol=0.1, variables=['category1', 'category2'])

    # Define the pipeline with the custom transformer
    pipeline = Pipeline([
        ('encoder', encoder),
        # Other pipeline steps...
    ])

    # Fit and transform the data using the pipeline
    X_transformed = pipeline.fit_transform(X)
    ```
    """
    def __init__(self, tol=0.05, variables=None):
        """
        Initialize the RareLabelCategoricalEncoder transformer.

        Parameters:
            tol (float, optional): The tolerance level to define rare categories.
                Categories with a frequency lower than tol will be encoded as 'rare'.
                Default is 0.05.
            variables (list or str, optional): List of column names (variables) to encode rare categories for.
                If a single string is provided, it will be treated as a single variable. Default is None.
        """
        self.tol = tol
        self.variables = [variables] if not isinstance(variables, list) else variables

    def fit(self, X, y=None):
        """
        Calculates the rare category labels for the specified categorical variables from the training data.

        Parameters:
            X (pd.DataFrame): Input data to be transformed.

        Returns:
            self (RareLabelCategoricalEncoder): The transformer instance.
        """
        self.rare_labels_dict = {}
        for var in self.variables:
            t = pd.Series(X[var].value_counts() / float(X.shape[0]))
            self.rare_labels_dict[var] = list(t[t<self.tol].index)
        return self

    def transform(self, X):
        """
        Encodes rare categories in the specified categorical variables and returns the modified DataFrame.

        Parameters:
            X (pd.DataFrame): Input data to be transformed.

        Returns:
            X_transformed (pd.DataFrame): Transformed DataFrame with rare categories encoded for the specified categorical variables.
        """
        X = X.copy()
        for var in self.variables:
            X[var] = np.where(X[var].isin(self.rare_labels_dict[var]), 'rare', X[var])
        return X

# print(CATEGORICAL_VARS)
# encoder = RareLabelCategoricalEncoder(tol=0.05, variables=CATEGORICAL_VARS)
# encoder.fit(X_train)
# X_train = encoder.transform(X_train)
# X_train

In [593]:
class OneHotEncoder(BaseEstimator, TransformerMixin):
    """
    Custom scikit-learn transformer to perform one-hot encoding for categorical variables.

    Parameters:
        variables (list or str, optional): List of column names (variables) to perform one-hot encoding for.
            If a single string is provided, it will be treated as a single variable. Default is None.

    Attributes:
        variables (list): List of column names (variables) to perform one-hot encoding for.
        dummies (list): List of column names representing the one-hot encoded dummy variables.

    Methods:
        fit(X, y=None):
            Calculates the one-hot encoded dummy variable columns for the specified categorical variables from the training data.
            It returns the transformer instance itself.

        transform(X):
            Performs one-hot encoding for the specified categorical variables and returns the modified DataFrame.

    Example usage:
    ```
    from sklearn.pipeline import Pipeline

    # Instantiate the custom transformer
    encoder = OneHotEncoder(variables=['category1', 'category2'])

    # Define the pipeline with the custom transformer
    pipeline = Pipeline([
        ('encoder', encoder),
        # Other pipeline steps...
    ])

    # Fit and transform the data using the pipeline
    X_transformed = pipeline.fit_transform(X)
    ```
    """
    def __init__(self, variables=None):
        """
        Initialize the OneHotEncoder transformer.

        Parameters:
            variables (list or str, optional): List of column names (variables) to perform one-hot encoding for.
                If a single string is provided, it will be treated as a single variable. Default is None.
        """
        self.variables = [variables] if not isinstance(variables, list) else variables

    def fit(self, X, y=None):
        """
        Calculates the one-hot encoded dummy variable columns for the specified categorical variables from the training data.

        Parameters:
            X (pd.DataFrame): Input data to be transformed.

        Returns:
            self (OneHotEncoder): The transformer instance.
        """
        self.dummies = pd.get_dummies(X[self.variables], drop_first=True).columns
        return self

    def transform(self, X):
        """
        Performs one-hot encoding for the specified categorical variables and returns the modified DataFrame.

        Parameters:
            X (pd.DataFrame): Input data to be transformed.

        Returns:
            X_transformed (pd.DataFrame): Transformed DataFrame with one-hot encoded dummy variables for the specified categorical variables.
        """
        X = X.copy()
        X = pd.concat([X, pd.get_dummies(X[self.variables], drop_first=True)], axis=1)
        X.drop(self.variables, axis=1)

        # Adding missing dummies, if any
        missing_dummies = [var for var in self.dummies if var not in X.columns]
        if len(missing_dummies) != 0:
            for col in missing_dummies:
                X[col] = 0

        return X


# print(CATEGORICAL_VARS)
# one_encoder = OneHotEncoder(variables=CATEGORICAL_VARS)
# one_encoder.fit(X_train)
# X_train = one_encoder.transform(X_train)
# X_train

In [594]:
class FeatureSelector(BaseEstimator, TransformerMixin):
    """
    Custom scikit-learn transformer to select specific features (columns) from a DataFrame.

    Parameters:
        feature_names (list or array-like): List of column names to select as features from the input DataFrame.

    Methods:
        fit(X, y=None):
            Placeholder method that returns the transformer instance itself.

        transform(X):
            Selects and returns the specified features (columns) from the input DataFrame.

    Example usage:
    ```
    from sklearn.pipeline import Pipeline

    # Define the feature names to be selected
    selected_features = ['feature1', 'feature2', 'feature3']

    # Instantiate the custom transformer
    feature_selector = FeatureSelector(feature_names=selected_features)

    # Define the pipeline with the custom transformer
    pipeline = Pipeline([
        ('feature_selector', feature_selector),
        # Other pipeline steps...
    ])

    # Fit and transform the data using the pipeline
    X_transformed = pipeline.fit_transform(X)
    ```
    """

    def __init__(self, feature_names):
        """
        Initialize the FeatureSelector transformer.

        Parameters:
            feature_names (list or array-like): List of column names to select as features from the input DataFrame.
        """
        self.feature_names = feature_names

    def fit(self, X, y=None):
        """
        Placeholder method that returns the transformer instance itself.

        Parameters:
            X (pd.DataFrame): Input data to be transformed.

        Returns:
            self (FeatureSelector): The transformer instance.
        """
        return self

    def transform(self, X):
        """
        Selects and returns the specified features (columns) from the input DataFrame.

        Parameters:
            X (pd.DataFrame): Input data to be transformed.

        Returns:
            X_selected (pd.DataFrame): DataFrame containing only the specified features (columns).
        """
        return X[self.feature_names]


In [595]:
class OrderingFeatures(BaseEstimator, TransformerMixin):
    """
    Custom scikit-learn transformer to order features (columns) in the same order as they appeared in the training data.

    Parameters:
        None

    Attributes:
        ordered_features (pd.Index): Index of column names representing the order of features as they appeared in the training data.

    Methods:
        fit(X, y=None):
            Records the order of features from the training data and returns the transformer instance itself.

        transform(X):
            Reorders the features in the same order as they appeared in the training data and returns the modified DataFrame.

    Example usage:
    ```
    from sklearn.pipeline import Pipeline

    # Instantiate the custom transformer
    feature_orderer = OrderingFeatures()

    # Define the pipeline with the custom transformer
    pipeline = Pipeline([
        ('feature_orderer', feature_orderer),
        # Other pipeline steps...
    ])

    # Fit and transform the data using the pipeline
    X_transformed = pipeline.fit_transform(X)
    ```
    """
    def __init__(self):
        """
        Initialize the OrderingFeatures transformer.

        Parameters:
            None
        """
        return None

    def fit(self, X, y=None):
        """
        Records the order of features from the training data.

        Parameters:
            X (pd.DataFrame): Input data to be transformed.

        Returns:
            self (OrderingFeatures): The transformer instance.
        """
        if isinstance(X, pd.DataFrame):
            self.ordered_features = X.columns
            print(self.ordered_features)
        elif isinstance(X, np.ndarray):
            self.ordered_features = np.arange(X.shape[1])
        else:
            raise ValueError("Input X must be a pandas DataFrame or a numpy array.")
        return self

    def transform(self, X):
        """
        Reorders the features in the same order as they appeared in the training data.

        Parameters:
            X (pd.DataFrame): Input data to be transformed.

        Returns:
            X_transformed (pd.DataFrame): Transformed DataFrame with features ordered as they appeared in the training data.
        """

        if isinstance(X, pd.DataFrame):
            # print(X[self.ordered_features])
            # print("return df")
            DROP_COLS_AFTER = ['gender','ever_married','work_type','Residence_type','smoking_status','gender_Other']
            #['gender','age','hypertension','heart_disease','ever_married','work_type','Residence_type','avg_glucose_level','bmi','smoking_status']
            X[self.ordered_features]
            X.drop(DROP_COLS_AFTER, axis=1, inplace=True)
            return X
        elif isinstance(X, np.ndarray):
            # print("return np")
            return X[:, self.ordered_features]
        else:
            raise ValueError("Input X must be a pandas DataFrame or a numpy array.")


# feature_orderer = OrderingFeatures()
# feature_orderer.fit(X_train)
# df = feature_orderer.transform(X_train)
# df

# Pipeline
The code below is a scikit-learn pipeline called titanic_pipeline, that is used for data preprocessing and modeling for a Titanic dataset classification task. Each step in the pipeline corresponds to a specific data transformation or modeling step.

* **`MissingIndicator`**: This is a custom transformer that creates indicator features for missing values in numerical variables. It takes the NUMERICAL_VARS as input, which represents a list of numerical column names in the dataset.

* **`ExtractLetters`**: This is a custom transformer that extracts letters from the 'cabin' variable. It aims to process the 'cabin' variable and retrieve only the alphabetical characters, discarding any numeric or special characters.

* **`CategoricalImputer`**: This is a custom transformer that imputes missing values in categorical variables. It takes the CATEGORICAL_VARS_WITH_NA as input, which represents a list of categorical column names that may contain missing values. It fills in the missing values with the string 'Missing'.

* **`NumericalImputer`**: This is a custom transformer that imputes missing values in numerical variables. It takes the NUMERICAL_VARS_WITH_NA as input, which represents a list of numerical column names that may contain missing values. It fills in the missing values with the median value of each respective variable.

* **`RareLabelCategoricalEncoder`**: This is a custom transformer that encodes rare categories in categorical variables. It takes the CATEGORICAL_VARS as input, which represents a list of categorical column names to encode rare categories for. It identifies categories with a frequency lower than 5% (tolerance of 0.05) and encodes them as 'rare'.

* **`OneHotEncoder`**: This is a custom transformer that performs one-hot encoding for categorical variables. It takes the CATEGORICAL_VARS as input, which represents a list of categorical column names to be one-hot encoded. It creates binary dummy variables for each category.

* **`OrderingFeatures`**: This is a custom transformer that orders the features (columns) in the same order as they appeared in the training data. It ensures that the order of columns in the transformed dataset is consistent with the order in which the pipeline was trained.

* **`MinMaxScaler`**: This step scales the numerical features to a specified range, typically between 0 and 1, using the Min-Max scaling technique.

* **`LogisticRegression`**: This is the final modeling step in the pipeline. It fits a logistic regression model to the preprocessed dataset. The model is specified with hyperparameters C=0.0005, class_weight='balanced', and random_state=SEED_MODEL. The C parameter is the regularization strength, 'balanced' sets the class weights to be inversely proportional to the class frequencies to handle class imbalance, and random_state is used for reproducibility.

In [596]:
df = pd.read_csv(DATASETS_DIR + RETRIEVED_DATA)

X_train, X_test, y_train, y_test = train_test_split(
                                                        df.drop(TARGET, axis=1),
                                                        df[TARGET],
                                                        test_size=0.2,
                                                        random_state=404
                                                   )

In [597]:
transformations_pipeline = Pipeline(
                              [
                                ('missing_indicator', MissingIndicator(variables=NUMERICAL_VARS)),
                                #('cabin_only_letter', ExtractLetters()),
                                ('categorical_imputer', CategoricalImputer(variables=CATEGORICAL_VARS_WITH_NA)),
                                ('median_imputation', NumericalImputer(variables=NUMERICAL_VARS_WITH_NA)),
                                #('rare_labels', RareLabelCategoricalEncoder(tol=0.05, variables=CATEGORICAL_VARS)),
                                ('dummy_vars', OneHotEncoder(variables=CATEGORICAL_VARS)),
                                #('feature_selector', FeatureSelector(SELECTED_FEATURES)),
                                #('aligning_feats', OrderingFeatures()),
                                #('scaling', MinMaxScaler()),
                              ])


In [598]:
X_train.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status
4958,Male,60.0,0,0,Yes,Private,Rural,153.48,37.3,never smoked
2705,Female,34.0,0,0,No,Private,Rural,103.43,43.6,smokes
2490,Male,0.88,0,0,No,children,Urban,85.38,23.4,Unknown
190,Female,65.0,0,0,Yes,Private,Urban,205.77,46.0,formerly smoked
1872,Female,56.0,0,0,Yes,Self-employed,Rural,94.71,29.6,smokes


In [599]:
X_train = transformations_pipeline.fit_transform(X_train)

# Model Training


In this section is the experimentation for model training.

In [600]:
X_train.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,...,gender_Other,ever_married_Yes,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,Residence_type_Urban,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
4958,Male,60.0,0,0,Yes,Private,Rural,153.48,37.3,never smoked,...,False,True,False,True,False,False,False,False,True,False
2705,Female,34.0,0,0,No,Private,Rural,103.43,43.6,smokes,...,False,False,False,True,False,False,False,False,False,True
2490,Male,0.88,0,0,No,children,Urban,85.38,23.4,Unknown,...,False,False,False,False,False,True,True,False,False,False
190,Female,65.0,0,0,Yes,Private,Urban,205.77,46.0,formerly smoked,...,False,True,False,True,False,False,True,True,False,False
1872,Female,56.0,0,0,Yes,Self-employed,Rural,94.71,29.6,smokes,...,False,True,False,False,True,False,False,False,False,True


In [601]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4088 entries, 4958 to 5108
Data columns (total 26 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   gender                          4088 non-null   object 
 1   age                             4088 non-null   float64
 2   hypertension                    4088 non-null   int64  
 3   heart_disease                   4088 non-null   int64  
 4   ever_married                    4088 non-null   object 
 5   work_type                       4088 non-null   object 
 6   Residence_type                  4088 non-null   object 
 7   avg_glucose_level               4088 non-null   float64
 8   bmi                             4088 non-null   float64
 9   smoking_status                  4088 non-null   object 
 10  age_nan                         4088 non-null   int32  
 11  avg_glucose_level_nan           4088 non-null   int32  
 12  bmi_nan                         4088

In [602]:
y_train.head()

4958    0
2705    0
2490    0
190     1
1872    0
Name: stroke, dtype: int64

In [603]:
X_train = X_train.drop(['gender','ever_married','work_type','Residence_type','smoking_status','gender_Other'], axis=1)

In [604]:
X_train

Unnamed: 0,age,hypertension,heart_disease,avg_glucose_level,bmi,age_nan,avg_glucose_level_nan,bmi_nan,hypertension.1,heart_disease.1,gender_Male,ever_married_Yes,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,Residence_type_Urban,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
4958,60.00,0,0,153.48,37.3,0,0,0,0,0,True,True,False,True,False,False,False,False,True,False
2705,34.00,0,0,103.43,43.6,0,0,0,0,0,False,False,False,True,False,False,False,False,False,True
2490,0.88,0,0,85.38,23.4,0,0,0,0,0,True,False,False,False,False,True,True,False,False,False
190,65.00,0,0,205.77,46.0,0,0,0,0,0,False,True,False,True,False,False,True,True,False,False
1872,56.00,0,0,94.71,29.6,0,0,0,0,0,False,True,False,False,True,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3579,65.00,0,0,60.70,31.3,0,0,0,0,0,False,False,False,False,False,False,True,False,True,False
2119,42.00,0,0,68.24,33.1,0,0,0,0,0,True,True,False,True,False,False,True,True,False,False
2657,36.00,0,0,129.43,29.7,0,0,0,0,0,False,True,False,False,False,False,False,False,True,False
4721,77.00,0,0,90.96,31.5,0,0,0,0,0,False,True,False,True,False,False,False,True,False,False


In [605]:
X_train.info

<bound method DataFrame.info of         age  hypertension  heart_disease  avg_glucose_level   bmi  age_nan  \
4958  60.00             0              0             153.48  37.3        0   
2705  34.00             0              0             103.43  43.6        0   
2490   0.88             0              0              85.38  23.4        0   
190   65.00             0              0             205.77  46.0        0   
1872  56.00             0              0              94.71  29.6        0   
...     ...           ...            ...                ...   ...      ...   
3579  65.00             0              0              60.70  31.3        0   
2119  42.00             0              0              68.24  33.1        0   
2657  36.00             0              0             129.43  29.7        0   
4721  77.00             0              0              90.96  31.5        0   
5108  51.00             0              0             166.29  25.6        0   

      avg_glucose_level_nan  bm

In [606]:
logistic_regression = LogisticRegression(C=0.0005, class_weight='balanced', random_state=SEED_MODEL)
logistic_regression.fit(X_train, y_train)

In [607]:
X_test

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status
3188,Female,35.0,0,0,Yes,Private,Rural,104.40,24.4,never smoked
2320,Female,56.0,0,0,Yes,Private,Urban,113.20,38.7,smokes
4027,Female,38.0,0,0,Yes,Govt_job,Rural,64.27,27.3,never smoked
3564,Female,55.0,0,1,Yes,Private,Urban,199.38,39.0,Unknown
4537,Female,51.0,0,0,Yes,Govt_job,Urban,81.38,34.1,smokes
...,...,...,...,...,...,...,...,...,...,...
1138,Female,38.0,0,0,Yes,Private,Urban,91.68,42.8,formerly smoked
4918,Male,75.0,0,0,Yes,Govt_job,Rural,79.49,28.9,Unknown
574,Male,18.0,0,0,No,Private,Urban,112.17,31.7,Unknown
2618,Female,37.0,0,0,Yes,Private,Rural,86.49,24.4,Unknown


In [608]:
X_test = transformations_pipeline.fit_transform(X_test)

In [609]:
X_test.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,...,gender_Male,ever_married_Yes,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,Residence_type_Urban,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
3188,Female,35.0,0,0,Yes,Private,Rural,104.4,24.4,never smoked,...,False,True,False,True,False,False,False,False,True,False
2320,Female,56.0,0,0,Yes,Private,Urban,113.2,38.7,smokes,...,False,True,False,True,False,False,True,False,False,True
4027,Female,38.0,0,0,Yes,Govt_job,Rural,64.27,27.3,never smoked,...,False,True,False,False,False,False,False,False,True,False
3564,Female,55.0,0,1,Yes,Private,Urban,199.38,39.0,Unknown,...,False,True,False,True,False,False,True,False,False,False
4537,Female,51.0,0,0,Yes,Govt_job,Urban,81.38,34.1,smokes,...,False,True,False,False,False,False,True,False,False,True


In [610]:
X_test = X_test.drop(['gender','ever_married','work_type','Residence_type','smoking_status'], axis=1)

In [611]:
X_test.head()

Unnamed: 0,age,hypertension,heart_disease,avg_glucose_level,bmi,age_nan,avg_glucose_level_nan,bmi_nan,hypertension.1,heart_disease.1,gender_Male,ever_married_Yes,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,Residence_type_Urban,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
3188,35.0,0,0,104.4,24.4,0,0,0,0,0,False,True,False,True,False,False,False,False,True,False
2320,56.0,0,0,113.2,38.7,0,0,0,0,0,False,True,False,True,False,False,True,False,False,True
4027,38.0,0,0,64.27,27.3,0,0,0,0,0,False,True,False,False,False,False,False,False,True,False
3564,55.0,0,1,199.38,39.0,0,0,0,0,1,False,True,False,True,False,False,True,False,False,False
4537,51.0,0,0,81.38,34.1,0,0,0,0,0,False,True,False,False,False,False,True,False,False,True


In [612]:
X_test.info

<bound method DataFrame.info of        age  hypertension  heart_disease  avg_glucose_level   bmi  age_nan  \
3188  35.0             0              0             104.40  24.4        0   
2320  56.0             0              0             113.20  38.7        0   
4027  38.0             0              0              64.27  27.3        0   
3564  55.0             0              1             199.38  39.0        0   
4537  51.0             0              0              81.38  34.1        0   
...    ...           ...            ...                ...   ...      ...   
1138  38.0             0              0              91.68  42.8        0   
4918  75.0             0              0              79.49  28.9        0   
574   18.0             0              0             112.17  31.7        0   
2618  37.0             0              0              86.49  24.4        0   
1419  43.0             0              0              81.94  27.7        0   

      avg_glucose_level_nan  bmi_nan  hyper

In [613]:
y_pred = logistic_regression.predict(X_test)

In [614]:
class_pred = logistic_regression.predict(X_test)
proba_pred = logistic_regression.predict_proba(X_test)[:,1]
print(f'test roc-auc : {roc_auc_score(y_test, proba_pred)}')
print(f'test accuracy: {accuracy_score(y_test, class_pred)}')
print()

test roc-auc : 0.8693709755227887
test accuracy: 0.7348336594911937



# Save Model

In this section is the saving of the model

In [615]:
import joblib

TRAINED_MODEL_DIR = './trained_models/' #'trained_models/'
PIPELINE_NAME = 'logistic_regression'
PIPELINE_SAVE_FILE = f'{PIPELINE_NAME}_output.pkl'

# Save the model using joblib
save_path = TRAINED_MODEL_DIR + PIPELINE_SAVE_FILE
joblib.dump(logistic_regression, save_path)

['./trained_models/logistic_regression_output.pkl']

# Load and predict data

This section contains the saved model load and a prediction with new data.

**Basic input validation**

input_data = X_test.copy()

**Making predictions**

In [624]:
# Sample single input data in dictionary format
single_input_data = {
    'gender': 'Female',
    'age': 60,
    'hypertension': 0,
    'heart_disease': 1,
    'ever_married': 'Yes',
    'work_type': 'Private',
    'Residence_type': 'Urban',
    'avg_glucose_level': '205.1',
    'bmi': 35.8,
    'smoking_status': 'never smoked'
}
# Convert the single input data to a DataFrame
single_input_df = pd.DataFrame([single_input_data])

# Preprocess the single input data using the transformations_pipeline
preprocessed_single_input = transformations_pipeline.transform(single_input_df)

# Load the model using joblib
trained_model = joblib.load(save_path)

# Delete columns conflict (review)
preprocessed_single_input = preprocessed_single_input.drop(['gender','ever_married','work_type','Residence_type','smoking_status'], axis=1)

# Predict the target value using the loaded model
predicted_value = trained_model.predict(preprocessed_single_input)

print("La predicción es: ", predicted_value)

La predicción es:  [1]


# Extra
Use this code to debug the Custom Transformer pipeline

In [None]:
# from sklearn.compose import ColumnTransformer

# # Define the debug_print function to print DataFrame or array
# def debug_print(X):
#     if isinstance(X, pd.DataFrame):
#         print(X.head())  # Print the first few rows of the DataFrame
#     elif isinstance(X, np.ndarray):
#         print(X[:5])  # Print the first 5 rows of the array


# # Define the preprocessor for categorical variables
# categorical_preprocessor = Pipeline([
#     ('categorical_imputer', CategoricalImputer(variables=CATEGORICAL_VARS_WITH_NA)),
#     ('rare_labels', RareLabelCategoricalEncoder(tol=0.05, variables=CATEGORICAL_VARS)),
#     ('dummy_vars', OneHotEncoder(variables=CATEGORICAL_VARS))
# ])

# # Define the preprocessor for numerical variables
# numerical_preprocessor = Pipeline([
#     ('missing_indicator', MissingIndicator(variables=NUMERICAL_VARS)),
#     # ('cabin_only_letter', ExtractLetters()),
#     ('median_imputation', NumericalImputer(variables=NUMERICAL_VARS_WITH_NA)),
#     ('scaling', MinMaxScaler())
# ])

# # Use ColumnTransformer to apply the different preprocessors to their respective columns
# preprocessor = ColumnTransformer(
#     transformers=[
#         ('categorical', categorical_preprocessor, CATEGORICAL_VARS),
#         ('numerical', numerical_preprocessor, NUMERICAL_VARS)
#     ]
# )

# # Combine the preprocessor with the logistic regression model in the final pipeline
# titanic_pipeline = Pipeline([
#     ('preprocessor', preprocessor),
#     ('aligning_feats', OrderingFeatures()),
#     ('log_reg', LogisticRegression(C=0.0005, class_weight='balanced', random_state=SEED_MODEL))
# ])

# # Debug each output after transformation
# X_train_transformed = titanic_pipeline['preprocessor'].fit_transform(X_train)
# debug_print(X_train_transformed)

# # Fit the model
# titanic_pipeline['log_reg'].fit(X_train_transformed, y_train)