# Machine Learning Models: A Comprehensive Overview

In this notebook, we will explore various machine learning models including linear regression, logistic regression, decision trees, Naive Bayes, and K-means clustering. We will also visualize assumptions and model performance metrics. 

The notebook is designed to be flexible for future use, allowing for easy adaptation to different datasets.

##  Imports and Data Loading

In this section, we import all the necessary packages and libraries for data manipulation, machine learning models, and visualizations.

In [None]:
# Essential imports for machine learning models, data manipulation, and visualizations
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.linear_model import Ridge, Lasso, LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, RandomForestClassifier
from sklearn.cluster import KMeans, DBSCAN
from sklearn.naive_bayes import GaussianNB
from sklearn.mixture import GaussianMixture
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.preprocessing import StandardScaler,MinMaxScaler
import statsmodels.api as sm

# For advanced machine learning algorithms
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from scipy.cluster.hierarchy import dendrogram, linkage



## Data Loading
We will load the dataset into a pandas DataFrame and inspect the first few rows to understand the structure of the data.

In [None]:
# Load dataset (replace 'your_dataset.csv' with actual dataset path)
df = pd.read_csv('your_dataset.csv')

# This lets us see all of the columns, preventing Juptyer from redacting them.
pd.set_option('display.max_columns', None)

# Display the first few rows of the dataset
df.head()

# Check for missing values or data types
df.info()


# Transform data if needed
# 

## Exploratory Data Analysis (EDA)

Before we dive into modeling, it's important to understand the distribution and relationships in the dataset. We'll start by plotting the distributions of numerical features and checking for correlations.


In [None]:
# Plot distributions of all numerical columns
df.hist(bins=20, figsize=(14, 10))
plt.tight_layout()
plt.show()

# Check correlations between features
plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()


## More Advanced EDA

In [None]:


class DataAnalyzer:
    """
    A class to perform Exploratory Data Analysis (EDA) and data transformations on a dataset,
    including handling and detecting missing data.
    """

    def __init__(self, df):
        """
        Initialize with the DataFrame to analyze.
        """
        self.df = df

    def standardize_missing_values(self):
        """
        Standardize common missing value representations (e.g., "NA", "Null", ".", " ") to NaN.
        """
        missing_values = ["NA", "Null", ".", " "]  # Define common missing value representations
        self.df.replace(missing_values, np.nan, inplace=True)  # Replace them with NaN
        print("Standardized missing values to NaN.")

    def check_missing_data(self):
        """
        Check for missing values and empty cells in the dataset.
        """
        print("=== Null Value Check ===")
        null_values = self.df.isnull().sum()
        print(null_values[null_values > 0])

        print("\n=== Empty Cell Check ===")
        empty_cells = (self.df == '').sum()
        print(empty_cells[empty_cells > 0])

    def visualize_missing_data(self):
        """
        Visualize missing data as a heatmap.
        """
        plt.figure(figsize=(10, 6))
        sns.heatmap(self.df.isnull(), cbar=False, cmap='viridis')
        plt.title("Missing Data Heatmap")
        plt.show()

    def handle_missing_data(self, strategy="drop", fill_value=None, method=None):
        """
        Handle missing data by dropping or filling based on a specified strategy.
        
        :param strategy: "drop", "fill", or "fill_method". Default is "drop".
        :param fill_value: Value to use for filling missing data when strategy="fill" (e.g., mean, median, custom value).
        :param method: For forward or backward filling, use "ffill" (forward fill) or "bfill" (backward fill).
        :return: Modified DataFrame with handled missing data.
        """
        if strategy == "drop":
            # Drop rows with any missing values
            print("Dropping rows with missing data...")
            self.df = self.df.dropna()

        elif strategy == "fill":
            # Fill missing data with specified fill_value
            if fill_value == 'mean':
                print("Filling missing values with column means...")
                self.df = self.df.fillna(self.df.mean())
            elif fill_value == 'median':
                print("Filling missing values with column medians...")
                self.df = self.df.fillna(self.df.median())
            elif fill_value == 'mode':
                print("Filling missing values with column modes...")
                self.df = self.df.fillna(self.df.mode().iloc[0])
            else:
                print(f"Filling missing values with {fill_value}...")
                self.df = self.df.fillna(fill_value)

        elif strategy == "fill_method":
            # Forward or backward fill based on method
            if method == "ffill":
                print("Forward filling missing values...")
                self.df = self.df.fillna(method='ffill')
            elif method == "bfill":
                print("Backward filling missing values...")
                self.df = self.df.fillna(method='bfill')
            else:
                print("Invalid method. Use 'ffill' for forward fill or 'bfill' for backward fill.")

        print("Missing data handled.")
        return self.df

    def data_summary(self):
        """
        Print a summary of the dataset, including general info and statistical summary.
        """
        print("\n=== Data Info ===")
        print(self.df.info())

        print("\n=== Summary Statistics ===")
        print(self.df.describe())

    def transpose_data(self):
        """
        Transpose the dataset for easier viewing of wide datasets.
        """
        print("\n=== Transposed Data ===")
        print(self.df.transpose().head())

    def visualize_feature_distributions(self):
        """
        Visualize distributions of numerical features using histograms.
        """
        print("\n=== Feature Distributions ===")
        self.df.hist(bins=20, figsize=(14, 10))
        plt.tight_layout()
        plt.show()

    def visualize_feature_distributions(self):
        """
        Visualize distributions of numerical features using histograms.
        """
        print("\n=== Feature Distributions ===")
        self.df.hist(bins=20, figsize=(14, 10))
        plt.tight_layout()
        plt.show()

    def detect_outliers_with_boxplot(self):
        """
        Detect outliers visually using a boxplot for each numerical feature.
        Outliers are points outside the whiskers of the boxplot.
        """
        numerical_columns = self.df.select_dtypes(include=[np.number]).columns
        for column in numerical_columns:
            plt.figure(figsize=(10, 6))
            sns.boxplot(x=self.df[column])
            plt.title(f'Boxplot of {column}')
            plt.show()

    def detect_outliers_with_zscore(self):
        """
        Detect outliers using Z-scores for numerical columns.
        Outliers are points with a Z-score greater than 3.
        """
        print("\n=== Outlier Detection (Z-score > 3) ===")
        z_scores = np.abs(stats.zscore(self.df.select_dtypes(include=[np.number])))
        outliers = (z_scores > 3).sum(axis=0)
        print(outliers[outliers > 0])

    def apply_log_transformation(self):
        """
        Apply a log transformation to numerical columns.
        """
        print("\n=== Log Transformation ===")
        df_log = self.df.select_dtypes(include=[np.number]).apply(lambda x: np.log(x + 1))
        df_log.hist(bins=20, figsize=(14, 10))
        plt.tight_layout()
        plt.show()

    def apply_standard_scaling(self):
        """
        Apply standard scaling (z-score normalization) to numerical columns.
        """
        print("\n=== Standard Scaling ===")
        scaler = StandardScaler()
        df_scaled = pd.DataFrame(scaler.fit_transform(self.df.select_dtypes(include=[np.number])), columns=self.df.select_dtypes(include=[np.number]).columns)
        df_scaled.hist(bins=20, figsize=(14, 10))
        plt.tight_layout()
        plt.show()

    def apply_minmax_scaling(self):
        """
        Apply Min-Max scaling (normalization) to numerical columns.
        """
        print("\n=== MinMax Scaling ===")
        minmax_scaler = MinMaxScaler()
        df_minmax = pd.DataFrame(minmax_scaler.fit_transform(self.df.select_dtypes(include=[np.number])), columns=self.df.select_dtypes(include=[np.number]).columns)
        df_minmax.hist(bins=20, figsize=(14, 10))
        plt.tight_layout()
        plt.show()


## Splitting the Data into Training, Validation, and Test Sets

In [None]:
def split_data(X, y, test_size=0.3, val_size=0.2, random_state=42):
    """
    Split the dataset into train, validation, and test sets.
    
    :param X: Features
    :param y: Target
    :param test_size: Size of the test set
    :param val_size: Size of the validation set (relative to the train set)
    :param random_state: Seed for reproducibility
    :return: Split dataset into X_train, X_val, X_test, y_train, y_val, y_test
    """
    # Initial split into train and test sets
    X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)

    # Further split the training set into training and validation sets
    X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=val_size, random_state=random_state)

    print(f"Training set shape: {X_train.shape}, Validation set shape: {X_val.shape}, Test set shape: {X_test.shape}")
    return X_train, X_val, X_test, y_train, y_val, y_test


##  Linear Regression with Assumptions Testing


In [None]:
def linear_regression(X_train, X_val, y_train, y_val):
    """
    Fit a linear regression model and check assumptions.
    
    :return: Fitted linear model, validation predictions
    """
    # Train the model
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    # Validate on the validation set
    val_preds = model.predict(X_val)
    
    # Plot residuals
    plt.scatter(val_preds, y_val - val_preds)
    plt.xlabel('Predicted Values')
    plt.ylabel('Residuals')
    plt.title('Residuals vs Predicted')
    plt.show()
    
    # QQ plot for residuals
    sm.qqplot(y_val - val_preds, line='s')
    plt.title('QQ Plot for Residuals')
    plt.show()

    return model, val_preds

In [None]:
# Split the data into training and testing sets
X = df.drop(columns='target_column')  # Features
y = df['target_column']  # Target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Instantiate and fit the linear regression model
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = linear_model.predict(X_test)

# Plot Residuals to Check Linearity Assumption
plt.scatter(y_pred, y_test - y_pred)
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Values')
plt.show()

# Run a quick iteration to inspect the coefficients
for i, coef in enumerate(linear_model.coef_):
    print(f"Coefficient for feature {X.columns[i]}: {coef}")


## Logistic Regression with ROC Curve

Logistic regression is used for binary classification tasks. In this example, we’ll fit a logistic regression model, plot an ROC curve, and compute the area under the curve (AUC) to evaluate model performance.


In [None]:
from sklearn.metrics import roc_auc_score, roc_curve

def logistic_regression(X_train, X_val, y_train, y_val):
    """
    Fit a logistic regression model and plot ROC curve.
    
    :return: Fitted logistic model, validation predictions
    """
    model = LogisticRegression()
    model.fit(X_train, y_train)
    
    # Predict probabilities for the validation set
    val_probs = model.predict_proba(X_val)[:, 1]
    
    # ROC curve
    fpr, tpr, _ = roc_curve(y_val, val_probs)
    plt.plot(fpr, tpr, label='ROC Curve')
    plt.plot([0, 1], [0, 1], 'k--')  # Random chance line
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve')
    plt.show()

    auc = roc_auc_score(y_val, val_probs)
    print(f"AUC: {auc:.3f}")

    return model, val_probs


In [None]:
# Instantiate and fit the logistic regression model
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)

# Make predictions on the test data
logistic_preds = logistic_model.predict(X_test)
logistic_probs = logistic_model.predict_proba(X_test)[:, 1]

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, logistic_probs)
plt.plot(fpr, tpr, label="Logistic Regression")
plt.plot([0, 1], [0, 1], 'k--')  # Diagonal line for random chance
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

# Print AUC score for better evaluation
auc_score = roc_auc_score(y_test, logistic_probs)
print(f"Logistic Regression AUC: {auc_score:.2f}")


## Ridge and Lasso Regression

In [None]:
def ridge_lasso_regression(X_train, X_val, y_train, y_val):
    """
    Fit and tune Ridge and Lasso models using validation set.
    
    :return: Best Ridge and Lasso models
    """
    # Ridge Regression
    ridge = Ridge()
    lasso = Lasso()
    
    param_grid = {'alpha': [0.01, 0.1, 1, 10]}  # Regularization strength
    
    ridge_cv = GridSearchCV(ridge, param_grid, scoring='neg_mean_squared_error', cv=5)
    lasso_cv = GridSearchCV(lasso, param_grid, scoring='neg_mean_squared_error', cv=5)
    
    ridge_cv.fit(X_train, y_train)
    lasso_cv.fit(X_train, y_train)
    
    # Validate on the validation set
    ridge_val_preds = ridge_cv.predict(X_val)
    lasso_val_preds = lasso_cv.predict(X_val)

    print(f"Best Ridge Alpha: {ridge_cv.best_params_}")
    print(f"Best Lasso Alpha: {lasso_cv.best_params_}")

    return ridge_cv, lasso_cv, ridge_val_preds, lasso_val_preds


### Params
n_estimators: The number of trees or boosting rounds used.
learning_rate: Controls the contribution of each tree in the boosting process.
max_depth: The maximum depth of individual trees (or decision rules).
min_samples_split and min_samples_leaf: Control how the tree grows and when it stops growing.
subsample and colsample_bytree: Control the fraction of samples and features used for each tree, which can prevent overfitting.
max_features: Controls the number of features considered for each split.
min_child_weight and min_child_samples: Control the minimum number of instances or weight required to form a node in tree-based models.

## Random Forest Regression

In [None]:
def random_forest_regression(X_train, X_val, y_train, y_val):
    """
    Fit a Random Forest Regressor and tune hyperparameters using GridSearchCV.
    
    :return: Best Random Forest model, validation predictions
    """
    rf = RandomForestRegressor(random_state=42)
    
    # Expanded parameter grid for tuning
    param_grid = {
        'n_estimators': [100, 200, 300, 500],  # Number of trees in the forest
        'max_depth': [None, 5, 10, 20],  # Maximum depth of the tree
        'min_samples_split': [2, 10, 20],  # Minimum samples required to split a node
        'min_samples_leaf': [1, 2, 4],  # Minimum samples required to be at a leaf node
        'max_features': ['auto', 'sqrt', 'log2']  # Number of features to consider for the best split
    }
    
    rf_cv = GridSearchCV(rf, param_grid, cv=5, scoring='neg_mean_squared_error')
    rf_cv.fit(X_train, y_train)
    
    # Validate on validation set
    val_preds = rf_cv.predict(X_val)
    
    print(f"Best Random Forest Parameters: {rf_cv.best_params_}")
    return rf_cv, val_preds


## Gradient Boosting Regression

In [None]:
def gradient_boosting_regression(X_train, X_val, y_train, y_val):
    """
    Fit a Gradient Boosting Regressor and tune hyperparameters.
    
    :return: Best Gradient Boosting model, validation predictions
    """
    gb = GradientBoostingRegressor(random_state=42)
    
    # Expanded parameter grid for tuning
    param_grid = {
        'n_estimators': [100, 200, 300],  # Number of boosting stages
        'learning_rate': [0.01, 0.05, 0.1, 0.2],  # Learning rate shrinks the contribution of each tree
        'max_depth': [3, 5, 7],  # Maximum depth of the individual regression estimators
        'min_samples_split': [2, 10, 20],  # Minimum samples required to split a node
        'min_samples_leaf': [1, 2, 4],  # Minimum samples required at a leaf node
        'subsample': [0.7, 0.8, 1.0],  # Fraction of samples used for fitting each base learner
        'max_features': ['auto', 'sqrt', 'log2']  # Number of features to consider for the best split
    }
    
    gb_cv = GridSearchCV(gb, param_grid, cv=5, scoring='neg_mean_squared_error')
    gb_cv.fit(X_train, y_train)
    
    # Validate on validation set
    val_preds = gb_cv.predict(X_val)
    
    print(f"Best Gradient Boosting Parameters: {gb_cv.best_params_}")
    return gb_cv, val_preds


## XGBoost Regression

In [None]:
def xgboost_regression(X_train, X_val, y_train, y_val):
    """
    Fit an XGBoost Regressor and tune hyperparameters.
    
    :return: Best XGBoost model, validation predictions
    """
    xgb = XGBRegressor(objective='reg:squarederror', random_state=42)
    
    # Expanded parameter grid for tuning
    param_grid = {
        'n_estimators': [100, 200, 300],  # Number of boosting rounds
        'learning_rate': [0.01, 0.05, 0.1, 0.2],  # Learning rate
        'max_depth': [3, 5, 7],  # Maximum depth of a tree
        'min_child_weight': [1, 3, 5],  # Minimum sum of instance weight needed in a child
        'subsample': [0.7, 0.8, 1.0],  # Fraction of samples used for fitting each tree
        'colsample_bytree': [0.7, 0.8, 1.0]  # Fraction of features used at each tree split
    }
    
    xgb_cv = GridSearchCV(xgb, param_grid, cv=5, scoring='neg_mean_squared_error')
    xgb_cv.fit(X_train, y_train)
    
    # Validate on validation set
    val_preds = xgb_cv.predict(X_val)
    
    print(f"Best XGBoost Parameters: {xgb_cv.best_params_}")
    return xgb_cv, val_preds


## LightGBM Regressor

In [None]:
def lightgbm_regression(X_train, X_val, y_train, y_val):
    """
    Fit a LightGBM Regressor and tune hyperparameters.
    
    :return: Best LightGBM model, validation predictions
    """
    lgbm = LGBMRegressor(random_state=42)
    
    # Expanded parameter grid for tuning
    param_grid = {
        'n_estimators': [100, 200, 300],  # Number of boosting rounds
        'learning_rate': [0.01, 0.05, 0.1, 0.2],  # Learning rate
        'max_depth': [-1, 5, 10],  # Maximum depth of the trees (-1 for no limit)
        'num_leaves': [31, 50, 100],  # Maximum number of leaves in one tree
        'min_child_samples': [20, 50, 100],  # Minimum number of data points in a child
        'subsample': [0.7, 0.8, 1.0],  # Fraction of samples used for fitting each base learner
        'colsample_bytree': [0.7, 0.8, 1.0]  # Fraction of features used at each tree split
    }
    
    lgbm_cv = GridSearchCV(lgbm, param_grid, cv=5, scoring='neg_mean_squared_error')
    lgbm_cv.fit(X_train, y_train)
    
    # Validate on validation set
    val_preds = lgbm_cv.predict(X_val)
    
    print(f"Best LightGBM Parameters: {lgbm_cv.best_params_}")
    return lgbm_cv, val_preds


## Decision Tree Classifier

Decision trees are highly interpretable models that split the dataset based on feature values. In this section, we use `GridSearchCV` to tune the tree's hyperparameters and evaluate feature importance.


In [None]:
# Set up the Decision Tree Classifier and hyperparameters
tree = DecisionTreeClassifier(random_state=42)
param_grid = {
    'max_depth': [3, 5, 7],
    'min_samples_split': [2, 10],
    'min_samples_leaf': [1, 5]
}

# Perform hyperparameter tuning with GridSearchCV
grid_search = GridSearchCV(tree, param_grid, scoring='accuracy', cv=5)
grid_search.fit(X_train, y_train)

# Best model from Grid Search
best_tree = grid_search.best_estimator_

# Feature importance
importances = pd.Series(best_tree.feature_importances_, index=X.columns).sort_values(ascending=False)
importances.plot(kind='bar', title='Feature Importances')
plt.show()


## K-Means Clustering

K-Means clustering is an unsupervised learning algorithm used to group similar data points together. We will use the inertia and silhouette score to evaluate the quality of the clusters.


In [None]:
# Standardize the dataset
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# K-Means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X_scaled)

# Calculate Inertia and Silhouette Score
inertia = kmeans.inertia_
silhouette_avg = silhouette_score(X_scaled, kmeans.labels_)

print(f"Inertia: {inertia}")
print(f"Silhouette Score: {silhouette_avg}")

# Plot clusters (if X is 2-dimensional)
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=kmeans.labels_, cmap='viridis')
plt.title('K-Means Clustering Results')
plt.show()


## Hierarchical Clustering

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage

def hierarchical_clustering(X):
    # Use Ward's method for hierarchical clustering
    Z = linkage(X, method='ward')
    
    # Plot dendrogram
    plt.figure(figsize=(10, 5))
    dendrogram(Z)
    plt.title('Hierarchical Clustering Dendrogram')
    plt.xlabel('Data Points')
    plt.ylabel('Distance')
    plt.show()


## DBSCAN (Density-Based Clustering)

In [None]:
from sklearn.cluster import DBSCAN

def dbscan_clustering(X):
    dbscan = DBSCAN(eps=0.5, min_samples=5)
    labels = dbscan.fit_predict(X)
    
    plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
    plt.title('DBSCAN Clustering Results')
    plt.show()
    
    return labels


## Gaussian Mixture Models (GMM)

In [None]:
from sklearn.mixture import GaussianMixture

def gaussian_mixture_model(X, n_components=3):
    gmm = GaussianMixture(n_components=n_components, random_state=42)
    gmm.fit(X)
    
    labels = gmm.predict(X)
    
    plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
    plt.title('Gaussian Mixture Model Clustering')
    plt.show()
    
    return labels


## Naive Bayes

Naive Bayes is a simple yet effective algorithm for classification. We'll fit a Gaussian Naive Bayes model and evaluate the model's performance using a confusion matrix.


In [None]:
# Instantiate and fit the Naive Bayes model
nb_model = GaussianNB()
nb_model.fit(X_train, y_train)

# Make predictions on the test data
nb_preds = nb_model.predict(X_test)

# Confusion Matrix
cm = confusion_matrix(y_test, nb_preds)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

# Get accuracy, precision, recall, and F1 score for the model
nb_test_scores = get_test_scores('Naive Bayes', nb_preds, y_test)
print(nb_test_scores)



## Demo

In this notebook, we explored several machine learning models including linear regression, logistic regression, decision trees, K-means clustering, and Naive Bayes. We also discussed how to check for key statistical assumptions and used different evaluation metrics like AUC and silhouette score to assess the models. 

This structure can be easily adapted for different datasets by simply updating the data loading step and the target variable.


In [None]:
# Display the decision splits for an initial small tree iteration
print("Initial Tree Iteration Splits:")
for i, feature in enumerate(best_tree.tree_.feature):
    if feature != -2:  # -2 means it's a leaf node
        print(f"Node {i}: Split on feature {X.columns[feature]}")
