# Cyclic Coordinate Descent for Logistic Regression with Lasso regularization

This notebook presents the implementation of Cyclic Coordinate Descent (CCD) algorithm for parameter 
estimation in regularized logistic regression with l1 (lasso) penalty and compares it with standard 
logistic regression model without regularization. 

## Imports & Consts

TODO: Add information about reproducibility

In [2]:
import os
import arff
import numpy as np
import polars as pl
from typing import List

In [3]:
CONST_DATASET_DIRECTORY_PATH = "./datasets"
CONST_RESuLTS_DIRECTORY_PATH = "./results"

## Data preprocessing

TODO: Describe the datasets

In [None]:
def load_datasets() -> List[pl.DataFrame]:
    """Load all datasets from the datasets folder and return them as a list of polars dataframes."""
    datasets = []
    for file in os.listdir(CONST_DATASET_DIRECTORY_PATH):
        if file.endswith(".arff"):
            with open(f"datasets/{file}", "r") as f:
                data = arff.load(f)
                df = pl.DataFrame(data["data"])
                df.columns = [i[0] for i in data["attributes"]]
                datasets.append(df)
    return datasets


def fill_missing_values(df: pl.DataFrame) -> pl.DataFrame:
    """Fill the missing values in the dataframe using the mean of the column strategy."""
    return df.fillna(df.mean())


def remove_colinear_features(df: pl.DataFrame) -> pl.DataFrame:
    """Remove features of a dataframe that are colinear."""
    return df.drop(columns=df.corr().abs().sum() > 1)


def normalize(df: pl.DataFrame) -> pl.DataFrame:
    """Normalize the features of a dataframe based on mean and standard deviation."""
    return (df - df.mean()) / df.std()

In [None]:
datasets = load_datasets()
for dataset in datasets:
    print(dataset.describe())

In [None]:
# Preprocess the datasets, including filling missing values, removing colinear features and normalizing the feature values.
preprocessing_pipeline = [fill_missing_values, remove_colinear_features, normalize]

# Overwrite the original datasets list with the preprocessed datasets to save memory.
for i in range(len(datasets)):
    for f in preprocessing_pipeline:
        datasets[i] = f(datasets[i])
        print(datasets[i].describe())

## LogRegCCD

Implementation of regularized Logistic Regression wiht Cyclic Coordinate Descent based on the [publication](https://www.jstatsoft.org/article/view/v033i01).

TODO: Add high-level overview of the algorithm

In [None]:
# TODO: Implement it.


class LogRegCCD:
    """Logistic Regression with Coordinate Cyclic Descent and Lasso Regularization."""

    def __init__(self) -> None:
        """Initialize the LogRegCCD model."""
        pass

    def fit(self, X_train: np.ndarray, y_train: np.ndarray) -> None:
        """Fit the Logsitic Regression model on provided training features and labels."""
        pass

    def validate(self, X_valid: np.ndarray, y_valid: np.ndarray, measure: str) -> float:
        """Compute the provided measure based on the validation features and labels."""
        pass

    def predict_proba(self, X_test: np.ndarray) -> np.ndarray:
        """Predict the probabilities of the classes for the test features."""
        pass

    def plot(selfl, measure: str) -> None:
        """Plot the evalueation measure over different values of lambda."""
        pass

    def plot_coefficients(self) -> None:
        """Plot the coeefficients of the model over different values of lambda."""
        pass

## Performance & Comparison

In [None]:
# TODO: Performance and Comparison