# Exercise 2: Exploring Different Optimizers in Logistic Regression

## Objective
In this exercise, you will explore different types of optimizers (e.g., LBFGS, SAG) in logistic regression using multi-dimensional data with features at different scales. You'll observe how different algorithms perform on this dataset without any preprocessing.

## Dataset
We'll create a synthetic dataset with the following characteristics:
- Multiple dimensions
- Features at different scales
- At least one exponentially distributed feature
- At least one categorical feature

## Tasks

1. Generate the dataset as described above.
2. Use an OrdinalEncoder for the categorical data.
3. Split the data into training and test sets.
4. Implement logistic regression with different optimizers:
   - LBFGS
   - SAGA
   - SAG (Stochastic Average Gradient)
5. Compare the performance of each optimizer.

## Starter Code


In [22]:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, log_loss

# Generate synthetic data
np.random.seed(42)
n_samples = 10000

# Numeric features
feature1 = np.random.normal(0, 1, n_samples)
feature2 = np.random.exponential(2, n_samples)
feature3 = np.random.uniform(0, 10000, n_samples)

# Categorical feature
categories = ['A', 'B', 'C', 'D']
feature4 = np.random.choice(categories, n_samples)

# Combine features
X = np.column_stack((feature1, feature2, feature3, feature4))
y = (0.2 * feature1 + np.log(feature2) + 0.0001 * feature3 + np.where(feature4 == 'A', 1, 0) > 2).astype(int)

# Create a DataFrame
df = pd.DataFrame(X, columns=['feature1', 'feature2', 'feature3', 'feature4'])
df['target'] = y

# Encode categorical feature
encoder = OrdinalEncoder()
df['feature4'] = encoder.fit_transform(df[['feature4']])

# Split the data
X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'], test_size=0.2, random_state=42)

# TODO: Implement logistic regression with different optimizers
# Hint: Use LogisticRegression(solver='lbfgs'), LogisticRegression(solver='saga'), and LogisticRegression(solver='sag')

for solver in ['sag','saga','lbfgs']:

    lr = LogisticRegression(solver=solver)

    # TODO: Compare the performance of each optimizer
    # Hint: Use accuracy_score and log_loss to evaluate performance

    lr.fit(X_train,y_train)
    y_pred = lr.predict(X_test)
    print(f"Accuracy score with {solver}: {accuracy_score(y_test,y_pred)}")

Accuracy score with sag: 0.8015
Accuracy score with saga: 0.8
Accuracy score with lbfgs: 0.9365


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



## Questions to Consider

1. Which optimizer performs best on this dataset? Why do you think that is?
2. How does the performance vary between the training and test sets for each optimizer?
3. Do you notice any issues with convergence for any of the optimizers?
4. How might the different scales of the features be affecting the performance of each optimizer?

# Exercise 3: Preprocessing and Pipeline Optimization

## Objective
Building on Exercise 2, you will now implement preprocessing steps and explore how they interact with different optimization algorithms. You'll use your knowledge of pipelines to create an efficient workflow.

## Tasks

1. Using the same dataset from Exercise 2, create a preprocessing pipeline that includes:
   - Handling of categorical variables (e.g., one-hot encoding)
   - Scaling of numerical features
   - Any other preprocessing steps you think might be beneficial
2. Implement this preprocessing pipeline along with logistic regression using different optimizers.
3. Compare the performance of each optimizer with and without preprocessing.
4. Experiment with different preprocessing techniques and observe their effects on model performance.

## Starter Code


In [None]:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, log_loss

# Use the same data generation code from Exercise 2

# TODO: Create a preprocessing pipeline
# Hint: Use ColumnTransformer to apply different preprocessing to numerical and categorical columns


# TODO: Create a full pipeline that includes preprocessing and logistic regression
# Hint: Use Pipeline to combine preprocessing with LogisticRegression

# TODO: Implement the pipeline with different optimizers and compare their performance

# TODO: Experiment with different preprocessing techniques
# For example, try StandardScaler vs MinMaxScaler, or try polynomial features
# Hint: Try using the LogTransform class we developed in class

# TODO: Compare the performance of each pipeline configuration



Answer...

In [20]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, log_loss

# Generate synthetic data (same as Exercise 2)
np.random.seed(42)
n_samples = 10000

feature1 = np.random.normal(0, 1, n_samples)
feature2 = np.random.exponential(1, n_samples)
feature3 = np.random.uniform(0, 100, n_samples)
categories = ['A', 'B', 'C', 'D']
feature4 = np.random.choice(categories, n_samples)

X = np.column_stack((feature1, feature2, feature3, feature4))
y = (0.2 * feature1 + np.log(feature2) + 0.01 * feature3 + np.where(feature4 == 'A', 1, 0) > 2).astype(int)

df = pd.DataFrame(X, columns=['feature1', 'feature2', 'feature3', 'feature4'])
df['target'] = y

# Split the data
X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'], test_size=0.2, random_state=42)

# Create preprocessing pipelines
numeric_features = ['feature1', 'feature2', 'feature3']
categorical_features = ['feature4']

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
    # ,
    # ('poly', PolynomialFeatures(degree=2, include_bias=False))
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(drop='first'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Create full pipelines with different optimizers
optimizers = ['lbfgs', 'sag', 'saga']
pipelines = {}

for optimizer in optimizers:
    pipelines[optimizer] = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', LogisticRegression(solver=optimizer, max_iter=1000, random_state=42))
    ])

# Train and evaluate each pipeline
results = {}

for optimizer, pipeline in pipelines.items():
    pipeline.fit(X_train, y_train)
    
    y_pred_train = pipeline.predict(X_train)
    y_pred_test = pipeline.predict(X_test)
    y_pred_proba_train = pipeline.predict_proba(X_train)
    y_pred_proba_test = pipeline.predict_proba(X_test)
    
    results[optimizer] = {
        'train_accuracy': accuracy_score(y_train, y_pred_train),
        'test_accuracy': accuracy_score(y_test, y_pred_test),
        'train_log_loss': log_loss(y_train, y_pred_proba_train),
        'test_log_loss': log_loss(y_test, y_pred_proba_test)
    }

# Print results
print("Results with preprocessing:")
for optimizer, metrics in results.items():
    print(f"\nOptimizer: {optimizer}")
    for metric, value in metrics.items():
        print(f"{metric}: {value:.4f}")

# Compare with results without preprocessing
pipelines_no_prep = {}


# Experiment: Try a different scaling method (MinMaxScaler)
from sklearn.preprocessing import MinMaxScaler

numeric_transformer_minmax = Pipeline(steps=[
    ('scaler', MinMaxScaler()),
    ('poly', PolynomialFeatures(degree=2, include_bias=False))
])

preprocessor_minmax = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer_minmax, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

pipeline_minmax = Pipeline([
    ('preprocessor', preprocessor_minmax),
    ('classifier', LogisticRegression(solver='lbfgs', max_iter=1000, random_state=42))
])

pipeline_minmax.fit(X_train, y_train)
y_pred_test_minmax = pipeline_minmax.predict(X_test)
y_pred_proba_test_minmax = pipeline_minmax.predict_proba(X_test)

print("\nResults with MinMaxScaler:")
print(f"Test accuracy: {accuracy_score(y_test, y_pred_test_minmax):.4f}")
print(f"Test log loss: {log_loss(y_test, y_pred_proba_test_minmax):.4f}")

Results with preprocessing:

Optimizer: lbfgs
train_accuracy: 0.9830
test_accuracy: 0.9865
train_log_loss: 0.0431
test_log_loss: 0.0414

Optimizer: sag
train_accuracy: 0.9830
test_accuracy: 0.9870
train_log_loss: 0.0431
test_log_loss: 0.0415

Optimizer: saga
train_accuracy: 0.9830
test_accuracy: 0.9870
train_log_loss: 0.0431
test_log_loss: 0.0415

Results with MinMaxScaler:
Test accuracy: 0.9820
Test log loss: 0.0490



## Questions to Consider

1. How does preprocessing affect the performance of each optimizer?
2. Which combination of preprocessing steps and optimizer yields the best performance? Why do you think this is?
3. How does the training time compare between the preprocessed and non-preprocessed data for each optimizer?
4. Are there any preprocessing steps that seem to be particularly important for this dataset?
5. How might you further optimize this pipeline for better performance?