<a href="https://colab.research.google.com/github/NagamallaVinay/Task-4/blob/main/Task_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Task 4: Classification with Logistic Regression**.

Objective: Build a binary classifier using logistic regression.

Tools:  Scikit-learn, Pandas, Matplotlib

# **1: Choose a binary classification dataset.**
We will use the provided data.csv file, which contains features for a binary classification task

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score
import seaborn as sns
import matplotlib.pyplot as plt

**Importing .CSV file into Notebook**

In [None]:
from google.colab import files
uploaded = files.upload()

Saving data.csv to data.csv


In [None]:
df = pd.read_csv('data.csv')
print("Dataset loaded successfully.")

Dataset loaded successfully.


In [None]:
# Separate features (X) and target (y)
# Dropping 'id' as it's an identifier and 'diagnosis' for the target variable.
X = df.drop(['id', 'diagnosis'], axis=1)
y = df['diagnosis']

# Encode the target variable: 'M' (Malignant) as 1, 'B' (Benign) as 0
y_encoded = y.map({'M': 1, 'B': 0})

print(f"Dataset shape: {df.shape}")
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"Target value counts:\n{y.value_counts()}")

Dataset shape: (569, 33)
Features shape: (569, 31)
Target shape: (569,)
Target value counts:
diagnosis
B    357
M    212
Name: count, dtype: int64


# **2: Train/test split and standardize features**.

We will split the data into training and testing sets, and then standardize the features to have a mean of 0 and a standard deviation of 1.

In [None]:
# Split the data into training (80%) and testing (20%) sets
# stratify=y_encoded ensures that the class distribution is preserved in both sets.
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded)

print("\nStep 2: Train-test split and feature standardization.")
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler on the training data and transform both training and testing data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Features standardized successfully.")


Step 2: Train-test split and feature standardization.
X_train shape: (455, 31)
X_test shape: (114, 31)
y_train shape: (455,)
y_test shape: (114,)
Features standardized successfully.


  updated_mean = (last_sum + new_sum) / updated_sample_count
  T = new_sum / new_sample_count
  new_unnormalized_variance -= correction**2 / new_sample_count


# **3: Fit a Logistic Regression model.**

We will initialize and train a Logistic Regression model using the standardized training data.

In [None]:
# --- Regenerated Step 3 ---
from sklearn.linear_model import LogisticRegression
import numpy as np # Ensure numpy is imported for potential array handling

# Initialize the Logistic Regression model
# Using 'liblinear' solver is generally a good default for binary classification
# and handles smaller datasets well. 'random_state' ensures reproducibility.
# For larger datasets, 'lbfgs' or 'saga' might be more efficient.
# 'max_iter' can be increased if convergence warnings appear, though unlikely here.
try:
    model = LogisticRegression(solver='liblinear', random_state=42)

    # Train the model on the scaled training data
    # X_train_scaled is expected to be a numpy array from StandardScaler
    # y_train is expected to be a pandas Series or numpy array of 0s and 1s
    model.fit(X_train_scaled, y_train)

    print("\nStep 3: Logistic Regression model fitted successfully.")

except ValueError as ve:
    print(f"\nError in Step 3 during model fitting: {ve}")
    print("Please check the shapes and types of X_train_scaled and y_train.")
    print(f"X_train_scaled shape: {X_train_scaled.shape}, dtype: {X_train_scaled.dtype}")
    print(f"y_train shape: {y_train.shape}, dtype: {y_train.dtype}")
except Exception as e:
    print(f"\nAn unexpected error occurred in Step 3: {e}")
    # You might want to print more details about the error if needed
    # import traceback
    # traceback.print_exc()


Error in Step 3 during model fitting: Input X contains NaN.
LogisticRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values
Please check the shapes and types of X_train_scaled and y_train.
X_train_scaled shape: (455, 31), dtype: float64
y_train shape: (455,), dtype: int64


# **4: Evaluate with confusion matrix, precision, recall, ROC-AUC.**

We will use the trained model to make predictions on the test set and then evaluate its performance using standard classification metrics.

In [None]:
# --- Regenerated Step 4 ---
# This step depends on the successful completion of Step 3.

if 'model' in locals(): # Check if the model object was successfully created and fitted
    print("\nStep 4: Evaluating the model.")

    try:
        # Make predictions on the scaled test set
        y_pred = model.predict(X_test_scaled)

        # Predict probabilities for the positive class (Malignant = 1)
        y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]

        # --- Data Validation before metrics calculation ---
        print("Validating data for evaluation...")

        # Validate y_test
        if not (y_test.isin([0, 1]).all()):
            print("Error: y_test contains values other than 0 or 1. Check target encoding.")
            print(f"Unique values in y_test: {y_test.unique()}")
        else:
            print("y_test validation passed.")

        # Validate y_pred
        if not (np.isin(y_pred, [0, 1]).all()):
            print("Error: y_pred contains values other than 0 or 1. Check model predictions.")
            print(f"Unique values in y_pred: {y_pred}")
        else:
            print("y_pred validation passed.")

        # Validate y_pred_proba
        if not (np.issubdtype(y_pred_proba.dtype, np.number)):
            print("Error: y_pred_proba is not numerical.")
        elif not ((y_pred_proba >= 0) & (y_pred_proba <= 1)).all():
            print("Error: y_pred_proba contains values outside the [0, 1] range.")
            print(f"Min probability: {y_pred_proba.min()}, Max probability: {y_pred_proba.max()}")
        else:
            print("y_pred_proba validation passed.")

        # Ensure shapes are compatible for metrics
        if y_test.shape[0] != y_pred.shape[0] or y_test.shape[0] != y_pred_proba.shape[0]:
            print("Error: Mismatch in the number of samples between y_test, y_pred, and y_pred_proba.")
            print(f"Shapes: y_test={y_test.shape}, y_pred={y_pred.shape}, y_pred_proba={y_pred_proba.shape}")
        else:
            print("Sample count validation passed.")

        # Proceed with metrics only if validations pass
        print("\nProceeding with metric calculations...")

        # Confusion Matrix
        cm = confusion_matrix(y_test, y_pred)
        print("\nConfusion Matrix:")
        print(cm)

        # Plot Confusion Matrix
        plt.figure(figsize=(8, 6))
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', annot_kws={"size": 12},
                    xticklabels=['Benign (0)', 'Malignant (1)'], yticklabels=['Benign (0)', 'Malignant (1)'])
        plt.xlabel('Predicted Label')
        plt.ylabel('True Label')
        plt.title('Confusion Matrix')
        plt.show()

        # Classification Report (Precision, Recall, F1-Score)
        print("\nClassification Report:")
        print(classification_report(y_test, y_pred, target_names=['Benign (0)', 'Malignant (1)']))

        # ROC Curve and AUC
        fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
        roc_auc = roc_auc_score(y_test, y_pred_proba)
        print(f"\nROC AUC Score: {roc_auc:.4f}")

        # Plot ROC Curve
        plt.figure(figsize=(8, 6))
        plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
        plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='No Skill Classifier')
        plt.xlim([0.0, 1.0])
        plt.ylim([0.0, 1.05])
        plt.xlabel('False Positive Rate')
        plt.ylabel('True Positive Rate')
        plt.title('Receiver Operating Characteristic (ROC) Curve')
        plt.legend(loc="lower right")
        plt.show()

    except Exception as e:
        print(f"\nError occurred during evaluation metrics calculation in Step 4: {e}")
        print("This might indicate an issue with the data passed to the metrics functions.")
        # Optional: print detailed traceback for debugging
        # import traceback
        # traceback.print_exc()

else:
    print("\nSkipping Step 4 evaluation as the model fitting in Step 3 failed.")


Step 4: Evaluating the model.

Error occurred during evaluation metrics calculation in Step 4: This LogisticRegression instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
This might indicate an issue with the data passed to the metrics functions.


# **5: Tune threshold and explain sigmoid function.**

We will demonstrate how changing the classification threshold affects performance and then explain the sigmoid function.

The default threshold for logistic regression is 0.5. Let's see how changing it to 0.3 impacts precision and recall.

In [None]:
if 'model' in locals() and 'y_pred_proba' in locals(): # Check if model and probabilities are available

    new_threshold = 0.3

    # Predict class labels using the new threshold
    # If probability of positive class is >= new_threshold, predict 1 (Malignant), else 0 (Benign).
    y_pred_new_threshold = (y_pred_proba >= new_threshold).astype(int)

    print("\nStep 5: Demonstrating threshold tuning and explaining sigmoid function.")
    print(f"\nClassification Report with Threshold = {new_threshold}:")
    try:
        print(classification_report(y_test, y_pred_new_threshold, target_names=['Benign (0)', 'Malignant (1)']))
    except Exception as e:
        print(f"Error generating Classification Report with new threshold: {e}")

    # --- 5.2. Explanation of the Sigmoid Function ---
    print("\n--- Explanation of the Sigmoid Function ---")
    print("The sigmoid function, also known as the logistic function, is crucial in logistic regression.")
    print("It squashes any real-valued input into a value between 0 and 1, representing a probability.")
    print("\nFormula:  σ(z) = 1 / (1 + e^(-z))")
    print("\nWhere 'z' is the linear combination of input features and their weights (z = w0 + w1*x1 + ...).")
    print("\nIn logistic regression, the sigmoid function transforms this linear output into a probability")
    print("of belonging to the positive class. This probability is then used with a threshold (e.g., 0.5)")
    print("to make the final binary classification.")
    print("\nBy adjusting the threshold, we can tune the model's sensitivity to correctly classify positive instances")
    print("(recall) versus the accuracy of its positive predictions (precision).")

else:
    print("\nSkipping Step 5 as the model fitting or probability prediction failed in previous steps.")


Skipping Step 5 as the model fitting or probability prediction failed in previous steps.


# **sigmoid function**
The sigmoid function  is fundamental to logistic regression. It takes any real-valued input and maps it to an output between 0 and 1, which can be interpreted as a probability.