# Overview  

This is the third notebook in my series where I’ll be implementing machine learning algorithms using the **Scikit-Learn** library.  

In this notebook, we’ll explore **Logistic Regression**, a fundamental classification algorithm that extends the principles of Linear Regression to handle categorical outcomes. 

 If you’d like to see how this algorithm can be implemented **from scratch**, check out this [notebook](https://www.kaggle.com/code/rameelsohail/logistic-regression-from-scratch).  

# Imports

In [None]:
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# Data Loading and Analysis

In [None]:
df = pd.read_csv('breast-cancer.csv')

In [None]:
df.head()

In [None]:
df.shape

Currently, our dataset contains **32 independent variables**.  
Having that many features doesn’t necessarily improve our model — instead, it can make training slower and potentially lead to overfitting.  

So, we’ll perform some **Exploratory Data Analysis (EDA)** to identify and extract the **most important features** that truly impact our target variable.

In [None]:
df.describe().T

In [None]:
fig = px.histogram(data_frame=df, x='diagnosis', color='diagnosis', color_discrete_sequence=['#05445E','#75E6DA'])
fig.show(renderer='iframe')

In [None]:
fig = px.histogram(data_frame=df, x='radius_mean', color='diagnosis', color_discrete_sequence=['#05445E','#75E6DA'])
fig.show(renderer='iframe')

In [None]:
fig = px.histogram(data_frame=df, x='area_mean', color='diagnosis', color_discrete_sequence=['#05445E','#75E6DA'])
fig.show(renderer='iframe')

In [None]:
fig = px.histogram(data_frame=df, x='perimeter_mean', color='diagnosis', color_discrete_sequence=['#05445E','#75E6DA'])
fig.show(renderer='iframe')

In [None]:
fig = px.histogram(data_frame=df, x='texture_mean', color='diagnosis', color_discrete_sequence=['#05445E','#75E6DA'])
fig.show(renderer='iframe')

In [None]:
fig = px.histogram(data_frame=df, x='smoothness_mean', color='diagnosis', color_discrete_sequence=['#05445E','#75E6DA'])
fig.show(renderer='iframe')

# Data Processing

In [None]:
df.drop('id', axis=1, inplace=True) #drop redundant columns

In [None]:
df['diagnosis'] = (df['diagnosis'] == 'M').astype(int) #encode the label into 1(M)/0(B)

In [None]:
df.head()

Now we'll extract the **correlation values** of each feature with our target label, and select only those features that show a **correlation greater than 0.2**.  
This helps us focus on the most relevant variables and reduce unnecessary noise in our dataset.

In [None]:
corr = df.corr()

In [None]:
# The heatmap helps visualize the correlation values
plt.figure(figsize=(20,20))
sns.heatmap(corr, cmap='RdBu',annot=True)
plt.show()

In [None]:
# Get the absolute value of the correlation
cor_target = abs(corr["diagnosis"])

# Select highly correlated features (thresold = 0.2)
relevant_features = cor_target[cor_target>0.2]

# Collect the names of the features
names = [index for index, value in relevant_features.items()]

# Drop the target variable from the results
names.remove('diagnosis')
len(names)

In [None]:
X = df[names].values
y = df['diagnosis'].values

In [None]:
# Splitting into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
# Standardizing the data so that our model converges faster
standardizer = StandardScaler()
X_train = standardizer.fit_transform(X_train)
X_test = standardizer.transform(X_test)

# Model Implementation

In [None]:
lg = LogisticRegression()
lg.fit(X_train, y_train)

# Evaluation

In [None]:
# Making predictions using our model
y_pred = lg.predict(X_test)

# Evaluating the model using classification metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

In [None]:
print(f"Accuracy: {accuracy:.2%}")
print(f"Precision: {precision:.2%}")
print(f"Recall: {recall:.2%}")
print(f"F1-Score: {f1:.2%}")

In [None]:
# If you'd like the complete summary in one table
report = classification_report(y_test, y_pred)

In [None]:
print(report)