# Project Overview: Logistic Regression for Binary Classification

In this project, I aim to demonstrate the use of **logistic regression** to perform a **binary classification task**. Logistic regression is an extension of linear regression that uses a sigmoid function to map predictions into a probability range between 0 and 1. This project leverages the `scikit-learn` (or `sklearn`) library to fit a logistic regression model to a dataset and make predictions based on input features.

## Key Objectives
- **Understand logistic regression**: Logistic regression models the relationship between input features and a binary target variable by fitting a sigmoid curve to the data. It allows me to estimate the probability of a sample belonging to a positive class.
- **Fitting and evaluating the model**: I will demonstrate how to fit a logistic regression model to training data, assess its coefficients, and use the trained model to predict class labels for unseen data.

## Code Breakdown and Explanation

### 1. Importing Libraries and the Dataset
I start by importing the necessary libraries for data manipulation and logistic regression. I then load the dataset, which contains information on students' exam performance based on study hours and practice test scores. I display the first few rows of the dataset to get an overview of its structure.

In [9]:
# Import libraries and data
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

codecademyU = pd.read_csv('data.csv')
print(codecademyU.head())

   hours_studied  practice_test  passed_exam
0              0             55            0
1              1             75            0
2              2             32            0
3              3             80            0
4              4             75            0


## 2. Preparing Data for the Model
Next, I separate the dataset into input features (X) and the target variable (y). The features include the number of hours studied and the score on practice tests, while the target variable indicates whether the student passed the exam (1) or not (0).

Before fitting the model, I standardise the features using StandardScaler to ensure that they are on a similar scale. This is important because logistic regression is sensitive to the scale of the input features.

In [10]:
# Separate out X and y
X = codecademyU[['hours_studied', 'practice_test']]
y = codecademyU.passed_exam

# Transform X
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X)

## 3. Splitting the Dataset into Training and Testing Sets
I split the dataset into training and testing sets using an 80-20 split. This helps in evaluating the model’s performance on unseen data after training.

In [14]:
# Split data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=27)

## 4. Fitting the Logistic Regression Model
Here, I create an instance of the logistic regression model and fit it to the training data. After fitting, I print the model’s coefficients and intercept to gain insight into the relationship between the input features and the target variable. The coefficients indicate how much each feature contributes to the probability of passing the exam.

In [17]:
# Create and fit the logistic regression model
from sklearn.linear_model import LogisticRegression

cc_lr = LogisticRegression()
cc_lr.fit(X_train, y_train)
# Print the intercept and coefficients
print(cc_lr.coef_)
print(cc_lr.intercept_)

[[1.51032451 0.11984701]]
[-0.13193653]


## 5. Making Predictions with the Trained Model
After training the model, I can use it to make predictions on new data points. In sklearn, the .predict() method takes a matrix of features as input and returns a vector of predicted class labels (0 or 1). The model predicts whether each sample belongs to the positive class based on a probability threshold (default is 0.5).

In [20]:
# Print out the predicted outcomes for the test data
print(cc_lr.predict(X_test))
print()
print(X_test)
print()
# Print out the predicted probabilities for the test data
print(cc_lr.predict_proba(X_test))
print()
# Print out the true outcomes for the test data
print(y_test)

[0 1 0 1 1]

[[-0.43355498  0.29722219]
 [ 0.95382097  0.29722219]
 [-1.64750894 -1.79313169]
 [ 0.26013299  0.42786931]
 [ 1.30066495  0.62383999]]

[[0.67942358 0.32057642]
 [0.20680975 0.79319025]
 [0.94454394 0.05545606]
 [0.42257111 0.57742889]
 [0.12928955 0.87071045]]

7     0
15    1
0     0
11    0
17    1
Name: passed_exam, dtype: int64


### Understanding prediction results

With predict_proba(), the return values relate to the probability of the prediction being 0 ('fail) or 1 ('pass'). In the case of sample 1 here, [0.67934073 0.32065927], there is a 68% probability of a fail, 32% probability of a pass (adding up to 100%).

You should see that the fourth datapoint was incorrectly classified as having passed the exam; however, the predicted probability of passing for this datapoint was only 57.7%, which is much lower than the other students who were correctly predicted to pass the exam (79.3% and 87.1%, respectively).

## 6. Evaluating the Model with Accuracy, Precision, Recall, F1-score, and Confusion Matrix
To assess the model's performance, I import the necessary metrics from sklearn.metrics. I then calculate and print:

- Accuracy: Measures the proportion of correctly classified samples.
- Precision: The proportion of positive identifications that were actually correct.
- Recall: The proportion of actual positives that were correctly identified.
- F1-score: A harmonic mean of precision and recall.
- Confusion matrix: Provides a summary of correct and incorrect predictions with respect to the true class labels.

In [32]:
# Import evaluation metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Calculate and print evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
print("Confusion Matrix:")
print(conf_matrix)

Accuracy: 0.80
Precision: 0.67
Recall: 1.00
F1 Score: 0.80
Confusion Matrix:
[[2 1]
 [0 2]]


## Summary
In this extended project, I successfully demonstrated how to:

- Fit a logistic regression model using the sklearn library.
- Standardise the features to improve the model's performance.
- Interpret the model's coefficients to understand feature importance.
- Predict the probability of passing an exam based on input features such as hours studied and practice test scores.
- Evaluate the model using accuracy, precision, recall, F1-score, and a confusion matrix to understand its effectiveness.

This project provided a comprehensive introduction to implementing logistic regression in Python using sklearn and explored how to interpret and evaluate the results from a fitted model.