# Training a simple logistic regression model

This project builds a logistic regression model to predict whether a student will be admitted to graduate school based on their academic profile. The process starts by preparing our data through several key transformations: one-hot encoding converts school ranks into separate binary columns, while z-score standardization puts GRE scores and GPAs on comparable scales.

The data is then split into training (90%) and testing (10%) sets to properly evaluate the model's performance. The logistic regression model learns by iteratively adjusting weights through gradient descent over 1000 epochs. During each epoch, it calculates admission probabilities using the sigmoid function, compares these predictions to actual admission results, and updates the weights to minimize prediction errors.

The `learning rate` of `0.5` controls how quickly the model adapts to errors, while regular loss calculations help monitor if the model is improving. The final model makes admission predictions by converting probability scores (between `0` and `1`) into yes/no decisions using a `0.5` threshold, achieving reasonable accuracy on previously unseen test data.

This approach provides an automated way to assess admission chances based on academic metrics, though it's important to note it's a simplified model of what is typically a more complex admission process.

### Import

In [84]:
import numpy as np
import pandas as pd

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


### Import data

In [85]:
folder = 'labs/train_nn_on_graduate_school_admissions_data'
admissions = pd.read_csv(f"{folder}/binary.csv");
admissions.head(10)

Unnamed: 0,admit,gre,gpa,rank
0,0,380,3.61,3
1,1,660,3.67,3
2,1,800,4.0,1
3,1,640,3.19,4
4,0,520,2.93,4
5,1,760,3.0,2
6,1,560,2.98,1
7,0,400,3.08,2
8,1,540,3.39,3
9,0,700,3.92,2


### One-hot encoding

One-hot encoding was used to convert the `'rank'` column into a format that machine learning models can better understand. Since models work with numbers, but our rank categories `(1, 2, 3, 4)` are just labels for different school ranks, we need to transform them so the model doesn't think `rank 4` is "twice as good" as rank 2.

The code creates separate columns for each rank `(rank_1, rank_2, rank_3, rank_4)`, puts a 1 in the column matching a student's school rank, and 0 in all others. For instance, a rank 2 school gets marked as 0-1-0-0 across these columns. This ensures the model treats each rank independently, without assuming any numerical relationships between them.

In [86]:
# One-hot encoding
data = pd.concat([admissions, pd.get_dummies(admissions['rank'], prefix='rank')], axis=1)
data = data.drop('rank', axis=1)

data.head(5)

Unnamed: 0,admit,gre,gpa,rank_1,rank_2,rank_3,rank_4
0,0,380,3.61,False,False,True,False
1,1,660,3.67,False,False,True,False
2,1,800,4.0,True,False,False,False
3,1,640,3.19,False,False,False,True
4,0,520,2.93,False,False,False,True


### Z-score standardization

We used `Z-score standardization` on the `GRE` scores and `GPA` values to put them on a similar scale. Consider two students: one with `GRE = 320` and `GPA = 3.5`, another with `GRE = 160` and `GPA = 4.0`. Since `GRE` scores range from `130-170` while `GPAs` range from `0-4`, this difference in scales might confuse our model into thinking `GRE` scores are more important simply because they're bigger numbers.

The code standardizes each value by finding how far it is from the average (subtracting the mean) and dividing by the standard deviation. After this process, both `GRE` and `GPA` values will typically center around `0 `and fall between `-3` and `+3`, indicating if a score is above or below average. For instance, a `GRE` of `330` might become `+2`, while a `GPA` of `2.5` might become `-1`. This transformation helps the model treat both measures fairly, regardless of their original scales.

In [87]:
# Z-score standardization
for field in ['gre', 'gpa']:
    mean, std = data[field].mean(), data[field].std()
    data[field] = data[field].astype(np.float64)
    data.loc[:, field] = (data[field] - mean) / std

data.head(5)

Unnamed: 0,admit,gre,gpa,rank_1,rank_2,rank_3,rank_4
0,0,-1.798011,0.578348,False,False,True,False
1,1,0.625884,0.736008,False,False,True,False
2,1,1.837832,1.603135,True,False,False,False
3,1,0.452749,-0.525269,False,False,False,True
4,0,-0.586063,-1.208461,False,False,False,True


### Spliting the data into train and test sets

We're using this code to create a fair way to test how well our model will work with new data. The code starts with `np.random.seed(42)`, which ensures our random split stays consistent across runs, like using the same shuffle pattern for cards. We then use `np.random.choice` with `replace=False` to randomly select `90%` of our data rows, similar to picking `90` students from a class of `100` without repeating any selections.

The data is split into training data (`90%`) and test data (`10%`), mimicking real-world model application. The larger portion teaches our model, while the smaller portion tests it - like how students learn from textbooks but are tested on different questions to ensure true understanding rather than memorization. This method helps ensure our model can handle new data effectively.

In [88]:
# Split data into train and test sets
np.random.seed(42)

sample = np.random.choice(data.index, size=int(len(data) * 0.9), replace=False)
data, test_data = data.iloc[sample], data.drop(sample)

print(f"Training set shape: {data.shape}")
print(f"test set shape: {test_data.shape}")

Training set shape: (360, 7)
test set shape: (40, 7)


### Separating our data into input features

This code is separating our data into input features and target values that we want to predict. We create four key sets: our main `features` (all columns except `admit`), our main `targets` (just the `admit` column), and corresponding test versions (`features_test` and `targets_test`) from our test data. The `print` statements show us the dimensions of each set, helping us verify we've split everything correctly.

Think of it like organizing a recipe: the features are your ingredients (like flour, sugar, eggs), and the target is what you're trying to make (like a cake). We split them into training sets (where we learn the recipe) and test sets (where we verify if we learned it correctly). The shapes tell us how many examples and characteristics we have in each set.

In [89]:
# Split into features and targets
features, targets = data.drop('admit', axis=1), data['admit']
features_test, targets_test = test_data.drop('admit', axis=1), test_data['admit']

print(f"Features shape: {features.shape}")
print(f"Targets shape: {targets.shape}")
print(f"Test features shape: {features_test.shape}")
print(f"Test targets shape: {targets_test.shape}")

Features shape: (360, 6)
Targets shape: (360,)
Test features shape: (40, 6)
Test targets shape: (40,)


### Sigmoid function

The `sigmoid` function helps convert any number into a probability between `0` and `1`. It works like a special calculator that takes any number and transforms it into this limited range. We use `np.array(x, dtype=float)` to ensure we're working with decimal numbers, then apply the formula `1 / (1 + np.exp(-x))` to create an S-shaped curve.

When a number goes through this function, large positive numbers become close to `1`, large negative numbers become close to `0`, and `0` becomes exactly `0.5`. This transformation is crucial for logistic regression as it helps us interpret our results as probabilities of admission.

In [90]:
def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1 / (1 + np.exp(-np.array(x, dtype=float)))

### Initialize `n_records`, `n_features` and `last_loss`
We create two key variables: `n_records` and `n_features` to store the dimensions of our dataset, and initialize `last_loss` at `0` to track our model's prediction errors over time.

In [91]:
n_records, n_features = features.shape
last_loss = 0

### Initialize weights

The code initializes our model's `weights` using random numbers from a normal distribution, where we scale them by `1 / sqrt(n_features)`. This scaling helps prevent our initial predictions from being too extreme and allows for better learning during training.

In [92]:
# Initialize weights
weights = np.random.normal(scale=1 / n_features**.5, size=n_features)

### Neural Network hyperparameters

These two hyperparameters control our model's training process: `epochs` sets how many times we'll iterate through our data (in this case `1000` cycles), while `learnrate` of `0.5` determines how big our adjustments are when improving the model's predictions.

In [93]:
# Neural Network hyperparameters
epochs = 1000
learnrate = 0.5

### Train model

This code runs the core training loop of our logistic regression model, performing iterative weight adjustments to improve admission predictions. The process loops through our data multiple times (epochs), making incremental improvements to find optimal prediction weights.

In [94]:
# Main training loop running 1000 times
for e in range(epochs):
    # Initialize array to store weight updates
    del_w = np.zeros(weights.shape)
    # Loop through each student's data
    for x, y in zip(features.values, targets):
        
        # Convert features to precise float numbers
        x = np.array(x, dtype=np.float64)
        # Convert target (admission result) to precise float
        y = np.float64(y)
        
        # Calculate probability of admission
        output = sigmoid(np.dot(weights, x))
        
        # Compute difference between actual and predicted
        error = y - output
        
        # Calculate gradient for updating weights
        error_term = error * output * (1 - output)
        
        # Add this sample's weight updates to total changes
        del_w += error_term * x
        
    # Apply averaged weight updates
    weights += (learnrate * del_w) / n_records
    
    # Every 100 epochs, monitor training progress
    if e % (epochs / 10) == 0:
        out = sigmoid(np.dot(features, weights))
        # Calculate current prediction error
        loss = np.mean((out - targets) ** 2)
        if last_loss and last_loss < loss:
            print("Train loss: ", loss, "  WARNING - Loss Increasing")
        else:
            print("Train loss: ", loss)
        # Store current loss for next comparison
        last_loss = loss

Train loss:  0.2545249645145559
Train loss:  0.20947756008222448
Train loss:  0.20139454062371134
Train loss:  0.19893736276429308
Train loss:  0.1979702271254469
Train loss:  0.19752222530389685
Train loss:  0.19729138322092427
Train loss:  0.1971634254018444
Train loss:  0.1970887003886595
Train loss:  0.19704335892544572


### Calculate accuracy on test data

Our trained model makes predictions on new test data by calculating admission probabilities through the `sigmoid` function, converting them to yes/no decisions at a `0.5` threshold, and comparing these against actual results to measure accuracy as a percentage of correct predictions.

In [95]:
# Calculate accuracy on test data
tes_out = sigmoid(np.dot(features_test, weights))
# Convert probabilities to binary predictions at 0.5 threshold
predictions = tes_out > 0.5
# Compute accuracy by comparing to actual admission results
accuracy = np.mean(predictions == targets_test)
# Display accuracy score to 3 decimal places
print(f"Prediction accuracy: {accuracy:.3f}")

Prediction accuracy: 0.725
