# Medical Price Prediction: Linear Regression from Scratch

This notebook implements the Linear Regression algorithm from scratch to predict medical charges. We use only **Numpy** and **Pandas** for the model and **Matplotlib** for visualization.

## 1. Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(42)

## 2. Data Preprocessing

Data preprocessing is crucial for the performance of Linear Regression. We perform the following steps:
1. **Categorical Encoding**: We convert 'sex' and 'smoker' into binary values (0 and 1). We apply One-Hot Encoding to the 'region' column.
2. **Feature Scaling**: Linear Regression using Gradient Descent performs better when features are on the same scale. We standardize the features to have $\mu=0$ and $\sigma=1$.
3. **Train-Test Split**: We split the data into 80% training and 20% testing sets.

In [None]:
# Load data
df = pd.read_csv('Medical Price Dataset.csv')

# Preprocessing categorical variables
df['sex'] = df['sex'].map({'female': 0, 'male': 1})
df['smoker'] = df['smoker'].map({'no': 0, 'yes': 1})
df = pd.get_dummies(df, columns=['region'], drop_first=True)
df = df.astype(float)  # Convert all to float

# Split into X and y
X = df.drop('charges', axis=1).values
y = df['charges'].values.reshape(-1, 1)

def standardize(X):
    mean = np.mean(X, axis=0)
    std = np.std(X, axis=0)
    std[std == 0] = 1
    return (X - mean) / std

X_scaled = standardize(X)

def train_test_split_custom(X, y, test_size=0.2):
    indices = np.arange(X.shape[0])
    np.random.shuffle(indices)
    split_idx = int(len(X) * (1 - test_size))
    return X[indices[:split_idx]], X[indices[split_idx:]], y[indices[:split_idx]], y[indices[split_idx:]]

x_train, x_test, y_train, y_test = train_test_split_custom(X_scaled, y)

## 3. Linear Regression Implementation

The `linear_regression` function below uses Gradient Descent to find the optimal weights ($\theta$).

In [None]:
def linear_regression(x_train, y_train, learning_rate=0.1, epochs=1000):
    """
    Trains a linear regression model using gradient descent.
    """
    m, n = x_train.shape
    # Add intercept term (bias)
    X_b = np.c_[np.ones((m, 1)), x_train]
    theta = np.zeros((n + 1, 1))
    cost_history = []
    
    for i in range(epochs):
        predictions = X_b.dot(theta)
        errors = predictions - y_train
        gradients = (1/m) * X_b.T.dot(errors)
        theta = theta - learning_rate * gradients
        
        cost = (1/(2*m)) * np.sum(np.square(errors))
        cost_history.append(cost)
        
    return theta, cost_history

# Train the model
theta, cost_history = linear_regression(x_train, y_train)

## 4. Results and Visualization

We calculate the $R^2$ score and visualize the cost convergence and prediction accuracy.

In [None]:
def predict(X, theta):
    X_b = np.c_[np.ones((X.shape[0], 1)), X]
    return X_b.dot(theta)

y_pred = predict(x_test, theta)

def get_r2_score(y_true, y_pred):
    ss_res = np.sum((y_true - y_pred)**2)
    ss_tot = np.sum((y_true - np.mean(y_true))**2)
    return 1 - (ss_res / ss_tot)

print(f"Final R2 Score: {get_r2_score(y_test, y_pred):.4f}")

# Cost convergence plot
plt.figure(figsize=(8, 5))
plt.plot(cost_history, color='green')
plt.title('Cost Function Convergence')
plt.xlabel('Iterations')
plt.ylabel('Cost')
plt.show()

# Actual vs Predicted scatter plot
plt.figure(figsize=(8, 5))
plt.scatter(y_test, y_pred, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.title('Actual vs Predicted Charges')
plt.xlabel('Actual Charges')
plt.ylabel('Predicted Charges')
plt.show()