# Final Project: Regression Analysis
**Name:** Huzaifa Nadeem  
**Date:** 2025-04-21

**Objective:** In this project, we apply regression techniques to predict medical insurance charges using features like age, BMI, smoking status, and more. We'll explore the data, build multiple regression models, and compare their performance.


In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

## Section 1: Import and Inspect the Data

In [None]:
df = pd.read_csv("data/insurance.csv")
df.info()
df.head()

**Reflection 1:** The dataset is clean and complete, with 1338 rows and no missing values. Some features are categorical and need to be encoded.

## Section 2: Data Exploration and Preparation

In [None]:
sns.pairplot(df)
plt.show()

# Encode categorical data
df['sex'] = df['sex'].map({'male': 0, 'female': 1})
df['smoker'] = df['smoker'].map({'yes': 1, 'no': 0})
df['region'] = df['region'].astype('category').cat.codes

**Reflection 2:** Smokers have significantly higher charges. Region is encoded as numbers. No major outliers.

## Section 3: Feature Selection and Justification

In [None]:
X = df[['age', 'bmi', 'children', 'smoker', 'region']]
y = df['charges']

**Reflection 3:** Age, BMI, number of children, smoker status, and region all likely affect insurance charges.

## Section 4: Train a Model (Linear Regression)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

y_pred = lr_model.predict(X_test)
print("R²:", r2_score(y_test, y_pred))
print("RMSE:", mean_squared_error(y_test, y_pred, squared=False))
print("MAE:", mean_absolute_error(y_test, y_pred))

**Reflection 4:** The model performs well with a strong R² and low error values. No major surprises.

## Section 5: Improve the Model or Try Alternates (Pipelines)

In [None]:
# Pipeline 1: Imputer + Scaler + Linear Regression
pipe1 = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),
    ('model', LinearRegression())
])

pipe1.fit(X_train, y_train)
pred1 = pipe1.predict(X_test)

print("Pipe 1 R²:", r2_score(y_test, pred1))

In [None]:
# Pipeline 2: Poly + Scaler + Linear Regression
pipe2 = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('poly', PolynomialFeatures(degree=3)),
    ('scaler', StandardScaler()),
    ('model', LinearRegression())
])

pipe2.fit(X_train, y_train)
pred2 = pipe2.predict(X_test)

print("Pipe 2 R²:", r2_score(y_test, pred2))

**Reflection 5:** Polynomial regression increased performance slightly. Scaling helps stabilize the model.

## Section 6: Final Thoughts & Insights

**6.1 Summary:** Linear and polynomial regression both performed well. Smoking and age were strong predictors.

**6.2 Challenges:** Slight skew in the data due to high charges in smokers made it hard to fully generalize.

**6.3 Next Steps:** Try regularization models (Ridge, Lasso) or tune polynomial degree for better control.