In [None]:
!pip install numpy pandas matplotlib scikit-learn

# Linear regression

## Step 1. Read dataset


In [None]:
import pandas as pd

df = pd.read_csv("../datasets/insurance.csv")

df.head()

In this notebook we'll be using insurance charges dataset, as we can see dataset contains various features (`age, sex, bmi, no. of childrens, smoker status and region`) with target `charges`. Our main goal is to train a ML model that takes new person's features and predicts the estimated insurance charges.

In [None]:
df.describe() # Let's take a look at the summary statistics

In [None]:
df.sample(n=5) # Instead of .head we can use .sample() to see random n rows

## Step 2. Simple Preprocessing

Before we build any model, we should quickly check the quality of our data.

In this step, we will:
- Look for **missing (null) values**
- Check if there are any **duplicate rows**

These are common issues in real-world datasets and are usually handled early in the ML workflow.

In [None]:
df.isna().sum()

In [None]:
df[df.duplicated(keep=False)]

As we see there are no null values, but 1 duplicated row.

There are various ways of handeling null values. Simplest solution is to drop them, however of large amount of rows contains null, we may need to infer it using various statistic based estimattions (Putting mean, medion, mod etc.)

In [None]:
df = df.drop_duplicates()

## Step 3. Visualization

Next step is to visualize and analze various columns of our dataset, we'll see their distributation, corelations and do comparisions.
For this we'll use matplotlib library.

Let's start by analizing our target column (dependent variable), in this case Insurance charges. Since this is continious data, we can use histogram.

In [None]:
import matplotlib.pyplot as plt

plt.figure()
plt.hist(df["charges"], bins=100)
plt.xlabel("Insurance Charges")
plt.ylabel("Count")
plt.title("Distribution of Insurance Charges")
plt.show()

From the distribution, we can see that there are some data points with very high insurance charges. These values may be outliers. One common way to identify outliers is by visualizing the data using a box plot.

Outliers can cause issues for some machine learning models, so they often need to be handled carefully. However, in this example, we will keep them for simplicity.

There are three common ways to handle outliers:
1. Dropping them
2. Capping (limiting extreme values)
3. Changing the scale of the data (e.g., using a log transformation)

In [None]:
plt.figure()
df.boxplot(column="charges")
plt.show()

### Bivarient analysis
Next let's visualize and see how each features affects insurance charges using varios plots

In [None]:
plt.figure()
plt.scatter(df["age"], df["charges"], alpha=0.5)
plt.xlabel("Age")
plt.ylabel("Charges")
plt.title("Age vs Insurance Charges")
plt.show()

In [None]:
plt.figure()
plt.scatter(df["bmi"], df["charges"], alpha=0.5)
plt.xlabel("BMI")
plt.ylabel("Charges")
plt.title("BMI vs Insurance Charges")
plt.show()

In [None]:
plt.figure()
df.boxplot(column="charges", by="smoker")
plt.xlabel("Smoker")
plt.ylabel("Charges")
plt.title("Charges by Smoking Status")
plt.suptitle("")  # remove auto title
plt.show()

In [None]:
plt.figure()
df.boxplot(column="charges", by="sex")
plt.xlabel("Sex")
plt.ylabel("Charges")
plt.title("Charges by Sex")
plt.suptitle("")
plt.show()


In [None]:
plt.figure()
df.boxplot(column="charges", by="region")
plt.xlabel("Region")
plt.ylabel("Charges")
plt.title("Charges by Region")
plt.suptitle("")
plt.show()

## Step 4. Preprocessing
Based on our analysis, we'll them perform proper preprocessing steps. This step may include
1. Dropping unnecessary columns
2. Handeling outliers
3. Mapping categorical data into numeric ones using various encoding methods

In [None]:
df = df.drop(columns=["region"]) 

In [None]:
df["smoker"] = df["smoker"].map({"yes": 1, "no": 0})

In [None]:
df = pd.get_dummies(
    df,
    columns=["sex"],
)
df

In [None]:
df["sex_female"] = df["sex_female"].astype(int)
df["sex_male"] = df["sex_male"].astype(int)

df.head()

## Step 5. Train test split
Once our data is ready, we'll get into model training step. But if we train our model on our entire dataset, how do we later test how good our model is ? How do we test if it overfitted or underfitted ? We need some way to measure our progress, so we'll split our dataset in train and test set.
We can then train it on train set and test it on test set that doen't contain any training data points.

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop("charges", axis=1)
y = df["charges"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Total length: {len(X)}, Traininig: {len(X_train)}, Testing: {len(X_test)}")

## Step 6. Model training
Since target column is numeric value, it is a regression problem. We'll train a regression model.

In [None]:
from sklearn.linear_model import LinearRegression


model = LinearRegression()

In [None]:
model.fit(X_train, y_train) # Train a regression model

In [None]:
y_pred = model.predict(X_test) # Make predictions on test set

comparison_df = pd.DataFrame({
    "Actual": y_test.values,
    "Predicted": y_pred,
    "Difference": y_test.values - y_pred
})

In [None]:
comparison_df.sample(10)

In [None]:
import matplotlib.pyplot as plt

plt.figure()
plt.scatter(comparison_df["Actual"], comparison_df["Predicted"], alpha=0.5)
plt.plot(
    [comparison_df["Actual"].min(), comparison_df["Actual"].max()],
    [comparison_df["Actual"].min(), comparison_df["Actual"].max()]
)
plt.xlabel("Actual Charges")
plt.ylabel("Predicted Charges")
plt.title("Actual vs Predicted on Test Set")
plt.show()

## Step 7. Model evaluation
Let's evaluate our model using r2 score.

In [None]:
from sklearn.metrics import r2_score

r2_score(y_true=y_test, y_pred=y_pred)

## Step 8. Inference

Once we're satisfied with our model, we need a way to serve it. This is called inference.

In [None]:
sample_df = pd.DataFrame([{
    "age": 21,
    "bmi":40.0,
    "children": 0,
    "smoker": 0,
    "sex_female": 0,
    "sex_male": 1,
}])

response = model.predict(sample_df)
print(f"Prediction for given sample is : {response[0]}")

In [None]:
import matplotlib.pyplot as plt
pred_value = response[0]

plt.figure()
plt.scatter(df["bmi"], df["charges"], alpha=0.4, label="Training data")

plt.scatter(
    sample_df["bmi"],
    pred_value,
    color="red",
    s=120,
    label="Our prediction"
)

plt.xlabel("BMI")
plt.ylabel("Insurance Charges")
plt.title("Model Prediction in Context of Data")
plt.legend()
plt.show()


## Save our model

Currently our ai model is inside `model` object. We need to save it somewhere, otherwise each time we need to train it every time we use it, which is not possible. We can use simple library like joblib to save our python object as .pkl file, and later recreate exact model using that file without loosing training data.

In [None]:
import joblib

joblib.dump(model, "../models/regression_model.pkl")

In [None]:
# Now let's use that .pkl file to create simple prediction system

from typing import Literal


ai_model = joblib.load("../models/regression_model.pkl")

def predict_insurance(
    age: int,
    bmi: float,
    children: int,
    smoker: bool,
    sex: Literal["male", "female"]
):  
    sample_df = pd.DataFrame([{
        "age": age,
        "bmi":bmi,
        "children": children,
        "smoker": 1 if smoker else 0,
        "sex_female": 1 if sex=="female" else 0,
        "sex_male": 1 if sex=="male" else 0,
    }])
    response = ai_model.predict(sample_df)

    cost = float(response[0])
    return round(cost, 2)

In [None]:
predict_insurance(age=23, bmi=31.1, children=1, smoker=False, sex="male")