# **Linear Regression: Predicting Medical Insurance Costs**

## Scenario
As data scientist working for a healthcare consultancy firm, you are tasked with building a model to predict medical insurance costs based on various attributes such as age, sex, BMI, number of children, and smoking habits. Accurate predictions will assist insurance companies in determining fair premium prices, help individuals understand factors affecting their costs, and support businesses in designing employee health plans.

## Dataset Overview
The dataset includes the following columns:
- `age`: The age of the individual.
- `sex`: Gender of the individual (`1` for male, `0` for female).
- `bmi`: Body Mass Index, a measure of body fat based on height and weight.
- `children`: Number of dependents covered by the insurance.
- `smoker`: Smoking status of the individual (`1` for smoker, `0` for non-smoker).
- `charges`: The target variable; the medical insurance cost for the individual.

## Why Linear Regression?
Linear regression is suitable for this scenario because:
1. The target variable (`charges`) is continuous, making it appropriate for regression techniques.
2. It helps quantify the relationship between the input variables (age, BMI, etc.) and the insurance cost.
3. The interpretable coefficients provide insights into the influence of each factor on insurance charges.

## Identifying Input and Target Variables
In predictive modeling, variables are categorised into **input variables** and **target variables**:
- **Input Variables**: Independent variables or features used for prediction. They represent factors affecting the outcome.
- **Target Variable**: The dependent variable or outcome that we aim to predict.

For this dataset:
- **Input Variables**: `age`, `sex`, `bmi`, `children`, `smoker`
- **Target Variable**: `charges`.


## Step 1: Import Libraries

In [8]:
# Importing essential libraries
import pandas as pd  # For data manipulation and analysis
import numpy as np  # For numerical operations
from sklearn.model_selection import train_test_split  # For splitting the dataset
from sklearn.linear_model import LinearRegression  # Linear regression model
from sklearn.metrics import mean_squared_error, r2_score  # Evaluation metrics

## Step 2: Load the Dataset

In [9]:
# Load the dataset
file_path = 'Medical Insurance.csv'
data = pd.read_csv(file_path)

# Display the first few rows
print(data.head())

   age  sex     bmi  children  smoker      charges
0   19    0  27.900         0       1  16884.92400
1   18    1  33.770         1       0   1725.55230
2   28    1  33.000         3       0   4449.46200
3   33    1  22.705         0       0  21984.47061
4   32    1  28.880         0       0   3866.85520


## Step 3: Data Preparation

In [12]:
# Selecting input variables and target variable
X = data[['age', 'sex', 'bmi', 'children', 'smoker']] # Input Variables
y = data['charges'] # Target Variable

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Ensuring reproducibility by setting the random_state


## Step 4: Train the Model

In [14]:
# Initialise and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Coefficients and intercept
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)

Coefficients: [  261.91061673   136.65119758   333.36099462   432.1792927
 23618.76182167]
Intercept: -12538.439849853119


## Step 5: Evaluate the Model

In [16]:
# Make predictions
y_pred = model.predict(X_test)

# Calculate metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

Mean Squared Error: 34003912.39316075
R-squared: 0.7680881643600721


### Mean Squared Error (MSE)
- **MSE: 34,003,912.39**  
  The Mean Squared Error represents the average squared difference between the actual and predicted values. A lower MSE indicates better model performance. In this case, the MSE is relatively high, suggesting some room for improvement in the model's accuracy.

### R-squared (R²)
- **R²: 0.7681**  
  The R-squared value explains the proportion of variance in the target variable (`charges`) that is explained by the input variables.  
  - An R² value of **0.7681** indicates that approximately **76.81%** of the variability in medical insurance costs is explained by the model.  
  - This is a reasonably good fit, though there is still about 23.19% of the variability that the model does not capture, which could be due to factors not included in the dataset such as the area where the customer lives.


## Step 6: Make Predictions with New Data

In [18]:
new_data = pd.DataFrame({
    'age': [34, 27],
    'sex': [1, 0],
    'bmi': [25.4, 28.2],
    'children': [1, 3],
    'smoker': [0, 1],
})
new_data

Unnamed: 0,age,sex,bmi,children,smoker
0,34,1,25.4,1,0
1,27,0,28.2,3,1


- **Individual 1 (Row 1):**  
  - **Age:** 34  
  - **Sex:** Male (1)  
  - **BMI:** 25.4  
  - **Children:** 1 dependent  
  - **Smoker:** No (0)


- **Individual 2 (Row 2):**  
  - **Age:** 27  
  - **Sex:** Female (0)  
  - **BMI:** 28.2  
  - **Children:** 3 dependents  
  - **Smoker:** Yes (1)  

In [51]:
# Make predictions
predicted_cost = model.predict(new_data)
print("Predicted Costs:", predicted_cost)

Predicted Costs: [ 5402.72087254 28849.22654988]


## What's Next?

Up next is **Logistic Regression**, where we will explore how to predict categorical outcomes and understand the relationship between variables when the target is binary. Stay tuned for the next lesson to expand your predictive modeling skills!
