# **Linear Regression: Insurance Charges Prediction**

---

## **Assignment Overview**

The primary objective of this project is to build a **Linear Regression model** to predict insurance costs (`charges`) based on various personal attributes of a beneficiary. This is a classic supervised machine learning problem where we will use a set of features (age, sex, BMI, etc.) to predict a continuous target variable (`charges`).

The process will involve:
1.  **Exploratory Data Analysis (EDA):** Understanding the distribution of our data and identifying relationships between variables.
2.  **Data Preprocessing:** Handling categorical variables and preparing the data for the model.
3.  **Model Training:** Splitting the data and training a Linear Regression model.
4.  **Evaluation:** Assessing the model's performance to see how well it predicts insurance charges.

---

## **Column Specifications**

Here is a detailed breakdown of the columns in the `insurance.csv` dataset:

| Column Name | Description | Data Type |
| :--- | :--- | :--- |
| **age** | The age of the primary beneficiary. | Numerical |
| **sex** | The gender of the beneficiary (`male`, `female`). | Categorical |
| **bmi** | Body Mass Index, providing a measure of body fat. A higher BMI often indicates being overweight or obese. | Numerical |
| **children** | The number of children covered by the health insurance. | Numerical |
| **smoker** | Indicates whether the beneficiary is a smoker (`yes`, `no`). This is expected to be a strong predictor of costs. | Categorical |
| **region** | The beneficiary's residential area in the US, divided into four geographical regions (`southwest`, `southeast`, `northwest`, `northeast`). | Categorical |
| **charges** | The cost billed to the health insurance. This is our **target variable** to predict. | Numerical |

---

# **Initial Setup**

In [None]:
#
# Import Libraries
#
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, root_mean_squared_error, r2_score

from google.colab import files

In [None]:
#
# Upload the data set 'insurance.csv'
#
uploadOutput = files.upload()

# **Load Data**

In [None]:
df = pd.read_csv('insurance.csv')
df.head()

In [None]:
print(f"Shape = {df.shape}")
print(f"Size  = {df.size}")

In [None]:
df.info()

**No Missing Data**

# **Statistical Details**

In [None]:
df.describe()

In [None]:
df.describe(include = 'all')

# **Pre-Processing**

## **Check Missing Values**

In [None]:
#
# Get only the columns that has missing values
#
df.columns[df.isnull().sum() > 0]

In [None]:
#
# Get only the columns that has missing values
#
df.isnull().sum()[df.isnull().sum() > 0].sort_values(ascending = False)

In [None]:
#
# Get the rows that has missing values
#
missing_value_rows = df[df.isnull().any(axis = 1)]
print(missing_value_rows)

## **Check and Fix Duplicates**

In [None]:
#
# Get Duplicate Count
#
df.duplicated().sum()

In [None]:
#
# Get Duplicate Row
#
duplicatedRows = df[df.duplicated()]
print(duplicatedRows)

In [None]:
#
# Copy the Data Frame before Deleting Duplicates
#
df_copy = df.copy()

In [None]:
#
# Drop Duplicates
#
df = df.drop_duplicates()

In [None]:
duplicatedRows = df[df.duplicated()]
print(duplicatedRows)

In [None]:
#
# Find Colums that has negative values
#
numericalColumns = df.select_dtypes(include = np.number).columns

for col in numericalColumns:
  if df[col].min() < 0:
    print(f"Column '{col}' has negative values. Min value: {df[col].min()}")

print("Negative Values Checks Done")# Set a clean visual style for all plots
sns.set_style("whitegrid")

# **Data Visualisation - Part 1**

## **Typical Workflow**
<font size="+1">**The Best Time to Encode Your Data**  <font>


---


The most effective approach in data analysis is to **visualize your data first, and then perform any necessary encoding**

**Here’s the reason why this workflow is so effective**  

* **Human-Readable Insights**  
Visualizations are designed to help you, the analyst, understand the data. When you plot your data with categorical labels like `male`, `female`, `yes`, `no`, or `northeast`, the charts are intuitive. You can instantly see patterns, distributions, and potential relationships between different categories and your target variable. If you encode the data first, you would be looking at a chart of 0s and 1s or other numerical representations, which makes it much more difficult to interpret quickly and naturally.

* **Crucial for Exploratory Data Analysis (EDA)**  
The main goal of EDA is to explore and uncover relationships in your data. By visualizing the raw, un-encoded data, you can spot key insights and outliers that might be obscured by numerical encoding. For example, a bar chart of `region` vs. `charges` would immediately reveal if people in one region have significantly higher charges. This is a vital piece of information for feature engineering and for making informed decisions about your model, and it's something you might miss if you encode first.

This is the typical, and most effective, data science workflow

1) **Exploratory Data Analysis (EDA)**  
Use the raw, human-readable data to visualize distributions and relationships.

2) **Data Preprocessing**  
Based on your insights from EDA, you then decide how to handle missing values, outliers, and finally, how to encode your categorical data.

3) **Model Training**  
Use the preprocessed and encoded data to train your machine learning model.

In [None]:
#
# Set a clean visual style for all plots
#
sns.set_style("whitegrid")

## **Distribution of Charges**

In [None]:
#
# We start by looking at the distribution of our target variable, 'charges'
# A histogram reveals if the data is skewed, bimodal, or normally distributed
#
plt.figure(figsize = (10,6))

sns.histplot(df['charges'], kde = True, bins = 30)

plt.title('Distribution of Insurance Charges', fontsize = 18)
plt.xlabel('Charges', fontsize = 12)
plt.ylabel('Frequency', fontsize = 12)

plt.show()

## **Observation**

**Right-Skewed Distribution**  
The most prominent feature is that the data is heavily right-skewed. This means the vast majority of people have relatively low insurance charges, while a small number of people incur very high costs. The long tail on the right side of the graph represents these outliers with significant expenses.

**Multimodal Nature**  
The distribution appears to be multimodal, meaning it has more than one peak. While the first peak at a low charge amount is the highest, there are two other distinct, smaller peaks at higher values.

The first major peak is in the **0 to 5,000 range.** This likely represents the general population with standard medical needs.

The second peak is around the **10,000 to 20,000 range.**

The third, and most interesting, peak is between **30,000 and 40,000.** This suggests there is a specific subgroup within the data that consistently faces much higher charges.

This multimodal distribution strongly suggests that a single variable is causing a major split in the dataset, creating distinct charge groups. Given the nature of the data, this is very likely the `smoker` variable, where non-smokers form the lower-cost groups and smokers form the high-cost group. We'll see it the forth coming charts

## **Charges by Smoker Status**

In [None]:
#
# Create the box plot
# x       = 'smoker'  - Sets the x-axis to the bmi
# y       = 'charges' - Sets the y-axis to the insurance charges
# hue     = 'smoker'  - Creates a different color for each category
# palette = 'viridis' - Gives a nice color scheme
#
plt.figure(figsize = (10, 6))

sns.boxplot(
             x       = 'smoker',
             y       = 'charges',
             data    = df,
             hue     = 'smoker',
             palette = 'viridis'
           )

plt.title('Insurance Charges by Smoker Status', fontsize = 18)
plt.xlabel('Smoker', fontsize = 12)
plt.ylabel('Charges', fontsize = 12)

plt.show()

## **Observation**
Based on the box plot, here are the key inferences regarding the relationship between insurance charges and smoker status.

The plot confirms the initial hypothesis that **smoking is the single most significant factor in determining insurance charges.** The chart clearly shows two distinct groups with vastly different cost distributions.

* **Smokers**  
The box for smokers is much higher on the y-axis, indicating that the charges for this group are significantly greater. The **median charge for smokers appears to be over $35,000.** The lowest charges for smokers are still higher than the median for non-smokers, and the charges for smokers are also much more widely spread out. The long upper whisker shows that a number of smokers have exceptionally high insurance costs.
* **Non-Smokers**  
The box for non-smokers is narrow and located at the lower end of the y-axis. The median charge for this group is very low, likely around **7,000 to 9,000.** The entire distribution, including **most of the outliers**, is contained below the median for smokers.

In conclusion, the two-part, multimodal distribution we observed in the initial histogram is primarily explained by the smoker variable. This highlights the immense financial impact of a single lifestyle choice on healthcare costs.

## **Charges vs BMI by Smoker Status**

In [None]:
#
# Create the scatter plot
# x       = 'bmi'     - Sets the x-axis to the bmi
# y       = 'charges' - Sets the y-axis to the insurance charges
# hue     = 'smoker'  - Creates a different color for each category
# palette = 'viridis' - Gives a nice color scheme
#
plt.figure(figsize = (10, 6))

sns.scatterplot(
                 x       = 'bmi',
                 y       = 'charges',
                 data    = df,
                 hue     = 'smoker',  # This is the key to our analysis, coloring points by smoker status
                 style   = 'smoker',
                 palette = 'viridis',
                 s       = 100,       # Adjust point size for better visibility
                 alpha   = 0.7        # Add transparency to see overlapping points
               )

plt.title('Insurance Charges vs. BMI, by Smoker Status', fontsize = 18)
plt.xlabel('BMI', fontsize = 12)
plt.ylabel('Charges', fontsize = 12)

plt.show()

## **Observation**
Based on the scatter plot, it appears that the relationship between BMI and insurance charges is heavily influenced by whether or not an individual is a smoker.

Here's an analysis of the graph's key insights:

* **Smokers (Blue Data Points)**  
This group is responsible for the dramatic, higher-end insurance charges. The graph reveals a **strong positive correlation** between BMI and cost for smokers. As their BMI increases, their insurance charges escalate significantly, with many data points reaching over 30,000. This demonstrates that for smokers, both BMI and their smoking habit combine to drive up costs considerably.
* **Non-Smokers (Green Data Points)**  
In stark contrast, this group is clustered at the lower end of the cost spectrum. The green data points show consistently lower insurance charges, rarely exceeding 15,000. While there might be a slight upward trend in costs as BMI increases, the effect is minor compared to the impact of smoking.

In summary, the graph illustrates a clear divide between the two groups. The non-smokers have relatively stable, low insurance costs, while the smokers face significantly higher and more volatile charges, especially as their BMI rises. The simple fact of being a smoker appears to be a major determinant of insurance cost.

## **Charges vs Age by Smoker Status**

In [None]:
#
# Create the scatter plot
# x       = 'age'     - Sets the x-axis to the age
# y       = 'charges' - Sets the y-axis to the insurance charges
# hue     = 'smoker'  - Creates a different color for each category
# palette = 'viridis' - Gives a nice color scheme
#
plt.figure(figsize = (10, 6))

sns.scatterplot(
                 x       = 'age',
                 y       = 'charges',
                 data    = df,
                 hue     = 'smoker',  # This is the key to our analysis, coloring points by smoker status
                 style   = 'smoker',
                 palette = 'viridis',
                 s       = 100,       # Adjust point size for better visibility
                 alpha   = 0.7        # Add transparency to see overlapping points
               )

plt.title('Insurance Charges vs. Age, by Smoker Status', fontsize = 18)
plt.xlabel('Age', fontsize = 12)
plt.ylabel('Charges', fontsize = 12)

plt.show()

## **Observation**
This scatter plot effectively illustrates how both age and smoking status influence insurance charges. The data forms two distinct bands, one for smokers and one for non-smokers, making the impact of smoking immediately clear.

* **Non-smokers (Green Data Points)**
Show a clear and **consistent positive correlation** between age and insurance charges. As a person gets older, their insurance costs tend to increase in a relatively smooth, predictable fashion. The charges for this group generally remain below 15,000, even for older individuals.

* **Smokers (Blue Data Points)**
Follow a similar **positive correlation** with age, but at a significantly higher level. The insurance charges for this group are substantially higher across all ages, with the costs rising steeply with each passing year. For older smokers, charges can reach over 40,000.

In conclusion, the data demonstrates that both age and smoking contribute to higher insurance charges. However, smoking status appears to be the primary factor, as it shifts the entire cost curve upward, resulting in a much larger financial impact than age alone.

## **PairPlot for Numerical Variables**

In [None]:
#
# A pairplot is an excellent tool for a quick, comprehensive overview of relationships
# between all numerical variables. We'll use 'smoker' as a hue to see how this
# categorical variable impacts the relationships.
#
numericalColumns = ['age', 'bmi', 'children', 'charges']

plt.figure(figsize = (20, 20))

sns.pairplot(data = df, vars = numericalColumns, hue = 'smoker', palette = 'viridis')

plt.suptitle('Relationships Between Numerical Features and Charges', y = 1.02, fontsize = 18)

plt.show()

## **Observation**
Looking at the plots where `charges` are on the y-axis, a clear division emerges between the two groups. The blue data points (Smokers) form a distinct cluster that is consistently positioned in the upper range of charges. In contrast, the green data points (Non-Smokers) occupy the lower half of the plot, showing significantly lower charges across the board.

* The relationship between `charges` and `age` is particularly revealing. For both smokers and non-smokers, charges generally increase with age. However, the charges for smokers are so much higher that their data points exist in a completely different financial bracket from non-smokers of the same age.
* Similarly, while a positive correlation exists between `charges` and `bmi`, it's the smoking status that causes the most dramatic difference. A non-smoker with a high BMI still pays far less in insurance premiums than a smoker with a low or average BMI.

In essence, the pair plot confirms that while Age and BMI are contributing factors, a person's smoking status is the primary driver of their insurance charges. It creates a clear and unmissable financial divide in the dataset.

## **Charges by Gender**

In [None]:
#
# Box plots are perfect for comparing the distribution of a continuous variable
# across different categories.
# This shows us the median, quartiles, and outliers
#

#
# Box plot for charges by 'sex' status
#
plt.figure(figsize = (10, 6))

sns.boxplot(
             x       = 'sex',
             y       = 'charges',
             data    = df,
             hue     = 'sex',
             palette = 'viridis'
           )

plt.title('Insurance Charges by Gender', fontsize = 18)
plt.xlabel('Sex', fontsize = 12)
plt.ylabel('Charges', fontsize = 12)

plt.show()

## **Observation**
Based on the box plot, the distribution of charges for both **males** and **females** appears to be remarkably similar.

* **Median Charges**  
The horizontal line inside each box, which represents the median charge, is at almost the exact same level for both sexes. This indicates that the typical charge for a male is nearly identical to the typical charge for a female.

* **Spread of Data**  
The boxes, which represent the middle 50% of the data, and the whiskers, which show the overall spread, are very similar in size and position but **interquartile range (the middle 50% of data)** is wider for males. This suggests while the lowest charges are similar, the middle range of charges shows greater variability for males than for females.

* **Outliers**  
Both box plots show a significant number of outliers, representing individuals with very high charges. The number and distribution of these outliers appear to be consistent between males and females.

My inference is that, unlike smoking status, a person's gender does not seem to have a major influence on their insurance charges. While there may be small differences, the overall pattern and distribution of costs are essentially the same for both males and females in this dataset.

## **Charges by Region**

In [None]:
#
# Box plot for charges by 'region'
#
plt.figure(figsize = (10, 6))

# sns.boxplot(x='region', y='charges', y_order=df['region'].value_counts().index)
sns.boxplot(
             x       = 'region',
             y       = 'charges',
             data    = df,
             hue     = 'region',
             palette = 'viridis'
           )

plt.title('Insurance Charges by Region', fontsize = 18)
plt.xlabel('Region', fontsize = 12)
plt.ylabel('Charges', fontsize = 12)

plt.show()

In [None]:
#
# Box plot for charges by 'region'
#

#
# Sorts the X-Axis in the Descending Order of Median Values of the Region's Charges
#
regionOrder = df.groupby('region')['charges'].median().sort_values(ascending=False).index

plt.figure(figsize=(10, 6))
sns.boxplot(
             x       = 'region',
             y       = 'charges',
             data    = df,
             hue     = 'region',
             palette = 'viridis',
             order   = regionOrder
           )

plt.title('Insurance Charges by Region', fontsize = 18)
plt.xlabel('Region', fontsize = 12)
plt.ylabel('Charges', fontsize = 12)
plt.xticks(rotation = 0) # Ensure region labels are not rotated for better readability

plt.show()

## **Observation**
The box plot, sorted by median charge, provides a clear visual hierarchy of insurance costs across the four regions

* **The Northeast region has the highest median charge.** Its median line sits higher than all the others. The box for this region is also fairly wide, showing a broad range of charges for the middle 50% of the data.
* **The Southeast region has the second-highest median charge.** While its median is slightly lower than the Northeast, it has a significant number of high-value outliers and an even wider box, suggesting a greater spread and more extreme cases.
* **The Northwest region is in the middle.** Its median is lower than both the Northeast and Southeast but higher than the Southwest. The overall distribution is more condensed than the top two regions.
* **The Southwest region has the lowest median charge.** This region has the most compact box and the lowest median, indicating that charges are generally less expensive and less variable here than in any of the other regions.

The plot makes it clear that while all regions have a range of costs, there are significant regional differences in the typical cost of insurance.

# **Label Encoding**

In [None]:
df.columns

In [None]:
df['sex'].value_counts()

In [None]:
df['smoker'].value_counts()

In [None]:
df['region'].value_counts()

### **Encode Categorical Columns**

In [None]:
#
# Take a copy of the dataframe before encoding.
# Once encoding is done, querying the dataframe could confuse due to binary data
#   Use the encoded dataframe for model building, and
#   Use the non-encoded dataframe for our exploration
#
df_before_encode = df.copy()
df_before_encode.head()

In [None]:
#
# Use .replace() to encode 'sex' and 'smoker' columns
#
# We normally use .map() but we have done that quite a few times before
# so we'll try .replace() this time
#

#
# Encode 'sex' column
# male   = 0
# female = 1
#
df['sex'] = df['sex'].replace({'male' : 0, 'female' : 1})

#
# Encode 'smoker' column
# yes = 1
# no  = 0
#
df['smoker'] = df['smoker'].replace({'yes' : 1, 'no' : 0})

#
# One-Hot Encoding
# Use .get_dummies() to encode 'region' columns
#
# df = pd.get_dummies(df, columns = 'region')
#
# The 'drop_first = True' argument drops one of the generated columns
# This is a common practice to avoid multicollinearity
#
df = pd.get_dummies(df, columns = ['region'], drop_first = True)
df.head()

#
# While the values display as True and False, they are actually stored as the
# numerical values 1 and 0 respectively.
# This makes them perfectly suitable for machine learning models, which require
# numerical input.
# Pandas simply uses the more memory-efficient bool type to represent these
# binary categories
#

In [None]:
'''
#
# ANOTHER METHOD - USING LABELENCODER
#

#
# Initialize LabelEncoder
#
le = LabelEncoder()

#
# Perform label encoding on 'sex', 'smoker', and 'region'
#
df['sex'] = le.fit_transform(df['sex'])
df['smoker'] = le.fit_transform(df['smoker'])
df['region'] = le.fit_transform(df['region'])

print(df_encoded.head())
'''

In [None]:
print(df[['sex', 'smoker', 'region_northwest', 'region_southeast', 'region_southwest']].dtypes)

In [None]:
#
# Check for missing values again
#
df.isnull().sum().sort_values(ascending = False)

In [None]:
#
# Check for duplicate rows again
#
df.duplicated().sum()

# **Data Visualisation - Part 2**

## **Correlation Heat Map**

In [None]:
#
# Correlation Heat Map
#

#
# Compute correlation matrix
#
correlationMatrix = df.corr()

plt.figure(figsize = (10, 10))

sns.heatmap(
             correlationMatrix,
             annot      = True,
             fmt        = '.2f',
             linewidths = 0.5,
             cmap       = 'coolwarm',
             cbar_kws   = {'label' : 'Correlation Coefficient'}
           )

plt.title('Correlation Heatmap of Insurance Data')

plt.show()

#
# Print the correlations with the 'charges' column
#
print("")
print("Correlation with 'charges' column")
print("=================================")
print(correlationMatrix['charges'].sort_values(ascending=False))

## **Observation**
The heatmap provides a clear and intuitive view of the relationships between the features and the charges we are trying to predict.

**Here's what stands out**

* **The Strongest Predictor**  
The most significant observation is the very strong positive correlation between `smoker` and `charges`. With a **correlation coefficient of 0.79**, this is by far the most influential feature. This makes perfect sense, as insurance costs for smokers are typically much higher.
* **Other Key Predictors**  
The next strongest correlations are with **age (0.30) and bmi (0.20)**. These are both positive relationships, meaning as a person's age or BMI increases, their insurance charges tend to increase as well. These three variables — `smoker`, `age`, and `bmi` — are the core features that will form the basis of our predictive model.
* **Weak or Negligible Relationships**  
All other variables show a very weak or almost non-existent correlation with charges. The coefficients for children, sex_male, and all the region dummy variables are close to zero. While we will still include these in our model to capture any subtle effects, we don't expect them to have a significant impact on the final predictions.

In summary, the heatmap confirms our intuition that being a smoker, being older, and having a higher BMI are the most critical factors when predicting insurance charges.

# **Train / Test Split**

In [None]:
df.columns

In [None]:
#
# Input / Independent Variable
#
X = df[['smoker', 'age', 'bmi']]
X.head()

In [None]:
#
# Output / Dependent Variable
#
y = df[['charges']]
y.head()

In [None]:
#
# Train Test Split
#   Test Size  = 0.2 (20%), which makes
#   Train Size = 0.8 (80%)
#
#   Random State, we pass 1 but industry standard is 42
#
# Using the standard naming convention
#
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

print(f"Shape of X_train : {X_train.shape}")
print(f"Shape of X_test  : {X_test.shape} ")
print(f"Shape of y_train : {y_train.shape}")
print(f"Shape of y_test  : {y_test.shape} ")

In [None]:
X_train.head()

In [None]:
y_train.head()

In [None]:
X_test.head()

In [None]:
y_test.head()

# **Standardisation**

We generally overwrite `X_train` and `X_test` when do the standarisation but for this assignment, we'll stick with the more robust approach of creating new variables to keep our process clear and easy to follow.

Creating a new variable like `X_train_scaled` is generally considered a best practice for a couple of reasons

* **Readability**  
It makes your code more self-documenting. Anyone reading your script, including your future self, can see at a glance that `X_train_scaled` contains the data after a scaling operation has been applied. In contrast, simply overwriting `X_train` forces the reader to track the code line-by-line to know what state the data is in.

* **Safety and Debugging**  
By keeping the original `X_train`, you have a backup of the unscaled data. If you need to go back and check the original feature distributions, or if a later part of your code needs the unscaled data for some reason, you haven't permanently lost it.



---


<font size="+1">**We do NOT scale the target variable (y) in linear regression, as we want to predict the original charges values**<font>

The reason we don't scale the dependent variable is because of the fundamental goal of our model: we want to predict the output in its original, real-world units.

**The Purpose of Scaling**  
When we scale the independent variables (X), our primary goal is to standardize their influence on the model. This is especially important for algorithms like linear regression that use an optimization process to find the best-fitting line. Scaling ensures that all features contribute equally to the cost function, preventing a feature with a large scale from dominating the training process.


---


**Why the Dependent Variable is Different**  
The dependent variable (`y`), however, represents the actual value we are trying to predict. If we scaled `y`, our model would output a scaled value, which would be completely uninterpretable. For example, a **prediction of 0.5** would have no direct meaning. To make it meaningful, we would have to "un-scale" the prediction back to its original units.

Keeping the dependent variable in its original units simplifies the entire process. The model's predictions are directly comparable to the real-world values, and the model's coefficients are immediately interpretable. For instance, the coefficient for a feature like `age` tells us exactly how much the predicted `charges` change for each year of `age`, assuming all other factors are constant.

In [None]:
#
# Initialize the StandardScaler
# It will learn the mean and standard deviation from the training data
#
sc = StandardScaler()

#
# Fit the scaler on the training data and transform both training and testing data
# We only fit on X_train to prevent data leakage from the test set
#
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled  = sc.transform(X_test)

#
# The above command raises the following Warning. To workaround that use .values
# (see below)
# /usr/local/lib/python3.12/dist-packages/sklearn/utils/validation.py:2732: UserWarning: X has feature names, but StandardScaler was fitted without feature names
#  warnings.warn(
#
#
# X_train_scaled = sc.fit_transform(X_train.values)
# X_test_scaled  = sc.transform(X_test.values)
#

In [None]:
#
# Print a summary of the scaled data to see the effect
#
pd.options.display.float_format = '{:.2f}'.format

print("Original X_train values")
print(X_train.describe())

print("")

#
# Cannot directly do Pandas describe() or head() as the X_train_scaled
# is an NumPy array.
#
# Change it to a dataframe
#
print("Scaled X_train values (note the mean is 0 (or close to it) and std 1 (or close to it))")
print(pd.DataFrame(X_train_scaled, columns = X_train.columns).describe())

#
# Same for X_test
#
print("")

print("Original X_test values")
print(X_test.describe())

print("")

print("Scaled X_test values (note the mean is 0 (or close to it) and std 1 (or close to it))")
print(pd.DataFrame(X_test_scaled, columns = X_test.columns).describe())

In [None]:
sns.boxplot(X)

In [None]:
sns.boxplot(X_train_scaled)

In [None]:
sns.boxplot(X_test_scaled)

# **Model Building**
**Use Linear Model (Multiple Linear Regression)**

In [None]:
#
# Initialize the Linear Regression model
#
lrModel = LinearRegression()

#
# Fit the model to the scaled training data
#
lrModel.fit(X_train_scaled, y_train)

#
# Make Predictions
#
y_predicted = lrModel.predict(X_test_scaled)
# y_predicted

In [None]:
#
# Check residual errors
#
residualErrors = y_predicted - y_test
print(residualErrors)
print('Type of residual errors =', type(residualErrors))

# **Model Evaluation**

## **Evaluation Metrics**

In [None]:
mean_squared_error(y_test, y_predicted)

In [None]:
mean_absolute_error(y_test, y_predicted)

In [None]:
root_mean_squared_error(y_test, y_predicted)

In [None]:
r2_score(y_test, y_predicted)

## **Observation**
**Our model accuracy is 80.05%**

## **Plot Residual Errors**

In [None]:
#
# Use Scatter Plot from plt
#
plt.figure(figsize = (10, 6))

plt.scatter(x = y_predicted, y = residualErrors)

plt.axhline(y=0, color='r', linestyle='--') # Add a horizontal line at y = 0 for reference
plt.title('Predicted vs. Residual Errors', fontsize = 18)
plt.xlabel('Predicated Insurance Charges', fontsize = 12)
plt.ylabel('Residiual Errors', fontsize = 12)

plt.show()

## **Observations**
**The Model is Performing Well for Lower Charges**  
The cluster of data points on the left side of the graph, corresponding to lower predicted charges, is tightly grouped around the red dashed line (the zero error line). This indicates that for most of your predictions, especially for individuals with lower insurance costs, your model's errors are small and close to zero. The predictions are quite accurate in this range.

**A Clear Pattern of Errors Exists**  
The graph is not a random scatter. You can clearly see distinct clusters of errors above and below the zero line. This suggests that the linear regression model isn't capturing everything.
* **Positive Errors**  
The cluster of points far above the zero line (e.g., predicted charges between $25,000 and $40,000) indicates that for some high-cost individuals, the model is consistently underestimating their charges. The model is predicting a lower charge than what the actual value is, leading to a large positive error.
* **Negative Errors**  
Similarly, the cluster of points below the zero line shows that for other individuals, the model is consistently overestimating their charges.

**The Underlying Cause**  
The most likely explanation for these distinct clusters of errors could be **Outliers**. We have seen outliers in our visualisations and we haven't handled it. **This can be a significant cause of large residual errors and a key reason why your model might not be performing as well as it could.**

# **Predict New Inputs**

In [None]:
X.loc[900]

In [None]:
y.loc[900]

In [None]:
#
# X Input - smoker, age, bmi
#
newData = np.array([[0.00, 49.00, 22.52]])

#
# Standardize/Scale - As we did it while training the model
#
newDataScaled = sc.transform(newData)

#
# Predict the Insurance Charges for the new data
# For the input
#   smoker   0.00
#   age     49.00
#   bmi     22.52
#
# Expect the predicted value
#   charges 8688.86
#
newDataPredicted = lrModel.predict(newDataScaled)
print("Prediction for New Input:", newDataPredicted)
print(f"Prediction for New Input: {newDataPredicted[0][0]:.2f}")
#
# The predicted value 8435.20 does not match with expected 8688.86
# Need to improve the model
#