## Objectives 
 - Load the data as a `pandas` dataframe
 - Clean the data, taking care of the blank entries
 - Run exploratory data analysis (EDA) and identify the attributes that most affect the `charges`
 - Develop single variable and multi variable Linear Regression models for predicting the `charges`
 - Use Ridge regression to refine the performance of Linear regression models. 

|  | **For Reference Purposes** |    |
|---|----------|----------|
| **Parameter** |**Description**| **Content type** |
|gender| Male or Female|integer (1 or 2)|
|smoker| Whether smoker or not | integer (0 or 1)|
|region| Which US region - NW, NE, SW, SE | integer (1,2,3 or 4 respectively)| 


### **Import Libraries**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score, train_test_split

### **Data Wrangling**

In [None]:
# import data
df = pd.read_csv('/kaggle/input/medical-insurance-price-prediction/Medical_insurance.csv')
df.head()

In [None]:
# inspect the data
df.info()

In [None]:
# checking for missing values
df.replace('?', np.nan, inplace=True)
df.isnull().sum()

In [None]:
# numerical summary of the data set
df.describe()

In [None]:
# reformat float values to 2 decimal places
pd.options.display.float_format = '{:,.2f}'.format
df.head()

In [None]:
# convert descriptive values into integer values for better modeling and evaluation
df['sex'].replace({'male': 1, 'female': 2}, inplace=True)
df['smoker'].replace({'no': 0, 'yes': 1}, inplace=True)
df['region'].replace({'northwest': 1, 'northeast': 2, 'southwest': 3, 'southeast': 4}, inplace=True)
df.head()

**Conclusion:**
* **Loading and inspection complete with the dataset containing no missing values**

### **Exploratory Data Analysis (EDA)**

**1. Implement the regression plot for charges with respect to bmi.**

In [None]:
# regression plot bmi vs charges
sns.regplot(data=df, x='bmi', y='charges', marker='x', line_kws={'color': 'red'})
plt.ylim(0)
plt.show()

A positive relationship between a person's BMI and annual insurance charge can be inferred. A higher BMI elevates the risk of developing certain diseases, thus incurring higher medical expenses.

**2. Implement the box plot for charges with respect to smoker.**

In [None]:
# box plot charges vs smoker
sns.boxplot(data=df, x='smoker', y='charges')
plt.show()

Based on the box plot, smokers are charged higher than non-smokers as expected.

**3. Print the correlation matrix for the dataset.**

In [None]:
# plot correlation of the dataset
sns.heatmap(df.corr(), annot=True)
plt.show()

Smoking exhibited the strongest correlation with annual insurance charges, followed by age.

### **Model Development**

**1. Fit a linear regression model that may be used to predict the charges value, just by using the smoker attribute of the dataset. Print the  R² score of this model.**

In [None]:
# predictor and target variable
x = df[['smoker']]
y = df['charges']

# fit and print the R² score
lr = LinearRegression()
lr.fit(x, y)
lr.score(x, y)

The R² value for the Linear Regression model was 0.62, indicating that smoking explains 62% of the variation in annual insurance charges.

**2. Fit a linear regression model that may be used to predict the charges value, just by using all other attributes of the dataset. Print the R² score of this model.**

In [None]:
x_data = df.drop('charges', axis=1)
y_data = df['charges']

lr.fit(x_data, y_data)
lr.score(x_data, y_data)

The R² value for the Linear Regression model was 0.75, indicating that the rest of the attributes explains 75% of the variation in annual insurance charges.

**3. Create a training pipeline that uses StandardScaler(), PolynomialFeatures() and LinearRegression() to create a model that can predict the charges value using all the other attributes of the dataset.**

In [None]:
# x_data and y_data use the same values as defined in previous cell
Input = [('scale', StandardScaler()), ('polynomial', PolynomialFeatures(include_bias=False)), ('model', LinearRegression())]

pipe = Pipeline(Input)

pipe.fit(x_data, y_data)
ypipe = pipe.predict(x_data)
print(r2_score(y_data,ypipe))

In [None]:
# data frame with the 'prediction' column
df['prediction'] = ypipe
df.head(5)

### **Model Refinement**

**1. Split the data into training and testing subsets, assuming that 20% of the data will be reserved for testing.**

In [None]:
# split dataset into two (80% for training and 20% for testing)
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.2, random_state=1)

**2. Initialize a Ridge regressor that used hyperparameter α = 0.1. Fit the model using training data data subset. Print the R² score for the testing data.**

In [None]:
# Ridge Regression model
ridge_model = Ridge(alpha=0.1)
ridge_model.fit(x_train, y_train)
yhat = ridge_model.predict(x_test)
print(r2_score(y_test,yhat))

**3. Apply polynomial transformation to the training parameters with degree=2. Use this transformed feature set to fit the same regression model, as above, using the training subset. Print the R² score for the testing subset.**

In [None]:
pr = PolynomialFeatures(degree=2)
x_train_pr = pr.fit_transform(x_train)
x_test_pr = pr.fit_transform(x_test)
ridge_model.fit(x_train_pr, y_train)
yhat = ridge_model.predict(x_test_pr)

print(r2_score(y_test, yhat))

Conclusion:
* A linear regression model incorporating age, sex, BMI, number of children, smoker status, and region yielded an R² score of 0.85, indicating that it 85% accurate in predicting the annual insurance price.

Thank you.