<h2>Objective:</h2>

- Develop a Machine learning model using Linear Regression 
- The model will be able to detect treatment Charges on the basis of Medical Report.

<h3>Import following libraries:</h3>

- Numpy
- Pandas
- MatPlotLib
- SciKit Learn
- SeaBorn

 Using  Python 3.11.9<br>
 From terminal install following packages<br>
 python -m pip install -U numpy<br>
 python -m pip install -U pandas<br>
 python -m pip install -U matplotlib<br>
 python -m pip install -U scikit-learn<br>
 similarly other required 

In [None]:
import pandas as pd
import numpy as np

: 

Import dataset from the link given below:

source of dataset : https://github.com/stedy/Machine-Learning-with-R-datasets/blob/master/insurance.csv

Adding Dataset using Pandas library 

In [2]:
df = pd.read_csv("C:\GEN1_AI\Dataset\Machine_Learning\medical-charges.csv")

<h2>About the Dataset

<h4>The dataset is about people who smokes and who don't. On the basis of this dataset the model will be able to predict the person medical charges of the basis of his/her age, sex, bmi, smokers, region belonging to, etc. 


In [None]:
df

The information about dataset is given below, which contains number of coloums - rows, datatype, memory usage, etc.

In [None]:
df.info()

The shape of DataFrame is like it contains, rows = 1338 and colums = 7.

In [None]:
df.shape

The describe function is used for calculating statistical data like count, mean, min, max, which is very important to understand about the data.

In [None]:
df.describe()

The dataset we are using, contains Zero null values, which helps to model predict more accuratly.

In [None]:
df.isnull().sum()

In [8]:
import plotly.express as px
import matplotlib
import matplotlib.pyplot as mlt
import seaborn as sb 


Decide basic parameters for the graph and stats we are going to use, parameter like font_size, figure_size, colour, etc.

In [9]:

matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (10,6)
matplotlib.rcParams['figure.facecolor'] = '#000000'

In [None]:
df.age.describe()

Creating a Histogram of age.

If you observe the graph of age, nunber of people of age 18 and 19 is almost double than other, other all age group has almost similar nunber of people. The more the steady graph the more accurate result we will get.

In [None]:
fig = px.histogram(df,
                   x='age', 
                   marginal='box',
                   nbins=47,
                   title='Distribution of Age')
fig.update_layout(width=800, height=600, bargap=0.1)
fig.show()

Creating a Histogram of BMI (Body Mass Index).

If you observe the graph of BMI, the graph first increase and decrease in the same way. If you know BMI, Body Mass Index is a medical screening tool that estimates body fat percentage by calculating the ratio of person's weight to height.

The graph is created by using Plotly library, it makes more interactive graph.

As I said, the graph is uneven which is not good to train the model. We need to handle this, before we use to train the model.

In [None]:
fig = px.histogram(df,
                   x='bmi',
                   marginal='box',
                   color_discrete_sequence=['red'],
                   title='Distribution of BMI')
fig.update_layout (width=800, height=600, bargap=0.1)
fig.show()

In [None]:
fig = px.histogram(df,
                   x='charges',
                   marginal='box',
                   color='smoker',
                   color_discrete_sequence=['green', 'grey'],
                   title='Annual Medical Charges')
fig.update_layout (width=800, height=600, bargap=0.1)
fig.show()

In [None]:
df.smoker.value_counts()

In [None]:
fig = px.histogram(df, x='smoker', 
            color='sex',
            color_discrete_sequence=['skyblue', 'darkcyan'],
            title='Smoker')
fig.update_layout(width=800, height=600, bargap=0.02)
fig.show()


In [None]:
df

In [None]:
fig = px.histogram(df, x='region', 
            color='sex',
            color_discrete_sequence=['salmon', 'dimgrey'],
            title='Smoker')
fig.update_layout(width=800, height=600, bargap=0.02)
fig.show()


In [None]:
fig = px.scatter(df, 
                 x='age',
                 y='charges',
                 color='smoker',
                 opacity=0.8,
                 hover_data=['sex'],
                 title='Age v/s Charges')
fig.update_traces(marker_size=5)
fig.update_layout(width=800, height=600, bargap=0.02)
fig.show()

In [None]:
fig = px.scatter(df, 
                x='bmi',
                y='charges',
                 color='smoker',
                 opacity=0.8,
                 hover_data=['sex'],
                 title='BMI v/s Charges')
fig.update_traces(marker_size=5)
fig.update_layout(width=800, height=600, bargap=0.02)
fig.show()

In [None]:
px.violin(df, x='children', y='charges').update_layout(width=800, height=600, bargap=0.02)

In [None]:
df.charges.corr(df.age)

In [None]:
df.charges.corr(df.bmi)

In [None]:
df.charges.corr(df.children)


In [24]:
df['smoker'] = df['smoker'].map({"yes":1, "no":0})

In [None]:
df.smoker

In [None]:
df.charges.corr(df.smoker)

In [None]:

df_numeric = df.select_dtypes(include=['number'])
df_numeric.corr()

In [None]:
sb.set_style("whitegrid")
mlt.figure(figsize=(8, 6))
sb.heatmap(df_numeric.corr(), cmap='Reds', annot=True)
mlt.title('Correlation Matrix')

In [29]:
non_smoker_df = df[df.smoker == 0]

In [None]:
non_smoker_df

In [None]:

sb.scatterplot(data = non_smoker_df, x='age', y='charges', alpha=0.7, s=15).set(title ='Age v/s Charges')

In [32]:
def estimate_charges(age, w, b):
    return w * age + b

In [33]:
w = 250
b = -2000

In [None]:
ages = non_smoker_df.age
ages

In [None]:
ages.shape

In [None]:
non_smoker_df.charges

In [37]:
estimate_charges = estimate_charges(ages, w, b)

In [None]:
mlt.plot(ages, estimate_charges, 'r-')
mlt.xlabel('AGE')
mlt.ylabel('Estimated Charges')

In [None]:
target = non_smoker_df.charges

mlt.plot(ages, estimate_charges, 'r', alpha=0.9)

mlt.scatter(ages, target, s=8, alpha=0.8)
mlt.xlabel('Age')
mlt.ylabel('Estimate Charges')
mlt.legend(['Estimate', 'Actual'])

In [40]:
def try_parameters(w, b):
    ages = non_smoker_df.age 
    target = non_smoker_df.charges

    mlt.plot(ages, estimate_charges, 'r', alpha=0.9)
    mlt.scatter(ages, target, s=8, alpha=0.8)
    mlt.xlabel('Age')
    mlt.ylabel('Charges')
    mlt.legend(['Estimate', 'Actual'])

In [None]:
try_parameters(250, -2000)

In [None]:
targets  = non_smoker_df.charges
target

In [None]:
predictions = estimate_charges
predictions


In [44]:
predicted = non_smoker_df.age * w + b

In [45]:
import numpy as np

In [46]:
def rmse(targets, estimate_charges):
    return np.sqrt(np.mean(np.square(targets - estimate_charges)))

In [47]:
w = 250
b = -2000

In [None]:
try_parameters(w, b)

In [None]:
non_smoker_df.age

In [50]:
targets = non_smoker_df['charges']

In [None]:
rmse(targets, predicted)

In [None]:
# def try_parameters(w, b):
#     ages == non_smoker_df.age 
#     target = non_smoker_df.charges
#     predictions = estimate_charges(ages, w, b)

#     mlt.plot(ages, predictions, 'r', alpha=0.9)
#     mlt.scatter(ages, target, s=8, alpha=0.8)
#     mlt.xlabel('Age')
#     mlt.ylabel('Charges')
#     mlt.legend('Prediction', 'Actual')
loss = rmse(target, predictions)
print("RMSE LOSS: ", loss)

In [53]:

from sklearn.linear_model import LinearRegression

In [54]:
model = LinearRegression()

In [None]:
help(model.fit)

In [None]:
inputs = non_smoker_df[['age']]
targets = non_smoker_df.charges
print('inputs.shape :', inputs.shape)
print('targets.shape :', targets.shape)


# the difference between 
# inputs = non_smoker_df['age'] 
# and 
# inputs = non_smoker_df[['age']]

# is [] this is the Series (1D) and [[]] this is the DataFrame(2D)

In [None]:
type(inputs)

In [None]:
model.fit(inputs, targets)

In [None]:
model.predict(np.array([[23], [37], [61]]))

In [60]:
predictions = model.predict(inputs)

In [None]:
predictions

In [None]:
targets

In [None]:
rmse(targets, predictions)

In [None]:
#w
model.coef_

In [None]:
#b
model.intercept_

In [None]:
try_parameters(model.coef_, model.intercept_)

In [None]:
#Create inputs and targets 
inputs, targets = non_smoker_df[['age']], non_smoker_df['charges']

#create and train the model 
model = LinearRegression().fit(inputs, targets)

#Generate prediction 
predictions = model.predict(inputs)

#Compute loss to evalute the model
loss = rmse(targets, predictions)
print('Loss:', loss)

<h2> Linear Regression Using Multiple Features </h2>

so far, we've used on the "age" feature to estimate "charges". Adding another feature like "bmi" is fairly straightforwad. We simply assume the following realtionship:

charges = w1 X age + w2 + bmi + b

here, w1 & w2 are weeights and b is the intercept.

In [None]:
#Create inputs and targets 
inputs, targets = non_smoker_df[['age', 'bmi']], non_smoker_df['charges']

#create and train the model 
model = LinearRegression().fit(inputs, targets)

#Generate prediction 
predictions = model.predict(inputs)

#Compute loss to evalute the model
loss = rmse(targets, predictions)
print('Loss:', loss)

In [None]:
non_smoker_df.charges.corr(non_smoker_df.bmi)

In [None]:
fig = px.scatter(non_smoker_df, x='bmi', y='charges', title='BMI v/s Charges')
fig.update_traces(marker_size=5)
fig.show()

In [None]:
model.coef_, model.intercept_

'childrens' which seems to have some correlation with 'charges'

In [None]:
non_smoker_df.charges.corr(non_smoker_df.children)

In [None]:
fig = px.strip(non_smoker_df, x='children', y='charges', title='children v/s charges')
fig.update_traces(marker_size=4, marker_opacity=0.7)
fig.show()

In [None]:
#Create inputs and targets 
inputs, targets = non_smoker_df[['age', 'bmi', 'children']], non_smoker_df['charges']

#create and train the model 
model = LinearRegression().fit(inputs, targets)

#Generate prediction 
predictions = model.predict(inputs)

#Compute loss to evalute the model
loss = rmse(targets, predictions)
print('Loss:', loss)

In [None]:
#Create inputs and targets 
inputs, targets = df[['age', 'bmi', 'children']], df['charges']

#create and train the model 
model = LinearRegression().fit(inputs, targets)

#Generate prediction 
predictions = model.predict(inputs)

#Compute loss to evalute the model
loss = rmse(targets, predictions)
print('Loss:', loss)

In [None]:
px.scatter(df, x='age', y='charges', color='smoker')