## <h1 style = "color:green; align-text:center; font-size: 25px">PROJECT PREDICTING THE ANNUAL MEDICAL INSURANCE PRICES</h1>

## Sherif Ayobami Nas'r

<img src = "Media/medicalpic.jpg"  alt = "">

## 1.0 INTRODUCTION

Medical insurance, often referred to as health insurance, is a financial arrangement in which individuals pay regular premiums to a healthcare provider or government entity in exchange for coverage of their medical expenses. This coverage typically includes a wide range of healthcare services, such as doctor visits, hospital stays, medications, and preventive care. Medical insurance serves as a safeguard against the high costs of medical treatments, ensuring that individuals can access necessary healthcare without bearing the full financial burden. Different plans offer varying levels of coverage and may require copayments, deductibles, and co-insurance. It plays a crucial role in providing access to healthcare, promoting wellness, and mitigating the financial risks associated with illness or injury.
 
 To utilize their medical insurance, policyholders often need to pay certain out-of-pocket expenses, such as deductibles, copayments, and co-insurance. Deductibles are the amount individuals must pay before the insurance starts covering costs. Copayments require a fixed amount for each service or prescription, while co-insurance involves a percentage of the cost being paid by the policyholder.

Medical insurance aims to provide financial protection against the high expenses associated with illness, injury, and ongoing healthcare needs. It enables individuals to access medical care without shouldering the full financial burden themselves, thus promoting overall wellness and timely treatment.

In conclusion, in this project, i consisdered developing a regression system to predict the price of the medical insurance considering the current state of global economy, the financial status of the policy holder and social factors retarding the growth of medical insurance.

This phase represent the model creation phase of my project. At the end of this project, i intend to host it as an API for consumption

# Importing packages

In [91]:
## Data loading and manipulation

import pandas as pd
import numpy as np

## Data visualization

import matplotlib.pyplot as plt
import seaborn as sns

## Data modelling

from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Lasso, Ridge
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

In [61]:
df = pd.read_csv('Datasets/Medicalpremium.csv')

### The Medical Insurance Price Prediction  Datasets Was Gotten From <a href = "https://www.kaggle.com/datasets/tejashvi14/medical-insurance-premium-prediction">Kaggle</a>

# EXPLORATORY DATA ANALYSIS

In this phase of the project, the dataset was analysed and investigated to summarize their main features. It helps determinehow best to manipulate the dataset to achieve optimal model performance by leveraging patterns, spotting and handling anomalies, testinghypothesis,or checking assumptions

# BRIEF OVERVIEW

In [12]:
df.head(20)

Unnamed: 0,Age,Diabetes,BloodPressureProblems,AnyTransplants,AnyChronicDiseases,Height,Weight,KnownAllergies,HistoryOfCancerInFamily,NumberOfMajorSurgeries,PremiumPrice
0,45,0,0,0,0,155,57,0,0,0,25000
1,60,1,0,0,0,180,73,0,0,0,29000
2,36,1,1,0,0,158,59,0,0,1,23000
3,52,1,1,0,1,183,93,0,0,2,28000
4,38,0,0,0,1,166,88,0,0,1,23000
5,30,0,0,0,0,160,69,1,0,1,23000
6,33,0,0,0,0,150,54,0,0,0,21000
7,23,0,0,0,0,181,79,1,0,0,15000
8,48,1,0,0,0,169,74,1,0,0,23000
9,38,0,0,0,0,182,93,0,0,0,23000


In [13]:
df.describe()

Unnamed: 0,Age,Diabetes,BloodPressureProblems,AnyTransplants,AnyChronicDiseases,Height,Weight,KnownAllergies,HistoryOfCancerInFamily,NumberOfMajorSurgeries,PremiumPrice
count,986.0,986.0,986.0,986.0,986.0,986.0,986.0,986.0,986.0,986.0,986.0
mean,41.745436,0.419878,0.46856,0.055781,0.180527,168.182556,76.950304,0.21501,0.117647,0.667343,24336.713996
std,13.963371,0.493789,0.499264,0.229615,0.384821,10.098155,14.265096,0.411038,0.322353,0.749205,6248.184382
min,18.0,0.0,0.0,0.0,0.0,145.0,51.0,0.0,0.0,0.0,15000.0
25%,30.0,0.0,0.0,0.0,0.0,161.0,67.0,0.0,0.0,0.0,21000.0
50%,42.0,0.0,0.0,0.0,0.0,168.0,75.0,0.0,0.0,1.0,23000.0
75%,53.0,1.0,1.0,0.0,0.0,176.0,87.0,0.0,0.0,1.0,28000.0
max,66.0,1.0,1.0,1.0,1.0,188.0,132.0,1.0,1.0,3.0,40000.0


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 986 entries, 0 to 985
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype
---  ------                   --------------  -----
 0   Age                      986 non-null    int64
 1   Diabetes                 986 non-null    int64
 2   BloodPressureProblems    986 non-null    int64
 3   AnyTransplants           986 non-null    int64
 4   AnyChronicDiseases       986 non-null    int64
 5   Height                   986 non-null    int64
 6   Weight                   986 non-null    int64
 7   KnownAllergies           986 non-null    int64
 8   HistoryOfCancerInFamily  986 non-null    int64
 9   NumberOfMajorSurgeries   986 non-null    int64
 10  PremiumPrice             986 non-null    int64
dtypes: int64(11)
memory usage: 84.9 KB


Observation: This datasets has eleven columns with 986 entries. It contains all the independent variables as numerical values. And the value for each variables is complete
No further data cleaning required because no missing values among the variables and object

## Data Modelling

In [30]:
y = df['PremiumPrice']
X = df.drop('PremiumPrice', axis=1)

In [31]:
y.info()

<class 'pandas.core.series.Series'>
RangeIndex: 986 entries, 0 to 985
Series name: PremiumPrice
Non-Null Count  Dtype
--------------  -----
986 non-null    int64
dtypes: int64(1)
memory usage: 7.8 KB


### Training of datasets

In [32]:
trainX, testX, trainy, testy = train_test_split(X, y, test_size = 0.2)

## 1. LOGISTIC REGRESSION MODEL

In [36]:
model1 = LogisticRegression()

In [37]:
model1.fit(trainX, trainy)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [38]:
result = model1.predict(testX)

In [39]:
result

array([23000, 15000, 23000, 25000, 23000, 15000, 15000, 15000, 29000,
       28000, 23000, 28000, 15000, 23000, 23000, 23000, 23000, 15000,
       29000, 28000, 28000, 23000, 25000, 23000, 28000, 25000, 28000,
       28000, 23000, 23000, 15000, 15000, 23000, 23000, 25000, 25000,
       28000, 23000, 28000, 28000, 28000, 23000, 23000, 23000, 25000,
       15000, 23000, 28000, 23000, 28000, 35000, 28000, 15000, 28000,
       15000, 23000, 23000, 23000, 25000, 28000, 23000, 23000, 25000,
       23000, 29000, 28000, 15000, 28000, 23000, 28000, 15000, 35000,
       23000, 23000, 15000, 15000, 25000, 25000, 28000, 15000, 28000,
       23000, 29000, 23000, 28000, 23000, 15000, 28000, 21000, 28000,
       28000, 23000, 21000, 29000, 15000, 35000, 23000, 15000, 23000,
       23000, 28000, 15000, 28000, 15000, 28000, 23000, 23000, 28000,
       25000, 25000, 15000, 25000, 23000, 23000, 25000, 28000, 15000,
       23000, 28000, 23000, 23000, 23000, 25000, 23000, 15000, 23000,
       15000, 23000,

In [40]:
mean_absolute_error(result, testy)

1803.030303030303

# 2. DecisionTreeRegressor 

In [41]:
model2 = DecisionTreeRegressor()

In [42]:
model2.fit(trainX, trainy)

In [43]:
result = model2.predict(testX)

In [44]:
result

array([15000., 15000., 23000., 25000., 30000., 15000., 15000., 15000.,
       23000., 28000., 23000., 28000., 15000., 29000., 23000., 23000.,
       35000., 16000., 29000., 28000., 28000., 15000., 29000., 23000.,
       25000., 25000., 25000., 29000., 30000., 23000., 15000., 32000.,
       23000., 34000., 22000., 25000., 28000., 23000., 28000., 35000.,
       28000., 38000., 23000., 23000., 25000., 15000., 23000., 28000.,
       32000., 25000., 35000., 31000., 15000., 25000., 15000., 15000.,
       35000., 23000., 25000., 35000., 23000., 23000., 25000., 23000.,
       35000., 28000., 15000., 26000., 23000., 28000., 15000., 35000.,
       23000., 23000., 15000., 36000., 25000., 25000., 28000., 15000.,
       28000., 23000., 38000., 23000., 25000., 23000., 15000., 28000.,
       21000., 28000., 25000., 29000., 21000., 35000., 15000., 29000.,
       38000., 15000., 25000., 23000., 28000., 39000., 28000., 15000.,
       28000., 23000., 23000., 17000., 28000., 25000., 15000., 38000.,
      

In [45]:
mean_absolute_error(result, testy)

1282.828282828283

# 3. LinearRegression

In [55]:
model3 = LinearRegression()

In [56]:
model3.fit(trainX, trainy)

In [57]:
result = model3.predict(testX)

In [58]:
result

array([18074.03792925, 18188.54217584, 20236.71895668, 35107.15917916,
       26644.30836743, 19950.06678065, 14274.32917605, 18891.94809365,
       25135.93749985, 29717.78018489, 22594.68489156, 28931.06049964,
       18245.89936464, 25622.67176939, 21001.65612786, 22482.72708891,
       27730.6342731 , 15598.62739225, 27334.93227573, 29576.75129436,
       28340.11353005, 17983.53900468, 26115.57616043, 24456.69918287,
       30295.83330561, 27723.16192965, 31514.16430265, 26249.71993837,
       26773.87637499, 23473.0767547 , 17728.14821382, 17569.99300815,
       22556.32051459, 20832.41294585, 25106.95079065, 26123.11981442,
       30468.5260422 , 19697.33041857, 30543.52769984, 31963.01877224,
       24765.08236602, 32306.45886431, 22304.10243343, 21299.86755576,
       25129.68125772, 17734.853932  , 20359.87244262, 30147.75896473,
       22732.38182162, 27873.53330713, 32035.65860084, 32059.56775285,
       18853.26424897, 28348.5959601 , 16839.06422428, 18931.51325205,
      

In [59]:
mean_absolute_error(result, testy)

2716.654725029599

# 4. Lasso

In [65]:
model4 = Lasso()

In [66]:
model4.fit(trainX, trainy)

In [67]:
result = model4.predict(testX)

In [68]:
result

array([18081.84707236, 18191.01638188, 20240.277659  , 35084.39065507,
       26641.5146231 , 19942.99837249, 14279.09736317, 18893.88772875,
       25141.95199515, 29719.51314337, 22595.59605382, 28944.6008585 ,
       18245.11618405, 25623.222646  , 21000.02561106, 22485.99555429,
       27732.280752  , 15603.57889505, 27336.97769607, 29589.7328115 ,
       28352.96546545, 17991.82228505, 26115.42891634, 24459.21037891,
       30288.7145229 , 27718.11305522, 31500.57994887, 26249.71023723,
       26771.00051486, 23469.74484713, 17731.82954145, 17563.18339643,
       22554.34173235, 20830.51013049, 25097.18170087, 26124.83325174,
       30486.25392541, 19704.44724676, 30547.44075179, 31956.31552436,
       24779.98054144, 32290.92617969, 22311.08321586, 21302.10557426,
       25134.29077163, 17742.68899177, 20365.80861007, 30155.40261379,
       22724.56755707, 27876.9673361 , 32036.12190007, 32052.09029348,
       18860.467876  , 28343.39146689, 16841.66332154, 18935.18220466,
      

In [69]:
mean_absolute_error(result, testy)

2715.8530323054833

# 5. Ridge

In [80]:
model5 = Ridge()

In [81]:
model5.fit(trainX, trainy)

In [82]:
result = model5.predict(testX)

In [83]:
result

array([18091.79601404, 18212.7837651 , 20256.20149643, 35067.47077062,
       26644.18751058, 19966.43312674, 14294.75229349, 18904.44062033,
       25151.12894495, 29736.90707022, 22606.00217149, 28956.55840851,
       18267.89350439, 25640.28012105, 21018.33844451, 22492.41304184,
       27746.43403882, 15620.59081585, 27352.8636119 , 29599.81043326,
       28369.14522194, 18003.3087144 , 26129.20511398, 24474.18442992,
       30274.25653076, 27716.18735775, 31490.00487402, 26260.02159322,
       26773.46508861, 23469.8806616 , 17756.33255501, 17586.14145515,
       22556.78306883, 20847.94535966, 25088.53197758, 26139.20600798,
       30499.6097771 , 19713.71224177, 30566.13587483, 31978.14909533,
       24793.3265476 , 32123.85803187, 22321.13791429, 21321.11153451,
       25146.47901772, 17752.48109466, 20378.45489373, 30168.39471565,
       22747.90634267, 27887.4871837 , 32049.97721609, 32037.09605879,
       18868.4371675 , 28367.73058897, 16852.92569483, 18956.74747248,
      

In [84]:
mean_absolute_error(result, testy)

2712.8969066852333

# 6. RandomForestRegressor

In [85]:
model6 = RandomForestRegressor()

In [86]:
model6.fit(trainX, trainy)

In [88]:
result = model6.predict(testX)

In [89]:
result

array([15500., 15050., 23000., 25160., 29490., 15440., 17880., 15930.,
       26030., 28540., 23000., 28000., 15250., 29000., 25600., 23110.,
       34780., 15190., 28970., 29920., 28410., 15000., 29130., 24600.,
       25000., 25200., 25160., 29110., 29320., 23140., 15440., 17150.,
       23000., 27180., 27670., 25000., 27960., 23000., 27800., 33010.,
       28000., 35270., 23000., 23000., 25000., 15350., 23000., 28000.,
       24860., 25000., 34940., 31390., 15600., 25100., 15020., 19870.,
       35040., 23000., 25000., 34940., 23330., 23270., 25000., 24140.,
       34940., 28130., 16100., 27710., 23000., 28000., 15010., 34740.,
       24320., 23220., 15170., 18570., 25100., 25000., 28140., 15870.,
       28000., 23000., 37820., 23380., 25000., 23080., 15000., 28000.,
       21060., 27560., 25000., 29000., 23110., 34940., 15000., 29090.,
       33390., 15410., 26540., 23070., 28230., 19860., 28000., 15330.,
       30620., 24260., 23620., 32370., 28000., 25000., 15170., 36050.,
      

In [90]:
mean_absolute_error(result, testy)

1080.5555555555557

# 7. KNeighborsClassifier

In [95]:
model7 = KNeighborsClassifier()

In [96]:
model7.fit(trainX, trainy)

In [97]:
result = model7.predict(testX)

In [98]:
result

array([15000, 15000, 21000, 25000, 23000, 15000, 15000, 15000, 23000,
       29000, 38000, 25000, 15000, 28000, 23000, 15000, 35000, 15000,
       28000, 28000, 28000, 15000, 29000, 30000, 25000, 25000, 28000,
       29000, 30000, 23000, 15000, 15000, 23000, 23000, 25000, 25000,
       35000, 21000, 28000, 28000, 25000, 23000, 23000, 23000, 25000,
       15000, 23000, 28000, 23000, 29000, 35000, 30000, 15000, 25000,
       15000, 15000, 35000, 23000, 25000, 35000, 23000, 23000, 25000,
       23000, 35000, 30000, 15000, 26000, 21000, 29000, 15000, 35000,
       23000, 23000, 15000, 15000, 23000, 25000, 28000, 15000, 29000,
       23000, 28000, 23000, 28000, 23000, 15000, 25000, 15000, 28000,
       25000, 29000, 21000, 35000, 15000, 28000, 15000, 15000, 23000,
       23000, 25000, 15000, 28000, 15000, 29000, 23000, 23000, 28000,
       25000, 25000, 15000, 25000, 15000, 23000, 25000, 28000, 15000,
       23000, 29000, 23000, 23000, 30000, 25000, 23000, 15000, 15000,
       15000, 23000,

In [99]:
mean_absolute_error(result, testy)

1904.040404040404

# 8. SVC

In [100]:
model8 = SVC()

In [101]:
model8.fit(trainX, trainy)

In [103]:
result = model8.predict(testX)

### After using six different models to choose the best performing one for this project, it is clear that RandomForestRegressor is suitable for this project because it is one with the least error