# Linear Regression - PSC

<img src="https://miro.medium.com/v2/resize:fit:1400/0*ssbGU5VIxtVB6NrF" height=500 width=500>

> **Problem Statment**: The purpose of this [problem](https://www.kaggle.com/datasets/noordeen/insurance-premium-prediction) is to predict and evaluate future medical expenses of individuals that help medical insurance company to make decision on charging the premium.

### Imports

In [1]:
import opendatasets as od
import pandas as pd
import os

### Download the Dataset

In [2]:
od.download('https://www.kaggle.com/datasets/noordeen/insurance-premium-prediction/data')

Downloading insurance-premium-prediction.zip to .\insurance-premium-prediction


100%|█████████████████████████████████████████████████████████████████████████████| 13.4k/13.4k [00:00<00:00, 4.70MB/s]







### Convert Categorical to Numerical Columns (Optional)

In [3]:
df=pd.read_csv('insurance.csv')
df

Unnamed: 0,age,sex,bmi,children,smoker,region,expenses
0,19,female,27.9,0,yes,southwest,16884.92
1,18,male,33.8,1,no,southeast,1725.55
2,28,male,33.0,3,no,southeast,4449.46
3,33,male,22.7,0,no,northwest,21984.47
4,32,male,28.9,0,no,northwest,3866.86
...,...,...,...,...,...,...,...
1333,50,male,31.0,3,no,northwest,10600.55
1334,18,female,31.9,0,no,northeast,2205.98
1335,18,female,36.9,0,no,southeast,1629.83
1336,21,female,25.8,0,no,southwest,2007.95


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   expenses  1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [5]:
df['sex'].replace(['male','female'],[1,0],inplace=True)
df['smoker'].replace(['yes','no'],[1,0],inplace=True)
df['region'].replace(['southwest','northwest','southeast','northeast'],[1,0,2,3],inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   int64  
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   int64  
 5   region    1338 non-null   int64  
 6   expenses  1338 non-null   float64
dtypes: float64(2), int64(5)
memory usage: 73.3 KB


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['sex'].replace(['male','female'],[1,0],inplace=True)
  df['sex'].replace(['male','female'],[1,0],inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['smoker'].replace(['yes','no'],[1,0],inplace=True)
  df['smoker'].replace(['yes','no'],[1,0],inplace=True)
The beha

In [6]:
df

Unnamed: 0,age,sex,bmi,children,smoker,region,expenses
0,19,0,27.9,0,1,1,16884.92
1,18,1,33.8,1,0,2,1725.55
2,28,1,33.0,3,0,2,4449.46
3,33,1,22.7,0,0,0,21984.47
4,32,1,28.9,0,0,0,3866.86
...,...,...,...,...,...,...,...
1333,50,1,31.0,3,0,0,10600.55
1334,18,0,31.9,0,0,3,2205.98
1335,18,0,36.9,0,0,2,1629.83
1336,21,0,25.8,0,0,1,2007.95


### Cleaning the Dataset

In [7]:
df.isna().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
expenses    0
dtype: int64

### Splitting the Dataset

In [8]:
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df,test_size=0.2, random_state=24)
train_df

Unnamed: 0,age,sex,bmi,children,smoker,region,expenses
132,53,0,35.9,2,0,1,11163.57
508,24,0,25.3,0,0,3,3044.21
422,40,1,32.8,1,1,3,39125.33
613,34,0,19.0,3,0,3,6753.04
1111,38,1,38.4,3,1,2,41949.24
...,...,...,...,...,...,...,...
145,29,0,38.8,3,0,2,5138.26
343,63,1,36.8,0,0,3,13981.85
192,25,1,25.7,0,0,2,2137.65
899,19,0,22.5,0,0,0,2117.34


In [9]:
test_df

Unnamed: 0,age,sex,bmi,children,smoker,region,expenses
736,37,0,38.4,0,1,2,40419.02
561,54,0,32.7,0,0,3,10923.93
930,26,1,46.5,1,0,2,2927.06
271,50,1,34.2,2,1,1,42856.84
933,45,0,35.3,0,0,1,7348.14
...,...,...,...,...,...,...,...
849,55,1,32.8,0,0,0,10601.63
483,51,0,39.5,1,0,1,9880.07
537,46,0,30.2,2,0,1,8825.09
893,47,1,38.9,2,1,2,44202.65


### Selecting the Inputs & Output Columns

In [10]:
df.columns

Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'expenses'], dtype='object')

In [11]:
train_inputs = train_df[['age', 'sex', 'bmi', 'children', 'smoker', 'region']]

In [12]:
train_output = train_df[['expenses']]

In [13]:
test_inputs = test_df[['age', 'sex', 'bmi', 'children', 'smoker', 'region']]

In [14]:
test_output = test_df[['expenses']]

### Fitting the Model

In [15]:
from sklearn.linear_model import LinearRegression

In [16]:
linear = LinearRegression()

In [17]:
linear.fit(train_inputs, train_output)

### Making Predictions

In [18]:
test_pred = linear.predict(test_inputs)

### Evaluating the Model

In [19]:
from sklearn.metrics import mean_squared_error as mse

In [20]:
mse(test_output, test_pred)

34279700.4555814

In [21]:
df.describe()

Unnamed: 0,age,sex,bmi,children,smoker,region,expenses
count,1338.0,1338.0,1338.0,1338.0,1338.0,1338.0,1338.0
mean,39.207025,0.505232,30.665471,1.094918,0.204783,1.513453,13270.422414
std,14.04996,0.50016,6.098382,1.205493,0.403694,1.104915,12110.01124
min,18.0,0.0,16.0,0.0,0.0,0.0,1121.87
25%,27.0,0.0,26.3,0.0,0.0,1.0,4740.2875
50%,39.0,1.0,30.4,1.0,0.0,2.0,9382.03
75%,51.0,1.0,34.7,2.0,0.0,2.0,16639.915
max,64.0,1.0,53.1,5.0,1.0,3.0,63770.43


In [22]:
from sklearn.metrics import r2_score

In [23]:
r2_score(test_output, test_pred)

0.7748088839118281