## _Medical Insurance Costs_

Pada kasus ini, terdapat data tentang informasi kesehatan dan biaya yang harus dikeluarkan oleh asuransi kesehatan. Infomasi terkait dengan data _medical insurance cost_ adalah sebagai berikut,

1. Age: Usia penerima manfaat
2. Sex: Gender penerima manfaat (_male_, _femele_)
3. Bmi : Body Mass Index
4. Children: Jumlah anak/tanggungan yang dicover oleh pihak asuransi
5. Smoker: Status perokok (_yes_, _no_)
6. Region: Wilayah tempat tinggal penerima manfaat
7. Charges: Biaya yang dikeluarkan oleh asuransi

In [42]:
# Selayang pandang data Medical Insurance Costs
import pandas as pd

df = pd.read_csv('data/insurance.csv')

display(df.head())

display(df.corr())

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


Unnamed: 0,age,bmi,children,charges
age,1.0,0.109272,0.042469,0.299008
bmi,0.109272,1.0,0.012759,0.198341
children,0.042469,0.012759,1.0,0.067998
charges,0.299008,0.198341,0.067998,1.0


### Tantangan

Buatlah model regresi untuk memprediksi biaya yang harus dikeluarkan oleh pihak asuransi berdasarkan data. Validasi performa model regresi Anda dengan nilai ***R-squared ($R^2$)***

#### _Tasks_

1. Pastikan semua variabel kategorial diolah dengan baik. (Gunakan fitur mapping pada pandas)
2. Cek kondisi multicollinearity untuk semua variabel independen. Jika ada, antar variabel apakah itu?
3. Pastikan model menggunakan variabel yang tidak memiliki nilai multicollinearity yang tinggi
4. (Hints) Anda dapat menggunakan nilai ***Variance Inflation Factor (VIF)*** untuk mengetahui tingkat multicollinearity pada sebuah variabel independent.
5. Evaluasi model yang Anda buat dengan nilai $R^2$
6. Simpulkan, variabel independen apa saja yang dapat digunakan untuk menghasilkan model regresi yang baik pada kasus _medical insurance costs_?

#### (Hints) Interpretasi Nilai VIF

- 1 - variabel indenpenden tidak memiliki korelasi dengan variabel independen yang lain
- 1 < VIF < 5 - variabel independen sedikit memiliki korelasi dengan variabel independen yang lain
- VIF > 5 - variabel independen memiliki korelasi yang kuat dengan variabel independen lainnya
- VIF > 10 - variabel independen miliki korelasi yang sangat kuat dengan variabel independen dan perlu diperhatikan lebih lanjut

#### (Hints) Implementasi Perhitungan VIF

VIF dapat dihitung secara langsung dengan menggunakan library dari `statsmodels`

#### (Hints) Scatterplot Korelasi Antar Variabel

![var_cor](assets/var_corr.png)

In [33]:

import numpy as np
X_feature = df[['age', 'bmi', 'children', 'smoker', 'sex', 'region']]
y_feature = df['charges']

# reshape y
y_feature = y_feature.values.reshape(-1,1)

# Encoding categorical data

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()

X_feature['smoker'] = labelencoder_X.fit_transform(X_feature['smoker'])
X_feature['sex'] = labelencoder_X.fit_transform(X_feature['sex'])
X_feature['region'] = labelencoder_X.fit_transform(X_feature['region'])

# Find the Variance Inflation Factor (VIF)
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Calculate VIF
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X_feature.values, i) for i in range(X_feature.shape[1])]
vif["features"] = X_feature.columns
display(vif)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_feature['smoker'] = labelencoder_X.fit_transform(X_feature['smoker'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_feature['sex'] = labelencoder_X.fit_transform(X_feature['sex'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_feature['region'] = labelencoder_X.fit_transform(X_feature['regio

Unnamed: 0,VIF Factor,features
0,7.551348,age
1,10.371829,bmi
2,1.801245,children
3,1.256837,smoker
4,2.001061,sex
5,2.924528,region


## NILAI VIF Masing-masing fitur
|   | **VIF Factor** | **Features** | 
|---|----------------|--------------|
| 0 | 7.551348       | age          |
| 1 | 10.371829      | bmi          |
| 2 | 1.801245       | children     |
| 3 | 1.256837       | smoker       |
| 4 | 2.001061       | sex          |
| 5 | 2.924528       | region       |


Tabel di atas menunjukkan nilai ***Variance Inflation Factor (VIF)*** untuk mengetahui tingkat multicollinearity pada sebuah variabel independent.

Kemudian, saya akan menggunakan fitur _age_ _children_ _smoker_ saja, dikarenakan, _bmi_ memiliki tingkat ketergantungan tinggi dengan _age_. Sehingga, saya akan memilih salah satu dari keduanya, lalu untuk fitur _sex_ dan _region_, mengapa tidak digunakan? Karena saya memiliki asumsi bahwasannya _sex_ dan _region_ tidak mempengaruhi hasil dari _charges_


In [83]:
X_feature2 = df[['age', 'children', 'smoker']]
y_feature2 = df['charges']


# reshape y_feature2
y_feature2 = y_feature2.values.reshape(-1,1)

# encoding categorical data

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_feature2 = LabelEncoder()

X_feature2['smoker'] = labelencoder_X_feature2.fit_transform(X_feature2['smoker'])

# Find the Variance Inflation Factor (VIF)
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Calculate VIF
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X_feature2.values, i) for i in range(X_feature2.shape[1])]
vif["features"] = X_feature2.columns
display(vif)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_feature2['smoker'] = labelencoder_X_feature2.fit_transform(X_feature2['smoker'])


Unnamed: 0,VIF Factor,features
0,1.878075,age
1,1.713125,children
2,1.216355,smoker


## Nilai VIF dari 3 fitur pilihan
|   | VIF Factor | features |
|--:|-----------:|---------:|
| 0 |   1.878075 |      age |
| 1 |   1.713125 | children |
| 2 |   1.216355 |   smoker |

Nilai yang ditunjukkan sesuai dengan hints yang telah ditunjukkan di atas.
- 1 - variabel indenpenden tidak memiliki korelasi dengan variabel independen yang lain
- 1 < VIF < 5 - variabel independen sedikit memiliki korelasi dengan variabel independen yang lain
- VIF > 5 - variabel independen memiliki korelasi yang kuat dengan variabel independen lainnya
- VIF > 10 - variabel independen miliki korelasi yang sangat kuat dengan variabel independen dan perlu diperhatikan lebih lanjut

In [88]:
# import minmaxscaler
from sklearn.preprocessing import MinMaxScaler

# create scaler
scaler = MinMaxScaler()

X = df[['age', 'children', 'smoker']]
y = df['charges']

# reshape y
y = y.values.reshape(-1,1)

# encoding categorical data

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()

X['smoker'] = labelencoder_X.fit_transform(X['smoker'])

X['age'] = scaler.fit_transform(X[['age']])
X['children'] = scaler.fit_transform(X[['children']])
X['smoker'] = scaler.fit_transform(X[['smoker']])
display(X.head())

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 50)

# Fitting Multiple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Predicting the Test set results
y_pred = regressor.predict(X_test)

# Compare the actual output values for X_test with the predicted values, execute the following script:
display_df = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': y_pred.flatten()})
# display(display_df)

# Evaluating the Algorithm
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

# R2 Score
from sklearn.metrics import r2_score
print('R2 Score:', r2_score(y_test, y_pred))




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['smoker'] = labelencoder_X.fit_transform(X['smoker'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['age'] = scaler.fit_transform(X[['age']])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['children'] = scaler.fit_transform(X[['children']])
A value is trying to be set on a copy of a slice fr

Unnamed: 0,age,children,smoker
0,0.021739,0.0,1.0
1,0.0,0.2,0.0
2,0.217391,0.6,0.0
3,0.326087,0.0,0.0
4,0.304348,0.0,0.0


Mean Absolute Error: 3781.4529947859564
Mean Squared Error: 34637338.7394313
Root Mean Squared Error: 5885.349500193791
R2 Score: 0.780562650810878


# Kesimpulan

Saya memakai fitur sebagai berikut: **_age_**, **_children_**, dan **_smoker_**,    
Dari fitur di atas menunjukkan hasil dari MEA, MSE, RMSE, dan R2 Score dengan nilai di bawah ini: 
```
Mean Absolute Error: 3781.4529947859564   
Mean Squared Error: 34637338.7394313   
Root Mean Squared Error: 5885.349500193791   
R2 Score: 0.780562650810878   
```

