# **Medical Cost** | 회귀분석

필요한 라이브러리

In [1]:
import numpy as np
import pandas as pd
import sklearn.linear_model

## 1. 데이터 불러오기

\- 캐글에서 "Medical Cost Personal Datasets"을 다운로드

> https://www.kaggle.com/datasets/mirichoi0218/insurance

In [6]:
df = pd.read_csv("data\insurance.csv")
df

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


## 2.분석

### **A. Data 정리**

> 열 이름을 먼저 알아보자.

In [7]:
df.columns

Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


\- 대충 여러가지 범주형ㆍ연속형 설명변수들과 보험료(```charge```)의 관계를 요약하고 싶다.

> 우선 범주형 변수들을 원-핫 인코딩을 통해 바꿔주자.

In [12]:
X = pd.get_dummies(df.drop(['charges'], axis = 1))  ## one-hot incoding / drop(axis = 1)
y = df.charges

X

Unnamed: 0,age,bmi,children,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,19,27.900,0,True,False,False,True,False,False,False,True
1,18,33.770,1,False,True,True,False,False,False,True,False
2,28,33.000,3,False,True,True,False,False,False,True,False
3,33,22.705,0,False,True,True,False,False,True,False,False
4,32,28.880,0,False,True,True,False,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...
1333,50,30.970,3,False,True,True,False,False,True,False,False
1334,18,31.920,0,True,False,True,False,True,False,False,False
1335,18,36.850,0,True,False,True,False,False,False,True,False
1336,21,25.800,0,True,False,True,False,False,False,False,True


### **B. Predictor 생성**

In [14]:
predictr = sklearn.linear_model.LinearRegression()
predictr

### **C. 학습**

In [15]:
predictr.fit(X, y)

### **D. 예측**

\- 원래의 데이터와 비교해보자.

In [16]:
df.assign(y_hat = predictr.predict(X))

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,y_hat
0,19,female,27.900,0,yes,southwest,16884.92400,25293.713028
1,18,male,33.770,1,no,southeast,1725.55230,3448.602834
2,28,male,33.000,3,no,southeast,4449.46200,6706.988491
3,33,male,22.705,0,no,northwest,21984.47061,3754.830163
4,32,male,28.880,0,no,northwest,3866.85520,5592.493386
...,...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830,12351.323686
1334,18,female,31.920,0,no,northeast,2205.98080,3511.930809
1335,18,female,36.850,0,no,southeast,1629.83350,4149.132486
1336,21,female,25.800,0,no,southwest,2007.94500,1246.584939


\- ```charges```(실제 데이터)와 ```y_hat```(예측한 데이터)가 별로 안 맞는 것 같은데...?

### **E. 평가**

In [18]:
predictr.score(X, y)

0.7509130345985207

> 0.7 이상이면 망한 모형까진 아님(대회용으론 부적절할 수 있겠지만, 어떻게든 쓸 수는 있는 정도...)

## 5. 계수 해석
\- 상수항

In [19]:
predictr.intercept_

-666.9377199366245

> 기본적인 보험료는 -666이라는 의미

\- 계수 해석

In [24]:
pd.DataFrame({'columns_index' : list(X.columns), 'coef' : list(predictr.coef_)})
#pd.DataFrame({'columns_index' : list(X.columns), 'coef' : predictr.coef_.reshape(-1)})

Unnamed: 0,columns_index,coef
0,age,256.856353
1,bmi,339.193454
2,children,475.500545
3,sex_female,65.65718
4,sex_male,-65.65718
5,smoker_no,-11924.267271
6,smoker_yes,11924.267271
7,region_northeast,587.009235
8,region_northwest,234.045336
9,region_southeast,-448.012814


* 나이, bmi, 자녀의 수가 많을수록 보험료는 올라갔음(연속형)
* 여성, 흡연자의 경우 보험료가 더 비쌌다(범주형)
* 지역은 잘 모르겠으나 나머지는 꽤 그럴듯해 보인다.