## 머신러닝 실습

### 머신러닝 개요
- AI(Artificial Intelligence) > ML(Machine Learning) > DL(Deep Learning)
- 
### AI 프레임워크
- `Numpy` : 다양한 행렬에서 수학연산
- `Scipy` : Numpy 기반으로 과학 및 기술 컴퓨팅을 수행, 대규모 데이터세트에 유용
- `Pandas` : 데이터 분석 라이브러리
- `Matplotlib` : 시각화 라이브러리
- `Ploty` : Matplotlib 대체 라이브러리, 가장 우수한 시각화 라이브러리이다.
- `Scikit-learn`(싸이킷런) : 고전적인 머신러닝 라이브러리
- `Theano` : 머신러닝용 라이브러리
- `Tensorflow` : 전세계 점유율1위 딥러닝 라이브러리(머신러닝도 사용가능하다) ,Google 에서 개발
- `Keras` : 전세계 점유율3위, 딥러닝만을 위한 라이브러리(Tensorflow 2.0 부터는 Keras가 같이들어있다.)
- `PyTorch` : 점유율2위 , 요새 핫한 딥러닝 라이브러리, Facebook(현 meta)에서 개발

### Scikit-learn 으로 머신러닝
- 일반 프로그램 : 입력값 x로 출력값 y도출
- 머신러닝 : 입력값 x, 출력값 y를 넣어서 학습모델을 생성. _**새로운 x를 학습모델에 넣으면 새로운 y를 도출**_
- 대부분의 머신러닝은 개발자가 어떤 결과가 나온다는 것을 지도하기 때문에 지도학습이라 한다.

### 분석평가지표
- 절댓값 평균오차(MAE: Mean Absolute Error)
- 제곱 평균오차(MSE : Mean Square Error)
- 제곱평균의 제곱근 오차(RMSE :Root Mean Squared Error) : scikit-learn 에 없다.
- 분산비율(Variance Score)
---
### 보스턴주택가격 회귀분석
> `Scikit-Learn` 라이브러리 내에 있는 데이터이용

#### Scikit-learn 라이브러리 사용

In [1]:
# scikit-learn 설치
!pip3 install scikit-learn



In [2]:
# 필수 라이브러리 사용 등록
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# import sklearn.datasets import load_boston TODO : 윤리문제로 1.2버전 이후로 보스턴 데이터 삭제
# from sklearn.datasets import fetch_california_housing 캘리포니아
from sklearn.datasets import fetch_openml

In [3]:
# 보스턴 집값 데이터 다운로드
X, y = fetch_openml('boston', return_X_y=True)

- version 1, status: active
  url: https://www.openml.org/search?type=data&id=531
- version 2, status: active
  url: https://www.openml.org/search?type=data&id=853



In [4]:
X['MEDV'] = y

In [5]:
dfBostonHousing = X

In [6]:
# 엑셀파일로 저장
dfBostonHousing.to_excel('./data/BostonHousing.xlsx', index=False)

In [7]:
dfBostonHousing.tail()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273.0,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.12,76.7,2.2875,1,273.0,21.0,396.9,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273.0,21.0,396.9,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273.0,21.0,393.45,6.48,22.0
505,0.04741,0.0,11.93,0,0.573,6.03,80.8,2.505,1,273.0,21.0,396.9,7.88,11.9


#### 컬럼 설명
| CRIM  | ZN                     | INDUS       |CHAS|NOX|RM|AGE|
|-------|------------------------|-------------|---|---|---|---|
| 범죄발생률 | 25,000평 방 피트 초과거주지역 비율 | 비상업지역 넓이 비율 |찰스강 더미변수(1: 강경계, 0: 경계아님)|일산화질소 농도|거주가능방수|1940년 이전 건축주택 률|

|  DIS  |RAD|TAX|PTRATIO|B|LSTAT|MEDV|
| ---   |---|---|---|---|---|---|
|5개 주요 고용센터까지 가중거리|고속도로 접근용이도|10,000 달러당 재산세 비율|지역교사와 학생 수 비율|흑인 거주비율|하위계층비율|가격(본인소유 주택가격 중앙값)|

#### 분석모델 구축, 결과를 분석
- 전체 데이터(100)에서 보통 70~80% 훈련(train)시 사용, 20~30% 데이터를 검증(test)시 사용

In [8]:
# 회귀분석 모델, 함수 사용등록
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [9]:
# X(독립변수 13가지 속성들) , y(종속변수, 독립변수에 영향을 받는 값)
y

0      24.0
1      21.6
2      34.7
3      33.4
4      36.2
       ... 
501    22.4
502    20.6
503    23.9
504    22.0
505    11.9
Name: MEDV, Length: 506, dtype: float64

In [10]:
X

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296.0,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273.0,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273.0,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273.0,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273.0,21.0,393.45,6.48,22.0


In [11]:
# 통합데이터(dfBostonHousing) 에서 독립변수만 다시 분리하려면(참조용)
X = dfBostonHousing.drop(['MEDV'], axis=1)
X

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296.0,15.3,396.90,4.98
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.90,9.14
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.90,5.33
...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273.0,21.0,391.99,9.67
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273.0,21.0,396.90,9.08
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273.0,21.0,396.90,5.64
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273.0,21.0,393.45,6.48


In [12]:
# 훈련용 데이터와 평가(검증)용 데이터 분할
# pandas 에 있는 DataFrame 에서 순서대로 데이터를 자르면, 고가의 집데이터가 후반부에 몰려있으면 훈련데이터와 검증데이터의 편차가 너무 심해짐
# train_test_split() 데이터를 랜덤으로 잘라서 훈련데이터와 검증데이터를 분할

# test_size = 0.3(30%), random_state = 156,105, 80 조정가능
## random_state=130 일 땐(mse/rmse/r^2) 19.4/4.4/0.7
## 100일땐 29.7 / 5.4 / 0.7 , 
## 91일땐 23.8/4.8/0.7, 
## 80일떈 16.2/4.3/0.7, 
## 70일 떈 25.5/5.09/0.73
## 50일땐 33.8 / 5.8 / 0.66
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=70)

In [13]:
X_train.tail()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
316,0.31827,0.0,9.9,0,0.544,5.914,83.2,3.9986,4,304.0,18.4,390.7,18.33
280,0.03578,20.0,3.33,0,0.4429,7.82,64.5,4.6947,5,216.0,14.9,387.31,3.76
114,0.14231,0.0,10.01,0,0.547,6.254,84.2,2.2565,6,432.0,17.8,388.74,10.45
214,0.28955,0.0,10.59,0,0.489,5.412,9.8,3.5875,4,277.0,18.6,348.93,29.55
334,0.03738,0.0,5.19,0,0.515,6.31,38.5,6.4584,5,224.0,20.2,389.4,6.75


In [14]:
y_train.tail()

316    17.8
280    45.4
114    18.5
214    23.7
334    20.7
Name: MEDV, dtype: float64

In [15]:
X_test.tail()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
62,0.11027,25.0,5.13,0,0.453,6.456,67.8,7.2255,8,284.0,19.7,396.9,6.73
64,0.01951,17.5,1.38,0,0.4161,7.104,59.5,9.2229,3,216.0,18.6,393.24,8.05
134,0.97617,0.0,21.89,0,0.624,5.757,98.4,2.346,4,437.0,21.2,262.76,17.31
218,0.11069,0.0,13.89,1,0.55,5.951,93.8,2.8893,5,276.0,16.4,396.9,17.92
177,0.05425,0.0,4.05,0,0.51,6.315,73.4,3.3175,5,296.0,16.6,395.6,6.29


In [16]:
y_test.tail()

62     22.2
64     33.0
134    15.6
218    21.5
177    24.6
Name: MEDV, dtype: float64

In [17]:
# 선형회귀 모델 생성
lr = LinearRegression()

In [18]:
# 선형회귀 모델 훈련
lr.fit(X_train, y_train)

In [19]:
# 선형회귀 분석 : 검증(평가)데이터를 넣어서 예측 수행
# Numpy 배열로 변경하고 테스트
y_predict = lr.predict(np.array(X_test))



In [20]:
# 선형회귀 모델로 예측한 값
y_predict

array([20.85634258, 25.34513399, 24.87477622, 15.40517818, 17.25878521,
        1.37578564, 16.22029827, 27.98840501, 30.68773757, 25.68109167,
       17.46440735, 30.24362049, 12.325743  , 30.93000623, 28.80011896,
       21.17491684, 21.96172153, 14.70652855, 19.98901515, 16.46442828,
       27.1748453 , 39.73477785, 19.32089996, 21.34852917, 20.66179818,
       19.96306603, 28.32924465, 35.18517632, 14.9833051 , 17.75151951,
       22.59634217, 25.94871002, 33.50939696, 41.38942124, 20.33522751,
       18.67502072, 29.20213411, 14.03729142, 23.69956778, 37.6967535 ,
       30.38826344, 24.93527128, 36.84577969, 24.49555952, 19.30866144,
       25.26571974, 17.4330132 , 24.54448455, 13.53904998, 35.94603941,
       17.59153933, 38.94482611, 23.03954046, 19.80624891, 14.75383844,
       19.13836794, 23.078889  , 10.80082269, 30.26654672, 33.56729506,
       13.29004009, 20.82585906, 25.0924484 , 19.44142049, 12.30823236,
       23.07776074, 32.6586873 , 23.79001095, 13.71065903, 22.50

In [21]:
from sklearn.metrics import r2_score

# MSE(Mean Squared Error)제곱평균오차로 평가
mse = mean_squared_error(y_test, y_predict)
rmse = np.sqrt(mse)  # Numpy 에 있는 squreroot 함수

r2 = r2_score(y_test, y_predict)

print(f'MSE = {mse:.30f}')  # 거의 오차가 없음.
print(f'RMSE = {rmse}')
print(f'R^2(Variance Score) {r2}')

MSE = 25.950898465649618884754090686329
RMSE = 5.094202436657737
R^2(Variance Score) 0.732109300437499


In [22]:
# y 절편값, 회귀계수값
print(f'Y절편값 = {lr.intercept_}')
print(f'회귀계수값 = {np.round(lr.coef_, 1)}')

Y절편값 = 38.58903780475029
회귀계수값 = [ -0.1   0.   -0.1   2.8 -19.7   3.4   0.   -1.7   0.3  -0.   -0.8   0.
  -0.5]


In [23]:
pd.Series(data=np.round(lr.coef_, 2), index=X.columns)

CRIM       -0.15
ZN          0.05
INDUS      -0.05
CHAS        2.79
NOX       -19.66
RM          3.38
AGE         0.01
DIS        -1.73
RAD         0.31
TAX        -0.01
PTRATIO    -0.84
B           0.01
LSTAT      -0.55
dtype: float64