### HousingData 학습
- 2024/03/05 과제
- 데이터 : Housing Data Set
    - Concerns housing values in suburbs of Boston.
- 타겟 : MEDV
- 피쳐 : 나머지
- 학습유형 : 지도학습 + 회귀 (LinearRegression, Ridge, Lasso)

Attribute Information:  
1) CRIM : per capita crime rate by town  
2) ZN : proportion of residential land zoned for lots over 25,000 sq. ft.  
3) INDUS : proportion of non-retail business acres per town  
4) CHAS : Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)  
5) NOX : nitric oxides concentration (parts per 10 million)  
6) RM : average number of rooms per dwelling  
7) AGE : proportion of owner-occupied units built prior to 1940  
8) DIS : weighted distances to five Boston employment centres  
9) RAD : index of accessibility to radial highways  
10) TAX : full-value property-tax rate per &#36;10,000  
11) B : 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town  
12) LSTAT : \% lower status of the population  
13) MEDV : Median value of owner-occupied homes in $1000's  

(1) 모듈 로딩 및 데이터 준비 <hr>

In [1]:
import pandas as pd

In [2]:
housingDF = pd.read_csv('HousingData.csv')
housingDF.head(3)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7


(2) 데이터 전처리 <hr>
- 결측치, 중복값, 이상치

In [3]:
housingDF.info()    # non-null count가 다름

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     486 non-null    float64
 1   ZN       486 non-null    float64
 2   INDUS    486 non-null    float64
 3   CHAS     486 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      486 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    int64  
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    486 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(12), int64(2)
memory usage: 55.5 KB


In [15]:
housingDF.isna().sum()

CRIM       20
ZN         20
INDUS      20
CHAS       20
NOX         0
RM          0
AGE        20
DIS         0
RAD         0
TAX         0
PTRATIO     0
B           0
LSTAT      20
MEDV        0
include     0
dtype: int64

In [4]:
housingDF[['CRIM', 'ZN', 'INDUS', 'CHAS', 'AGE', 'LSTAT']].isna().sum()     # 각각 null 값이 20개씩 있음

CRIM     20
ZN       20
INDUS    20
CHAS     20
AGE      20
LSTAT    20
dtype: int64

In [5]:
housingDF['include'] = True    # 학습 데이터셋 포함여부 컬럼 생성

for col in ['CRIM', 'ZN', 'INDUS', 'CHAS', 'AGE', 'LSTAT']:
    isna_mask = housingDF[col].isna()
    housingDF.loc[isna_mask, 'include'] = False  # 결측값이 존재하면 'include' : False

In [6]:
len(housingDF[housingDF.include == False])     # 결측치 존재하는 행이 총 112개... 꽤 많다.

112

In [7]:
housingDF.duplicated().sum()    # 중복값은 없다.

0

(3) 기계학습 데이터셋 준비 <hr>

In [8]:
# 결측치 제외
include_mask = housingDF.include == True
housingDF2 = housingDF[include_mask].reset_index(drop=True)     # 결측치 제외한 데이터프레임 (인덱스 초기화)
housingDF2 = housingDF2[housingDF2.columns[:-1]]                # 'include' 열 제외

In [9]:
housingDF2.isna().sum()     # 중복값 제외 완료

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64

In [10]:
# 교차 검증 수행
from sklearn.model_selection import KFold

k_fold = KFold(n_splits=7)  # 7등분
kf_datasets = k_fold.split(housingDF2)

In [11]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso

train_score1, test_score1 = [], []
train_score2, test_score2 = [], []
train_score3, test_score3 = [], []

for train_idx, test_idx in kf_datasets:     # train_idx : <class 'numpy.ndarray'>
    # 학습용, 테스트용 데이터셋 분리
    trainDF = housingDF2.iloc[train_idx]
    testDF = housingDF2.iloc[test_idx]
    
    # 피쳐, 타겟 분리
    x_train = trainDF[trainDF.columns[:-1]]
    y_train = trainDF[trainDF.columns[-1]]
    
    x_test = testDF[testDF.columns[:-1]]
    y_test = testDF[testDF.columns[-1]]
    
    # 스케일링 수행
    mm_scaler = MinMaxScaler()
    mm_scaler.fit(x_train)
    scaled_x_train = mm_scaler.transform(x_train)
    scaled_x_test = mm_scaler.transform(x_test)
    
    # 회귀 모델 학습
    lin_model = LinearRegression()
    lin_model.fit(scaled_x_train, y_train)
    
    ridge_model = Ridge()
    ridge_model.fit(scaled_x_train, y_train)
    
    lasso_model = Lasso()
    lasso_model.fit(scaled_x_train, y_train)
    
    # 점수 저장
    train_score1.append(lin_model.score(scaled_x_train, y_train))
    test_score1.append(lin_model.score(scaled_x_test, y_test))
    
    train_score2.append(ridge_model.score(scaled_x_train, y_train))
    test_score2.append(ridge_model.score(scaled_x_test, y_test))
    
    train_score3.append(lasso_model.score(scaled_x_train, y_train))
    test_score3.append(lasso_model.score(scaled_x_test, y_test))

In [12]:
def mean(some_list: list):
    """
    리스트의 평균값을 반환하는 함수
    :param some_list: 리스트
    :return: 리스트의 평균값
    """
    if len(some_list) == 0:
        print('빈 리스트')
    else:
        return sum(some_list) / len(some_list)

In [13]:
train_score2

[0.7634079967377112,
 0.7656840164530028,
 0.745094279646995,
 0.7194645495070386,
 0.7689573809932847,
 0.8644675765667429,
 0.7518190412466444]

In [14]:
print(f'LinearRegression train 점수 : {mean(train_score1)}')
print(f'LinearRegression test  점수 : {mean(test_score1)}')
print(f'Ridge(alpha=1)   train 점수 : {mean(train_score2)}')
print(f'Ridge(alpha=1)   test  점수 : {mean(test_score2)}')
print(f'Lasso(alpha=1)   train 점수 : {mean(train_score3)}')
print(f'Lasso(alpha=1)   test  점수 : {mean(test_score3)}')

LinearRegression train 점수 : 0.7739374141769807
LinearRegression test  점수 : 0.5176442289009354
Ridge(alpha=1)   train 점수 : 0.768413548735917
Ridge(alpha=1)   test  점수 : 0.5613269050707473
Lasso(alpha=1)   train 점수 : 0.2718878442334171
Lasso(alpha=1)   test  점수 : -0.26397951036465217


[결과]  
전체적 성능 : Ridge(alpha=1) > LinearRegression > Lasso(alpha=1)  
- LinearRegression 결과 : 과소적합 (train 점수도 77점에 불과하므로)
- Ridge, Lasso는 과대적합일 때 사용하는 규제 방법
- LinearRegression의 점수를 더 끌어올려보자. (이상치, 결측치, 피쳐선택 등)  

남은 과제 : 
- 데이터의 분포를 시각화하여 더 잘 파악해보기 (히스토그램, 박스플롯)
- 중복값, 이상치, 결측치 처리를 더 잘 해보는 것
- Ridge와 Lasso의 alpha 값에 변화를 줘 최적의 모델을 뽑아보는 것
- 최적의 모델을 뽑는 방법 사용해보기 (all_estimators() 등...)