# SC21x
 
## 선형모델 (Linear Models)

In [30]:
'''
# Google Colab에서 하시는 경우 해당 셀을 먼저 실행해주시기 바랍니다.
import sys
if 'google.colab' in sys.modules:
    !pip install category_encoders==2.*
    !pip install pandas-profiling==2.*
    !pip install plotly==4.*
!pip uninstall scikit-learn -y
!pip install -U scikit-learn
'''

"\n# Google Colab에서 하시는 경우 해당 셀을 먼저 실행해주시기 바랍니다.\nimport sys\nif 'google.colab' in sys.modules:\n    !pip install category_encoders==2.*\n    !pip install pandas-profiling==2.*\n    !pip install plotly==4.*\n!pip uninstall scikit-learn -y\n!pip install -U scikit-learn\n"

# Part 1 - 분류 (Classification): 헌혈 여부 예측하기 🚑
Part 1에선 대만의 이동식 헌혈 차량 헌혈자들의 정보를 담은 데이터셋을 다룹니다. 대만의 수혈 서비스 센터 (The Blood Transfusion Service Center)는 이동식 차량으로 대학교들을 돌면서 헌혈 행사를 진행, 수혈을 위한 혈액을 기부 받습니다.

Part 1의 목표는 각 헌혈자의 정보를 활용해서 **헌혈자가 2007년 3월에 헌혈을 했는지 여부**를 예측하는 것입니다. 

헌혈 여부 및 공급 요구를 추적하고 예측하는 좋은 데이터 기반 시스템은 공급망 전체를 개선하여 더 많은 환자가 필요한 수혈을 받도록 할 수 있습니다.

In [31]:
# 분석에 필요한 라이브러리 및 데이터셋을 불러옵니다
import pandas as pd

donors = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data')
assert donors.shape == (748,5)  # 데이터셋이 잘 불러와졌는지 assert를 사용해 검증합니다.

# 칼럼의 이름을 이해하기 쉽도록 변경합니다.
donors = donors.rename(columns={
    'Recency (months)': 'months_since_last_donation', 
    'Frequency (times)': 'number_of_donations', 
    'Monetary (c.c. blood)': 'total_volume_donated', 
    'Time (months)': 'months_since_first_donation', 
    'whether he/she donated blood in March 2007': 'made_donation_in_march_2007'
})

In [32]:
donors.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Data columns (total 5 columns):
 #   Column                       Non-Null Count  Dtype
---  ------                       --------------  -----
 0   months_since_last_donation   748 non-null    int64
 1   number_of_donations          748 non-null    int64
 2   total_volume_donated         748 non-null    int64
 3   months_since_first_donation  748 non-null    int64
 4   made_donation_in_march_2007  748 non-null    int64
dtypes: int64(5)
memory usage: 29.3 KB


## 칼럼 설명:
- **months_since_last_donation**: 마지막으로 헌혈을 한 후에 지난 개월수
- **number_of_donations**: 헌혈 횟수
- **total_volume_donated**: 기부한 혈액 총량
- **months_since_first_donation**: 첫 헌혈을 한 후에 지난 개월수
- **made_donation_in_march_2007**: 2007년 헌혈을 했는지 여부 (target)

데이터를 본다면 약 3/4에 해당하는 과반수가 2007년 3월에 헌혈을 하지 않았다는 것을 알 수 있습니다.  
아래는 Baseline 모델의 정확도 점수 (accuracy score)를 나타낸 것입니다.

In [33]:
donors['made_donation_in_march_2007'].value_counts(normalize=True)

0    0.762032
1    0.237968
Name: made_donation_in_march_2007, dtype: float64

## 1.1 데이터를 feature (X), label(y)로 분할하고 데이터를 train/test 셋으로 무작위로 나누어 주세요 (scikit-learn 활용).

In [34]:
import pandas as pd
from sklearn.model_selection import train_test_split

features = ['months_since_last_donation', 'number_of_donations',
       'total_volume_donated', 'months_since_first_donation']
label = 'made_donation_in_march_2007'

X_train, X_test, y_train, y_test = train_test_split(donors[features], donors[label], train_size = 0.8, random_state=2)
X_train.shape, X_test.shape

((598, 4), (150, 4))

## 1.2 scikit-learn으로 logistic regression 모델을 만든 후에 학습 (fit)까지 진행하세요. 

학습에 사용할 feature 수는 자유롭게 결정 하십시오. 

In [35]:
from sklearn.linear_model import LogisticRegression

logistic = LogisticRegression(max_iter = 1000)
logistic.fit(X_train, y_train)

LogisticRegression(max_iter=1000)

## 1.3 분류 평가 지표 (classification metric) 보고: 정확도 (accuracy)
테스트셋을 사용했을 때 분류 모델의 정확도(accuracy score)를 알려주세요.

모델의 성능이 베이스라인보다 안좋더라도 걱정하지 마세요. 
(accuracy 대신 recall을 평가 지표 (metric)로 사용할 경우에 우리가 만든 모델이 베이스라인을 뛰어넘을 수 있습니다. 알맞은 평가 지표를 고르고 해석하는 법은 앞으로 계속 다룰 예정입니다.)

In [36]:
print(f"테스트 세트 정확도 : {logistic.score(X_test, y_test):.4f}")

테스트 세트 정확도 : 0.7600


# Part 2 - 회귀(Regression): Iowa 주 Ames의 집값 예측하기 🏠

여러분이 다루게 될 데이터는 Iowa 주에 있는 Ames시의 주택들의 정보를 모은 데이터셋입니다.

## 칼럼 설명
```
1stFlrSF: 1층 면적 (square feet)

BedroomAbvGr: 지하실 제외 침실 갯수

BldgType: 주거 형태
		
       1Fam	Single-family Detached	
       2FmCon	Two-family Conversion; originally built as one-family dwelling
       Duplx	Duplex
       TwnhsE	Townhouse End Unit
       TwnhsI	Townhouse Inside Unit
       
BsmtHalfBath: 지하실 half-bathroom 개수 (세면대와 변기만 있는 화장실)

BsmtFullBath: 지하실 full bathroom 개수 (세면대, 변기, 샤워, 욕조 전부 있는 욕실)

CentralAir: 중앙 냉방 장치 유무

       N	No
       Y	Yes
		
Condition1: 다양한 조건과의 근접성
	
       Artery	Adjacent to arterial street
       Feedr	Adjacent to feeder street	
       Norm	Normal	
       RRNn	Within 200' of North-South Railroad
       RRAn	Adjacent to North-South Railroad
       PosN	Near positive off-site feature--park, greenbelt, etc.
       PosA	Adjacent to postive off-site feature
       RRNe	Within 200' of East-West Railroad
       RRAe	Adjacent to East-West Railroad
	
Condition2: 다양한 조건과의 근접성 (하나 이상 존재할 시)
		
       Artery	Adjacent to arterial street
       Feedr	Adjacent to feeder street	
       Norm	Normal	
       RRNn	Within 200' of North-South Railroad
       RRAn	Adjacent to North-South Railroad
       PosN	Near positive off-site feature--park, greenbelt, etc.
       PosA	Adjacent to postive off-site feature
       RRNe	Within 200' of East-West Railroad
       RRAe	Adjacent to East-West Railroad
       
Electrical: 전기 시스템

       SBrkr	Standard Circuit Breakers & Romex
       FuseA	Fuse Box over 60 AMP and all Romex wiring (Average)	
       FuseF	60 AMP Fuse Box and mostly Romex wiring (Fair)
       FuseP	60 AMP Fuse Box and mostly knob & tube wiring (poor)
       Mix	Mixed
       
ExterCond: 외관 소재 현황 평가
		
       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       Po	Poor
 
ExterQual: 외관 소재 품질 평가
		
       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       Po	Poor
		
Exterior1st: 집 외부 소재

       AsbShng	Asbestos Shingles
       AsphShn	Asphalt Shingles
       BrkComm	Brick Common
       BrkFace	Brick Face
       CBlock	Cinder Block
       CemntBd	Cement Board
       HdBoard	Hard Board
       ImStucc	Imitation Stucco
       MetalSd	Metal Siding
       Other	Other
       Plywood	Plywood
       PreCast	PreCast	
       Stone	Stone
       Stucco	Stucco
       VinylSd	Vinyl Siding
       Wd Sdng	Wood Siding
       WdShing	Wood Shingles
	
Exterior2nd: 집 외부 소재 (하나 이상 존재할 시)

       AsbShng	Asbestos Shingles
       AsphShn	Asphalt Shingles
       BrkComm	Brick Common
       BrkFace	Brick Face
       CBlock	Cinder Block
       CemntBd	Cement Board
       HdBoard	Hard Board
       ImStucc	Imitation Stucco
       MetalSd	Metal Siding
       Other	Other
       Plywood	Plywood
       PreCast	PreCast
       Stone	Stone
       Stucco	Stucco
       VinylSd	Vinyl Siding
       Wd Sdng	Wood Siding
       WdShing	Wood Shingles
       
Foundation: 주택의 토대 종류
		
       BrkTil	Brick & Tile
       CBlock	Cinder Block
       PConc	Poured Concrete	
       Slab	Slab
       Stone	Stone
       Wood	Wood
		
FullBath: 지상층의 full-bathroom 갯수

Functional: 주택의 기능성 (공제가 보장되지 않는 한 일반적인 것으로 가정)

       Typ	Typical Functionality
       Min1	Minor Deductions 1
       Min2	Minor Deductions 2
       Mod	Moderate Deductions
       Maj1	Major Deductions 1
       Maj2	Major Deductions 2
       Sev	Severely Damaged
       Sal	Salvage only
		
GrLivArea: 지상 주거지역 면적 (sqaure feet)
        
HalfBath: 지상층 half-bathroom 갯수

Heating: 난방 종류
		
       Floor	Floor Furnace
       GasA	Gas forced warm air furnace
       GasW	Gas hot water or steam heat
       Grav	Gravity furnace	
       OthW	Hot water or steam heat other than gas
       Wall	Wall furnace
		
HeatingQC: 난방 품질 및 상태

       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       Po	Poor

HouseStyle: 주거 형식
	
       1Story	One story
       1.5Fin	One and one-half story: 2nd level finished
       1.5Unf	One and one-half story: 2nd level unfinished
       2Story	Two story
       2.5Fin	Two and one-half story: 2nd level finished
       2.5Unf	Two and one-half story: 2nd level unfinished
       SFoyer	Split Foyer
       SLvl	Split Level

KitchenAbvGr: 지상층 주방 갯수

KitchenQual: 주방 품질

       Ex	Excellent
       Gd	Good
       TA	Typical/Average
       Fa	Fair
       Po	Poor

LandContour: 소유지의 평탄도

       Lvl	Near Flat/Level	
       Bnk	Banked - Quick and significant rise from street grade to building
       HLS	Hillside - Significant slope from side to side
       Low	Depression
		
LandSlope: 소유지 경사도
		
       Gtl	Gentle slope
       Mod	Moderate Slope	
       Sev	Severe Slope

LotArea: 용지 크기 (square feet)

LotConfig: 용지 구성

       Inside	Inside lot
       Corner	Corner lot
       CulDSac	Cul-de-sac
       FR2	Frontage on 2 sides of property
       FR3	Frontage on 3 sides of property

LotShape: 소유지의 형태
       Reg	Regular	
       IR1	Slightly irregular
       IR2	Moderately Irregular
       IR3	Irregular

MSSubClass: 분양되는 주거지의 유형	

        20	1-STORY 1946 & NEWER ALL STYLES
        30	1-STORY 1945 & OLDER
        40	1-STORY W/FINISHED ATTIC ALL AGES
        45	1-1/2 STORY - UNFINISHED ALL AGES
        50	1-1/2 STORY FINISHED ALL AGES
        60	2-STORY 1946 & NEWER
        70	2-STORY 1945 & OLDER
        75	2-1/2 STORY ALL AGES
        80	SPLIT OR MULTI-LEVEL
        85	SPLIT FOYER
        90	DUPLEX - ALL STYLES AND AGES
       120	1-STORY PUD (Planned Unit Development) - 1946 & NEWER
       150	1-1/2 STORY PUD - ALL AGES
       160	2-STORY PUD - 1946 & NEWER
       180	PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
       190	2 FAMILY CONVERSION - ALL STYLES AND AGES

MSZoning: 주택이 속해있는 구역 구분
		
       A	Agriculture
       C	Commercial
       FV	Floating Village Residential
       I	Industrial
       RH	Residential High Density
       RL	Residential Low Density
       RP	Residential Low Density Park 
       RM	Residential Medium Density

MasVnrType: 석조 베니어 유형

       BrkCmn	Brick Common
       BrkFace	Brick Face
       CBlock	Cinder Block
       None	None
       Stone	Stone

MoSold: 팔린 달  

Neighborhood: Ames시의 경계내 물리적 위치

       Blmngtn	Bloomington Heights
       Blueste	Bluestem
       BrDale	Briardale
       BrkSide	Brookside
       ClearCr	Clear Creek
       CollgCr	College Creek
       Crawfor	Crawford
       Edwards	Edwards
       Gilbert	Gilbert
       IDOTRR	Iowa DOT and Rail Road
       MeadowV	Meadow Village
       Mitchel	Mitchell
       Names	North Ames
       NoRidge	Northridge
       NPkVill	Northpark Villa
       NridgHt	Northridge Heights
       NWAmes	Northwest Ames
       OldTown	Old Town
       SWISU	South & West of Iowa State University
       Sawyer	Sawyer
       SawyerW	Sawyer West
       Somerst	Somerset
       StoneBr	Stone Brook
       Timber	Timberland
       Veenker	Veenker
			
OverallCond: 주택의 전반적인 상태 평가

       10	Very Excellent
       9	Excellent
       8	Very Good
       7	Good
       6	Above Average	
       5	Average
       4	Below Average	
       3	Fair
       2	Poor
       1	Very Poor

OverallQual: 주택의 전체 자재와 마감재에 대한 평가

       10	Very Excellent
       9	Excellent
       8	Very Good
       7	Good
       6	Above Average
       5	Average
       4	Below Average
       3	Fair
       2	Poor
       1	Very Poor

PavedDrive: 포장 진입로

       Y	Paved 
       P	Partial Pavement
       N	Dirt/Gravel

RoofMatl: 지붕 소재

       ClyTile	Clay or Tile
       CompShg	Standard (Composite) Shingle
       Membran	Membrane
       Metal	Metal
       Roll	Roll
       Tar&Grv	Gravel & Tar
       WdShake	Wood Shakes
       WdShngl	Wood Shingles

RoofStyle: 지붕 형태

       Flat	Flat
       Gable	Gable
       Gambrel	Gabrel (Barn)
       Hip	Hip
       Mansard	Mansard
       Shed	Shed

SalePrice: 주택의 판매 가격

SaleCondition: 판매 조건

       Normal	Normal Sale
       Abnorml	Abnormal Sale -  trade, foreclosure, short sale
       AdjLand	Adjoining Land Purchase
       Alloca	Allocation - two linked properties with separate deeds, typically condo with a garage unit	
       Family	Sale between family members
       Partial	Home was not completed when last assessed (associated with New Homes)

SaleType: 판매 유형
		
       WD 	Warranty Deed - Conventional
       CWD	Warranty Deed - Cash
       VWD	Warranty Deed - VA Loan
       New	Home just constructed and sold
       COD	Court Officer Deed/Estate
       Con	Contract 15% Down payment regular terms
       ConLw	Contract Low Down payment and low interest
       ConLI	Contract Low Interest
       ConLD	Contract Low Down
       Oth	Other
	
Street: 소유지에 대한 도로 접근 유형

       Grvl	Gravel	
       Pave	Paved
       	
TotRmsAbvGrd: 지상층 방 갯수 (화장실/욕실 제외)

Utilities: 수도세, 전기세, 가스세 포함 여부
		
       AllPub	All public Utilities (E,G,W,& S)	
       NoSewr	Electricity, Gas, and Water (Septic Tank)
       NoSeWa	Electricity and Gas Only
       ELO	Electricity only	
	
YearBuilt: 건축 연도

YearRemod/Add: 리모델링 / 추가 공사 연도 (아무런 리모델링이나 추가 시공이 없는 경우 건축연도와 동일)
						
YrSold: 판매 연도 (YYYY)	

```

In [37]:
# 데이터셋을 불러옵니다
import pandas as pd
homes = pd.read_csv('https://ds-lecture-data.s3.ap-northeast-2.amazonaws.com/ames_home/ames_home_subset.csv')
assert homes.shape == (2904, 47)

## 2.1 Baseline 평가지표를 계산하세요

평균 baseline의 $MAE$ (Mean Absolute Error)와 $R^2$ 점수를 계산하세요 (계산을 하는 방법은 데이터를 나누지 않은 채로 계산하면 됩니다.) 

In [38]:
# 가장 상관계수가 큰 특성 선택

homes_corr = homes.corr()
corr_price = pd.DataFrame(abs(homes_corr['SalePrice'])).drop(['SalePrice'])
print(corr_price.idxmax())

predict = homes['SalePrice'].mean()
errors = predict - homes['SalePrice']
MAE = errors.abs().mean()

SalePrice    OverallQual
dtype: object


In [39]:
# 계수 구하는 함수

def RSS(x, y):
  x_mean = x.mean()
  y_mean = y.mean()
  S_xy = 0
  S_xx = 0

  for i, j in zip(x, y):
    A = (i-x_mean) * (j-y_mean)
    B = (i-x_mean) ** 2

    S_xy += A
    S_xx += B

  a = S_xy/S_xx 
  b = y_mean - a*x_mean
  return a, b

In [40]:
# get MAE, R2 value

x = homes['OverallQual']
y = homes['SalePrice']
a, b = RSS(x,y)

x_test = x
y_pred = a * x_test + b

r2 = 1 - ((y - y_pred)**2).sum() / ((y - predict)**2).sum()

print(f"MAE : {MAE:.4f}")
print(f"R2 : {r2:.4f}")

MAE : 58149.9277
R2 : 0.6386


## 2.2 데이터셋을 아래의 기준을 사용해서 train/validation/test 셋으로 나누세요

- **Train**: 2006년부터 2008년까지 팔린 주택들 (1,920개)

- **Validation**: 2009년에 팔린 주택들. (644개)

- **Test**: 2010년에 팔린 주택들. (340개)

In [41]:
'''
import pandas_profiling

profile = homes.profile_report()
profile
'''

## profile로 SalePrice와 연관성 0 이상의 특성만 선택

'\nimport pandas_profiling\n\nprofile = homes.profile_report()\nprofile\n'

In [42]:
# 연도별 data 나누기

train = homes[homes['YrSold'] < 2009]
validation = homes[homes['YrSold'] == 2009]
test = homes[homes['YrSold'] > 2009]

train.shape, validation.shape, test.shape

((1920, 47), (644, 47), (340, 47))

## 2.3 train / validation / test 셋을 features(X)와 target(y)으로 각각 나누세요

특성에는 최소 하나의 수치형 (numeric) 특성과 범주형 (categorical) 특성을 포함하세요.  
위의 조건을 만족한다면 추가 feature를 선택하는 것에 대한 제한은 없습니다.

In [44]:
target = 'SalePrice'
features = ['1stFlrSF', 'BsmtFullBath','FullBath', 'GrLivArea', 'HalfBath','LotArea', 
            'OverallQual', 'TotRmsAbvGrd', 'YearBuilt', 'YearRemod/Add']

X_train = train[features]
y_train = train[target]

X_val = validation[features]
y_val = validation[target]

X_test = test[features]
y_test = test[target]

## 2.4 One-hot encoding을 카테고리형 특성(feature)에 수행하세요

In [45]:
X_train_cat = X_train.loc[:,['BsmtFullBath','FullBath','HalfBath']]
X_train = X_train.drop(['BsmtFullBath','FullBath','HalfBath'], axis = 1)

X_val_cat = X_val.loc[:, ['BsmtFullBath','FullBath','HalfBath']]
X_val = X_val.drop(['BsmtFullBath','FullBath','HalfBath'], axis = 1)

X_test_cat = X_test.loc[:, ['BsmtFullBath','FullBath','HalfBath']]
X_test = X_test.drop(['BsmtFullBath','FullBath','HalfBath'], axis = 1)

In [46]:
# run onehotencoder 

from category_encoders import OneHotEncoder

X_train_cat = X_train_cat.astype(str)
X_val_cat = X_val_cat.astype(str)
X_test_cat = X_test_cat.astype(str)

encoder = OneHotEncoder()
X_train_encoder = encoder.fit_transform(X_train_cat, y_train)
X_val_encoder = encoder.transform(X_val_cat, y_val)
X_test_encoder = encoder.transform(X_test_cat, y_test)

  elif pd.api.types.is_categorical(cols):


In [47]:
# data 합치기
X_train = pd.concat([X_train, X_train_encoder], axis = 1)
X_val = pd.concat([X_val, X_val_encoder], axis = 1)
X_test = pd.concat([X_test, X_test_encoder], axis = 1)

## 2.5 scikit-learn으로 linear regression이나 ridge regression 모델을 만들어서 학습하세요

In [48]:
# 상위 10개 특성 선택

from sklearn.feature_selection import f_regression, SelectKBest

selector = SelectKBest(score_func  = f_regression, k = 10)
X_train_selected = selector.fit_transform(X_train, y_train)

all_names = X_train.columns
selected_mask  = selector.get_support()
selected_names = all_names[selected_mask]
X_train = X_train[selected_names]
X_val = X_val[selected_names]
X_test = X_test[selected_names]

In [49]:
# ridgecv

from sklearn.linear_model import RidgeCV
import numpy as np

alphas = np.arange(0.0001, 0.01, 0.0005)

ridge = RidgeCV(alphas = alphas, normalize = True, cv=5)
ridge.fit(X_train, y_train)
print("alpha : ", ridge.alpha_)
print("best score : ", ridge.best_score_)

alpha :  0.0041
best score :  0.7642015804308059


## 2.6 Validation 데이터셋의 MAE and $R^2$
validation 데이터셋에 모델을 예측했을 때의 MAE와 $R^2$ 점수를 계산하세요. (검증 점수의 높고 낮음은 채점에 영향을 미치지 않습니다)

In [50]:
# get MAE, R2

from sklearn.metrics import  mean_absolute_error, r2_score

y_pred = ridge.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
r2_score = r2_score(y_test, y_pred)

print(f"MAE : {mae:.4f}")
print(f"R2 : {r2_score:.4f}")

MAE : 23623.7859
R2 : 0.7920


# Advanced Goals: 3점을 획득하기 위해선 아래의 모든 조건을 만족해야합니다. 
### Part 2의 데이터셋을 활용하여 아래의 과제들을 수행하세요.
- 다양한 시각화 라이브러리를 활용하여 feature와 target의 관계를 보여주는 시각화를 최소 2개 이상 하세요. 
- 최소 3개 이상의 feature 조합을 시도해보세요. feature의 조합을 고르는 방법에는 어떠한 제한도 없습니다.
- 위에서 시도한 각 feature 조합에 대한 validation set의 MAE & $R^2$를 계산하세요.
- 최종 모델을 정한 후 test set의 MAE and $R^2$을 계산하세요.
- 최종 모델에서 사용한 특성들의 회귀 계수 (coefficients)를 출력하거나 시각화 하세요.

# 미완성입니다 ㅠㅠ

In [51]:
# 데이터셋을 불러옵니다
import pandas as pd
df = pd.read_csv('https://ds-lecture-data.s3.ap-northeast-2.amazonaws.com/ames_home/ames_home_subset.csv')
assert df.shape == (2904, 47)

In [52]:
target = 'SalePrice'
features = ['SalePrice','1stFlrSF', 'BsmtFullBath','BsmtHalfBath','FullBath', 'GrLivArea', 'HalfBath','LotArea', 
            'OverallQual', 'TotRmsAbvGrd', 'YearBuilt', 'YearRemod/Add','OverallCond']

df = df.loc[:,['SalePrice','1stFlrSF', 'BsmtFullBath','BsmtHalfBath','FullBath', 'GrLivArea', 'HalfBath','LotArea', 
            'OverallQual', 'TotRmsAbvGrd', 'YearBuilt', 'YearRemod/Add','OverallCond']]

In [53]:
### 총 화장실 갯수
df['TotalBath'] = df.loc[:,['BsmtFullBath','BsmtHalfBath','FullBath','HalfBath']].sum(axis = 1)

### (2021 - YearRemod/Add) /(2021 - YearBuilt)  -> 1에 가까워질수록 
df['built_to_Remod'] = (2021 - df['YearRemod/Add'])/(2021-df['YearBuilt'])

### Qual + Cond
df['Total_Overall'] = df['OverallCond'] + df['OverallQual']

In [54]:
df_cat = df.loc[:,['BsmtFullBath','BsmtHalfBath','FullBath','HalfBath']]
df = df.drop(['BsmtFullBath','BsmtHalfBath','FullBath','HalfBath'], axis=1)

df_cat = df_cat.astype(str)

encoder = OneHotEncoder()
df_cat_encoder = encoder.fit_transform(df_cat, df[target])

df = pd.concat([df,df_cat_encoder],axis =1)

df.head()

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,SalePrice,1stFlrSF,GrLivArea,LotArea,OverallQual,TotRmsAbvGrd,YearBuilt,YearRemod/Add,OverallCond,TotalBath,built_to_Remod,Total_Overall,BsmtFullBath_1,BsmtFullBath_2,BsmtFullBath_3,BsmtFullBath_4,BsmtHalfBath_1,BsmtHalfBath_2,BsmtHalfBath_3,FullBath_1,FullBath_2,FullBath_3,FullBath_4,FullBath_5,HalfBath_1,HalfBath_2,HalfBath_3
0,215000,1656,1656,31770,6,7,1960,1960,5,2.0,1.0,11,1,0,0,0,1,0,0,1,0,0,0,0,1,0,0
1,105000,896,896,11622,5,5,1961,1961,6,1.0,1.0,11,0,1,0,0,1,0,0,1,0,0,0,0,1,0,0
2,172000,1329,1329,14267,6,6,1958,1958,6,2.0,1.0,12,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0
3,244000,2110,2110,11160,7,8,1968,1968,5,4.0,1.0,12,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0
4,189900,928,1629,13830,5,6,1997,1998,5,3.0,0.958333,10,0,1,0,0,1,0,0,0,1,0,0,0,0,1,0


In [55]:
df_target = df[target]
df = df.drop(target, axis = 1)
X_train, X_test, y_train, y_test = train_test_split(df, df_target, train_size=0.8, random_state= 2)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, train_size=0.8, random_state= 2)

X_train.shape, X_val.shape, X_test.shape

((1858, 26), (465, 26), (581, 26))

In [56]:
from sklearn.feature_selection import f_regression, SelectKBest

selector = SelectKBest(score_func  = f_regression, k = 20)
X_train_selected = selector.fit_transform(X_train, y_train)

all_names = X_train.columns
selected_mask  = selector.get_support()
selected_names = all_names[selected_mask]
X_train = X_train[selected_names]
X_val = X_val[selected_names]
X_test = X_test[selected_names]

In [57]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

## run StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train, y_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)