# 1978년에 발표된 데이터로 미국 보스턴 지역의 주택 가격에 영향을 미치는 요소들을 정리

## 윤리적 문제를 언급한 원문
```
Warning The Boston housing prices dataset has an ethical problem: as 
investigated in [1], the authors of this dataset engineered a noninvertible
variable “B” assuming that racial self-segregation had a 
positive impact on house prices [2]. Furthermore the goal of the 
research that led to the creation of this dataset was to study the 
impact of air quality but it did not give adequate demonstration of 
the validity of this assumption.
The scikit-learn maintainers therefore strongly discourage the use of this 
dataset unless the purpose of the code is to study and educate about 
ethical issues in data science and machine learning.
In this special case, you can fetch the dataset from the original source:
```

In [None]:
!pip install scikit-learn

In [None]:
import pandas as pd
import numpy as np

data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None) # \s : space(공백)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]

combined_data = np.hstack([data, target.reshape(-1, 1)])

boston_df = pd.DataFrame(combined_data)
column_names = ["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "PRICE"]

boston_df.columns = column_names

In [None]:
raw_df

In [None]:
print('보스톤 주택 가격 데이터셋 크기: ', boston_df.shape)
boston_df.info()

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
#X, Y 분할하기
Y = boston_df['PRICE']
X = boston_df.drop(['PRICE'], axis = 1, inplace = False)

In [None]:
#훈련용 데이터와 평가용 데이터 분할하기
# train_test_split(arrays, test_size, train_size, random_state, shuffle, stratify)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 156)

In [None]:
#선형 회귀 분석 : 모델 생성
lr = LinearRegression()

In [None]:
#선형 회귀 분석 : 모델 훈련
lr.fit(X_train, Y_train)

In [None]:
#선형 회귀 분석 : 평가 데이터에 대한 예측 수행 -> 예측 결과 Y_predict 구하기
Y_predict = lr.predict(X_test)

In [None]:
mse = mean_squared_error(Y_test, Y_predict)
rmse = np.sqrt(mse)
print('MSE : {0:.3f}, RMSE : {1:.3f}'.format(mse, rmse))
print('R^2(Variance score) : {0:.3f}'.format(r2_score(Y_test, Y_predict)))

In [None]:
print('Y 절편 값: ', lr.intercept_)
print('회귀 계수 값: ', np.round(lr.coef_, 1))

In [None]:
coef = pd.Series(data = np.round(lr.coef_, 2), index = X.columns)
coef.sort_values(ascending = False)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
fig, axs = plt.subplots(figsize = (16, 16), ncols = 3, nrows = 5)
x_features = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']

for i, feature in enumerate(x_features):
    row = int(i/3)
    col = i%3
    sns.regplot(x = feature, y = 'PRICE', data = boston_df, ax = axs[row][col])

In [None]:
plt.show()

# 자동차 연비 데이터에 머신러닝 기반의 회귀 분석을 수행
## 연비에 영향을 미치는 항목을 확인하고, 그에 따른 자동차 연비를 예측
[UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/9/auto+mpg)

In [None]:
import numpy as np
import pandas as pd
data_df = pd.read_csv('./dataSet/auto-mpg.csv', header = 0, engine = 'python')

In [None]:
print('데이터셋 크기: ', data_df.shape)
data_df.head()

In [None]:
data_df.info()

In [None]:
# data_df = data_df.drop(['origin', 'horsepower'], axis = 1, inplace = False)

data_df = data_df.drop(['origin'], axis = 1, inplace = False)
data_df['horsepower'] = data_df['horsepower'].replace('?', 0)
data_df['horsepower'] = data_df['horsepower'].replace(0, data_df['horsepower'].median())
data_df['horsepower2'] = data_df['horsepower'].astype('float64')
data_df = data_df.drop(['horsepower'], axis = 1, inplace = False)
data_df.info()

In [None]:
print('데이터셋 크기: ', data_df.shape)
data_df.head()

In [None]:
data_df.info()

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
#X, Y 분할하기
Y = data_df['mpg']
X = data_df.drop(['mpg'], axis = 1, inplace = False)

In [None]:
#훈련용 데이터와 평가용 데이터 분할하기
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 0)

In [None]:
#선형 회귀 분석 : 모델 생성
lr = LinearRegression()

In [None]:
#선형 회귀 분석 : 모델 훈련
lr.fit(X_train, Y_train)

In [None]:
#선형 회귀 분석 : 평가 데이터에 대한 예측 수행 -> 예측 결과 Y_predict 구하기
Y_predict = lr.predict(X_test)

In [None]:
mse = mean_squared_error(Y_test, Y_predict)
rmse = np.sqrt(mse)
print('MSE : {0:.3f}, RMSE : {1:.3f}'.format(mse, rmse))
print('R^2(Variance score) : {0:.3f}'.format(r2_score(Y_test, Y_predict)))

In [None]:
print('Y 절편 값: ', np.round(lr.intercept_, 2))
print('회귀 계수 값: ', np.round(lr.coef_, 2))

In [None]:
coef = pd.Series(data = np.round(lr.coef_, 2), index = X.columns)
coef.sort_values(ascending = False)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
fig, axs = plt.subplots(figsize = (16, 16), ncols = 3, nrows = 2)
x_features = ['model_year', 'acceleration', 'displacement', 'weight', 'cylinders']
plot_color = ['r', 'b', 'y', 'g', 'r']
for i, feature in enumerate(x_features):
    row = int(i/3)
    col = i%3
    sns.regplot(x = feature, y = 'mpg', data = data_df, ax = axs[row][col], color = plot_color[i])

In [None]:
plt.show()

In [None]:
print("연비를 예측하고 싶은 차의 정보를 입력해주세요.")
cylinders_1 = int(input("cylinders : "))
displacement_1 = int(input("displacement : "))
weight_1 = int(input("weight : "))
acceleration_1 = int(input("acceleration : "))
model_year_1 = int(input("model_year : "))
horsepower_2 = int(input("horsepower : "))

In [None]:
# mpg_predict = lr.predict([[cylinders_1, displacement_1, weight_1, acceleration_1 , model_year_1, horsepower_2]])

In [None]:
X_new = pd.DataFrame([{
    "cylinders": cylinders_1,
    "displacement": displacement_1,
    "weight": weight_1,
    "acceleration": acceleration_1,
    "model_year": model_year_1,
    "horsepower2": horsepower_2
}])

In [None]:
import numpy as np

def extract_value(pred):
    # pred가 배열이 아닌 경우 그대로 반환
    if not isinstance(pred, (list, np.ndarray)):
        return pred
    
    # 빈 배열이면 예외
    if len(pred) == 0:
        raise ValueError("예측 결과가 비어 있습니다.")

    # 요소가 1개면 .item()
    if len(pred) == 1:
        return pred.item()

    # 요소가 여러 개면 기본적으로 첫 번째 값 사용
    return pred[0]

In [None]:
import numpy as np
try:
    mpg_predict = lr.predict(X_new)
    
    mpg_val = extract_value(mpg_predict)

except Exception as e:
    print("예측 과정에서 오류가 발생했습니다:", e)


In [None]:
# print("이 자동차의 예상 연비(MPG)는 %.2f입니다." %mpg_predict)
print(f"이 자동차의 예상 연비(MPG)는 {mpg_val:.2f}입니다.")