정원혁 William, 2020, D+
# Overall Outline

1. Read Data

2. Independent variable / Dependent variable: Ranking / ......
    - Remove unnecessary columns
    - (Create additional necessary columns)
3. Handle NA values
4. Handle outliers
5. Dummy encoding
    - Convert categorical / numerical variables using astype()
    - For numerical variables: scale transformation
        min / max scale
        mean / std scale
6. Split into train / test sets
7. Declare model
8. Train model
9. Predict
10. Performance
    - MSE, R-square
    - Accuracy



# 1. Read Data

In [10]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from lightgbm import LGBMRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Read Data
data = pd.read_csv('https://raw.githubusercontent.com/blackdew/tensorflow1/master/csv/boston.csv')

data

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,393.45,6.48,22.0


# 2. Independent variable / Dependent variable: Ranking / ......
    - Remove unnecessary columns
    - (Create additional necessary columns)

In [11]:
X = data.drop('medv', axis=1)
y = data['medv']

# 3. Handle NA values

In [12]:
# imputer = SimpleImputer(strategy='mean')
# X = pd.DataFrame(imputer.fit_transform(X), columns=X.columns)

# 5. Dummy encoding / Scaling
    - Convert categorical / numerical variables using astype()
    - For numerical variables: scale transformation
        min / max scale
        mean / std scale

In [13]:
scaler = MinMaxScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

# 6. Split into train / test sets


In [14]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# 7. Declare model


In [15]:
models = {
    'Decision Tree': DecisionTreeRegressor(random_state=42),
    'Random Forest': RandomForestRegressor(random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(random_state=42),
    'LightGBM': LGBMRegressor(random_state=42)
}

# 8. Train model

In [16]:
results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    results[name] = {'MSE': mse, 'R2': r2, 'Predictions': y_pred}

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000323 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1027
[LightGBM] [Info] Number of data points in the train set: 404, number of used features: 13
[LightGBM] [Info] Start training from score 22.796535


# 9. Predict


In [18]:
comparison_df = pd.DataFrame({'Actual': y_test})
for name, metrics in results.items():
    comparison_df[f'{name}'] = metrics['Predictions']

print("\nComparison of Actual vs Predicted values:")
print(comparison_df.head())


Comparison of Actual vs Predicted values:
     Actual  Decision Tree  Random Forest  Gradient Boosting   LightGBM
173    23.6           28.1         22.839          23.449761  25.527304
274    32.4           33.1         30.676          31.461360  34.871147
491    13.6           17.3         16.300          17.705313  14.271260
72     22.8           22.0         23.510          24.022573  23.810942
452    16.1           23.2         16.819          17.681144  16.504006


# 10. Performance
    - MSE, R-square
    - Accuracy


In [19]:
performance_df = pd.DataFrame({
    'Model': results.keys(),
    'MSE': [metrics['MSE'] for metrics in results.values()],
    'R2': [metrics['R2'] for metrics in results.values()]
})

print("\nModel Performance Comparison:")
print(performance_df.sort_values('MSE'))


Model Performance Comparison:
               Model        MSE        R2
2  Gradient Boosting   6.209171  0.915330
1      Random Forest   7.909854  0.892139
3           LightGBM   8.625874  0.882375
0      Decision Tree  10.416078  0.857963
