### Statistical Learning for Data Science 2 (229352)
#### Instructor: Donlapark Ponnoprat

#### [Course website](https://donlapark.pages.dev/229352/)

## Lab #8

We will train an XGBoost regressor on the monitor dataset dataset and deploy it Github website.

In [None]:
import pandas as pd
import sklearn
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline # ข้อดีของ Pipeline คือเราไม่จำเป็นต้อง Save Scaler, Compeller แยกกันแล้ว
from sklearn.preprocessing import StandardScaler, OneHotEncoder

import xgboost
from xgboost import XGBRegressor

import joblib

We will need `Scikit-learn`'s version number for setting up HuggingFace's space.

In [None]:
# Import มาแล้วเช็คเวอร์ชัน เพราะ เวอร์ชันใน Scikit-learn กับ xgboost ต้องตรงกับใน Hugging Face
print("Scikit-learn's version:", sklearn.__version__)
print("xgboost's version:", xgboost.__version__)

Scikit-learn's version: 1.3.2
xgboost's version: 2.1.1


First, download the data of monitor prices collected from Amazon [source](https://www.kaggle.com/datasets/durjoychandrapaul/amazon-products-sales-monitor-dataset).

In [None]:
!wget http://www.donlapark.cmustat.com/229352/monitors.csv

--2024-08-29 16:08:09--  http://www.donlapark.cmustat.com/229352/monitors.csv
Resolving www.donlapark.cmustat.com (www.donlapark.cmustat.com)... 150.107.31.67
Connecting to www.donlapark.cmustat.com (www.donlapark.cmustat.com)|150.107.31.67|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 187214 (183K) [text/csv]
Saving to: ‘monitors.csv’


2024-08-29 16:08:10 (531 KB/s) - ‘monitors.csv’ saved [187214/187214]



In [None]:
# ข้อมูลนี้อาจารย์ Clean มา่เรียบร้อย
data = pd.read_csv('monitors.csv')
# ตามหลักการหน้าจอยิ่ง Pixel เยอะ ยิ่งกว้าง ยิ่งมีราคาแพง
data.head()

Unnamed: 0,Title,Brand,Screen Size,Resolution (Width),Resolution (Height),Price
0,"acer SB240Y G0bi 23.8"" IPS Full HD Ultra-Slim ...",acer,23.8,1920,1080,3872.4
1,"acer Nitro 31.5"" FHD 1920 x 1080 1500R Curved ...",acer,31.5,1920,1080,10598.4
2,"Acer SB272 EBI 27"" Full HD (1920 x 1080) IPS Z...",acer,27.0,1920,1080,4076.4
3,"Sceptre 30-inch Curved Gaming Monitor 21,9 256...",Sceptre,30.0,2560,1080,8151.6
4,"SAMSUNG 32"" UJ59 Series 4K UHD (3840x2160) Com...",SAMSUNG,31.5,3840,2160,11413.2


ตัวแปร 4 ตัวนี้จะใช้ในการทำนายประมาณค่าใช้จ่ายของหน้าจอคอม Screen Size, Resolution (Width),	Resolution (Height),	Price

In [None]:
y_train = data["Price"]
X_train = data.drop(["Title", "Price"], axis=1)

# Names of numerical features
# สร้าง List เก็บข้อมูลเชิงปริมาณ
num_col = X_train.select_dtypes(include=['int64', 'float64']).columns
# Names of categorical features
# เก็บข้อมูลเชิงคุณภาพ (ยี่ห้อ)
cat_col = X_train.select_dtypes(include=['object', 'bool']).columns

print(num_col)
print(cat_col)

Index(['Screen Size', 'Resolution (Width)', 'Resolution (Height)'], dtype='object')
Index(['Brand'], dtype='object')


In [None]:
# Column Numeric ใช้ Scaler ธรรมดา
# Column Categorical ใช้ OneHotEncoder ไม่ต้อง sparse เนื่องจากมีแค่ 20 ยี่ห้อ
preprocessor = ColumnTransformer([("scaler", StandardScaler(), num_col),
                                  ("onehot", OneHotEncoder(sparse=False), cat_col)])
# เราจะข้ามขั้นตอนเช็คความแม่นยำโมเดลไปเลย
model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('regressor', XGBRegressor())])
# เอามา fit เลย
model.fit(X_train, y_train)



Save the model using `joblib`.

In [None]:
# เอาโมเดลที่ได้ของเราไป Save เป็น File
joblib.dump(model, 'model.joblib')

['model.joblib']

Save the list of brands---we will need it to create a dropdown menu.

In [None]:
# ดึง Column หนึ่งมา unique ซึ่งจะได้ List ของ Brand ที่เป็นไปได้จากข้อมูลนี้
unique_values = {col:X_train[col].unique() for col in cat_col}
joblib.dump(unique_values, 'unique_values.joblib')

['unique_values.joblib']

In [None]:
unique_values

{'Brand': array(['acer', 'Sceptre', 'SAMSUNG', 'ViewSonic', 'LG', 'AOC', 'Dell',
        'ASUS', 'Teamgee', 'SANSUI', 'KYY', 'Cevaton', 'BenQ', 'domyfan',
        'cocopar', 'CIDETTY', 'Philips Computer Monitors', 'KOORUI', 'QQH',
        'AOPEN', 'MSI', 'Alienware', 'kasorey', 'BOSII', 'Macsecor',
        'GIGABYTE', 'Lenovo', 'CRUA', 'INNOCN', 'HP', 'XGaming', 'ARZOPA',
        'Deco Gear', 'Poly', 'Kensington', 'Pixio', 'KTC', 'MP',
        'SideTrak', 'ANGEL POS', 'LESOWN', 'TouchWo', 'Duex', 'Z Z-EDGE',
        'InnoView', 'Planar', 'PHILIPS', 'NEC', 'Neway', 'Fiodio',
        'LILLIPUT', 'ALOGIC', 'Thermaltake', 'AUO', 'DIYmalls', 'Targus',
        'Elo', 'Atdec', 'iChawk'], dtype=object)}

### Exercise:
1. Choose your own dataset from https://www.kaggle.com/datasets?topic=trendingDataset or any other website. Choose your own prediction task.

2. Fit and deploy your prediction model on a Github website.

3. Go to Assignment in Mango and send in the link to you website.


#### 1. เลือกชุดข้อมูล Vehicle Dataset
This dataset contains information about used cars.
This data can be used for a lot of purposes such as price prediction to exemplify the use of linear regression in Machine Learning.

In [3]:
import pandas as pd
import sklearn
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder

import xgboost
from xgboost import XGBRegressor

import joblib

In [28]:
data = pd.read_csv("diamonds.csv")
data

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.20,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...,...
49995,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.50
49996,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
49997,0.70,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
49998,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74


In [29]:
data.columns

Index(['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y',
       'z'],
      dtype='object')

### 2. Fit and deploy the prediction model on a Github website.


In [30]:
# Define features and target variable
features = ['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'x', 'y', 'z']
target = 'price'

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    data[features], data[target], test_size=0.2, random_state=42
)

# Preprocess numerical and categorical features
num_features = ['carat', 'depth', 'table', 'x', 'y', 'z']
cat_features = ['cut', 'color', 'clarity']

In [31]:
num_transformer = StandardScaler()
cat_transformer = OneHotEncoder()

preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_features),
        ('cat', cat_transformer, cat_features)
    ]
)

# Create a pipeline with preprocessing and XGBoost model
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', XGBRegressor())
])

# Fit the model on the training data
model.fit(X_train, y_train)

In [32]:
# Save the trained model
joblib.dump(model, 'model.joblib')

['model.joblib']

### 3. Go to Assignment in Mango and send in the link to you website.