# **Introduction**

Regresi adalah metode statistik yang digunakan untuk memodelkan hubungan antara satu atau lebih variabel independen (variabel prediktor) dengan satu variabel dependen (variabel respons). Tujuan utama regresi adalah untuk memahami dan memprediksi hubungan antara variabel-variabel tersebut.

# **Ekstraksi Data**

Dataset yang digunakan, mencakup informasi tentang video-video yang sedang tren di platform youtube dengan kolom sbb :

* `trending_date`: tanggal ketika video trending
* `title`: judul video
* `channel_title`: nama channel
* `category_id`: kategori video dalam label encoding
* `publish_time`: waktu publish video
* `tags`: tag yang digunakan pada video
* `views`: jumlah views video
* `likes`: jumlah likes video
* `dislikes`: jumlah dislikes video
* `comment_count`: jumlah komentar pada video
* `comments_disabled`: apakah status komentar dinonaktifkan pada video
* `ratings_disabled`: apakah rating dinonaktifkan pada video
* `video_error_or_removed`: apakah video error atau sudah dihapus saat ini
* `description`: deskripsi video
* `No_tags`: jumlah tags yang digunakan
* `desc_len`: panjang kata deskripsi video
* `len_title`: panjang kata judul video
* `publish_date`: tanggal publish video

Dari dataset tersebut dapat digunakan model regresi atau machine learning untuk memprediksi jumlah tampilan, suka, tidak suka, atau komentar berdasarkan fitur-fitur yang ada dalam dataset.

In [None]:
# Import library yang akan digunakan
import pandas as pd
pd.set_option('display.float_format', lambda x: '%.2f' % x)

# Proses ekstraksi data


# Tampilkan data


Unnamed: 0,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,comments_disabled,ratings_disabled,video_error_or_removed,description,No_tags,desc_len,len_title,publish_date
0,2017-11-14,Sharry Mann: Cute Munda ( Song Teaser) | Parmi...,Lokdhun Punjabi,1,12:20:39,"sharry mann|""sharry mann new song""|""sharry man...",1096327,33966,798,882,False,False,False,Presenting Sharry Mann latest Punjabi Song Cu...,15,920,81,2017-11-12
1,2017-11-14,"पीरियड्स के समय, पेट पर पति करता ऐसा, देखकर दं...",HJ NEWS,25,05:43:56,"पीरियड्स के समय|""पेट पर पति करता ऐसा""|""देखकर द...",590101,735,904,0,True,False,False,"पीरियड्स के समय, पेट पर पति करता ऐसा, देखकर दं...",19,2232,58,2017-11-13
2,2017-11-14,Stylish Star Allu Arjun @ ChaySam Wedding Rece...,TFPC,24,15:48:08,Stylish Star Allu Arjun @ ChaySam Wedding Rece...,473988,2011,243,149,False,False,False,Watch Stylish Star Allu Arjun @ ChaySam Weddin...,14,482,58,2017-11-12
3,2017-11-14,Eruma Saani | Tamil vs English,Eruma Saani,23,07:08:48,"Eruma Saani|""Tamil Comedy Videos""|""Films""|""Mov...",1242680,70353,1624,2684,False,False,False,This video showcases the difference between pe...,20,263,30,2017-11-12
4,2017-11-14,why Samantha became EMOTIONAL @ Samantha naga ...,Filmylooks,24,01:14:16,"Filmylooks|""latest news""|""telugu movies""|""telu...",464015,492,293,66,False,False,False,why Samantha became EMOTIONAL @ Samantha naga ...,11,753,88,2017-11-13


# **Mengakses Informasi Umum pada Data**

In [None]:
# Akses informasi umum pada data


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36791 entries, 0 to 36790
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   trending_date           36791 non-null  object        
 1   title                   36791 non-null  object        
 2   channel_title           36791 non-null  object        
 3   category_id             36791 non-null  int64         
 4   publish_time            36791 non-null  object        
 5   tags                    36791 non-null  object        
 6   views                   36791 non-null  int64         
 7   likes                   36791 non-null  int64         
 8   dislikes                36791 non-null  int64         
 9   comment_count           36791 non-null  int64         
 10  comments_disabled       36791 non-null  bool          
 11  ratings_disabled        36791 non-null  bool          
 12  video_error_or_removed  36791 non-null  bool  

# **Transformasi Data**

<ul>
  <li>Konversi nama kolom menjadi <i>lowercase</i> semua (opsional)</li>
  <li>Beberapa kolom memiliki tipe data yang belum sesuai, sehingga perlu untuk dilakukan transformasi tipe data</li><br>
  <table>
    <tr>
      <th>Nama Kolom</th>
      <th>Tipe Data Awal</th>
      <th>Tipe Data Konversi</th>
    </tr>
    <tr>
      <td>category_id</td>
      <td>int</td>
      <td>str / object</td>
    </tr>
    <tr>
      <td>trending_date</td>
      <td>object</td>
      <td>datetime</td>
    </tr>
    <tr>
      <td>publish_time</td>
      <td>object</td>
      <td>timedelta</td>
    </tr>
  </table><br>
</ul>

In [None]:
# Ubah nama kolom menjadi lowercase semua


In [None]:
# Konversi category id menjadi object


# Konversi trending_date menjadi datetime


# Konversi publish_time menjadi timedelta


# Periksa kembali informasi pada data


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36791 entries, 0 to 36790
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype          
---  ------                  --------------  -----          
 0   trending_date           36791 non-null  datetime64[ns] 
 1   title                   36791 non-null  object         
 2   channel_title           36791 non-null  object         
 3   category_id             36791 non-null  object         
 4   publish_time            36791 non-null  timedelta64[ns]
 5   tags                    36791 non-null  object         
 6   views                   36791 non-null  int64          
 7   likes                   36791 non-null  int64          
 8   dislikes                36791 non-null  int64          
 9   comment_count           36791 non-null  int64          
 10  comments_disabled       36791 non-null  bool           
 11  ratings_disabled        36791 non-null  bool           
 12  video_error_or_removed  36791 no

# **Exploratory Data Analysis**



## **Statsitik Deskriptif**

In [None]:
# Pilih kolom dengan tipe data numerik saja


# Hitung statistik deskriptif


Unnamed: 0,views,likes,dislikes,comment_count,no_tags,desc_len,len_title
count,36791.0,36791.0,36791.0,36791.0,36791.0,36791.0,36791.0
mean,1071490.26,27450.69,1685.36,2714.02,18.94,923.08,70.61
std,3207149.05,97831.29,16197.32,14978.11,9.84,815.04,22.41
min,4024.0,0.0,0.0,0.0,1.0,3.0,5.0
25%,125604.0,879.0,109.0,83.0,12.0,368.0,53.0
50%,307836.0,3126.0,331.0,336.0,19.0,677.0,74.0
75%,806631.5,14095.0,1032.0,1314.5,25.0,1237.0,91.0
max,125432237.0,2912710.0,1545017.0,827755.0,72.0,5136.0,100.0


## **Handling Missing Value**

In [None]:
# Deteksi dan hitung kolom yang null


trending_date              0
title                      0
channel_title              0
category_id                0
publish_time               0
tags                       0
views                      0
likes                      0
dislikes                   0
comment_count              0
comments_disabled          0
ratings_disabled           0
video_error_or_removed     0
description               45
no_tags                    0
desc_len                   0
len_title                  0
publish_date               0
dtype: int64

karena missing value hanya terdapat pada kolom description dan kolom tersebut tidak terlalu dibutuhkan dalam proses analisa maka bisa diabaikan saja

## **Handling Duplicated Data**

In [None]:
# Periksa data duplikat


# Tampilkan hasilnya


Jumlah data awal = 36791 dengan total duplikasi data = 4229


In [None]:
# Hapus duplikasi data


# Periksa data duplikat


# Tampilkan hasilnya


Jumlah data sekarang = 32562


## **Deteksi Outliers**

In [None]:
# Import library untuk visualisasi
import plotly.express as px

# Definisikan warna yang digunakan
color = ['#ff6d00', '#ff8500', '#ff9e00', '#240046', '#5a189a', '#9d4edd', '#18af9d']

for i in ... :

    # Create the horizontal box plot
    fig = px.box(
        ...,
        orientation = 'h',
        color_discrete_sequence  = [color[i]]
    )

    # Update layout and display the plot
    fig.update_layout(
        title = f'<b>Box Plot {col_numeric[i]}</b>',
        yaxis = dict(
            title = '',
            showgrid = False,
            showline = False,
            showticklabels = False,
            zeroline = False,
        ),
        xaxis = dict(
            title = 'Total',
            showgrid = False,
            showline = True,
            showticklabels = True,
            zeroline = False,
        )
    )

    fig.show()

In [None]:
import plotly.express as px

fig = px.histogram(
    ...,
    x = ...,
    marginal = 'box',
    color_discrete_sequence  = ['#0E2954'],
    nbins = 50
)

fig.update_traces(
      marker_line_width = 1,
      marker_line_color = 'white'
)

fig.update_layout(
    plot_bgcolor = 'rgba(0, 0, 0, 0)',
    title = dict(
        text = "<b>Distribusi <span style='color:#0E2954'>No Tags</b>",
        font = dict(
            size = 28,
            color = '#757882'
        ),
        y = 0.92,
        x = 0.5
    ),
    yaxis = dict(
        title = '',
        showgrid = False,
        showline = False,
        showticklabels = False,
        zeroline = False,
    ),
    margin = dict(
        t = 80,
        b = 10,
        r = 20
    )
)

fig.show()

## Korelasi



In [None]:
# Perhitungan Korelasi


In [None]:
import plotly.express as px

# Plot HeatMap
fig = px.imshow(
    ...,
    text_auto = True,
    color_continuous_scale = 'Blues'
)

fig.update_coloraxes(
    showscale = False
)

fig.update_layout(
    width = 800,
    height = 600,
    title = dict(
        text = "<b>Korelasi Antar Fitur</b>",
        font = dict(
            size = 30,
            color = '#0E2954'
        ),
        y = 0.95,
        x = 0.5
    ),
    margin = dict(
        t = 80,
        b = 30,
        r = 50,
        l = 50
    )
)

fig.show()

- Kolom yang memiliki korelasi tinggi terhadap target (>0.5) adalah likes, dislikes, dan comment_count.
- Jika kita menggunakan algoritma linear regression untuk melatih model kita, penting untuk memperhatikan asumsi multikolinearitas dengan menghindari korelasi antar feature yang tinggi, sepertil likes, dislikes, atau comment count (dapat dipilih salah satu feature saja)
-  Atau dapat menggunakan algoritma yang cukup robust terhadap outlier

In [None]:
import plotly.express as px

# Plot Scatter
fig = px.scatter(
    ...,
    x = 'views',
    y = 'likes',
    trendline = 'ols',
    color_discrete_sequence  = ['#0E2954']
)

fig.show()

# **Feature Engineering**

Sebagai contoh kita dapat melihat apakah video di-upload pada weekday/weekend

In [None]:
# Ekstrak nama hari dari publish date


# Buat flagging untuk hari Sabtu / Minggu (1 = Weekend | 0 = otherwise)


# Tampilkan hasilnya


Unnamed: 0,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,comments_disabled,ratings_disabled,video_error_or_removed,description,no_tags,desc_len,len_title,publish_date,publish_day,is_weekend
0,2017-11-14,Sharry Mann: Cute Munda ( Song Teaser) | Parmi...,Lokdhun Punjabi,1,0 days 12:20:39,"sharry mann|""sharry mann new song""|""sharry man...",1096327,33966,798,882,False,False,False,Presenting Sharry Mann latest Punjabi Song Cu...,15,920,81,2017-11-12,Sunday,1
1,2017-11-14,"पीरियड्स के समय, पेट पर पति करता ऐसा, देखकर दं...",HJ NEWS,25,0 days 05:43:56,"पीरियड्स के समय|""पेट पर पति करता ऐसा""|""देखकर द...",590101,735,904,0,True,False,False,"पीरियड्स के समय, पेट पर पति करता ऐसा, देखकर दं...",19,2232,58,2017-11-13,Monday,0
2,2017-11-14,Stylish Star Allu Arjun @ ChaySam Wedding Rece...,TFPC,24,0 days 15:48:08,Stylish Star Allu Arjun @ ChaySam Wedding Rece...,473988,2011,243,149,False,False,False,Watch Stylish Star Allu Arjun @ ChaySam Weddin...,14,482,58,2017-11-12,Sunday,1
3,2017-11-14,Eruma Saani | Tamil vs English,Eruma Saani,23,0 days 07:08:48,"Eruma Saani|""Tamil Comedy Videos""|""Films""|""Mov...",1242680,70353,1624,2684,False,False,False,This video showcases the difference between pe...,20,263,30,2017-11-12,Sunday,1
4,2017-11-14,why Samantha became EMOTIONAL @ Samantha naga ...,Filmylooks,24,0 days 01:14:16,"Filmylooks|""latest news""|""telugu movies""|""telu...",464015,492,293,66,False,False,False,why Samantha became EMOTIONAL @ Samantha naga ...,11,753,88,2017-11-13,Monday,0


# **Feature Selection**

In [None]:
# Ambil kolom yang ingin diolah
fitur = ['likes', 'dislikes', 'comment_count', 'no_tags', 'desc_len', 'len_title', 'publish_day', 'is_weekend', 'views']

# Lakukan proses One-Hot Encoding


# Tampilkan data


Unnamed: 0,likes,dislikes,comment_count,no_tags,desc_len,len_title,is_weekend,views,publish_day_Friday,publish_day_Monday,publish_day_Saturday,publish_day_Sunday,publish_day_Thursday,publish_day_Tuesday,publish_day_Wednesday
0,33966,798,882,15,920,81,1,1096327,0,0,0,1,0,0,0
1,735,904,0,19,2232,58,0,590101,0,1,0,0,0,0,0
2,2011,243,149,14,482,58,1,473988,0,0,0,1,0,0,0
3,70353,1624,2684,20,263,30,1,1242680,0,0,0,1,0,0,0
4,492,293,66,11,753,88,0,464015,0,1,0,0,0,0,0


# **Scaling Data**

In [None]:
# Import library untuk scaling data


# Inisialisasi objek RobustScaler


# Fit dan transform data pada dataframe


# Membuat dataframe baru dengan data yang telah discaling


# Tampilkan data


Unnamed: 0,likes,dislikes,comment_count,no_tags,desc_len,len_title,is_weekend,views,publish_day_Friday,publish_day_Monday,publish_day_Saturday,publish_day_Sunday,publish_day_Thursday,publish_day_Tuesday,publish_day_Wednesday
0,2.71,0.6,0.52,-0.31,0.29,0.16,1.0,1.31,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,-0.18,0.72,-0.27,0.0,1.81,-0.46,0.0,0.5,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,-0.07,-0.06,-0.14,-0.38,-0.22,-0.46,1.0,0.31,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,5.87,1.58,2.13,0.08,-0.48,-1.22,1.0,1.54,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,-0.2,0.0,-0.22,-0.62,0.09,0.35,0.0,0.3,0.0,1.0,0.0,0.0,0.0,0.0,0.0


# **Data Splitting**

In [None]:
# Import library untuk splitting data
from sklearn.model_selection import train_test_split

# Definisikan fitur dan target
X = ...
y = ...

# Proses splitting data
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size = ...,
    random_state = 42
)

# Periksa banyak data masing-masing
print(f'Banyak data latih = {X_train.shape[0]}')
print(f'Banyak data test  = {X_test.shape[0]}')

Banyak data latih = 26049
Banyak data test  = 6513


# **Proses Modelling**

Karena ditemukan banyak outlier pada data, maka Linear Regression merupakan model yang kurang tepat untuk digunakan. Sebagai gantinya akan digunakan model Elastic Net yang cukup robust (tahan) terhadap outlier<br><br>

docs : https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RANSACRegressor.html

In [None]:
# Import library untuk proses modelling
from sklearn.linear_model import ElasticNet

# Create a RANSACRegressor object with custom parameter settings
model = ElasticNet()

# Fit the model to the data
...

# Make predictions on the testing data
y_pred = model.predict(...)

# <b>Mengukur Performansi Model</b>
  Salah satu metrik untuk mengukur performansi model adalah dengan menghitung rata-rata selisih mutlak (<i>Mean Absolute Error</i>) yang dirumuskan sebagai
  \begin{equation}
  MAE =  \frac{\sum\limits_{i=1}^{n}|{y_{prediksi}}_i - {y_{aktual}}_i|}{n}
  \end{equation}
  MAE digunakan untuk melihat sejauh mana rata-rata selisih model dan data aktual, sehingga semakin kecil nilai MAE maka semakin baik model dalam melakukan prediksi

docs : <i>https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html</i>
</ol>

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Hitung error prediksi
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

# Tampilkan hasil
print("Mean Squared Error:", mse)
print("Mean Absolute Error:", mae)

Mean Squared Error: 6.311900224277747
Mean Absolute Error: 0.985223191599188


# Koefisien Determinasi (R2)
R-squared (R^2), juga dikenal sebagai koefisien determinasi, adalah ukuran yang digunakan untuk mengevaluasi sejauh mana model regresi cocok dengan data yang diamati. R-squared mengukur proporsi variabilitas dalam variabel target yang dapat dijelaskan oleh variabel independen dalam model regresi.

In [None]:
from sklearn.metrics import r2_score

r2 = r2_score(y_test, y_pred)
print(f'R2 Score = {r2}')

R2 Score = 0.7665912503623886


In [None]:
koefisien = model.coef_
intercept = model.intercept_
print(f'koefisien = {koefisien}')
print(f'bias = {intercept}')

koefisien = [ 0.47547019  0.05236671 -0.03147718  0.          0.          0.
 -0.         -0.         -0.         -0.         -0.         -0.
  0.          0.        ]
bias = 0.18610572170562734


In [None]:
fitur_penting = pd.DataFrame({
    'Fitur' : model.feature_names_in_,
    'Koefisien' : koefisien
})

fitur_penting = fitur_penting.sort_values(by = 'Koefisien', ascending = True)
display(fitur_penting)

Unnamed: 0,Fitur,Koefisien
2,comment_count,-0.03
3,no_tags,0.0
4,desc_len,0.0
5,len_title,0.0
6,is_weekend,-0.0
7,publish_day_Friday,-0.0
8,publish_day_Monday,-0.0
9,publish_day_Saturday,-0.0
10,publish_day_Sunday,-0.0
11,publish_day_Thursday,-0.0


In [None]:
import plotly.express as px

fig = px.bar(
    fitur_penting,
    x = 'Koefisien',
    y = 'Fitur',
    orientation = 'h',
    text_auto = True
)

fig.update_layout(
    width = 1200,
    height = 600,
    title = '<b>Feature Importance</b>',
    xaxis_title = '',
    yaxis_title = '',
    showlegend = False,
    paper_bgcolor = 'rgb(255, 255, 255, 1)',
    plot_bgcolor = 'rgb(255, 255, 255, 0)',
)

fig.show()