# Youtube 500th top channels analysis and predictions

The dataset contain information about the 500th most popular Youtube channels based on their subscribers. More information about this on the link below.

Data source: [500th most subscribed youtube channels](https://www.kaggle.com/datasets/ritiksharma07/top-500-most-subscribed-youtube-channels-june24)

### Next steps in order:

- EDA (exploration data analysis)
  - [Import and inspect data](#import-and-inspect-data)
  - [Handle missing values (impute or clean)](#handle-missing-values)
  - [Transform features (if necessary)](#transform-features)
  - [Handle outliers](#handle-outliers)
  - [Explore data and visualization](#explore-data-and-visualization)
- Based on EDA establish an [objetive](#objective) and use models to estimate this
- [Train and predict](#train-and-predict) the objectives with the models. Evaluate them.
- [Conclusion](#conclusion)


In [1100]:
# Installations

!pip install pandas
!pip install numpy
!pip install scikit-learn
!pip install plotly
!pip install nbformat

In [1101]:
# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import plotly.express as px

# ML
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import IsolationForest, GradientBoostingRegressor
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Others
import re
import warnings

warnings.filterwarnings('ignore')

# EDA

### Import and inspect data


In [1102]:
yt_df = pd.read_csv("./500th_top_youtube_channels.csv")

yt_df.head()

Unnamed: 0,RANK,NAME_OF_CHANNEL,TOTAL_NUMBER_OF_VIDEOS,SUBSCRIBERS,VIEWS,CATEGORY
0,#1,MrBeast,799,274M,50.98B,Entertainment
1,#2,T-Series,21.12K,267M,257.16B,Music
2,#3,Cocomelon - Nursery Rhymes,1.18K,176M,182.88B,Kids
3,#4,SET India,138.97K,173M,164.71B,Entertainment
4,#5,✿ Kids Diana Show,1.22K,123M,103.5B,Kids


Renaming column


In [1103]:
yt_df.rename(columns={"TOTAL_NUMBER_OF_VIDEOS": "VIDEOS"}, inplace=True)

### Handle missing values

Looking if there are null values, quant of entries and type of the features


In [1104]:
yt_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   RANK             500 non-null    object
 1   NAME_OF_CHANNEL  493 non-null    object
 2   VIDEOS           500 non-null    object
 3   SUBSCRIBERS      500 non-null    object
 4   VIEWS            500 non-null    object
 5   CATEGORY         492 non-null    object
dtypes: object(6)
memory usage: 23.6+ KB


There are 500 entries but 8 have no category (less than 2% of data) so those will be dropped. All features are objects so then them will need to be transformed to numeric values.


In [1105]:
yt_df.dropna(inplace=True)
yt_df.reset_index(inplace=True, drop='first')
yt_df.isna().sum().sum()

0

Rank and name of channel are not needed


In [1106]:
yt_df.drop(columns=["RANK", "NAME_OF_CHANNEL"], inplace=True)
yt_df.columns

Index(['VIDEOS', 'SUBSCRIBERS', 'VIEWS', 'CATEGORY'], dtype='object')

### Transform features

Transform the columns VIDEOS, SUBSCRIBERS and VIEWS to numeric


In [1107]:
multipliers = {"K": 3, "M": 6, "B": 9}


def numeric_format(column):
    return [
        (
            int(i)
            if re.search("[KMB]", i) == None
            else int(float(i[:-1]) * 10 ** multipliers[i[-1]])
        )
        for i in column
    ]


for col in yt_df.columns[:-1]:
    yt_df[col] = numeric_format(yt_df[col].tolist())

print(yt_df.dtypes)
yt_df.head()

VIDEOS          int64
SUBSCRIBERS     int64
VIEWS           int64
CATEGORY       object
dtype: object


Unnamed: 0,VIDEOS,SUBSCRIBERS,VIEWS,CATEGORY
0,799,274000000,50980000000,Entertainment
1,21120,267000000,257160000000,Music
2,1180,176000000,182880000000,Kids
3,138970,173000000,164710000000,Entertainment
4,1220,123000000,103500000000,Kids


Check how many categories are


In [1108]:
print(f'Are {len(yt_df['CATEGORY'].value_counts())} categories')
yt_df['CATEGORY'].value_counts()

Are 32 categories


CATEGORY
Entertainment              184
Music                      108
Kids                        66
Gaming/Entertainment        37
Education                   19
News                        15
Movies                      14
Animation                    5
Food                         4
Kids                         3
Sports/Entertainment         2
Sports                       2
Kids                         2
Gaming                       2
DIY                          2
Technology                   2
Beauty/Lifestyle             2
Kids                         2
Kids                         1
Music                        1
Politics                     1
Charity/Non-profit           1
Fitness/Health               1
Kids                         1
Kids                         1
toyoraljanahtv               1
Platform                     1
Kids                         1
DIY/Education                1
Travel/Entertainment         1
Education                    1
Arab Games Network           1

There are some categories repeated (e.g: Kids) that just are different because have some extra white spaces. Strip and merge them


In [1109]:
yt_df['CATEGORY'] = yt_df['CATEGORY'].str.strip()
print(f'Are {len(yt_df['CATEGORY'].value_counts())} categories')
yt_df['CATEGORY'].value_counts()

Are 23 categories


CATEGORY
Entertainment           184
Music                   109
Kids                     77
Gaming/Entertainment     37
Education                20
News                     15
Movies                   14
Animation                 5
Food                      4
Sports                    2
DIY                       2
Technology                2
Beauty/Lifestyle          2
Gaming                    2
Sports/Entertainment      2
Fitness/Health            1
toyoraljanahtv            1
Platform                  1
Charity/Non-profit        1
Politics                  1
Travel/Entertainment      1
DIY/Education             1
Arab Games Network        1
Name: count, dtype: int64

Encode the categories with LabelEncoder


In [1110]:
l_encoder = LabelEncoder()

categories = yt_df["CATEGORY"]

yt_df["CATEGORY"] = l_encoder.fit_transform(yt_df["CATEGORY"])

yt_df.head()

Unnamed: 0,VIDEOS,SUBSCRIBERS,VIEWS,CATEGORY
0,799,274000000,50980000000,7
1,21120,267000000,257160000000,14
2,1180,176000000,182880000000,12
3,138970,173000000,164710000000,7
4,1220,123000000,103500000000,12


### Handle outliers


In [1111]:
yt_df.describe()

Unnamed: 0,VIDEOS,SUBSCRIBERS,VIEWS,CATEGORY
count,485.0,485.0,485.0,485.0
mean,13508.16701,35142270.0,18904960000.0,10.2
std,44593.703938,23586550.0,20759980000.0,3.593382
min,1.0,20500000.0,1390.0,0.0
25%,546.0,23700000.0,8710000000.0,7.0
50%,1320.0,28400000.0,14640000000.0,11.0
75%,4390.0,38600000.0,22970000000.0,14.0
max,379590.0,274000000.0,257160000000.0,22.0


<b>Observation</b>: there are some highly potential outliers in VIDEOS because the difference between mean, min and max and the min value of VIEWS


Standarize the number of videos, subscribers and views


In [1112]:
numeric_features = yt_df.iloc[:, :-1]
encoded_categories = yt_df["CATEGORY"]

ct = ColumnTransformer([("std_scaler", StandardScaler(), numeric_features.columns)])

scaled_numeric_features = pd.DataFrame(
    ct.fit_transform(numeric_features), columns=numeric_features.columns
)

scaled_numeric_features.head()

Unnamed: 0,VIDEOS,SUBSCRIBERS,VIEWS
0,-0.285293,10.137317,1.546637
1,0.170869,9.840232,11.488501
2,-0.276741,5.978117,7.906768
3,2.816348,5.850795,7.030623
4,-0.275843,3.728754,4.079117


In [1113]:
px.box(scaled_numeric_features)

Use the IsolationForest algorithm to detect outliers


In [1114]:
iso_forest = IsolationForest()
iso_forest.fit(scaled_numeric_features)

In [1115]:
outliers_pred = iso_forest.predict(scaled_numeric_features)
not_outliers = [True if x == 1 else False for x in outliers_pred]
X_without_outliers = scaled_numeric_features[not_outliers]
outliers_pred = [False if x == 1 else True for x in outliers_pred]
print(f"Outliers: {scaled_numeric_features[outliers_pred].shape}")
print(f"Not outliers: {X_without_outliers.shape}")

Outliers: (55, 3)
Not outliers: (430, 3)


Imputing outliers with median


In [1116]:
# Filtering scaled_numeric_features and X with the not outliers, then the outliers index are removed and been reindexed, filled with NaN and imputing these NaN values with median. For testing reasons a copy of numeric_features will be saved to also train models without impute data (keeping the outliers) to compare


def replace_outliers_median(df):
    imputer = SimpleImputer(strategy="median")
    new_numeric_features = df[not_outliers].reindex(
        list(range(scaled_numeric_features.index.min(), scaled_numeric_features.index.max() + 1)), fill_value=np.nan
    )
    return pd.DataFrame(imputer.fit_transform(new_numeric_features), columns=new_numeric_features.columns)

numeric_features_with_outliers = yt_df.iloc[:, :-1]

scaled_numeric_features = replace_outliers_median(scaled_numeric_features)

numeric_features = replace_outliers_median(numeric_features)

px.box(scaled_numeric_features)

### Explore data and visualization


In [1117]:
fig = px.scatter_matrix(numeric_features, numeric_features.columns, color=categories)
fig.update_layout(legend_title_text="Categories")
fig.show()

In [1118]:
for col in numeric_features.columns:
    fig = px.pie(
        numeric_features, values=col, names=categories, height=500, title=f"{col} PER CATEGORY"
    )
    fig.update_traces(textposition="inside")
    fig.update_layout(legend_title_text="Categories")
    fig.show()

In [1119]:
fig = px.histogram(numeric_features, x="SUBSCRIBERS", y="VIEWS", color=categories)
fig.update_layout(legend_title_text="Categories")


fig.show()

In [1120]:
fig = px.imshow(numeric_features.join(encoded_categories).corr(), text_auto=True, color_continuous_scale="ylorrd")
fig.show()

Some conclussions from the visualization:

- Scatter matrix: in general channels with more subscribers have less videos, looks like the videos do not impact too much in views, there is some correlation between views and subscribers (more visible in correlation matrix later)
- Pie: the first 5 categories which have the biggest value per feature are Entertainment, Music, Kids, Gaming/Entertainment, Education, which are also the most counted
- Histogram: channels between 22M and 32M of subscribers have more sum of views than the rest
- Correlation matrix: the strongest relation is between subscribers and views


# Objective

Train a model to predict the views that a channel can have based on the rest of the features


# Train and predict

The models that will be used for the regression are multiple linear regression (L1, L2 and elastic net), decision tree, knn and gradient boosting, all with cross validation


In [1121]:
seed = 0

X_imputed = numeric_features.iloc[:, :2].join(encoded_categories)
y_imputed = numeric_features["VIEWS"]

X_train, X_test, y_train, y_test = train_test_split(X_imputed, y_imputed, random_state=seed)

def get_best_estimator(estimator, params, X , y):
    cv = RandomizedSearchCV(estimator, params, n_iter=100).fit(X, y)
    print(f"Best params: \n {cv.best_params_}")
    return cv.best_estimator_


In [1122]:
l1_l2_params = {"alpha": np.logspace(-5, 0, 50), "fit_intercept": [True],}
enet_params = {"alpha": np.logspace(-1, 1, 50), "fit_intercept": [True], }
knn_params = {"n_neighbors": np.arange(5, 30, 5)}
tree_params = {
    "criterion": ["squared_error", "absolute_error"],
    "max_depth": [None, *np.arange(2, 6)],
    "min_samples_split": [2, 8, 15],
    "min_samples_leaf": np.arange(1, 7),
}
gb_params = {
    "n_estimators": [100, 200],
    "learning_rate": [0.01, 0.1, 1],
    "max_depth": [2, 3, 4],
}

In [1123]:
ridge = get_best_estimator(Ridge(), l1_l2_params, X_imputed, y_imputed)

Best params: 
 {'fit_intercept': True, 'alpha': 1.0}


In [1124]:
lasso = get_best_estimator(Lasso(), l1_l2_params, X_imputed, y_imputed)

Best params: 
 {'fit_intercept': True, 'alpha': 1.0}


In [1125]:
enet = get_best_estimator(ElasticNet(), enet_params, X_imputed, y_imputed)

Best params: 
 {'fit_intercept': True, 'alpha': 10.0}


In [1126]:
knn = get_best_estimator(KNeighborsRegressor(), knn_params, X_imputed, y_imputed)

Best params: 
 {'n_neighbors': 10}


In [1127]:
tree = get_best_estimator(DecisionTreeRegressor(), tree_params, X_imputed, y_imputed)

Best params: 
 {'min_samples_split': 8, 'min_samples_leaf': 4, 'max_depth': 3, 'criterion': 'absolute_error'}


In [1128]:
gb = get_best_estimator(GradientBoostingRegressor(), gb_params, X_imputed, y_imputed)

Best params: 
 {'n_estimators': 200, 'max_depth': 3, 'learning_rate': 0.01}


In [1129]:
report = pd.DataFrame(columns=["Model", "MAE", "MSE", "R2"])

models = [
    {"name": "Ridge", "estimator": ridge},
    {"name": "Lasso", "estimator": lasso},
    {"name": "ElasticNet", "estimator": enet},
    {"name": "KNN", "estimator": knn},
    {"name": "Tree", "estimator": tree},
    {"name": "GradientBoosting", "estimator": gb},
]


def regression_report(name, estimator):
    y_pred = estimator.predict(X_test)
    report.loc[len(report)] = [
        name,
        mean_absolute_error(y_test, y_pred),
        mean_squared_error(y_test, y_pred),
        r2_score(y_test, y_pred),
    ]

In [1130]:
for model in models:
    regression_report(model["name"], model["estimator"])

report.sort_values("R2", ascending=False)

Unnamed: 0,Model,MAE,MSE,R2
5,GradientBoosting,5425932000.0,4.814282e+19,0.383522
3,KNN,5277900000.0,5.033752e+19,0.355419
4,Tree,5545902000.0,5.463535e+19,0.300384
2,ElasticNet,5903436000.0,5.668662e+19,0.274117
0,Ridge,5948407000.0,5.719494e+19,0.267608
1,Lasso,5948436000.0,5.719527e+19,0.267604


Now using the data without impute

In [1131]:
X = numeric_features_with_outliers.iloc[:, :2].join(encoded_categories)
y = numeric_features_with_outliers["VIEWS"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=seed)

In [1132]:
ridge_with_outlier = get_best_estimator(Ridge(), l1_l2_params, X, y)

Best params: 
 {'fit_intercept': True, 'alpha': 1.0}


In [1133]:
lasso_with_outlier = get_best_estimator(Lasso(), l1_l2_params, X, y)

Best params: 
 {'fit_intercept': True, 'alpha': 1.0}


In [1134]:
enet_with_outlier = get_best_estimator(ElasticNet(), enet_params, X, y)

Best params: 
 {'fit_intercept': True, 'alpha': 10.0}


In [1135]:
knn_with_outlier = get_best_estimator(KNeighborsRegressor(), knn_params, X, y)

Best params: 
 {'n_neighbors': 10}


In [1136]:
tree_with_outlier = get_best_estimator(DecisionTreeRegressor(), tree_params, X, y)

Best params: 
 {'min_samples_split': 8, 'min_samples_leaf': 4, 'max_depth': 5, 'criterion': 'squared_error'}


In [1137]:
gb_with_outlier = get_best_estimator(GradientBoostingRegressor(), gb_params, X, y)

Best params: 
 {'n_estimators': 100, 'max_depth': 2, 'learning_rate': 0.1}


In [1138]:
models = [
    {"name": "Ridge", "estimator": ridge_with_outlier},
    {"name": "Lasso", "estimator": lasso_with_outlier},
    {"name": "ElasticNet", "estimator": enet_with_outlier},
    {"name": "KNN", "estimator": knn_with_outlier},
    {"name": "Tree", "estimator": tree_with_outlier},
    {"name": "GradientBoosting", "estimator": gb_with_outlier},
]


for model in models:
    regression_report(f'{model["name"]} (without impute)', model["estimator"])

report.sort_values("R2", ascending=False)

Unnamed: 0,Model,MAE,MSE,R2
11,GradientBoosting (without impute),6223114000.0,5.879555e+19,0.918547
8,ElasticNet (without impute),8370641000.0,1.774013e+20,0.754236
6,Ridge (without impute),8405711000.0,1.780957e+20,0.753274
7,Lasso (without impute),8405731000.0,1.780962e+20,0.753273
10,Tree (without impute),7669205000.0,1.991543e+20,0.7241
9,KNN (without impute),8474875000.0,2.928207e+20,0.594339
5,GradientBoosting,5425932000.0,4.814282e+19,0.383522
3,KNN,5277900000.0,5.033752e+19,0.355419
4,Tree,5545902000.0,5.463535e+19,0.300384
2,ElasticNet,5903436000.0,5.668662e+19,0.274117


# Conclusion

Estimator performance can vary a lot depending of the data selected to train/test but in general the best model of choosed for this dataset is Gradient boosting without impute data (impute data significaly reduce the performance so in this case the outliers can be have sense and help the estimator).

The Gradient boosting can predict how many views a channel can have depending of his subscribers, videos and category with a <b>91,85%</b> of precision