# Model Training and Testing **YouTube Views Prediction**

Techniques:
- Linear Regression
- Decision Tree
- Random Forest
- Support Vector Regressor

Regularization:
- L1 (Lasso)
- L2 (Ridge)

Hyperparameter Tuning
- Grid Search

Model Evaluation: MAE, RSME, MAPE, R^2

In [8]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler


## Load Dataset

In [9]:
input_dir = 'data_files'

X_train = pd.read_csv(os.path.join(input_dir, 'X_train.csv'))
X_test = pd.read_csv(os.path.join(input_dir, 'X_test.csv'))
y_train = pd.read_csv(os.path.join(input_dir, 'y_train.csv'))
y_test = pd.read_csv(os.path.join(input_dir, 'y_test.csv'))

In [13]:
print(X_train.shape, X_test.shape)  
print(y_train.shape, y_test.shape) 

(25753, 6) (11038, 6)
(25753, 1) (11038, 1)


In [14]:
y_train = y_train.squeeze()  # Convert DataFrame to Series if necessary
y_test = y_test.squeeze()  # Convert DataFrame to Series if necessary

## Model Setup

In [10]:
# Initialize models
models = {
    "Linear Regression": LinearRegression(),
    "Lasso (L1)": Lasso(),
    "Ridge (L2)": Ridge(),
    "Decision Tree": DecisionTreeRegressor(),
    "Random Forest": RandomForestRegressor(),
    "Support Vector Regressor (SVR)": SVR()
}

## Hyperparameter Tuning

In [11]:
# Hyperparameter grid for different models
param_grid = {
    "Lasso (L1)": {"alpha": [0.1, 1, 10]},
    "Ridge (L2)": {"alpha": [0.1, 1, 10]},
    "Decision Tree": {"max_depth": [5, 10, 20], "min_samples_split": [2, 10]},
    "Random Forest": {"n_estimators": [50, 100, 200], "max_depth": [10, 20], "min_samples_split": [2, 10]},
    "Support Vector Regressor (SVR)": {"C": [0.1, 1, 10], "gamma": ["scale", "auto"]}
}

## Model Fitting and Evaluation

In [15]:
results = {}

for model_name, model in models.items():
    print(f"Training {model_name}...")
    
    # Apply Grid Search for hyperparameter tuning
    if model_name in param_grid:
        grid_search = GridSearchCV(model, param_grid[model_name], cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
        grid_search.fit(X_train, y_train)
        best_model = grid_search.best_estimator_
        print(f"Best Params for {model_name}: {grid_search.best_params_}")
    else:
        best_model = model
        best_model.fit(X_train, y_train)
    
    # Predict on test set
    y_pred = best_model.predict(X_test)
    
    # Evaluate the model
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100  # Mean Absolute Percentage Error
    r2 = r2_score(y_test, y_pred)
    
    # Store results
    results[model_name] = {
        "MAE": mae,
        "RMSE": rmse,
        "MAPE": mape,
        "R²": r2
    }

    # Print evaluation metrics
    print(f"{model_name} - MAE: {mae:.4f}, RMSE: {rmse:.4f}, MAPE: {mape:.4f}%, R²: {r2:.4f}")


Training Linear Regression...
Linear Regression - MAE: 649379.1516, RMSE: 1643316.7473, MAPE: 172.6316%, R²: 0.7449
Training Lasso (L1)...
Best Params for Lasso (L1): {'alpha': 10}
Lasso (L1) - MAE: 649378.0743, RMSE: 1643317.2348, MAPE: 172.6236%, R²: 0.7449
Training Ridge (L2)...
Best Params for Ridge (L2): {'alpha': 10}
Ridge (L2) - MAE: 649376.4393, RMSE: 1643317.4410, MAPE: 172.6172%, R²: 0.7449
Training Decision Tree...
Best Params for Decision Tree: {'max_depth': 10, 'min_samples_split': 2}
Decision Tree - MAE: 393327.9137, RMSE: 1082781.5350, MAPE: 65.0517%, R²: 0.8892
Training Random Forest...
Best Params for Random Forest: {'max_depth': 20, 'min_samples_split': 2, 'n_estimators': 200}
Random Forest - MAE: 323269.3013, RMSE: 880511.6786, MAPE: 56.9286%, R²: 0.9268
Training Support Vector Regressor (SVR)...
Best Params for Support Vector Regressor (SVR): {'C': 10, 'gamma': 'auto'}
Support Vector Regressor (SVR) - MAE: 924260.0328, RMSE: 3336373.1633, MAPE: 160.7385%, R²: -0.051

## Results


| **Model**                       | **MAE**           | **RMSE**         | **MAPE**          | **R² Score**   |
|----------------------------------|-------------------|------------------|-------------------|----------------|
| **Linear Regression**            | 649,379.15        | 1,643,316.75     | 172.63%           | 0.7449         |
| **Lasso (L1)**                   | 649,378.07        | 1,643,317.23     | 172.62%           | 0.7449         |
| **Ridge (L2)**                   | 649,376.44        | 1,643,317.44     | 172.62%           | 0.7449         |
| **Decision Tree**                | 393,327.91        | 1,082,781.54     | 65.05%            | 0.8892         |
| **Random Forest**                | 323,269.30        | 880,511.68       | 56.93%            | 0.9268         |
| **Support Vector Regressor (SVR)** | 924,260.03        | 3,336,373.16     | 160.74%           | -0.0516        |


## Conclusion

**Best Model**: **Random Forest** Regressor performed the best with the highest R² score (0.9268), the lowest MAE (323,269.30), and MAPE (56.93%). <br>
**Second Best**: **Decision Tree** Regressor also performed well, with a good R² score (0.8892) and significantly lower errors than the linear models. <br>
**Linear Models (Linear, Lasso, Ridge)**: These performed similarly and did not explain as much variance as the tree-based models, and they had high error rates (MAE, RMSE, MAPE). <br>
**SVR**: This performed poorly and is not recommended for this dataset