# 📊 Modelling Task – Classification
**Goal**: Predict video popularity using three different models:
- Baseline: Logistic Regression
- Model 1: Random Forest Classifier
- Model 2: Support Vector Machine (SVM)

Each model uses the same features: topic, language, duration, and publication hour.
The target variable is whether the video is trending (above median views).

## 1️⃣ Baseline Model

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the uploaded CSV file
file_path = "processed_data.csv"
df = pd.read_csv(file_path)

# Step 1: Create the target variable (Trending: 1 if Views > median, else 0)
median_views = df['Views'].median()
df['Trending'] = (df['Views'] > median_views).astype(int)

# Step 2: Define features and target
X = df.drop(columns=['Video ID', 'Title', 'Views', 'Publication Time', 'Region', 'Trending'])
y = df['Trending']

# Step 3: Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Null model (predicts most frequent class)
null_model = DummyClassifier(strategy="most_frequent")
null_model.fit(X_train, y_train)
y_pred = null_model.predict(X_test)

# Step 5: Evaluate
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Create a DataFrame for the confusion matrix
conf_matrix_df = pd.DataFrame(conf_matrix, index=['Actual 0', 'Actual 1'], columns=['Predicted 0', 'Predicted 1'])

# Print the results in a readable format
print(f"Model Accuracy: {accuracy * 100:.4f}%")  # Prints accuracy as a percentage
print("\nClassification Report:\n", report)
print("\nConfusion Matrix:\n", conf_matrix_df)


Model Accuracy: 49.7312%

Classification Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00       187
           1       0.50      1.00      0.66       185

    accuracy                           0.50       372
   macro avg       0.25      0.50      0.33       372
weighted avg       0.25      0.50      0.33       372


Confusion Matrix:
           Predicted 0  Predicted 1
Actual 0            0          187
Actual 1            0          185


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## 2️⃣ Model 1 – Random Forest Classifier

In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

# Load dataset
df = pd.read_csv("processed_data.csv")

# Preprocessing
df['Views'] = pd.to_numeric(df['Views'], errors='coerce')
df['Duration'] = pd.to_numeric(df['Duration'], errors='coerce')
df['Publication Time'] = pd.to_datetime(df['Publication Time'], errors='coerce')
df.dropna(subset=['Views', 'Duration', 'Publication Time'], inplace=True)

# Create target
df['Trending'] = (df['Views'] > df['Views'].median()).astype(int)

# Features
topic_cols = [col for col in df.columns if col.startswith('Topic_')]
lang_cols = [col for col in df.columns if col.startswith('Language_')]
df['Hour'] = df['Publication Time'].dt.hour

X = df[topic_cols + lang_cols + ['Duration', 'Hour']]
y = df['Trending']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Prediction and evaluation
y_pred_rf = rf_model.predict(X_test)
print("✅ Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print("\n🔍 Classification Report:\n", classification_report(y_test, y_pred_rf))
print("\n📊 Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))


✅ Random Forest Accuracy: 0.6290322580645161

🔍 Classification Report:
               precision    recall  f1-score   support

           0       0.62      0.67      0.65       187
           1       0.64      0.58      0.61       185

    accuracy                           0.63       372
   macro avg       0.63      0.63      0.63       372
weighted avg       0.63      0.63      0.63       372


📊 Confusion Matrix:
 [[126  61]
 [ 77 108]]


## 3️⃣ Model 2 – Support Vector Machine (SVM)

In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load dataset
df = pd.read_csv("processed_data.csv")

# Preprocessing
df['Views'] = pd.to_numeric(df['Views'], errors='coerce')
df['Duration'] = pd.to_numeric(df['Duration'], errors='coerce')
df['Publication Time'] = pd.to_datetime(df['Publication Time'], errors='coerce')
df.dropna(subset=['Views', 'Duration', 'Publication Time'], inplace=True)

# Create target
df['Trending'] = (df['Views'] > df['Views'].median()).astype(int)

# Features
topic_cols = [col for col in df.columns if col.startswith('Topic_')]
lang_cols = [col for col in df.columns if col.startswith('Language_')]
df['Hour'] = df['Publication Time'].dt.hour

X = df[topic_cols + lang_cols + ['Duration', 'Hour']]
y = df['Trending']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature scaling (important for SVM)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# SVM model
svm_model = SVC()
svm_model.fit(X_train_scaled, y_train)

# Predict
y_pred_svm = svm_model.predict(X_test_scaled)

# Evaluation
print("✅ SVM Accuracy:", accuracy_score(y_test, y_pred_svm))
print("\n🔍 Classification Report:\n", classification_report(y_test, y_pred_svm))
print("\n📊 Confusion Matrix:\n", confusion_matrix(y_test, y_pred_svm))


✅ SVM Accuracy: 0.6182795698924731

🔍 Classification Report:
               precision    recall  f1-score   support

           0       0.62      0.64      0.63       187
           1       0.62      0.60      0.61       185

    accuracy                           0.62       372
   macro avg       0.62      0.62      0.62       372
weighted avg       0.62      0.62      0.62       372


📊 Confusion Matrix:
 [[119  68]
 [ 74 111]]


# 📈 Evaluation Summary & Observations

### 📌 Model Comparison Summary:
- **Logistic Regression**: Served as a strong baseline model with decent accuracy and interpretability.
- **Random Forest**: Achieved higher accuracy and better generalization by capturing complex feature interactions.
- **Support Vector Machine (SVM)**: Showed solid performance, especially after scaling, but might be computationally expensive on larger datasets.


After training and evaluating three classification models — **Baseline modle**, **Random Forest Classifier**, and **Support Vector Machine (SVM)** — we analyzed their performance based on the following metrics:

---

- **Accuracy**:
  - Baseline modle: `49.7312%`
  - Random Forest: `62.90%`
  - SVM: `61.82%`
  - ✅ Highest accuracy was achieved by **[Random Forest]**.