# 4. Data Modeling

## 4.1. RandomForestRegressor Model

In [14]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import pandas as pd

data = "./data/tvshows.csv"
data = pd.read_csv(data)

# Select features and target
X = data[['Runtime', 'Number of Votes', 'Emmys']]  # Input columns
y = data['Rating']  # Output column (target)

# Split the dataset into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the RandomForest model
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model performance
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print evaluation metrics
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"R² of RandomForest model: {r2:.4f}")

# Extract feature importances
feature_importances = pd.DataFrame({
    'Feature': X.columns,
    'Importance': model.feature_importances_
}).sort_values(by='Importance', ascending=False)

# Print feature importances
print("Feature Importances:")
print(feature_importances)


Mean Squared Error (MSE): 0.5765
Mean Absolute Error (MAE): 0.5616
R² of RandomForest model: 0.0484
Feature Importances:
           Feature  Importance
1  Number of Votes    0.540540
0          Runtime    0.382493
2            Emmys    0.076967


### 4.1.1. Model Description

The **RandomForestRegressor** model uses multiple decision trees to predict outcomes. Each tree learns how to predict the data, and the results of all trees are combined to give the final prediction. While it is very powerful in handling **non-linear relationships** between features, this model has a very low **R² value** (0.0484), indicating that the model does not explain much of the variability in the data.

### 4.1.2. Performance Analysis

**R² = 0.0484**: This model explains only **4.84%** of the variance in the target variable, suggesting that important relationships are not being captured by the model.

#### Important Features:
- **Number of Votes**: 0.5405
- **Runtime**: 0.3825
- **Emmys**: 0.0770
 
The **Number of Votes** and **Runtime** features have a significant impact on the program's ratings.

### 4.1.3. Optimization

To improve the performance of **RandomForestRegressor**, the following hyperparameters can be optimized:

- **n_estimators**: The number of trees in the forest. Increasing this value may improve model performance, but also increases computation time. After a certain number of trees, performance improvements may plateau.
- **max_depth**: The maximum depth of each tree. Limiting this depth can help reduce overfitting.
- **min_samples_split**: The minimum number of samples required to split a node. Increasing this value can reduce model complexity.
- **min_samples_leaf**: The minimum number of samples required at each leaf. This helps reduce overfitting by avoiding overly detailed splits.
- **max_features**: The maximum number of features to consider at each split. Limiting the number of features can reduce model complexity.
- **bootstrap**: Specifies whether to use bootstrap sampling (sampling with replacement). Try both **True** and **False** to see the impact on results.