# 4. Data Modeling

## 4.1. RandomForestRegressor Model

In [14]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import pandas as pd

data = "./data/tvshows.csv"
data = pd.read_csv(data)

# Select features and target
X = data[['Runtime', 'Number of Votes', 'Emmys']]  # Input columns
y = data['Rating']  # Output column (target)

# Split the dataset into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the RandomForest model
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model performance
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print evaluation metrics
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"R² of RandomForest model: {r2:.4f}")

# Extract feature importances
feature_importances = pd.DataFrame({
    'Feature': X.columns,
    'Importance': model.feature_importances_
}).sort_values(by='Importance', ascending=False)

# Print feature importances
print("Feature Importances:")
print(feature_importances)


Mean Squared Error (MSE): 0.5765
Mean Absolute Error (MAE): 0.5616
R² of RandomForest model: 0.0484
Feature Importances:
           Feature  Importance
1  Number of Votes    0.540540
0          Runtime    0.382493
2            Emmys    0.076967


### 4.1.1. Model Description

The **RandomForestRegressor** model uses multiple decision trees to predict outcomes. Each tree learns how to predict the data, and the results of all trees are combined to give the final prediction. While it is very powerful in handling **non-linear relationships** between features, this model has a very low **R² value** (0.0484), indicating that the model does not explain much of the variability in the data.

### 4.1.2. Performance Analysis

**R² = 0.0484**: This model explains only **4.84%** of the variance in the target variable, suggesting that important relationships are not being captured by the model.

#### Important Features:
- **Number of Votes**: 0.5405
- **Runtime**: 0.3825
- **Emmys**: 0.0770
 
The **Number of Votes** and **Runtime** features have a significant impact on the program's ratings.

### 4.1.3. Optimization

To improve the performance of **RandomForestRegressor**, the following hyperparameters can be optimized:

- **n_estimators**: The number of trees in the forest. Increasing this value may improve model performance, but also increases computation time. After a certain number of trees, performance improvements may plateau.
- **max_depth**: The maximum depth of each tree. Limiting this depth can help reduce overfitting.
- **min_samples_split**: The minimum number of samples required to split a node. Increasing this value can reduce model complexity.
- **min_samples_leaf**: The minimum number of samples required at each leaf. This helps reduce overfitting by avoiding overly detailed splits.
- **max_features**: The maximum number of features to consider at each split. Limiting the number of features can reduce model complexity.
- **bootstrap**: Specifies whether to use bootstrap sampling (sampling with replacement). Try both **True** and **False** to see the impact on results.

## 4.2. Linear Regression Model

In [15]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import pandas as pd

data = "./data/tvshows.csv"
data = pd.read_csv(data)

# Select features and target
X = data[['Runtime', 'Number of Votes', 'Emmys']]  # Input columns
y = data['Rating']  # Output column (target)

# Split the dataset into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model performance
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print evaluation metrics
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"R² of Linear Regression model: {r2:.4f}")

# Coefficients and Intercept
print("Coefficients and Intercept:")
print(f"Intercept: {model.intercept_}")
for feature, coef in zip(X.columns, model.coef_):
    print(f"Feature: {feature}, Coefficient: {coef:.4f}")

Mean Squared Error (MSE): 0.5337
Mean Absolute Error (MAE): 0.5420
R² of Linear Regression model: 0.1190
Coefficients and Intercept:
Intercept: 7.758484998183272
Feature: Runtime, Coefficient: -0.0016
Feature: Number of Votes, Coefficient: 0.0000
Feature: Emmys, Coefficient: 0.0132


### 4.2.1. Model Description

**Linear Regression** is a basic yet powerful model that fits a straight line to describe the relationship between the features and the target variable. This model is easy to understand and implement, but it only works well when there is a **linear relationship** between features and the target.

### 4.2.2. Performance Analysis

**R² = 0.1190**: This model explains about **11.9%** of the variance, indicating that the relationship between features and the target variable is more complex than what linear regression can capture.

### Coefficients:
- **Intercept**: 7.7585
- **Runtime**: -0.0016
- **Number of Votes**: 0.0000
- **Emmys**: 0.0132

### 4.2.3. Optimization

**Linear regression** can be optimized through the following techniques:

- **Regularization**: Use **Ridge Regression** (L2 regularization) or **Lasso Regression** (L1 regularization) to avoid overfitting and improve model performance when dealing with many features.
- **Feature Selection**: Techniques like **Recursive Feature Elimination (RFE)** or **Lasso** can help select the most important features, reducing complexity and improving performance.
- **Polynomial Features**: If the relationship between the features and the target is not fully linear, adding **polynomial features** can improve the model.

## 4.3. Support Vector Regression (SVR) Model

In [16]:
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import pandas as pd

data = "./data/tvshows.csv"
data = pd.read_csv(data)

# Select features and target
X = data[['Runtime', 'Number of Votes', 'Emmys']]  # Input columns
y = data['Rating']  # Output column (target)

# Split the dataset into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Support Vector Regression model
model = SVR(kernel='rbf')
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model performance
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print evaluation metrics
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"R² of SVR model: {r2:.4f}")


Mean Squared Error (MSE): 0.5419
Mean Absolute Error (MAE): 0.5415
R² of SVR model: 0.1054


### 4.3.1. Model Description

**Support Vector Regression (SVR)** is a powerful model for handling **non-linear data**. SVR uses **support vectors** to find a regression function in a high-dimensional space, which can handle complex relationships between features.

### 4.3.2. Performance Analysis

**R² = 0.1054**: SVR has a similar performance to **Linear Regression**, with a low R² indicating that it does not explain much of the variance in the data.

### 4.3.3. Optimization

To improve the performance of **SVR**, the following hyperparameters can be optimized:

- **C (Penalty Parameter)**: Increasing **C** can make the model more accurate, but it may also lead to overfitting if the value is too high.
- **Kernel**: Try different kernels such as **linear**, **polynomial**, and **RBF** to find the most suitable one for the data.
- **epsilon**: This parameter controls the margin of error for the model. Increasing **epsilon** can reduce model complexity, while decreasing it makes the model focus more on the data points close to the margin.
- **gamma**: Increasing **gamma** helps the model capture more non-linear relationships, but can lead to overfitting if set too high.

## 4.4. Comparison and Conclusion

### 4.4.1. Comparison

| Model                        | MSE      | MAE      | R²       |
|------------------------------|----------|----------|----------|
| **Random Forest**             | 0.5765   | 0.5616   | 0.0484   |
| **Linear Regression**         | 0.5337   | 0.5420   | 0.1190   |
| **Support Vector Regression** | 0.5419   | 0.5415   | 0.1054   |

### 4.4.2. Conclusion

**Random Forest** performs the best in terms of **MSE** and **MAE**, but with a very low **R²**, indicating the model does not capture the relationship between the features and the target well. However, hyperparameter optimization may improve results.

**Linear Regression** has a higher **R²** but performs worse in terms of **MSE** and **MAE**.

**SVR** shows no significant improvement over **Linear Regression**, with similarly low **R²** values.

### 4.4.3. Recommendation

**Hyperparameter optimization** is essential for improving the performance of these models, especially when dealing with **non-linear** and **complex data**. Fine-tuning parameters can help better capture relationships and improve model predictions.