In [23]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Data

In [24]:
data = pd.read_csv('./data.csv')
data

Unnamed: 0,Title,Years,Certification,Runtime,Rating,Number of Votes,Emmys,Creators,Actors,Genres,Coutries of origins,Languages,Production companies,Link
0,Queen Cleopatra,"(2023, None)",TV-14,45.0,1.2,86000,0,,"Jada Pinkett Smith, Adele James, Craig Russell...","Documentary, Drama, History",United Kingdom,English,Nutopia,https://www.imdb.com/title/tt27528139/?ref_=sr...
1,Velma,"(2023, 2024)",TV-MA,25.0,1.6,80000,0,Charlie Grandy,"Mindy Kaling, Glenn Howerton, Sam Richardson, ...","Animation, Adventure, Comedy, Crime, Horror, M...","United States, South Korea",English,"Charlie Grandy Productions, Kaling Internation...",https://www.imdb.com/title/tt14153790/?ref_=sr...
2,Keeping Up with the Kardashians,"(2007, 2021)",TV-14,44.0,2.9,32000,0,"Ryan Seacrest, Eliot Goldberg","Khloé Kardashian, Kim Kardashian, Kourtney Kar...","Family, Reality-TV",United States,"English, Spanish","Bunim-Murray Productions (BMP), Ryan Seacrest ...",https://www.imdb.com/title/tt1086761/?ref_=sr_...
3,Batwoman,"(2019, 2022)",TV-14,45.0,3.6,47000,0,Caroline Dries,"Camrus Johnson, Rachel Skarsten, Meagan Tandy,...","Action, Adventure, Crime, Drama, Sci-Fi",United States,English,"Berlanti Productions, DC Entertainment, Warner...",https://www.imdb.com/title/tt8712204/?ref_=sr_...
4,The Acolyte,"(2024, None)",TV-14,35.0,4.1,125000,0,Leslye Headland,"Lee Jung-jae, Amandla Stenberg, Manny Jacinto,...","Action, Adventure, Drama, Fantasy, Mystery, Sc...",United States,English,"Lucasfilm, Disney+, The Walt Disney Company",https://www.imdb.com/title/tt12262202/?ref_=sr...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1133,The Wire,"(2002, 2008)",TV-MA,60.0,9.3,390000,0,David Simon,"Dominic West, Lance Reddick, Sonja Sohn, Wende...","Crime, Drama, Thriller",United States,"English, Greek, Mandarin, Spanish","Blown Deadline Productions, Home Box Office (HBO)",https://www.imdb.com/title/tt0306414/?ref_=sr_i_5
1134,Planet Earth,"(2006, None)",TV-PG,50.0,9.4,223000,4,,"Sigourney Weaver, David Attenborough, Nikolay ...","Documentary, Family","United Kingdom, Canada, United States, Japan",English,"British Broadcasting Corporation (BBC), Canadi...",https://www.imdb.com/title/tt0795176/?ref_=sr_i_4
1135,Band of Brothers,"(2001, None)",TV-MA,60.0,9.4,544000,6,,"Scott Grimes, Damian Lewis, Ron Livingston, Sh...","Drama, History, War","United Kingdom, United States","English, Dutch, French, German, Lithuanian","DreamWorks, DreamWorks Television, HBO Films",https://www.imdb.com/title/tt0185906/?ref_=sr_i_3
1136,Planet Earth II,"(2016, None)",TV-G,50.0,9.5,162000,2,,"David Attenborough, Michael J. Sanderson, Gord...",Documentary,"United Kingdom, Germany, France, China, United...","English, French","BBC Natural History Unit (NHU), BBC America, Z...",https://www.imdb.com/title/tt5491994/?ref_=sr_i_2


# Train Model

In [25]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error



X = data[['Runtime', 'Number of Votes', 'Emmys']]
y = data['Rating']  

# Chia tập train-test (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Khởi tạo và huấn luyện mô hình RandomForest
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)

# Dự đoán trên tập kiểm tra
y_pred = model.predict(X_test)

# Đánh giá mô hình
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)


print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"R² của mô hình RandomForest: {r2:.4f}")

# Lấy tầm quan trọng của các đặc trưng (feature importances)
feature_importances = pd.DataFrame({
    'Feature': X.columns,
    'Importance': model.feature_importances_
}).sort_values(by='Importance', ascending=False)


print("Tầm quan trọng của các đặc trưng:")
print(feature_importances)


Mean Squared Error (MSE): 0.5765
Mean Absolute Error (MAE): 0.5616
R² của mô hình RandomForest: 0.0484
Tầm quan trọng của các đặc trưng:
           Feature  Importance
1  Number of Votes    0.540540
0          Runtime    0.382493
2            Emmys    0.076967


We notice that the R² value of the RandomForest model is only 0.0484, which is very low compared to expectations. R² (the coefficient of determination) is an important metric for evaluating how well the model fits the actual data. The closer the R² value is to 1.0, the better the model explains the variability in the data. Conversely, a value close to 0, like in this case, indicates that the model explains only a very small portion of the data's variability.

With an R² value of 0.0484, this suggests that the RandomForest model performs very poorly in predicting or explaining the changes in the data. Specifically, the model explains only about 4.84% of the data's variability, meaning that it has not captured the important relationships between the factors in the dataset.

Here, we observe that two columns have a significant impact on the **Rating**, which are **Number of Votes** and **Runtime**. Specifically, **Number of Votes** has a correlation coefficient of **0.540540**, indicating that the more votes a movie receives, the higher its rating tends to be. This reflects the idea that movies with a larger viewership and more voter participation often receive higher ratings.

Additionally, **Runtime** also has a notable influence, with a correlation coefficient of **0.382493**. This suggests that the movie's duration plays an important role in shaping the viewer's rating. Movies with an appropriate runtime—neither too long nor too short—tend to receive higher ratings, as viewers may feel more comfortable while watching them. These factors together contribute to the overall rating given by the audience for the films.

# Optimization: 
<br>You can add some values such as:
- **n_estimators**: The number of trees in the forest. Increasing this value can improve the model's performance but also increases computation time. However, after a certain number of trees, the performance may not change much.
- **max_depth**: The maximum depth of each tree. Limiting the depth can help reduce overfitting.
- **min_samples_split**: The minimum number of samples required to split a node. Increasing this value can help reduce model complexity.
- **min_samples_leaf**: The minimum number of samples required at each leaf. This helps reduce overfitting by avoiding overly detailed splits.
- **max_features**: The maximum number of features to consider at each split. Limiting the number of features can help reduce model complexity.
- **bootstrap**: Specifies whether to use bootstrap sampling (sampling with replacement). Try both True and False to see the effect on the results.