<a href="https://colab.research.google.com/github/Exion007/Colab/blob/main/ensemble_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ensemble Learning

* Ensemble learning is a supervised machine learning technique that aims to improve the accuracy of the prediction.

* Ensemble methods combine predictions from multiple models.

* Ensemble methods are meta-algorithms that associate a variety of machine learning methods into a single predictive model to decrease variance by means of bagging, decrease bias by means of boosting, or improve predictions by means of stacking.

* We can classify ensemble methods into two main branches:
  1. Simple Ensemble Methods

  2. Advanced Ensemble Methods

# 1. Simple Ensemble Methods

* **Voting**
  - Primarily used in classification problems.
  - Combine the predictions obtained from multiple models to come up with a final prediction.
  - Example: Elections (50% + 1)

* **Averaging**
  - Primarily used in classification problems.
  - Similar to **'Voting'**, but it involves calculating the average of predicted probabilities.
  - Ex: (0.6 + 0.7 + 0.8) / 3 = 0.7

* **Weighted Averaging**
  - Similar to **'Averaging'**, but each model's prediction is given a weight based on its reliability or performance.
  - Ex: [(0.6 * 0.1) + (0.7 * 0.3) + (0.8 * 0.6)] = 0.75


# 2. Advanced Ensemble Methods
  * **Bagging (Bootstrap Aggregating)**
    - Used in classification or regression problems
    - Multiple instances of the same model are trained on different subsets of the training data, each obtained through bootstrap resampling.
    - The final prediction is often an average or majority vote of predictions from each individual model.
    
        <div>
        <img src="https://media.geeksforgeeks.org/wp-content/uploads/20210707140912/Bagging.png" width="500"/>
        </div>
  
  ---

  * **Random Forest**
    - Used in classification or regression problems
    - A specific type of bagging where an ensemble of decision trees is created.
    - Each tree is trained on a different bootstrap sample, and during tree construction, a random subset of features is considered at each split.
    - Predictions are aggregated through majority voting (classification) or averaging (regression) of tree predictions.

        <div>
        <img src="https://pages.cms.hu-berlin.de/EOL/geo_rs/fig/s09_rf-concept.png" width="400"/>
        </div>
  
---

 * **Boosting**
  - Used in classification or regression problems
  - Iterative process where models are trained sequentially, with each new model focusing on the mistakes made by the previous models.
  - The final prediction is an aggregate of the predictions from all models, weighted according to their performance.
  - There are 4 commonly used boosting algorithms:
  
  1. **Gradient Boosting:**
      - GB builds a sequence of models in a way that each new model attempts to correct the errors made by the previous models. It uses gradient descent optimization to minimize a loss function.

      - The final prediction is the sum of predictions from all models, where each model's contribution is determined by a learning rate.
    
  2. **AdaBoost (Adaptive Boosting):**

      - AdaBoost assigns higher weights to misclassified samples and lower weights to correctly classified samples. It trains a sequence of models, where each new model gives more importance to the misclassified samples from the previous models.

      - The final prediction is a weighted combination of the predictions from all models.
    
  3. **XGBoost (Extreme Gradient Boosting):**
      - XGBoost is an advanced version of gradient boosting that incorporates regularization, parallel processing, and handling of missing values. It also optimizes the performance of each individual tree in the ensemble.
      
      - It has become a widely used algorithm in various machine learning competitions and applications due to its high performance and scalability. Additionally, it is said that XGBoost is the fastest and amongst all boosting algorithms.
    
  4. **CatBoost:**
     - CatBoost is a gradient boosting algorithm that handles categorical features directly, eliminating the need for preprocessing like label encoding or one-hot encoding.

     - It also employs techniques like ordered boosting and oblivious trees to improve performance and generalization.

  - Advantages:
    * Improved accuracy
    * Robustness to overfitting
    * Better handling of imbalanced data
    * Better interpretability
  
  - Disadvantages
    * Vulnerable to the outliers
    * Difficult to use for the Real-Time applications
    * Computationally expensive for large datasets

---







# Sample Training

In [188]:
import time
import pandas as pd
import numpy as np
import lightgbm as lgb
from xgboost import XGBRegressor
#from catboost import CatBoostRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import StackingRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, MultiLabelBinarizer
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_squared_error, recall_score, precision_score

In [189]:
# Reading file

df = pd.read_csv("movies.csv")
df.head(3)

Unnamed: 0,id,title,release_date,genres,original_language,vote_average,vote_count,popularity,overview,budget,production_companies,revenue,runtime,tagline
0,758323,The Pope's Exorcist,2023-04-05,"['Horror', 'Mystery', 'Thriller']",English,7.4,619,5089.969,"Father Gabriele Amorth, Chief Exorcist of the ...",18000000,"['Screen Gems', '2.0 Entertainment', 'Jesus & ...",65675816,103,Inspired by the actual files of Father Gabriel...
1,640146,Ant-Man and the Wasp: Quantumania,2023-02-15,"['Action', 'Adventure', 'Science Fiction']",English,6.6,2294,4665.438,Super-Hero partners Scott Lang and Hope van Dy...,200000000,"['Marvel Studios', 'Kevin Feige Productions']",464566092,125,Witness the beginning of a new dynasty.
2,502356,The Super Mario Bros. Movie,2023-04-05,"['Animation', 'Adventure', 'Family', 'Fantasy'...",English,7.5,1861,3935.55,"While working underground to fix a water main,...",100000000,"['Universal Pictures', 'Illumination', 'Ninten...",1121048165,92,


In [190]:
# PREPROCESSING

# Selecting the columns we want to use
df = df.loc[:, ['vote_average', 'vote_count', 'genres', 'budget', 'popularity']]


# Replace the budget values given as 0 with the mean of the other budget values
avg = df.query("budget > 0").budget.mean()
df.loc[df["budget"] == 0, "budget"] = avg

# Drop NA values (We do not have any in this case)
df.dropna(inplace = True)

# Convert the genres column from string to list format
def convert_genres(row):
  row = row[1:-1].replace("'", "").split(", ")
  return row

df['genres'] = df['genres'].apply(convert_genres)

df.head(3)

Unnamed: 0,vote_average,vote_count,genres,budget,popularity
0,7.4,619,"[Horror, Mystery, Thriller]",18000000.0,5089.969
1,6.6,2294,"[Action, Adventure, Science Fiction]",200000000.0,4665.438
2,7.5,1861,"[Animation, Adventure, Family, Fantasy, Comedy]",100000000.0,3935.55


In [191]:
# Selecting the columns we want to operate on
X = df[['vote_count', 'genres', 'budget', 'popularity']]
y = df['vote_average']

# One hot encoding
mlb = MultiLabelBinarizer()

# Transform the genres column into binary columns
genres_encoded = pd.DataFrame(mlb.fit_transform(X['genres']), columns=mlb.classes_, index=X.index)

# Concatenate the encoded genres with the original features
X_encoded = pd.concat([X[['vote_count', 'budget', 'popularity']], genres_encoded], axis=1)

# Displaying Genres
genres = X_encoded.iloc[:, 4:].columns
print(genres)

X_encoded.head(3)

Index(['Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Documentary',
       'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music', 'Mystery',
       'Romance', 'Science Fiction', 'TV Movie', 'Thriller', 'War', 'Western'],
      dtype='object')


Unnamed: 0,vote_count,budget,popularity,Unnamed: 4,Action,Adventure,Animation,Comedy,Crime,Documentary,...,History,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western
0,619,18000000.0,5089.969,0,0,0,0,0,0,0,...,0,1,0,1,0,0,0,1,0,0
1,2294,200000000.0,4665.438,0,1,1,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,1861,100000000.0,3935.55,0,0,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [192]:
# Dataframe to keep the performance data

algs = []
mses = []
rmses = []
precisions = []
recalls = []
exectimes = []
predictions = []

def addrow(alg, mse, precision, recall, exec_time, prediction):
  algs.append(alg)
  mses.append(mse)
  rmses.append(mse ** 0.5)
  precisions.append(precision)
  recalls.append(recall)
  exectimes.append(exec_time)
  predictions.append(prediction)

In [193]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

In [194]:
# Using Random Forest

start_time = time.time()

model_rf = RandomForestRegressor()

model_rf.fit(X_train, y_train)

y_pred_rf = model_rf.predict(X_test)

mse_rf = mean_squared_error(y_test, y_pred_rf)

# Convert predicted vote averages into binary classifications
threshold = 7.0  # Choose an appropriate threshold based on your problem
y_pred_binary_rf = [1 if pred >= threshold else 0 for pred in y_pred_rf]

# Convert actual vote averages into binary classifications using the same threshold
y_test_binary_rf = [1 if actual >= threshold else 0 for actual in y_test]

# Calculate recall and precision
precision_rf = precision_score(y_test_binary_rf, y_pred_binary_rf)
recall_rf = recall_score(y_test_binary_rf, y_pred_binary_rf)

end_time = time.time()
exec_rf = end_time - start_time

print(f"Random Forest Regressor -> Mean Squared Error: {mse_rf}")
print(f"Random Forest Regressor -> Root Mean Squared Error: {mse_rf ** 0.5}")
print(f"Random Forest Regressor -> Precision: {precision_rf}")
print(f"Random Forest Regressor -> Recall: {recall_rf}")
print(f"Elapsed Time: {exec_rf:.2f} seconds")

Random Forest Regressor -> Mean Squared Error: 0.6498853349999999
Random Forest Regressor -> Root Mean Squared Error: 0.8061546594791845
Random Forest Regressor -> Precision: 0.7013888888888888
Random Forest Regressor -> Recall: 0.45089285714285715
Elapsed Time: 4.48 seconds


In [195]:
# Using Gradient Boosting

start_time = time.time()

model_gb = GradientBoostingRegressor()

model_gb.fit(X_train, y_train)

y_pred_gb = model_gb.predict(X_test)

mse_gb = mean_squared_error(y_test, y_pred_gb)

# Convert predicted vote averages into binary classifications
threshold = 7.0  # Choose an appropriate threshold based on your problem
y_pred_binary_gb = [1 if pred >= threshold else 0 for pred in y_pred_gb]

# Convert actual vote averages into binary classifications using the same threshold
y_test_binary_gb = [1 if actual >= threshold else 0 for actual in y_test]

# Calculate recall and precision
precision_gb = precision_score(y_test_binary_gb, y_pred_binary_gb)
recall_gb = recall_score(y_test_binary_gb, y_pred_binary_gb)

end_time = time.time()
exec_gb = end_time - start_time

print(f"Gradient Boosting Regressor -> Mean Squared Error: {mse_gb}")
print(f"Gradient Boosting Regressor -> Root Mean Squared Error: {mse_gb ** 0.5}")
print(f"Gradient Boosting Regressor -> Precision: {precision_gb}")
print(f"Gradient Boosting Regressor -> Recall: {recall_gb}")
print(f"Elapsed Time: {exec_gb:.2f} seconds")

Gradient Boosting Regressor -> Mean Squared Error: 0.6105309784303883
Gradient Boosting Regressor -> Root Mean Squared Error: 0.7813648177582533
Gradient Boosting Regressor -> Precision: 0.7539432176656151
Gradient Boosting Regressor -> Recall: 0.3556547619047619
Elapsed Time: 1.60 seconds


In [196]:
# Using XGBoost

start_time = time.time()

model_xgb = XGBRegressor()

model_xgb.fit(X_train, y_train)

y_pred_xgb = model_xgb.predict(X_test)

mse_xgb = mean_squared_error(y_test, y_pred_xgb)

# Convert predicted vote averages into binary classifications
threshold = 7.0  # Choose an appropriate threshold based on your problem
y_pred_binary_xgb = [1 if pred >= threshold else 0 for pred in y_pred_xgb]

# Convert actual vote averages into binary classifications using the same threshold
y_test_binary_xgb = [1 if actual >= threshold else 0 for actual in y_test]

# Calculate recall and precision
precision_xgb = precision_score(y_test_binary_xgb, y_pred_binary_xgb)
recall_xgb = recall_score(y_test_binary_xgb, y_pred_binary_xgb)

end_time = time.time()
exec_xgb = end_time - start_time

print(f"XGBoost -> Mean Squared Error: {mse_xgb}")
print(f"XGBoost -> Root Mean Squared Error: {mse_xgb ** 0.5}")
print(f"XGBoost -> Precision: {precision_xgb}")
print(f"XGBoost -> Recall: {recall_xgb}")
print(f"Elapsed Time: {exec_xgb:.2f} seconds")

XGBoost -> Mean Squared Error: 0.6134430232279932
XGBoost -> Root Mean Squared Error: 0.7832260358466087
XGBoost -> Precision: 0.713953488372093
XGBoost -> Recall: 0.4568452380952381
Elapsed Time: 1.38 seconds


In [197]:
# Using AdaBoost

start_time = time.time()

model_adaboost = AdaBoostRegressor()

model_adaboost.fit(X_train, y_train)

y_pred_ada = model_adaboost.predict(X_test)

mse_ada = mean_squared_error(y_test, y_pred_ada)

# Convert predicted vote averages into binary classifications
threshold = 7.0  # Choose an appropriate threshold based on your problem
y_pred_binary_ada = [1 if pred >= threshold else 0 for pred in y_pred_ada]

# Convert actual vote averages into binary classifications using the same threshold
y_test_binary_ada = [1 if actual >= threshold else 0 for actual in y_test]

# Calculate recall and precision
precision_ada = precision_score(y_test_binary_ada, y_pred_binary_ada)
recall_ada = recall_score(y_test_binary_ada, y_pred_binary_ada)

end_time = time.time()
exec_ada = end_time - start_time

print(f"AdaBoost Regressor -> Mean Squared Error: {mse_ada}")
print(f"AdaBoost Regressor -> Root Mean Squared Error: {mse_ada ** 0.5}")
print(f"AdaBoost Regressor -> Precision: {precision_ada}")
print(f"AdaBoost Regressor -> Recall: {recall_ada}")
print(f"Elapsed Time: {exec_ada:.2f} seconds")

AdaBoost Regressor -> Mean Squared Error: 0.7888698290850504
AdaBoost Regressor -> Root Mean Squared Error: 0.8881834433747627
AdaBoost Regressor -> Precision: 0.8333333333333334
AdaBoost Regressor -> Recall: 0.052083333333333336
Elapsed Time: 0.18 seconds


In [198]:
# Get the input from the user, and display ratings produced using algorithms

# Function to input genres. Get genre input from the user until user inputs "exit"

def input_genres():
  genres = []
  x = True
  while (x):
    genre = str(input("Enter genre: ").lower().capitalize())

    if (genre.lower() == "exit"):
      x = False

    else:
      if genre not in genres:
        genres.append(genre)

  return genres

# Get inputs
new_vote_count = int(input("Enter vote count: "))
new_budget = int(input("Enter budget: "))
new_genres = input_genres()
new_popularity = float(input("Enter popularity: "))

# Example prediction for new data
new_data = pd.DataFrame({
    'vote_count': [new_vote_count],
    'genres': [new_genres],
    'budget': [new_budget],
    'popularity': [new_popularity]
})

# Transform the genres for the new data
new_data_genres_encoded = pd.DataFrame(mlb.transform(new_data['genres']), columns=mlb.classes_, index=new_data.index)

# Concatenate the encoded genres with the new data features
new_data_encoded = pd.concat([new_data[['vote_count', 'budget', 'popularity']], new_data_genres_encoded], axis=1)

print()

# Make prediction for new data
prediction_rf = model_rf.predict(new_data_encoded)
print(f"Random Forest Regressor -> Predicted Vote Average: {prediction_rf[0]:.2f}")

prediction_gb = model_gb.predict(new_data_encoded)
print(f"Gradient Boosting Regressor -> Predicted Vote Average: {prediction_gb[0]:.2f}")

prediction_xgb = model_xgb.predict(new_data_encoded)
print(f"XGBoost Regressor -> Predicted Vote Average: {prediction_xgb[0]:.2f}")

prediction_adaboost = model_adaboost.predict(new_data_encoded)
print(f"AdaBoost Regressor -> Predicted Vote Average: {prediction_adaboost[0]:.2f}")

Enter vote count: 3000
Enter budget: 80000000
Enter genre: Action
Enter genre: thriller
Enter genre: exit
Enter popularity: 6000

Random Forest Regressor -> Predicted Vote Average: 6.83
Gradient Boosting Regressor -> Predicted Vote Average: 7.11
XGBoost Regressor -> Predicted Vote Average: 7.05
AdaBoost Regressor -> Predicted Vote Average: 6.52


In [199]:
exectime = 0.0
addrow("Random Forest", mse_rf, precision_rf, recall_rf, exec_rf, prediction_rf[0])
addrow("Gradient Boosting", mse_gb, precision_gb, recall_gb, exec_gb, prediction_gb[0])
addrow("XGBoost", mse_xgb, precision_xgb, recall_xgb, exec_xgb, prediction_xgb[0])
addrow("AdaBoost", mse_ada, precision_ada, recall_ada, exec_ada, prediction_adaboost[0])

data = {
    "Algorithm" : algs,
    "MSE" : mses,
    "RMSE" : rmses,
    "Precision" : precisions,
    "Recalls" : recalls,
    "Execution Time" : exectimes,
    "Predictions" : predictions
}

infodf = pd.DataFrame(data)
infodf.head()

Unnamed: 0,Algorithm,MSE,RMSE,Precision,Recalls,Execution Time,Predictions
0,Random Forest,0.649885,0.806155,0.701389,0.450893,4.475601,6.829
1,Gradient Boosting,0.610531,0.781365,0.753943,0.355655,1.595408,7.110164
2,XGBoost,0.613443,0.783226,0.713953,0.456845,1.383394,7.04725
3,AdaBoost,0.78887,0.888183,0.833333,0.052083,0.183612,6.523207
