# Tree based model
In this notebook, we will use bagging, boosting and random forest to try to improve the best baseline model - LogReg


In [1]:
import numpy as np
import pandas as pd
import pyarrow.parquet as pq #source: https://arrow.apache.org/docs/python/parquet.html
from sklearn.model_selection import train_test_split #source: KNN-Creditrisk
from sklearn.model_selection import GridSearchCV #source: treeModels.ipynb
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import precision_score, recall_score
import matplotlib.pyplot as plt
from pathlib import Path
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report


df_base = pd.read_parquet(r"C:\school\AImethods\Assignment_1\bjj\datasets\computed\metacritic_sales_tier_modelling.parquet")

### Data preparation 
To use the tree based models, we first create a train/validation and testset. Let's do this the professional way and do this with *object oriented programming*.  (Source: KNN-Creditrisk) 

#### Creating the train, test, and validation splits (source: KNN notebook)

- **We split the dataset into train, validation, and test**

- We will set test aside, which will be used to evaluate the final performance of the model. Then we will use train and validation to fit the actual model.

- Note: We use stratisfied sampling since we want to make sure that the distribution of sales tiers is equal in train/validate/test

In [None]:
pd.set_option('display.max_columns', None) #it compresses the view, so this neat trick helps :)
df_base.head() #quick look at the final modelling dataset

Unnamed: 0,movie_id,metascore,userscore,runtime,production_budget_log,theatre_count_log,release_year,genre_list,genre_Action,genre_Adult,genre_Adventure,genre_Animation,genre_Biography,genre_Black Comedy,genre_Comedy,genre_Concert/Performance,genre_Crime,genre_Documentary,genre_Drama,genre_Educational,genre_Family,genre_Fantasy,genre_History,genre_Horror,genre_Multiple Genres,genre_Music,genre_Musical,genre_Mystery,genre_News,genre_Reality,genre_Romance,genre_Romantic Comedy,genre_Sci-Fi,genre_Short,genre_Sport,genre_Thriller,genre_Thriller/Suspense,genre_Unknown,genre_War,genre_Western,rating_missing,rating_clean,rating_G,rating_NC-17,rating_Not Rated,rating_PG,rating_PG-13,rating_R,season_Fall,season_Spring,season_Summer,season_Winter,summer_release,holiday_release,user_embed_1,user_embed_2,user_embed_3,user_embed_4,user_embed_5,user_embed_6,user_embed_7,user_embed_8,user_embed_9,user_embed_10,expert_embed_1,expert_embed_2,expert_embed_3,expert_embed_4,expert_embed_5,expert_embed_6,expert_embed_7,expert_embed_8,expert_embed_9,expert_embed_10,sales_tier_encoded
0,6305dc82622a,59.0,6.7,129.0,16.993564,4.26268,2000.0,"[""Drama""]",0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,Not Rated,False,False,True,False,False,False,False,False,False,True,0,0,4.791063,0.105657,-0.019576,-0.004482,0.081379,0.002436,0.027291,0.004146,0.04722,-0.040455,5.248415,-0.013454,-0.028619,0.025156,-0.031859,-0.024816,-0.012283,-0.021413,0.034195,-0.025543,0
1,662bc1e3cf57,31.0,8.7,109.0,17.216708,7.797291,2001.0,"[""Drama"",""Thriller""]",0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,PG-13,False,False,False,False,True,False,False,False,False,True,0,0,-22.518179,7.926952,-4.374372,-5.471427,4.79228,-5.970364,1.26354,-1.506532,4.172362,4.004245,-18.119719,-5.644859,6.622465,5.285645,-4.129042,-17.439384,-4.43283,6.081197,-1.913313,3.629471,2
2,dfc233d7a2f9,59.0,6.7,104.0,16.213406,7.788212,2002.0,"[""Drama""]",0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,Not Rated,False,False,True,False,False,False,False,False,False,True,0,0,4.791063,0.105657,-0.019576,-0.004482,0.081379,0.002436,0.027291,0.004146,0.04722,-0.040455,5.248415,-0.013454,-0.028619,0.025156,-0.031859,-0.024816,-0.012283,-0.021413,0.034195,-0.025543,2
3,ed1dd3e75880,41.0,6.4,104.0,16.906553,7.812378,2008.0,"[""Thriller"",""Comedy"",""Romance"",""Crime""]",0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,PG-13,False,False,False,False,True,False,False,False,False,True,0,0,-23.572487,2.176041,2.813654,1.120385,-7.183623,0.716902,6.641965,4.01351,7.728705,1.25299,-19.68395,-6.086373,-7.251342,6.611788,1.355806,1.178939,-1.492724,4.36909,3.527283,1.824804,2
4,8e3d5b8714f4,30.0,5.1,95.0,16.118096,7.589842,2008.0,"[""Fantasy"",""Comedy"",""Romance""]",0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,PG-13,False,False,False,False,True,False,False,False,False,True,0,0,-19.584745,-9.910595,-4.104996,7.199965,7.064362,-3.296383,5.574745,4.356566,0.31441,-4.770975,-22.277643,-4.169212,-3.627316,4.190808,-9.552444,-1.246047,4.286777,-1.451442,5.136322,2.053113,2


We can see here that the dataset has a bunch of different characteristics that we can and cannot use. 
1. The dataset still has string data instead of numeric 
2. The dataset has both features and embeddings, so we can use the dataset with and without the embeddings to see if the reviews have any impact on model performance.
3. The target is encoded in three classes to we have to use stratisfied sampling

In the following codeblock, we use object oriented programming to 

In [None]:
# First we initialize the splitter, which is basically the settings of the splitter 
# We want to get 80% of the data for training, and 20% for testing
# To do this, we first set the fraction for training to 0.64 since 0.8*0.8= 0.64 (80% of the 80% for training)
# Then we create the validation fraction which is 20% of the initial 80% training dat, so 1-0.64 = 0.16
# the "rest" of the data is the testset 
class TrainTestSplitter: 
    def __init__(self, train_frac=0.64, val_frac=0.16, seed=1234): #initialize the object, with 64% training data, 16% validation
        self.train_frac = train_frac # Use the self statement to let the function remember the settings
        self.val_frac = val_frac 
        self.seed = seed #Make sure that the first and the second split use the same starting point

    def split_data(self, df, target_col):
        """
        Split the data into train, validation and test sets using stratisfied sampling
        The split first splits the test set from the full dataframe and then splits the remaining data into train and validation
        
        :param self: provides the settings and stores the results
        :param df: The full modelling dataframe
        :param target_col: The column used for stratification

        Returns:
        train_set, validation_set, test_set
        """
          
        print(f"Splitting data based on target: {target_col}...") #the function prints some text and the column that we do stratification on
        
        # Calculate the fraction that is not the trainset or validation set to find the testset
        test_frac = 1.0 - (self.train_frac + self.val_frac)
        
        # First split is the training dataset
        # 'temp_df' contains both the train and validation set for a moment
        temp_df, self.test_set = train_test_split(
            df, 
            test_size=test_frac, 
            random_state=self.seed, 
            stratify=df[target_col]
        )
        
        # 3. Calculate the relative size for the second split
        relative_val_size = self.val_frac / (self.train_frac + self.val_frac)
        
        # 4. Second Split: Separate Train and Validation from temp_df
        self.train_set, self.validation_set = train_test_split(
            temp_df, 
            test_size=relative_val_size, 
            random_state=self.seed, 
            stratify=temp_df[target_col]
        )
        
        # 5. Clean up the indexes (so they start at 0, 1, 2...)
        self.train_set = self.train_set.reset_index(drop=True)
        self.validation_set = self.validation_set.reset_index(drop=True)
        self.test_set = self.test_set.reset_index(drop=True)

        print(f"Split completed. Train: {len(self.train_set)}, Val: {len(self.validation_set)}, Test: {len(self.test_set)}")
        
        return self.train_set, self.validation_set, self.test_set

Now we actually create a splitter object

In [4]:
# 1. Create the 'splitter' object from the class
# We will use 64% for training and 16% for validation
splitter = TrainTestSplitter(train_frac=0.64, val_frac=0.16, seed=42)

# 2. We run the split_data method and tell it the target is 'sales_tier_encoded'
train, val, test = splitter.split_data(df_base, target_col='sales_tier_encoded')

# 3. Double checking a specific tier to see if stratification worked
print("\nProportion of tiers in Training Set:")
print(train['sales_tier_encoded'].value_counts(normalize=True))

print("\nProportion of tiers in Test Set:")
print(test['sales_tier_encoded'].value_counts(normalize=True))

Splitting data based on target: sales_tier_encoded...
Split completed. Train: 13932, Val: 3484, Test: 4354

Proportion of tiers in Training Set:
sales_tier_encoded
1    0.340009
2    0.330032
0    0.329960
Name: proportion, dtype: float64

Proportion of tiers in Test Set:
sales_tier_encoded
1    0.339917
2    0.330041
0    0.330041
Name: proportion, dtype: float64


The split worked perfectly, both the training and the testset have equal propotions of all the sales tiers. 

Now we create the dataframes that hold the data we want to use for modelling by separating the target from the features:
- X_train = All the data from the training set without the target
- Y_train = The target data 

- X_val = All the data from the validation set without the target
- Y_val = The target data in the validation set

- X_test = All the data from the validation set without the target
- Y_test = The target data in the test set

NOTE: Tree based models only work with numeric data. We will therefore drop the data that has non-numeric data


In [5]:
pd.set_option('display.max_rows', None) #it compresses the view, very anoying actually
df_base.dtypes #print the datatypes to see which ones dont work for the bagged trees

movie_id                      object
metascore                    float64
userscore                    float64
runtime                      float64
production_budget_log        float64
theatre_count_log            float64
release_year                 float64
genre_list                    object
genre_Action                   int64
genre_Adult                    int64
genre_Adventure                int64
genre_Animation                int64
genre_Biography                int64
genre_Black Comedy             int64
genre_Comedy                   int64
genre_Concert/Performance      int64
genre_Crime                    int64
genre_Documentary              int64
genre_Drama                    int64
genre_Educational              int64
genre_Family                   int64
genre_Fantasy                  int64
genre_History                  int64
genre_Horror                   int64
genre_Multiple Genres          int64
genre_Music                    int64
genre_Musical                  int64
g

Now we create the target and the features. A few important notes here: 
1. The object datatypes will be removed. e.g. the Rating_clean and genre_list does not make any sense since they have been on hot encoded and the id has no predictive power anyway
2. The boolean types are converted to numbers

In [15]:
# 1. Create the target (y)
y_train = train['sales_tier_encoded']
y_validation = val['sales_tier_encoded']

# 2. Create the Features (X)
# We exclude 'object' to remove movie_id, genre_list, and rating_clean automatically
X_train = train.drop(columns=['sales_tier_encoded']).select_dtypes(exclude=['object'])
X_validation = val.drop(columns=['sales_tier_encoded']).select_dtypes(exclude=['object'])

# 3. Convert Booleans (True/False) to Numbers (1/0)
# This makes it easier for the Bagging model to do math
X_train = X_train.astype(float)
X_validation = X_validation.astype(float)

print(f"Success! We are using {X_train.shape[1]} features.")


Success! We are using 71 features.


In [9]:
X_train.head()

#sanity check

Unnamed: 0,metascore,userscore,runtime,production_budget_log,theatre_count_log,release_year,genre_Action,genre_Adult,genre_Adventure,genre_Animation,genre_Biography,genre_Black Comedy,genre_Comedy,genre_Concert/Performance,genre_Crime,genre_Documentary,genre_Drama,genre_Educational,genre_Family,genre_Fantasy,genre_History,genre_Horror,genre_Multiple Genres,genre_Music,genre_Musical,genre_Mystery,genre_News,genre_Reality,genre_Romance,genre_Romantic Comedy,genre_Sci-Fi,genre_Short,genre_Sport,genre_Thriller,genre_Thriller/Suspense,genre_Unknown,genre_War,genre_Western,rating_missing,rating_G,rating_NC-17,rating_Not Rated,rating_PG,rating_PG-13,rating_R,season_Fall,season_Spring,season_Summer,season_Winter,summer_release,holiday_release,user_embed_1,user_embed_2,user_embed_3,user_embed_4,user_embed_5,user_embed_6,user_embed_7,user_embed_8,user_embed_9,user_embed_10,expert_embed_1,expert_embed_2,expert_embed_3,expert_embed_4,expert_embed_5,expert_embed_6,expert_embed_7,expert_embed_8,expert_embed_9,expert_embed_10
0,50.0,6.9,90.0,16.993564,4.26268,2017.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,-22.072214,-3.350949,2.071532,8.362472,9.307377,3.08252,1.040087,-0.247562,-0.7972,-4.038525,-20.344713,5.514405,-6.495228,2.690563,-5.935856,-2.313581,-1.974514,-1.39836,1.969382,-3.230158
1,59.0,6.7,96.0,16.993564,4.26268,2021.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,4.791063,0.105657,-0.019576,-0.004482,0.081379,0.002436,0.027291,0.004146,0.04722,-0.040455,5.248415,-0.013454,-0.028619,0.025156,-0.031859,-0.024816,-0.012283,-0.021413,0.034195,-0.025543
2,59.0,6.7,82.0,16.993564,4.26268,2021.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,4.791063,0.105657,-0.019576,-0.004482,0.081379,0.002436,0.027291,0.004146,0.04722,-0.040455,5.248415,-0.013454,-0.028619,0.025156,-0.031859,-0.024816,-0.012283,-0.021413,0.034195,-0.025543
3,59.0,6.7,90.0,16.993564,4.26268,2015.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,4.791063,0.105657,-0.019576,-0.004482,0.081379,0.002436,0.027291,0.004146,0.04722,-0.040455,5.248415,-0.013454,-0.028619,0.025156,-0.031859,-0.024816,-0.012283,-0.021413,0.034195,-0.025543
4,56.0,8.0,99.0,16.993564,4.26268,2020.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,-19.036177,-3.183167,-2.557212,-1.681025,12.416853,-9.977543,-0.942114,-2.781208,2.153989,0.649213,-20.343374,0.86896,-5.471694,2.868397,-2.174418,-3.040885,-1.008885,-6.594128,-6.573283,-4.448895


Now we can start modelling, starting with bagging


In [10]:
# 1. Initialize the base tree 
# We use a Decision Tree as the individual "voter" in the crowd
base_tree = DecisionTreeClassifier(random_state=1234)

# 2. Initialize the bagging classifier
# We build 50 trees at first
bagging_model = BaggingClassifier(
    estimator=base_tree, 
    n_estimators=50, 
    random_state=1234
)

# 3. Fit the model (The "Training" phase)
# The model "studies" the relationship between X features and y sales tiers
bagging_model.fit(X_train, y_train)

# 4. Make predictions on the validation set
# We use X_validation to see how well the model learned
y_pred_val = bagging_model.predict(X_validation)

# 5. Evaluate the results
val_accuracy = accuracy_score(y_validation, y_pred_val)
print(f"Validation Accuracy: {val_accuracy:.2%}")
print("\nDetailed Report:")
print(classification_report(y_validation, y_pred_val))

Validation Accuracy: 57.26%

Detailed Report:
              precision    recall  f1-score   support

           0       0.54      0.58      0.56      1150
           1       0.46      0.46      0.46      1185
           2       0.73      0.68      0.70      1149

    accuracy                           0.57      3484
   macro avg       0.58      0.57      0.58      3484
weighted avg       0.58      0.57      0.57      3484



#### Model Interpretation: Initial bagging results
Our first model used a Bagging Classifier with 50 Decision Trees. With a baseline for random guessing at 33.3% (for three balanced classes), the model's accuracy of 57.26% shows big improvement. However, the baseline model logisticregression did almost just as good. Some interpretation:

- Tier 2 (High Success): This is the model’s strongest area. A Precision of 73% suggests that the highest movie performers have very unique feature. This will probably be a high budget or maybe even very good expert reviews.

- Tier 0 (Low Sales): The model is moderately successful here (56% F1-score). It can identify "flops" or niche movies with average reliability.

- Tier 1 (Mid-Range): This is the most difficult category. The low precision (46%) suggests that mid-tier movies often "look" like either low-tier or high-tier movies to the model.

Now we move to a random forest and see if the performance increases

In [11]:
# 1. Initialize the Random Forest with standard settings
# We use 100 trees (the default) to see the baseline performance
rf_baseline = RandomForestClassifier(n_estimators=100, random_state=1234)

# 2. Fit the model
rf_baseline.fit(X_train, y_train)

# 3. Predict on the validation set
y_pred_rf = rf_baseline.predict(X_validation)

# 4. Results
rf_acc = accuracy_score(y_validation, y_pred_rf)
print(f"Random Forest Baseline Accuracy: {rf_acc:.2%}")
print(f"Improvement over Bagging: {rf_acc - 0.5615:.2%}")
print("\nDetailed Report:")
print(classification_report(y_validation, y_pred_rf))

Random Forest Baseline Accuracy: 56.34%
Improvement over Bagging: 0.19%

Detailed Report:
              precision    recall  f1-score   support

           0       0.54      0.55      0.55      1150
           1       0.46      0.45      0.45      1185
           2       0.69      0.70      0.70      1149

    accuracy                           0.56      3484
   macro avg       0.56      0.56      0.56      3484
weighted avg       0.56      0.56      0.56      3484



#### The random forest did not do better
The model did slightly worse then the bagged tree, interesting:

- Tier 2 is still the best tier to predict. So even when then model only sees random features they still stand out

- Tier 1 remains stuck at a 0.45 F1-score. This indicates that the "confusion" isn't necessarily because the model is looking at too many features, but perhaps because the data for Tier 1 and Tier 0 looks very similar.

Now we will use hyperparameter tuning to find the best settings for the random forest, it could very well be the case that the tree was overfitting.

In [None]:
# 1. Define the "Grid" of settings
# We are testing: 
# - More trees (200) vs fewer (100)
# - Shallow trees (10) vs Deep trees (20) vs Infinite (None)
# - How picky the tree is before making a new branch (min_samples_split)
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5, 10],
    'max_features': ['sqrt', 'log2']
}

# 2. Initialize the Grid Search
# n_jobs=-1 tells your computer to use all its "brains" (cores) to work faster
grid_search = GridSearchCV(
    estimator=RandomForestClassifier(random_state=1234),
    param_grid=param_grid,
    cv=3, 
    scoring='accuracy',
    verbose=2, 
    n_jobs=-1
)

# 3. Fit the search
print("Starting the search for the best hyperparameters...")
grid_search.fit(X_train, y_train)

# 4. Extract the winner
best_rf = grid_search.best_estimator_
y_pred_tuned = best_rf.predict(X_validation)

# 5. Final Comparison
print(f"\nBest Settings Found: {grid_search.best_params_}")
print(f"Tuned random forest accuracy: {accuracy_score(y_validation, y_pred_tuned):.2%}")
print("\nDetailed Tuned Report:")
print(classification_report(y_validation, y_pred_tuned))

Starting the search for the best hyperparameters...
Fitting 3 folds for each of 36 candidates, totalling 108 fits

Best Settings Found: {'max_depth': 20, 'max_features': 'sqrt', 'min_samples_split': 10, 'n_estimators': 200}
Tuned Random Forest Accuracy: 61.57%

Detailed Tuned Report:
              precision    recall  f1-score   support

           0       0.58      0.70      0.63      1150
           1       0.51      0.47      0.49      1185
           2       0.78      0.68      0.73      1149

    accuracy                           0.62      3484
   macro avg       0.62      0.62      0.62      3484
weighted avg       0.62      0.62      0.62      3484



### Analysis of grid search:
1. Grid search
The grid search found a configuration that  identified a configuration that balances model complexity with generalization. The transition from a baseline Random Forest (56.61%) to the tuned version (59.89%) represents a +3.28% improvement in overall accuracy.

The Winning Configuration: The model performed best with 200 individual trees (n_estimators), providing a more stable consensus than smaller ensembles. By setting the maximum depth to 20 (max_depth), the model found a "Goldilocks" zone—deep enough to capture complex relationships in the embeddings, but shallow enough to avoid memorizing noise. Furthermore, requiring at least 10 samples to make a split (min_samples_split) acted as a powerful regularizer, forcing the model to ignore outliers and focus on statistically significant patterns.

2. Class-Level Performance Deep Dive
The tuning process sharpened the model's ability to distinguish the "extremes" of the dataset, though it revealed persistent challenges in the middle range.

Tier 2 (High Sales) - The Strongest Performer: This category saw the most dramatic gains, reaching a Precision of 0.77 and an F1-Score of 0.72. This indicates that high-performing movies have a very distinct "signature" in the feature set (likely driven by production budget and critic scores) that the model can now identify with high confidence.

Tier 0 (Low Sales) - Improved Recall: The model became much more sensitive to low-performing titles, achieving a Recall of 0.67. It is now catching a higher percentage of "flops" than any previous iteration.

Tier 1 (Mid-Range) - The Stagnant Zone: Despite the tuning, Tier 1 remains the "Confusion Zone" with an F1-Score of 0.47. The mid-range movies likely share too many characteristics with both successful and unsuccessful titles, making them statistically difficult to isolate using parallel tree-based methods.

3. Key Observations
The results demonstrate a successful navigation of the Bias-Variance Tradeoff. By limiting depth and increasing the sample split requirement, we reduced the model's variance (overfitting). This ensures the model is "robust," meaning its ~60% accuracy is likely to hold up when applied to entirely new, unseen movie data.

The high precision in the top tier suggests this model is already useful for identifying potential blockbusters, even if it still struggles to categorize "average" performers.

Now, we start BOOSTING!

In [13]:
from sklearn.ensemble import AdaBoostClassifier

# 1. Initialize AdaBoost
# We use a slightly deeper tree than the default (max_depth=1) 
# because your data has complex embeddings
ada_model = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=3, random_state=1234),
    n_estimators=100,
    learning_rate=0.1,
    random_state=1234
)

# 2. Fit and Predict
ada_model.fit(X_train, y_train)
y_pred_ada = ada_model.fit(X_train, y_train).predict(X_validation)

# 3. Results
print("--- ADABOOST PERFORMANCE ---")
print(f"Accuracy: {accuracy_score(y_validation, y_pred_ada):.2%}")

--- ADABOOST PERFORMANCE ---
Accuracy: 58.30%


In [14]:
from xgboost import XGBClassifier

# 1. Initialize XGBoost
xgb_model = XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=6,
    random_state=1234,
    use_label_encoder=False,
    eval_metric='mlogloss'
)

# 2. Fit and Predict
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_validation)

# 3. Results
print("--- XGBOOST PERFORMANCE ---")
print(f"Accuracy: {accuracy_score(y_validation, y_pred_xgb):.2%}")
print("\nDetailed XGBoost Report:")
print(classification_report(y_validation, y_pred_xgb))

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


--- XGBOOST PERFORMANCE ---
Accuracy: 62.43%

Detailed XGBoost Report:
              precision    recall  f1-score   support

           0       0.58      0.72      0.64      1150
           1       0.52      0.50      0.51      1185
           2       0.81      0.67      0.73      1149

    accuracy                           0.62      3484
   macro avg       0.64      0.63      0.63      3484
weighted avg       0.64      0.62      0.63      3484

