# table of contents
1. [predictions, part I](#predictions,-part-I)
2. [preprocessing](#preprocessing)
3. [calculations](#calculations)
   1. [batch model evaluation](#batch-model-evaluation)
   2. [single model evaluation](#single-model-evaluation)
   3. [LogisticRegression, coefficients](#LogisticRegression,-coefficients)
   4. [DecisionTreeClassifier, tree importance](#DecisionTreeClassifier,-tree-importance) 

# predictions, part I
- drop columns: no
- `scaling: yes`
- hyperparameter tuning: no
- one-hot encoding: yes, the dataset was received encoded
- resampling: no

`The main takeaways from this session:`
- I'm testing every model's performance on unscaled and scaled data.
- RandomForestClassifier is the best performer, as a standalone model, and a base estimator to 2 ensemble methods.
- Scaling only significantly improves accuracy for KNN and Bagging with KNN as base estimator.
- AdaBoostClassifier has an extremely high computational cost when paired with certain 2 base estimators.

# preprocessing

In [None]:
# import libraries
%run common_imports.py

# load and split data
%run load_and_split_data.py
X_train, X_test, y_train, y_test = load_and_split_data()

# scale data
%run minmaxscaler.py
X_train_scaled, X_test_scaled = scale_data(X_train, X_test)

# calculations

There are 2 approaches, depending on the needs and computing capacities:


1. `batch:` calculates accuracy scores and computation time for a group of models, saves their accuracy scores and runtime into a dataframe.\
This approach is recommended because the results can be easily collected and compared compared in the next iterations: unscaled vs scaled data, etc.\
Caution: computation time for this dataset is around 20 minutes.

2. `single model:` calculates accuracy score, classification report, and runtime for a single model; doesn't save the results into a dataframe.

Both approaches require the timer function, defined in the cell below. For reusability, the function is saved in the utils.py

In [2]:
# Define a decorator to measure the runtime of functions

from time import time

def timer(func):
    def wrapper(*args, `kwargs):
        start_time = time()
        result = func(*args, `kwargs)
        end_time = time()
        runtime = int(end_time - start_time)
        return result, runtime
    return wrapper

## batch model evaluation

In [None]:
# Calculate accuracy score and computation cost for a group of models, save the results into a DataFrame

# List of models
models = [
    KNeighborsClassifier(),
    LogisticRegression(),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    GradientBoostingClassifier(),
    BaggingClassifier(),
    BaggingClassifier(KNeighborsClassifier()),
    BaggingClassifier(LogisticRegression()),
    BaggingClassifier(RandomForestClassifier()),
    BaggingClassifier(GradientBoostingClassifier()),
    AdaBoostClassifier(),
    AdaBoostClassifier(LogisticRegression()),
    AdaBoostClassifier(RandomForestClassifier()),
    AdaBoostClassifier(GradientBoostingClassifier())
]

# Empty lists to store results
model_names = []
estimator_names = []
accuracies = []
times = []
sources = []

# Train and evaluate model, returning accuracy and runtime
@timer
def train_evaluate_model(model, X_train, y_train, X_test, y_test):
    display(model.fit(X_train, y_train))
    accuracy = round(model.score(X_test, y_test) * 100, 2)
    return accuracy

# Calculate accuracy and time for each model with both scaled and unscaled data
for model in models:
    for scaled in [False, True]:
        X_train_use = X_train_scaled if scaled else X_train
        X_test_use = X_test_scaled if scaled else X_test

        model_name = str(model).split("(")[0]
        if hasattr(model, 'base_estimator_'):
            estimator_name = str(model.base_estimator_).split('(')[0]
        elif hasattr(model, 'estimator'):
            estimator_name = str(model.estimator).split('(')[0]
        else:
            estimator_name = model_name

        accuracy, time_taken = train_evaluate_model(model, X_train_use, y_train, X_test_use, y_test)
        
        model_names.append(model_name)
        estimator_names.append(estimator_name)
        accuracies.append(accuracy)
        times.append(time_taken)
        sources.append("scaled" if scaled else "unscaled")

# Create DataFrame
accuracies_without_parameters = pd.DataFrame({
    "model": model_names,
    "estimator": estimator_names,
    "accuracy_in_%": accuracies,
    "runtime_in_seconds": times,
    "source": sources
})

# Save the DataFrame to a CSV file
accuracies_without_parameters.to_csv("../data/test.csv", index=False)

# Display the DataFrame
accuracies_without_parameters

## single model evaluation

In [None]:
# Train and evaluate a single model, returning accuracy, predictions, and runtime

@timer
def train_evaluate_model(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    accuracy = round(model.score(X_test, y_test) * 100, 2)
    pred = model.predict(X_test)
    return accuracy, pred

# Wrapper function to handle model instantiation and evaluation
def train_evaluate_runtime(model_class, X_train, y_train, X_test, y_test, *args, `kwargs):
    model = model_class(*args, `kwargs)
    (result, runtime) = train_evaluate_model(model, X_train, y_train, X_test, y_test)
    accuracy, pred = result  

    print(f"The accuracy of the model is {accuracy}%")
    print(f"Runtime (seconds): {runtime}\n")
    print(classification_report(y_true=y_test, y_pred=pred))
    
    return model_class.__name__, accuracy, runtime, pred

# Example usage:
# model_name, accuracy, runtime, predictions = train_evaluate_runtime(KNeighborsClassifier, X_train, y_train, X_test, y_test, n_neighbors=5)

## LogisticRegression, coefficients
Below is #log_reg_coefficients function that calculates and displays the coefficients of the features in a logistic regression model, sorted by their absolute values in descending order, indicating their importance or impact on the model's predictions.

`Top 5 features by absolute coefficient:`
1. deposit_type_No_Deposit 1.97
2. deposit_type_Non_Refund 1.69
3. previous_cancellations 1.27
4. required_car_parking_spaces 0.98
5. market_segment_Offline_TA_TO 0.78

In [None]:
def log_reg_coefficients():
    coefficients = LogisticRegression().fit(X_train, y_train).coef_[0]
    feature_names = X_train.columns
    coefficients_df = pd.DataFrame({"Feature": feature_names, "Coefficient": coefficients})
    coefficients_df["Absolute Coefficient"] = abs(coefficients_df["Coefficient"])
    coefficients_df[["Coefficient", "Absolute Coefficient"]] = coefficients_df[["Coefficient", "Absolute Coefficient"]].round(2)
    coefficients_df = coefficients_df.sort_values(by="Absolute Coefficient", ascending=False)
    display(coefficients_df)

log_reg_coefficients()

## DecisionTreeClassifier, tree importance
Below is #dt_tree_importance function that calculates importance of each feature in the decision tree, sorts them in descending order of importance, displays as a DataFrame, generates+displays+saves a visualisation of the tree.

With the max_depth=2 parameter, this tree returns Top 3 features in the decision tree (and their importance score):
1. `deposit_type_Non_Refund (0.88):` there are more refundable canceled bookings in absolute terms, but 99% of non-refundable ones were canceled.
2. lead_time (0.12): refundable bookings with lead times <= 14.5 days are less likely to be canceled.
3. previous_bookings_not_canceled (almost 0): customers with history of not canceling previous bookings is a strong predictor of not canceling the current booking.

In [None]:
def dt_tree_importance():
    dt = DecisionTreeClassifier(max_depth=2)
    dt.fit(X_train, y_train)
    
    tree_importance = {feature: f"{importance:.2f}" for feature, importance in zip(X_train.columns, dt.feature_importances_)}
    sorted_tree_importance = {k: v for k, v in sorted(tree_importance.items(), key=lambda item: item[1], reverse=True)}

    df = pd.DataFrame(sorted_tree_importance.items(), columns=["Feature", "Importance"])
    display(df)

    dot_data = export_graphviz(dt, out_file=None, filled=True, rounded=True, feature_names=X_train.columns)
    graph = graphviz.Source(dot_data)
    graph.format = "png"
    graph.render("decision_tree_unscaled")
    display(graph)

dt_tree_importance()

Next: notebook_05_machine_learning_02_hyperparameter_tuning