| Links ausgerichtet | Mittig ausgerichtet | Rechts ausgerichtet |
|:------------------ |:-------------------:| -------------------:|
| Inhalt             | Inhalt              | Inhalt              |
| Inhalt             | Inhalt              | Inhalt              |


## RF - Random Forest

Random Forest is a powerful ensemble learning method that operates by constructing a multitude of decision trees during training. Each tree in the forest independently predicts the target variable and the final prediction is determined by averaging among these individual predictions. Random Forest is particularly adept at handling high-dimensional data and mitigating overfitting which we need in our case.

In [1]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score, log_loss, mean_squared_error
from sklearn.preprocessing import LabelBinarizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score, log_loss, mean_squared_error

import numpy as np
import pandas as pd
from Helper.Data import loadData
from Helper.Perform_CrossVal import perform_cross_validation
from Helper.Perform_GridSearch import perform_grid_search
import pickle


pd.options.mode.chained_assignment = None

In [2]:
X_final, y, X_train, X_test, y_train, y_test, data, feature_columns, categorical_features,target_column, label_encoder = loadData()
X_final

Unnamed: 0,X,Y,DayOfWeek_Friday,DayOfWeek_Monday,DayOfWeek_Saturday,DayOfWeek_Sunday,DayOfWeek_Thursday,DayOfWeek_Tuesday,DayOfWeek_Wednesday,PdDistrict_BAYVIEW,...,Events_Clear,Events_Fog,Events_Fog-Rain,Events_Rain,Events_Rain-Thunderstorm,Events_Thunderstorm,season_Autumn,season_Spring,season_Summer,season_Winter
0,-122.426995,37.800873,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,-122.438738,37.771541,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,-122.403252,37.713431,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,-122.423327,37.725138,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,-122.371274,37.727564,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
395397,-122.431046,37.783030,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
395398,-122.414073,37.751685,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
395399,-122.389769,37.730564,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
395400,-122.447364,37.731948,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In Random Forest for multi-class classification, categorical features may require numerical encoding, but the target variable can be provided as either strings or integers directly, without needing binary transformation.

In [3]:
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)

The code initializes a Random Forest classifier named rf_clf, specifying two parameters:
- n_estimators: The n_estimators parameter in Random Forest determines the number of decision trees in the ensemble. Increasing it can improve performance, but it also increases training and prediction times. Too many trees may lead to overfitting and diminishing returns. In this case, 100 trees are used, which is a common default choice balancing between model performance and computational efficiency.
- random_state: ensures reproducibility of results by setting the random number generator seed for consistent model evaluation and comparison across different runs or environments.

In [4]:
# define scoring metrics
scoring = {
    'accuracy': 'accuracy',
    'f1_macro': 'f1_macro',
    'roc_auc_ovr': 'roc_auc_ovr'
}

# cross validation
cv_results = cross_validate(rf_clf, X_final, y, cv=4, scoring=scoring, return_train_score=True)

# average and std results
for metric in scoring.keys():
    print(f"Average {metric}: {np.mean(cv_results[f'test_{metric}']) * 100:.4f}%")
    print(f"Standard deviation {metric}: {np.std(cv_results[f'test_{metric}']) * 100:.4f}%")

# differences between train and test
for metric in scoring.keys():
    test_std = np.std(cv_results[f'test_{metric}'])
    train_std = np.std(cv_results[f'train_{metric}'])
    print(f"Difference in std between train and test {metric}: {(train_std - test_std) * 100:.4f}%")

with open('rf_cross_val_result.pkl', 'wb') as f:
    pickle.dump(cv_results, f)

Average accuracy: 41.8136%
Standard deviation accuracy: 1.9134%
Average f1_macro: 30.6033%
Standard deviation f1_macro: 1.7710%
Average roc_auc_ovr: 64.2488%
Standard deviation roc_auc_ovr: 0.5959%
Difference in std between train and test accuracy: -1.7746%
Difference in std between train and test f1_macro: -1.6224%
Difference in std between train and test roc_auc_ovr: -0.5644%


#### Preliminary Testing Crossvalidation Results
Given the context of a multiclass classification scenario with five different outcomes, let's delve into interpreting our cross-validation results:
- **Average Accuracy 41.82%**: In our initial testing, Random Forest demonstrated a low average accuracy of 41.8136% in a 5-fold cross-validation. Though better than random guessing, it still falls short. It appears to struggle with our multiclass problem too.
- **Standard Deviation of Accuracy 1.91%**: The model's performance remains stable across various subsets of our data.
- **Average Macro F1 Score 30.60%**: A score of 30.60% reflects a significant challenge, signaling that the model encounters difficulties with both precision and recall uniformly across all classes. This hints at potential issues in correctly recognizing all instances of a class or in distinguishing between classes without generating numerous incorrect predictions.
- **Standard Deviation of Macro F1 Score 1.77%**: A standard deviation of about 2%, is evident in the precision and recall of the model across folds. It doesn't exceed excessive levels.

- **Average ROC_AUC_OVR 64.25%**: A score of 64.25% indicates the model's proficiency in distinguishing between different categories. It's important to consider that this score might be even higher given the challenge of dealing with multiple classes.

- **Differences in Standard Deviation Between Train and Test (Accuracy, F1 Macro, ROC_AUC_OVR)**: The model being more consistent on the test set than on the training set, possibly indicating underfitting and reduced accuracy in capturing data complexity.

#### Preliminary Testing Crossvalidation Conclusion
Overall our model displays stable performance across folds but faces challenges in achieving high accuracy, particularly in precision and recall, resulting in a low F1 score. The ROC_AUC_OVR score indicates an acceptable level, however, there is still substantial potential for overall enhancement.

In [5]:
# Training the classifier
rf_clf.fit(X_train, y_train)

# Predictions on the test set
y_pred = rf_clf.predict(X_test)

# Predicting the probabilities for the test set
y_pred_proba = rf_clf.predict_proba(X_test)

# Calculation of various metrics
# We use the “weighted” parameter because this takes imbalances into account
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')

# Random Forest does not provide probability estimates directly like XGBoost
# For ROC-AUC score, we can only calculate it for binary or multiclass classification with one-vs-one or one-vs-rest strategy.
# Here, we'll calculate ROC-AUC score for binary classification since RandomForestClassifier does not output probabilities for multiclass directly.

lb = LabelBinarizer()
lb.fit(y_test)
y_test_binarized = lb.transform(y_test)
y_pred_binarized = lb.transform(y_pred)

roc_auc = roc_auc_score(y_test_binarized[:, 0], y_pred_proba[:, 1])  # Assuming binary classification
logloss = log_loss(y_test, y_pred_proba)

print(f"Accuracy: {accuracy:.4f}")
print(f"F1-score: {f1:.4f}")
print(f"ROC-AUC Score (Binary): {roc_auc:.4f}")
print(f"Log Loss: {logloss:.4f}")

with open('random_forest_model.pkl', 'wb') as f:
    pickle.dump(rf_clf, f)

Accuracy: 0.5288
F1-score: 0.5156
ROC-AUC Score (Binary): 0.4813
Log Loss: 2.5616


#### Preliminary Testing Prediction Results
- **Accuracy 52.88%**: An accuracy of 52.88% indicates that the model correctly predicts the outcome more than half of the time. While still showing room for improvement, this suggests the model is performing significantly better than expected in comparison to cross-validation results.
- **F1-score 51.563%**: The F1-score is 51.563, indicating room for improvement in balancing precision and recall, particularly in multiclass classification contexts where achieving high precision and recall is challenging.
- **ROC-AUC OVO Score 48.13%**: This score falls below 50%, suggesting challenges in distinguishing between the various classes.


#### Preliminary Testing Prediction Conclusion
The results are acceptable for preliminary testing, however, there is a clear need for improvement, particularly in enhancing the ROC for distinguishing different classes and other relevant scores above 50%.

In [None]:
# extract feature importance
importance = rf_clf.feature_importances_
feature_names = X_train.columns.tolist()

# plot feature importance
plt.figure(figsize=(10, 6))
plt.bar(range(len(importance)), importance)
plt.xticks(range(len(importance)), feature_names, rotation=90)
plt.title('Feature Importance')
plt.xlabel('Features')
plt.ylabel('Importance')
plt.show()

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV
import matplotlib.pyplot as plt
# initialise RFECV
cv_strategy = StratifiedKFold(n_splits=5)
rfecv = RFECV(estimator=rf_clf, step=1, cv=cv_strategy, scoring='accuracy')

# RFECV fit
rfecv.fit(X_train, y_train)

# print optimal number of features
print("Optimal number of features: %d" % rfecv.n_features_)

# Extracting the feature names based on RFECV support
#selected_features = X_train.columns[rfecv.support_]

# Output of the selected feature names and their rankings
print("Selected features and their rankings:")
for rank, feature in zip(rfecv.ranking_, X_train.columns):
    print(f"{feature}: Rank {rank}")

# Plot
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross Validation Score (Accuracy)")
plt.plot(range(1, len(rfecv.cv_results_['mean_test_score']) + 1), rfecv.cv_results_['mean_test_score'])
plt.title('RFECV - Performance of the model')
plt.show()

In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Confusion Matrix
y_test_names = label_encoder.inverse_transform(y_test)
cm = confusion_matrix(y_test_names, label_encoder.inverse_transform(y_pred)) # Annahme, dass y_pred die Vorhersagen sind
plt.figure(figsize=(10, 7))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=label_encoder.classes_, yticklabels=label_encoder.classes_)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

In [None]:
from itertools import cycle
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

n_classes = y_test_binarized.shape[1]
fpr = dict()
tpr = dict()
roc_auc = dict()

for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test_binarized[:, i], y_pred_proba[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

colors = cycle(['blue', 'red', 'green', 'orange', 'black'])

# plot ROC
plt.figure(figsize=(7, 5))
for i, color in zip(range(n_classes), colors):
    # Ersetzen von `i` durch `label_encoder.classes_[i]` für die Klassennamen
    plt.plot(fpr[i], tpr[i], color=color, lw=2,
             label='ROC curve of class {0} (area = {1:0.2f})'
             ''.format(label_encoder.classes_[i], roc_auc[i]))

plt.plot([0, 1], [0, 1], 'k--', lw=2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc="lower right")
plt.show()


In [None]:
from sklearn.model_selection import GridSearchCV

# grid
param_grid = {
    'max_depth': [3, 4, 5],
    'n_estimators': [100, 200, 300],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
}

# GridSearchCV-initialization
grid_search = GridSearchCV(estimator=rf_clf, param_grid=param_grid, scoring='accuracy', cv=4, n_jobs=-1)

# search for best params
grid_search.fit(X_train, y_train)

# scores
print(f"Best  : {grid_search.best_params_}")
print(f"Best Accuracy: {grid_search.best_score_ * 100:.2f}%")
with open('rfgrid_search.pkl', 'wb') as f:
    pickle.dump(grid_search, f)

In [None]:
best_params = grid_search.best_params_

best_rf_clf = RandomForestClassifier(**best_params, random_state=42)

best_rf_clf.fit(X_train, y_train)
y_pred = best_rf_clf.predict(X_test)

# Calculate accuracy, F1-score, and ROC-AUC
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')
roc_auc = roc_auc_score(y_test, best_rf_clf.predict_proba(X_test), multi_class='ovr', average='weighted')

# Print evaluation metrics
print(f"Accuracy on test data: {accuracy:.4f}")
print(f"F1-score on test data: {f1:.4f}")
print(f"ROC-AUC on test data: {roc_auc:.4f}")

with open('random_forest_model2.pkl', 'wb') as f:
    pickle.dump(best_rf_clf, f)