---

# Jupyter Notebook

**Jupyter Notebook** is an open-source web application that allows creation and sharing of documents. It combines live **code**, **equations**, **visualizations**, and **narrative text**. Jupyter Notebook is widely used in various fields such as data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and more.

Here are some key features:

1. Code Cells
Jupyter Notebooks support cells to write and execute programs. These cells can run multiple languages directly within the notebook.

2. Markdown Cells
In addition to code, you can also use Markdown cells to include formatted text using markdown syntax. It supports different heading levels, bold, italics, hyperlinks, and more. This is helpful for documenting your computational workflows in a readable format.

3. Interactive Output
Code in Jupyter can produce rich, interactive outputs like HTML, images, videos, LaTeX, custom MIME types, which is particularly useful for data exploration and presenting results.

4. Integration with Big Data Tools
Jupyter has built-in support for big data tools like Apache Spark, allowing you to write and execute Spark code in a notebook cell.

5. Support for Multiple Programming Languages
Although "Jupyter" stands for Julia, Python, and R, the notebook supports many other languages, thus making it a versatile tool for polyglot data analysis.

6. Sharing and Conversion
Jupyter notebooks can be easily shared and converted to a number of open standard output formats such as HTML, PDF, LaTeX, etc.

Given these features and the ability to add in-line comments and share notebooks easily, Jupyter Notebook has become a popular choice for teaching and collaborative data analysis projects.

---


# Machine Learning Pipeline

This document outlines the general steps involved in developing a Machine Learning model.

0. **ETL (Extract, Transform, Load)**: ETL is the process of extracting data from primary sources, transforming it (e.g., cleaning, aggregating, and sometimes performing feature engineering), and loading it into a suitable data storage. This step is outside the scope of this document, but it's worth noting that it's a crucial step in the ML pipeline. ETL is documented in the ETL scripts and notebooks in this project. ETL is specific to the data source, some datasets will need more cleaning and feature engineering than others. The ETL process is also iterative, and it's common to revisit this step multiple times during the ML pipeline.

1. **Dataset Selection**: After the ETL process, we select the relevant attributes (features) for our model from the available dataset. It's preferable to perform any necessary aggregations or feature engineering in the ETL phase.

2. **Encode Categorical**: Categorical columns are typically one-hot encoded or label encoded to convert them into a numerical format that the model can understand.

3. **Splitting**: The dataset is split into a training set and a test set. Neural Networks or Multi-Layer Perceptrons might further split the training set to get a validation set.

4. **Balancing**: In cases where the target classes are imbalanced, various techniques like upsampling, downsampling, or creating an ensemble split can be employed to balance the classes.

5. **Regularization/Normalization**: We provide options to use different scaling techniques to normalize numerical attributes.

6. **Model Selection**: At this point, a machine learning model is selected. This could be a manual selection or an automatic one where different models are tested, and the best performing one is chosen based on predefined metrics. Hyperparameters could either be manually defined or selected using techniques like grid search or random search.

7. **Model Training and Evaluation**: Finally, the model is trained using the training data and evaluated on the test set. The results (e.g., accuracy, precision, recall, f1-score, etc.) are then reported.

Each step in this pipeline is crucial and can have a significant impact on the performance of the final model. It's also worth noting that this is an iterative process. Based on the model's performance, you might need to revisit and adjust previous steps.


---
# Dependencies

This Jupyter notebook utilizes several Python libraries that are fundamental to data analysis and machine learning. Below is a brief explanation of the key libraries used:

1. [**Python**](https://www.python.org/): Python is an interpreted, high-level, general-purpose programming language that is widely used in data science and machine learning due to its simple syntax and extensive library support.

2. [**Pandas**](https://pandas.pydata.org/docs/): Pandas is a fast, powerful, and flexible open source data analysis and manipulation library for Python. It provides data structures and functions needed to manipulate structured data, including functionality for manipulating tables, time series data and more.

3. [**SQLAlchemy**](https://www.sqlalchemy.org/): SQLAlchemy is a SQL toolkit and Object-Relational Mapping (ORM) system for Python. It provides a full suite of well-known enterprise-level persistence patterns, designed for efficient and high-performing database access.

4. [**Imbalanced-learn (imblearn)**](https://imbalanced-learn.org/stable/references/index.html): Imbalanced-learn is a Python package offering a number of re-sampling techniques commonly used in datasets showing strong between-class imbalance. It is compatible with scikit-learn and is part of scikit-learn-contrib projects.

5. [**Scikit-learn (sklearn)**](https://scikit-learn.org/stable/modules/classes.html): Scikit-learn is a free software machine learning library for Python. It features various classification, regression, and clustering algorithms, and is designed to interoperate with Python numerical and scientific libraries like NumPy and SciPy.

6. [**XGBoost**](https://xgboost.readthedocs.io/en/stable/python/python_api.html): XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable. It provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way.

Remember to install these dependencies using pip or conda before trying to run the notebook.

---

# Recommendations

For individuals new to data science, it is highly recommended to first gain a solid understanding of Python and pandas before diving into the more complex aspects of machine learning.

1. **Learn Python**: Python is the foundation upon which all these libraries and tools are built. Understanding Python allows you to leverage its powerful features and make your data analysis process more efficient. Free resources like [Codecademy's Python course](https://www.codecademy.com/learn/learn-python-3) or [Coursera's Python for Everybody](https://www.coursera.org/specializations/python) can be a good starting point.

2. **Learn Pandas**: Pandas is the most popular Python library for data manipulation and analysis. It provides flexible data structures that make it easy to load, process, and analyze datasets of different sizes. [Pandas' documentation](https://pandas.pydata.org/docs/) is a comprehensive resource, and this [10-minute introduction to pandas](https://pandas.pydata.org/docs/user_guide/10min.html) can help you get started.

With a good grasp of Python and pandas, you'll be well-equipped to understand and use the other libraries in this notebook effectively.


---
# Training RandomForest Models with Python

In this Jupyter notebook, we walk through the process of training a RandomForest machine learning model using Python. We will use various powerful libraries including pandas for data manipulation, Imbalanced-learn for handling class imbalance, and Scikit-learn for model training and evaluation. RandomForest is an ensemble learning method that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes or mean prediction of the individual trees. It can be applied to both classification and regression problems. This notebook aims to provide a comprehensive example of how to handle a practical machine learning task using RandomForest.

---

<br>
<br>

# Import dependencies

The cell below shows different was of importing libraries and classes within them. From the cell below, we can see that the `import` statement can be used to import a library, and the `from` statement can be used to import a class from a library. The `as` statement can be used to rename a library or class. The `*` statement can be used to import all classes from a library. The `import` statement can also be used to import a class from a library and rename it. We can also import classes and methods from other files in our repository.


In [1]:
# import dependencies

import pandas as pd
import sqlalchemy as sq
import sys, os
from imblearn.combine import SMOTEENN
from xgboost import XGBRFClassifier
from sklearn.ensemble import (  # type: ignore
    RandomForestClassifier,
)
from imblearn.ensemble import (  # type: ignore
    BalancedRandomForestClassifier,
)

from sklearn.metrics import (  # type: ignore
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    classification_report,
)

sys.path.append("../../")
os.chdir("../../")
from ModelBuilderMethods import getConn, extractYears

# Jupyter cell output settings

Here we modify the default settings for the Jupyter notebook cells to show more rows and columns.

In [2]:
# unlimited line output
pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_rows", 500)

# 1. Dataset Selection

Here we create SQL statements to select the data from the database. We will use the following tables: 
- dataset_cross_monthly_sat
- ergot_sample_feat_eng

To select all columns from a table, we use the following syntax: 

```sql
SELECT * FROM table_name
```

To select specific columns from a table, we use the following syntax: 

```sql
SELECT column_name1, column_name2 FROM table_name
```

To see which columns are available in this project, refer to the readme or use pgAdmin to view the tables.


In [3]:
weatherSatQuery = sq.text(
    """
    SELECT * from dataset_cross_monthly_sat
"""
)

ergotTargetQuery = sq.text(
    """
    SELECT year, district, downgrade from ergot_sample_feat_eng
"""
)

Here we use getConn to get a connection to the database. We then use the connection with pandas read_sql to get a pandas dataframe of the data requested by the SQL statements. 
del is a memory saving command that deletes the variable from memory after it is no longer needed. This is useful for large dataframes that take up a lot of memory.

In [4]:
conn = getConn("./.env")

satelliteDf = pd.read_sql(weatherSatQuery, conn)
ergotTargetDf = pd.read_sql(ergotTargetQuery, conn)

conn.close()
del conn

Here we use the pandas merge function to merge the two dataframes on the common columns year and district. Since the ergot table has our target variable, we will use a left join to keep all the rows in the ergot table.

In [5]:
# merge on year and district
datasetDf = pd.merge(ergotTargetDf, satelliteDf, on=["year", "district"], how="left")
del ergotTargetDf

# 2. Encode categorical values 

When we encounter categorical data, we have to convert it to numbers because machine learning models work with numerical data. One way to convert categorical data to numerical data is to encode it into binary vectors, a process known as [One-Hot Encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).

Here's how it works:

For each category in the feature, one-hot encoding creates a binary (0/1) feature corresponding to each category in the original feature. Each of these binary features will be "1" when the original feature is equal to the corresponding category and "0" otherwise.

Let's take an example of a feature named "Color". This feature has three categories: "Red", "Green", and "Blue". One-hot encoding this feature would result in three new features: "Is_Red", "Is_Green", and "Is_Blue". If an instance of our data had "Color" = "Red", then "Is_Red" would be 1, while "Is_Green" and "Is_Blue" would both be 0.

So, in summary, one-hot encoding transforms a categorical feature with n categories into n binary features, with only one active.

This method is widely used because it doesn't assume any order of the categories, which can be very helpful when the categorical feature is nominal (i.e., there's no inherent order among the categories). However, it can increase the dimensionality of the dataset dramatically if the categorical feature has many unique categories, which can potentially lead to the curse of dimensionality. This is a situation where the feature space is so high that the learning becomes much harder.

Below, we use Pandas [get_dummies()](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) function to one-hot encode the categorical features in our dataset.

In [6]:
# encode district
datasetDf["district"] = datasetDf["district"].astype("category")

temp = pd.get_dummies(datasetDf["district"], prefix="district", drop_first=True)
datasetDf = pd.concat([datasetDf, temp], axis=1)

datasetDf = datasetDf.drop(columns=["district"])

del temp

# 3. Splitting the dataset

There are multiple ways to split the dataset into training and testing sets. [Scikit provides a few methods and classes](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection)

### Splitter Classes

1. **GroupKFold**: A variant of K-Fold cross-validator that ensures the same group is not represented in both testing and training sets.

2. **GroupShuffleSplit**: A variation of ShuffleSplit, which returns randomized folds, with the constraint that the same group will not appear in two different folds.

3. **KFold**: A standard K-Folds cross-validator that divides all the samples into k groups of samples, of equal or almost equal size.

4. **LeaveOneGroupOut**: A cross-validation object that returns training set indices for each iteration, such that the same group does not appear in two different folds.

5. **LeavePGroupsOut**: A cross-validator creating training/testing indices to split data according to a third-party provided group, ensuring the same group is not in both testing and training sets.

6. **LeaveOneOut**: A cross-validator that returns training and test indices with one observation removed from the training data for testing. 

7. **LeavePOut**: A cross-validator that creates all the possible training/test sets by removing p samples from the complete set.

8. **PredefinedSplit**: A cross-validator generating a user-defined pre-computed cross-validation split.

9. **RepeatedKFold**: A cross-validator that repeats K-Fold n times with different randomization in each repetition.

10. **RepeatedStratifiedKFold**: A cross-validator which returns stratified folds, with each set contains approximately the same percentage of samples of each target class as the complete set.

11. **ShuffleSplit**: A cross-validator generating a user-defined number of independent train/test dataset splits. 

12. **StratifiedKFold**: A cross-validator returning stratified folds, with each fold contains roughly the same proportions of the different types of class labels.

13. **StratifiedShuffleSplit**: A cross-validator that returns randomized stratified folds.

14. **StratifiedGroupKFold**: A variation of StratifiedKFold, where samples are grouped.

15. **TimeSeriesSplit**: A variation of k-fold which is suitable for time-series data.

### Splitter Functions

1. **check_cv**: A utility function to check the cv parameter and return a cross-validator.

2. **train_test_split**: A function to split arrays or matrices into random train and test subsets. The splitting is random, and the proportion of train to test can be specified.

Below we use a much simpler method, since the dataset has a time component, most random splitters will not work. We will use 2016-2020 of the data as the test set.


In [7]:
# train 1995 - 2015 test 2016 - 2020
trainDf = extractYears(datasetDf, 1995, 2015)
testDf = extractYears(datasetDf, 2016, 2020)
del datasetDf

Here we drop the year since it is not a factor in predicting ergot. For example 2020 is a bigger number than 2010 but this has no bearing on the amount of ergot in the crop.

In [8]:
# drop year
trainDf = trainDf.drop(columns=["year"])
testDf = testDf.drop(columns=["year"])

# 4. Balancing the dataset  

Class imbalance is a common problem in machine learning where the total number of a class of data (positive) is far less than the total number of another class of data (negative). This problem is predominant in scenarios where anomaly detection is crucial like fraudulent transactions in banking, diagnosing medical conditions, or in spam filtering where the data is usually skewed. 

Traditional machine learning algorithms are often biased towards the majority class, not taking the data distribution into consideration. In the worst case, minority classes are treated as outliers and ignored. For some cases, such as fraud detection or cancer prediction, we would need to carefully configure our model or artificially balance the dataset, as minority classes can be of utmost importance.

Hence, the need for class balancers. Class balancing techniques basically help to balance our dataset in such a way that the model gets a balanced view of the classes. There are several ways to approach this issue:

1. [**Under-sampling**](https://imbalanced-learn.org/stable/references/under_sampling.html): The aim is to balance class distribution by randomly eliminating majority class examples. This is done until the majority and minority class instances are balanced out.

2. [**Over-sampling**](https://imbalanced-learn.org/stable/references/over_sampling.html): This is used when the quantity of data is insufficient. It tries to balance class distribution by randomly increasing minority class examples by replicating them.

3. [**Combination methods**](https://imbalanced-learn.org/stable/references/combine.html): These are a combination of over-sampling and under-sampling.

The library `imblearn` provides many methods to handle imbalanced datasets, such as the above-mentioned techniques.

It's important to remember that there's no guarantee that over-sampling or under-sampling will improve model performance, and these techniques may not always be appropriate. The choice to use balancing methods should be driven by the specific problem context and a thorough exploratory data analysis (EDA).


Below we see the class imbalance in our training and test sets.  
trainDf  
False    122202  
True       2082  
  
testDf  
False    26307  
True      1016  


In [9]:
# pre balancing check
# print value counts downgrade
print(trainDf["downgrade"].value_counts())
print(testDf["downgrade"].value_counts())

downgrade
False    122202
True       2082
Name: count, dtype: int64
downgrade
False    26307
True      1016
Name: count, dtype: int64


Dealing with missing data is an important part of the data cleaning process, as missing data can lead to biased or incorrect results. Here are several strategies to handle missing data:

1. **Deleting Rows**: This method commonly used to handle the null values. Here, we either delete a particular row if it has a null value for a particular feature and a particular column if it has more than 70-75% of missing values. This method is advised only when there are enough samples in the dataset.

2. **Replacing with Mean/Median/Mode**: This strategy can be applied on a feature which has numeric data like the age of a person or the ticket fare. We can calculate the mean, median or mode of the feature and replace it with the missing values. This is an approximation which can add variance to the dataset but might yield better results compared to removal of rows and columns. Replacing with the above three approximations are a statistical approach to handle the missing values.

3. **Assigning an Unique Category**: If a categorical feature has missing values, then we can introduce a new category for those missing values. This will let the algorithm identify that there's something different about these instances.

4. **Predicting Missing Values**: In this case, we divide our data into two sets: one set with no missing values (it will be our training set), and another one with missing values (it will be our test set). We can now use methods like logistic regression or decision trees to predict and fill missing values.

5. **Using Algorithms Which Support Missing Values**: Few algorithms like Random Forest and XGBoost, can handle missing values without requiring any explicit preprocessing. They do it by using internal mechanisms, which can be very useful and save time.

6. **Imputation Using Multivariate Imputation by Chained Equation (MICE)**: This is a statistical technique for imputing missing values by modeling each feature with missing values as a function of other features in a round-robin fashion.

Remember, it's important to understand why data is missing before treating it, as data could be missing for a variety of reasons, and understanding the underlying reason will help you make the best decision.  

Here we use the simplest method of dealing with rows that have some missing values, deletion.

In [10]:
# count nan
print(trainDf.isna().sum())
# set nan to 0
# trainDf = trainDf.fillna(0)

# drop nan
trainDf = trainDf.dropna()

downgrade                           0
1:min_dewpoint_temperature          0
1:min_temperature                   0
1:min_evaporation_from_bare_soil    0
1:min_skin_reservoir_content        0
                                   ..
district_4830                       0
district_4840                       0
district_4850                       0
district_4860                       0
district_4870                       0
Length: 687, dtype: int64


**SMOTEENN** is a method that combines over-sampling and under-sampling, using SMOTE (Synthetic Minority Over-sampling Technique) and ENN (Edited Nearest Neighbours) respectively.

1. **SMOTE**: It works by creating synthetic samples from the minor class instead of creating copies. The algorithm selects two or more similar instances (using a distance measure) and perturbing an instance one attribute at a time by a random amount within the difference to the neighboring instances.

2. **ENN**: It is an under-sampling technique that removes instances of the majority class that are misclassified by KNN (k-Nearest Neighbours) with k=3 (by default). It aims to clean overlapping areas between classes.

The SMOTEENN method is a two-step process:

- First, it over-samples the minority class by generating synthetic examples through SMOTE.

- Next, it cleans the data space resulting from over-sampling by removing the instances of the majority class that are misclassified by the ENN rule. This helps in pruning the instances of the majority class that are near the borderline and that invade the minority class space, causing difficulty in learning.

By combining over-sampling (adding instances) and under-sampling (removing instances), SMOTEENN tends to provide a well-balanced class distribution and better performance on imbalanced datasets.

However, it's important to keep in mind that no method is universally best for all imbalanced datasets. The appropriate method should be chosen based on the dataset characteristics and the learning algorithm being used.


In [11]:
balancer = SMOTEENN(sampling_strategy=1, random_state=42)
balancedTrainDfX, balancedTrainDfY = balancer.fit_resample(
    trainDf.drop(columns="downgrade"), trainDf["downgrade"]
)

In [12]:
# post balancing check
# print value counts downgrade
print(balancedTrainDfY.value_counts())

downgrade
False    115239
True      25156
Name: count, dtype: int64


# 5. Normalization / Scaling / Regularization
Data preprocessing, which includes normalization, scaling, and regularization, plays a critical role in many machine learning models. Here's why these steps are important:

1. **Normalization**: Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values or losing information. Normalization is essential for algorithms that rely on the magnitude of the feature vector or if the scale of features is relevant, as in K-Nearest Neighbors (KNN), Neural Networks, or any algorithms using distance based or gradient descent based methods.

2. **Scaling**: Scaling is the process of converting an actual range of values into a standard range of values, typically in the interval [-1, 1] or [0, 1]. Some machine learning algorithms, like SVM or KNN, perform better when input numerical variables fall within a similar scale. In these algorithms, scaling can have a significant impact on the model's performance.

In essence, these preprocessing techniques are crucial for certain machine learning models as they help reduce the computational burden, remove noise, prevent overfitting, and improve the overall performance of models.
 
Scikit provides a few classes:
- [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)             
- [MaxAbsScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html)  
- [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)  
- [RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html)  
- [Normalizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html)  
- [PowerTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html)  
- [QuantileTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html)  

In [13]:
# scaler if needed

# 6. Model Selection

Choosing the right model for a machine learning problem can be quite a task. You have to consider the problem at hand, the size and nature of your data, and the requirements of the project. Here are some general steps to guide you:

1. **Understand the Problem**: Firstly, identify whether the problem is a regression problem, a classification problem, a clustering problem, or something else. Different algorithms are better suited to different types of problems.

2. **Understand the Data**: Look at the data you have available. How many features do you have? How much data do you have? Is the data numeric, categorical, or a mix? Different algorithms are designed to handle different types of data. Also, some algorithms can handle lots of features or lots of examples better than others.

3. **Consider the Trade-offs**: Every model comes with trade-offs. Some models are simpler to interpret but may not have the highest accuracy. Other models may be more accurate but take a longer time to train. Some models handle categorical features well but not numerical features, or vice versa. Some can handle missing data, others can't. You'll need to consider what's important for your specific problem.

4. **Test Multiple Models**: Often the best way to choose a model is to try out several and see which one works best. You might start with a simple model like linear regression or k-nearest neighbors and then try more complex models like random forests or neural networks. Use cross-validation to get a reliable estimate of the model's performance.

5. **Tune the Model**: Once you've chosen a model, you can use techniques like grid search or random search to find the optimal hyperparameters for your model. This can often significantly improve the model's performance.

6. **Evaluate the Model**: Finally, once you have a model and have tuned its hyperparameters, you'll want to evaluate the model using an appropriate metric. This could be accuracy, precision, recall, F1-score, ROC-AUC, mean squared error, etc., depending on the problem and the business context.

Remember, there is no one-size-fits-all algorithm or model in machine learning. The best approach often depends on the specific problem, the available data, and the context.

Below we show how to set up RandomForest classifier models using sklearn, imblearn, and xgboost library implementations.

In [16]:
ESTIMATORS = 400
DEPTH = 40
CORES = 10
MINSPLSPLIT = 8
MINSAMPLELEAF = 4

model_rf = RandomForestClassifier(
    n_estimators=ESTIMATORS,
    random_state=42,
    max_depth=DEPTH,
    n_jobs=CORES,
    min_samples_split=MINSPLSPLIT,
    min_samples_leaf=MINSAMPLELEAF,
)
model_nobalance_rf = RandomForestClassifier(
    n_estimators=ESTIMATORS,
    random_state=42,
    max_depth=DEPTH,
    n_jobs=CORES,
    min_samples_split=MINSPLSPLIT,
    min_samples_leaf=MINSAMPLELEAF,
)
balanced_model_rf = BalancedRandomForestClassifier(
    n_estimators=ESTIMATORS,
    random_state=42,
    max_depth=DEPTH,
    n_jobs=CORES,
    min_samples_split=MINSPLSPLIT,
    min_samples_leaf=MINSAMPLELEAF,
)
balanced_model_balanced_rf = BalancedRandomForestClassifier(
    n_estimators=ESTIMATORS,
    random_state=42,
    max_depth=DEPTH,
    n_jobs=CORES,
    min_samples_split=MINSPLSPLIT,
    min_samples_leaf=MINSAMPLELEAF,
)
model_xgbrf = XGBRFClassifier(
    n_estimators=ESTIMATORS,
    random_state=42,
    max_depth=DEPTH,
    n_jobs=CORES,
)
model_balance_xgbrf = XGBRFClassifier(
    n_estimators=ESTIMATORS,
    random_state=42,
    max_depth=DEPTH,
    n_jobs=CORES,
)

# 6. Model Training

Here we train the models using the fit() method using the training set

In [17]:
model_nobalance_rf.fit(trainDf.drop(columns="downgrade"), trainDf["downgrade"])
model_rf.fit(balancedTrainDfX, balancedTrainDfY)
balanced_model_rf.fit(trainDf.drop(columns="downgrade"), trainDf["downgrade"])
balanced_model_balanced_rf.fit(balancedTrainDfX, balancedTrainDfY)
model_xgbrf.fit(trainDf.drop(columns="downgrade"), trainDf["downgrade"])
model_balance_xgbrf.fit(balancedTrainDfX, balancedTrainDfY)

  warn(
  warn(
  warn(
  warn(


# 6. Model Evaluation

We use the test set to evaluate our trained models

In [18]:
# set nan to 0
# testDf = testDf.fillna(0)

# drop nan
testDf = testDf.dropna()

Here we generate the predictions for the test set for each model to help us calculate performance metrics later on.

In [19]:
# get predictions
predictions = model_rf.predict(testDf.drop(columns="downgrade"))
predictions_nobalance = model_nobalance_rf.predict(testDf.drop(columns="downgrade"))
predictions_balanced = balanced_model_rf.predict(testDf.drop(columns="downgrade"))
predictions_balanced_balanced = balanced_model_balanced_rf.predict(
    testDf.drop(columns="downgrade")
)
predictions_xgbrf = model_xgbrf.predict(testDf.drop(columns="downgrade"))
predictions_balance_xgbrf = model_balance_xgbrf.predict(
    testDf.drop(columns="downgrade")
)

Here we inspect the predictions to see the distribution of classes predicted by each model. We see that some only make one class prediction, e.g. it always chooses false for ergot

In [20]:
print(pd.DataFrame(predictions).value_counts())
print(pd.DataFrame(predictions_nobalance).value_counts())
print(pd.DataFrame(predictions_balanced).value_counts())
print(pd.DataFrame(predictions_balanced_balanced).value_counts())
print(pd.DataFrame(predictions_xgbrf).value_counts())
print(pd.DataFrame(predictions_balance_xgbrf).value_counts())

False    22907
True      4416
Name: count, dtype: int64
False    27323
Name: count, dtype: int64
False    26237
True      1086
Name: count, dtype: int64
False    25565
True      1758
Name: count, dtype: int64
0    27323
Name: count, dtype: int64
0    22679
1     4644
Name: count, dtype: int64


Evaluating the performance of a machine learning model is an important aspect of any ML project. These metrics help to understand how well the model is performing. Some key performance metrics include:

1. **Accuracy**: Accuracy is the ratio of correctly predicted observations to the total observations. It is the most intuitive performance measure. However, accuracy is not a good choice with imbalanced classes.

    `Accuracy = (True Positives + True Negatives) / Total Observations`

2. **Precision**: Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. High precision relates to the low false positive rate.

    `Precision = True Positives / (True Positives + False Positives)`

3. **Recall (Sensitivity)**: Recall is the ratio of correctly predicted positive observations to the all observations in actual class. Recall gives us an idea about when it's actually yes, how often does it predict yes.

    `Recall = True Positives / (True Positives + False Negatives)`

4. **F1 Score**: The F1 Score is the weighted average of Precision and Recall. It tries to find the balance between precision and recall. F1 Score reaches its best value at 1 (perfect precision and recall) and worst at 0.

    `F1 Score = 2*(Recall * Precision) / (Recall + Precision)`

5. **ROC-AUC**: Receiver Operating Characteristic - Area Under Curve (ROC-AUC) is a performance measurement for the classification problems at various threshold settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. The higher the AUC, the better the model is at predicting 0s as 0s and 1s as 1s.


Below is a helper function to print model performance metrics.

In [None]:
def printMetrics(model_name, y_true, y_pred):
    print(model_name)
    print("Accuracy: ", accuracy_score(y_true, y_pred))
    print("Precision: ", precision_score(y_true, y_pred))
    print("Recall: ", recall_score(y_true, y_pred))
    print("F1: ", f1_score(y_true, y_pred))
    print("ROC AUC: ", roc_auc_score(y_true, y_pred))
    print("Classification Report: \n", classification_report(y_true, y_pred))
    print()

print model performance metrics on test data

In [21]:
printMetrics("sk RF balanced train set", testDf["downgrade"], predictions)
printMetrics("sk RF imbalanced train set", testDf["downgrade"], predictions_nobalance)
printMetrics("imb RF imbalanced train set", testDf["downgrade"], predictions_balanced)
printMetrics(
    "imb RF balanced train set", testDf["downgrade"], predictions_balanced_balanced
)
printMetrics("xgb RF imbalanced train set", testDf["downgrade"], predictions_xgbrf)
printMetrics(
    "xgb RF balanced train set", testDf["downgrade"], predictions_balance_xgbrf
)

sk RF balanced train set
Accuracy:  0.8126120850565458
Precision:  0.035326086956521736
Recall:  0.15354330708661418
F1:  0.05743740795287187
ROC AUC:  0.49580461055094766
Classification Report: 
               precision    recall  f1-score   support

       False       0.96      0.84      0.90     26307
        True       0.04      0.15      0.06      1016

    accuracy                           0.81     27323
   macro avg       0.50      0.50      0.48     27323
weighted avg       0.93      0.81      0.86     27323


sk RF imbalanced train set
Accuracy:  0.9628152106284082
Precision:  0.0
Recall:  0.0
F1:  0.0
ROC AUC:  0.5
Classification Report: 
               precision    recall  f1-score   support

       False       0.96      1.00      0.98     26307
        True       0.00      0.00      0.00      1016

    accuracy                           0.96     27323
   macro avg       0.48      0.50      0.49     27323
weighted avg       0.93      0.96      0.94     27323


imb RF imbala

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Recall:  0.1377952755905512
F1:  0.04946996466431096
ROC AUC:  0.48329304586156974
Classification Report: 
               precision    recall  f1-score   support

       False       0.96      0.83      0.89     26307
        True       0.03      0.14      0.05      1016

    accuracy                           0.80     27323
   macro avg       0.50      0.48      0.47     27323
weighted avg       0.93      0.80      0.86     27323


