<a href="https://colab.research.google.com/github/MaQuest/RE2022_Tutorial/blob/main/4.%20Working%20with%20data%20and%20ML-with%20practice%20blank%20spots.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Overview of the Course

Here is a brief overall view of the tutorial:

![image](https://github.com/mitramir55/REConference/blob/main/Tutorial/3%20Preprocessing%20-%20Mitra/new%20cover-01-01.jpg?raw=true)


Our goal in this tutorial is to first, get to know the data, and then classify the text into either enhancement or task. For that, we'll use the "summary" column for the input and the "type" column for our label.


Brief description of the columns:

* **Bug ID**: The numeric id of a bug, unique within this entire installation of Bugzilla.


* **Type**: This field describes the type of a bug.

 * **enhancement**: new feature, improvement in UI, performance, etc. and any other request for user-facing changes and enhancements in the product; not engineering changes
 * **task**: refactoring, removal, replacement, enabling or disabling of functionality and any other engineering task
 * **defect**: (we have filtered this field out) regression, crash, hang, security vulnerability and any other general issue.

* **Summary**: The bug summary is a short sentence which succinctly describes what the bug is about.

* **Priority**: This field describes the importance and order in which a bug should be fixed compared to other bugs. This field is utilized by the programmers/engineers/release managers/managers to prioritize the work to be done. 

* **Severity**: This field describes the impact of a bug. 

For more information about columns please go to [this link](https://wiki.mozilla.org/BMO/UserGuide/BugFields).

### Importing the packages

In [1]:
import pandas as pd
import numpy as np

In order to visualize dataframes better, we tweek a few default settings of a Jupyter notebook.

In [2]:
from termcolor import colored
pd.options.mode.chained_assignment = None

# we specify for the columns to be shown all together without truncation
pd.set_option('display.max_columns', None)

# Row
pd.set_option('display.max_rows', 10)


%matplotlib inline

### Import the file

In [3]:
!git clone https://github.com/MaQuest/RE2022_Tutorial.git

Cloning into 'RE2022_Tutorial'...
remote: Enumerating objects: 81, done.[K
remote: Counting objects: 100% (81/81), done.[K
remote: Compressing objects: 100% (74/74), done.[K
remote: Total 81 (delta 32), reused 20 (delta 4), pack-reused 0[K
Unpacking objects: 100% (81/81), done.


In [24]:
# path to the data
ALLData_FILE_PATH = r"/content/RE2022_Tutorial/Data/AllData.csv"

In [25]:
all_data = pd.read_csv(ALLData_FILE_PATH)


Columns (7,8,9,10,11,13,16,18,19,26,27,28,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,70,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,96,98,99,101,102,103,104,105,106,107,108,109,110,111,112,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,139,140,145,146,147,148,149,150,151,152,153,154,155,156,169) have mixed types.Specify dtype option on import or set low_memory=False.



In [26]:
# print all the columns in the file


In [27]:
# take a look at the data



In [28]:
# remove the redundant column
all_data.drop(columns=['Unnamed: 0'], inplace=True)

# drop duplicate ids
all_data.drop_duplicates(subset='id', inplace=True)

# drop columns that only contain null
all_data.dropna(axis=1, inplace=True, how='all')

### Data Exploration

First, we take a look at the data frame size. Then we realize that we don't need all the 198 columns, so we choose only the ones that can be helpful in understanding the data.

However, it is always necessary to look at the documentation for the data from the source website to understand it better

In [29]:

print(f'There are {...} records and {...} columns in this dataset.')

There are Ellipsis records and Ellipsis columns in this dataset.


In [30]:
df = all_data[[
        'component', 'platform',  'assigned_to', 
        'summary', 'id', 'status', 'is_confirmed', 'votes',
        'severity', 'keywords', 'priority', 'creator', 'type'
        ]]

In [None]:
print(f'There are {...} records and {...} columns in this dataset.')

There are 4215 records and 13 columns in this dataset.


### Visualize the data to understand it better

In the first part we try to categorize the data into groups and count their frequency. These categories will be placed in a data frame along with the number of times each one was seen.

Here is an overview of what the functions in the next cells will do:


In [31]:
# for plots
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
import plotly.tools as tls
py.init_notebook_mode(connected=True)
%matplotlib inline

In [32]:
# choose a column
col = 'platform'

# create a new dataset by grouping the categories, then choose id to show the count
categories_count = df.groupby(by=col).count().id.sort_values(ascending=False)

# by resetting the index, we create a df
categories_count_df = categories_count.reset_index()

categories_count_df

Unnamed: 0,platform,id
0,Unspecified,2727
1,All,1228
2,x86,138
3,x86_64,45
4,ARM64,26
5,Other,18
6,Desktop,14
7,ARM,10
8,PowerPC,8
9,HP,1


In [33]:
# rename the column to count
categories_count_df. ...
categories_count_df

Unnamed: 0,platform,count
0,Unspecified,2727
1,All,1228
2,x86,138
3,x86_64,45
4,ARM64,26
5,Other,18
6,Desktop,14
7,ARM,10
8,PowerPC,8
9,HP,1


In [34]:
def create_grouped_df(df, col):
    """
    creates a dataset for counting categories in one field of a df
    """
    series = df.groupby(by=col).count().id.sort_values(ascending=False).reset_index()
    df_ = pd.DataFrame(series)
    df_.rename(columns={'id': 'count'}, inplace=True)

    return df_

In [35]:
def create_plot(df, col):

    # create the dataset for the visualization
    df_ = create_grouped_df(df, col=col)

    # start plotting
    fig = px.bar(df_, x=col, y='count', text='count',
     labels={'x': col, 'y': 'frequency'}, color='count',
     color_continuous_scale='Agsunset_r')

    fig.update_traces(
        texttemplate='%{text:.2f}', textposition='outside',
        textfont_size=10)

    # Change the bar mode
    fig.update_layout(title_text=f'Field {col} types count')
    fig.show(renderer="colab")

Many fields have more than 20 categories which can make visualization and depicting the values challenging. To prevent the plot from getting too messy and illegible, we filter the columns with less than or equal to 20 categories.

In [36]:
# a list of columns we want to visalize
visualization_cols = []

# start filtering
for col in df.columns:

    # select the unique values
    unique_values = df.loc[:, col].unique()

    if len(unique_values)<20 and len(unique_values)>1:
        print('column: ', colored(f'{col}', 'green'))
        print(unique_values)
        visualization_cols.append(col)

column:  [32mplatform[0m
['All' 'x86' 'PowerPC' 'Unspecified' 'x86_64' 'HP' 'ARM' 'Other' 'Desktop'
 'ARM64']
column:  [32mstatus[0m
['NEW' 'ASSIGNED' 'REOPENED' 'UNCONFIRMED' 'RESOLVED']
column:  [32mis_confirmed[0m
[ True False]
column:  [32mseverity[0m
['normal' 'critical' 'minor' 'blocker' 'major' 'trivial']
column:  [32mpriority[0m
['P3' '--' 'P5' 'P2' 'P4' 'P1']
column:  [32mtype[0m
['enhancement' 'task']


In [37]:
# loop over the columns and visualize them
for col in visualization_cols:
    create_plot(df, col=col)

### Working with textual data
![](https://github.com/mitramir55/REConference/blob/main/additional%20resources/Pages%20from%20Introduction%20to%20NLP-2.jpg?raw=true)
* Tokenization
* Taking a closer look at the summary data
* Plotting the distribution of words with PlotlyExpress

We'll start by looking at the summary column. As we discussed earlier, first, we'll tokenize the text for a more accurate cleaning, then, based on our task, we choose a couple of cleaning functions one after another.



### Tokenization

To see the data more clearly, we'll filter out the columns we want to use for model creation.

In [38]:
# filter out summary and type
df = ...
df.head()

Unnamed: 0,summary,type
0,Proxy: map HTTP 500 errors to necko errors (so...,enhancement
1,[LDAP] Access to a local LDAP server in Off-Li...,enhancement
2,Warnings for USENET etiquette errors required ...,enhancement
3,URL linkifying code should cross linebreaks [l...,enhancement
4,Automatically update bookmarks when sites move...,enhancement


In [39]:
# for handling text
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

import regex as re

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [41]:
def tokenize_text(df, input_col='summary', output_col="summary_tokens_list"):
    """
    takes a dataset and name of a column
    then tokenizes the text in the column of that dataset
    """
    # create a series that is the tokenized version of the input_col
    tokenized_series = ...
    df.loc[:, output_col] = tokenized_series
    return df
    
df = tokenize_text(df)

In [43]:
df.head(2)

Unnamed: 0,summary,type,summary_tokens_list
0,Proxy: map HTTP 500 errors to necko errors (so...,enhancement,"[Proxy, :, map, HTTP, 500, errors, to, necko, ..."
1,[LDAP] Access to a local LDAP server in Off-Li...,enhancement,"[[, LDAP, ], Access, to, a, local, LDAP, serve..."


### Data Cleaning

These are the functions and the road map we follow for cleaning this dataset:

 1. Lowercase the text
 2. remove the numbers
 3. Remove the stopwords
 4. Remove the punctuation

When we go over all the steps, we append the text to a new list and then attach it to the main dataset.


In [None]:
stop_words = set(stopwords.words('english'))

def remove_stopwords(list_of_words):
    filtered_list = [w for w in list_of_words if not w in stop_words]
    return filtered_list

def check_punct(list_of_words):
    """
    look at the tokenized text. if the token was matched with ( or was only)
    a puncutation, it is redundant and we remove it.
    """
    filtered_list = []
    for word in list_of_words:
        if re.match("[()!><.,`?':\-\[\]_@]", word):
            pass
        else: filtered_list.append(word)
        
    return filtered_list

def remove_numbers(list_of_words):

    filtered_list = []
    for word in list_of_words:
        if re.findall("[0-9]", word):
            pass
        else: filtered_list.append(word)
        
    return filtered_list


def clean(df, input_col='summary_tokens_list', output_col="cleaned_word_list"):
    
    """
    takes a column of textual data 
    outputs a df with cleaned text attached
    """
    text_series = df.loc[:, input_col]
    cleaned_text_series = []
    
    for text in text_series: 
        # go through every record of this dataset
        # then append the cleaned text to a new list
        cleaned_text = remove_numbers(text)
        cleaned_text = remove_stopwords(cleaned_text)
        cleaned_text = check_punct(cleaned_text)
        cleaned_text = [t.lower() for t in cleaned_text]
        
        cleaned_text_series.append(cleaned_text)
        
    df.loc[:, output_col] = cleaned_text_series
    
    return df

df = clean(df)

In [48]:
df.head(2)

Unnamed: 0,summary,type,summary_tokens_list,cleaned_word_list
0,Proxy: map HTTP 500 errors to necko errors (so...,enhancement,"[Proxy, :, map, HTTP, 500, errors, to, necko, ...","[proxy, map, http, errors, necko, errors, inte..."
1,[LDAP] Access to a local LDAP server in Off-Li...,enhancement,"[[, LDAP, ], Access, to, a, local, LDAP, serve...","[ldap, access, local, ldap, server, off-line, ..."


Let's see how long each document tokens list is.

In [52]:
def add_token_length_to_df(df, col='cleaned_word_list'):
    
    # calculate the length of each list 
    length_of_tokens = ...
    df.loc[:, 'tokens_list_length'] = length_of_tokens
    return df

df = add_token_length_to_df(df)
df.head(2)

Unnamed: 0,summary,type,summary_tokens_list,cleaned_word_list,tokens_list_length
0,Proxy: map HTTP 500 errors to necko errors (so...,enhancement,"[Proxy, :, map, HTTP, 500, errors, to, necko, ...","[proxy, map, http, errors, necko, errors, inte...",12
1,[LDAP] Access to a local LDAP server in Off-Li...,enhancement,"[[, LDAP, ], Access, to, a, local, LDAP, serve...","[ldap, access, local, ldap, server, off-line, ...",7


In [56]:
fig = px.histogram(df, x="tokens_list_length", color="type",
                    marginal="box",
                   title='Histogram of summary tokens count',
                   labels={'tokens_list_length':'count of tokens list'}, # can specify one label per df column
                   opacity=0.8,)
fig.show(renderer="colab")

The goal of the following cell is to create a dataset of all the words in our dataset and count them as they appear in the records.

In [57]:
def count_words(tokens_column_series):
    """
    gets a series of tokens lists and counts the tokens
    output: a sorted dict
    {"word1": 105, "word2": 67, ...}
    note: you can also use a bag of words package to do this
    """
    count_dict = {}
    for tokens_list in tokens_column_series:
        for word in tokens_list:
            try: count_dict[word] +=1
            except: count_dict[word] = 1

    # sort 
    sorted_count_dict = {k:v for k,v in sorted(count_dict.items(), key=lambda item: item[1], reverse=True)}
    
    return sorted_count_dict 

def get_n_key_and_value(n, dict_):

    """
    get the first - most frequent and important -
    words of dictionary 
    """
    keys = [k for (k, v) in dict_.items()][:n]
    values = [v for (k, v) in dict_.items()][:n]

    return keys, values

def convert_tokens_list_to_freq_df(tokens_arrays, n=-1):
    """
    gets the array of tokenized sentences
    output: a sorted data frame with two cols
    the words and their frequency
    """

    sorted_count_dict = count_words(tokens_arrays)
    keys, values = get_n_key_and_value(n, sorted_count_dict)

    df = pd.DataFrame({'words': keys, 'freq': values})

    return df

    
df_words_freq_task = convert_tokens_list_to_freq_df(df[df.type=='task'].cleaned_word_list, n=-1)
df_words_freq_enh = convert_tokens_list_to_freq_df(df[df.type=='enhancement'].cleaned_word_list, n=-1)


In [59]:
df_words_freq_enh

Unnamed: 0,words,freq
0,meta,704
1,implement,283
2,add,278
3,support,236
4,use,154
...,...,...
5274,distinguish,1
5275,hacl,1
5276,manager|,1
5277,pad,1


In [58]:
# take a look at what the data frames look like
df_words_freq_task

Unnamed: 0,words,freq
0,meta,245
1,history,105
2,room,98
3,add,58
4,remove,54
...,...,...
2038,non-mozglue,1
2039,de,1
2040,serializations,1
2041,plawless,1


Because they have different number of records, and to have a fair comparison, we use percentage as a way of measurement

In [60]:
def take_percentage(df, col='freq'):

    sum = df.loc[:, 'freq'].sum()
    df.loc[:, 'freq'] = df.loc[:, 'freq'].apply(lambda x: x/sum)
    df.rename(columns={'freq': 'freq_ratio'}, inplace=True)
    return df

df_words_freq_enh = take_percentage(df_words_freq_enh)
df_words_freq_task = take_percentage(df_words_freq_task)

In [61]:
df_words_freq_enh.head()

Unnamed: 0,words,freq_ratio
0,meta,0.034417
1,implement,0.013835
2,add,0.013591
3,support,0.011538
4,use,0.007529


We merge the datasets for overall analysis and visualization. For this, we rename columns to distinguish each set when merging, but we need to keep the words field the same because we want them to be merged on it.

In [62]:

task_word_freq_df = df_words_freq_enh.add_suffix('_enh')
task_word_freq_df.rename(columns={'words_enh': 'words'}, inplace=True)

enh_word_freq_df = df_words_freq_task.add_suffix('_task')
enh_word_freq_df.rename(columns={'words_task': 'words'}, inplace=True)

merged_df = pd.merge(enh_word_freq_df, task_word_freq_df)

In [63]:
merged_df

Unnamed: 0,words,freq_ratio_task,freq_ratio_enh
0,meta,0.052261,0.034417
1,history,0.022398,0.000782
2,room,0.020904,0.000098
3,add,0.012372,0.013591
4,remove,0.011519,0.006062
...,...,...,...
1153,embedded,0.000213,0.000244
1154,reset,0.000213,0.000293
1155,proposed,0.000213,0.000049
1156,property,0.000213,0.000733


In [64]:
# a function to see the difference in words in tasks and enhancement

n = 30
merged_df_sample = merged_df[:n]
fig = go.Figure(data=[
    go.Bar(name='Enhancement', x=merged_df_sample.words, y=merged_df_sample.freq_ratio_enh, text=merged_df_sample.freq_ratio_enh, marker_color='#BA0F30'),
    go.Bar(name='Task', x=merged_df_sample.words, y=merged_df_sample.freq_ratio_task, text=merged_df_sample.freq_ratio_task, marker_color='#98D7C2')
])


fig.update_xaxes(tickangle= -45)  
fig.update_traces(
    texttemplate='%{text:.2f}', textposition='outside',
     textfont_size=8)

# Change the bar mode
fig.update_layout(barmode='group', title_text='Comparing the the most frequent words in enhancement and task types')
fig.show(renderer="colab")

### Classification

Step .1 Preparing the input

"Term Frequency — Inverse Document Frequency" is referred to as "TF-IDF". This method first, counts the number of words in a collection of documents. Then, count the number of documents that contain this word. Each word is typically given a score to indicate how important it is to the document and corpus. We frequently use TFIDF vectors to classify, rank, or cluster documents or evaluate words and topics.



* t — term (word)
* d — document (set of words)
* N — count of corpus
* corpus — the total document set

`TF-IDF = Term Frequency (TF) * Inverse Document Frequency (IDF)`

`tf(t,d) = count of t in d / number of words in d`

`df(t) = occurrence of t in N documents`

For instance, a word like "is," which is quite common, may appear several times in a document. So, to prevent words such as stop words have high scores and determinant factors, we perform normalization on the frequency value. We divide the frequency with the total number of words in the document. For more explanation you can see [this blog](https://towardsdatascience.com/tf-idf-a-visual-explainer-and-python-implementation-on-presidential-inauguration-speeches-2a7671168550).

For starting, we first, join all the cleaned tokens in each record and feed it as a span of text to TFIDF vectorizer.

In [None]:
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score

In [None]:
# we join the cleaned tokens to feed the tfidf algorithm

def join_tokens(df, tokens_arrays_col):
    return [" ".join(df.loc[i, tokens_arrays_col]) for i in range(len(df))]

cleaned_tokens_series = join_tokens(df, 'cleaned_word_list')
df.loc[:, 'cleaned_text'] = cleaned_tokens_series

In [None]:
tf_idf = ...

#applying tf idf to training data
X = ...

In [None]:
X_tfidf_df = pd.DataFrame(X)
X_tfidf_df.columns = tf_idf.get_feature_names_out()

# we have to change None to a small number to see a concatenated version of the matrix
pd.set_option('display.max_columns', 20)


X_tfidf_df

The above dataset shows you how we create TFIDF vectors, but to see which words are given a high score (and are deemed more important), run the following cell and take a look at the output.

In [None]:
# visualize and list the most important words from TFIDF's view
def get_tfidf_words_and_array(text_arrays):

    vectorizer = TfidfVectorizer()
    transformed_data = vectorizer.fit_transform(text_arrays).toarray()
    words = vectorizer.get_feature_names_out()
    
    return transformed_data, words

def create_tfidf_df(text_arrays):
    """
    gets the df, converts it into tfidf arrays and words
    then puts them in a dataset
    """

    transformed_data, words = get_tfidf_words_and_array(text_arrays)

    df = pd.DataFrame(data=transformed_data, columns=words).sum().reset_index()

    col_names = ['words', 'tfidf_score_sum']
    default_col_names = df.columns

    # rename whatever the df cols are called to the col_names
    df.rename(columns={default_col_names[i]:col_names[i] for i in range(len(col_names))}, inplace=True)

    df.sort_values(by='tfidf_score_sum', inplace=True, ascending=False)

    return df

    
tfidf_scores = create_tfidf_df(df.loc[:, 'cleaned_text'])
tfidf_scores

Step 2. Creating a model that uses the vectors to classify text

<center>
<img src="https://github.com/mitramir55/REConference/blob/main/additional%20resources/3_Page_15.png?raw=true"></center>

In general, Scikit-learn makes machine learning model training simple and generally follows the same pattern. Here we'll use Naive Bayes to classify our vectors. These are the step we'll have to follow:


1. initialize an instance of the model
2. fit the model to the train data
3. predict the test data 
4. compute scores for both train and test sets.


To put it very briefly, it is worth saying that NB basically applies the Bayesian rule to calculate the probability or likelihood that a set of words (document) belongs to a class. To calculate this probability it assumes that all the words are independent or, in other words, the location of words in a sentence is completely random . (e.g., program is independent of enhancement) 

To learn more about Naive Bayes you can go to [this link](https://towardsdatascience.com/training-a-naive-bayes-model-to-identify-the-author-of-an-email-or-document-17dc85fa630a).


In [None]:
# import the model
from sklearn.naive_bayes import GaussianNB

# instantiate the class
nb = ...

# get the 
y = ...

#### Evaluation

In order to understand the prediction capability of our model, avoid overfitting, and not misjudge its classification accuracy, we use an evaluation technique called cross validation (CV). In this method, we separate the dataset into n parts, train the model on n-1 parts, and test it on the one part that's been held out.


![](https://github.com/mitramir55/REConference/blob/main/additional%20resources/1920px-K-fold_cross_validation_EN-01-01.png?raw=true)

#### Stratified KFold: 

Stratified kfold cross validation is an extension of regular kfold cross validation but specifically for classification problems where rather than the splits being completely random, the ratio between the target classes is the same in each fold as it is in the full dataset.

Read more [here](https://machinelearningmastery.com/k-fold-cross-validation/) and [here](https://towardsdatascience.com/how-to-train-test-split-kfold-vs-stratifiedkfold-281767b93869).

![](https://github.com/mitramir55/REConference/blob/main/Tutorial/3%20Preprocessing%20-%20Mitra/stratified%20cv.png?raw=true)

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

We evaluate the performance of our model by calculating accuracy and the weighted average F1, precision, and recall of the two groups.
<center>
<img src="https://miro.medium.com/max/1400/1*pOtBHai4jFd-ujaNXPilRg.png" alt="scikit-learn" style="width:50%"></center>
To learn more about SKlean's evaluation packages, visit:

 * [F1 score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)
 * [precision](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html)
 * [recall](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html)

In [None]:
def evaluate(X_test, y_test, model):
        
    y_pred = model.predict(X_test)
    print('Accuracy: ', accuracy_score(y_test, y_pred))
    print('F1 score: ', f1_score(y_test, y_pred, average="weighted"))
    print('Precision: ', precision_score(y_test, y_pred, average="weighted"))
    print('Recall: ', recall_score(y_test, y_pred, average="weighted"))  

In [None]:
def train_and_evaluate(X, y, model, fold_n = 1):

    """
    gets X and y, then splits them into test, train in n folds
    """
    # create a stratified kfold with 3 splits.
    # Make sure it shuffles the data
    skf = ...

    fold_n = 1

    for train_index, test_index in skf.split(X, y):

        # instantiate the class
        model_ = ...

        print(colored(f'\nFold, {fold_n}', 'green'))
        print("Training set :", train_index, "Testing set:", test_index)

        # define the x and y with indices
        X_train, X_test = ..., ...
        y_train, y_test = ..., ...
        
        # fit the model to x_train and y_train
        ...
       
       # evaluate the model with x_test and y_test


        fold_n+=1


In [None]:
train_and_evaluate(X, y, model=GaussianNB, fold_n = 1)

Just to show you how a more complex model would work, we'll take a look at Random Forest Classifier. However, a more powerful algorithm might take more time to converge and reach an answer. Run the following cell to see how it works.

In [None]:
train_and_evaluate(X, y, model=RandomForestClassifier, fold_n = 1)

References:

https://www.kaggle.com/code/mitramir5/text-eda-rnn-lstm-gru-and-many-more


https://towardsdatascience.com/training-a-naive-bayes-model-to-identify-the-author-of-an-email-or-document-17dc85fa630a

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html


https://wiki.mozilla.org/BMO/UserGuide/BugFields

https://www.redmine.org/projects/redmine/wiki/RedmineIssues

https://www.analyseup.com/python-machine-learning/stratified-kfold.html


Thanks for your attention!