# Part 03: Review Classification (Machine Learning)


## Introduction 

In this notebook, we apply machine learning techniques to classify Amazon reviews for the "Miracle Noodle Zero Carb, Gluten Free Shirataki Pasta, Spinach Angel Hair" product into high and low sentiment categories. By building and evaluating various machine learning models, we aim to develop an accurate classifier that can predict the sentiment of future reviews.

#### Objectives

1. **Load and Explore Data**:
   - Load the processed reviews and metadata from a JSON file.
   - Perform exploratory data analysis to understand the structure and characteristics of the dataset.

2. **Analyze Review Length**:
   - Compare the length of reviews between high and low sentiment categories.
   - Visualize the differences using bar plots.

3. **Prepare Data for Modeling**:
   - Split the dataset into training, validation, and test sets.
   - Encode target labels into numerical values.
   - Ensure class balance by performing under-sampling if necessary.

4. **Train and Evaluate Models**:
   - Train several machine learning models, including Naive Bayes, Logistic Regression, Random Forest, and SVC.
   - Perform hyperparameter tuning using GridSearchCV to find the best parameters for each model.
   - Evaluate models using classification metrics and confusion matrices.

5. **Save the Best Model**:
   - Save the best-performing model for future use and deployment in a Streamlit app.

By the end of this notebook, we will have an effective machine learning classifier capable of predicting the sentiment of Amazon reviews. This classifier will help in understanding customer opinions and improving product features based on sentiment analysis.


In [None]:
from IPython.display import display, Markdown
with open("data/Amazon Product Reviews.md") as f:
    info = f.read()

display(Markdown(info))

# Loading Imports

In [None]:
# !pip install wordcloud

In [None]:
import matplotlib.pyplot as plt
import missingno
import matplotlib as mpl
import seaborn as sns
import numpy as np
import pandas as pd
import os, json, joblib
from pprint import pprint
pd.set_option("display.max_columns",50)
from nltk.probability import FreqDist
from wordcloud import WordCloud
pd.set_option('display.max_colwidth', 200)

In [None]:
!pip install -U dojo_ds -q
import dojo_ds as ds
# ds.__version__

# Import Processed Data

We created several versions of our text in the previous lecture.  We will load those again to use for analysis and classification

In [None]:
FPATHS_FILE = "config/filepaths.json"
import json
with open(FPATHS_FILE) as f:
    FPATHS = json.load(f)
FPATHS.keys()

In [None]:
fpath_json = FPATHS['data']['processed-nlp']['processed-reviews-with-target_json']
fpath_json

In [None]:

df = pd.read_json(fpath_json)#joblib.load('data/processed/processed-reviews.joblib')
df.info()
df.head()

## Classification Target

In [None]:
# sns.histplot

In [None]:
# Checking what values are in the overall ratings
df['target-rating'].value_counts(1)

In [None]:
# # Should return nothing
# create_groups(4.9)

In [None]:
# ## Use the function to create a new "rating" column with groups
# df['rating'] = df['overall'].map(create_groups)
# df['rating'].value_counts(dropna=False)

In [None]:
# # target_map = {1.0:'low',
# #              2.0:'low',
# #               # 3.0:'mid',
# #               # 4.0:'high',
# #              5.0:'high'}
# df['rating'] = df['overall'].map(create_target)
# df['rating'].value_counts(dropna=False)

### Class Balance

It's always a good idea to check the class balance when creating a classification model.  This can affect modeling bias and interpretation of metrics.

In [None]:
## Check class balance of 'rating'
df['target-rating'].value_counts(normalize=True)

There are more neutral Ratings any any other class, but high and low Ratings are pretty balanced.

### Length

One easy way to compare text is to compare the length of each text.  We could compare the number of character or tokens.  In this case we will compare the number of characters in each Rating for each sentiment.

In [None]:
df['length'] = df['review-text-full'].map(lambda x: len(x.split(" ")))

In [None]:
## visualize median review length
# high_len = high['length'].median()
# low_len = low['length'].median()

ax = sns.barplot(data=df, x='target-rating',
                 y='length', errorbar=('ci',68),#estimator='median',
                );

# Show plot before print statement
plt.show()
# print(f' The median character length for {low_len} for low Ratings and {high_len} for high ratings.')

# Save figure
fig = ax.get_figure()

> - ***Add saving image length to Part 00 config***

In [None]:
# # Create a folder for images
# import os
# os.makedirs("images/", exist_ok=True)

# fig.savefig('images/barplot-compare-lengths.png', transparent=False, dpi=300, bbox_inches='tight')

We aren't seeing a big difference here in the lengths of Ratings.  Both high and low seem to averag about 70 characters.  Neutral Ratings seem to be somewhat shorter on average.

These are more informative!  We see words like 'good', 'love', 'hope', 'great', and 'nice' in the highs and 'miss', 'work', 'sad', 'bad', and 'hate' in lows.

While this may not be surprising with high and low sentiments, if we were classifying other kinds of text, like spam emails or patent applications, it could be very enlightening!

## Preparing Data for Modeling


In [None]:
## Import Modeling Package
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

### Train Test Split
We created null values in our target column, so we will drop the null values from the rating column before making our X and y data.

In [None]:
# Create a df_ml without null ratings
df_ml = df.dropna(subset=['target-rating'])
df_ml.isna().sum()

In [None]:
## X - Option A)  lemmas
# def join_tokens(token_list):
#     joined_tokens = ' '.join(token_list)
#     return joined_tokens
# X = df_ml['spacy_lemmas'].apply(join_tokens)

# X - Option B) original raw text
X = df_ml['review-text-full']

# y - use our binary target 
y = df_ml['target-rating']
X.head(10)

In [None]:
y.value_counts(normalize=True)

### Validation Split

We will perform a triple split.  That way we can tune our model to a validation set and test the final version on a test set.  This avoid tuning our model to specifically perform well on just one set of testing data and gives us a better idea of how our model will perform on new data.

In [None]:
## Train test split
X_train_full, X_test, y_train_names_full, y_test_names = train_test_split(X,y,test_size=0.25,
                                                    random_state=321)
X_train_full.head()

- Using 70% of data as train, 15% as test and 15 % as val (will use in deep nlp codealongs next class)

In [None]:
# ## Split data into train, test, val
# X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=.3, random_state=42)
# X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=.5, random_state=42)
# (len(X_train_full), len(X_val), len(X_test))

In [None]:
# Check class balance
y_train_names_full.value_counts(normalize=True)

### Label Encoding

In [None]:
from sklearn.preprocessing import LabelEncoder

# Instansiate label encoder
encoder = LabelEncoder()

# Fit and transform the training target
y_train_full = pd.Series(encoder.fit_transform(y_train_names_full), 
                    index=y_train_names_full.index, 
                    name=y_train_names_full.name)

# Fit and tranform the test target
y_test = pd.Series(encoder.transform(y_test_names), index=y_test_names.index, 
                    name=y_test_names.name)

y_train_names_full[:10]

In [None]:
import joblib
# Save Label encoder
fpath_encoder = FPATHS['metadata']['label_encoder_joblib']
joblib.dump( encoder,fpath_encoder)

In [None]:
y_train_full.value_counts(1)

### ~~Under-Sampling Majority Group~~

~~We have a very imbalanced dataset. We will take a small sample from the majority class to match the number of reviews for the minority group.~~

In [None]:
# from imblearn.under_sampling import RandomUnderSampler

In [None]:
# X_train_full#.values.reshape(-1,1)

In [None]:
# # Instantiate a RandomUnderSampler
# sampler = RandomUnderSampler(random_state=42)

# # Fit_resample on the reshaped X_train data and y-train data
# X_train, y_train = sampler.fit_resample(X_train_full.values.reshape(-1,1),y_train_full)

# # Flatten the reshaped X_train data back to 1D
# X_train = X_train.flatten()

# # Check for class balance
# y_train.value_counts()

In [None]:
y_train = y_train_full
X_train = X_train_full

### Saving training and test data for re-use with ANNs

In [None]:

# Save training data
fpath_train_joblib = FPATHS['data']['ml-nlp']['train_joblib']
print(fpath_train_joblib)
joblib.dump([X_train, y_train],fpath_train_joblib)

# Save test data
fpath_test_joblib = FPATHS['data']['ml-nlp']['test_joblib']
print(fpath_test_joblib)
joblib.dump([X_test, y_test],fpath_test_joblib)


# Modeling


**Models to Try:**
- Dummy Classifer (Baseline)
- Multinomial Naive Bayes 
- Random Forest
- Logistic Regression
- SVC

## DummyClassifier

In [None]:
from sklearn.dummy import DummyClassifier
vectorizer = CountVectorizer(stop_words=None,
    ngram_range=(1,1))

clf_pipe = Pipeline([
    ('vectorizer',vectorizer),
    ('clf',DummyClassifier(strategy='stratified'))
])

## Fit an evaluate a Logistic Regression
clf_pipe.fit(X_train, y_train)
ds.evaluate.evaluate_classification(clf_pipe, X_train,y_train, X_test, y_test,figsize=(8,12),values_format=".2g")
# joblib.dump(clf_pipe, FPATHS['models']['ml']['logreg'])

We can see our dummy classifier performs poorly overall and is biased towards predicting the 0 class.

## Multinomial Naive Bayes

In [None]:
FPATHS['models']['ml']#['bayes']

In [None]:
## Create a model pipeline for inference.
vectorizer = CountVectorizer(stop_words='english',
    ngram_range=(1,2)
                            )

clf_bayes = Pipeline([
    ('vectorizer',vectorizer),
    ('clf',MultinomialNB())
])

## Fit an evaluate a Bayesian Model
clf_bayes.fit(X_train, y_train)
ds.evaluate.evaluate_classification(clf_bayes, X_train,y_train, X_test, y_test)
# joblib.dump(clf_bayes, FPATHS['models']['ml']['bayes_joblib'])

In [None]:
# ds.evaluate.evaluate_classification(count_pipe, X_train, y_train, X_test, y_test)

Our model was 95% accurate, and did not seem overly biased against any given class.  There might be a slight bias toward the neutral class.

### GridSearch: Preprocessing for Multinomial Naive Bayes

In [None]:
## Create a model pipeline for inference.
vectorizer = CountVectorizer(stop_words='english',
    ngram_range=(1,2)
                            )

clf_bayes = Pipeline([
    ('vectorizer',vectorizer),
    ('clf',MultinomialNB())
])

clf_bayes.get_params()

In [None]:
# Define params to try for both vectorizers
param_grid_shared = {
    "vectorizer__max_df": [0.8, 0.9,1.0],
    'vectorizer__min_df': [ 1, 2, 3 ], 
    "vectorizer__max_features": [None, 1000, 2000],
    "vectorizer__stop_words": [None,'english'],
    'vectorizer__ngram_range': [(1,1), (1,2), (1,3)],
}

# Setting params for the count vectorizer
param_grid_count = {
    'vectorizer':[CountVectorizer()],
    **param_grid_shared
}


# Setting params for tfidf vectorizer 
param_grid_tfidf = {
    'vectorizer': [TfidfVectorizer()],
    "vectorizer__norm": ["l1", "l2"],
    "vectorizer__use_idf": [True, False],
    **param_grid_shared
}

# combine into list of params
params_preprocessing = [param_grid_count, param_grid_tfidf]
params_preprocessing

In [None]:
## Add model params to the params_combined
model_params = {"clf__alpha": [0.1, 0.5, 1.0, 2.0]}
params_with_model = [ {**param, **model_params} for param in params_preprocessing]
params_with_model

In [None]:
%%time
# Create grid search
grid_search = GridSearchCV(clf_bayes, params_with_model, cv=3, verbose=1, n_jobs=-1,
                           scoring='recall_macro')
    
    
# Fit the model
grid_search.fit(X_train, y_train)
grid_search.best_params_

#### Best Multinomial Naive Bayes

In [None]:
best_gs_pipe = grid_search.best_estimator_
ds.evaluate.evaluate_classification(best_gs_pipe, X_train, y_train, X_test, y_test)

## Logistic Regression

### LogReg + Count Vectorization

In [None]:
vectorizer = CountVectorizer(stop_words=None,
    ngram_range=(1,1))

clf_logreg = Pipeline([
    ('vectorizer',vectorizer),
    ('clf',LogisticRegression(max_iter=500,C=1e12))
])

## Fit an evaluate a Logistic Regression
clf_logreg.fit(X_train, y_train)
ds.evaluate.evaluate_classification(clf_logreg, X_train,y_train, X_test, y_test)
# joblib.dump(clf_logreg, FPATHS['models']['ml']['logreg_joblib'])

### GridSearch: Preprocessing for LogReg

In [None]:
logreg_pipe  = Pipeline([('vectorizer',CountVectorizer()),
                    ('clf',LogisticRegression(max_iter=500, class_weight='balanced'))])
logreg_pipe.get_params()

In [None]:
## Add model params to the params_combined
model_params = {"clf__C": [0.1, 0.5, 1.0, 10],
                "clf__solver": ['liblinear', 'saga'],
                "clf__penalty": ['l1','l2']
                }
params_with_model = [ {**param, **model_params} for param in params_preprocessing]
params_with_model

In [None]:
%%time
# Create grid search
grid_search = GridSearchCV(logreg_pipe, params_with_model, cv=3, verbose=1, n_jobs=-1,
                           scoring='recall_macro')
    
# Fit the model
grid_search.fit(X_train, y_train)
grid_search.best_params_

#### Best LogReg

In [None]:
best_logreg_pipe = grid_search.best_estimator_
ds.evaluate.evaluate_classification(best_logreg_pipe, X_train, y_train, X_test, y_test)

### LogReg with Tfidf Vectorizaton

In [None]:
# # Logreg-tfidf

# vectorizer = TfidfVectorizer(stop_words='english',#None,
#     ngram_range=(1,2))

# clf_logreg = Pipeline([
#     ('vectorizer',vectorizer),
#     ('clf',LogisticRegression(max_iter=500,C=1e12))
# ])

# ## Fit an evaluate a Logistic Regression
# clf_logreg.fit(X_train, y_train)
# ds.evaluate.evaluate_classification(clf_logreg, X_train,y_train, X_test, y_test)
# joblib.dump(clf_logreg, FPATHS['models']['ml']['logreg_joblib'])

### LogRegCV (Tuning C) + Tfidf

In [None]:
# # TEMP COMMENTED OUT
# # LogregCV-tfidf 

# vectorizer = TfidfVectorizer(stop_words=None,
#     ngram_range=(1,2))

# clf_logreg = Pipeline([
#     ('vectorizer',vectorizer),
#     ('clf',LogisticRegressionCV(max_iter=200))
# ])

# ## Fit an evaluate a Logistic Regression
# clf_logreg.fit(X_train, y_train)
# ds.evaluate.evaluate_classification(clf_logreg, X_train,y_train, X_test, y_test)
# # joblib.dump(clf_logreg, FPATHS['models']['ml']['logreg'])

## RandomForest

### RandomForest + CountVectorization

In [None]:
vectorizer = CountVectorizer(stop_words=None,
    ngram_range=(1,1))

clf_rf = Pipeline([
    ('vectorizer',vectorizer),
    ('clf',RandomForestClassifier(class_weight='balanced'))
])

## Fit an evaluate a Logistic Regression
clf_rf.fit(X_train, y_train)
ds.evaluate.evaluate_classification(clf_rf, X_train,y_train, X_test, y_test)
# joblib.dump(clf_rf, FPATHS['models']['ml']['random_forest_joblib'])

In [None]:
##  Determine depths of random forest
sns.histplot([t.get_depth() for t in clf_rf[-1].estimators_])

### GridSearch: RandomForest + Preprocessing

In [None]:
rf_pipe  = Pipeline([('vectorizer',CountVectorizer()),
                    ('clf',RandomForestClassifier(class_weight='balanced'))])
rf_pipe.get_params()

In [None]:
## Add model params to the params_combined
model_params = {#"clf__n_estimators": [100, 200, 300],
                "clf__max_depth": [40, 50, 100, None],
                # 'clf__min_samples_split': [2, 5, 10],
                'clf__min_samples_leaf': [1, 2, 4]
                }
params_with_model = [ {**param, **model_params} for param in params_preprocessing]
params_with_model

In [None]:
%%time
# Create grid search
grid_search = GridSearchCV(rf_pipe, params_with_model, cv=3, verbose=1, n_jobs=-1,
                           scoring='recall_macro')
    
# Fit the model
grid_search.fit(X_train, y_train)
grid_search.best_params_

#### Best RandomForest

In [None]:
best_rf_pipe = grid_search.best_estimator_
ds.evaluate.evaluate_classification(best_rf_pipe, X_train, y_train, X_test, y_test)

### RandomForest + TfidfVectorization

In [None]:
# vectorizer = TfidfVectorizer(stop_words='english',
#     ngram_range=(1,2))

# clf_rf = Pipeline([
#     ('vectorizer',vectorizer),
#     ('clf',RandomForestClassifier(class_weight='balanced'))
# ])

# ## Fit an evaluate a Logistic Regression
# clf_rf.fit(X_train, y_train)
# ds.evaluate.evaluate_classification(clf_rf, X_train,y_train, X_test, y_test)
# # joblib.dump(clf_rf, FPATHS['models']['ml']['random_forest_joblib'])

## LinearSVC

### SVC + Tfidf Vectorization

In [None]:
from sklearn.svm import SVC, LinearSVC

vectorizer = TfidfVectorizer(stop_words=None,
    ngram_range=(1,1))

clf_pipe = Pipeline([
    ('vectorizer',vectorizer),
    ('clf',LinearSVC(class_weight='balanced'))
])

## Fit an evaluate a Logistic Regression
clf_pipe.fit(X_train, y_train)
ds.evaluate.evaluate_classification(clf_pipe, X_train,y_train, X_test, y_test,
                                    # target_names=encoder.classes_,
                                    figsize=(8,10))
# joblib.dump(clf_rf, FPATHS['models']['ml']['random_forest'])

In [None]:
raise Exception("Stop here")

# GridSearch Text Preprocessing Params

We can tune our text preprocessing choices using gridsearch.

### GS: Multinomial Naive Bayes

In [None]:
# gs_pipe = Pipeline([('vectorizer',CountVectorizer()),
#                     ('clf',MultinomialNB())])
# gs_pipe.get_params()

In [None]:
# # Define params to try for both vectorizers
# param_grid_shared = {
#     "vectorizer__max_df": [0.7, 0.8, 0.9],
#     'vectorizer__min_df': [ 2, 3, 4 ], 
#     "vectorizer__max_features": [None, 1000, 2000],
#     "vectorizer__stop_words": [None,'english']
# }

# # Setting params for the count vectorizer
# param_grid_count = {
#     'vectorizer':[CountVectorizer()],
#     **param_grid_shared
# }


# # Setting params for tfidf vectorizer 
# param_grid_tfidf = {
#     'vectorizer': [TfidfVectorizer()],
#     "vectorizer__norm": ["l1", "l2"],
#     "vectorizer__use_idf": [True, False],
#     **param_grid_shared
# }

# # combine into list of params
# params_combined = [param_grid_count, param_grid_tfidf]
# params_combined

In [None]:
# %%time
# # Create grid search
# grid_search = GridSearchCV(gs_pipe, params_combined, cv=3, verbose=1, n_jobs=-1)
    
    
# # Fit the model
# grid_search.fit(X_train, y_train)
# grid_search.best_params_

#### Best Multinomial Naive Bayes

In [None]:
# best_gs_pipe = grid_search.best_estimator_
# ds.evaluate.evaluate_classification(best_gs_pipe, X_train, y_train, X_test, y_test)

### GS: RandomForest

In [None]:
# rf_pipe  = Pipeline([('vectorizer',CountVectorizer()),
#                     ('clf',RandomForestClassifier(class_weight='balanced'))])
# rf_pipe.get_params()

In [None]:
# %%time
# # Create grid search
# grid_search = GridSearchCV(rf_pipe, params_combined, cv=3, verbose=1, n_jobs=-1)
    
# # Fit the model
# grid_search.fit(X_train, y_train)
# grid_search.best_params_

#### Best RandomForest

In [None]:
# best_rf_pipe = grid_search.best_estimator_
# ds.evaluate.evaluate_classification(best_rf_pipe, X_train, y_train, X_test, y_test)

### GS: LogReg

In [None]:
# logreg_pipe  = Pipeline([('vectorizer',CountVectorizer()),
#                     ('clf',LogisticRegression(max_iter=500, class_weight='balanced'))])
# # logreg_pipe.get_params()

In [None]:
# %%time
# # Create grid search
# grid_search = GridSearchCV(logreg_pipe, params_combined, cv=3, verbose=1, n_jobs=-1)
    
# # Fit the model
# grid_search.fit(X_train, y_train)
# grid_search.best_params_

#### Best LogReg

In [None]:
# best_logreg_pipe = grid_search.best_estimator_
# ds.evaluate.evaluate_classification(best_logreg_pipe, X_train, y_train, X_test, y_test)

## Save Best Model

In [None]:
%load_ext autoreload 
%autoreload 2
# !pip install langchain_core langchain_openai langchain_community
# !pip install streamlit

import custom_functions.app_functions as af

In [None]:
# TEMP COMMENTED OUT
# LogregCV-tfidf 

vectorizer = TfidfVectorizer(stop_words=None,
    ngram_range=(1,2))

clf_logreg = Pipeline([
    ('vectorizer',vectorizer),
    ('clf',LogisticRegressionCV(max_iter=200))
])

## Fit an evaluate a Logistic Regression
clf_logreg.fit(X_train, y_train)
ds.evaluate.evaluate_classification(clf_logreg, X_train,y_train, X_test, y_test)
# joblib.dump(clf_logreg, FPATHS['models']['ml']['logreg'])

### Save Best Model for APP

In [None]:


def classification_metrics_streamlit(y_true, y_pred, label='',
                           output_dict=False, figsize=(8,4),
                           normalize='true', cmap='Blues',
                           colorbar=False,values_format=".2f"):
    """
    This function calculates and displays classification metrics for a given set of true labels and predicted labels.
    
    Parameters:
    - y_true (array-like): The true labels.
    - y_pred (array-like): The predicted labels.
    - label (str): Optional label for the classification metrics.
    - output_dict (bool): Whether to return the classification report as a dictionary.
    - figsize (tuple): The size of the figure to display the confusion matrices.
    - normalize (str): The normalization method for the confusion matrix. Options are 'true', 'pred', 'all', or None.
    - cmap (str): The color map for the confusion matrix.
    - colorbar (bool): Whether to display a colorbar for the confusion matrix.
    - values_format (str): The format for displaying values in the confusion matrix.
    
    Returns:
    - final_report (str): The final classification report as a string.
    - fig (Figure): The figure object containing the confusion matrices.
    """
    # Get the classification report
    report = classification_report(y_true, y_pred)
    
    ## Save header and report
    header = "-"*70
    final_report = "\n".join([header,f" Classification Metrics: {label}", header,report,"\n"])
    
    
    ## CONFUSION MATRICES SUBPLOTS
    fig, axes = plt.subplots(ncols=2, figsize=figsize)
    
    # Create a confusion matrix  of raw counts (left subplot)
    ConfusionMatrixDisplay.from_predictions(y_true, y_pred,
                                            normalize=None, 
                                            cmap='gist_gray_r',# Updated cmap
                                            values_format="d", 
                                            colorbar=colorbar,
                                            ax = axes[0]);
    axes[0].set_title("Raw Counts")
    
    # Create a confusion matrix with the data with normalize argument 
    ConfusionMatrixDisplay.from_predictions(y_true, y_pred,
                                            normalize=normalize,
                                            cmap=cmap, 
                                            values_format=values_format, #New arg
                                            colorbar=colorbar,
                                            ax = axes[1]);
    axes[1].set_title("Normalized Confusion Matrix")
    
    # Adjust layout and show figure
    fig.tight_layout()

    return final_report, fig


def evaluate_classification_streamlit(model, X_train, y_train, X_test, y_test,
                                     joblib_fpath=None, cmap_train="Blues",cmap_test="Reds"):
    """
    Evaluate a classification model using Streamlit.

    Parameters:
    - model: The trained classification model.
    - X_train: The training data features.
    - y_train: The training data labels.
    - X_test: The test data features.
    - y_test: The test data labels.
    - joblib_fpath: The file path to save the model results using joblib.
    - cmap_train: The color map for the training data classification metrics plot. Default is "Blues".
    - cmap_test: The color map for the test data classification metrics plot. Default is "Reds".

    Returns:
    - results_dict: A dictionary containing the model, training and test classification metrics.

    This function evaluates the performance of a classification model using Streamlit. It calculates and displays
    the classification metrics for both the training and test data, including the classification report and confusion matrix.
    The results are saved to a file using joblib.

    Example usage:
    results = evaluate_classification_streamlit(model, X_train, y_train, X_test, y_test, "model_results.joblib")
    """

    ## Save ML Model Results
    y_hat_train = model.predict(X_train)
    report_train, fig_train = classification_metrics_streamlit(y_train, y_hat_train, cmap=cmap_train,
                                                            label="Training Data")
    print(report_train)
    plt.show()
    y_hat_test = model.predict(X_test)
    report_test, fig_test = classification_metrics_streamlit(y_test, y_hat_test,cmap=cmap_test,
                                                             label="Test Data")
    print(report_test)
    plt.show()


    results_dict = {
        "model": model,
        "train": {"classification_report": report_train, "confusion_matrix": fig_train},
        "test": {"classification_report": report_test, "confusion_matrix": fig_test},
    }
    
    if joblib_fpath is not None:
        joblib.dump(results_dict, joblib_fpath, compress=9)
        print(f"- Succesfully saved model to {joblib_fpath}.")
    return results_dict

In [None]:
FPATHS['models']['ml']

In [None]:
results = evaluate_classification_streamlit(clf_logreg, X_train, y_train, X_test,y_test,joblib_fpath=FPATHS['results']['best-ml-clf_joblib'])


In [None]:
results['train']['confusion_matrix']