In [1]:
# Imports
import pycaret
import pickle
from pycaret.regression import *
from sklearn.feature_selection import RFE
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import pandas as pd
import altair as alt
from outbreak_data import authenticate_user
from outbreak_data import outbreak_data

# CRISP-DM Stage 1: Business Understanding

## Business Objectives
Our primary business objective is to leverage data analysis and machine learning to identify potential virus mutations that increase transmission rates, thereby aiding healthcare organizations globally in forecasting and managing outbreaks more effectively. By being at the forefront of such discoveries, we can not only save lives but also foster collaborations with governments, NGOs, and other healthcare entities. This initiative aligns with our broader goal of spearheading innovations in healthcare to promote well-being and security amidst the pandemic.

## Data and Library Assessment
The outbreak Python package offers a rich repository of data regarding the global and local (US) spread of various SARS-CoV-2 lineages, their mutation rates, and associated epidemiological data. Our initial task will be to undertake a comprehensive exploration of this dataset to understand the variables and the quality of data available. The library, with its functionalities, appears to be a robust tool to fetch and manage a wide range of epidemiological data efficiently. Our efforts will concentrate on extracting the most pertinent data points that can fuel our analytical models.

## Overview of CRISP Data Mining Project Goals
Our data mining goals will focus on identifying patterns of mutations that are correlated with increased transmission rates, by extracting key features that influence the transmission dynamics of different virus lineages. I will be constructing a dataset that brings together mutation details and corresponding infection rates over time/regions and across select Sars-CoV-2 lineages.

## Project Plan

***[Stage 2] Data: Understanding and EDA***

> **Step 1**: Utilize the outbreak package to acquire the latest data on virus lineages, mutations, and infection rates.

> **Step 2**: Clean the data to remove any inconsistencies and handle missing values appropriately.

> **Step 3**: Exploratory Data Analysis (EDA): Conduct EDA to understand the distribution of different variables and identify potential correlations.

***[Stage 3] Data: Cleaning, Preprocessing, Preparation***

> **Step 4**: Feature Engineering and Selection: Based on the insights gathered from EDA, create new features that can potentially be indicative of a lineage's transmissibility.

> **Step 5**: Select the most relevant features for model training through techniques like recursive feature elimination.

***[Stage 4] Modeling: Artificial Intelligence & Machine Learning***

> **Step 6**: Model Building: Build a predictive model using machine learning algorithms such as Random Forest or Gradient Boosting to identify the mutation characteristics that are strongly associated with increased transmission rates.

***[Stage 5] Final Review: Evaluation & Testing***

> **Step 7**: Validate the model using appropriate techniques like cross-validation to ensure its robustness.

> **Step 8**: Write a medium article and research paper highlighting the findings and sharing them with stakeholders and collaborating organizations for concerted efforts in researching the virus.

By adhering to this plan, we aim to build a tool that can not only identify potentially dangerous mutations early on but also foster a data-driven approach to managing the pandemic more effectively. The insights derived from our model can be instrumental in guiding policy decisions and healthcare strategies, thus playing a vital role in safeguarding public health.

# CRISP-DM Stage 2
## Data Understanding & Exploratory Data Analysis

### **IMPORTANT** 
> The data used in this notebook is done in collaboration with GISAID at https://gisaid.org/. This data was obtained from GISAID via the outbreak.info API. In order to use this data, you must make an account with gisaid and use this or another similar API to access it. To visualize and run the code in this notebook, create an account, and then log in with your credentials after running the "authenticate_user" code in the following code block.

**This notebook is used for academic and research purposes.**

In [2]:
# Authenticating as user to have access to the data
authenticate_user.authenticate_new_user()

Please open this url in a web browswer and authenticate with your GISAID credentials:  https://gpsapi.epicov.org/epi3/gps_authenticate/HOUNSZMZFYQBUPVAUUHEXXIFLFVYXBAKPPSLWYLIDHFCPTNCDXAPTYIJDVCICKUDHKXNXTBIXZFCXQTXRWNATBBDWSKIULHGHJVMEAEGPOYKWZPJDSPXUUYHNQHROPMM
Waiting for authorization response... [Press Ctrl-C to abort]
Authenication failed, trying again in 5 seconds...
Waiting for authorization response... [Press Ctrl-C to abort]
Authenticated successfully!

    TERMS OF USE for Python Package and
    Reminder of GISAID's Database Access Agreement
    Your ability to access and use Data in GISAID, including your access and
    use of same via R Package, is subject to the terms and conditions of
    GISAID's Database Access Agreement (“DAA”) (which you agreed to
    when you requested access credentials to GISAID), as well as the
    following terms:
    1. You will treat all data contained in the R Package consistent with
    other Data in GISAID and in accordance with GISAID's Da

## Geographical Regions of Interest
According to the latest viral infections report from WHO (World Health Organization) in collaboration with PAHO (Pan-American Health Organization):

> In North America, levels of Sars-Cov-2 have been rising at moderate levels

> In the Caribbean at intermediate levels with high levels of circulation in Barbados, Guyana, Jamaica, and Saint Lucia

> In Central-America at low levels & decreasing

> In Brazil and Southern Cone at intermediate and increasing levels, especially in Bolivia, Brazil, Chile, and Argentina. 

This data was published on 9/15/23 and you can find the report here: https://www.paho.org/en/influenza-situation-report.

As the most recent available data in the GISAID Python-Outbreak package tends to vary by location, this means there is a gap in data for some locations, and so doing this project on the regions where Sars-CoV-2 levels are rising and intermediate currently (i.e Brazil, Chile, Bolivia) with the available data can help give us a retrospective insight into how the virus became more infectious in these regions to today.

## **Step 1**: Utilize the outbreak package to acquire the latest data on virus lineages, mutations, and infection rates.

In [3]:
geo_isos = ['BOL', 'BRA', 'CHL']

In [4]:
# Infection rates 
cases_numIncrease = outbreak_data.cases_by_location(geo_isos, pull_smoothed=True)

In [5]:
cases_numIncrease.head()

Unnamed: 0,_id,_score,admin1,confirmed_rolling,date
0,BRA_None_None2020-02-12,8.482446,,0.0,2020-02-12
1,BRA_None_None2020-02-14,8.482446,,0.0,2020-02-14
2,BRA_None_None2020-02-18,8.482446,,0.0,2020-02-18
3,BRA_None_None2020-02-21,8.482446,,0.0,2020-02-21
4,BRA_None_None2020-02-22,8.482446,,0.0,2020-02-22


In [6]:
# Sars-CoV-2 Virus Lineages
lineage_prevalences = []
for loc in geo_isos:
    locDf = outbreak_data.prevalence_by_location(loc, other_threshold=0.85)
    locDf['location'] = [loc] * len(locDf)
    lineage_prevalences.append(locDf)
lineage_prevalences = pd.concat(lineage_prevalences) 

In [7]:
lineage_prevalences.head()

Unnamed: 0,date,total_count,lineage_count,lineage,prevalence,prevalence_rolling,location
0,2020-03-31,1,1,a.5,1.0,1.0,BOL
1,2021-09-08,1,1,ay.25.1,1.0,0.875,BOL
2,2021-09-09,0,0,ay.25.1,0.0,0.777778,BOL
3,2021-09-10,0,0,ay.25.1,0.0,1.0,BOL
4,2021-09-11,0,0,ay.25.1,0.0,0.2,BOL


In [8]:
lineage_prevalences.shape

(20951, 7)

In [9]:
# Selecting most recent 5000 datapoints to plot visual of lineage, as there is too  many to plot
plot_lineage_prevalences=lineage_prevalences.sort_values('date')[-5000:]

# Calculates the cumulative sum of lineages across time, across all locations - in the past 5000 recent datapoints
plot_lineage_prevalences['prevalence_cumSum'] = plot_lineage_prevalences.groupby(['date', 'lineage'])['prevalence_rolling'].cumsum()

In [10]:
most_recent_lineages = plot_lineage_prevalences.loc[:,['prevalence_cumSum', 'lineage']].groupby('lineage').mean().sort_values('prevalence_cumSum')[-20:].index.to_list()

In [11]:
# Selecting top 20 most prevalent in the most recent datapoints to limit visual complexity
plot_recent_lineage_prevalences = plot_lineage_prevalences.where(plot_lineage_prevalences.lineage.apply(lambda x: x in most_recent_lineages)).dropna(how='all')

In [12]:
# Other being the most common lineage may imply lineage cannot be detected
# There could be many reasons why depending on the healthcare domain in those areas
plot_recent_lineage_prevalences.lineage.value_counts()[:7]

other         1342
ba.5.2.1       614
ba.5.1         386
xbb.1.5        283
bq.1.8         270
xbb.1.5.76     134
ba.4           107
Name: lineage, dtype: int64

In [13]:
# Graphing lineages with high recent prevalences in these locations
# Using a tool to select the lineages in the legend that are most impactful
selection = alt.selection_multi(fields=['lineage'], bind='legend')
plot_lineages = alt.Chart(plot_recent_lineage_prevalences, title = "Lineage Prevalences").mark_line().encode(
x='date:T',
y=alt.Y('prevalence_rolling:Q'),
color = 'lineage:N',
tooltip=['date:T', 'lineage:N', 'prevalence_rolling:Q'],  # Add tooltips
opacity=alt.condition(selection, alt.value(1), alt.value(0.2))  # Adjust opacity based on selection
).add_selection(
    selection  # Add the selection filter to the chart
).interactive()

In [14]:
plot_lineages

In [15]:
# GK.1, and xbb 1.5 seem to be good candidates for mutation information
interesting_lineages = ['xbb.1.5.76', 'xbb.1.5']
lineage_mutations = outbreak_data.lineage_mutations(pango_lin=interesting_lineages)

In [16]:
lineage_mutations.head()

Unnamed: 0,mutation,mutation_count,lineage_count,lineage,gene,ref_aa,alt_aa,codon_num,codon_end,type,prevalence,change_length_nt
0,orf6:d61l,349,351,xbb.1.5.76,ORF6,D,L,61,,substitution,0.994302,
1,s:l24s,344,351,xbb.1.5.76,S,L,S,24,,substitution,0.980057,
2,n:r203k,350,351,xbb.1.5.76,N,R,K,203,,substitution,0.997151,
3,s:g339h,344,351,xbb.1.5.76,S,G,H,339,,substitution,0.980057,
4,s:f486p,335,351,xbb.1.5.76,S,F,P,486,,substitution,0.954416,


In [17]:
lineage_mutations.shape

(132, 12)

## **Step 2**: Clean the data to remove any inconsistencies and handle missing values appropriately.

In [18]:
cases_numIncrease.shape

(3363, 5)

In [19]:
plot_recent_lineage_prevalences = plot_recent_lineage_prevalences.reset_index(drop=True)

In [20]:
#splitting the dataset into lineage_defined and lineage_undefined
lineage_undefined = plot_recent_lineage_prevalences.where(plot_recent_lineage_prevalences.lineage == 'other').dropna(how='all')

lineage_defined = plot_recent_lineage_prevalences.where(plot_recent_lineage_prevalences.lineage != 'other').dropna(how='all')

In [21]:
# The two datasets are about equal
print(lineage_defined.shape)
print(lineage_undefined.shape)

(2023, 8)
(1342, 8)


In [22]:
# For the lineage_defined dataset, we can go ahead and merge it with the mutations data
sars_epi_viro = pd.merge(lineage_defined, lineage_mutations, on = 'lineage', suffixes=('_cases','_mutations'))

In [23]:
# As the sars_epi_viro table is a product of a merge for a dataset like lineage mutations
# which has a one-to-many relationship on it's 'mutation' column with 'lineage' (the column being merged on), 
# the size of the dataset increased. This is good for training a model, but bad for visualization.
sars_epi_viro.shape

(27820, 19)

In [24]:
# Checking for missing values
cases_numIncrease.isnull().sum()

_id                  0
_score               0
admin1               0
confirmed_rolling    0
date                 0
dtype: int64

In [25]:
# Checking for missing values
sars_epi_viro.isnull().sum()

date                       0
total_count                0
lineage_count_cases        0
lineage                    0
prevalence_cases           0
prevalence_rolling         0
location                   0
prevalence_cumSum          0
mutation                   0
mutation_count             0
lineage_count_mutations    0
gene                       0
ref_aa                     0
alt_aa                     0
codon_num                  0
codon_end                  0
type                       0
prevalence_mutations       0
change_length_nt           0
dtype: int64

## **Step 3**: Exploratory Data Analysis (EDA): Conduct EDA to understand the distribution of different variables and identify potential correlations.

In [29]:
# Using the describe() function to calculate statistics info
cases_numIncrease.describe()

Unnamed: 0,_score,confirmed_rolling
count,3363.0,3363.0
mean,8.416526,12920.127826
std,0.040732,22522.125237
min,8.367605,0.0
25%,8.381978,795.571411
50%,8.405478,3005.285645
75%,8.446026,14278.214355
max,8.482446,189227.0


In [30]:
# Using the describe() function calculates statistics on the data
sars_epi_viro.describe()

Unnamed: 0,total_count,lineage_count_cases,prevalence_cases,prevalence_rolling,prevalence_cumSum,mutation_count,lineage_count_mutations,codon_num,prevalence_mutations
count,27820.0,27820.0,27820.0,27820.0,27820.0,27820.0,27820.0,27820.0,27820.0
mean,13.381165,3.715313,0.185981,0.258344,0.363791,127489.804026,131471.67491,709.020345,0.972512
std,19.587252,6.830893,0.282794,0.29371,0.394035,84965.24107,87533.379003,926.495591,0.025855
min,0.0,0.0,0.0,0.0,0.0,302.0,351.0,8.0,0.860399
25%,0.0,0.0,0.0,0.013158,0.028249,350.0,351.0,144.0,0.960422
50%,0.0,0.0,0.0,0.09589,0.25,182513.0,189905.0,413.0,0.979958
75%,22.0,4.0,0.333333,0.48,0.571429,186146.0,189905.0,681.0,0.993302
max,79.0,41.0,1.0,1.0,2.0,189444.0,189905.0,3696.0,1.0


# CRISP-DM Stage 3
## Data Preparation, Data Cleaning & Preprocessing

## **Step 4**: Feature Engineering and Selection: Based on the insights gathered from EDA, create new features that can potentially be indicative of a lineage's transmissibility.


In [31]:
# The _score, admin1 columns are not meaningful and could be dropped. (it is used by the API team)
cases_numIncrease = cases_numIncrease.drop('_score', axis=1)
cases_numIncrease = cases_numIncrease.drop('admin1', axis=1)

In [32]:
# _id column contains the location name, and this is the only unique piece of info 
# feature can be re-engineered to location & then merged with the sars_epi_cov dataset
cases_numIncrease['location'] = cases_numIncrease._id.apply(lambda x: x[:3])

In [33]:
cases_numIncrease = cases_numIncrease.drop('_id', axis = 1)

In [34]:
sars_epi_viro.tail()

Unnamed: 0,date,total_count,lineage_count_cases,lineage,prevalence_cases,prevalence_rolling,location,prevalence_cumSum,mutation,mutation_count,lineage_count_mutations,gene,ref_aa,alt_aa,codon_num,codon_end,type,prevalence_mutations,change_length_nt
27815,2023-05-29,3.0,3.0,xbb.1.5.76,1.0,1.0,CHL,1.0,s:s477n,327,351,S,S,N,477,,substitution,0.931624,
27816,2023-05-29,3.0,3.0,xbb.1.5.76,1.0,1.0,CHL,1.0,s:n764k,326,351,S,N,K,764,,substitution,0.928775,
27817,2023-05-29,3.0,3.0,xbb.1.5.76,1.0,1.0,CHL,1.0,s:t478k,325,351,S,T,K,478,,substitution,0.925926,
27818,2023-05-29,3.0,3.0,xbb.1.5.76,1.0,1.0,CHL,1.0,s:g252v,303,351,S,G,V,252,,substitution,0.863248,
27819,2023-05-29,3.0,3.0,xbb.1.5.76,1.0,1.0,CHL,1.0,m:q19e,302,351,M,Q,E,19,,substitution,0.860399,


In [35]:
# Finally, we constructed the training/test dataset for a M.L/A.I model
# It should contain all necessary features for this project
sars_epi_viro = pd.merge(sars_epi_viro, cases_numIncrease, on=['location', 'date'])

In [36]:
# The size of the dataset shrank due to the contraint I made
# Only datapoints having data collected for number of cases as well as for lineage prevalence
sars_epi_viro.shape

(9448, 20)

In [48]:
sars_epi_viro.columns

Index(['date', 'total_count', 'lineage_count_cases', 'lineage',
       'prevalence_cases', 'prevalence_rolling', 'location',
       'prevalence_cumSum', 'mutation', 'mutation_count',
       'lineage_count_mutations', 'gene', 'ref_aa', 'alt_aa', 'codon_num',
       'codon_end', 'type', 'prevalence_mutations', 'change_length_nt',
       'confirmed_rolling'],
      dtype='object')

## **Step 5**: Select the most relevant features for model training through techniques like recursive feature elimination (RFE).

In [44]:
# One hot encoding the categorical features

In [38]:
categs = ['date', 'lineage', 'location', 'mutation', 'gene', 'ref_aa', 'alt_aa', 'codon_end', 'type', 'change_length_nt']

In [39]:
for col in categs:
    sars_epi_viro[col] = sars_epi_viro[col].astype('category')

In [40]:
sars_epi_viro_encoded = pd.get_dummies(sars_epi_viro, columns=categs, drop_first=True)

In [51]:
sars_epi_viro_encoded.shape

(9448, 226)

In [50]:
sars_epi_viro_encoded.columns.to_list()

['total_count',
 'lineage_count_cases',
 'prevalence_cases',
 'prevalence_rolling',
 'prevalence_cumSum',
 'mutation_count',
 'lineage_count_mutations',
 'codon_num',
 'prevalence_mutations',
 'confirmed_rolling',
 'date_2022-12-10',
 'date_2022-12-11',
 'date_2022-12-12',
 'date_2022-12-13',
 'date_2022-12-14',
 'date_2022-12-15',
 'date_2022-12-16',
 'date_2022-12-17',
 'date_2022-12-18',
 'date_2022-12-19',
 'date_2022-12-20',
 'date_2022-12-21',
 'date_2022-12-22',
 'date_2022-12-23',
 'date_2022-12-24',
 'date_2022-12-25',
 'date_2022-12-26',
 'date_2022-12-27',
 'date_2022-12-28',
 'date_2022-12-29',
 'date_2022-12-30',
 'date_2022-12-31',
 'date_2023-01-01',
 'date_2023-01-02',
 'date_2023-01-03',
 'date_2023-01-04',
 'date_2023-01-05',
 'date_2023-01-06',
 'date_2023-01-07',
 'date_2023-01-08',
 'date_2023-01-09',
 'date_2023-01-10',
 'date_2023-01-11',
 'date_2023-01-12',
 'date_2023-01-13',
 'date_2023-01-14',
 'date_2023-01-15',
 'date_2023-01-16',
 'date_2023-01-17',
 'date

In [41]:
sars_setup = setup(data=sars_epi_viro_encoded, target='prevalence_rolling')

Unnamed: 0,Description,Value
0,Session id,5356
1,Target,prevalence_rolling
2,Target type,Regression
3,Original data shape,"(9448, 226)"
4,Transformed data shape,"(9448, 226)"
5,Transformed train set shape,"(6613, 226)"
6,Transformed test set shape,"(2835, 226)"
7,Numeric features,225
8,Preprocess,True
9,Imputation type,simple


In [42]:
sars_lasso_model = create_model('lasso')

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.0786,0.0106,0.103,0.4763,0.0825,3.0907
1,0.0782,0.0099,0.0995,0.4717,0.0799,3.2065
2,0.0769,0.0093,0.0967,0.5236,0.0777,3.3264
3,0.0797,0.0104,0.1021,0.4893,0.0813,3.2353
4,0.0777,0.01,0.1,0.4816,0.0795,3.0264
5,0.0802,0.0107,0.1035,0.457,0.0824,2.9013
6,0.0805,0.0108,0.1039,0.4996,0.0822,3.0135
7,0.0805,0.0112,0.106,0.4624,0.0836,3.2811
8,0.0753,0.0089,0.0946,0.5263,0.0755,3.0968
9,0.0761,0.0096,0.0978,0.4593,0.0791,3.8116


In [54]:
if hasattr(sars_lasso_model, 'feature_names_in_'):
    sars_lasso_model.feature_names_in_ = None

In [57]:
sars_epi_viro_encoded.drop('prevalence_rolling', axis=1)

Unnamed: 0,total_count,lineage_count_cases,prevalence_cases,prevalence_cumSum,mutation_count,lineage_count_mutations,codon_num,prevalence_mutations,confirmed_rolling,date_2022-12-10,...,alt_aa_T,alt_aa_V,alt_aa_Y,codon_end_33.0,codon_end_144.0,codon_end_3677.0,codon_end_None,type_substitution,change_length_nt_9.0,change_length_nt_None
0,45.0,1.0,0.022222,0.025000,184792,189905,61,0.973076,3385.142822,0,...,0,0,0,0,0,0,1,1,0,1
1,45.0,1.0,0.022222,0.025000,184103,189905,24,0.969448,3385.142822,0,...,0,0,0,0,0,0,1,1,0,1
2,45.0,1.0,0.022222,0.025000,188131,189905,203,0.990658,3385.142822,0,...,0,0,0,0,0,0,1,1,0,1
3,45.0,1.0,0.022222,0.025000,183426,189905,339,0.965883,3385.142822,0,...,0,0,0,0,0,0,1,1,0,1
4,45.0,1.0,0.022222,0.025000,182801,189905,486,0.962592,3385.142822,0,...,0,0,0,0,0,0,1,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9443,54.0,6.0,0.111111,0.092715,327,351,477,0.931624,2748.714355,0,...,0,0,0,0,0,0,1,1,0,1
9444,54.0,6.0,0.111111,0.092715,326,351,764,0.928775,2748.714355,0,...,0,0,0,0,0,0,1,1,0,1
9445,54.0,6.0,0.111111,0.092715,325,351,478,0.925926,2748.714355,0,...,0,0,0,0,0,0,1,1,0,1
9446,54.0,6.0,0.111111,0.092715,303,351,252,0.863248,2748.714355,0,...,0,1,0,0,0,0,1,1,0,1


In [58]:
sars_lasso_model.predict(sars_epi_viro_encoded.drop('prevalence_rolling', axis=1))

array([ 0.0826584 ,  0.08256784,  0.08309727, ..., -0.04218772,
       -0.04219061, -0.04219074])

## Beginning of RFE which may be causing a drop in performance - to re-evaluate with ablated model

In [50]:
# Perform recursive feature elimination
sars_rfe_selector = RFE(sars_lasso_model, n_features_to_select=10)

In [51]:
sars_rfe_selector

In [52]:
# Get the dataset from the PyCaret environment
X_train_sars = get_config('X_train')
y_train_sars = get_config('y_train')

In [53]:
# Fit RFE
rfe_selector = sars_rfe_selector.fit(X_train_sars, y_train_sars)

In [54]:
# Saving the selected features found in the RFE stage
selected_features = X_train_sars.columns[rfe_selector.support_]

In [55]:
selected_features = selected_features.to_list() + ['prevalence_rolling']

In [56]:
# These are the columns with the most support for the lasso regression model,
# when targeting prevalence_rolling
selected_features

['confirmed_rolling',
 'date_2022-12-16',
 'alt_aa_Y',
 'codon_end_33.0',
 'codon_end_144.0',
 'codon_end_3677.0',
 'codon_end_None',
 'type_substitution',
 'change_length_nt_9.0',
 'change_length_nt_None',
 'prevalence_rolling']

# CRISP-DM Stage 4
## Modeling: Artificial Intelligence & Machine Learning

## **Step 6**: Build a predictive model using machine learning algorithms such as Random Forest or Gradient Boosting to identify the mutation characteristics that are strongly associated with increased transmission rates.

In [57]:
X_train_sars['prevalence_rolling'] = y_train_sars

In [58]:
# Set up the PyCaret environment with the selected features
sars_rfe_setup = setup(data=X_train_sars, target='prevalence_rolling')

Unnamed: 0,Description,Value
0,Session id,4272
1,Target,prevalence_rolling
2,Target type,Regression
3,Original data shape,"(4284, 223)"
4,Transformed data shape,"(4284, 223)"
5,Transformed train set shape,"(2998, 223)"
6,Transformed test set shape,"(1286, 223)"
7,Numeric features,222
8,Preprocess,True
9,Imputation type,simple


In [59]:
# Training the final model using the above selected features
lasso_rfe_model = create_model('lasso')

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.0932,0.0144,0.1202,0.4521,0.0955,3.2666
1,0.0904,0.0129,0.1137,0.4654,0.0906,3.7334
2,0.0943,0.0149,0.122,0.4287,0.0968,3.2511
3,0.0914,0.0142,0.119,0.4737,0.0943,3.0405
4,0.0902,0.0135,0.1163,0.4877,0.0929,3.6762
5,0.0843,0.012,0.1095,0.4622,0.0877,3.1477
6,0.0946,0.0145,0.1203,0.4668,0.0957,3.2166
7,0.0889,0.0137,0.1171,0.4844,0.0928,2.7264
8,0.0933,0.0142,0.1193,0.4531,0.0949,2.5623
9,0.0938,0.0151,0.1227,0.4366,0.0972,2.8443


### End of RFE - save the ablated model

In [43]:
# Pickling / Saving the pycaret model
with open('sars_lasso_model.pkl', 'wb') as file:
    pickle.dump(sars_lasso_model, file)

# CRISP-DM Stage 5
## Final Review: Evaluation and Testing

## **Step 7**: Validate the model using appropriate techniques like cross-validation to ensure its robustness.

In [46]:
evaluate_model(sars_lasso_model)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…