# Question Q4: Practitioners Main Problems on Model Deployment Stage

*Question*: According to your personal experience, please outline the main problems or difficulties (up to three) faced during each of the seven ML life cycle stages.

*Answer Type*: Text Free

### Necessary Libraries

In [1]:
import pandas as pd
from utils.basic import rename_values
from utils.dataframe import DataframeUtils
from utils.plot import PlotUtils
from utils.bootstrapping import BootstrappingUtils

### Dataframe Initialization

Here we get a formatted Pandas DataFrame from a raw CSV containing the Survey responses. When formatting, we discard unused columns and renames existing ones to more clear names. We created two classes that help us manage this dataframe and create meaningful charts.

In [2]:
dataframe_obj = DataframeUtils(df_path='./data/main_data.csv', sep=';',
                               to_discard_columns_file='./data/unused_columns.txt',
                               to_format_columns_file='./data/formatted_columns.txt')
# dataframe preview
dataframe_obj.df.head()

Unnamed: 0,ID,Status,Duration,D1_Undergraduation,D1_Specialization,D1_Master,D1_Phd,D1_Courses,D1_Others,D2_Country,...,Q15_Model_Deploy_Production_Monitoring,Q16_Model_Monitor_Aspects_Input_And_Output,Q16_Model_Monitor_Aspects_Interpretability_Output,Q16_Model_Monitor_Aspects_Output_And_Decisions,Q16_Model_Monitor_Aspects_Fairness,Q16_Model_Monitor_Aspects_Others,Q16_Model_Monitor_Aspects_Others_Free,Q17_Automated_Machine_Learning_Tools_Yes_No,Q17_Automated_Machine_Learning_Tools_Yes_Free,Origin
2,31,Completed (31),1317,Economics,-99,M.Sc. in Economics,-99,Data Scientist in Datacamp,-99,Brazil,...,-77,not quoted,not quoted,not quoted,not quoted,not quoted,-99,0,-99,https://ww2.unipark.de/uc/seml/
3,34,Completed (31),854,-99,Management,No,No,No,No,Brazil,...,70,not quoted,not quoted,quoted,not quoted,not quoted,-99,No,-99,-99
4,36,Completed (31),1593,Mathematics,Informatics,MSC Computer Science,PhD computer Science,Vários cursos in Coursera,-99,Brazil,...,60,quoted,not quoted,quoted,not quoted,not quoted,-99,"Yes, Please, specify",Own approach,-99
5,57,Completed (31),4238,Computer Science,Data science specialization,-99,-99,-99,-99,Germany,...,100,not quoted,quoted,quoted,not quoted,not quoted,-99,No,-99,-99
6,46,Completed (31),2821,Actuarial Science,Post Graduation in Data Science,M Sc in Data Science -ML models,no Ph D,no other certifications,-99,Brazil,...,80,not quoted,quoted,quoted,not quoted,not quoted,-99,No,-99,-99


In this research, we conservatively considered those who fully completed the survey. So, we discarded suspended submissions.

In [3]:
dataframe_obj.df.drop(dataframe_obj.df[dataframe_obj.df['Status'] == 'Suspended (22)'].index, inplace = True)

In [4]:
# dataframe preview
dataframe_obj.df.head()

Unnamed: 0,ID,Status,Duration,D1_Undergraduation,D1_Specialization,D1_Master,D1_Phd,D1_Courses,D1_Others,D2_Country,...,Q15_Model_Deploy_Production_Monitoring,Q16_Model_Monitor_Aspects_Input_And_Output,Q16_Model_Monitor_Aspects_Interpretability_Output,Q16_Model_Monitor_Aspects_Output_And_Decisions,Q16_Model_Monitor_Aspects_Fairness,Q16_Model_Monitor_Aspects_Others,Q16_Model_Monitor_Aspects_Others_Free,Q17_Automated_Machine_Learning_Tools_Yes_No,Q17_Automated_Machine_Learning_Tools_Yes_Free,Origin
2,31,Completed (31),1317,Economics,-99,M.Sc. in Economics,-99,Data Scientist in Datacamp,-99,Brazil,...,-77,not quoted,not quoted,not quoted,not quoted,not quoted,-99,0,-99,https://ww2.unipark.de/uc/seml/
3,34,Completed (31),854,-99,Management,No,No,No,No,Brazil,...,70,not quoted,not quoted,quoted,not quoted,not quoted,-99,No,-99,-99
4,36,Completed (31),1593,Mathematics,Informatics,MSC Computer Science,PhD computer Science,Vários cursos in Coursera,-99,Brazil,...,60,quoted,not quoted,quoted,not quoted,not quoted,-99,"Yes, Please, specify",Own approach,-99
5,57,Completed (31),4238,Computer Science,Data science specialization,-99,-99,-99,-99,Germany,...,100,not quoted,quoted,quoted,not quoted,not quoted,-99,No,-99,-99
6,46,Completed (31),2821,Actuarial Science,Post Graduation in Data Science,M Sc in Data Science -ML models,no Ph D,no other certifications,-99,Brazil,...,80,not quoted,quoted,quoted,not quoted,not quoted,-99,No,-99,-99


### Grounded Theory

Here we get a formatted Pandas DataFrame from an Excel file containing the codes and categorization of our qualitative responses.

In [5]:
qualitative_df = pd.read_excel('./data/Problems on ML Stages/MODEL DEPLOYMENT.xlsx')
qualitative_df.head()

Unnamed: 0,lfdn,MODEL DEPLOYMENT,TOPIC,CATEGORY
0,31,Show the model in a didatic way,MODEL INTERPRETATION AND EXPLAINABILITY,COMMUNICATION
1,36,Prepare production environment,PRODUCTION INFRASTRUCTURE (DEPLOY),INFRASTRUCTURE
2,46,what kind of deploy is better?,PRODUCTION INFRASTRUCTURE (DEPLOY),INFRASTRUCTURE
3,53,Not knowing how to deploy,PRODUCTION INFRASTRUCTURE (DEPLOY),INFRASTRUCTURE
4,58,Cost,FINANCIAL ISSUES,ORGANIZATION


### Basic Analysis

In [6]:
# get our valid respondents ids
respondents_ids = list(dataframe_obj.df['ID'])
respondents_ids = [int(respondent_id) for respondent_id in respondents_ids]
respondents_ids[:5]

[31, 34, 36, 57, 46]

In [7]:
# get our valid respondents ids
respondents_ids = list(dataframe_obj.df['ID'])
respondents_ids = [int(respondent_id) for respondent_id in respondents_ids]
respondents_ids[:5]

[31, 34, 36, 57, 46]

In [8]:
# discard the ids that are suspended
qualitative_df = qualitative_df[qualitative_df['lfdn'].isin(respondents_ids)][['MODEL DEPLOYMENT', 'TOPIC', 'CATEGORY']]

In [9]:
# get a list with all the reported categories
category_column = list(qualitative_df['CATEGORY'])
category_column[:5]

['COMMUNICATION',
 'INFRASTRUCTURE',
 'INFRASTRUCTURE',
 'INFRASTRUCTURE',
 'ORGANIZATION']

In [10]:
# compute the total of answers that belong to each category
total_per_category_df = qualitative_df[['CATEGORY', 'TOPIC']].groupby(['CATEGORY']).count()
total_per_category_df

Unnamed: 0_level_0,TOPIC
CATEGORY,Unnamed: 1_level_1
COMMUNICATION,3
DATA,2
INFRASTRUCTURE,84
METHOD,25
ORGANIZATION,15
PEOPLE,6
PLANNING,3


In [11]:
# get the total of qualitative answers
total_answers = total_per_category_df['TOPIC'].sum()
print(f'The total of qualitative answers to be considered is {total_answers}')

The total of qualitative answers to be considered is 138


In [12]:
# we create two lists that contain, for each category, the total of qualitative answers and their percentage
category_percentage = []
category_total = []
# note that we will have repeated values, once we are creating lists with the same length of the column 'CATEGORY' in the qualitative dataframe
for category in category_column:
    category_percentage.append(round(total_per_category_df.loc[category]['TOPIC'] / total_answers * 100, 2))
    category_total.append(total_per_category_df.loc[category]['TOPIC'])

In [13]:
# include two new columns to our qualitative dataframe
qualitative_df['CATEGORY_TOTAL'] = category_total
qualitative_df['CATEGORY_%'] = category_percentage
qualitative_df.head()

Unnamed: 0,MODEL DEPLOYMENT,TOPIC,CATEGORY,CATEGORY_TOTAL,CATEGORY_%
0,Show the model in a didatic way,MODEL INTERPRETATION AND EXPLAINABILITY,COMMUNICATION,3,2.17
1,Prepare production environment,PRODUCTION INFRASTRUCTURE (DEPLOY),INFRASTRUCTURE,84,60.87
2,what kind of deploy is better?,PRODUCTION INFRASTRUCTURE (DEPLOY),INFRASTRUCTURE,84,60.87
3,Not knowing how to deploy,PRODUCTION INFRASTRUCTURE (DEPLOY),INFRASTRUCTURE,84,60.87
4,Cost,FINANCIAL ISSUES,ORGANIZATION,15,10.87


In [14]:
# now we group our results considering the category and the topic
qualitative_df = qualitative_df.groupby(['CATEGORY', 'CATEGORY_TOTAL', 'CATEGORY_%', 'TOPIC']).count()

In [15]:
# rename the remaining column, which is now a simple count representing the quantity of each category and topic
qualitative_df.columns = ['QUANTITY']
# add the percentage version of each quantity to improve readability
qualitative_df['QUANTITY_%'] = round(qualitative_df['QUANTITY'] / qualitative_df['QUANTITY'].sum() * 100, 2)
qualitative_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,QUANTITY,QUANTITY_%
CATEGORY,CATEGORY_TOTAL,CATEGORY_%,TOPIC,Unnamed: 4_level_1,Unnamed: 5_level_1
COMMUNICATION,3,2.17,MODEL INTERPRETATION AND EXPLAINABILITY,3,2.17
DATA,2,1.45,DATA UNAVAILABLE,1,0.72
DATA,2,1.45,INSUFFICIENT DATA,1,0.72
INFRASTRUCTURE,84,60.87,COMPUTATIONAL CONSTRAINTS,3,2.17
INFRASTRUCTURE,84,60.87,HARD TO INTEGRATE WITH OLD APPLICATIONS,10,7.25
INFRASTRUCTURE,84,60.87,INFRASTRUCTURE,6,4.35
INFRASTRUCTURE,84,60.87,MODEL MONITORING,1,0.72
INFRASTRUCTURE,84,60.87,MODEL UPDATE,3,2.17
INFRASTRUCTURE,84,60.87,PRODUCTION INFRASTRUCTURE (DEPLOY),59,42.75
INFRASTRUCTURE,84,60.87,UPDATING THE LIBRARIES USED,1,0.72
