# Question Q4: Practitioners Other Problems

*Question*: According to your personal experience, please outline the main problems or difficulties (up to three) faced during each of the seven ML life cycle stages.

*Answer Type*: Text Free

### Necessary Libraries

In [1]:
import pandas as pd
from utils.basic import rename_values
from utils.dataframe import DataframeUtils
from utils.plot import PlotUtils
from utils.bootstrapping import BootstrappingUtils

### Dataframe Initialization

Here we get a formatted Pandas DataFrame from a raw CSV containing the Survey responses. When formatting, we discard unused columns and renames existing ones to more clear names. We created two classes that help us manage this dataframe and create meaningful charts.

In [2]:
dataframe_obj = DataframeUtils(df_path='./data/main_data.csv', sep=';',
                               to_discard_columns_file='./data/unused_columns.txt',
                               to_format_columns_file='./data/formatted_columns.txt')
# dataframe preview
dataframe_obj.df.head()

Unnamed: 0,ID,Status,Duration,D1_Undergraduation,D1_Specialization,D1_Master,D1_Phd,D1_Courses,D1_Others,D2_Country,...,Q15_Model_Deploy_Production_Monitoring,Q16_Model_Monitor_Aspects_Input_And_Output,Q16_Model_Monitor_Aspects_Interpretability_Output,Q16_Model_Monitor_Aspects_Output_And_Decisions,Q16_Model_Monitor_Aspects_Fairness,Q16_Model_Monitor_Aspects_Others,Q16_Model_Monitor_Aspects_Others_Free,Q17_Automated_Machine_Learning_Tools_Yes_No,Q17_Automated_Machine_Learning_Tools_Yes_Free,Origin
2,31,Completed (31),1317,Economics,-99,M.Sc. in Economics,-99,Data Scientist in Datacamp,-99,Brazil,...,-77,not quoted,not quoted,not quoted,not quoted,not quoted,-99,0,-99,https://ww2.unipark.de/uc/seml/
3,34,Completed (31),854,-99,Management,No,No,No,No,Brazil,...,70,not quoted,not quoted,quoted,not quoted,not quoted,-99,No,-99,-99
4,36,Completed (31),1593,Mathematics,Informatics,MSC Computer Science,PhD computer Science,Vários cursos in Coursera,-99,Brazil,...,60,quoted,not quoted,quoted,not quoted,not quoted,-99,"Yes, Please, specify",Own approach,-99
5,57,Completed (31),4238,Computer Science,Data science specialization,-99,-99,-99,-99,Germany,...,100,not quoted,quoted,quoted,not quoted,not quoted,-99,No,-99,-99
6,46,Completed (31),2821,Actuarial Science,Post Graduation in Data Science,M Sc in Data Science -ML models,no Ph D,no other certifications,-99,Brazil,...,80,not quoted,quoted,quoted,not quoted,not quoted,-99,No,-99,-99


In this research, we conservatively considered those who fully completed the survey. So, we discarded suspended submissions.

In [3]:
dataframe_obj.df.drop(dataframe_obj.df[dataframe_obj.df['Status'] == 'Suspended (22)'].index, inplace = True)

In [4]:
# dataframe preview
dataframe_obj.df.head()

Unnamed: 0,ID,Status,Duration,D1_Undergraduation,D1_Specialization,D1_Master,D1_Phd,D1_Courses,D1_Others,D2_Country,...,Q15_Model_Deploy_Production_Monitoring,Q16_Model_Monitor_Aspects_Input_And_Output,Q16_Model_Monitor_Aspects_Interpretability_Output,Q16_Model_Monitor_Aspects_Output_And_Decisions,Q16_Model_Monitor_Aspects_Fairness,Q16_Model_Monitor_Aspects_Others,Q16_Model_Monitor_Aspects_Others_Free,Q17_Automated_Machine_Learning_Tools_Yes_No,Q17_Automated_Machine_Learning_Tools_Yes_Free,Origin
2,31,Completed (31),1317,Economics,-99,M.Sc. in Economics,-99,Data Scientist in Datacamp,-99,Brazil,...,-77,not quoted,not quoted,not quoted,not quoted,not quoted,-99,0,-99,https://ww2.unipark.de/uc/seml/
3,34,Completed (31),854,-99,Management,No,No,No,No,Brazil,...,70,not quoted,not quoted,quoted,not quoted,not quoted,-99,No,-99,-99
4,36,Completed (31),1593,Mathematics,Informatics,MSC Computer Science,PhD computer Science,Vários cursos in Coursera,-99,Brazil,...,60,quoted,not quoted,quoted,not quoted,not quoted,-99,"Yes, Please, specify",Own approach,-99
5,57,Completed (31),4238,Computer Science,Data science specialization,-99,-99,-99,-99,Germany,...,100,not quoted,quoted,quoted,not quoted,not quoted,-99,No,-99,-99
6,46,Completed (31),2821,Actuarial Science,Post Graduation in Data Science,M Sc in Data Science -ML models,no Ph D,no other certifications,-99,Brazil,...,80,not quoted,quoted,quoted,not quoted,not quoted,-99,No,-99,-99


### Grounded Theory

Here we get a formatted Pandas DataFrame from an Excel file containing the codes and categorization of our qualitative responses.

In [5]:
qualitative_df = pd.read_excel('./data/Problems on ML Stages/OTHER PROBLEMS.xlsx')
qualitative_df.head()

Unnamed: 0,lfdn,OTHER PROBLEMS,TOPIC,CATEGORY
0,72,Understand statistics behind the models,LACK OF DATA SCIENCE KNOWLEDGE,PEOPLE
1,86,team integration,TEAM INTEGRATION,ORGANIZATION
2,113,Expectation management with future users,MANAGING EXPECTATIONS,METHOD
3,170,Updating of current models,MODEL UPDATE,METHOD
4,240,managers who want quick and reliable solutions,MANAGING EXPECTATIONS,METHOD


### Basic Analysis

In [6]:
# get our valid respondents ids
respondents_ids = list(dataframe_obj.df['ID'])
respondents_ids = [int(respondent_id) for respondent_id in respondents_ids]
respondents_ids[:5]

[31, 34, 36, 57, 46]

In [7]:
# discard the ids that are suspended
qualitative_df = qualitative_df[qualitative_df['lfdn'].isin(respondents_ids)][['OTHER PROBLEMS', 'TOPIC', 'CATEGORY']]

In [8]:
# get a list with all the reported categories
category_column = list(qualitative_df['CATEGORY'])
category_column[:5]

['PEOPLE', 'ORGANIZATION', 'METHOD', 'METHOD', 'METHOD']

In [9]:
# compute the total of answers that belong to each category
total_per_category_df = qualitative_df[['CATEGORY', 'TOPIC']].groupby(['CATEGORY']).count()
total_per_category_df

Unnamed: 0_level_0,TOPIC
CATEGORY,Unnamed: 1_level_1
DATA,2
METHOD,13
ORGANIZATION,4
PEOPLE,2


In [10]:
# get the total of qualitative answers
total_answers = total_per_category_df['TOPIC'].sum()
print(f'The total of qualitative answers to be considered is {total_answers}')

The total of qualitative answers to be considered is 21


In [11]:
# we create two lists that contain, for each category, the total of qualitative answers and their percentage
category_percentage = []
category_total = []
# note that we will have repeated values, once we are creating lists with the same length of the column 'CATEGORY' in the qualitative dataframe
for category in category_column:
    category_percentage.append(round(total_per_category_df.loc[category]['TOPIC'] / total_answers * 100, 2))
    category_total.append(total_per_category_df.loc[category]['TOPIC'])

In [12]:
# include two new columns to our qualitative dataframe
qualitative_df['CATEGORY_TOTAL'] = category_total
qualitative_df['CATEGORY_%'] = category_percentage
qualitative_df.head()

Unnamed: 0,OTHER PROBLEMS,TOPIC,CATEGORY,CATEGORY_TOTAL,CATEGORY_%
0,Understand statistics behind the models,LACK OF DATA SCIENCE KNOWLEDGE,PEOPLE,2,9.52
1,team integration,TEAM INTEGRATION,ORGANIZATION,4,19.05
2,Expectation management with future users,MANAGING EXPECTATIONS,METHOD,13,61.9
3,Updating of current models,MODEL UPDATE,METHOD,13,61.9
4,managers who want quick and reliable solutions,MANAGING EXPECTATIONS,METHOD,13,61.9


In [13]:
# now we group our results considering the category and the topic
qualitative_df = qualitative_df.groupby(['CATEGORY', 'CATEGORY_TOTAL', 'CATEGORY_%', 'TOPIC']).count()

In [14]:
# rename the remaining column, which is now a simple count representing the quantity of each category and topic
qualitative_df.columns = ['QUANTITY']
# add the percentage version of each quantity to improve readability
qualitative_df['QUANTITY_%'] = round(qualitative_df['QUANTITY'] / qualitative_df['QUANTITY'].sum() * 100, 2)
qualitative_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,QUANTITY,QUANTITY_%
CATEGORY,CATEGORY_TOTAL,CATEGORY_%,TOPIC,Unnamed: 4_level_1,Unnamed: 5_level_1
DATA,2,9.52,DATA QUALITY,2,9.52
METHOD,13,61.9,AUTOMATING MODEL TRAINING AND DEPLOYMENT,1,4.76
METHOD,13,61.9,BUILD PIPELINES,1,4.76
METHOD,13,61.9,DOCUMENTATION,1,4.76
METHOD,13,61.9,FEATURE SELECTION,1,4.76
METHOD,13,61.9,MANAGING EXPECTATIONS,4,19.05
METHOD,13,61.9,MODEL SELECTION,1,4.76
METHOD,13,61.9,MODEL UPDATE,2,9.52
METHOD,13,61.9,NOT USING AUTOML,1,4.76
METHOD,13,61.9,VALIDATE MODEL RESULTS,1,4.76
