# Third Problems that most likely lead to Project Failure

*Question*: Regarding the problems listed in previous question, 
please rank the ones (up to three) that most likely lead t 
overall project failur.

*Answer Type*: Text Free

### Necessary Libraries

In [1]:
import pandas as pd
from utils.basic import rename_values
from utils.dataframe import DataframeUtils
from utils.plot import PlotUtils
from utils.bootstrapping import BootstrappingUtils

### Dataframe Initialization

Here we get a formatted Pandas DataFrame from a raw CSV containing the Survey responses. When formatting, we discard unused columns and renames existing ones to more clear names. We created two classes that help us manage this dataframe and create meaningful charts.

In [2]:
dataframe_obj = DataframeUtils(df_path='./data/main_data.csv', sep=';',
                               to_discard_columns_file='./data/unused_columns.txt',
                               to_format_columns_file='./data/formatted_columns.txt')
# dataframe preview
dataframe_obj.df.head()

Unnamed: 0,ID,Status,Duration,D1_Undergraduation,D1_Specialization,D1_Master,D1_Phd,D1_Courses,D1_Others,D2_Country,...,Q15_Model_Deploy_Production_Monitoring,Q16_Model_Monitor_Aspects_Input_And_Output,Q16_Model_Monitor_Aspects_Interpretability_Output,Q16_Model_Monitor_Aspects_Output_And_Decisions,Q16_Model_Monitor_Aspects_Fairness,Q16_Model_Monitor_Aspects_Others,Q16_Model_Monitor_Aspects_Others_Free,Q17_Automated_Machine_Learning_Tools_Yes_No,Q17_Automated_Machine_Learning_Tools_Yes_Free,Origin
2,31,Completed (31),1317,Economics,-99,M.Sc. in Economics,-99,Data Scientist in Datacamp,-99,Brazil,...,-77,not quoted,not quoted,not quoted,not quoted,not quoted,-99,0,-99,https://ww2.unipark.de/uc/seml/
3,34,Completed (31),854,-99,Management,No,No,No,No,Brazil,...,70,not quoted,not quoted,quoted,not quoted,not quoted,-99,No,-99,-99
4,36,Completed (31),1593,Mathematics,Informatics,MSC Computer Science,PhD computer Science,Vários cursos in Coursera,-99,Brazil,...,60,quoted,not quoted,quoted,not quoted,not quoted,-99,"Yes, Please, specify",Own approach,-99
5,57,Completed (31),4238,Computer Science,Data science specialization,-99,-99,-99,-99,Germany,...,100,not quoted,quoted,quoted,not quoted,not quoted,-99,No,-99,-99
6,46,Completed (31),2821,Actuarial Science,Post Graduation in Data Science,M Sc in Data Science -ML models,no Ph D,no other certifications,-99,Brazil,...,80,not quoted,quoted,quoted,not quoted,not quoted,-99,No,-99,-99


In [3]:
# For PROFES, we discarded suspended submissions (e.g., remove those who didn't complete the survey).
dataframe_obj.df.drop(dataframe_obj.df[dataframe_obj.df['Status'] == 'Suspended (22)'].index, inplace = True)

In [4]:
# dataframe preview
dataframe_obj.df.head()

Unnamed: 0,ID,Status,Duration,D1_Undergraduation,D1_Specialization,D1_Master,D1_Phd,D1_Courses,D1_Others,D2_Country,...,Q15_Model_Deploy_Production_Monitoring,Q16_Model_Monitor_Aspects_Input_And_Output,Q16_Model_Monitor_Aspects_Interpretability_Output,Q16_Model_Monitor_Aspects_Output_And_Decisions,Q16_Model_Monitor_Aspects_Fairness,Q16_Model_Monitor_Aspects_Others,Q16_Model_Monitor_Aspects_Others_Free,Q17_Automated_Machine_Learning_Tools_Yes_No,Q17_Automated_Machine_Learning_Tools_Yes_Free,Origin
2,31,Completed (31),1317,Economics,-99,M.Sc. in Economics,-99,Data Scientist in Datacamp,-99,Brazil,...,-77,not quoted,not quoted,not quoted,not quoted,not quoted,-99,0,-99,https://ww2.unipark.de/uc/seml/
3,34,Completed (31),854,-99,Management,No,No,No,No,Brazil,...,70,not quoted,not quoted,quoted,not quoted,not quoted,-99,No,-99,-99
4,36,Completed (31),1593,Mathematics,Informatics,MSC Computer Science,PhD computer Science,Vários cursos in Coursera,-99,Brazil,...,60,quoted,not quoted,quoted,not quoted,not quoted,-99,"Yes, Please, specify",Own approach,-99
5,57,Completed (31),4238,Computer Science,Data science specialization,-99,-99,-99,-99,Germany,...,100,not quoted,quoted,quoted,not quoted,not quoted,-99,No,-99,-99
6,46,Completed (31),2821,Actuarial Science,Post Graduation in Data Science,M Sc in Data Science -ML models,no Ph D,no other certifications,-99,Brazil,...,80,not quoted,quoted,quoted,not quoted,not quoted,-99,No,-99,-99


### Grounded Theory

Here we get a formatted Pandas DataFrame from an Excel file containing the codes and categorization of our qualitative responses.

In [5]:
qualitative_df = pd.read_excel('./data/Problems on ML Stages/PROJECT FAILURE - THIRD.xlsx')
qualitative_df.head()

Unnamed: 0,lfdn,Q5 - THIRD,TOPIC,CATEGORY
0,31,Search the appropriate methodology,MODEL CREATION AND SELECTION,METHOD
1,36,Selecionar of learning algo,MODEL CREATION AND SELECTION,METHOD
2,46,present and discuss metrics and distribution o...,CHOOSE AND EVALUATE THE METRICS,METHOD
3,53,Not knowing how to deploy,DEPLOYMENT INFRASTRUCTURE,INFRASTRUCTURE
4,58,Deployment Costs for non-trivial ML projects,DEPLOYMENT INFRASTRUCTURE,INFRASTRUCTURE


### Basic Analysis

In [6]:
# get our respondents ids
respondents_ids = list(dataframe_obj.df['ID'])
respondents_ids = [int(respondent_id) for respondent_id in respondents_ids]
respondents_ids[:5]

[31, 34, 36, 57, 46]

In [7]:
# discard the ids
qualitative_df = qualitative_df[qualitative_df['lfdn'].isin(respondents_ids)][['Q5 - THIRD', 'TOPIC', 'CATEGORY']]

In [8]:
# get a list with all the reported categories
category_column = list(qualitative_df['CATEGORY'])
category_column[:5]

['METHOD', 'METHOD', 'METHOD', 'INFRASTRUCTURE', 'INFRASTRUCTURE']

In [9]:
# compute the total of answers that belong to each category
total_per_category_df = qualitative_df[['CATEGORY', 'TOPIC']].groupby(['CATEGORY']).count()
total_per_category_df

Unnamed: 0_level_0,TOPIC
CATEGORY,Unnamed: 1_level_1
DATA,23
INFRASTRUCTURE,25
INPUT,8
METHOD,40
ORGANIZATION,7
PEOPLE,7


In [10]:
# get the total of qualitative answers
total_answers = total_per_category_df['TOPIC'].sum()
print(f'The total of qualitative answers to be considered is {total_answers}')

The total of qualitative answers to be considered is 110


In [11]:
# we create two lists that contain, for each category, the total of qualitative answers and their percentage
category_percentage = []
category_total = []
# note that we will have repeated values, once we are creating lists with the same length of the column 'CATEGORY' in the qualitative dataframe
for category in category_column:
    category_percentage.append(round(total_per_category_df.loc[category]['TOPIC'] / total_answers * 100, 2))
    category_total.append(total_per_category_df.loc[category]['TOPIC'])

In [12]:
# include two new columns to our qualitative dataframe
qualitative_df['CATEGORY_TOTAL'] = category_total
qualitative_df['CATEGORY_%'] = category_percentage
qualitative_df.head()

Unnamed: 0,Q5 - THIRD,TOPIC,CATEGORY,CATEGORY_TOTAL,CATEGORY_%
0,Search the appropriate methodology,MODEL CREATION AND SELECTION,METHOD,40,36.36
1,Selecionar of learning algo,MODEL CREATION AND SELECTION,METHOD,40,36.36
2,present and discuss metrics and distribution o...,CHOOSE AND EVALUATE THE METRICS,METHOD,40,36.36
3,Not knowing how to deploy,DEPLOYMENT INFRASTRUCTURE,INFRASTRUCTURE,25,22.73
4,Deployment Costs for non-trivial ML projects,DEPLOYMENT INFRASTRUCTURE,INFRASTRUCTURE,25,22.73


In [13]:
# now we group our results considering the category and the topic
qualitative_df = qualitative_df.groupby(['CATEGORY', 'CATEGORY_TOTAL', 'CATEGORY_%', 'TOPIC']).count()

In [14]:
# rename the remaining column, which is now a simple count representing the quantity of each category and topic
qualitative_df.columns = ['QUANTITY']
# add the percentage version of each quantity to improve readability
qualitative_df['QUANTITY_%'] = round(qualitative_df['QUANTITY'] / qualitative_df['QUANTITY'].sum() * 100, 2)
qualitative_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,QUANTITY,QUANTITY_%
CATEGORY,CATEGORY_TOTAL,CATEGORY_%,TOPIC,Unnamed: 4_level_1,Unnamed: 5_level_1
DATA,23,20.91,DATA ACCESSIBILITY,1,0.91
DATA,23,20.91,DATA AVAILABILITY,1,0.91
DATA,23,20.91,DATA COLLECTION,3,2.73
DATA,23,20.91,DATA PREPROCESSING,9,8.18
DATA,23,20.91,DATA QUALITY,4,3.64
DATA,23,20.91,INSUFFICIENT DATA,5,4.55
INFRASTRUCTURE,25,22.73,COMPUTATIONAL CONSTRAINTS,5,4.55
INFRASTRUCTURE,25,22.73,DEPLOYMENT INFRASTRUCTURE,15,13.64
INFRASTRUCTURE,25,22.73,INTEGRATION,1,0.91
INFRASTRUCTURE,25,22.73,MODEL MONITORING,2,1.82


In [15]:
df = qualitative_df['QUANTITY_%'].copy()
df.index = qualitative_df.index.get_level_values('TOPIC')
for key, val in zip(list(df.keys()), list(df)):
    print(f"{key.lower()} ({val} %)")

data accessibility (0.91 %)
data availability (0.91 %)
data collection (2.73 %)
data preprocessing (8.18 %)
data quality (3.64 %)
insufficient data (4.55 %)
computational constraints (4.55 %)
deployment infrastructure (13.64 %)
integration (0.91 %)
model monitoring (1.82 %)
scalability (1.82 %)
incomplete/incorrect requirements (3.64 %)
lack of domain knowledge (0.91 %)
problem understanding (1.82 %)
unclear goals (0.91 %)
build pipelines (1.82 %)
choose and evaluate the metrics (11.82 %)
code quality (0.91 %)
communication (0.91 %)
feature engineering (0.91 %)
hyperparameter tuning (2.73 %)
model creation and selection (10.91 %)
model scalability (0.91 %)
model update (1.82 %)
overfitting and underfitting (1.82 %)
ui (0.91 %)
user acceptance test (0.91 %)
change management (0.91 %)
definition of roles and responsibilities (0.91 %)
financial issues (0.91 %)
lack of time (0.91 %)
low client/domain expert availability/engagement (1.82 %)
team integration (0.91 %)
lack of data science kno