# Second Problems that most likely lead to Project Failure

*Question*: Regarding the problems listed in previous question, please rank the ones (up to three) that most likely lead to ooverall project failur.

*Answer Type*: Text Free

### Necessary Libraries

In [1]:
import pandas as pd
from utils.basic import rename_values
from utils.dataframe import DataframeUtils
from utils.plot import PlotUtils
from utils.bootstrapping import BootstrappingUtils

### Dataframe Initialization

Here we get a formatted Pandas DataFrame from a raw CSV containing the Survey responses. When formatting, we discard unused columns and renames existing ones to more clear names. We created two classes that help us manage this dataframe and create meaningful charts.

In [2]:
dataframe_obj = DataframeUtils(df_path='./data/main_data.csv', sep=';',
                               to_discard_columns_file='./data/unused_columns.txt',
                               to_format_columns_file='./data/formatted_columns.txt')
# dataframe preview
dataframe_obj.df.head()

Unnamed: 0,ID,Status,Duration,D1_Undergraduation,D1_Specialization,D1_Master,D1_Phd,D1_Courses,D1_Others,D2_Country,...,Q15_Model_Deploy_Production_Monitoring,Q16_Model_Monitor_Aspects_Input_And_Output,Q16_Model_Monitor_Aspects_Interpretability_Output,Q16_Model_Monitor_Aspects_Output_And_Decisions,Q16_Model_Monitor_Aspects_Fairness,Q16_Model_Monitor_Aspects_Others,Q16_Model_Monitor_Aspects_Others_Free,Q17_Automated_Machine_Learning_Tools_Yes_No,Q17_Automated_Machine_Learning_Tools_Yes_Free,Origin
2,31,Completed (31),1317,Economics,-99,M.Sc. in Economics,-99,Data Scientist in Datacamp,-99,Brazil,...,-77,not quoted,not quoted,not quoted,not quoted,not quoted,-99,0,-99,https://ww2.unipark.de/uc/seml/
3,34,Completed (31),854,-99,Management,No,No,No,No,Brazil,...,70,not quoted,not quoted,quoted,not quoted,not quoted,-99,No,-99,-99
4,36,Completed (31),1593,Mathematics,Informatics,MSC Computer Science,PhD computer Science,Vários cursos in Coursera,-99,Brazil,...,60,quoted,not quoted,quoted,not quoted,not quoted,-99,"Yes, Please, specify",Own approach,-99
5,57,Completed (31),4238,Computer Science,Data science specialization,-99,-99,-99,-99,Germany,...,100,not quoted,quoted,quoted,not quoted,not quoted,-99,No,-99,-99
6,46,Completed (31),2821,Actuarial Science,Post Graduation in Data Science,M Sc in Data Science -ML models,no Ph D,no other certifications,-99,Brazil,...,80,not quoted,quoted,quoted,not quoted,not quoted,-99,No,-99,-99


In [3]:
# For PROFES, we discarded suspended submissions (e.g., remove those who didn't complete the survey).
dataframe_obj.df.drop(dataframe_obj.df[dataframe_obj.df['Status'] == 'Suspended (22)'].index, inplace = True)

In [4]:
# dataframe preview
dataframe_obj.df.head()

Unnamed: 0,ID,Status,Duration,D1_Undergraduation,D1_Specialization,D1_Master,D1_Phd,D1_Courses,D1_Others,D2_Country,...,Q15_Model_Deploy_Production_Monitoring,Q16_Model_Monitor_Aspects_Input_And_Output,Q16_Model_Monitor_Aspects_Interpretability_Output,Q16_Model_Monitor_Aspects_Output_And_Decisions,Q16_Model_Monitor_Aspects_Fairness,Q16_Model_Monitor_Aspects_Others,Q16_Model_Monitor_Aspects_Others_Free,Q17_Automated_Machine_Learning_Tools_Yes_No,Q17_Automated_Machine_Learning_Tools_Yes_Free,Origin
2,31,Completed (31),1317,Economics,-99,M.Sc. in Economics,-99,Data Scientist in Datacamp,-99,Brazil,...,-77,not quoted,not quoted,not quoted,not quoted,not quoted,-99,0,-99,https://ww2.unipark.de/uc/seml/
3,34,Completed (31),854,-99,Management,No,No,No,No,Brazil,...,70,not quoted,not quoted,quoted,not quoted,not quoted,-99,No,-99,-99
4,36,Completed (31),1593,Mathematics,Informatics,MSC Computer Science,PhD computer Science,Vários cursos in Coursera,-99,Brazil,...,60,quoted,not quoted,quoted,not quoted,not quoted,-99,"Yes, Please, specify",Own approach,-99
5,57,Completed (31),4238,Computer Science,Data science specialization,-99,-99,-99,-99,Germany,...,100,not quoted,quoted,quoted,not quoted,not quoted,-99,No,-99,-99
6,46,Completed (31),2821,Actuarial Science,Post Graduation in Data Science,M Sc in Data Science -ML models,no Ph D,no other certifications,-99,Brazil,...,80,not quoted,quoted,quoted,not quoted,not quoted,-99,No,-99,-99


### Grounded Theory

Here we get a formatted Pandas DataFrame from an Excel file containing the codes and categorization of our qualitative responses.

In [5]:
qualitative_df = pd.read_excel('./data/Problems on ML Stages/PROJECT FAILURE - SECOND.xlsx')
qualitative_df.head()

Unnamed: 0,lfdn,Q5 - SECOND,TOPIC,CATEGORY
0,31,Others tasks which competes the time,LACK OF TIME,ORGANIZATION
1,36,Prediction Task identification,MODEL SELECTION AND CREATION,METHOD
2,46,we need to cut or we need to cluster some kind...,PROBLEM UNDERSTANDING,INPUT
3,53,apply the models,MODEL SELECTION AND CREATION,METHOD
4,58,Sufficient Data Quantity,INSUFFICIENT DATA,DATA


### Basic Analysis

In [6]:
# get our respondents ids
respondents_ids = list(dataframe_obj.df['ID'])
respondents_ids = [int(respondent_id) for respondent_id in respondents_ids]
respondents_ids[:5]

[31, 34, 36, 57, 46]

In [7]:
# discard the ids
qualitative_df = qualitative_df[qualitative_df['lfdn'].isin(respondents_ids)][['Q5 - SECOND', 'TOPIC', 'CATEGORY']]

In [8]:
# get a list with all the reported categories
category_column = list(qualitative_df['CATEGORY'])
category_column[:5]

['ORGANIZATION', 'METHOD', 'INPUT', 'METHOD', 'DATA']

In [9]:
# compute the total of answers that belong to each category
total_per_category_df = qualitative_df[['CATEGORY', 'TOPIC']].groupby(['CATEGORY']).count()
total_per_category_df

Unnamed: 0_level_0,TOPIC
CATEGORY,Unnamed: 1_level_1
DATA,58
INFRASTRUCTURE,9
INPUT,21
METHOD,28
ORGANIZATION,23
PEOPLE,2


In [10]:
# get the total of qualitative answers
total_answers = total_per_category_df['TOPIC'].sum()
print(f'The total of qualitative answers to be considered is {total_answers}')

The total of qualitative answers to be considered is 141


In [11]:
# we create two lists that contain, for each category, the total of qualitative answers and their percentage
category_percentage = []
category_total = []
# note that we will have repeated values, once we are creating lists with the same length of the column 'CATEGORY' in the qualitative dataframe
for category in category_column:
    category_percentage.append(round(total_per_category_df.loc[category]['TOPIC'] / total_answers * 100, 2))
    category_total.append(total_per_category_df.loc[category]['TOPIC'])

In [12]:
# include two new columns to our qualitative dataframe
qualitative_df['CATEGORY_TOTAL'] = category_total
qualitative_df['CATEGORY_%'] = category_percentage
qualitative_df.head()

Unnamed: 0,Q5 - SECOND,TOPIC,CATEGORY,CATEGORY_TOTAL,CATEGORY_%
0,Others tasks which competes the time,LACK OF TIME,ORGANIZATION,23,16.31
1,Prediction Task identification,MODEL SELECTION AND CREATION,METHOD,28,19.86
2,we need to cut or we need to cluster some kind...,PROBLEM UNDERSTANDING,INPUT,21,14.89
3,apply the models,MODEL SELECTION AND CREATION,METHOD,28,19.86
4,Sufficient Data Quantity,INSUFFICIENT DATA,DATA,58,41.13


In [13]:
# now we group our results considering the category and the topic
qualitative_df = qualitative_df.groupby(['CATEGORY', 'CATEGORY_TOTAL', 'CATEGORY_%', 'TOPIC']).count()

In [14]:
# rename the remaining column, which is now a simple count representing the quantity of each category and topic
qualitative_df.columns = ['QUANTITY']
# add the percentage version of each quantity to improve readability
qualitative_df['QUANTITY_%'] = round(qualitative_df['QUANTITY'] / qualitative_df['QUANTITY'].sum() * 100, 2)
qualitative_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,QUANTITY,QUANTITY_%
CATEGORY,CATEGORY_TOTAL,CATEGORY_%,TOPIC,Unnamed: 4_level_1,Unnamed: 5_level_1
DATA,58,41.13,DATA AVAILABILITY,5,3.55
DATA,58,41.13,DATA COLLECTION,20,14.18
DATA,58,41.13,DATA PREPROCESSING,8,5.67
DATA,58,41.13,DATA QUALITY,10,7.09
DATA,58,41.13,DATA UNDERSTANDING,1,0.71
DATA,58,41.13,INSUFFICIENT DATA,14,9.93
INFRASTRUCTURE,9,6.38,COMPUTATIONAL CONSTRAINTS,2,1.42
INFRASTRUCTURE,9,6.38,DEPLOYMENT INFRASTRUCTURE,4,2.84
INFRASTRUCTURE,9,6.38,MODEL DEPLOYMENT,2,1.42
INFRASTRUCTURE,9,6.38,MODEL MONITORING,1,0.71


In [15]:
df = qualitative_df['QUANTITY_%'].copy()
df.index = qualitative_df.index.get_level_values('TOPIC')
for key, val in zip(list(df.keys()), list(df)):
    print(f"{key.lower()} ({val} %)")

data availability (3.55 %)
data collection (14.18 %)
data preprocessing (5.67 %)
data quality (7.09 %)
data understanding (0.71 %)
insufficient data (9.93 %)
computational constraints (1.42 %)
deployment infrastructure (2.84 %)
model deployment (1.42 %)
model monitoring (0.71 %)
incomplete/incorrect requirements (3.55 %)
lack of domain knowledge (2.13 %)
problem understanding (9.22 %)
choose and evaluate the metrics (3.55 %)
error handling (0.71 %)
feature engineering (0.71 %)
hyperparameter tuning (1.42 %)
insufficient/unacceptable results  (1.42 %)
managing expectations (2.13 %)
model evaluation (1.42 %)
model selection and creation (5.67 %)
model understanding (0.71 %)
model update (0.71 %)
security (0.71 %)
technical debt (0.71 %)
insufficient client support (9.93 %)
lack of time (4.96 %)
prioritizing issues (1.42 %)
lack of data science knowledge (1.42 %)
