# First Problems that most likely lead to Project Failure

*Question*: Regarding the problems listed in previous question, 
please rank the ones (up to three) that most likely lead o 
overall project failur.

*Answer Type*: Text Free

### Necessary Libraries

In [1]:
import pandas as pd
from utils.basic import rename_values
from utils.dataframe import DataframeUtils
from utils.plot import PlotUtils
from utils.bootstrapping import BootstrappingUtils

### Dataframe Initialization

Here we get a formatted Pandas DataFrame from a raw CSV containing the Survey responses. When formatting, we discard unused columns and renames existing ones to more clear names. We created two classes that help us manage this dataframe and create meaningful charts.

In [2]:
dataframe_obj = DataframeUtils(df_path='./data/main_data.csv', sep=';',
                               to_discard_columns_file='./data/unused_columns.txt',
                               to_format_columns_file='./data/formatted_columns.txt')
# dataframe preview
dataframe_obj.df.head()

Unnamed: 0,ID,Status,Duration,D1_Undergraduation,D1_Specialization,D1_Master,D1_Phd,D1_Courses,D1_Others,D2_Country,...,Q15_Model_Deploy_Production_Monitoring,Q16_Model_Monitor_Aspects_Input_And_Output,Q16_Model_Monitor_Aspects_Interpretability_Output,Q16_Model_Monitor_Aspects_Output_And_Decisions,Q16_Model_Monitor_Aspects_Fairness,Q16_Model_Monitor_Aspects_Others,Q16_Model_Monitor_Aspects_Others_Free,Q17_Automated_Machine_Learning_Tools_Yes_No,Q17_Automated_Machine_Learning_Tools_Yes_Free,Origin
2,31,Completed (31),1317,Economics,-99,M.Sc. in Economics,-99,Data Scientist in Datacamp,-99,Brazil,...,-77,not quoted,not quoted,not quoted,not quoted,not quoted,-99,0,-99,https://ww2.unipark.de/uc/seml/
3,34,Completed (31),854,-99,Management,No,No,No,No,Brazil,...,70,not quoted,not quoted,quoted,not quoted,not quoted,-99,No,-99,-99
4,36,Completed (31),1593,Mathematics,Informatics,MSC Computer Science,PhD computer Science,Vários cursos in Coursera,-99,Brazil,...,60,quoted,not quoted,quoted,not quoted,not quoted,-99,"Yes, Please, specify",Own approach,-99
5,57,Completed (31),4238,Computer Science,Data science specialization,-99,-99,-99,-99,Germany,...,100,not quoted,quoted,quoted,not quoted,not quoted,-99,No,-99,-99
6,46,Completed (31),2821,Actuarial Science,Post Graduation in Data Science,M Sc in Data Science -ML models,no Ph D,no other certifications,-99,Brazil,...,80,not quoted,quoted,quoted,not quoted,not quoted,-99,No,-99,-99


In [3]:
# For PROFES, we discarded suspended submissions (e.g., remove those who didn't complete the survey).
dataframe_obj.df.drop(dataframe_obj.df[dataframe_obj.df['Status'] == 'Suspended (22)'].index, inplace = True)

In [4]:
# dataframe preview
dataframe_obj.df.head()

Unnamed: 0,ID,Status,Duration,D1_Undergraduation,D1_Specialization,D1_Master,D1_Phd,D1_Courses,D1_Others,D2_Country,...,Q15_Model_Deploy_Production_Monitoring,Q16_Model_Monitor_Aspects_Input_And_Output,Q16_Model_Monitor_Aspects_Interpretability_Output,Q16_Model_Monitor_Aspects_Output_And_Decisions,Q16_Model_Monitor_Aspects_Fairness,Q16_Model_Monitor_Aspects_Others,Q16_Model_Monitor_Aspects_Others_Free,Q17_Automated_Machine_Learning_Tools_Yes_No,Q17_Automated_Machine_Learning_Tools_Yes_Free,Origin
2,31,Completed (31),1317,Economics,-99,M.Sc. in Economics,-99,Data Scientist in Datacamp,-99,Brazil,...,-77,not quoted,not quoted,not quoted,not quoted,not quoted,-99,0,-99,https://ww2.unipark.de/uc/seml/
3,34,Completed (31),854,-99,Management,No,No,No,No,Brazil,...,70,not quoted,not quoted,quoted,not quoted,not quoted,-99,No,-99,-99
4,36,Completed (31),1593,Mathematics,Informatics,MSC Computer Science,PhD computer Science,Vários cursos in Coursera,-99,Brazil,...,60,quoted,not quoted,quoted,not quoted,not quoted,-99,"Yes, Please, specify",Own approach,-99
5,57,Completed (31),4238,Computer Science,Data science specialization,-99,-99,-99,-99,Germany,...,100,not quoted,quoted,quoted,not quoted,not quoted,-99,No,-99,-99
6,46,Completed (31),2821,Actuarial Science,Post Graduation in Data Science,M Sc in Data Science -ML models,no Ph D,no other certifications,-99,Brazil,...,80,not quoted,quoted,quoted,not quoted,not quoted,-99,No,-99,-99


### Grounded Theory

Here we get a formatted Pandas DataFrame from an Excel file containing the codes and categorization of our qualitative responses.

In [5]:
qualitative_df = pd.read_excel('./data/Problems on ML Stages/PROJECT FAILURE - FIRST.xlsx')
qualitative_df.head()

Unnamed: 0,lfdn,Q5 - FIRST,TOPIC,CATEGORY
0,31,Problems with data collection and cleaning,DATA PREPARATION,DATA
1,36,Data preparation,DATA PREPARATION,DATA
2,46,understand the pain and identify if ML is real...,PROBLEM UNDERSTANDING,INPUT
3,53,insufficient amount of data,INSUFFICIENT DATA,DATA
4,58,DATA UNAVAILABLE,DATA AVAILABILITY,DATA


### Basic Analysis

In [6]:
# get our respondents ids
respondents_ids = list(dataframe_obj.df['ID'])
respondents_ids = [int(respondent_id) for respondent_id in respondents_ids]
respondents_ids[:5]

[31, 34, 36, 57, 46]

In [7]:
# discard the ids
qualitative_df = qualitative_df[qualitative_df['lfdn'].isin(respondents_ids)][['Q5 - FIRST', 'TOPIC', 'CATEGORY']]

In [8]:
# get a list with all the reported categories
category_column = list(qualitative_df['CATEGORY'])
category_column[:5]

['DATA', 'DATA', 'INPUT', 'DATA', 'DATA']

In [9]:
# compute the total of answers that belong to each category
total_per_category_df = qualitative_df[['CATEGORY', 'TOPIC']].groupby(['CATEGORY']).count()
total_per_category_df

Unnamed: 0_level_0,TOPIC
CATEGORY,Unnamed: 1_level_1
DATA,46
INFRASTRUCTURE,2
INPUT,52
METHOD,29
ORGANIZATION,8
PEOPLE,2


In [10]:
# get the total of qualitative answers
total_answers = total_per_category_df['TOPIC'].sum()
print(f'The total of qualitative answers to be considered is {total_answers}')

The total of qualitative answers to be considered is 139


In [11]:
# we create two lists that contain, for each category, the total of qualitative answers and their percentage
category_percentage = []
category_total = []
# note that we will have repeated values, once we are creating lists with the same length of the column 'CATEGORY' in the qualitative dataframe
for category in category_column:
    category_percentage.append(round(total_per_category_df.loc[category]['TOPIC'] / total_answers * 100, 2))
    category_total.append(total_per_category_df.loc[category]['TOPIC'])

In [12]:
# include two new columns to our qualitative dataframe
qualitative_df['CATEGORY_TOTAL'] = category_total
qualitative_df['CATEGORY_%'] = category_percentage
qualitative_df.head()

Unnamed: 0,Q5 - FIRST,TOPIC,CATEGORY,CATEGORY_TOTAL,CATEGORY_%
0,Problems with data collection and cleaning,DATA PREPARATION,DATA,46,33.09
1,Data preparation,DATA PREPARATION,DATA,46,33.09
2,understand the pain and identify if ML is real...,PROBLEM UNDERSTANDING,INPUT,52,37.41
3,insufficient amount of data,INSUFFICIENT DATA,DATA,46,33.09
4,DATA UNAVAILABLE,DATA AVAILABILITY,DATA,46,33.09


In [13]:
# now we group our results considering the category and the topic
qualitative_df = qualitative_df.groupby(['CATEGORY', 'CATEGORY_TOTAL', 'CATEGORY_%', 'TOPIC']).count()

In [14]:
# rename the remaining column, which is now a simple count representing the quantity of each category and topic
qualitative_df.columns = ['QUANTITY']
# add the percentage version of each quantity to improve readability
qualitative_df['QUANTITY_%'] = round(qualitative_df['QUANTITY'] / qualitative_df['QUANTITY'].sum() * 100, 2)
qualitative_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,QUANTITY,QUANTITY_%
CATEGORY,CATEGORY_TOTAL,CATEGORY_%,TOPIC,Unnamed: 4_level_1,Unnamed: 5_level_1
DATA,46,33.09,DATA AVAILABILITY,6,4.32
DATA,46,33.09,DATA CLEANING,1,0.72
DATA,46,33.09,DATA COLLECTION,9,6.47
DATA,46,33.09,DATA COMPLEXITY,1,0.72
DATA,46,33.09,DATA PREPARATION,7,5.04
DATA,46,33.09,DATA QUALITY,12,8.63
DATA,46,33.09,DATA UNDERSTANDING,1,0.72
DATA,46,33.09,INSUFFICIENT DATA,9,6.47
INFRASTRUCTURE,2,1.44,COMPUTATIONAL CONSTRAINTS,2,1.44
INPUT,52,37.41,BUSINESS/DOMAIN UNDERSTANDING,1,0.72


In [15]:
df = qualitative_df['QUANTITY_%'].copy()
df.index = qualitative_df.index.get_level_values('TOPIC')
for key, val in zip(list(df.keys()), list(df)):
    print(f"{key.lower()} ({val} %)")

data availability (4.32 %)
data cleaning (0.72 %)
data collection (6.47 %)
data complexity (0.72 %)
data preparation (5.04 %)
data quality (8.63 %)
data understanding (0.72 %)
insufficient data (6.47 %)
computational constraints (1.44 %)
business/domain understanding (0.72 %)
gathering requirements (1.44 %)
incomplete/incorrect requirements (6.47 %)
lack of domain knowledge (5.76 %)
problem understanding (22.3 %)
scope definition (0.72 %)
algorithms selection (1.44 %)
choose and evaluate the metrics (3.6 %)
communication (1.44 %)
improving the model (0.72 %)
insufficient/unacceptable results  (0.72 %)
managing expectations (7.19 %)
model creation (0.72 %)
model deployment (0.72 %)
model evaluation (0.72 %)
model monitoring (0.72 %)
quality assurance (0.72 %)
solutions uniqueness (0.72 %)
time consuming (1.44 %)
contract negotiation (0.72 %)
lack of resources for monitoring (0.72 %)
lack of time (1.44 %)
low client/domain expert availability/engagement (2.16 %)
management engagement (0.