<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/DSPM-Book-Cover-Small.png"><br>

This notebook contains an excerpt from the **`Data Science Project Management Methodology - A Guideline for Beginners`** whitepaper written by *Balasubramanian Chandran*; the content is available [on GitHub](https://github.com/BalaChandranGH/Whitepapers/DSPM-Methodology).

<!--NAVIGATION-->

<[ [The Abstract](01.00-dspm-The-Abstract.ipynb) | [Contents and Acronyms](00.00-dspm-Contents-and-Acronyms.ipynb) | [Summary of existing PM Methodologies](03.00-dspm-Summary-of-existing-PM-Methodologies.ipynb) ]>

# 2. Introduction - Why DSPM is challenging?

DS projects fail because people overlook the underlying challenges and do not allocate sufficient time to address them. The following are some of the reasons that make the data science projects challenging.

## 2.1. Conversion of Business problems into Analytics problems
The vague definition of business problems/ requirements by the business is not new. Traditionally, the project teams budget sufficient time in understanding the problem and proceeding with the project development. In the case of DSPM, an additional step of converting the business problem into an analytics problem is required because the problem will be solved by the data; not just by business rules. In other words, the business problems are “qualitative” and the analytics problems are “quantitative” whose solutions can be measured with metrics. 

This is a make or breaks step in DS projects. If the business problem is not defined analytically right, we will be finding solutions to a wrong problem, and all the project time and efforts will be channelized in the wrong direction! Even if you use the most powerful algorithms, the results will be useless if you are solving the wrong problem. 

A well-defined analytics problem should be specific, unambiguous, measurable, realistic, and aligned with business goals and should help to find solutions that should predict outcomes or exhibit correlations or categorize/ group or identify patterns/ anomalies or make recommendations. The DS solutions must help the business to make informed decisions backed by data rather than bombarding them with abundant analysis/modeling reports.  

_**Example 1:**_

``Business problem:``
* _How fast the pandemic is spreading across the world?_ 

``Converted Analytical problem`` (Regression problem – Supervised learning):
* _Build a predictive model combining real-time pandemic data collected from across the world, with visualizations on the trend which can be sliced by continent, country, age group, and ethnicity_


_**Example 2:**_

``Business problem:``
* _Develop a system to prevent data theft and sabotage_

``Converted Analytical problem`` (Classification problem – Supervised learning):
* _Build a predictive model to monitor resources for any suspicious activity, with real-time decision making, visualizations, and alerts_


_**Example 3:**_

``Business problem:``
* _What products can we promote to new customers that they are likely to buy?_

``Converted Analytical problem`` (Clustering problem – Unsupervised learning):
* _Build an analytical model combining demographic data, age, buying patterns, and report the association of new customers with the clusters of the existing customers_

## 2.2. Data availability
The cases are rare where the required data are readily available to use. The project team needs to spend enough time in collecting the data which may include the identification of:
* Sources of data 
  - Databases, Flat files, Data mining, Web scrapping, Crowd-sourcing, etc.
* Types of data 
  - Structured, Semi-structured, Unstructured (Free-form text, Images, Videos, Audios, etc.)
* Volume of data 
  - Small-size data, Medium-size data, Big data
* Cost of data 
  - Are the data available for free or need to purchase?
* Privacy of data
  - Knowing the data ownerships and obtaining permissions to use the data (what and what not to) 
* Right data
  - More is not always good: Many a time the project gets lots of data, but most of them are irrelevant and cannot be used
  - “It's not who has the best algorithm that wins. It's who has the most data” (Andrew Ng); the “most” means “most relevant and right data”

## 2.3. Data Quality
Models learn from data and the quality of the models depends on the quality of the data used to train the models (not to discount the fact that the algorithms also play a crucial role in model outputs). The following are some of the factors that might affect the quality of the data, especially in the case of big data.
* Sources of data
  - Depending on the business problem, the data sources could be internal or external, or both. We would have better control and understanding of the data if they are generated in-house, and that’s not the case if they are from external-sources
  - Inconsistency and incompleteness are inevitable if the data are generated through crowdsourcing such as surveys, public opinions, competitions, contests, etc.
* Heterogeneity
  - The number of data types or the data variety impose a direct challenge on the understanding and the quality of the data due to the increased complexity
  - Consider a case where a project dataset contains data mixed with the following data types,
    - Numerical (Integer and Float types)
    - Categorical (Nominal, Ordinal, Boolean types)
    - Free-form texts, Images, Audios and Videos
    - Real-time streaming data <br>
      The quality may be compromised with the increased complexity.
  - To reduce the complexity and to improve the quality, it’s important to maintain the metadata (information about the data, a.k.a. data description)
  - Generating metadata itself may be challenging; should they be automated or manual?
  - Metadata generation could be a time-consuming task as well
* Volume
  - Working with 100s or 1000s of samples and 10s of attributes is much easier than working with 10s of 1000s of attributes and millions of samples which usually suffer from quality
* Messy data
  - Most of the time the project team gets inaccurate, inconsistent & incomplete data to work with
* Lack of Skills
  - In the data science world, people say “finding a true Data scientist is like finding a unicorn” especially in one member team where one person plays all the roles in a data science project such as data and analytics manager, data scientist, business analyst, data analyst, data engineer, data architect, ML engineer, statistician, etc.
  - Lack of talents in terms of domain expertise and technical expertise in-house
* Reliability 
  - Big data are often messy, complex, and untrustworthy

## 2.4. Data Pre-processing
This is the most time-consuming step in a data science project. The collected data may not be (in most cases “will not be”) in the form that they can be used as they are. They need to be processed and brought into a form that the model can analyze and produce the results. Data Pre-processing is also called Data Preparation or Data Wrangling or Data Augmentation. Some of the activities involved in this step are,
* Data cleaning
  - Removal of duplicates (samples/ features), handling outliers, handling missing values, handling inconsistent/ errors in data, etc.
* Feature selection
  - Dimensionality Reduction (e.g., PCA)
* Feature engineering
  - Combining existing features to create a new one (e.g., total), deriving a new feature from the existing ones (e.g., mean), data formatting/ reformatting, etc.
* Split datasets into Train and Test datasets
* Data transforms
  - Numerical type - Standardization, Normalization, etc.
  - Categorical type - One-Hot encode, Dummy encode, etc.
* Handling imbalanced classes

## 2.5. Research and Experimentation
All DS projects do not demand the use of ML. However, most of the DS projects these days use ML to handle a high volume of data and a high number of dependent attributes to generate the outputs. Depending on the size, complexity, and use of ML, the data science projects involve a certain amount of research, ranging from simple (e.g., algorithm selection) to advanced (e.g., new algorithm development). One of the common mistakes by the novices is “discounting the human factor in deciding on the final solution”.
* Depending on the algorithms to finalize the solution is not a good option
* Data scientists should have a deeper knowledge of the error analysis and model parameter tuning to improve the model performances before selecting the final model

<!--NAVIGATION-->
<br>

<[ [The Abstract](01.00-dspm-The-Abstract.ipynb) | [Contents and Acronyms](00.00-dspm-Contents-and-Acronyms.ipynb) | [Summary of existing PM Methodologies](03.00-dspm-Summary-of-existing-PM-Methodologies.ipynb) ]>