# Workflow overview

## Empathize
In this stage time is dedicated to understanding the business opportunities.

In this setting the frequency and duration of customer visits are going to be related to overall sales. The initial business opportunity here is: How do you ensure new games drive revenue? There are many other business opportunities, like what is the optimal menu for the customer-base and do seasonal variations of offerings help the business? For now,  let's focus on ensuring that new games drive revenue for this example.
 
As part of this stage you would talk with your friend, her employees and some customers to do your best to fully understand the experience of the customer. The important thing here is to spend time on-site simulating the experience of a customer to obtain as genuine an understanding of the problem as possible. You may realize that most customers are there to work or most of them are just passing through. This domain knowledge is useful when making decisions like which new types of new games to create.
 
 After you have gathered your information and studied it you will generally articulate the business scenario using a scientific thought process—this means a statement that can be tested. The business opportunity should be stated in a way that minimizes the presence of confounding factors.

There are logical follow-up questions to ask to fully understand the problem, but the next two stages are the more appropriate places to get into these details. Now that you understand the problem it is time to gather the data.

HINT:  This is the stage where we gather all of the data and we make note of what would be ideal data.  

The data here are mostly sales and customer profiles. There are two important aspects of the data that would be ideal:

The data are at a transaction level (each purchase and its associated data are recorded)
We can associate game usage with transactions.
Fortunately for us this is a modern cafe so customers order and play games through the same interface. Additionally, they are incentivized to login to the system and generate a customer profile. In this stage we go through the process of gathering the raw data. This may involve querying a database, gathering files, web-scraping and other mechanisms. It is important to gather all of the relevant data in this stage, because access and quality of the data may force you to modify the business question. It is very difficult to assess the quality of data when it is not in hand. If possible, efforts should be made to collect even marginally related data.

Lets assume that your initial investigation led you to understand that games that used quotations from the books in an interactive way were the most effective. So you have come up with the idea to develop a game that is built on a chatbot that has been trained to talk like Sherlock. This would involve Natural Language Processing (NLP) and we would need a corpus of textual data. As a start you might download The Adventures of Sherlock Holmes, by Arthur Conan Doyle from Project Gutenberg.

HINT:  This is a live coding example and we suggest that you open a Jupyter notebook either locally or within Watson Studio so that you may annotate and expand on the example freely.

In [9]:
import re
import requests
text = requests.get('https://www.gutenberg.org/files/1661/1661-0.txt').text

with open("sherlock-holmes.txt", "w") as text_file:
    text_file.write(text)

## Define
This is the data wrangling stage

Given the data, an understanding of the business scenario and your gathered domain knowledge you will next perform your data cleaning and preliminary exploratory data analysis. To get to the point of preliminary investigation into the findings from the empathize stage it is frequently the case that we need to clean our data.

This could involve parsing JSON, manipulating SQL queries, reading CSV, cleaning a corpus of text, sifting through images, and so much more. One common goal of this part of the process is the creation of one or more pandas dataframes or NumPy arrays that will be used for initial exploratory data analysis (EDA).  

EDA: Exploratory Data Analysis
Exploratory data analysis (EDA) is the process of analyzing data sets to create summaries and visualizations of the data. These summaries and visualizations are then used to guide the use of the data for solving business challenges.

In [3]:
text = open('sherlock-holmes.txt', 'r').read()

In [7]:
# sentences = text.split('.')

In [16]:
import re
stop_pattern = '\.|\?|\!'
sentences = re.split(stop_pattern, text)
sentences = [re.sub("\r|\n"," ",s.lower()) for s in sentences]

Next, let’s stage the data in an environment where we can perform EDA. Let’s assume that we’ve already gone through the texts and annotated them according to whether sentences were about Mr. Holmes or Dr. Watson. These annotations are stored in a .csv (sherlock-holmes-annotations.csv) that you can download using the link below. From here, you can create a pandas dataframe that contains the texts and those annotations.

## Ideate
This is the stage where we modify our data and our features

Now that you have clean data the data processing must continue until you are ready to input your data into a model. This stage contains all of the possible data manipulations you might perform before modeling. Perhaps the data need to be log transformed, standardized, reduced in dimensionality, kernel transformed, engineered to contain more features or transformed in some other way.

For our text data we would likely want to dig into the sentences themselves to make sure they fit the desired use case. If we were building a chatbot to engage with in a very Holmes manner then we would likely want to remove any sentences that were not said by Mr. Holmes, but his name was mentioned. If we were building a predictive model to determine which story a phrase would most likely have been generated, we would need to create a new column in our data frame representing the books themselves.

When working with text data many models that we might consider prefer a numeric representation of the data. This may be occurrences, frequencies, or another transformation of the original data. It is in this stage that these types of transformations are readied or carried out. For example here we import the necessary transformers for usage in the next stage.  

123456789101112


In [21]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline

# extract the data to be used in the model from the df
labels = np.zeros(df.shape[0])
labels[(df['has_sherlock'] == True)] = 1
labels[(df['has_watson'] == True)] = 2
df['labels'] = labels
df = df[df['labels']!=0]
X = df['text'].values
y = df['labels'].values

NameError: name 'np' is not defined

There are a lot of ways to prepare data for different models. In some case you will not know the best transformation or series of transformations until you have run the different models and made a comparison. The concept of pipelines is extremely useful for iterating over different permutations of transformers and models. The following topics will be covered in detail during Module 3.



- Unsupervised learning
- Feature engineering
- Dimension Reduction
- Simulation
- Missing value imputation
- Outlier detection

HINT:  This is the stage where we enumerate the advantages and disadvantages of the possible modeling solutions  

Once the transformations are carried or staged as part of some pipeline it is a valuable exercise to document what you know about the process so far. The form that this most commonly takes is a table of possible modeling strategies complete with the advantages and disadvantages of each.

## Prototype
This is the modeling stage

The data have been cleaned, processed and staged (ideally in a pipeline) for modeling. The modeling (classic statistics and machine learning) is the bread and butter of data science. This is the stage where most data scientists want to spend the majority of their time. It is where you will interface with the most intriguing aspects of this discipline.  

To illustrate the process to the end shown below is a Support Vector Machine with Stochastic gradient decent as a model. The process involves the use of a train-test split and a pipeline because we want you to be exposed from the very beginning of this course with best practices. Given this example we also see that there can be considerable overlap between the ideate and prototype stages. The overlap exists because transformations of data are generally specific to models–as you will explore which model fits the situation best you will be modifying the transformations of your data.  

1234567891011121314151617

In [23]:
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split

## carry out the train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

from sklearn.linear_model import SGDClassifier
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(loss='hinge', penalty='l2',
                        alpha=1e-3, random_state=42,
                        max_iter=5, tol=None))
])

## train a model
text_clf.fit(X_train, y_train)

NameError: name 'X' is not defined

## Testing
This is the production, testing and feedback loop stage

The model works and there are evaluation metrics to provide insight into how well it works. However, the process does not end here. Perhaps the model runs, but it is not yet in production or maybe you want to try different models and/or transformers. Once in production you might want to run some tests to determine if it will handle load or if it will scale well as the data grows. A working model with an impressive f-score does not mean it will be effective in practice. This stage is dedicated to all of the considerations that come after the initial modeling is carried out.  

It is also the stage where you will determine how best to iterate. Design thinking like data science is an iterative process. Our model performed very well (see below), possibly because Dr. Holmes and Dr. Watson are described in very different ways in the stories, but it could be something else.  

In [18]:
from sklearn import metrics

## evaluate the model performance
predicted = text_clf.predict(X_test)

print(metrics.classification_report(y_test, predicted,\
      target_names=['sherlock','watson'])

SyntaxError: unexpected EOF while parsing (<ipython-input-18-92bae9644834>, line 7)

As a scientist you always want to remain skeptical about your findings until you have multiple ways to corroborate them. You will also want to always be aware of the overall goal of why you are doing the work you are doing. This example is an interesting metaphor for what can happen as a data scientist. It is possible to go down a path that may only marginally be related to the central business question. Developing a game here is not unlike using a new model for deep-learning or incorporating a new technology into your workflow—it may be fun and it may to some degree help the business case, but you need to always ask yourself is this the best way for me or my team to address the business problem? The questions your ask here are going to guide how best to iterate on the entire workflow. 

---------------------------------------------------------------

# Data Collection Objectives

## Data collection

Empathize Process
1.  Get as close to the source of data as possible usually by interviewing the people involved
2.  Identify the business problem
3.  Obtain all of the relevant the data
4.  Translate the business problem into a testable hypothesis or hypotheses

As we have seen, there are several viable processes for conducting data science, but none omit the important step of understanding the needs of the business. To take it a step further, we set out the expectation that the process be viewed through the lens of scientific thinking. It is this practice that allows valuable time and resources to be conserved.  

# Introduction to Business Opportunities

You are surprised by the fact that you, a data scientist, are being asked to help out with interviews, observations, process mapping, and various design thinking sessions.  These techniques as well as many others are used during the empathize stage to gather as much information as possible so that a problem may be definedm

As a data scientist, this process should be used to guide your investigative process. Ultimately, your top priority is to analyze the data coming out of Singapore, understand the problem and fix the situation. The involved parties are subscribers, data engineers, data scientists, marketing and management. You are going to need to talk everyone involved in the data generation process. This is why you're spending time on interviews and observations.  

Asking questions is a critical part of getting the process started. You will want to be naturally curious gathering details about the product, the subscriber, and the interaction between the two. This information gathering stage provides both a perspective on the situation and it will help you formulate the business question.  

In the short sections below, we provide guidelines for asking questions and beginning with an investigative mindset.  

## Articulate the business question

There are generally many business questions that can be derived from a given situation. It is an important thought exercise to enumerate the possible questions, that way it makes the discussion easier when you work with the involved stakeholders in order to focus and prioritize. In this situation here are some ways of articulating the business case.

- Can we use marketing to reduce the rate of churn?
- Can we salvage the Singapore market with new products?
- Are there factors outside of our influence that caused the situation in Singapore and is it temporary?
- Can we identify the underlying variables in Singapore that are related to churn and can we use the knowledge to remedy the situations?

## Prioritize

It is logical, but there is a need to prioritize If there are several distinct business objectives. In this case maybe one is related to reducing churn directly and another is about profitability.

There are three major contributing factors when it comes to priority.

1  Stakeholder or domain expert opinion

In situations where considerable domain expertise is required to effectively prioritize (e.g. Physics, Medicine and Finance) prioritization will likely be driven by the people closest to the domain.

2 Feasibility
-  Do we have the necessary data to address the business questions?
-   Do we have clean enough data to address the business questions?
-   Do we have the technology infrastructure to deploy a solution once the data are modeled?

3 Impact

When looking at Impact we’re purely looking at expected dollar contribution and added value from a monetary perspective. When possible, calculating the back-of-the-envelope ROI is a crucial step that you can do. This is an expectation and not a real ROI calculation, but it can serve as a guiding principle nonetheless.

The ROI calculation should be an expected dollar value that you can generate based on all available information you currently have in your organization combined with any domain insight you can collect.

Measuring the back-of-the-envelope ROI calculation could make use of any of the following:

- Estimates for fully-loaded salaries of employees involved
- Cost per unit item and/or time required to produce
- Number of customers, clients, or users
- Revenue and more

# Scientific Thinking for Business

## Our Story

Data science involves lots of investigation via trial and error. The investigations are based on evidence and this is one of the strongest reasons why data science is considered a "real" science.

You will be using a scientific process with your work at AAVAIL.  This will help you to organize your work as well as be able to clearly explain everything  you are doing to the AAVAIL leadership.  

Let's take a look now at some guidance and best practices for engaging with a scientific mindset. 

## Science is a process and the route to solving problems is not always direct

A common argument made by statisticians and mathematicians is that data science is not really a science. This is untrue, mainly because data science involves a lot of investigations through sometimes chaotic data sets, in search of meaningful patterns that might help in solving particular problems.

Since data science implies a scientific approach, it is important that all data scientists learn to adopt and use a scientific thought process. **A scientific thought process of observation, developing hypotheses, testing hypotheses, and modifying hypotheses is critical to your success as a data scientist**.

Pulling in data and jumping right into exploratory data analysis can make your work prone to exactly the types of negative issues that plague data science today. There are a number of well-discussed issues revolving around data science and data science teams not living up to promised potential.

<a href='https://www.youtube.com/watch?v=tRZN-q6GYKU'>IBM’s Seth Dobrin on how to realize the full ROI of enterprise data.</a>

<a href='https://www.ibmbigdatahub.com/blog/learn-deliver-fast-roi-data-science'>Learn to deliver fast ROI with data science.</a>

At the **heart of this problem is the process of communicating results to leadership**. It should begin with **a meaningful and well-articulated business opportunity**. If that opportunity is stated too simply, as say, increasing overall revenue then the central talking point for communication is too vague to be meaningful from the data side.

***The business scenario needs to be communicated in a couple of ways***:

1. Stated in a **testable way in terms of data**
2. Stated in a **clear way that minimizes the influence of confounding factors**

## Testable hypotheses

There is no one single best way to **articulate a business opportunity as a testable hypothesis**. In some cases the statement will be intuitive, but in other cases there will be some back and forth with stakeholders and domain experts.

###  Guidelines for creating testable hypotheses

- Become a scientist of the business

Spend a little bit less time learning new algorithms and Python packages and more time **learning the levers that make your specific business go up or down and the variables that impact those levers**.

- Make an effort to understand how the data are produced 

If it comes down to it, sources of variation can be explicitly accounted for in many types of models. If the data come from a database you should ask about the process by which the data are stored. If the data are compiled by another person then dig into the details and find out about the compiling process as well as the details of **what happened before the data arrived on their desk**.

- Make yourself part of the business

Do not under any circumstances become siloed. Proactively get involved with the business unit as a partner, not a support function.

- Think about how to measure success

When thinking about what course of action might be most appropriate, keep at the forefront of your mind how you will measure business value when said action is complete.

**IMPORTANT**:  Data Science is NOT Business Intelligence. BI analysts serve to derive business insights out of data. There is without a doubt some overlap, but the job of a data scientist is to investigate the business opportunity and solve it.  

There is a **balancing act to maintain between directly addressing the business need and ensuring that you have thoughtfully studied the problem** enough to ensure that you can account for most of the likely contingencies. The scientific method can be of some guidance here.

### Thinking scientifically about the business scenario

A major goal of this process is to make the business objectives clear to leadership. Some of these individuals are technical and some are not, so as a good rule-of-thumb get in the habit of articulating the business problem at a level that everyone can understand. Stakeholders and leadership need to know what you are trying to accomplish before you begin work. They also need to be aware from the start what success would look like. Science is an iterative process and many experiments produce results that some might consider a failure. However, experiments that are properly setup will not fail no matter the result–the result may not useful but you have gained valuable information along the way.

Experiments in this context could refer to an actual scientific experiment (e.g. A/B testing) or it could be more subtle. Let’s say you work for a company that collects tolls in an automated way, and you want to identify the make and model of each car in order to modify pricing models based on predicted vehicle weight. After talking with the stakeholders and the folks who implemented the image storage solution you are ready to begin. The experiment here has to do with how you begin. You may think that there is enough training data to implement a huge multi-class model and just solve most of the problem. If you approach it that way then you are hypothesizing that the solution will work.

For those of you who have done much image analysis work, you could guess that approach would likely result in a significant loss of time. If we take a step back and think scientifically, we could approach the solution from an evidence driven perspective. Before investing a significant amount of time you may try to see if you can distinguish one make and model from the rest before adding more classes. You may want to first pipe the images through an image segmentation algorithm to identify the make of the car. There are many possible ways to build towards a comprehensive solution, but it is important to determine if either of these piecemeal approaches would have any immediate business value.

This might be a good time for a reminder about the steps in the scientific method.

1. Formulate the question
2. Generate a hypothesis to address the question
3. Make a prediction
4. Conduct an experiment
5. Analyze the data and draw a conclusion

We will continue with an interactive example, but first it is important to note that **Scientific experiments must be repeatable** in order to become reliable evidence. m

 - Question

The question can be open-ended and generally it summarizes your business opportunity. Let’s say you work for a small business that manufactures sleds and other winter gear and you are not sure which cities to build your next retail locations. You have heard that Utah, Colorado and Vermont are all states that have high rates of snowfall, but it is unclear which one has the highest rate of snowfall.

- Hypothesis

Because the Rocky mountains are higher in elevation and they are well-known for fresh powder on their ski slopes, you hypothesize that both Utah and Colorado have more snow than Vermont.

- Prediction

If you were to run a hypothesis test, you would find that Vermont has significantly less snow fall than Colorado or Utah



- Experiment

<a href="https://d3c33hcgiwev3.cloudfront.net/n0betBHfRBiG3rQR3zQYVw_7f2b08854fcf4694a22774ae509e2f75_snowfall.csv?Expires=1610064000&Signature=MXnVJHDFO0ePORtSP-~hhWcK7VVzKeSKazyxHsMNDm3Q7QR1-TSIK5-peonF-UzkDFcVhEbhM7O3Ny10sDN0I-XU62gF8Aah8VNazn7VvTQZr4ZNffXJTL~WD-foD15Yl~xUTIMnPKhJyJY5w93zDwnOzVNSW0p~DSQra8T0LmE_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A">CSV Data of Snow fall</a>

In [16]:
import pandas as pd
df = pd.read_csv("https://d3c33hcgiwev3.cloudfront.net/n0betBHfRBiG3rQR3zQYVw_7f2b08854fcf4694a22774ae509e2f75_snowfall.csv?Expires=1610064000&Signature=MXnVJHDFO0ePORtSP-~hhWcK7VVzKeSKazyxHsMNDm3Q7QR1-TSIK5-peonF-UzkDFcVhEbhM7O3Ny10sDN0I-XU62gF8Aah8VNazn7VvTQZr4ZNffXJTL~WD-foD15Yl~xUTIMnPKhJyJY5w93zDwnOzVNSW0p~DSQra8T0LmE_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A")

In [17]:
df.head()

Unnamed: 0,rank,location,snowfall,state,city,lat,long,elevation
0,1,VALDEZ,316.8,AK,Valdez,61.12994,-146.349364,6.8
1,2,MT. WASHINGTON,260.0,NH,Mt. Washington,44.27046,-71.303531,1913.4
2,3,BLUE CANYON,240.3,CA,Blue Canyon,39.257275,-120.710825,1405.3
3,4,YAKUTAT,190.3,AK,Yakutat,59.572734,-139.578312,26.0
4,5,MARQUETTE,149.1,MI,Marquette,46.543491,-87.396433,


In [19]:
## get the subset of data from cola,uta,vom
df1 = df[df['state'].isin(['CO','UT','VT'])]

In [20]:
df1_pivot = pd.pivot_table(df1, values='snowfall', index='state',
                            aggfunc=['count', 'mean', 'max'])

print(df1_pivot)

         count     mean      max
      snowfall snowfall snowfall
state                           
CO           5    37.76     59.6
UT           2    51.65     58.2
VT           1    80.90     80.9


- Analyze

There is not enough data to do a 1-way ANOVA. The experiment is not a failure; it has a few pieces of information.

There is not enough data
There is a small possibility that VT gets more snow on average than either CO or UT
Our degree of belief in the conclusion drawn from (2) is very small because of (1)
The notion of degree of belief is central to scientific thinking. It is somehow a part of our human nature to believe statements that have little to no supporting evidence. In science the word belief, with respect to a hypothesis is proportional to the evidence. With more evidence available, ideally, from repeated experiments, one’s degree of belief should change. Evidence is derived from the process described above and if we have none then we are stuck at the question stage and a proper scientific hypothesiscannot be made.

The other important side to degree of belief is that it never caps out at 100 percent certainty. Some hypotheses have become laws like Newton’s Law of Gravitation, but most natural phenomena in the world outside of physics cannot be explained as a law.

A hypothesis is the simplest explanation of a phenomenon. A scientific theory is an in-depth explanation of the observed phenomenon. Do not be mistaken with the word theory, there can be sufficient evidence that your degree of belief all but touches 100%, and is plenty for decision making purposes. A built-in safeguard for scientific thought is that our degree of belief does not reach 100%, which leaves some room to find new evidence that could move the dial in the other direction.

There are additional factors like external peer review that help ensure the integrity of the scientific method and in the case of implementing a model for a specific business task this could mean assigning reviewers for a pull request or simply asking other qualified individuals to check over your work.



# Data Injesting

## Limitations of Extract, Transform, Load (ETL)

Extract, Transform, Load (ETL) is a process to move data from its original source to another target destination. The target destination is often a database or a data warehouse, but it could be a simple flat file. The commonly referred to acronym, ETL, consists of three distinct stages:

### Extract

Read data from a source (e.g. database) and extract the desired subset of data. The purpose of this step is to retrieve all the required data from the source system with minimum resources. This step needs to be designed in a way that it does not affect the source system negatively in terms of performance or response time. Often this means that the regularly scheduled pull is performed at night, when the system is not under load.

### Transform

The transform stage cleanses and prepares the extracted data using lookup tables or rules. Data from heterogeneous sources can be combined at this stage. The transform step also includes validation of records, rejection of data (if they are not acceptable). The commonly used processes for transformation are conversion, sorting, filtering, clearing the duplicates, standardizing, translating and looking up or verifying the consistency of data sources.

### Load

Loading is the last stage of an ETL process. The load function writes the extracted and transformed data (all of the subset or just the changes) to a target data location. Often the target data are inserted as a record in a target database using SQL insert statement.

ETL has been around for a long time and it often implies the use of SQL databases. When ETL is described we often come up short covering all of common tasks associated with this part of the AI workflow. Some examples of technologies that have forced the industry to re-think the boundaries of ETL are:

source and target locations are not always SQL databases
maintaining quality data can become more expensive than just storing everything
streaming data as an alternative
NoSQL technologies and enterprise solutions IBM’s enterprise data warehousing solution are common alternatives.
Another important limitation to the traditional ETL philosophy is that up front in the process decisions have to be made about which data are going to be important. Sometimes the specific data or form that the data must take is not evident. Sometimes, the best data for a model can change if a different model is selected. Flexibility in the early stages of the AI workflow is critical to avoid complications or errors later on.

##  Enterprise Data Stores for Data Ingestion

Large data stores are the norm in large enterprises. The concept of a data lake reflects this reality. Data lakes are very large collections of data stored in their natural formats, usually as object blobs or files. Today’s data scientist must be proficient in building data pipelines that tap directly into such large collections of raw data, then process the data to gain insights.

Along with data lakes, technologies such as Apache Hadoop enable large enterprises to store very large amounts of data, and to access the data quickly for analysis. Hadoop has two advantages that make it useful in large enterprises. First, it is designed from the ground up to be fault tolerant. A Hadoop cluster runs on an array of individual commodity servers designed to cleanly fail over without loss of data or processing power. Second, Hadoop clusters allow for parallel execution of data analysis code against the blocks of data stored in the cluster. This enables the rapid execution of complex analyses against huge amounts of data.

While many data ingestion pipelines draw data directly from sources such as data lakes and Hadoop clusters, data scientists in large enterprises will sometimes work with data engineers to build a data warehouse. A data warehouse keeps data gathered and integrated from different sources (e.g., a data lake) and stores the large number of records needed for long-term usage by machine learning systems. A data warehouse is typically built using data extractions, data transformations and data loads. After selecting data from the sources of origin, data ingestion procedures resolve problems in the data and ready it for research and modeling.

Modern large enterprises have adopted sophisticated data management processes and systems to handle very large amounts of data. With large datasets and complex use cases, data ingestion involves the ability to use data from a wide variety of sources, mixing and matching those sources to create data pipelines that feed machine learning models.



## Why We Need a Data Ingestion Process

Cleaning, parsing, assembling and gut-checking data is among the most time-consuming tasks that a data scientist has to perform. In fact, the problem is not new as statisticians have been dealing with the same dilemma for many decades. The time spent on data cleaning can start at 60% and increase depending on data quality and the project requirements. One could debate the proportion and surely it depends on the team, the data and a number of other factors, but one statement that is difficult to argue against is

**Very significant portions of time are often devoted to data ingestion pipelines**.

For many enterprises data is the most important asset and when this is true maintaining the quality of those data is paramount. Poor data quality can result in project delays, budget projection shortfalls, or other avoidable challenges. The quality of data refers to both the observations themselves and the maturity of the data itself. Companies may consider improving their data ingestion infrastructure and methods for the benefits it could return.

## Data Ingestion and Automation

Data engineers exist in many organizations to ease the burden of the data ingestion process. If the target data source is a database, then there are some useful tools and procedures under the umbrella term database testing. Data warehouse automation is the general term used to improve the overall process of data ingestion. Testing is an essential piece of data warehouse automation, because the quality of downstream models are tied to the quality of the available data.

***MPORTANT***:  The testing process is data-centric and it helps validate that data has been transformed and loaded into the target destination as expected. It is a critical part of data ingestion automation.

Testing can involve comparing large volumes of data which may contain millions of records. The size of the data can pose challenges, but in some cases a more significant challenge can be heterogeneous nature of data. You may find that you are using data of various types and sources: flat files, relational databases, open API feeds like twitter to XML web services and many others. Connecting all these heterogeneous sources in a standardized way can be a non-trivial task. With more sources of data comes an increased need for testing.

n reality, any form of data movement from source to target can be considered as data ingestion. In large enterprises like hospitals it is not uncommon to have dozens of independent systems saving data—oftentimes in a redundant way. A common database as a target is next to impossible due to logistical and privacy concerns, but a well-constructed gateway in the form of an Application Programming Interface (API) and API keys could be a viable solution towards automation.

Outside of more comprehensive solutions automation can be achieved with scripting. If the data ingestion code exists as a script (e.g. Bash or Python), then <a href="https://en.wikipedia.org/wiki/Cron">cron jobs</a> are an incredibly powerful way to automate the process. The testing process can be automated with cron as well.

## Sparse Matrices are Used Early in Data Ingestion Development

AOnce a well-trained machine learning model has been deployed, the data ingestion pipeline for that model will also be deployed. That pipeline will consist of a collection of tools and systems used to fetch, transform, and feed data to the machine learning system in production.

However, that pipeline cannot be finalized during the development of the machine learning model it feeds. Finalizing the process of data ingestion before models have been run and your hypotheses about the business use case have been tested often leads to lots of re-work. Early experiments almost always fail and you should be careful about investing large amounts of time in building a data ingestion pipeline until there is enough accumulated evidence that a deployed model will help the business.

Instead of building a complete data ingestion pipeline, data scientists will often use sparse matrices during the development and testing of a machine learning model. Sparse matrices are used to represent complex sets of data (e.g., word counts) in a way that reduces the use of computer memory and processing time.

There are Python libraries available in the SciPy package to work with sparse matrices. The code block below imports this library as well as NumPy for calculations.

In [30]:
import numpy as np
from scipy import sparse

Sparse matrices offer a middle-ground between a comprehensive data warehouse solution with extensive test coverage and a directory of text files and database dumps. Sparse matrices do not work for all data types, but in situations where they are an appropriate technology you can leverage them even under load in production. Lets use an example to see how this process might play out.

A sparse matrix is one in which most of the values are zero. If the number of zero-valued elements divided by the size of the matrix is greater than 0.5 then it is consider sparse.

In [31]:
A = np.random.randint(0,2,100000).reshape(100,1000)
sparcity = 1.0 - (np.count_nonzero(A) / A.size)
print(round(sparcity,4))

0.5001


Very large matrices require significant amounts of memory. If we make a matrix of counts for a document or a book where the features are all known English words, the chances are high that your personal machine does not have enough memory to represent it as a dense matrix. Sparse matrices have the additional advantage of getting around time-complexity issues that arise with operations on large dense matrices.

***WARNING:  Many of the common functions like np.dot do not work on sparse matrices. See the scipy.sparse docs to learn about the specific functions for matrix products.***

Some of the common applications of sparse matrices are:

- word counts with a large vocabulary
- recommender systems
- large networks

There are different types of sparse matrix representations in Python available through SciPy. The most commonly used are:

#### coo_matrix -sparse matrix built from the COOrdinates and values of the non-zero entries.

In [32]:
A = np.random.poisson(0.3, (10,100))
B = sparse.coo_matrix(A)
C = B.todense()

print("A",type(A),A.shape,"\n"
      "B",type(B),B.shape,"\n"
      "C",type(C),C.shape,"\n")

A <class 'numpy.ndarray'> (10, 100) 
B <class 'scipy.sparse.coo.coo_matrix'> (10, 100) 
C <class 'numpy.matrix'> (10, 100) 



#### csc_matrix - When there are repeated entries in the rows or cols

In [33]:
A = np.random.poisson(0.3, (10,100))
B = sparse.csc_matrix(A)

In [37]:
B

<10x100 sparse matrix of type '<class 'numpy.int64'>'
	with 269 stored elements in Compressed Sparse Column format>

Because the coordinate format is easier to create, it is common to create it first then cast to another more efficient format. Let us first show how to create a matrix from coordinates:  

In [38]:
rows = [0,1,2,8]
cols = [1,0,4,8]
vals = [1,2,1,4]

A = sparse.coo_matrix((vals, (rows, cols)))
print(A.todense())

[[0 1 0 0 0 0 0 0 0]
 [2 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 4]]


In [39]:
B = A.tocsr()

In [40]:
B

<9x9 sparse matrix of type '<class 'numpy.int64'>'
	with 4 stored elements in Compressed Sparse Row format>

Because this introduction to sparse matrices is applied to data ingestion we would need to be able to:

concatenate matrices (e.g., add a new user to a recommender matrix)
read and write the matrices to and from disk

In [41]:
## matrix merge example
C = sparse.csr_matrix(np.array([0,1,0,0,2,0,0,0,1]).reshape(1,9))
print(B.shape,C.shape)
D = sparse.vstack([B,C])
print(D.todense())

(9, 9) (1, 9)
[[0 1 0 0 0 0 0 0 0]
 [2 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 4]
 [0 1 0 0 2 0 0 0 1]]


In [42]:
## read and write
file_name = "sparse_matrix.npz"
sparse.save_npz(file_name, D)
E = sparse.load_npz(file_name)
print(E.shape)

(10, 9)


Additional resources

- <a href='https://www.ibm.com/cloud/blog/ibm-data-catalog-data-scientists-productivity'>Breaking the 80/20 rule: How data catalogs transform data scientists productivity</a>
- <a href="https://developer.ibm.com/articles/data-preprocessing-in-detail/">Data preprocessing in detail</a>
- <a href="https://www.ibm.com/blogs/research/2017/06/automating-low-level-tasks-data-scientists/">Automating low-level tasks for data scientists</a>