### ML Pipeline

ML Pipelines are cyclical and iterative as every step is repeated to finally achieve a successful algorithm.

#### Key Stages in building ML Pipelines

<UL>
    <li><b>Problem Definition:</b>Define the business problem you require an answer for.</li>
    <br>
    <li><b>Data Ingestion:</b>Identify and gather the data you want to work with.</li>
    <br>
    <li><b>Data Preparation:</b>Since the data is raw and unstructured, it is rarely in the correct form to be processed. It usually involves <i>filling missing values</i> or <i>removing duplicate records</i> or <i>normalising</i> and correcting other flaws in data, like different representations of the same values in a column for instance. This is where the <i>feature extraction, construction and selection takes place too.</i></li>
    <br>
    <li><b>Data Segregation:</b> Split subsets of data to train the model, test it and further validate how it performs against new data.</li>
    <br>
    <li><b>Model Training:</b> Use the training subset of data to let the <i>ML algorithm</i> recognise the patterns in it.</li>
    <br>
    <li><b>Candidate Model Evaluation:</b> Assess the performance of the model using test and validation subsets of data to understand how accurate the prediction is. This is an iterative process and <i>various algorithms might be tested until you have a Model that sufficiently answers your question.</i></li>
    <br>
    <li><b>Model Deployment:</b> Once the chosen model is produced, it is typically <i>exposed via some kind of API and embedded in decision-making frameworks as a part of an analytics solution.</i></li>
    <br>
    <li><b>Performance Monitoring:</b> The model is continuously monitored to observe how it behaved in the real world and calibrated accordingly. <i>New data is collected to incrementally improve it.</i></li>
</UL>

### Architecting a ML Pipeline

Traditionally, pipelines involve overnight batch processing, i.e.<b><i> collecting data, sending it through an enterprise message bus and processing it to provide pre-calculated results</i></b> and guidance for next day’s operations.

#### Batch Processing mode? : 
<br>
Processes large volume of data all at once. May Require dedicated staff to handle issues.

#### Note:
<br>
<b>NoSQL document databases</b> are ideal for storing large volumes of rapidly changing structured and/or unstructured data, since they are schema-less. They also offer a distributed, scalable, replicated data storage.

### Data Ingestion
<br>

    Data Collection

#### Offline Layer

In the offline layer, data flows into the <b> RAW DATA STORE</b> via an <b><i>Ingestion Service</i></b> - a composite orchestration service, which encapsulates the data sourcing and persistence. 

<br>
Internally, a repository pattern is employed to interact with a data service, which in return interacts with the data store. When the data is saved in the database, a unique batch-id is assigned to the dataset, to allow for efficient querying and end-to-end data lineage and traceability.

To be performant, the <i>ingestion distribution</i> is <b>TwoFold</b>:
<br>
<ul>
    <li> There is a <i>dedicated pipeline</i> for <b>each dataset</b> so all of them are processed independently and <i>concurrently</i>, and</li>
    <li> Within each pipeline, the data is <b>partitioned</b> to take advantage of the multiple server cores, processors or even servers.(example: <b>MongoDB</b>)</li>
</ul>
<br>
Spreading the data preparation across multiple pipelines, horizontally and vertically, reduces the overall time to complete the job.

#### Online Layer

In the online layer, the <b>Online Ingestion Service</b> is the <i>entry point</i> to the streaming architecture as it <i>decouples and manages the flow of information</i> from data sources to the processing and storage components, by providing <b>reliable, high throughput, low latency capabilities</b>.

<br>
It functions as an <i>enterprise-scale</i>'<b>Data Bus</b>'.

<br>

Data is saved in along term Raw Data Store, but is also passed through a '<b>pass-through layer</b>' to the next online streaming service, for further real-time processing.

Example technologies used here can be <b>Apache Kafka</b> (pub/sub messaging system) and<b> Apache Flume</b> (data collection to long term db)

### Data Preparation
<br>

    Data exploration, data transformation and feature engineering.

Once the data is ingested, a distributed pipeline is generated which assesses the <i>condition of the data</i>, i.e. <b>looks for format differences, outliers, trends, incorrect, missing, or skewed data</b> and <b>rectify any anomalies</b> along the way.

This step also includes feature engineering process.
<br>

Three main phases in <i>feature pipeline</i>:
<ul>
    <li><b>Extraction:</b> <i>Input</i> :'<b>Raw Data</b>' <b>|</b> <i>Output</i> :'<b>Features</b></li>
    <br>
    <li><b>Transformation:</b> <i>Input</i> :'<b>Features</b>' <b>|</b> <i>Output</i> :'<b>Features(<i>transformations</i>)</b>'<li>
    <br>
    <li><b>Selection:</b> <i>Input</i> :'<b>List[Features]</b>' <b>|</b> <i>Output</i> :'<b>List[Features]</b>'</li>

#### THIS step is the most complex part of the ML Project.

Introducing the right design patterns is crucial, so in terms of code organisation:

<br>
    having a factory method to generate the features based on <i>some common abstract feature behaviour</i> as well as a <i>strategy pattern</i> to allow the selection of the <b>right features at run time</b> is a sensible approach.

Both <i>feature extractors</i> and <i>transformers</i> should be structured with <b>composition</b> and <b>re-usability</b> in mind.

Broadly speaking, a <i>data preparation pipeline</i> should be assembled into a <b>series of immutable transformations</b>, that can easily be combined. 

<br>
This is where the <b>significance</b> of<i> testing and high code coverage</i> becomes an important factor for the project’s success.

### Offline

In the offline layer, the <b>Data Preparation Service</b>, is <i>triggered</i> by the <i>completion of the ingestion service</i>. It sources the <b>Raw Data</b>, undertakes <b>all the feature engineering logic</b>, and <i>saves the generated features</i> in the <b>Feature Data Store.</b>

Same <b>TwoFold</b> Partitioning applies here in Data Preparation as it did in Data Ingestion. 
            (<b>Dedicated and Parallel Pipelines</b>)

Optionally, the features from <i>multiple data sources can be <b>combined</b></i>, so a <b>‘join/sync’ task</b> is designed to <i>aggregate all the intermediate completion events</i> and <i>create these new, combined features</i>. 

<br>
In the end, the <i>notification service broadcasts to the <b>broker</b>. This process is complete and the <i>features are available.</i>

When each data preparation pipeline ﬁnishes, <b>the features are also replicated to the Online Feature Data Store</b>, so that the features can be queried with low latency for real-time prediction.

### Online

The raw data is streamed from the ingestion pipeline into the <b>Online Data Preparation Service</b>. 

<br>
The generated features are stored in an <b>in-memory Online Feature Data Store</b> where they can be <i>read at low latency</i> at prediction time, but are also persisted in the long term Feature Data Store for future training.

Additionally the in-memory database can be <b>pre-warmed</b> by loading features from the <b>long term Feature Data Store</b>.

<br>
A frequently used streaming engine is <b>Apache Spark.</b>

### Data Segregation
    Splits subsetss of data to train the model & further validate how it performs against new data.

The fundamental goal of the ML system is to use an accurate model based on the <b>quality</b> of its pattern prediction for data that it has not been trained on. As such, existing labelled data is used as a <b>proxy</b> <i>for</i> <b>future/unseen</b> data, by splitting it into training and evaluation subsets.

The data <b>segregation</b> is not a separate ML pipeline as such, but <b>an API or service</b> must be available to facilitate this task. 

<br>
The next two pipelines (<b>model training</b> and <b>evaluation</b>) must be able to call this API to get back the requested datasets.

In terms of code organisation, a <b>strategy pattern</b> is necessary so the caller service can select the <b>right algorithm</b> at run time, and obviously the ability to inject the ratio or random seed is needed.

<br>
Additionally, the <b>API must be</b> able to return the data with or without labels/traits — for training and evaluation respectively.

<br>

To protect the caller from specifying parameters that cause an <b>uneven data distribution</b>, a warning should be raised and returned along with the dataset.

### Model Training
    Use the training subset of data to let the ML algorithm recognise the patterns in it.

### Side NOTE:
            each model implementation would be a subclass of an abstract class that requires model maker to implement estimate() and predict() methods. But since you mentioned making wrappers for jupyter notebook

### Side Note:

<br>
<b>Affinity Propagation</b> is an unsupervised machine learning algorithm that is particularly well suited for problems where we don't know the optimal number of <b>Cluster</b>