In [None]:
1. Data Ingestion Pipeline:


  a. Design a data ingestion pipeline that collects and stores data from various sources such as databases, APIs, and streaming platforms.


Data sources:
    In terms of plumbing — we are talking about pipelines, after all — data sources are the wells, lakes, and streams where organizations first gather data. SaaS vendors support thousands of potential data sources, and every organization hosts dozens of others on their own systems. As the first layer in a data pipeline, data sources are key to its design. Without quality data, there’s nothing to ingest and move through the pipeline.

Ingestion:
    The ingestion components of a data pipeline are the processes that read data from data sources — the pumps and aqueducts in our plumbing analogy. An extraction process reads from each data source using application programming interfaces (API) provided by the data source. Before you can write code that calls the APIs, though, you have to figure out what data you want to extract through a process called data profiling — examining data for its characteristics and structure, and evaluating how well it fits a business purpose.

    After the data is profiled, it's ingested, either as batches or through streaming.

Batch ingestion and streaming ingestion:
    Batch processing is when sets of records are extracted and operated on as a group. Batch processing is sequential, and the ingestion mechanism reads, processes, and outputs groups of records according to criteria set by developers and analysts beforehand. The process does not watch for new records and move them along in real time, but instead runs on a schedule or acts based on external triggers.

    Streaming is an alternative data ingestion paradigm where data sources automatically pass along individual records or units of information one by one. All organizations use batch ingestion for many different kinds of data, while enterprises use streaming ingestion only when they need near-real-time data for use with applications or analytics that require the minimum possible latency.

    Depending on an enterprise's data transformation needs, the data is either moved into a staging area or sent directly along its flow.

Transformation:
    Once data is extracted from source systems, its structure or format may need to be adjusted. Processes that transform data are the desalination stations, treatment plants, and personal water filters of the data pipeline.

    Transformations include mapping coded values to more descriptive ones, filtering, and aggregation. Combination is a particularly important type of transformation. It includes database joins, where relationships encoded in relational data models can be leveraged to bring related multiple tables, columns, and records together.
    
    The timing of any transformations depends on what data replication process an enterprise decides to use in its data pipeline: ETL (extract, transform, load) or ELT (extract, load, transform). ETL, an older technology used with on-premises data warehouses, can transform data before it's loaded to its destination. ELT, used with modern cloud-based data warehouses, loads data without applying any transformations. Data consumers can then apply their own transformations on data within a data warehouse or data lake.

Destinations
    Destinations are the water towers and holding tanks of the data pipeline. A data warehouse is the main destination for data replicated through the pipeline. These specialized databases contain all of an enterprise's cleaned, mastered data in a centralized location for use in analytics, reporting, and business intelligence by analysts and executives.

    Less-structured data can flow into data lakes, where data analysts and data scientists can access the large quantities of rich and minable information.

    Finally, an enterprise may feed data into an analytics tool or service that directly accepts data feeds.

Monitoring
    Data pipelines are complex systems that consist of software, hardware, and networking components, all of which are subject to failures. To keep the pipeline operational and capable of extracting and loading data, developers must write monitoring, logging, and alerting code to help data engineers manage performance and resolve any problems that arise.

In [None]:
   b. Implement a real-time data ingestion pipeline for processing sensor data from IoT devices.


    Mobile phones are a prime example of IoT devices that can provide valuable real-time sensor data. With an estimated 3.8 billion smartphone users worldwide in 2016 and expected growth to 7.5 billion by 2026, that is a whopping 97% increase in 10 years [Mobile network subscriptions worldwide 2028 | Statista], there is a vast amount of data generated by these devices. Smartphones have an array of sensors such as accelerometers, gyroscopes, and GPS, which can collect data on user activity, location, and behavior collecting a ton of data. Real-time data processing and analysis of sensor data can enable businesses to make immediate decisions, leading to personalized marketing, efficient transportation routing, and optimized health tracking.

    To enable companies to make these decisions in real-time, there are some key characteristics this pipeline should have. Such as:

1. Low Latency
2. High Availability
3. Scalability
4. Fault Tolerance
5. Realtime Data Processing
6. Data Freshness

    Overall, these features are essential for real-time systems to function properly. Having said that, let’s dive into the system architecture that makes this possible.

Architectural Components:

    A single payload from the phone is ~66Kb/second which amounts to ~5.7Gb of data being produced per phone per day. Thus, choosing an architecture which is highly scalable is very important when it comes to designing real-time data ingestion systems.
    
Data Producers:

    The Data Producer component in our real-time data ingestion pipeline consists of a Sensor Logger application that runs on a smartphone and pushes data to a FastAPI POST endpoint. Pydantic models are used in the FastAPI to ensure data validity. This component is critical because it collects and sends data from the source to the pipeline.

    In a business context, this could represent a fleet of trucks or a network of IoT sensors, where real-time data is crucial for operational efficiency and decision-making. By using Pydantic models to validate incoming data, businesses can ensure that the data is accurate and reliable, leading to better insights and more informed decisions.

Data Broker:

    We are using Kafka as the data broker within our pipeline. Kafka is the best solution for real-time streaming data due to its ability to handle large volumes of data from multiple sources, provide durability and fault tolerance, and support distributed processing, allowing for efficient use of resources.

    In a business context, Kafka can be used to handle large volumes of real-time data from various sources, such as social media or online transactions. With Kafka, businesses can ensure that their data is always available, reliable, and scalable, leading to better real-time insights and faster decision-making.

Data Consumers:

    The Data Consumers component in our pipeline is a Python application that subscribes to the Kafka topic and processes and writes the data to an Elasticsearch index in an asynchronous fashion. This component is critical because it takes the raw data from Kafka and transforms it into a format that can be analyzed and visualized by businesses. By using Python for this task, businesses can take advantage of its flexibility and ease of use, allowing them to quickly analyze large amounts of data and gain valuable insights. By processing the data asynchronously, businesses can also ensure that the pipeline is scalable and can handle high volumes of data.

Data Store:

    The Data Store component in our pipeline is Elasticsearch. Elasticsearch is a distributed, open-source search and analytics engine that is used for full-text search and real-time analytics. It indexes incoming documents and makes them available as soon as they are indexed. Additionally, Elasticsearch comes with Kibana out of the box for analytics and visualization.

    In a business context, Elasticsearch can be used to store and analyze large volumes of real-time data, such as customer interactions or sensor logs. With Elasticsearch, businesses can gain valuable insights into their operations and make informed decisions based on real-time data. Additionally, with Kibana, businesses can easily visualize their data and gain insights that might otherwise be difficult to see.

In [None]:
2. Model Training:

   a. Build a machine learning model to predict customer churn based on a given dataset. Train the model using appropriate algorithms and evaluate its performance.

    Customer attrition (a.k.a customer churn) is one of the biggest expenditures of any organization. If we could figure out why a customer leaves and when they leave with reasonable accuracy, it would immensely help the organization to strategize their retention initiatives manifold. Let’s make use of a customer transaction dataset from Kaggle to understand the key steps involved in predicting customer attrition in Python.

    Supervised Machine Learning is nothing but learning a function that maps an input to an output based on example input-output pairs. A supervised machine learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. Given that we have data on current and prior customer transactions in the telecom dataset, this is a standardized supervised classification problem that tries to predict a binary outcome (Y/N).

In real-world, we need to go through seven major stages to successfully predict customer churn:
Data Preprocessing, Data Evaluation, Model Selection, Model Evaluation Model, Improvement, Future Predictions, Model Deployment.

 b. Develop a model training pipeline that incorporates feature engineering techniques such as one-hot encoding, feature scaling, and dimensionality reduction.

    The deployment of machine learning models (or pipelines) is the process of making models available in production where web applications, enterprise software (ERPs) and APIs can consume the trained model by providing new data points, and get the predictions.

    In short, Deployment in Machine Learning is the method by which you integrate a machine learning model into an existing production environment to make practical business decisions based on data. It is the last stage in the machine learning lifecycle.

    Normally the term Machine Learning Model Deployment is used to describe deployment of the entire Machine Learning Pipeline, in which the model itself is only one component of the Pipeline.

Docker
Docker is a company that provides software (also called Docker) that allows users to build, run and manage containers. While Docker's containers are the most common, other less famous alternatives such as LXD and LXC also provide container solutions.

Docker is a tool designed to make it easier to create, deploy, and run applications by using containers. Containers are used to package up an application with all of its necessary components, such as libraries and other dependencies, and ship it all out as one package.



4.One-hot encoding

    A one-hot encoding is a type of encoding in which an element of a finite set is represented by the index in that set, where only one element has its index set to “1” and all other elements are assigned indices within the range [0, n-1]. In contrast to binary encoding schemes, where each bit can represent 2 values (i.e. 0 and 1), this scheme assigns a unique value for each possible case.

5.Scaling

    Feature scaling is one of the most pervasive and difficult problems in machine learning, yet it’s one of the most important things to get right. In order to train a predictive model, we need data with a known set of features that needs to be scaled up or down as appropriate. This blog post will explain how feature scaling works and why it’s important as well as some tips for getting started with feature scaling.

    After a scaling operation, the continuous features become similar in terms of range. Although this step isn’t required for many algorithms, it’s still a good idea to do so. Distance-based algorithms like k-NN and k-Means, on the other hand, require scaled continuous features as model input. There are two common ways for scaling :

Normalization : All values are scaled in a specified range between 0 and 1 via normalisation (or min-max normalisation). This modification has no influence on the feature’s distribution, however it does exacerbate the effects of outliers due to lower standard deviations. As a result, it is advised that outliers be dealt with prior to normalisation.

Standardization: Standardization (also known as z-score normalisation) is the process of scaling values while accounting for standard deviation. If the standard deviation of features differs, the range of those features will likewise differ. The effect of outliers in the characteristics is reduced as a result. To arrive at a distribution with a 0 mean and 1 variance, all the data points are subtracted by their mean and the result divided by the distribution’s variance.

In [None]:
c. Train a deep learning model for image classification using transfer learning and fine-tuning techniques.


    A pre-trained model is a saved network that was previously trained on a large dataset, typically on a large-scale image-classification task. You either use the pretrained model as is or use transfer learning to customize this model to a given task.

    The intuition behind transfer learning for image classification is that if a model is trained on a large and general enough dataset, this model will effectively serve as a generic model of the visual world. You can then take advantage of these learned feature maps without having to start from scratch by training a large model on a large dataset.

Feature Extraction: 

    Use the representations learned by a previous network to extract meaningful features from new samples. You simply add a new classifier, which will be trained from scratch, on top of the pretrained model so that you can repurpose the feature maps learned previously for the dataset.

    You do not need to (re)train the entire model. The base convolutional network already contains features that are generically useful for classifying pictures. However, the final, classification part of the pretrained model is specific to the original classification task, and subsequently specific to the set of classes on which the model was trained.

Fine-Tuning:

    Unfreeze a few of the top layers of a frozen model base and jointly train both the newly-added classifier layers and the last layers of the base model. This allows us to "fine-tune" the higher-order feature representations in the base model in order to make them more relevant for the specific task.

In [None]:
3. Model Validation:

   a. Implement cross-validation to evaluate the performance of a regression model for predicting housing prices.

Cross Validation and Grid Search
    Cross Validation(CV) is a re-sampling procedure when the amount of data is limited. This randomly splits the entire data into K-folds, fit a model using (K-1) folds, validates the model using the remaining fold, and then evaluates the performance through metrics. After this, CV repeats this whole process until every K-fold is used as the testing set. The average of the K-number of scores of a metric is the final performance score for the model.

    Grid-search is the process of tuning hyper parameters to find the optimal values of the parameters for a model. The prediction results can vary depending on the specific values for the parameters. The grid-search technique applies all the possible candidates for the parameters to find out the optimal one to give the best predictions for the model.

Linear Regression

    Since the linear regression does not have any hyper parameter in our analysis, only CV is performed here. The number of the folds in the CV is set to be 10. The average of the R² is 0.5918 and the RMSE is 269131.

Ridge Regression

    The regularization parameter in the ridge regression is expressed as alpha in sklearn. Since GridSearchCV in sklearn includes the process of the Cross Validation, the process of performing cross_validate is omitted. The set of the grid for alpha is set to be [0.01, 0.1, 1, 10, 100, 1000, 10000] here.

   b. Perform model validation using different evaluation metrics such as accuracy, precision, recall, and F1 score for a binary classification problem.
  

1. Accuracy:
    Accuracy simply measures how often the classifier correctly predicts. We can define accuracy as the ratio of the number of correct predictions and the total number of predictions.
    
    When any model gives an accuracy rate of 99%, you might think that model is performing very good but this is not always true and can be misleading in some situations. I am going to explain this with the help of an example.

    Consider a binary classification problem, where a model can achieve only two results, either model gives a correct or incorrect prediction. Now imagine we have a classification task to predict if an image is a dog or cat as shown in the image. In a supervised learning algorithm, we first fit/train a model on training data, then test the model on testing data. Once we have the model’s predictions from the X_test data, we compare them to the true y_values (the correct labels).
 
2. Precision —
    Precision explains how many of the correctly predicted cases actually turned out to be positive. Precision is useful in the cases where False Positive is a higher concern than False Negatives. The importance of Precision is in music or video recommendation systems, e-commerce websites, etc. where wrong results could lead to customer churn and this could be harmful to the business.

    Precision for a label is defined as the number of true positives divided by the number of predicted positives.

    Evaluation Metrics For Classification Model precision
    
3. Recall (Sensitivity) — 
    Recall explains how many of the actual positive cases we were able to predict correctly with our model. It is a useful metric in cases where False Negative is of higher concern than False Positive. It is important in medical cases where it doesn’t matter whether we raise a false alarm but the actual positive cases should not go undetected!

    Recall for a label is defined as the number of true positives divided by the total number of actual positives.

    recall Evaluation Metrics For Classification Model
    
4. F1 Score —I
    t gives a combined idea about Precision and Recall metrics. It is maximum when Precision is equal to Recall.

    F1 Score is the harmonic mean of precision and recall.

F1 Score
The F1 score punishes extreme values more. F1 Score could be an effective evaluation metric in the following cases:

When FP and FN are equally costly.
Adding more data doesn’t effectively change the outcome
True Negative is high

In [None]:
 c. Design a model validation strategy that incorporates stratified sampling to handle imbalanced datasets.

Oversampling Techniques:

    Oversampling methods duplicate examples in the minority class or synthesize new examples from the examples in the minority class.

Some of the more widely used and implemented oversampling methods include:

Random Oversampling
Synthetic Minority Oversampling Technique (SMOTE)
Borderline-SMOTE
Borderline Oversampling with SVM
Adaptive Synthetic Sampling (ADASYN)
Let’s take a closer look at these methods.

    The simplest oversampling method involves randomly duplicating examples from the minority class in the training dataset, referred to as Random Oversampling.

    The most popular and perhaps most successful oversampling method is SMOTE; that is an acronym for Synthetic Minority Oversampling Technique.

    SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample as a point along that line.

    There are many extensions to the SMOTE method that aim to be more selective for the types of examples in the majority class that are synthesized.

    Borderline-SMOTE involves selecting those instances of the minority class that are misclassified, such as with a k-nearest neighbor classification model, and only generating synthetic samples that are “difficult” to classify.

    Borderline Oversampling is an extension to SMOTE that fits an SVM to the dataset and uses the decision boundary as defined by the support vectors as the basis for generating synthetic examples, again based on the idea that the decision boundary is the area where more minority examples are required.

    Adaptive Synthetic Sampling (ADASYN) is another extension to SMOTE that generates synthetic samples inversely proportional to the density of the examples in the minority class. It is designed to create synthetic examples in regions of the feature space where the density of minority examples is low, and fewer or none where the density is high.

In [None]:
4. Deployment Strategy:

   a. Create a deployment strategy for a machine learning model that provides real-time recommendations based on user interactions.
 

Model deployment strategies
    Strategies allow us to evaluate the ML model performances, capabilities and discover issues concerning the model. A key point to keep in mind is that the strategies usually depend on the task and resources in hand. Some of the strategies can be a great resource but computationally expensive while some can get the job done with ease. Let’s discuss a few of them.


1. Shadow deployment strategy
    In shadow deployment or shadow mode, the new model is deployed with new features alongside the live model. The new deployed model in this case is known as a shadow model. The shadow model handles all the requests just like the live model except it is not released to the public. 

    This strategy allows us to evaluate the shadow model better by testing it on real-world data while not interrupting the services offered by the live model. 
    
Methodology: champion vs challenger
    In shadow evaluation, the request is sent to both the models running parallel to each other using two API endpoints. During the inference, predictions from both the models are computed and stored, but only the prediction from the live model is used in the application which is returned to the users.

    The predicted values from both the live and shadow model are compared against the ground truth. Once the results are in hand, data scientists can decide whether to deploy the shadow model globally into production or not. 

    But one can also use champion/challenger framework in a manner where multiple shadow models are tested and compared with the existing model. Essentially the model with the best accuracy or Key Performance Index (KPI) is selected and deployed. 

Pros:

1. Model evaluation is efficient since both the models are running parallelly there is no impact on traffic.  
2. No overloading irrespective of the traffic. 
3. You can monitor the shadow model which allows you to check the stability and performance; this reduces risk. 

Cons:

1. Expensive because of the resources required to support the shadow model. 
2. Shadow deployment can be tedious, especially if you are concerned about different aspects of model performance like metrics comparison, latency, load testing, et cetera.
3. Provides no user response data. 


2. A/B testing model deployment strategy
    A/B testing is a data-based strategy method. It is used to evaluate two models namely A and B, to assess which one performs better in a controlled environment. It is primarily used in e-commerce websites and social media platforms. With A/B testing the data scientists can evaluate and choose the best design for the website based on the data received from the users.  

    The two models differ slightly in terms of features and they cater to different sets of users. Based on the interaction and data received from the users such as feedback, data scientists choose one of the models that can be deployed globally into production. 

Methodology
    In A/B the two models are set up parallelly with different features. The aim is to increase the conversion rate of a given model. In order to do that data scientist sets up a hypothesis. A hypothesis is an assumption based on an abstract intuition of the data. This assumption is proposed through an experiment, if the assumption passes the test it is accepted as fact and the model is accepted, otherwise, it’s rejected. 

Hypothesis testing
In A/B testing there are two types of hypothesis: 
1. Null Hypothesis states that the phenomenon occurring in the model is purely out of chance and not because of a certain feature.
2. Alternate Hypothesis challenges the null hypothesis by stating that the phenomenon occurring in the model is because of a certain feature.

    In hypothesis testing, the aim is to reject the null hypothesis by setting up experiments like the A/B testing and exposing the new model with a certain feature to a few users. The new model essentially is designed on an alternate hypothesis. If the alternate hypothesis is accepted and the null hypothesis is rejected then that feature is added and the new model is deployed globally. 

    It is important to know that in order to reject the null hypothesis you have to prove the statistical significance of the test.
   
   
3. Multi Armed Bandit
    Multi-Armed Bandit or MAB is an advanced version of A/B testing. It is also inspired by reinforcement learning, and the idea is to explore and exploit the environment that maximizes the reward function.

    MAB leverages machine learning to explore and exploit the data received to optimize the key performance index (KPI). The advantage of using this technique is that the user traffic is diverted according to the KPI of two or more models. The model which yields the best KPI is deployed globally. 
    
Methodology
    MAB heavily depends on two concepts: exploration and exploitation. 

Exploration: It is a concept where the model explores the statistically significant results, as what we saw in A/B testing. The prime focus of A/B testing is to find or discover conversion rates of the two models. 

Exploitation: It is a concept where the algorithm uses a greedy approach to maximize conversion rates using the information it gained during exploring. 

    MAB is very flexible compared to the A/B testing. It can work with more than two models at a given time, this increases the rate of conversion. The algorithm continuously logs the KPI score of each model based on the success with respect to the route from which the request was made. This allows the algorithm to update its score of which is best.  


4. Blue-green deployment strategy
    Blue-green deployment strategies involve two production environments instead of just models. The blue environment consists of the live model whereas the green environment consists of the new version of the model.   
    
    The green environment is set as a staging environment i.e. an exact replica of a live environment but with new features. Let us briefly understand the methodology. 

Methodology
    In Blue-green deployment, the two identical environments consist of the same database, containers, virtual machines, same configuration et cetera. Keep in mind that setting up an environment can be expensive so usually, some components like a database are shared between the two. 

    The Blue environment which contains the original model is live and keeps servicing requests while the Green environment acts as a staging environment for a new version of the model. It is subjected to deployment and final stages of testing against the real data to ensure that it performs well and is ready to deploy to production. Once the testing is successfully completed ensuring that all the bugs and issues are rectified the new model is made live. 

    Once this model is made live, the traffic is diverted from the blue environment to the green environment. In most cases, the blue environment serves as a backup, in case something goes wrong the request can be rerouted to the blue model.

Pros:

It ensures application availability round the clock.
Rollbacks are easy because you can quickly divert the traffic to the blue environment in case of any issues. 
Since both environments are independent of each other, deployment risk is less.

Cons:

It is cost expensive since both models require separate environments.

5. Canary deployment strategy
    The canary deployment aims to deploy the new version of the model by gradually increasing the number of users. Unlike the previous strategies that we’ve seen where the new model is either hidden from the public or a small control group is set up, the canary deployment strategy uses the real users to test the new model. As a result, bugs and issues can be detected before the model is deployed globally for all the users.
    
Methodology
    Similar to other deployment strategies in canary deployment, the new model is tested alongside the current live model but here the new model is tested on a few users to check its reliability, errors, performance et cetera. 

    The number of users can be increased or decreased based on the testing requirements. If the model is successful in the testing phase then the model can be rolled out and if it is not then it can be rolled back with no downtime but only a number of users will be exposed to the new model.    
    
    
6. Other model deployment strategies and techniques
Feature flag 
Feature flag is a technique rather than a strategy that allows developers to push or integrate code into the main branch. The idea here is to keep the feature dormant until it is ready. This allows developers to collaborate on different ideas and iterations. Once the feature is finalized it can be activated and deployed. 

As mentioned earlier feature flag is a technique so this can be used in combination with any deployment techniques mentioned earlier. 

Rolling deployment
Rolling deployment is a strategy that gradually updates and replaces the older version of the model. This deployment occurs in a running instance, it does not involve staging or even private development. 

   b. Develop a deployment pipeline that automates the process of deploying machine learning models to cloud platforms such as AWS or Azure.


In [None]:
  c. Design a monitoring and maintenance strategy for deployed models to ensure their performance and reliability over time.


Maintenance Strategy
    Selecting a successful maintenance strategy requires a good knowledge of maintenance management principles and practices as well as knowledge of specific facility performance. There is no one correct formula for maintenance strategy selection and, more often than not, the selection process involves a mix of different maintenance strategies to suit the specific facility performance and conditions.

    There are a number of maintenance strategies available today that have been tried and tested throughout the years. These strategies range from optimization of existing maintenance routines to eliminating the root causes of failures altogether, to minimize maintenance requirements. Ultimately, the focus should be on improving equipment reliability while reducing cost of ownership.

    An effective maintenance strategy is concerned with maximizing equipment uptime and facility performance while balancing the associated resources expended and ultimately the cost. We need to ensure that we are getting sufficient return on our investment.

    Are we satisfied with the maintenance cost expended versus equipment performance and uptime? There is a balance to be had in terms of maintenance cost and facility performance. We can develop a suitable maintenance strategy to help tailor this balancing act in order to ensure the return on investment is acceptable.
    A maintenance strategy should be tailored specifically to meet the individual needs of a facility. The strategy is effectively dynamic and must be updated periodically as circumstances change. The strategy must include a detailed assessment of the current situation at the facility and consider the following questions:

• What is the performance history of the facility equipment and systems?

• What are the production targets, i.e. what are the mission times for facility equipment and systems?

• What are the facility shutdown targets?

• What is the current maintenance budget?

    Once we have clarity on the current situation and constraints, we need to define the objectives of the maintenance plan. The objectives must align with the business objectives of the company. They must be developed by all of the key facility stakeholders and be clear, concise and realistic. There may be a number of components to the strategy objectives – for example: improve equipment uptime, reduce maintenance costs, reduce equipment operating costs, extend equipment life, reduce spare parts inventory, improve MTTR, etc

    An example of a maintenance strategy workflow is illustrated in Figure 5.12. This workflow is developed to optimize and improve an existing facility maintenance program. Depending on the specific circumstances at the facility, our strategy may also take us into the direction of a step change approach to maintenance management and opt for a reliability-centered maintenance (RCM) program, which may replace our existing maintenance program. This strategy is labor and time intensive and can be expensive; we will discuss RCM in section 