# **The Data Science Pipeline**

**Instructors:** Jhun Brian M. Andam | Timothy Jonah Borromeo

**Course:** Introduction to Data Science

**Objectives**

- Understand the ins and outs of the data science process and its components.
- Design and build your own data science solution and pipeline for a particular problem.

> **Think Back**

<center><img src="../four_acts.png" width="600px"></center>

## **What is the Data Science Pipeline?**

The **Data Science Pipeline** is a step-by-step workflow that guides the entire process of solving a problem using data. It helps data scientists systematically approach complex tasks, from defining the problem to deploying a final model and monitoring its performance. Though variations exist, most pipelines share common stages:

**1. Problem Definition**
Every data science project begins with a clear understanding of the problem:
- What is the business or research question?
- What are the goals and constraints?
- What will success look like (e.g., a certain accuracy or business KPI improvement)?

**2. Data Collection**
Once the problem is understood, relevant data must be gathered:
- **Sources**: databases, CSV files, APIs, IoT devices, web scraping, user logs.
- **Formats**: structured (tables), semi-structured (JSON, XML), unstructured (text, images).
- Ethical concerns and compliance (e.g., GDPR) should be addressed at this stage.

**3. Data Preprocessing**
Raw data is often messy. Preprocessing ensures the data is usable:
- **Cleaning**: handling missing values, removing duplicates, correcting data types.
- **Transformation**: encoding categorical variables, normalizing or scaling numerical data.
- **Exploration**: using visualizations and statistics to understand distributions, relationships, and outliers.

**4. Feature Engineering**
Features are the variables used to train models. This stage includes:
- Selecting the most relevant variables.
- Creating new variables (e.g., ratios, time lags, interaction terms).
- Reducing dimensionality to prevent overfitting and improve performance.

**5. Model Building**
Here, predictive or descriptive models are developed:
- **Model selection**: regression, classification, clustering, time series models, etc.
- **Training**: fitting the model to the training data.
- **Hyperparameter tuning**: optimizing model parameters using grid search, random search, etc.

**6. Model Evaluation**
After training, the model is evaluated to assess performance:
- **Metrics**: accuracy, precision, recall, F1-score, ROC-AUC, MAE, RMSE, etc.
- **Validation techniques**: cross-validation, holdout sets.
- This step helps detect issues like **overfitting** (model performs well on training data but poorly on new data).

**7. Model Deployment**
Once a model is performing well, it must be integrated into a real-world system:
- **Deployment options**: REST APIs, cloud services, edge devices.
- **Tools**: Flask, FastAPI, Docker, AWS/GCP/Azure.
- Often involves automation pipelines like CI/CD for smoother updates.

**8. Monitoring and Maintenance**
Even after deployment, work isn't finished:
- **Monitoring** for model drift, data quality issues, or changing input distributions.
- **Retraining** models as new data becomes available.
- Ensuring long-term performance, reliability, and accountability.

A structured data science workflow is essential because it **provides a systematic approach to solving problems**, ensuring that each stage—from problem formulation to model deployment—is executed with clarity and consistency. By following a defined pipeline, data scientists can better align their technical efforts with business objectives, reduce errors, and promote collaboration across multidisciplinary teams. For instance, in a healthcare analytics project aimed at predicting patient readmissions, skipping or rushing through data preprocessing could result in biased models due to missing values or imbalanced classes. However, by adhering to a structured pipeline, the team can methodically clean the data, engineer meaningful features like hospital stay duration or comorbidity scores, and properly evaluate model performance using stratified cross-validation. This approach not only improves model reliability but also builds trust among stakeholders like clinicians and hospital administrators. Moreover, structured workflows facilitate documentation, making it easier to retrace decisions or update models as new data becomes available—critical for compliance in regulated industries like finance or healthcare. Without a clear structure, data science efforts can become ad hoc, making results hard to reproduce and exposing the organization to risks such as model drift or incorrect decision-making.

## **Machine Learning**

Machine Learning (ML) is a subfield of artificial intelligence (AI) that focuses on building systems that can learn from data and improve their performance over time without being explicitly programmed. In general terms, machine learning involves developing algorithms that can identify patterns, make decisions, or predict outcomes based on input data.

<center><img src="https://www.edureka.co/blog/wp-content/uploads/2018/03/AI-vs-ML-vs-Deep-Learning.png" width="700px"></center>

- **AI** is the science of making machines think and act like humans.
- **ML** lets machines learn from data without being **explicitly programmed**.
- **DL** uses neural networks with many layers to learn complex patterns.

"**Explicitly programmed**" means giving a computer **exact, detailed instructions** on what to do in every possible situation.

In traditional programming, you write rules like:

```python
if temperature > 30:
    print("It's hot")
```

The computer follows these exact rules and does not learn or adapt beyond them.

In contrast, **machine learning** doesn't rely on predefined rules. Instead, it **learns the rules** by analyzing data. For example, instead of telling the computer what "hot" means, you show it many examples of temperatures labeled as "hot" or "not hot," and it figures out the pattern.

### **Different Approaches of Machine Learning**

<center><img src="https://www.researchgate.net/publication/354960266/figure/fig1/AS:11431281251915131@1718389091562/The-main-types-of-machine-learning-Main-approaches-include-classification-and-regression.tif" width="700px"></center>

**Supervised learning** is the most common type of machine learning, where the model is trained on **labeled data**. This means that for each input in the dataset, the correct output (or label) is provided, and the algorithm learns to map inputs to their corresponding outputs. Supervised learning is used for tasks like classification and regression.

> **Example**:
- **Classification**: Predicting whether an email is spam or not based on past labeled examples.
- **Regression**: Predicting house prices based on features like size, location, and number of rooms.

In this type of learning, the model is explicitly told what the "correct" answer is, and the goal is to minimize the difference between its predictions and the actual labels.


**Unsupervised learning** works with **unlabeled data**. Here, the model tries to find patterns, structures, or relationships within the data without being given explicit output labels. This type of learning is useful for discovering hidden patterns or grouping data based on similarity, which is ideal for clustering and dimensionality reduction tasks.

> **Example**:
- **Clustering**: Grouping customers into different segments based on their purchasing behavior.
- **Anomaly detection**: Identifying unusual transactions that could indicate fraud in financial data.

Since there are no labels, unsupervised learning allows the model to explore the data and make inferences or classifications based on inherent structures.


**Reinforcement learning** is a type of learning where an **agent** learns to make decisions by interacting with an **environment**. The agent receives feedback in the form of **rewards** or **penalties** based on its actions, and its goal is to maximize the cumulative reward over time. This type of learning is heavily used in fields like robotics, gaming, and autonomous driving.

> **Example**:
- **Game-playing**: Training an AI to play chess or Go by having it learn from playing games, where it receives rewards based on winning or losing.
- **Robotics**: Teaching a robot to navigate through a maze by rewarding it for taking correct steps toward the goal.

The key difference here is the agent’s interaction with the environment—learning from the consequences of its actions rather than being explicitly told what the correct behavior is.

<div class="alert alert-block alert-info"><b>Pondering Questions: </b>
    
1. *What do you think is ChatGPT's machine learning approach, and why is it effective for generating human-like responses?*
2. *How does the quality and quantity of data affect the performance of AI tools like image generators or recommendation systems?*
3. *In what ways might reinforcement learning be useful in real-world applications beyond games and robotics?*

</div> 

### **Machine Learning Libraries and Frameworks**

<center><img src="../figures/ml-tools2.jpg" width="700px"></center>


- **NumPy**. Provides support for arrays, matrices, and mathematical operations used in data preprocessing and model calculations.

- **Pandas**. Offers structures like DataFrames, ideal for loading, cleaning, and exploring datasets before training a model.

- **SciPy**. Useful for optimization, statistics, and signal processing during preprocessing or model evaluation steps.

- **Scikit-learn (sklearn)**. A simple and powerful ML library for classical algorithms like decision trees, SVMs, and k-means.  

- **TensorFlow**. Used to design, train, and deploy deep learning models at scale, especially for production environments.

- **PyTorch**. Known for flexibility and ease of use in research; great for building custom neural networks and experimenting with model architectures.

### **ML Algorithms**

There are various types of machine learning models, each designed with unique characteristics and suited for specific types of tasks. For instance, linear regression is simple and effective for predicting numerical values, while decision trees and random forests handle classification tasks with complex decision boundaries. Support Vector Machines (SVMs) work well for high-dimensional data, and K-nearest neighbors (KNN) is useful for intuitive, distance-based predictions. More complex models like neural networks are ideal for capturing intricate patterns in large datasets such as images or text. Each model has its strengths, assumptions, and limitations, making it important to understand when and how to use them appropriately.

- https://scikit-learn.org/stable/supervised_learning.html

## **ML Implementation**

- Open the website below and `copy-paste` the codes.
- https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html

**Note:**

- Make sure that `scikit-learn` and `matplotlib` is isntalled in your environments.

In [None]:
# copy the codes here

### **Export Model for Integration**

In [None]:
import joblib
joblib.dump(clf, 'svm_model.pkl')