# 📜 IBM Data Science Professional Certificate  
*Curiosity to Capability — One Notebook at a Time*

---

**Compiled and Authored by:**  
**Partho Sarothi Das**  
Dhaka, Bangladesh  
🎓 Bachelor's & Master's in Statistics  
💼 Investment Banking Professional → Aspiring Data Scientist  

>**Disclaimer:** This notebook is based on content from the [IBM Data Science Professional Certificate](https://www.coursera.org/professional-certificates/ibm-data-science) offered on Coursera. It is intended for personal learning and review purposes.

---
---

# Libraries for Data Science

### Scientific Computing Libraries (Python)

* **Pandas**:

  * Works with tabular data (DataFrames)
  * Supports data cleaning, manipulation, analysis
    
* **NumPy**:

  * Works with arrays and matrices
  * Enables mathematical operations
  * Foundation for Pandas and SciPy

### Visualization Libraries (Python)

* **Matplotlib**:

  * Most popular library for customizable plots and graphs
    
* **Seaborn**:

  * Built on Matplotlib
  * Generates advanced plots (heatmaps, time series, violin plots) easily

### High-Level Machine Learning & Deep Learning Libraries

* **Scikit-learn**:

  * For ML tasks like regression, classification, clustering
  * Easy to use; built on NumPy, SciPy, and Matplotlib

* **Keras**:

  * High-level interface for building deep learning models
  * Simple and fast prototyping

### Low-Level Deep Learning Libraries

* **TensorFlow**:

  * For production-scale deep learning
  * Powerful but more complex to experiment with

* **PyTorch**:

  * Ideal for research and rapid prototyping
  * More intuitive for experimentation

### Big Data & Cluster Computing

* **Apache Spark**:

  * General-purpose, parallel computing framework
  * Can be used with Python, R, Scala, and SQL
  * Spark libraries mirror Pandas, NumPy, and Scikit-learn

### Libraries in Other Languages

* **Scala**:

  * Used for big data and data engineering
  * Libraries:

    * **Vegas**: for statistical visualizations
    * **BigDL**: for deep learning
    
* **R**:

  * Built-in support for data analysis and visualization
  * **ggplot2**: for elegant and flexible visualizations
  * Can integrate with **Keras** and **TensorFlow**

### ✅ **Final Takeaways**

* Libraries provide ready-to-use functionalities to simplify data science tasks
* Python libraries dominate in data science, especially for ML and DL
* Apache Spark supports big data processing in multiple languages
* R and Scala also offer strong library ecosystems for data visualization and engineering

---

# Application Programming Interfaces (APIs)

### What is an API?

* **API (Application Programming Interface)**:

  * Enables communication between two software components
  * Acts as an interface to access software functionality without exposing internal logic
  * You interact with an API using inputs and outputs — not backend implementation

### Examples of APIs

* **Pandas API**:

  * Lets you manipulate data in Python, even though some backend components are not written in Python
    
* **TensorFlow API**:

  * Backend written in C++
  * APIs available in Python, JavaScript, Java, Go, etc.
  * Community has also created APIs for Julia, MATLAB, R, and Scala

### REST API (Representational State Transfer)

* REST APIs allow web-based interaction between a **client** and a **web service (resource)**
* Communication uses **HTTP** protocol
* Data is transferred via **JSON** files in both request and response

#### REST API Workflow

1. Client (your code) sends an HTTP **request** with a JSON file
2. Web service processes the request and performs the action
3. Service sends back a **response** with results in JSON format

#### Common REST API Terms

* **Client**: You or your program
* **Resource**: The web service
* **Endpoint**: URL where the service is accessed
* **HTTP Methods**:

  * `GET`: Retrieve data
  * `POST`: Send new data
  * `PUT`: Update data
  * `DELETE`: Remove data

### Examples of REST APIs

* **Watson Speech-to-Text API**:

  * Send an audio file (`POST` request)
  * Get transcribed text (`GET` response)
* **Watson Language Translator API**:

  * Send a text to be translated
  * Get the translated output (e.g., English to Spanish)

### ✅ **Key Takeaways**

* APIs connect different software systems through a defined interface
* REST APIs use HTTP and JSON to transmit requests and responses over the internet
* They enable access to advanced services like AI, storage, and translation with simple inputs from the client

---

# Data Sets – Powering Data Science

### What is a Data Set?

* A data set is a structured collection of data representing information such as:

  * Text, numbers, or media (images, audio, video)
    
* Common formats include:

  * Tabular data (e.g., CSV files)
  * Hierarchical data (tree structure)
  * Network data (graph structure)
* Example: **Weather data** in CSV (rows = observations, columns = variables)

### Types of Data

* **Tabular** – structured in rows and columns (e.g., CSV)
* **Hierarchical** – tree-like structure
* **Network/Graph** – shows relationships (e.g., social networks)
* **Raw media files** – such as images (e.g., **MNIST** for handwritten digit recognition)

### Data Ownership

* **Private data**: Confidential and proprietary (e.g., customer or pricing data)
* **Open data**: Publicly shared by governments, institutions, and organizations

### Sources of Open Data

* **Government portals**: Local, federal, and international (e.g., UN, EU)
* **Online platforms**:

  * [**Kaggle**](https://www.kaggle.com/datasets) – user-contributed datasets
  * [**Google Dataset Search**](https://datasetsearch.research.google.com/)
  * [**datacatalogs.org**](https://datacatalogs.org) by the Open Knowledge Foundation


### Community Data License Agreement (CDLA)

Created by the **Linux Foundation** to address licensing for open data.

Two main types:

1. **CDLA-Sharing**:

   * You must share your modified data under the same license
2. **CDLA-Permissive**:

   * You may modify and use the data without sharing your modifications

> Both licenses allow you to keep your derived results (e.g., models) private.


### Key Takeaways

* A **dataset** is the backbone of data science projects
* Open data has fueled growth in **data science, machine learning, and AI**
* The **CDLA** provides clear licensing terms for responsible and flexible open data use
* While open datasets are valuable for learning and experimentation, they may not always meet **enterprise** requirements due to privacy or quality constraints

---

# Machine Learning Models – Learning from Models to Make Predictions

### What is Machine Learning?

* ML uses **algorithms (models)** to **identify patterns** in data

* The process of learning these patterns is called **model training**

* Once trained, the model can **make predictions** on new data

### Types of Machine Learning

1. **Supervised Learning**

   * Human provides labeled input and output
   * Model learns relationships between inputs and outputs
   * Two types:
     * Regression: Predicts a numeric value
       (e.g., predicting house prices)
     * Classification: Predicts a category
       (e.g., spam or not spam)

2. **Unsupervised Learning**

   * Data is unlabeled
   * Model finds patterns or structure in the data
   * Examples:

     * **Clustering**: Groups similar items (e.g., product recommendation)
     * **Anomaly Detection**: Finds outliers (e.g., fraud detection)

3. **Reinforcement Learning**

   * Model learns by interacting with an environment
   * Gets rewards or penalties for actions
   * Used in games, robotics, and simulations


### Deep Learning

* A specialized type of ML modeled after the human brain.
* Excellent for:

  * Natural Language Processing (NLP)
  * Image and video analysis
  * Time series forecasting
* Requires:

  * Large labeled datasets
  * High computational power
  * Special hardware (e.g., GPUs)
    
* Built using frameworks like:
  * TensorFlow, PyTorch, Keras
    
* Pre-trained models available via model zoos (e.g., TensorFlow Hub, ONNX)


### Building a Deep Learning Model Example

To train a model to identify objects in images:

1. Collect and prepare data (e.g., label objects with bounding boxes)
2. Choose or build a model
3. Train the model using the labeled data
4. Evaluate and refine model performance
5. Deploy the model for use in applications


### Key Takeaways

* **Machine learning models** learn from data to make predictions
* **Supervised**, **unsupervised**, and **reinforcement learning** are core types
* **Deep learning** is powerful but resource-intensive
* **Model training**, **evaluation**, and **deployment** are essential steps in the ML workflow

---

# The Model Asset eXchange (MAX)

### What is MAX?

* Model Asset eXchange (MAX) is a free, open-source repository hosted by IBM Developer
  
* It provides ready-to-use and customizable deep learning microservices.

* These microservices are ideal for common AI tasks, especially when:

  * You want to reduce time to value
  * You don’t want to train models from scratch (which takes lots of data, time, and compute power)

### Key Features

* Models are:

  * **Pre-trained** (or can be further trained)
    
  * **Validated** through research, testing, and evaluation
    
  * Licensed under permissive open-source licenses (good for personal & commercial use)
 
* Domains covered include:

  * Object detection
  * Image/audio/video/text classification
  * Named entity recognition
  * Human pose detection
  * Image-to-text translation

### Structure of a Model Microservice

Each MAX microservice includes:

1. A pre-trained deep learning model
2. Pre-processing code (for input)
3. Post-processing code (for output)
4. A standardized REST API (for application access)

These are packaged into a **Docker image** and made available on **GitHub**.


### Deployment & Integration

* Microservices can be deployed on:

  * Local machines
  * Private, hybrid, or public clouds
    
* Deployment tools:

  * **Docker**: Packages models into containers
  * **Kubernetes**: Automates deployment, scaling, and management
  * **Red Hat OpenShift**: An enterprise-grade Kubernetes platform available on IBM Cloud, AWS, GCP, Azure

### Key Takeaways

* **MAX** provides a fast, legal, and reusable way to integrate deep learning models
* **Pre-trained models** reduce time, cost, and complexity
* MAX services use **Docker + Kubernetes** for flexible, scalable deployment

---

# Module 3 Summary

Congratulations! You have completed this module. At this point in the course, you know that:

- Python offers a diverse library ecosystem for data science, covering scientific computing (Pandas, NumPy), visualization (Matplotlib, Seaborn), and high-level machine learning (Scikit-learn). These libraries offer tools for data manipulation, mathematical operations, and simplified machine learning model development.

- Application Programming Interfaces (APIs) facilitate communication between software components. REST APIs, specifically, facilitate internet communication and access resources like storage. Key API terms include client (user or code accessing it), resource (service or data), and endpoint (API's URL). 

- Machine learning models analyze data and identify patterns to make predictions and automate complex tasks—the three fundamental types of machine learning are supervised, unsupervised, and reinforcement learning. Supervised learning includes regression and classification models for predictive modeling and pattern recognition. Deep learning, an advanced subset of machine learning, mimics the brain's processing, enabling intricate problem-solving in various domains.

- The Community Data License Agreement (CDLA) facilitates open data sharing by providing clear licensing terms for distribution and use, and the IBM Data Asset eXchange (DAX) site contains high-quality open data sets.

- The Model Asset eXchange (MAX) provides a wealth of pre-trained deep learning models, empowering developers with readily deployable solutions for various business challenges.  