# 📜 IBM Data Science Professional Certificate  
*Curiosity to Capability — One Notebook at a Time*

---

**Compiled and Authored by:**  
**Partho Sarothi Das**  
Dhaka, Bangladesh  
🎓 Bachelor's & Master's in Statistics  
💼 Investment Banking Professional → Aspiring Data Scientist  

>**Disclaimer:** This notebook is based on content from the [IBM Data Science Professional Certificate](https://www.coursera.org/professional-certificates/ibm-data-science) offered on Coursera. It is intended for personal learning and review purposes.

---
---

# Tools for Data Science – Course Summary

This beginner-friendly course introduces a wide range of essential tools that data scientists use to work with and analyze data. It's suitable for learners with or without programming experience and is structured into seven comprehensive modules:

#### Module Breakdown:
- **Module 1: Overview of Data Science Tools**
  - Categories of tools (open source & commercial)
  - Functional overlaps, strengths, and limitations

- **Module 2: Languages of Data Science**
  - Key languages: Python, R, Scala, Java, Julia, SQL
  - Use cases and relevance in data workflows

- **Module 3: Libraries, APIs, Datasets & Models**
  - Built-in libraries for specific functions
  - APIs for software interaction
  - Data Asset eXchange (DAX) datasets
  - Machine learning models and pattern detection

- **Module 4: Jupyter Project**
  - Jupyter Notebook, JupyterLab, and JupyterLite
  - Tools like IBM Watson Studio and Google Colab
  - Installation options like Anaconda

- **Module 5: RStudio & GitHub**
  - Visualizations in R using various packages
  - GitHub for project sharing and version control

- **Module 6: Final Project**
  - Create, share, and peer-review a Jupyter Notebook

- **Module 7 (Optional): IBM Tools for Data Science**
  - IBM Watson Studio, Cloud Pak, Machine Learning services

# Data Science Tools

## Categories of Data Science Tools

### 🔹 Data Science Task Categories

1. **Data Management**

   * Collect, store, and retrieve data securely and efficiently
   * Sources: social media, sensors, e-commerce platforms, etc.

2. **Data Integration and Transformation (ETL)**

   * Extract data from multiple repositories
   * Transform data format, structure, and values (e.g., unit conversion)
   * Load transformed data into data warehouses

3. **Data Visualization**

   * Represent data using charts, plots, maps, etc. for better decision-making
   * Examples: bar chart, treemap, line chart, map chart

4. **Model Building**

   * Train machine learning models to identify patterns and make predictions
   * Tools: IBM Watson Machine Learning

5. **Model Deployment**

   * Integrate models into production via APIs
   * Allow business applications to access predictions
   * Tool: SPSS Collaboration and Deployment Services

6. **Model Monitoring and Assessment**

   * Ensure ongoing model performance, fairness, and accuracy
   * Tools: Fiddler, IBM Watson OpenScale
   * Metrics: F1 score, true positive rate, sum of squared error

---

### 🔹 Supporting Tools and Environments

1. **Code Asset Management**

   * Manages and versions source code
   * Supports team collaboration via centralized platforms
   * Example: GitHub

2. **Data Asset Management (DAM)**

   * Organizes and secures data from various sources
   * Supports versioning, replication, and access control
   * Enables collaborative data handling

3. **Development Environments (IDEs)**

   * Tools for writing, testing, and deploying code
   * Simulate real-world conditions before deployment
   * Example: IBM Watson Studio

4. **Execution Environments**

   * Provide system resources and libraries to run code
   * Cloud-based environments offer flexibility and scalability

---

### **Summary**

Data Science involves a series of tasks — from data collection to model deployment and monitoring — supported by robust tools like DAM, version control, IDEs, and cloud platforms. IBM Watson Studio and IBM Cognos Dashboard are examples of fully integrated platforms covering the full data science lifecycle.


## **Open-Source Tools for Data Science Part 1**

**summary** of the **“Open-Source Tools for Data Science Part 1”**:

### 1. Data Management Tools

* **Relational Databases**: MySQL, PostgreSQL
* **NoSQL Databases**: MongoDB, Apache CouchDB, Apache Cassandra
* **File Systems**: Hadoop File System, Ceph (cloud file system)
* **Search Engine**: Elasticsearch (text data & indexing)


### 2. Data Integration & Transformation Tools

* **Apache AirFlow** – workflow orchestration
* **KubeFlow** – pipelines on Kubernetes
* **Apache Kafka** – real-time data streaming
* **Apache Nifi** – visual editor for data flows
* **Apache SparkSQL** – scalable SQL processing
* **NodeRED** – lightweight, visual data integration (runs on Raspberry Pi)


### 3. Data Visualization Tools

* **PixieDust** – Python plotting library with UI
* **Hue** – SQL-based visualization tool
* **Kibana** – works with Elasticsearch
* **Apache Superset** – web app for data exploration & visualization

### 4. Model Deployment Tools

* **Apache PredictionIO** – for SparkML models
* **Seldon** – supports many ML frameworks (TensorFlow, R, SparkML, etc.)
* **MLeap** – deploy SparkML pipelines
* **TensorFlow Serving**, **TensorFlow Lite**, **TensorFlow\.js** – deploy on server, mobile, or web

### 5. Model Monitoring & Assessment Tools

* **ModelDB** – metadata repository for models (supports SparkML & scikit-learn)
* **Prometheus** – generic monitoring (used for model performance too)
* **IBM AI Fairness 360** – detect & mitigate model bias
* **IBM Adversarial Robustness 360** – protect models from adversarial attacks
* **IBM AI Explainability 360** – explain model decisions


### 6. Code Asset Management Tools

* **Git** – industry standard version control
* **GitHub** – most popular Git service
* **GitLab** – open-source & self-hostable
* **Bitbucket** – another Git platform

### 7. Data Asset Management Tools

* **Apache Atlas** – metadata & data governance
* **ODPi Egeria** – open metadata sharing across systems
* **Kylo** – open-source platform for data asset management


## **Open-Source Tools for Data Science Part 2**

### 1. Development Environments

* **Jupyter Notebooks**

  * Supports 100+ languages via "kernels"
  * Integrates code, documentation, output, visualizations, and shell commands
* **JupyterLab**

  * More modular and modern
  * Allows multi-file layout (notebooks, terminals, datasets) on a flexible canvas
* **Apache Zeppelin**

  * Jupyter-inspired with **built-in plotting** (no coding required)
  * Extensible with libraries
* **RStudio**

  * Long-established environment for R
  * Includes tools for execution, debugging, visualization, and remote data access
  * Integrates with Jupyter
* **Spyder**

  * Python IDE modeled after RStudio
  * Not as feature-rich as RStudio but integrates code, plots, and documentation

### 2. Cluster Execution Environments

* **Apache Spark**

  * Most popular for **batch processing**
  * Highly scalable; performance increases with more servers
* **Apache Flink**

  * Focuses on **real-time stream processing**
  * Competes with Spark, though Spark is more widely adopted
* **Ray**

  * Newer tool focused on **large-scale deep learning** model training

### 3. Visual & No-Code Tools

* **KNIME**

  * Drag-and-drop interface
  * Built-in visualization & support for R, Python, Apache Spark
  * Good for users with minimal coding experience
* **Orange**

  * Easier to use than KNIME
  * Less flexible but beginner-friendly for model building and visualization

### Final Takeaway

This video introduced key open-source tools across:

* **Interactive environments** (e.g., Jupyter, RStudio, Spyder)
* **Big data processing platforms** (e.g., Spark, Flink, Ray)
* **No-code visual tools** (e.g., KNIME, Orange)

These tools support essential tasks like data integration, visualization, model building, and large-scale execution—offering options for both coders and non-programmers in data science.

## **Commercial Tools for Data Science**

### 1. Commercial Data Management Tools

* Oracle Database
* Microsoft SQL Server
* IBM Db2

> These are industry-standard tools supported by major vendors, valued for their reliability, functionality, and strong commercial support.


### 2. Commercial Data Integration & Transformation Tools (ETL)

* **Leaders**:

  * *Informatica PowerCenter*, *IBM InfoSphere DataStage*
* **Others**:

  * *SAP*, *Oracle*, *SAS*, *Talend*, *Microsoft*
* **Watson Studio Desktop**: Includes **Data Refinery** for spreadsheet-style data integration


### 3. Commercial Data Visualization Tools

* **Business Intelligence (BI) Tools** for reports and dashboards:

  * *Tableau*, *Microsoft Power BI*, *IBM Cognos Analytics*
* **Watson Studio Desktop**: Provides data-specific visualizations for data scientists (e.g., column relationships)

### 4. Model Building Tools

* **SPSS Modeler**
* **SAS Enterprise Miner**

> SPSS Modeler is also integrated into Watson Studio Desktop.

### 5. Model Deployment Tools

* **SPSS Collaboration and Deployment Services**: Used for deploying SPSS assets
* Commercial tools support model export in open formats like **PMML** (Predictive Model Markup Language)

### 6. Model Monitoring

* **Not widely supported** in commercial tools yet
* **Open-source tools** are preferred (e.g., IBM Watson Open Scale for fairness and monitoring)

### 7. Code Asset Management

* **Git & GitHub** (Open-source standards)

> Commercial alternatives are not widely adopted

### 8. Data Asset Management (Data Governance & Lineage)

* **Informatica Enterprise Data Governance**
* **IBM Information Governance Catalog**: Includes features like:

  * Data dictionary
  * Data steward assignment
  * Data lineage tracking
  * Policy and rule management for compliance

### 9. Development Environments & Integrated Platforms

* **Watson Studio (Cloud & Desktop)**:

  * Combines Jupyter Notebooks and graphical tools
  * Integrated with **Watson Open Scale** for full lifecycle support
  * Deployable on-premises, Kubernetes, or RedHat OpenShift
* **H2O Driverless AI**:

  * Fully automated platform covering the complete data science workflow
