# 📜 IBM Data Science Professional Certificate  
*Curiosity to Capability — One Notebook at a Time*

---

**Compiled and Authored by:**  
**Partho Sarothi Das**  
Dhaka, Bangladesh  
🎓 Bachelor's & Master's in Statistics  
💼 Investment Banking Professional → Aspiring Data Scientist  

**Note:** This notebook is based on content from the [IBM Data Science Professional Certificate](https://www.coursera.org/professional-certificates/ibm-data-science) offered on Coursera. It is intended for personal learning and review purposes.

---
---

# Understanding Data

### Summary: Types of Data

**Data** refers to unorganized information such as numbers, symbols, observations, or media that can be processed to extract **meaningful insights**.

### Categories of Data by Structure

#### 1. Structured Data

* **Well-organized** and follows a **defined schema** (e.g., rows and columns).
* Easily stored in **SQL databases** and analyzed with standard tools.
* **Sources** include:

  * SQL/OLTP systems
  * Spreadsheets (Excel, Google Sheets)
  * Online forms
  * Sensors (GPS, RFID)
  * Web/server logs

#### 2. Semi-structured Data

* Has **some structure**, but lacks a rigid schema.
* Contains **tags and metadata** for hierarchy and grouping.
* Not stored in traditional tabular formats.
* **Sources** include:

  * Emails
  * XML and JSON files
  * Zipped files, TCP/IP packets
  * Mixed-source data integrations

#### 3. Unstructured Data

* Has **no predefined format or structure**.
* Difficult to organize in relational databases.
* Common in **big data and analytics** use cases.
* **Sources** include:

  * Web pages
  * Social media content
  * Images, audio, and video
  * PDFs, Word docs, presentations
  * Surveys and media logs

---

### Summary Table

| Type            | Structure       | Examples                                 | Storage              |
| --------------- | --------------- | ---------------------------------------- | -------------------- |
| Structured      | Rigid schema    | SQL DBs, Excel, Sensors                  | SQL Databases        |
| Semi-structured | Flexible schema | XML, JSON, Emails                        | Hierarchical files   |
| Unstructured    | No schema       | Images, Text files, Social media, Videos | NoSQL / File systems |

---

### Conclusion

* **Structured** data is highly organized and easily analyzed.
* **Semi-structured** data is partially organized and uses tags or metadata.
* **Unstructured** data is irregular but rich in insights when analyzed with the right tools.

# Data Sources

---

### Common Data Sources in Data Science

Data today comes from **diverse and dynamic** sources, both internal and external to organizations. These sources feed into analytics pipelines for decision-making, prediction, and business strategy.

### 1. Internal Data Sources (within organizations)

* **Relational Databases** (e.g., SQL Server, Oracle, MySQL, IBM DB2)

  * Used for structured storage of transactional, customer, HR, and operational data.
  * Supports analysis for business performance and forecasting.

### 2. External Data Sources

* **Government Datasets**: Demographic and economic data.
* **Commercial Data Providers**: Offer POS, financial, and weather data for a fee.


### 3. Flat Files, Spreadsheets, and XML

* **Flat Files**: Plain text, typically CSV; one record per line, values separated by delimiters.
* **Spreadsheets**: Tabular format (.XLS, .XLSX); support formulas and multiple worksheets.
* **XML Files**: Hierarchical data using tags; common in surveys, bank statements.


### 4. APIs and Web Services

* Allow **programmatic access** to data via web/network requests.
* Data formats: JSON, XML, HTML, CSV, or media files.
* Examples:

  * **Twitter/Facebook APIs**: Sentiment analysis from posts.
  * **Stock Market APIs**: Real-time financial data.
  * **Validation APIs**: Zip code to city lookup for cleaning/correlation.


### 5. Web Scraping

* Extracts data from **unstructured web content**.
* Use cases:

  * Product info for **price comparison**
  * **Sales lead** generation
  * Forum or blog post analysis
  * Machine learning **training datasets**
* Tools: **BeautifulSoup**, **Scrapy**, **Pandas**, **Selenium**

### 6. Data Streams and Feeds

* Constant flow of real-time data, often **timestamped and geo-tagged**.
* Examples:

  * **Stock tickers** for trading
  * **Retail transactions** for demand prediction
  * **Sensor/IoT data** for industrial monitoring
  * **Web clickstreams**, **social media feeds**, **video surveillance**
* Tools: **Apache Kafka**, **Apache Spark Streaming**, **Apache Storm**


### 7. RSS Feeds

* Used to **track updated content** (e.g., news, forums).
* Feed readers convert RSS into continuously updated data streams.


### Conclusion

Data sources for analytics are rich and varied—from structured databases to real-time streams. Understanding these sources helps data scientists choose the right tools and strategies for effective data gathering, integration, and analysis.

---

# Working with Varied Data Sources and Types
---

### 1. Relational Databases & SQL

* SQL is widely used for **data structuring, movement, and security**.
* Moving data **within and across** relational databases (e.g., SQL Server, DB2) often introduces:

  * **Versioning issues**
  * **Compatibility challenges** between vendors
* One-time data transfers are manageable (especially <1TB), but **continuous performant movement** is more complex.

### 2. Evolution Beyond Relational Databases

* Traditional RDBMS struggle with **write-heavy applications** (e.g., IoT, social media).
* **BigTable-inspired NoSQL databases** like Cassandra and HBase offer better performance for:

  * Random reads/writes
  * Real-time streaming data
  * Massive data volumes

### 3. Flexibility & Learning are Key

* Data engineers must handle:

  * Structured (CSV, SQL)
  * Semi-structured (JSON, XML)
  * Unstructured and **proprietary formats**
* Must work with:

  * **Data at rest**, **streaming data**, and **data in motion**
* Learning is ongoing—skills evolve **per project needs**.

### 4. Challenges with Data Formats

* **Log Data**: Unstructured, often requires **custom tools** to parse.
* **XML**: Once popular with SOAP APIs, now considered **resource-heavy** due to verbose syntax.
* **JSON**: More lightweight and widely used with REST APIs (key-value format).
* **Apache Avro**: Emerging for its **efficient storage and serialization** capabilities.

### 5. Real-World Conversion Example

* Converting from **IBM Db2 to SQL Server** exposed challenges with:

  * **Export/import format differences**
  * **Special characters** in data (e.g., commas, bells)
  * Required **custom delimiters** per table due to non-standard content


### Conclusion

Working with diverse data sources demands:

* **Flexibility**
* **Adaptability**
* **Willingness to learn**
* A **deep understanding of data formats and tool limitations**

Being a successful data professional means **navigating complexity** and solving unexpected data issues with creativity and precision.

---

# Lesson Overview: Data Literacy

---

### Data Collection and Organization

A **data repository** is a centralized location where data is collected, organized, and stored to support **business operations**, **analytics**, and **reporting**. It can range from small databases to large, distributed infrastructures.

### 1. Databases

* A **database** is a collection of data designed for:

  * **Input**, **storage**, **retrieval**, and **modification**
* A **Database Management System (DBMS)** allows users to query and manage the data.

  * Example: Retrieve all customers inactive for 6+ months.
* **Two main types of databases**:

  * 🔷 **Relational Databases (RDBMS)**

    * Structured (rows/columns), use **SQL**
    * Efficient for multi-table operations and large data volumes
  * 🔶 **Non-Relational Databases (NoSQL)**

    * Schema-less, flexible format
    * Ideal for **big data**, **IoT**, and **social media**-driven data

### 2. Data Warehouses

* A **data warehouse** is a **central repository** that consolidates data from multiple sources.
* Uses the **ETL process**:

  * **Extract** → from different systems
  * **Transform** → into clean, usable formats
  * **Load** → into the warehouse for analysis
* Supports **analytics and business intelligence**
* Related concepts:

  * **Data Marts**: Subsets of data warehouses
  * **Data Lakes**: Will be covered later


### 3. Big Data Stores

* Designed to handle **very large volumes** of data
* Use **distributed storage and processing** frameworks
* Built for **scalability, speed, and flexibility**
* Ideal for real-time and unstructured data sources


### Conclusion

Data repositories—whether **databases, data warehouses**, or **big data stores**—play a vital role in:

* **Isolating data**
* Ensuring efficient **reporting and analysis**
* Enabling powerful **data-driven decision-making**

---

### Summary: Understanding Relational Databases

A **Relational Database** organizes data into **tables** (rows = records, columns = attributes), and these tables can be **linked** via **common fields** (e.g., Customer ID). This enables efficient data retrieval and meaningful insights using **SQL (Structured Query Language)**.


### 1. Key Concepts

* **Tables**: Structured into rows and columns.
* **Relationships**: Tables can be joined based on common fields to create new insights.
* **Example**:

  * *Customer Table* ↔ *Transaction Table* via `Customer ID`.


### 2. Strengths of Relational Databases

* **Optimized for Structured Data**
* Use **SQL** to retrieve, update, and manipulate data quickly.
* **Minimize redundancy** by linking rather than duplicating data.
* Enforce **data integrity** through data types, constraints, and referential integrity.
* **Security features** control access and enforce data governance policies.


### 3. Deployment Types

* **On-premise or cloud-based**
* **Popular RDBMS tools**:

  * MySQL, PostgreSQL, Oracle, IBM DB2, SQL Server
  * Cloud options: Amazon RDS, Google Cloud SQL, Azure SQL, Oracle Cloud DB


### 4. Key Advantages

| Feature                   | Description                                         |
| ------------------------- | --------------------------------------------------- |
| 🔄 **Flexibility**        | Modify schema live (add columns/tables)             |
| 🔁 **Reduced Redundancy** | Single entry per entity, linked elsewhere           |
| 💾 **Backup & Recovery**  | Easy exports, cloud mirroring for disaster recovery |
| ✅ **ACID Compliance**     | Ensures reliable and consistent transactions        |


### 5. Common Use Cases

* **OLTP (Online Transaction Processing)**: Fast inserts/updates by multiple users (e.g., banking, e-commerce).
* **OLAP (Online Analytical Processing)**: Business intelligence and data warehousing.
* **IoT Applications**: For lightweight, fast-processing database solutions at the edge.


### 6. Limitations

* Not suited for **semi-structured/unstructured data** (e.g., logs, images).
* **Schema migration** between systems can be difficult due to strict structure.
* **Field size limits** may restrict very long inputs.
* Less effective for **big data or real-time streaming analytics**.


### Conclusion

Despite the rise of NoSQL, big data, and cloud-native technologies, **relational databases remain the dominant choice** for handling structured data due to their robustness, maturity, consistency, and strong querying capabilities through SQL.

---

### Summary: NoSQL (Not Only SQL) Databases

**NoSQL** databases are **non-relational**, schema-flexible databases designed to handle **large volumes of structured, semi-structured, and unstructured data**. They are especially popular in **cloud-based, big data, IoT, and web/mobile** applications where speed, scalability, and flexibility are essential.

**1. Key Features**

* **Schema-less** design: No predefined structure; can store any type of data in any record.
* **Do not rely on SQL**, though some support SQL-like queries.
* Suited for **scalability**, **performance**, and **distributed systems**.


### 2. Types of NoSQL Databases

#### Key-Value Stores

* Store data as **key-value pairs**.
* Ideal for: **session storage**, **caching**, **user preferences**.
* Examples: **Redis**, **Memcached**, **DynamoDB**.
* **Limitation**: Poor for querying based on value or relationships.

#### Document-Based Databases

* Store data as **individual documents** (often JSON).
* Ideal for: **eCommerce**, **medical records**, **analytics platforms**.
* Examples: **MongoDB**, **CouchDB**, **DocumentDB**.
* **Limitation**: Not ideal for complex transactions or multi-query operations.

#### Column-Based Databases

* Store data **by columns instead of rows**.
* Ideal for: **time-series data**, **IoT**, **heavy-write applications**.
* Examples: **Cassandra**, **HBase**.
* **Limitation**: Not flexible for changing query patterns or complex querying.

#### Graph-Based Databases

* Store data using **nodes (entities)** and **edges (relationships)**.
* Ideal for: **social networks**, **fraud detection**, **access management**.
* Examples: **Neo4j**, **CosmosDB**.
* **Limitation**: Not optimal for high-volume transactional processing.

### 3. Advantages of NoSQL

* Handles **high volume, high variety, high velocity** data.
* **Distributed architecture**: Easily scales across cloud infrastructure and data centers.
* **Cost-effective**: Uses commodity hardware; scales horizontally.
* **Agile and flexible**: Ideal for iterative development and rapid changes.


### 4. NoSQL vs. Relational Databases

| Feature               | Relational (RDBMS)          | NoSQL                                          |
| --------------------- | --------------------------- | ---------------------------------------------- |
| **Schema**            | Rigid, fixed schema         | Schema-less, flexible                          |
| **Data Type Support** | Structured only             | Structured, semi-structured, unstructured      |
| **Scalability**       | Vertical (scale up)         | Horizontal (scale out)                         |
| **Cost**              | Higher (commercial support) | Lower (commodity hardware)                     |
| **ACID Compliance**   | Yes                         | Not always (some provide eventual consistency) |
| **Maturity**          | Mature and well-documented  | Newer, less predictable                        |


### Conclusion

NoSQL databases are essential for modern applications that require:

* **Speed**
* **Scalability**
* **Flexibility**
  They are not a replacement for relational databases, but rather a **complementary solution** tailored to today’s complex and dynamic data environments.

---

### Summary: Data Marts, Data Lakes, ETL, and Data Pipelines


### 🗂️ **1. Data Warehouses**

* **Definition**: Centralized repositories for storing **cleansed, structured, and analysis-ready data**.
* **Purpose**: Serve as the **single source of truth** for **current and historical** data.
* **Use Cases**: Operational and performance analytics; reporting across large datasets.
* **Note**: Data is **modeled before storage** for specific business needs.

---

### 📦 **2. Data Marts**

* **Definition**: A **subset** of a data warehouse designed for a **specific business function or user group**.
* **Examples**: Sales team reports, finance projections.
* **Benefits**:

  * Business-specific focus
  * Faster performance
  * Isolated security

---

### 🌊 **3. Data Lakes**

* **Definition**: Repositories that store **raw, unstructured, semi-structured, and structured data** in its **native format**.
* **Key Features**:

  * Stores **massive amounts** of data
  * Uses **metadata tagging**
  * Retains **all source data**
* **Best For**: Advanced analytics, machine learning, and flexible exploration.
* **Note**: Can also be used as **staging areas** for data warehouses.

---

### 🔁 **4. ETL Process (Extract – Transform – Load)**

#### **Extract**

* **Goal**: Gather data from source systems.
* **Types**:

  * **Batch Processing**: Periodic chunks (e.g., Stitch, Blendo)
  * **Stream Processing**: Real-time flow (e.g., Apache Kafka, Storm)

#### **Transform**

* **Goal**: Convert raw data to analysis-ready form.
* **Tasks**:

  * Clean & standardize formats
  * Remove duplicates
  * Apply business logic
  * Enrich and relate data

#### **Load**

* **Goal**: Store processed data in the destination.
* **Types**:

  * **Initial Load**: One-time population
  * **Incremental Load**: Periodic updates
  * **Full Refresh**: Replace all data
* **Includes**: Load verification, failure checks, and recovery mechanisms.

---

### 🚀 **5. Data Pipelines**

* **Definition**: The **entire flow** of data from source to destination; ETL is a **subset** of data pipelines.

* **Types**:

  * **Batch Pipelines**: Move large volumes periodically
  * **Streaming Pipelines**: Real-time continuous data flow
  * **Hybrid Pipelines**: Combine both

* **Key Tools**: **Apache Beam**, **Google Dataflow**

* **Destinations**:

  * Data Lakes
  * Applications
  * Visualization platforms (e.g., BI tools)

---

### ✅ **Conclusion**

| Component      | Purpose                                   | Best Used For                     |
| -------------- | ----------------------------------------- | --------------------------------- |
| Data Warehouse | Structured, business-ready reporting data | Analytics & BI                    |
| Data Mart      | Department-specific reporting             | Focused team dashboards           |
| Data Lake      | Raw, diverse, large-scale data storage    | ML, AI, advanced analytics        |
| ETL            | Convert raw data into analysis-ready data | Data preparation                  |
| Data Pipeline  | Full data movement process (ETL + more)   | End-to-end data flow & automation |

---

### Viewpoints: Considerations for Choice of Data Repository

1. ** Use Case**

   * Transactional systems vs. analytical systems
   * Data warehousing vs. backup/archival
   * Real-time vs. historical analysis

2. ** Data Type**

   * Structured (e.g. tabular)
   * Semi-structured (e.g. JSON, XML)
   * Unstructured (e.g. images, videos, logs)

3. ** Schema Flexibility**

   * Do you know the schema ahead of time?
   * Need for flexible vs. fixed schema?

4. ** Performance & Access Patterns**

   * High-frequency reads/writes?
   * Long-running queries or real-time access?
   * Data at rest vs. data in motion (streaming)?

5. ** Security & Compliance**

   * Does the data need encryption?
   * Does it meet organizational or legal standards?

6. ** Scalability**

   * Can the system grow with organizational needs?
   * Is horizontal scaling supported (adding nodes)?

7. ** Compatibility**

   * Does it integrate with your current tech stack, tools, and programming languages?

8. ** Organizational Skills & Costs**

   * What skills does the team already have?
   * What is the cost of licensing, cloud hosting, and maintenance?

9. ** Hosting & Cloud Options**

   * On-prem vs. cloud (AWS, Azure, GCP)
   * Managed services like Amazon RDS, Aurora, Google Cloud SQL

---

###  **Common Choices Based on Use Cases**

| Use Case                                       | Recommended Repositories                             |
| ---------------------------------------------- | ---------------------------------------------------- |
| **High-volume structured data**                | Relational DBs (Postgres, MySQL, DB2)                |
| **Semi/unstructured or dynamic schema**        | NoSQL (MongoDB, Cassandra)                           |
| **Real-time recommendations, social networks** | Graph DBs (Neo4j, Apache TinkerPop)                  |
| **Big data processing & analytics**            | Hadoop + MapReduce, Spark                            |
| **Microservices, small projects**              | Lightweight open-source DBs                          |
| **Data ingestion at scale**                    | Column stores (Cassandra), Document stores (MongoDB) |

---

### Final Thoughts

* There's **no one-size-fits-all**; most organizations use **multiple repositories** depending on the use case.
* The **structure, volume, and velocity** of data, plus organizational context, dictate the best choice.
* Always consider **future scalability**, **integration potential**, and **team expertise** before finalizing.

---

### Data Integration Platforms


### 🔗 **What is Data Integration?**

Gartner Definition:
Data integration is a discipline involving practices, architecture, and tools to **ingest, transform, combine**, and **provision data** across various sources and types.

---

### Key Usage Scenarios

* Ensuring **data consistency** across applications
* **Master data management**
* **Data sharing** across enterprises
* **Data migration and consolidation**
* Making **unified data available for analytics** and **data science**

---

### Relation to ETL and Data Pipelines

* **Data Integration** is the broader discipline.
* **ETL (Extract, Transform, Load)** is a **subset** of data integration.
* **Data pipelines** refer to the **entire journey** of data from source to destination—used to implement data integration.

---

### Capabilities of Modern Data Integration Platforms

1. **Pre-built connectors** to databases, flat files, APIs, CRMs, ERPs, social media, etc.
2. **Open-source architecture** to avoid vendor lock-in
3. Support for:

   * **Batch processing**
   * **Stream processing**
   * **Big Data integration**
4. **Data quality**, **governance**, **security**, and **compliance**
5. **Cloud portability** (single, multi-, hybrid-cloud deployment)

---

### Popular Tools & Platforms

#### 🔹 IBM Tools

* IBM InfoSphere DataStage
* Cloud Pak for Data
* IBM Data Replication
* IBM Data Virtualization Manager

#### 🔹 Talend Suite

* Talend Data Fabric
* Talend Open Studio
* Talend Cloud & Big Data tools

#### 🔹 Other Vendors

* **SAP**, **Oracle**, **SAS**, **Microsoft**, **Qlik**, **TIBCO**, **Denodo**

#### 🔹 Open-Source & iPaaS

* **Dell Boomi**, **Jitterbit**, **SnapLogic**
* iPaaS platforms:

  * **Informatica Integration Cloud**
  * **IBM Application Integration Suite**
  * **Adeptia Integration Suite**
  * **Google Cloud’s tools**

---

###  Why It Matters

As **data sources diversify and grow**, and **cloud adoption expands**, robust data integration is essential for:

* Unified data views
* Real-time analytics
* Scalable and portable data architectures

---