## 🧭 From Business Need to Data Model – Decision Flow for Analysts & Engineers

When implementing a new data model, **Business Analysts** and **Data Engineers** need to collaborate closely. To ensure the model supports the business case, it's important to answer a structured set of questions together before stories and tasks could be confirmed and specified in Jira.
In this section **we will go through set of questions** which should be answered. For clarity from **technical perspective please refer to previous sections** where in basic way each technicality is described.

For more clarity on responsibilities and overall flow : https://onetakeda.atlassian.net/wiki/spaces/GMSGQDIME/pages/6366625819/Business+Analysis+Step-by-Step

---

### 📦 1. What are the Data Sources?

| Question | Why It Matters |
|----------|----------------|
| Are we sourcing from **SAP**, **CSV**, **API**, **IoT**, or **External DBs**? | Affects ingestion method, latency, and transformations. |
| Is it **Structured** or **Unstructured** data, does it contain LOBs ?  | Determines complexity of solution. |
| Do we have **GxP/SOX** requirements, specific **data contracts** to follow? | Determines compliance architecture. |
| Do we have **already existing connection**? | Determines effort to add new resource. |

---

### 🔄 2. What Type of Data Refresh is Needed?

| Option | Description |
|--------|-------------|
| **Batch Processing** | Data updates periodically (e.g., daily or hourly). |
| **Streaming / Near Real-Time** | Required for use cases like monitoring, dashboards, alerts. Implemented with DLT + MSK (Kafka) or LakeFlow Connect. |
| **Is Underlying database/source prepared with capacity for providing data?** | Especially in streaming. Event logs should be available to incrementaly load data and Database should handle extra load. |
---

### 🧩 2. What Kind of Data Model is Required?

| Question | Why It Matters |
|---------|----------------|
| Is this a **Star** or **Snowflake** schema? | Affects table structure and performance optimization. |
| Will the model include **dimensions** and **facts**? | Determines normalization and fact-to-dim joins. |
| Does it require **data from other domains or teams**? | Dependency tracking, data availability, responsibility across teams. |

---

### ⚙️ 3. What Is the Load Strategy?

| Question | Why It Matters |
|----------|----------------|
| Do we need to track changes using **SCD Type 1**, **1.5**, or **2**? | Impacts schema design — SCD2 requires historical tracking with versioning or date ranges. |
| Does the source system support **CDC (Change Data Capture)**? | Required to detect changes for incremental loads, especially with SCD2. |
| Are **deletes** tracked or soft-deleted in source? | Important for data retention logic and historical correctness. |

---

### 🧼 4. Does the Data Require Standardization, Quality Checks, or Access Control?

Standardization and compliance are critical, especially in regulated environments like pharma. This step ensures your data is **trusted**, **consistent**, and **secure**.

| Question | Why It Matters |
|----------|----------------|
| Does the data need **standardization via MDM mapping/lookup of golden values**? | Is mapping needed ? Is there existing mapping or new should be implemented ( creates dependency on MDM). |
| Are **business data quality checks** required (complex business rules besides simple technical checks)? | Prevents incorrect KPIs and flawed insights. Implemented via DQH Engine or pipeline expectations. |
| Does the data require **masking, obfuscation, or row-level filtering**? | Critical for PII, sensitive data, and regulatory compliance (e.g., GxP, GDPR). |
| Do we need to **tag or label tables/columns** for governance? | Enables lineage tracking, compliance audits, and discoverability in Unity Catalog. Implementation with One Data Catalog |



---
### 📊 5. Will the Data Be Used for Dashboarding or API Consumption?

| Question | Why It Matters |
|----------|----------------|
| Is the data consumed by **dashboards (e.g., Power BI, Tableau)**? | Requires fast aggregations, clean joins, and user-friendly fields. |
| Will it serve **API endpoints or data products**? | Requires stable schemas, consistent latency, and simplified result sets within mart layer. |
| Is the data **aggregated or raw**? | Affects performance, cost, and how modeling logic is applied. |

---
### 🗺️ 6. Documentation ( STTM source to target mapping)
| Concept | Why It Matters |
|----------|----------------|
|Functional understanding of column values |In order to efectivelly work with data, DE should be informed about what purpose each column serves. |
| naming convention, documentation | before data engineer can build pipeline, business logic ( transformation, sourcing) needs to be documented ba BAs. For more information please check this link: https://onetakeda.atlassian.net/wiki/spaces/GMSGQDIME/pages/6125487467/Databricks+Naming+Standardization+App+-+User+Guide|
| original table analysis ( types, naming) | current process and helper application described in here : https://onetakeda.atlassian.net/wiki/spaces/GMSGQDIME/pages/5980553844/Databricks+Data+Profiler+App+-+User+Guide|


---

####Data Domain Products specification documentation :
- https://mytakeda.sharepoint.com/:x:/s/GMSGQDataAnalyticsDigital/EeU0gx9VXaNLm1XollWf97YBUb4OUFjO8TXeIDdWbQCwnA?e=hrmwhx&CID=A4338547-F53A-4070-9142-E61871CF3AA0&wdLOR=c4ED83762-7CAC-4108-B109-7D016AD1BF86

---

📌 Once these questions are aligned, the engineering team can confidently proceed with implementation using the right modeling, tools (DLT, Unity Catalog, etc.), and schedule.
