# Data Integration

- Data integration refers to the process of combining data from multiple sources into a single, unified view.
- It can involve cleaning and transforming the data, as well as resolving any inconsistencies or conflicts that may exist between the different sources.
- The goal of data integration is to make the data more useful and meaningful for the purposes of analysis and decision making.
- Techniques used in data integration include data warehousing, ETL processes and data federation.
- Data Integration is a data preprocessing technique that combines data from multiple heterogeneous data sources into a coherent data store and provides a unified view of the data.

The data integration approaches are formally defined as triple <G, S, M> where, 
- G stand for the global schema, 
- S stands for the heterogeneous source of schema, 
- M stands for mapping between the queries of source and global schema.

There are mainly 2 major approaches for data integration – one is the “tight coupling approach” and another is the “loose coupling approach”. 

- <b>Tight Coupling:</b> 

This approach involves creating a centralized repository or data warehouse to store the integrated data. The data is extracted from various sources, transformed and loaded into a data warehouse. Data is integrated in a tightly coupled manner, meaning that the data is integrated at a high level, such as at the level of the entire dataset or schema. This approach is also known as data warehousing, and it enables data consistency and integrity, but it can be inflexible and difficult to change or update.

    Here, a data warehouse is treated as an information retrieval component.
    In this coupling, data is combined from different sources into a single physical location through the process of ETL – Extraction, Transformation, and Loading.

- <b>Loose Coupling:</b>  

This approach involves integrating data at the lowest level, such as at the level of individual data elements or records. Data is integrated in a loosely coupled manner, meaning that the data is integrated at a low level, and it allows data to be integrated without having to create a central repository or data warehouse. This approach is also known as data federation, and it enables data flexibility and easy updates, but it can be difficult to maintain consistency and integrity across multiple data sources.

    Here, an interface is provided that takes the query from the user, transforms it in a way the source database can understand, and then sends the query directly to the source databases to obtain the result.
    And the data only remains in the actual source databases.

## Issues in data integration

There are several issues that can arise when integrating data from multiple sources, including:

- <b>Data Quality:</b> Inconsistencies and errors in the data can make it difficult to combine and analyze.
- <b>Data Semantics:</b> Different sources may use different terms or definitions for the same data, making it difficult to combine and understand the data.
- <b>Data Heterogeneity:</b> Different sources may use different data formats, structures, or schemas, making it difficult to combine and analyze the data.
- <b>Data Privacy and Security:</b> Protecting sensitive information and maintaining security can be difficult when integrating data from multiple sources.
- <b>Scalability:</b> Integrating large amounts of data from multiple sources can be computationally expensive and time-consuming.
- <b>Data Governance:</b> Managing and maintaining the integration of data from multiple sources can be difficult, especially when it comes to ensuring data accuracy, consistency, and timeliness.
- <b>Performance:</b> Integrating data from multiple sources can also affect the performance of the system.
- <b>Integration with existing systems:</b> Integrating new data sources with existing systems can be a complex task, requiring significant effort and resources.
- <b>Complexity:</b> The complexity of integrating data from multiple sources can be high, requiring specialized skills and knowledge.


There are three issues to consider during data integration: Schema Integration, Redundancy Detection, and resolution of data value conflicts. These are explained in brief below. 

1. Schema Integration: 

    Integrate metadata from different sources.
    The real-world entities from multiple sources are referred to as the entity identification problem.ER

2. Redundancy Detection: 

    An attribute may be redundant if it can be derived or obtained from another attribute or set of attributes.
    Inconsistencies in attributes can also cause redundancies in the resulting data set.
    Some redundancies can be detected by correlation analysis.

3. Resolution of data value conflicts: 

    This is the third critical issue in data integration.
    Attribute values from different sources may differ for the same real-world entity.
    An attribute in one system may be recorded at a lower level of abstraction than the “same” attribute in another.


# Extract, Transform, Load(ETL)

ETL, which stands for “extract, transform, load,” are the three processes that move data from various sources to a unified repository—typically a data warehouse. It enables data analysis to provide actionable business information, effectively preparing data for analysis and business intelligence processes.

As data engineers are experts at making data ready for consumption by working with multiple systems and tools, data engineering encompasses ETL. Data engineering involves ingesting, transforming, delivering, and sharing data for analysis. These fundamental tasks are completed via data pipelines that automate the process in a repeatable way. A data pipeline is a set of data-processing elements that move data from source to destination, and often from one format (raw) to another (analytics-ready).

## Purpose of ETL

ETL allows businesses to consolidate data from multiple databases and other sources into a single repository with data that has been properly formatted and qualified in preparation for analysis. This unified data repository allows for simplified access for analysis and additional processing. It also provides a single source of truth, ensuring that all enterprise data is consistent and up-to-date.

## ETL Process

There are three unique processes in extract, transform, load. These are:

<b>Extraction</b>, in which raw data is pulled from a source or multiple sources. Data could come from transactional applications, such as customer relationship management (CRM) data from Salesforce or enterprise resource planning (ERP) data from SAP, or Internet of Things (IoT) sensors that gather readings from a production line or factory floor operation, for example. To create a data warehouse, extraction typically involves combining data from these various sources into a single data set and then validating the data with invalid data flagged or removed. Extracted data may be several formats, such as relational databases, XML, JSON, and others.

<b>Transformation</b>, in which data is updated to match the needs of an organization and the requirements of its data storage solution. Transformation can involve standardizing (converting all data types to the same format), cleansing (resolving inconsistencies and inaccuracies), mapping (combining data elements from two or more data models), augmenting (pulling in data from other sources), and others. During this process, rules and functions are applied, and data cleansed to prevent including bad or non-matching data to the destination repository. Rules that could be applied include loading only specific columns, deduplicating, and merging, among others.

<b>Loading</b>, in which data is delivered and secured for sharing, making business-ready data available to other users and departments, both within the organization and externally. This process may include overwriting the destination’s existing data.

## ETL Tools

ETL tools automate the extraction, transforming, and loading processes, consolidating data from multiple data sources or databases. These tools may have data profiling, data cleansing, and metadata-writing capabilities. A tool should be secure, easy to use and maintain, and compatible with all components of an organization’s existing data solutions.

# Data Modeling

- Data modelling is a fundamental component that facilitates the organisation, structuring, and interpretation of complicated datasets by analysts.

## What is Data Modeling?

Data modelling in analysis is the process of creating a visual representation , abstraction of data structures, relationships, and rules within a system or organization. Determining and analysing the data requirements required to support business activities within the bounds of related information systems in organisations is another process known as data modelling.

The main objective of data modelling is to provide a precise and well-organized framework for data organisation and representation, since it enables efficient analysis and decision-making. Analysts can discover trends, understand the connections between various data items, and make sure that data is efficiently and accurately stored by building models.

## What is Data Model?

Data models are visual representations of an enterprise’s data elements and the connections between them. Models assist to define and arrange data in the context of key business processes, hence facilitating the creation of successful information systems. They let business and technical personnel to collaborate on how data will be kept, accessed, shared, updated, and utilised within an organisation.

### Types of Data Models

There are three main types of data models:

- Conceptual Data Model: Conceptual Data Model is a representations of data Examine and describe in depth your abstract, high-level business concepts and structures. They are most commonly employed when working through high-level concepts and preliminary needs at the start of a new project. They are typically developed as alternatives or preludes to the logical data models that come later.T he main purpose of this data model is to organize, define business problems , rules and concepts. For instance, it helps business people to view any data like market data, customer data, and purchase data.

- Logical Data Model: In the logical data model, By offering a thorough representation of the data at a logical level, the logical data model expands on the conceptual model. It outlines the tables, columns, connections, and constraints that make up the data structure. Although logical data models are not dependant on any particular database management system (DBMS), they are more similar to how data would be implemented in a database. The physical design of databases is based on this idea.

- Physical Data Model: In Physical Data model ,The implementation is explained with reference to a particular database system. It outlines every part and service needed to construct a database. It is made with queries and the database language. Every table, column, and constraint—such as primary key, foreign key, NOT NULL, etc.—is represented in the physical data model. The creation of a database is the primary task of the physical data model. Developers and database administrators (DBAs) designed this model. This kind of data modelling aids in the creation of the schema and provides us with an abstraction of the databases. This model explains how the data model is specifically implemented. Constraints, RDBMS features, and database column keys are made possible by the physical data model.

## Data Modeling Process

The practice of conceptually representing data items and their connections to one another is known as data modelling. Data modellers collaborate with stakeholders at each stage of the process to define entities and attributes, establish relationships between data objects, and create models that accurately represent the data in a format that can be consumed by applications. These stakeholders may include developers, database administrators, and other interested parties. Lets discuss the data modelling steps:

- Identifying data sources: The first stage is to identify and investigate the different sources of data both inside and outside the company. It's critical to comprehend the sources of the data and how various sources add to the information as a whole. Determining the sources of data is essential since it guarantees a thorough framework for data modelling. It assists in gathering all pertinent data, setting the stage for a precise and comprehensive depiction of the data landscape.

- Defining Entities and Attributes: This stage is all on identifying the entities (items or ideas) and the characteristics that go along with them. Entities constitute the subject matter of the data, whereas attributes specify the particular qualities of each entity. The foundation of data modelling is the definition of entities and characteristics. It offers an orderly and transparent framework, which is necessary to comprehend the characteristics of the data and create a useful model.

- Mapping Relationships: Relationships show the connections or associations between various things. Relationship mapping entails locating and characterising these linkages, indicating the nature and cardinality of every relationship. In order to capture the interdependencies within the data, it is essential to understand relationships. It improves the correctness of the model by capturing the relationships between various data pieces that exist in the real world.

- Choosing a model Type: The right data model type is selected based on the project needs and data properties. Choosing between conceptual, logical, or physical models, or going with a particular model like relational or object-oriented, may be part of this decision. The degree of abstraction and detail in the representation is determined by the model type that is selected. It guarantees adherence to project objectives and facilitates the development of a model appropriate for the data type.

- Implementing and Maintaining: The process of implementation converts a physical or logical data model into a database schema. This entails establishing constraints, generating tables, and adding database-specific information. Updating the model to account for shifting technological or commercial needs is called maintenance. Significance: The theoretical model becomes a useful database upon implementation. Frequent upkeep guarantees that the model stays current and accurate, allowing it to adjust to the changing requirements of the company.

## Types of Data Modeling

These are the 5 different types of data models:

- Hierarchical Model: The structure of the hierarchical model resembles a tree. The remaining child nodes are arranged in a certain sequence, and there is only one root node—or, alternatively, one parent node. However, the hierarchical approach is no longer widely applied. approach connections in the actual world may be modelled using this approach.

        For Example , For example, in a college there are many courses, many professors and students. So college became a parent and professors and students became its children.


- Relational Model :Relational Mode represent the links between tables by representing data as rows and columns in tables. It is frequently utilised in database design and is strongly related to relational database management systems (RDBMS).

- Object-Oriented Data Model: In this model, data is represented as objects, similar to those used in object-oriented programming ,Creating objects with stored values is the object-oriented method. In addition to allowing data abstraction, inheritance, and encapsulation, the object-oriented architecture facilitates communication.

- Network Model :We have a versatile approach to represent objects and the relationships among these things thanks to the network model. One of its features is a schema, which is a graph representation of the data. An item is stored within a node, and the relationship between them is represented as an edge. This allows them to generalise the maintenance of many parent and child records.

- ER-Model: A high-level relational model called the entity-relationship model (ER model) is used to specify the data pieces and relationships between the entities in a system. This conceptual design gives us an easier-to-understand perspective on the facts. An entity-relationship diagram, which is made up of entities, attributes, and relationships, is used in this model to depict the whole database.

A relationship between entities is called an association. Mapping cardinality many associations like:

- one to one
- one to many
- many to one
- many to many

## Benefits of Data Modeling

In order to organise and structure data and provide database design clarity, data modelling is essential. It acts as a common language, promoting efficient stakeholder communication. It directs the best database architecture for effective data storage and retrieval through visual representation.

- Visualizes complex data structures, providing a clear roadmap for understanding relationships.
- Acts as a universal language, fostering effective communication between business and technical stakeholders.
- Creates organized databases by defining entities, properties, and relationships.
- Enhances data quality and integrity by reducing anomalies and redundancy through normalization.
- Minimizes errors in database and application development.
- Ensures consistency in documentation and system designs across the organization.
- Improves database and application performance.
- Facilitates quick correlation of data across the company.
- Strengthens communication between business intelligence and development teams.

## Conclusion

In conclusion, Data modelling is an essential component of data analysis that offers a methodical way to arrange and comprehend intricate facts. Analysts may create reliable models that improve insights and decision-making by adhering to the process's specified phases.

# Data Warehousing

A Database Management System (DBMS) stores data in the form of tables and uses an ER model and the goal is ACID properties. For example, a DBMS of a college has tables for students, faculty, etc. 

A Data Warehouse is separate from DBMS, it stores a huge amount of data, which is typically collected from multiple heterogeneous sources like files, DBMS, etc. The goal is to produce statistical results that may help in decision-making. For example, a college might want to see quick different results, like how the placement of CS students has improved over the last 10 years, in terms of salaries, counts, etc. 

## Issues Occur while Building the Warehouse

1. When and how to gather data:
    - In a source-driven architecture for gathering data, the data sources transmit new information, either continually (as transaction processing takes place), or periodically (nightly, for example). In a destination-driven architecture, the data warehouse periodically sends requests for new data to the sources. Unless updates at the sources are replicated at the warehouse via two phase commit, the warehouse will never be quite up to-date with the sources. Two-phase commit is usually far too expensive to be an option, so data warehouses typically have slightly out-of-date data. That, however, is usually not a problem for decision-support systems.
      
2. What schema to use:
    - Data sources that have been constructed independently are likely to have different schemas. In fact, they may even use different data models. Part of the task of a warehouse is to perform schema integration, and to convert data to the integrated schema before they are stored. As a result, the data stored in the warehouse are not just a copy of the data at the sources. Instead, they can be thought of as a materialized view of the data at the sources.
      
3. Data transformation and cleansing:
    - The task of correcting and preprocessing data is called data cleansing. Data sources often deliver data with numerous minor inconsistencies, which can be corrected. For example, names are often misspelled, and addresses may have street, area, or city names misspelled, or postal codes entered incorrectly. These can be corrected to a reasonable extent by consulting a database of street names and postal codes in each city. The approximate matching of data required for this task is referred to as fuzzy lookup.
      
4. How to propagate update:
    - Updates on relations at the data sources must be propagated to the data warehouse. If the relations at the data warehouse are exactly the same as those at the data source, the propagation is straightforward. If they are not, the problem of propagating updates is basically the view-maintenance problem.
      
5. What data to summarize:
    - The raw data generated by a transaction-processing system may be too large to store online. However, we can answer many queries by maintaining just summary data obtained by aggregation on a relation, rather than maintaining the entire relation. For example, instead of storing data about every sale of clothing, we can store total sales of clothing by item name and category.

## Need for Data Warehouse 

An ordinary Database can store MBs to GBs of data and that too for a specific purpose. For storing data of TB size, the storage shifted to the Data Warehouse. Besides this, a transactional database doesn’t offer itself to analytics. To effectively perform analytics, an organization keeps a central Data Warehouse to closely study its business by organizing, understanding, and using its historical data for making strategic decisions and analyzing trends. 

## Benefits of Data Warehouse

- Better business analytics: Data warehouse plays an important role in every business to store and analysis of all the past data and records of the company. which can further increase the understanding or analysis of data for the company.
  
- Faster Queries: The data warehouse is designed to handle large queries that’s why it runs queries faster than the database.

- Improved data Quality: In the data warehouse the data you gathered from different sources is being stored and analyzed it does not interfere with or add data by itself so your quality of data is maintained and if you get any issue regarding data quality then the data warehouse team will solve this.

- Historical Insight: The warehouse stores all your historical data which contains details about the business so that one can analyze it at any time and extract insights from it.

## Example Applications of Data Warehousing 

Data Warehousing can be applied anywhere where we have a huge amount of data and we want to see statistical results that help in decision making. 

- Social Media Websites: The social networking websites like Facebook, Twitter, Linkedin, etc. are based on analyzing large data sets. These sites gather data related to members, groups, locations, etc., and store it in a single central repository. Being a large amount of data, Data Warehouse is needed for implementing the same.
- Banking: Most of the banks these days use warehouses to see the spending patterns of account/cardholders. They use this to provide them with special offers, deals, etc.
- Government: Government uses a data warehouse to store and analyze tax payments which are used to detect tax thefts.

## Features of Data Warehousing

Data warehousing is essential for modern data management, providing a strong foundation for organizations to consolidate and analyze data strategically. Its distinguishing features empower businesses with the tools to make informed decisions and extract valuable insights from their data.

- Centralized Data Repository: Data warehousing provides a centralized repository for all enterprise data from various sources, such as transactional databases, operational systems, and external sources. This enables organizations to have a comprehensive view of their data, which can help in making informed business decisions.
- Data Integration: Data warehousing integrates data from different sources into a single, unified view, which can help in eliminating data silos and reducing data inconsistencies.
- Historical Data Storage: Data warehousing stores historical data, which enables organizations to analyze data trends over time. This can help in identifying patterns and anomalies in the data, which can be used to improve business performance.
- Query and Analysis: Data warehousing provides powerful query and analysis capabilities that enable users to explore and analyze data in different ways. This can help in identifying patterns and trends, and can also help in making informed business decisions.
- Data Transformation: Data warehousing includes a process of data transformation, which involves cleaning, filtering, and formatting data from various sources to make it consistent and usable. This can help in improving data quality and reducing data inconsistencies.
- Data Mining: Data warehousing provides data mining capabilities, which enable organizations to discover hidden patterns and relationships in their data. This can help in identifying new opportunities, predicting future trends, and mitigating risks.
- Data Security: Data warehousing provides robust data security features, such as access controls, data encryption, and data backups, which ensure that the data is secure and protected from unauthorized access.

### Advantages of Data Warehousing

- Intelligent Decision-Making: With centralized data in warehouses, decisions may be made more quickly and intelligently.
- Business Intelligence: Provides strong operational insights through business intelligence.
- Historical Analysis: Predictions and trend analysis are made easier by storing past data.
- Data Quality: Guarantees data quality and consistency for trustworthy reporting.
- Scalability: Capable of managing massive data volumes and expanding to meet changing requirements.
- Effective Queries: Fast and effective data retrieval is made possible by an optimized structure.
- Cost reductions: Data warehousing can result in cost savings over time by reducing data management procedures and increasing overall efficiency, even when there are setup costs initially.
- Data security: Data warehouses employ security protocols to safeguard confidential information, guaranteeing that only authorized personnel are granted access to certain data.

### Disadvantages of Data Warehousing

- Cost: Building a data warehouse can be expensive, requiring significant investments in hardware, software, and personnel.
- Complexity: Data warehousing can be complex, and businesses may need to hire specialized personnel to manage the system.
- Time-consuming: Building a data warehouse can take a significant amount of time, requiring businesses to be patient and committed to the process.
- Data integration challenges: Data from different sources can be challenging to integrate, requiring significant effort to ensure consistency and accuracy.
- Data security: Data warehousing can pose data security risks, and businesses must take measures to protect sensitive data from unauthorized access or breaches.


# Data Streaming

- Data streaming is the continuous transfer of data at a high rate of speed.
- Many data streams are collecting data from thousands of data sources at the same time.
- A data stream typically sends large clusters of smaller sized data records simultaneously.

- Batch processing has traditionally been the primary method of data processing, where large data volumes are processed at fixed intervals. While batch processing may have some advantages in loading especially large data sets during time windows where resource allocation has been freed up,  there are usually long down periods between data batches.. This can impact data timeliness, especially with high-volume Web and IoT (Internet of Things) data sets.

- Data streams work particularly well when the goal is to detect data patterns for temporal events, such as web engagement, eCommerce transactions, instrument telemetry, or geolocation and traffic monitoring. Streamed data is used for real-time data aggregation, sampling, and filtering, allowing analysts to access data instantly and gather actionable insights or make adjustments on the fly.

- Data stream analysis provides organizations with visibility into a wide range of customer and business activity, including website behavior; employee, device, equipment, and goods geo-location; and metering or billing data. With this visibility, businesses can quickly react to changes in customer sentiment in eCommerce or address delivery, equipment, or supply chain issues in a timely manner.

# Data Transformation

- Data transformation is the process of converting data from one format to another, generally to meet the requirements of the destination platform.
- Companies are collecting data from a multitude of disparate sources, requiring data transformation to enable compatibility. 
- Once converted data is at hand, it's essential to follow data warehousing best practices to protect your data.
- To best understand data transformation, it's important to appreciate the process and the motivations behind it.
- Data transformation entails a variety of procedures.
- Many people immediately consider data conversion, which encompasses data extraction and scrubbing.
- It might also include cleansing data and aggregating data.
- The overall process of data transformation sets out to make data compatible.
- The ETL (Extract, Transform, and Load) model is generally relied upon as an efficient means of data transformation. Snowflake is an example of data warehousing that can native support semi-structured data alongside relational (or structured) data.
- Data transformation may occur when data is being moved or when various data types need to be analyzed together. It also happens when information is being added to existing data sets, and when users want to aggregate data from multiple data sets.

# Data Pipeline

- A data pipeline is a means of moving data from one place to a destination (such as a data warehouse) while simultaneously optimizing and transforming the data. As a result, the data arrives in a state that can be analyzed and used to develop business insights.
- A data pipeline essentially is the steps involved in aggregating, organizing, and moving data. Modern data pipelines automate many of the manual steps involved in transforming and optimizing continuous data loads. Typically, this includes loading raw data into a staging table for interim storage and then changing it before ultimately inserting it into the destination reporting tables.

## Benefits of Data Pipeline
They eliminate most manual steps from the process and enable a smooth, automated flow of data from one stage to another. They are essential for real-time analytics to help make faster, data-driven decisions. They’re important if the organization:

- Relies on real-time data analysis
- Stores data in the cloud
- Houses data in multiple sources

By consolidating data from various silos into one single source of truth, it ensures consistent data quality and enabling quick data analysis for business insights.

## Elements in data pipeline
Data pipelines consist of three essential elements: a source or sources, processing steps, and a destination.
1. Sources

 - Sources are where data comes from. Common sources include relational database management systems like MySQL, CRMs such as Salesforce and HubSpot, ERPs like SAP and Oracle, social media management tools, and even IoT device sensors.
2. Processing steps

 - In general, data is extracted data from sources, manipulated and changed according to business needs, and then deposited at its destination. Common processing steps include transformation, augmentation, filtering, grouping, and aggregation.
3. Destination

- A destination is where the data arrives at the end of its processing, typically a data lake or data warehouse for analysis.


## Data Pipeline versus ETL

- Extract, transform, and load (ETL) systems are a kind of data pipeline in that they move data from a source, transform the data, and then load the data into a destination.
- But ETL is usually just a sub-process.
- Depending on the nature of the pipeline, ETL may be automated or may not be included at all.
- On the other hand, a data pipeline is broader in that it is the entire process involved in transporting data from one location to another.

## Characteristics of Data Pipeline

- Continuous and extensible data processing
- The elasticity and agility of the cloud
- Isolated and independent resources for data processing
- Democratized data access and self-service management
- High availability and disaster recovery

## In Cloud

- Modern data pipelines can provide many benefits to your business, including easier access to insights and information, speedier decision-making, and the flexibility and agility to handle peak demand. Modern, cloud-based data pipelines can leverage instant elasticity at a far lower price point than traditional solutions. Like an assembly line for data, it is a powerful engine that sends data through various filters, apps, and APIs, ultimately depositing it at its final destination in a usable state. They offer agile provisioning when demand spikes, eliminate access barriers to shared data, and enable quick deployment across the entire business.

# Business Intelligence 

- Business Intelligence or BI is a technology driven data analysis process that helps an organization's executives, managers and workers make informed business decisions. As part of the BI process, relevant data is collected and prepared for analysis. Queries are run against the data and the analytics results are used to support operational decision-making and strategic planning.
- The ultimate goal of BI initiatives is to drive better business decisions that enable organizations to increase revenue, improve operational efficiency and gain competitive advantages over business rivals. To achieve that goal, BI incorporates a combination of analytics, data visualization and reporting tools, plus various methodologies for managing and analysing data

## How does Business Intelligence process work?

Business intelligence initiatives uncover actionable information for use by senior executives, business managers and operational workers in various use cases. For example, BI applications generate insights on business performance, processes and trends, enabling management teams to identify problems and new opportunities and then take action to address them.

Business intelligence data is typically stored in a data warehouse built for an entire organization or in smaller data marts that hold subsets of business information for individual departments and business units. In addition, data lakes based on big data systems are often now used as repositories or landing pads for BI data, especially unstructured and semistructured data types. Data lakehouse platforms that combine elements of data lakes and data warehouses have also become available.

BI data can include both historical and real-time data gathered from a combination of internal IT systems and external sources. Before it's used in BI applications, raw data from different source systems usually must be integrated, consolidated and cleansed to ensure it's accurate and consistent.

From there, the steps in the BI process include the following:

- Data preparation, in which data sets are organized, transformed and modeled for analysis.
- Analytical querying of the prepared data.
- Development of data visualizations, reports and dashboards with information on key performance indicators (KPIs) and other findings.
- Distribution of the analytics results to decision-makers, either by the BI team or self-service BI users sharing the information with business colleagues.
- Use of the performance metrics and generated insights to help inform business decisions.

BI programs sometimes also incorporate forms of advanced analytics, such as data mining, predictive analytics, text mining and statistical analysis. Predictive modeling that enables what-if analysis of different business scenarios is one example. Most commonly, though, advanced analytics projects are handled by separate data science teams, while BI teams oversee more straightforward querying and analysis of business data.

## Benefits of Business Intelligence


- Speed up and improve decision-making.
- Optimize internal business processes.
- Increase operational efficiency and productivity.
- Spot business problems that need to be addressed.
- Identify emerging business and market trends.
- Develop stronger business strategies.
- Drive higher sales and new revenues.
- Gain a competitive edge over rival companies.

The following are the major business intelligence functions that are supported by BI platforms.

- Business monitoring and measurement.

- Data analysis.

- Reporting and information delivery. 

- Predictive analysis.

## How industries use BI tools

The following are some examples of how business intelligence is used in different industries:

- Banks use BI to help assess financial risks when deciding whether to approve mortgage and loan applications. They and other financial services firms also analyze customer portfolios to help plan cross-selling efforts aimed at getting customers to buy additional products.
- Insurers similarly rely on BI tools to analyze risks when considering applications for life, auto and homeowners insurance policies. In addition, they tap BI to analyze policy pricing.
- Manufacturers use BI software to aid in production planning, purchasing of materials and supplies, supply chain management and monitoring of manufacturing operations.
- Retailers plan marketing campaigns and product promotions with the aid of BI and analytics tools while also using them in inventory management and product replenishment.
- Hotel chains use BI tools to track room occupancy rates and adjust pricing based on booking demand, and to help manage their customer loyalty programs.
- Airlines likewise employ BI to help track ticket sales and flight occupancy, as well as for things such as managing flight schedules, crew assignments and food and beverage ordering.
- Transportation companies plan distribution schedules and routes with guidance from BI and analytics tools. They also use BI to monitor gas mileage and other aspects of fleet operations.
- Hospitals use BI systems to analyze patient outcomes and readmission rates as part of efforts to improve patient care. In addition, doctors use the tools to analyze clinical data and help diagnose medical conditions.


# Relational Database VS Non-Relational Database


| Aspect                  | Relational Database                                                                                     | Non-Relational Database                                                                                           |
|-------------------------|---------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------|
| **Definition**          | A type of database that stores data in structured tables with predefined relationships between them.    | A database that stores data in a flexible format like key-value pairs, documents, graphs, or wide-column stores. |
| **Structure**           | Data is organized into rows and columns within tables. Each column has a specific datatype, and rows represent records. | Data is stored in various formats such as key-value pairs, JSON documents, graphs, or wide columns, allowing greater flexibility. |
| **Schema**              | Requires a fixed, predefined schema. Changes to the schema require altering the database structure.     | Has a dynamic schema, which means data can be added without adhering to a strict structure.                       |
| **Data Relationships**  | Strong relationships are enforced through primary and foreign keys, making it ideal for complex relationships. | Relationships are less strict and often hierarchical or unstructured, depending on the use case.                  |
| **Query Language**      | Uses SQL (Structured Query Language) to perform CRUD (Create, Read, Update, Delete) operations and complex queries. | Uses various query methods, which can vary depending on the database type, such as JSON-like queries in MongoDB. |
| **Scalability**         | Typically scales vertically by upgrading the hardware of the database server (e.g., adding more RAM, CPU). | Scales horizontally by distributing the data across multiple servers or nodes, making it highly scalable.          |
| **Performance**         | Performs well for structured and complex queries involving joins, aggregations, and transactional operations. | Optimized for large-scale, unstructured data with high read/write throughput and simpler queries.                  |
| **Data Integrity**      | Enforces strong data integrity through ACID (Atomicity, Consistency, Isolation, Durability) compliance. | Often prioritizes availability and partition tolerance over consistency (following the CAP theorem), with eventual consistency. |
| **Flexibility**         | Limited flexibility due to the rigid schema, which requires predefined data structures.                 | Highly flexible, as it supports unstructured, semi-structured, and structured data without predefined schemas.     |
| **Complex Queries**     | Suitable for complex queries, aggregations, and analytics due to its structured nature.                 | Not ideal for complex joins or aggregations; better for simple, straightforward data retrieval.                    |
| **Transaction Support** | Provides strong support for transactions with rollback and commit capabilities, essential for financial and business applications. | May offer limited transaction support, depending on the database type (e.g., MongoDB supports transactions, but not all NoSQL databases do). |
| **Storage Mechanism**   | Stores data in rows and columns, often normalized to eliminate redundancy.                             | Stores data in denormalized formats, optimized for fast read and write operations.                                 |
| **Scaling Limitations** | Scaling is expensive and requires significant hardware upgrades.                                        | Can scale out easily by adding more servers, making it cost-effective for large-scale systems.                     |
| **Data Volume Handling**| Ideal for moderate to large volumes of structured data.                                                | Designed to handle massive volumes of unstructured or semi-structured data with ease.                             |
| **Use Cases**           | Best suited for applications requiring consistent and structured data, such as ERP, CRM, and financial systems. | Ideal for applications involving big data, content management, real-time analytics, IoT data, and social media platforms. |
| **Examples**            | MySQL, PostgreSQL, Oracle Database, Microsoft SQL Server.                                              | MongoDB, Cassandra, DynamoDB, Redis, CouchDB, Neo4j, Firebase.                                                   |
| **Community and Support** | Long-standing community support and resources available due to decades of usage.                      | Growing community, with different levels of maturity depending on the database type.                               |
| **Cost**                | Often associated with higher costs due to licensing and the need for powerful hardware.                | Many non-relational databases are open-source or come with flexible cloud pricing models.                          |
| **ACID Compliance**     | Fully ACID-compliant, ensuring data reliability, consistency, and safety during transactions.           | May or may not be ACID-compliant; some databases prioritize speed and scalability over strict consistency.         |
| **CAP Theorem**         | Favors consistency over availability and partition tolerance.                                           | Often follows the CAP theorem, prioritizing availability and partition tolerance over consistency.                 |
| **History**             | Relational databases have been in use since the 1970s and are a mature, well-established technology.    | Non-relational databases emerged in the 2000s, designed to handle modern web-scale applications.                   |



# Big Data

- Big data refers to extremely large and diverse collections of structured, unstructured, and semi-structured data that continues to grow exponentially over time.
- These datasets are so huge and complex in volume, velocity, and variety, that traditional data management systems cannot store, process, and analyze them.
- The amount and availability of data is growing rapidly, spurred on by digital technology advancements, such as connectivity, mobility, the Internet of Things (IoT), and artificial intelligence (AI).
- As data continues to expand and proliferate, new big data tools are emerging to help companies collect, process, and analyze data at the speed needed to gain the most value from it.
- Big data describes large and diverse datasets that are huge in volume and also rapidly grow in size over time.
- Big data is used in machine learning, predictive modeling, and other advanced analytics to solve business problems and make informed decisions.

## Big data examples

Data can be a company’s most valuable asset. Using big data to reveal insights can help you understand the areas that affect your business—from market conditions and customer purchasing behaviors to your business processes. 

Here are some big data examples that are helping transform organizations across every industry: 

- Tracking consumer behavior and shopping habits to deliver hyper-personalized retail product recommendations tailored to individual customers
- Monitoring payment patterns and analyzing them against historical customer activity to detect fraud in real time
- Combining data and information from every stage of an order’s shipment journey with hyperlocal traffic insights to help fleet operators optimize last-mile delivery
- Using AI-powered technologies like natural language processing to analyze unstructured medical data (such as research reports, clinical notes, and lab results) to gain new insights for improved treatment development and enhanced patient care
- Using image data from cameras and sensors, as well as GPS data, to detect potholes and improve road maintenance in cities
- Analyzing public datasets of satellite imagery and geospatial datasets to visualize, monitor, measure, and predict the social and environmental impacts of supply chain operations

## How does big data work?

The central concept of big data is that the more visibility you have into anything, the more effectively you can gain insights to make better decisions, uncover growth opportunities, and improve your business model. 

Making big data work requires three main actions: 

- Integration: Big data collects terabytes, and sometimes even petabytes, of raw data from many sources that must be received, processed, and transformed into the format that business users and analysts need to start analyzing it. 
- Management: Big data needs big storage, whether in the cloud, on-premises, or both. Data must also be stored in whatever form required. It also needs to be processed and made available in real time. Increasingly, companies are turning to cloud solutions to take advantage of the unlimited compute and scalability.  
- Analysis: The final step is analyzing and acting on big data—otherwise, the investment won’t be worth it. Beyond exploring the data itself, it’s also critical to communicate and share insights across the business in a way that everyone can understand. This includes using tools to create data visualizations like charts, graphs, and dashboards.

## Big data benefits

- Improved decision-making
- Increased agility and innovation
- Better customer experiences
- Continuous intelligence
- More efficient operations
- Improved risk management


# Data Visualization

- Data visualization is the graphical representation of information. In this guide we will study what is Data visualization and its importance with use cases.

## Understanding Data Visualization

Data visualization translates complex data sets into visual formats that are easier for the human brain to understand. This can include a variety of visual tools such as:

- Charts: Bar charts, line charts, pie charts, etc.
- Graphs: Scatter plots, histograms, etc.
- Maps: Geographic maps, heat maps, etc.
- Dashboards: Interactive platforms that combine multiple visualizations.

The primary goal of data visualization is to make data more accessible and easier to interpret allow users to identify patterns, trends, and outliers quickly. This is particularly important in big data where the large volume of information can be confusing without effective visualization techniques.

## Why is Data Visualization Important?

1. Data Visualization Simplifies the Complex Data
2. Enhances Data Interpretation
3. Data Visualization Saves Time
4. Improves Communication
5. Data Visualization Tells a Data Story

## Best Practices for Visualizing Data

Effective data visualization is crucial for conveying insights accurately. Follow these best practices to create compelling and understandable visualizations:

- <b>Audience-Centric Approach:</b> Tailor visualizations to your audience’s knowledge level, ensuring clarity and relevance. Consider their familiarity with data interpretation and adjust the complexity of visual elements accordingly.
- <b>Design Clarity and Consistency:</b> Choose appropriate chart types, simplify visual elements, and maintain a consistent color scheme and legible fonts. This ensures a clear, cohesive, and easily interpretable visualization.
- <b>Contextual Communication:</b> Provide context through clear labels, titles, annotations, and acknowledgments of data sources. This helps viewers understand the significance of the information presented and builds transparency and credibility.
- <b>Engaging and Accessible Design:</b> Design interactive features thoughtfully, ensuring they enhance comprehension. Additionally, prioritize accessibility by testing visualizations for responsiveness and accommodating various audience needs, fostering an inclusive and engaging experience.
