Awesome Data Science Dictionary

{Awesome Works in Progress}

Data

Data is fact.
- No Matter How You Say It, I’m Thankful for Data
Data analytics is the science of analyzing raw data to make conclusions about that information. The techniques and processes of data analytics have been automated into mechanical processes and algorithms that work over raw data for human consumption. Data analytics help a business optimize its performance [...] A company can also use data analytics to make better business decisions.
Dataism and Dataists learn more | The Rise of Dataism: A Threat to Freedom or a Scientific Revolution?
Data Architecture - Data architecture is a set of rules, policies, standards and models that govern and define the type of data collected and how it is used, stored, managed and integrated within an organization and its database systems. It provides a formal approach to creating and managing the flow of data and how it is processed across an organization’s IT systems and applications - techopedia
A Data Culture is the collective behaviors and beliefs of people who value, practice, and encourage the use of data to improve decision-making. As a result, data is woven into the operations, mindset, and identity of an organization. A Data Culture equips everyone in your organization with the insights they need to be truly data-driven, tackling your most complex business challenges.
Data curation is the process of creating, organizing and maintaining data sets so they can be accessed and used by people looking for information.
Data Decisions
- Data-Backed Decision Making
- Data-Informed Decision Making (DIDM)
- Data-Driven Decision Making (DDDM)
Data Democratization is the ongoing process of enabling everybody in an organization, irrespective of their technical know-how, to work with data comfortably, to feel confident talking about it, and as a result, make data-informed decisions and build customer experiences powered by data.
Data element is a basic unit of information that has a unique meaning and subcategories (data items) of distinct value. Examples of data elements include gender, race, and geographic location.
Data ethics is about responsible and sustainable use of data.
Data Exhaust - Data exhaust is the data generated as a byproduct of people’s online actions and choices - techtarget
A data estate is simply the infrastructure to help companies systematically manage all of their corporate data. A data estate can be developed on-premises, in the cloud or a combination of both (hybrid).
Data blending - Data blending is the process of combining data from multiple sources into a functioning dataset. This process is gaining attention among analysts and analytic companies due to the fact that it is a quick and straightforward method used to extract value from multiple data sources.
Data cardinality the first meaning of cardinality is when you’re designing the database—what’s called data modeling. In this sense, cardinality means whether a relationship is one-to-one, many-to-one, or many-to-many. So you’re really talking about the relationship cardinality. Cardinality’s official, non-database dictionary definition is mathematical: the number of values in a set. When applied to databases, the meaning is a bit different: it’s the number of distinct values in a table column, relative to the number of rows in the table. Repeated values in the column don’t count.
A data lakehouse can be defined as a modern data platform built from a combination of a data lake and a data warehouse. More specifically, a data lakehouse takes the flexible storage of unstructured data from a data lake and the management features and tools from data warehouses, then strategically implements them together as a larger system.
Data lineage
Data literacy as the ability to read, write and communicate data in context, including an understanding of data sources and constructs, analytical methods and techniques applied, and the ability to describe the use case, application and resulting value. Statistics Canada defines Data literacy as: the ability to derive meaningful information from data. It focuses on the competencies involved in working with data including the knowledge and skills to read, analyze, interpret, visualize and communicate data as well as understand the use of data in decision-making. Data literacy also means having the knowledge and skills to be a good data steward including the ability to assess the quality of data, protect and secure data, and their responsible and ethical use.
Data observability focuses on managing the health of your data, which is much more than monitoring it. Organizations have become much more reliant on their data for everyday operations and decision-making, making it critical to ensure a timely, high-quality flow of data. And, as more data is moved around an organization, often for analytics, data pipelines are the central highways for your data. Data observability helps make sure you have a reliable and effective flow of data.
Data Owner - Every data field in every database in the organization should be owned by a data owner, who is in the authority to ultimately decide on the access to, and usage of, the data.
Data steward - Data stewards are the DQ experts in charge of ensuring the quality of both the actual business data and the corresponding metadata. They assess DQ by performing extensive and regular data quality checks.
Data Swamp is the term that describes the failure to document the stored data accurately, resulting in the inability to analyze and exploit the data efficiently; the original data may remain, but the data swamp cannot retrieve it without the metadata that gives it context.
Data validation has nothing to do with what the user wants to input. Validation is about checking the input data to ensure it conforms with the data requirements of the system to avoid data errors. An example of this is a range check to avoid an input number that is greater/smaller than the specified range.
Data verification is a way of ensuring the user types in what he or she intends, in other words, to make sure the user does not make a mistake when inputting data. An example of this includes double entry of data (such as when creating a password or email) to prevent incorrect data input.
Gartner defines dark data as the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing).
Citizen Data Scientist - A person who creates or generates models that leverage predictive or prescriptive analytics, but whose primary job function is outside of the field of statistics and analytics. simplilearn
Granularity refers to “the level of detail or summarisation of the units of data in the data warehouse”. The low level of granularity contains high level of detail and the high level of granularity contains low level of detail.
Information overload (also known as infobesity, infoxication, information anxiety, and information explosion) is the difficulty in understanding an issue and effectively making decisions when one has too much information about that issue. In recent years, the term "information overload" has evolved into phrases such as "information glut", "data smog", and "data glut".
Raw Data Raw data typically refers to tables of data where each row contains an observation and each column represents a variable that describes some property of each observation. Data in this format is sometimes referred to as tidy data, flat data, primary data, atomic data, and unit record data. Sometimes raw data refers to data that has not yet been processed.
Polyglot persistence is used to describe solutions that use a mix of data store technologies.

Terms

Shadow It refers to IT devices, software and services outside the ownership or control of IT organizations.
Self-service business intelligence (SSBI) empowers teams such as product developers, sales, finance, marketing, operations, and more to answer data questions, with governance supported by IT and business intelligence (BI) analysts.
Telemetry is the automatic recording and transmission of data from remote or inaccessible sources to an IT system in a different location for monitoring and analysis. Telemetry data may be relayed using radio, infrared, ultrasonic, GSM, satellite or cable, depending on the application (telemetry is not only used in software development, but also in meteorology, intelligence, medicine, and other fields).
OSINT "Open Source INTelligence" “produced from publicly available information that is collected, exploited, and disseminated in a timely manner to an appropriate audience for the purpose of addressing a specific intelligence requirement” in Sec. 931 of Public Law 109-164 “National Defense Authorization Act for Fiscal Year 2006”

Acronyms

ERD - Entity Relationship Diagrams
ETL - Extract, Transform and Load
ELT - Extract, Load and Transform
KPI - Key Performance Indicator
LCDP - A low-code development platform (LCDP) is software that provides an environment programmers use to create application software through graphical user interfaces and configuration instead of traditional computer programming
OLAP - Online Analytical Processing
OLTP - Online Transaction Processing
*aaS
- SaaS - Software as a Service
- PaaS - Platform as a Service
- IaaS - Infrastructure as a Service
  - SaaS vs PaaS vs IaaS: What’s The Difference and How To Choose
- DFaaS - Data Fabrics as a Service
  - What Is a Data Fabric?
DW - Data Warehouse
DSS - Decision Support System
EDW - Enterprise Data Warehouse
EIM - Enterprise Information Management (Enterprise information management is an integrative discipline for structuring, describing and governing information assets across organizational and technological boundaries to improve efficiency, promote transparency and enable business insight) Gartner IT Glossary
VLDB - Very Large Database
BISM - Business Intelligence Semantic Model
CRISP-DM - Cross-Industry Standard Process for Data Mining
RDBMS - Relational Database Management System
PSA - Persistent Staging Area
- Using a Persistent Staging Area: What, Why, And How
SSOT - Single Source of Truth (SSOT)

ML

Machine Learning
- Machine Learning algorithms aim to learn a target function (f) that describes the mapping between data input variables (X) and an output variable (Y) | Y=f(X)+e (Udacity)
- Machine learning is a data science technique used to extract patterns from data, allowing computers to identify related data, and forecast future outcomes, behaviors, and trends (Udacity)
Pipelines An Azure ML pipeline performs a complete logical workflow with an ordered sequence of steps. Each step is a discrete processing action. (Microsoft Azure)
In the context of machine learning, data drift is the change in model input data that leads to model performance degradation. It is one of the top reasons where model accuracy degrades over time, thus monitoring data drift helps detect model performance issues.
Irreducible error is the error that can’t be reduced by creating good models. It is a measure of the amount of noise in our data. Here it is important to understand that no matter how good we make our model, our data will have certain amount of noise or irreducible error that can not be removed. Towards Data Science. Irreducible Error comes from data collection.

Dictionaries

Gartner Glossary "Data"

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
DataSkills.md		DataSkills.md
EnglishArabicDataTerms.md		EnglishArabicDataTerms.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataSkills.md

DataSkills.md

EnglishArabicDataTerms.md

EnglishArabicDataTerms.md

README.md

README.md

Repository files navigation

Awesome Data Science Dictionary

Data

Terms

Acronyms

ML

Related topics

Dictionaries

About

Releases

Packages

NajiElKotob/Awesome-Data-Science-Dictionary

Folders and files

Latest commit

History

Repository files navigation

Awesome Data Science Dictionary

Data

Terms

Acronyms

ML

Related topics

Dictionaries

About

Resources

Stars

Watchers

Forks