Skip to content

Latest commit



42 lines (40 loc) · 4.38 KB

File metadata and controls

42 lines (40 loc) · 4.38 KB


The following table introduces terminology used in the fields of Machine Learning, Data Science, Data Engineering and Artificial Intelligence, and most important their use in the field of Bioinformatics to help you start your journey working with algorithms and machine assisted routines to increase your productivity and capability to operate with more complex problems.

Term Definition Field
Array A data structure that contains a collection of elements of the same data type Data Science
CSV Comma-separated values file format Data Engineering
Data Aggregation A data operation typically applied to groups of rows. Typically used for descriptive statistics Data Engineering
Data Filter A predicate applicable to 1 or many columns in a dataframe that excludes rows that don't match such predicate Data Engineering
Data Imputation Procedure to replace empty or null values in a dataframe using different strategies, for example the mean value or the median, or forward-fill for timeseries Data Science
Data Model Is an abstraction of multiple entities and their relationships or associations Data Engineering
Data Partition Referred when storing data into folders or groups of rows, so that reading or analytical workloads don't have to operate with large amounts of data Data Engineering
Data Type Refers to the nature of type discrete, continuous or categorical nature of fields inside a dataset, for example numeric, string or array Data Engineering
Dataframe A data structure abstraction that allows analytical operations like filters or aggregations over data points Data Engineering
Dataset A collection of information gathered by observations, measurements, research or analysis Agnostic
EDA Exploratory Data Analysis - refers to distributions, statistics, and patterns on data Data Science
ERD Entity Relationship Diagram Data Engineering
Epoch Machine Learning
Embeddings Machine Learning
F1 Score Machine Learning
Feature Engineering Procedure to derive new data points from existing Data Science
Features Refers to the columns or fields that make a data frame Data Science
File Encoding Represents how to represent characters when processing text Data Engineering
File Format A predefined structure on a data set, like headers and rows, or more sophisticated versions like binary files like movies or audio Data Engineering
Hyperparameters The group of values that allow a machine learning model to increase its performance metrics, or reduce its training epocs Data Science
Linear Model A machine learning model whose representative function is a line Data Science
Machine Learning Model An algorithm that allows the calculation of a number or label over a group of data points Data Science
Matrix A data structure commonly representing a collection of data points in Eucledian Space Data Science
Matrix Rank The vector space spanned by its columns Data Science
Model Metrics An evaluation criteria to determine the performance of a machine learning model, i.e. accuracy, area under the curve, specificity, etc. Data Science
Neural Network A machine representation of how a human brain works, through neurons that are activated through interactions Data Science
Parquet The most popular file format for analytical workloads that allows operating with millions of rows in commodity hardware like a laptop Data Engineering
R Acronym used in mathematical representations to represent Real numbers Data Science
R2 Acronym used to represent Eucledian space or 2-dimensional space Data Science
Supervised Learning A machine learning method in which data with samples is at the disposal during training Data Science
Tensor A data structure commonly representing a collection of data points in multi-dimensional space Data Science
Training and Testing Datasets Refers to the process of splitting data from a data set to train and test a machine learning model, expecting that can predict values from unseen data Data Science
TSV Tab-separated values file format Data Engineering
Unsupervised Learning A machine learning method to make sense of data when no labels or classification criteria is available Data Science
Vector A data structure commonly representing a collection of data points in space Data Science
XSV A special character-separated file Data Engineering