## Large Language Models (LLMs)
### Q&A Example

In this notebook, we'll explore the world of Language Models (LLMs) by creating a simple Q&A model using your own data.

Imagine you have an article from Wikipedia, and you're curious about certain facts in that article, but you don't want to read the whole article. Using Q&A with LLMs, you can ask questions about the article and the LLM will answer you. 

Generally, Q&A with LLMs involves asking questions about unstructured data like documents or tables, similar to doing a Ctrl+F search but with more complex queries.

To guide us through this process, we'll be using a handy library called LangChain. This library serves as a framework for building applications powered by language models. If you want to learn more about LangChain, check out their documentation at https://python.langchain.com/docs/get_started/introduction.

Let's dive in and build our very own Q&A model step by step!

In [0]:
%pip install langchain==0.0.331  tiktoken sentence_transformers==2.2.2 chromadb==0.3.29

In [0]:
dbutils.library.restartPython()

In [0]:
import pandas as pd
from tqdm import tqdm

from langchain.document_loaders import WikipediaLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.schema.document import Document

#### Load data
LangChain provides various data loaders specific for each data format (e.g. TextLoader, CSVLoader, PDFLoader, ...)

In [0]:
# Load content from Wikipedia using WikipediaLoader

# we need to use local data (no connection to the Internet)
# loader = WikipediaLoader("Machine_learning")
# data = loader.load()

In [0]:
# we need to use local data 
data = [Document(page_content='Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Recently, generative artificial neural networks have been able to surpass many previous approaches in performance.Machine learning approaches have been applied to many fields including large language models, computer vision, speech recognition, email filtering, agriculture, and medicine, where it is too costly to develop algorithms to perform the needed tasks. ML is known in its application across business problems under the name predictive analytics. Although not all machine learning is statistically based, computational statistics is an important source of the field\'s methods.\nThe mathematical foundations of ML are provided by mathematical optimization (mathematical programming) methods. Data mining is a related (parallel) field of study, focusing on exploratory data analysis through unsupervised learning. From a theoretical point of view Probably approximately correct learning provides a framework for describing machine learning.\n\n\n== History and relationships to other fields ==\n\nThe term machine learning was coined in 1959 by Arthur Samuel, an IBM employee and pioneer in the field of computer gaming and artificial intelligence. The synonym self-teaching computers was also used in this time period.Although the earliest machine learning model was introduced in the 1950s when Arthur Samuel invented a program that calculated the winning chance in checkers for each side, the history of machine learning roots back to decades of human desire and effort to study human cognitive processes. In 1949, Canadian psychologist Donald Hebb published the book The Organization of Behavior, in which he introduced a theoretical neural structure formed by certain interactions among nerve cells. Hebb\'s model of neurons interacting with one another set a groundwork for how AIs and machine learning algorithms work under nodes, or artificial neurons used by computers to communicate data. Other researchers who have studied human cognitive systems contributed to the modern machine learning technologies as well, including logician Walter Pitts and Warren McCulloch, who proposed the early mathematical models of neural networks to come up with algorithms that mirror human thought processes.By the early 1960s an experimental "learning machine" with punched tape memory, called Cybertron, had been developed by Raytheon Company to analyze sonar signals, electrocardiograms, and speech patterns using rudimentary reinforcement learning. It was repetitively "trained" by a human operator/teacher to recognize patterns and equipped with a "goof" button to cause it to re-evaluate incorrect decisions. A representative book on research into machine learning during the 1960s was Nilsson\'s book on Learning Machines, dealing mostly with machine learning for pattern classification. Interest related to pattern recognition continued into the 1970s, as described by Duda and Hart in 1973. In 1981 a report was given on using teaching strategies so that an artificial neural network learns to recognize 40 characters (26 letters, 10 digits, and 4 special symbols) from a computer terminal.Tom M. Mitchell provided a widely quoted, more formal definition of the algorithms studied in the machine learning field: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P,  improves with experience E." This definition of the tasks in which machine learning is concerned offers a fundamentally operational definition rather than defining the field in cognitive terms. This follows Alan Turing\'s proposal in his paper "Computing Machinery and Intelligence", in which the question "Can machines think?" is replaced with the question "Can machin', metadata={'title': 'Machine learning', 'summary': "Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Recently, generative artificial neural networks have been able to surpass many previous approaches in performance.Machine learning approaches have been applied to many fields including large language models, computer vision, speech recognition, email filtering, agriculture, and medicine, where it is too costly to develop algorithms to perform the needed tasks. ML is known in its application across business problems under the name predictive analytics. Although not all machine learning is statistically based, computational statistics is an important source of the field's methods.\nThe mathematical foundations of ML are provided by mathematical optimization (mathematical programming) methods. Data mining is a related (parallel) field of study, focusing on exploratory data analysis through unsupervised learning. From a theoretical point of view Probably approximately correct learning provides a framework for describing machine learning.", 'source': 'https://en.wikipedia.org/wiki/Machine_learning'}), Document(page_content='Quantum machine learning is the integration of quantum algorithms within machine learning programs.The most common use of the term refers to machine learning algorithms for the analysis of classical data executed on a quantum computer, i.e. quantum-enhanced machine learning. While machine learning algorithms are used to compute immense quantities of data, quantum machine learning utilizes qubits and quantum operations or specialized quantum systems to improve computational speed and data storage done by algorithms in a program. This includes hybrid methods that involve both classical and quantum processing, where computationally difficult subroutines are outsourced to a quantum device. These routines can be more complex in nature and executed faster on a quantum computer. Furthermore, quantum algorithms can be used to analyze quantum states instead of classical data.Beyond quantum computing, the term "quantum machine learning" is also associated with classical machine learning methods applied to data generated from quantum experiments (i.e. machine learning of quantum systems), such as learning the phase transitions of a quantum system or creating new quantum experiments.Quantum machine learning also extends to a branch of research that explores methodological and structural similarities between certain physical systems and learning systems, in particular neural networks. For example, some mathematical and numerical techniques from quantum physics are applicable to classical deep learning and vice versa.Furthermore, researchers investigate more abstract notions of learning theory with respect to quantum information, sometimes referred to as "quantum learning theory".\n\n\n== Machine learning with quantum computers ==\nQuantum-enhanced machine learning refers to quantum algorithms that solve tasks in machine learning, thereby improving and often expediting classical machine learning techniques. Such algorithms typically require one to encode the given classical data set into a quantum computer to make it accessible for quantum information processing. Subsequently, quantum information processing routines are applied and the result of the quantum computation is read out by measuring the quantum system. For example, the outcome of the measurement of a qubit reveals the result of a binary classification task. While many proposals of quantum machine learning algorithms are still purely theoretical and require a full-scale universal quantum computer to be tested, others have been implemented on small-scale or special purpose quantum devices.\n\n\n=== Quantum associative memories and quantum pattern recognition ===\nAssociative (or content-addressable memories) are able to recognize stored content on the basis of a similarity measure, rather than fixed addresses, like in random access memories. As such they must be able to retrieve both incomplete and corrupted patterns, the essential machine learning task of pattern recognition.\nTypical classical associative memories store p patterns in the \n  \n    \n      \n        O\n        (\n        \n          n\n          \n            2\n          \n        \n        )\n      \n    \n    {\\displaystyle O(n^{2})}\n   interactions (synapses) of a real,  symmetric energy matrix over a network of n artificial neurons. The encoding is such that the desired patterns are local minima of the energy functional and retrieval is done by minimizing the total energy, starting from an initial configuration.\nUnfortunately, classical associative memories are severely limited by the phenomenon of cross-talk. When too many patterns are stored, spurious memories appear which quickly proliferate, so that the energy landscape becomes disordered and no retrieval is anymore possible. The number of storable patterns is typically limited by a linear function of the number of neurons, \n  \n    \n      \n        p\n        ≤\n        O\n        (\n        n\n        )\n      \n    \n    {\\displaystyle p\\leq O(n)}\n  .\nQuantum associative memories (in t', metadata={'title': 'Quantum machine learning', 'summary': 'Quantum machine learning is the integration of quantum algorithms within machine learning programs.The most common use of the term refers to machine learning algorithms for the analysis of classical data executed on a quantum computer, i.e. quantum-enhanced machine learning. While machine learning algorithms are used to compute immense quantities of data, quantum machine learning utilizes qubits and quantum operations or specialized quantum systems to improve computational speed and data storage done by algorithms in a program. This includes hybrid methods that involve both classical and quantum processing, where computationally difficult subroutines are outsourced to a quantum device. These routines can be more complex in nature and executed faster on a quantum computer. Furthermore, quantum algorithms can be used to analyze quantum states instead of classical data.Beyond quantum computing, the term "quantum machine learning" is also associated with classical machine learning methods applied to data generated from quantum experiments (i.e. machine learning of quantum systems), such as learning the phase transitions of a quantum system or creating new quantum experiments.Quantum machine learning also extends to a branch of research that explores methodological and structural similarities between certain physical systems and learning systems, in particular neural networks. For example, some mathematical and numerical techniques from quantum physics are applicable to classical deep learning and vice versa.Furthermore, researchers investigate more abstract notions of learning theory with respect to quantum information, sometimes referred to as "quantum learning theory".', 'source': 'https://en.wikipedia.org/wiki/Quantum_machine_learning'}), Document(page_content='In machine learning, boosting is an ensemble meta-algorithm for primarily reducing bias, and also variance in supervised learning, and a family of machine learning algorithms that convert weak learners to strong ones. Boosting is based on the question posed by Kearns and Valiant (1988, 1989): "Can a set of weak learners create a single strong learner?" A weak learner is defined to be a classifier that is only slightly correlated with the true classification (it can label examples better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification.\nRobert Schapire\'s affirmative answer in a 1990 paper to the question of Kearns and Valiant has had significant ramifications in machine learning and statistics, most notably leading to the development of boosting.When first introduced, the hypothesis boosting problem simply referred to the process of turning a weak learner into a strong learner. "Informally, [the hypothesis boosting] problem asks whether an efficient learning algorithm […] that outputs a hypothesis whose performance is only slightly better than random guessing [i.e. a weak learner] implies the existence of an efficient algorithm that outputs a hypothesis of arbitrary accuracy [i.e. a strong learner]." Algorithms that achieve hypothesis boosting quickly became simply known as "boosting". Freund and Schapire\'s arcing (Adapt[at]ive Resampling and Combining), as a general technique, is more or less synonymous with boosting.\n\n\n== Boosting algorithms ==\nWhile boosting is not algorithmically constrained, most boosting algorithms consist of iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier. When they are added, they are weighted in a way that is related to the weak learners\' accuracy.  After a weak learner is added, the data weights are readjusted, known as "re-weighting". Misclassified input data gain a higher weight and examples that are classified correctly lose weight. Thus, future weak learners focus more on the examples that previous weak learners misclassified.\n\nThere are many boosting algorithms. The original ones, proposed by Robert Schapire (a recursive majority gate formulation), and Yoav Freund (boost by majority), were not adaptive and could not take full advantage of the weak learners. Schapire and Freund then developed AdaBoost, an adaptive boosting algorithm that won the prestigious Gödel Prize.\nOnly algorithms that are provable boosting algorithms in the probably approximately correct learning formulation can accurately be called boosting algorithms.  Other algorithms that are similar in spirit to boosting algorithms are sometimes called "leveraging algorithms", although they are also sometimes incorrectly called boosting algorithms.The main variation between many boosting algorithms is their method of weighting training data points and hypotheses. AdaBoost is very popular and the most significant historically as it was the first algorithm that could adapt to the weak learners. It is often the basis of introductory coverage of boosting in university machine learning courses. There are many more recent algorithms such as LPBoost, TotalBoost, BrownBoost, xgboost, MadaBoost, LogitBoost, and others. Many boosting algorithms fit into the AnyBoost framework, which shows that boosting performs gradient descent in a function space using a convex cost function.\n\n\n== Object categorization in computer vision ==\n\nGiven images containing various known objects in the world, a classifier can be learned from them to automatically classify the objects in future images.  Simple classifiers built based on some image feature of the object tend to be weak in categorization performance. Using boosting methods for object categorization is a way to unify the weak classifiers in a special way to boost the overall ability of categorization.\n\n\n=== Problem of object categorization ===\nObject categoriz', metadata={'title': 'Boosting (machine learning)', 'summary': 'In machine learning, boosting is an ensemble meta-algorithm for primarily reducing bias, and also variance in supervised learning, and a family of machine learning algorithms that convert weak learners to strong ones. Boosting is based on the question posed by Kearns and Valiant (1988, 1989): "Can a set of weak learners create a single strong learner?" A weak learner is defined to be a classifier that is only slightly correlated with the true classification (it can label examples better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification.\nRobert Schapire\'s affirmative answer in a 1990 paper to the question of Kearns and Valiant has had significant ramifications in machine learning and statistics, most notably leading to the development of boosting.When first introduced, the hypothesis boosting problem simply referred to the process of turning a weak learner into a strong learner. "Informally, [the hypothesis boosting] problem asks whether an efficient learning algorithm […] that outputs a hypothesis whose performance is only slightly better than random guessing [i.e. a weak learner] implies the existence of an efficient algorithm that outputs a hypothesis of arbitrary accuracy [i.e. a strong learner]." Algorithms that achieve hypothesis boosting quickly became simply known as "boosting". Freund and Schapire\'s arcing (Adapt[at]ive Resampling and Combining), as a general technique, is more or less synonymous with boosting.\n\n', 'source': 'https://en.wikipedia.org/wiki/Boosting_(machine_learning)'}), Document(page_content='A transformer is a deep learning architecture based on the multi-head attention mechanism, proposed in a 2017 paper "Attention Is All You Need". It has no recurrent units, and thus requires less training time than previous recurrent neural architectures, such as long short-term memory (LSTM), and its later variation has been prevalently adopted for training large language models on large (language) datasets, such as the Wikipedia corpus and Common Crawl.\nInput text is split into n-grams encoded as tokens and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished. The transformer paper, published in 2017, is based on the softmax-based attention mechanism was proposed by Bahdanau et. al. in 2014 for machine translation, and the Fast Weight Controller, similar to a transformer, was proposed in 1992.\nThis architecture is now used not only in natural language processing and computer vision, but also in audio and multi-modal processing. It has also led to the development of pre-trained systems, such as generative pre-trained transformers (GPTs) and BERT (Bidirectional Encoder Representations from Transformers).\n\n\n== Timeline ==\nIn 1990, the Elman network, using a recurrent neural network, encoded each word in a training set as a vector, called a word embedding, and the whole vocabulary as a vector database, allowing it to perform such tasks as sequence-predictions that are beyond the power of a simple multilayer perceptron. A shortcoming of the static embeddings was that they didn\'t differentiate between multiple meanings of same-spelt words.\nIn 1992, the Fast Weight Controller was published by Jürgen Schmidhuber. It learns to answer queries by programming the attention weights of another neural network through outer products of key vectors and value vectors called FROM and TO. The Fast Weight Controller was later shown to be equivalent to the unnormalized linear Transformer.  The terminology "learning internal spotlights of attention" was introduced in 1993.\nIn 1993, the IBM alignment models were used for statistical machine translation.\nIn 1997, a precursor of large language model, using recurrent neural networks, such as long short-term memory, was proposed.\nIn 2001, a one-billion-word large text corpus, scraped from the Internet, referred to as "very very large" at the time, was used for word disambiguation.\nIn 2012, AlexNet demonstrated the effectiveness of large neural networks for image recognition, encouraging large artificial neural networks approach instead of older, statistical approaches.\nIn 2014, a 380M-parameter seq2seq model for machine translation using two LSTMs networks was proposed by Sutskever at al. The architecture consists of two parts. The encoder is an LSTM that takes in a sequence of tokens and turns it into a vector. The decoder is another LSTM that converts the vector into a sequence of tokens.\nIn 2014, gating proved to be useful in a 130M-parameter seq2seq model, which used a simplified gated recurrent units (GRUs). Bahdanau et al showed that GRUs are neither better nor worse than gated LSTMs.\nIn 2014, Bahdanau et al. improved the previous seq2seq model by using an "additive" kind of attention mechanism in-between two LSTM networks. It was, however, not yet the parallelizable (scaled "dot product") kind of attention, later proposed in the 2017 transformer paper.\nIn 2015, the relative performance of Global and Local (windowed) attention model architectures were assessed by Luong et al, a mixed attention architecture found to improve on the translations offered by Bahdanau\'s architecture, while the use of a local attention architecture  reduced translation time.\nIn 2016, Google Translate gradually replaced the older statistical', metadata={'title': 'Transformer (deep learning architecture)', 'summary': 'A transformer is a deep learning architecture based on the multi-head attention mechanism, proposed in a 2017 paper "Attention Is All You Need". It has no recurrent units, and thus requires less training time than previous recurrent neural architectures, such as long short-term memory (LSTM), and its later variation has been prevalently adopted for training large language models on large (language) datasets, such as the Wikipedia corpus and Common Crawl.\nInput text is split into n-grams encoded as tokens and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished. The transformer paper, published in 2017, is based on the softmax-based attention mechanism was proposed by Bahdanau et. al. in 2014 for machine translation, and the Fast Weight Controller, similar to a transformer, was proposed in 1992.\nThis architecture is now used not only in natural language processing and computer vision, but also in audio and multi-modal processing. It has also led to the development of pre-trained systems, such as generative pre-trained transformers (GPTs) and BERT (Bidirectional Encoder Representations from Transformers).', 'source': 'https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)'}), Document(page_content='Artificial intelligence (AI) is the intelligence of machines or software, as opposed to the intelligence of other living beings, primarily of humans. It is a field of study in computer science that develops and studies intelligent machines. Such machines may be called AIs.\nAI technology is widely used throughout industry, government, and science. Some high-profile applications are: advanced web search engines (e.g., Google Search), recommendation systems (used by YouTube, Amazon, and Netflix), interacting via human speech (such as Google Assistant, Siri, and Alexa), self-driving cars (e.g., Waymo), generative and creative tools (ChatGPT and AI art), and superhuman play and analysis in strategy games (such as chess and Go).Alan Turing was the first person to conduct substantial research in the field that he called Machine Intelligence. Artificial intelligence was founded as an academic discipline in 1956. The field went through multiple cycles of optimism followed by disappointment and loss of funding. Funding and interest vastly increased after 2012 when deep learning surpassed all previous AI techniques, and after 2017 with the transformer architecture. This led to the AI spring of the early 2020s, with companies, universities, and laboratories overwhelmingly based in the United States pioneering significant advances in artificial intelligence.The various sub-fields of AI research are centered around particular goals and the use of particular tools. The traditional goals of AI research include reasoning, knowledge representation, planning, learning, natural language processing, perception, and support for robotics. General intelligence (the ability to complete any task performable by a human) is among the field\'s long-term goals.To solve these problems, AI researchers have adapted and integrated a wide range of problem-solving techniques, including search and mathematical optimization, formal logic, artificial neural networks, and methods based on statistics, operations research, and economics. AI also draws upon psychology, linguistics, philosophy, neuroscience and other fields.\n\n\n== Goals ==\nThe general problem of simulating (or creating) intelligence has been broken into sub-problems. These consist of particular traits or capabilities that researchers expect an intelligent system to display. The traits described below have received the most attention and cover the scope of AI research.\n\n\n=== Reasoning, problem-solving ===\nEarly researchers developed algorithms that imitated step-by-step reasoning that humans use when they solve puzzles or make logical deductions. By the late 1980s and 1990s, methods were developed for dealing with uncertain or incomplete information, employing concepts from probability and economics.Many of these algorithms are insufficient for solving large reasoning problems because they experience a "combinatorial explosion": they became exponentially slower as the problems grew larger. Even humans rarely use the step-by-step deduction that early AI research could model. They solve most of their problems using fast, intuitive judgments. Accurate and efficient reasoning is an unsolved problem.\n\n\n=== Knowledge representation ===\nKnowledge representation and knowledge engineering allow AI programs to answer questions intelligently and make deductions about real-world facts. Formal knowledge representations are used in content-based indexing and retrieval, scene interpretation, clinical decision support, knowledge discovery (mining "interesting" and actionable inferences from large databases), and other areas.A knowledge base is a body of knowledge represented in a form that can be used by a program. An ontology is the set of objects, relations, concepts, and properties used by a particular domain of knowledge. Knowledge bases need to represent things such as: objects, properties, categories and relations between objects; situations, events, states and time; causes and effects; knowledge about knowledge (wh', metadata={'title': 'Artificial intelligence', 'summary': "Artificial intelligence (AI) is the intelligence of machines or software, as opposed to the intelligence of other living beings, primarily of humans. It is a field of study in computer science that develops and studies intelligent machines. Such machines may be called AIs.\nAI technology is widely used throughout industry, government, and science. Some high-profile applications are: advanced web search engines (e.g., Google Search), recommendation systems (used by YouTube, Amazon, and Netflix), interacting via human speech (such as Google Assistant, Siri, and Alexa), self-driving cars (e.g., Waymo), generative and creative tools (ChatGPT and AI art), and superhuman play and analysis in strategy games (such as chess and Go).Alan Turing was the first person to conduct substantial research in the field that he called Machine Intelligence. Artificial intelligence was founded as an academic discipline in 1956. The field went through multiple cycles of optimism followed by disappointment and loss of funding. Funding and interest vastly increased after 2012 when deep learning surpassed all previous AI techniques, and after 2017 with the transformer architecture. This led to the AI spring of the early 2020s, with companies, universities, and laboratories overwhelmingly based in the United States pioneering significant advances in artificial intelligence.The various sub-fields of AI research are centered around particular goals and the use of particular tools. The traditional goals of AI research include reasoning, knowledge representation, planning, learning, natural language processing, perception, and support for robotics. General intelligence (the ability to complete any task performable by a human) is among the field's long-term goals.To solve these problems, AI researchers have adapted and integrated a wide range of problem-solving techniques, including search and mathematical optimization, formal logic, artificial neural networks, and methods based on statistics, operations research, and economics. AI also draws upon psychology, linguistics, philosophy, neuroscience and other fields.", 'source': 'https://en.wikipedia.org/wiki/Artificial_intelligence'}), Document(page_content='Active learning is a special case of machine learning in which a learning algorithm can interactively query a human user (or some other information source), to label new data points with the desired outputs. The human user must possess knowledge/expertise in the problem domain, including the ability to consult/research authoritative sources when necessary.  In statistics literature, it is sometimes also called optimal experimental design. The information source is also called teacher or oracle.\nThere are situations in which unlabeled data is abundant but manual labeling is expensive. In such a scenario, learning algorithms can actively query the user/teacher for labels. This type of iterative supervised learning is called active learning. Since the learner chooses the examples, the number of examples to learn a concept can often be much lower than the number required in normal supervised learning. With this approach, there is a risk that the algorithm is overwhelmed by uninformative examples.  Recent developments are dedicated to multi-label active learning, hybrid active learning and active learning in a single-pass (on-line) context, combining concepts from the field of machine learning (e.g. conflict and ignorance) with adaptive, incremental learning policies in the field of online machine learning. Using active learning allows for faster development of a machine learning algorithm, when comparative updates would require a quantum or super computer.Large-scale active learning projects may benefit from crowdsourcing frameworks such as Amazon Mechanical Turk that include many humans in the active learning loop.\n\n\n== Definitions ==\nLet T be the total set of all data under consideration. For example, in a protein engineering problem, T would include all proteins that are known to have a certain interesting activity and all additional proteins that one might want to test for that activity.\nDuring each iteration, i, T is broken up into three subsets\n\n  \n    \n      \n        \n          \n            T\n          \n          \n            K\n            ,\n            i\n          \n        \n      \n    \n    {\\displaystyle \\mathbf {T} _{K,i}}\n  : Data points where the label is known.\n\n  \n    \n      \n        \n          \n            T\n          \n          \n            U\n            ,\n            i\n          \n        \n      \n    \n    {\\displaystyle \\mathbf {T} _{U,i}}\n  : Data points where the label is unknown.\n\n  \n    \n      \n        \n          \n            T\n          \n          \n            C\n            ,\n            i\n          \n        \n      \n    \n    {\\displaystyle \\mathbf {T} _{C,i}}\n  : A subset of TU,i that is chosen to be labeled.Most of the current research in active learning involves the best method to choose the data points for TC,i.\n\n\n== Scenarios ==\nPool-Based Sampling: In this approach, which is the most well known scenario, the learning algorithm attempts to evaluate the entire dataset before selecting data points (instances) for labeling.  It is often initially trained on a fully labeled subset of the data using a machine-learning method such as logistic regression or SVM that yields class-membership probabilities for individual data instances. The candidate instances are those for which the prediction is most ambiguous.instances are drawn from the entire data pool and assigned a confidence score, a measurement of how well the learner "understands" the data. The system then selects the instances for which it is the least confident and queries the teacher for the labels. The theoretical drawback of pool-based samplilng is that it is memory-intensive and is therefore limited in its capacity to handle enormous datasets, but in practice, the rate-limiting factor is that the teacher is typically a (fatiguable) human expert who must be paid for their effort, rather than computer memory.\nStream-Based Selective Sampling: Here, each consective unlabeled dinstance is examined one at a time with the machine evaluating the informativene', metadata={'title': 'Active learning (machine learning)', 'summary': 'Active learning is a special case of machine learning in which a learning algorithm can interactively query a human user (or some other information source), to label new data points with the desired outputs. The human user must possess knowledge/expertise in the problem domain, including the ability to consult/research authoritative sources when necessary.  In statistics literature, it is sometimes also called optimal experimental design. The information source is also called teacher or oracle.\nThere are situations in which unlabeled data is abundant but manual labeling is expensive. In such a scenario, learning algorithms can actively query the user/teacher for labels. This type of iterative supervised learning is called active learning. Since the learner chooses the examples, the number of examples to learn a concept can often be much lower than the number required in normal supervised learning. With this approach, there is a risk that the algorithm is overwhelmed by uninformative examples.  Recent developments are dedicated to multi-label active learning, hybrid active learning and active learning in a single-pass (on-line) context, combining concepts from the field of machine learning (e.g. conflict and ignorance) with adaptive, incremental learning policies in the field of online machine learning. Using active learning allows for faster development of a machine learning algorithm, when comparative updates would require a quantum or super computer.Large-scale active learning projects may benefit from crowdsourcing frameworks such as Amazon Mechanical Turk that include many humans in the active learning loop.', 'source': 'https://en.wikipedia.org/wiki/Active_learning_(machine_learning)'}), Document(page_content='Machine learning-based attention is a mechanism which intuitively mimicks cognitive attention. It calculates "soft" weights for each word, more precisely for its embedding, in the context window. These weights can be computed either in parallel (such as in transformers) or sequentially (such as recurrent neural networks). "Soft" weights can change during each runtime, in contrast to "hard" weights, which are (pre-)trained and fine-tuned and remain frozen afterwards.  \nAttention was developed to address the weaknesses of leveraging information from the hidden outputs of recurrent neural networks.  Recurrent neural networks favor more recent information contained in words at the end of a sentence, while information earlier in the sentence is expected to be attenuated.  Attention allows the calculation of the hidden representation of a token equal access to any part of a sentence directly, rather than only through the previous hidden state.  \nEarlier uses attached this mechanism to a serial recurrent neural network\'s language translation system (below), but later uses in Transformers large language models removed the recurrent neural network and relied heavily on the faster parallel attention scheme.\n\n\n== Predecessors ==\nPredecessors of the mechanism were used in recurrent neural networks which, however, calculated "soft" weights sequentially and, at each step, considered the current word and other words within the context window. They were known as multiplicative modules, sigma pi units, and hyper-networks. They have been used in long short-term memory (LSTM) networks, multi-sensory data processing (sound, images, video, and text) in perceivers, fast weight controller\'s memory, reasoning tasks in differentiable neural computers, and neural Turing machines.\n\n\n== Core calculations ==\nThe attention network was designed to identify the highest correlations amongst words within a sentence, assuming that it has learned those patterns from the training corpus.  This correlation is captured in neuronal weights through backpropagation, either from self-supervised pretraining or supervised fine-tuning. \nThe example below shows how correlations are identified once a network has been trained and has the right weights.  When looking at the word "that" in the sentence "see that girl run", the network should be able to identify "girl" as a highly correlated word.  For simplicity this example focuses on the word "that", but in reality all words receive this treatment in parallel and the resulting soft-weights and context vectors are stacked into matrices for further task-specific use.\n\nThe query vector is compared (via dot product) with each word in the keys. This helps the model discover the most relevant word for the query word. In this case "girl" was determined to be the most relevant word for "that". The result (size 4 in this case) is run through the softmax function, producing a vector of size 4 with probabilities summing to 1. Multiplying this against the value matrix effectively amplifies the signal for the most important words in the sentence and diminishes the signal for less important words.The structure of the input data is captured in the Qw and Kw weights, and the Vw weights express that structure in terms of more meaningful features for the task being trained for.  For this reason, the attention head components are called Query (Q), Key (K), and Value (V)—a loose and possibly misleading analogy with relational database systems.\nNote that the context vector for "that" does not rely on context vectors for the other words; therefore the context vectors of all words can be calculated using the whole matrix X, which includes all the word embeddings, instead of a single word\'s embedding vector x in the formula above, thus parallelizing the calculations. Now, the softmax can be interpreted as a matrix softmax acting on separate rows.  This is a huge advantage over recurrent networks which must operate sequentially.\n\n\n== A language tran', metadata={'title': 'Attention (machine learning)', 'summary': 'Machine learning-based attention is a mechanism which intuitively mimicks cognitive attention. It calculates "soft" weights for each word, more precisely for its embedding, in the context window. These weights can be computed either in parallel (such as in transformers) or sequentially (such as recurrent neural networks). "Soft" weights can change during each runtime, in contrast to "hard" weights, which are (pre-)trained and fine-tuned and remain frozen afterwards.  \nAttention was developed to address the weaknesses of leveraging information from the hidden outputs of recurrent neural networks.  Recurrent neural networks favor more recent information contained in words at the end of a sentence, while information earlier in the sentence is expected to be attenuated.  Attention allows the calculation of the hidden representation of a token equal access to any part of a sentence directly, rather than only through the previous hidden state.  \nEarlier uses attached this mechanism to a serial recurrent neural network\'s language translation system (below), but later uses in Transformers large language models removed the recurrent neural network and relied heavily on the faster parallel attention scheme.\n\n', 'source': 'https://en.wikipedia.org/wiki/Attention_(machine_learning)'}), Document(page_content='Adversarial machine learning is the study of the attacks on machine learning algorithms, and of the defenses against such attacks. A survey from May 2020 exposes the fact that practitioners report a dire need for better protecting machine learning systems in industrial applications.Most machine learning techniques are mostly designed to work on specific problem sets, under the assumption that the training and test data are generated from the same statistical distribution (IID). However, this assumption is often dangerously violated in practical high-stake applications, where users may intentionally supply fabricated data that violates the statistical assumption.\nMost common attacks in adversarial machine learning include evasion attacks, data poisoning attacks, Byzantine attacks and model extraction.\n\n\n== History ==\nAt the MIT Spam Conference in January 2004, John Graham-Cumming showed that a machine learning spam filter could be used to defeat another machine learning spam filter by automatically learning which words to add to a spam email to get the email classified as not spam.In 2004, Nilesh Dalvi and others noted that linear classifiers used in spam filters could be defeated by simple "evasion attacks" as spammers inserted "good words" into their spam emails. (Around 2007, some spammers added random noise to fuzz words within "image spam" in order to defeat OCR-based filters.) In 2006, Marco Barreno and others published "Can Machine Learning Be Secure?", outlining a broad taxonomy of attacks. As late as 2013 many researchers continued to hope that non-linear classifiers (such as support vector machines and neural networks) might be robust to adversaries, until Battista Biggio and others demonstrated the first gradient-based attacks on such machine-learning models (2012–2013). In 2012, deep neural networks began to dominate computer vision problems; starting in 2014, Christian Szegedy and others demonstrated that deep neural networks could be fooled by adversaries, again using a gradient-based attack to craft adversarial perturbations.Recently, it was observed that adversarial attacks are harder to produce in the practical world due to the different environmental constraints that cancel out the effect of noises. For example, any small rotation or slight illumination on an adversarial image can destroy the adversariality. In addition, researchers such as Google Brain\'s Nicholas Frosst point out that it is much easier to make self-driving cars miss stop signs by physically removing the sign itself, rather than creating adversarial examples. Frosst also believes that the adversarial machine learning community incorrectly assumes models trained on a certain data distribution will also perform well on a completely different data distribution. He suggests that a new approach to machine learning should be explored, and is currently working on a unique neural network that has characteristics more similar to human perception than state of the art approaches.While adversarial machine learning continues to be heavily rooted in academia, large tech companies such as Google, Microsoft, and IBM have begun curating documentation and open source code bases to allow others to concretely assess the robustness of machine learning models and minimize the risk of adversarial attacks.\n\n\n=== Examples ===\nExamples include attacks in spam filtering, where spam messages are obfuscated through the misspelling of "bad" words or the insertion of "good" words; attacks in computer security, such as obfuscating malware code within network packets or modifying the characteristics of a network flow to mislead intrusion detection; attacks in biometric recognition where fake biometric traits may be exploited to impersonate a legitimate user; or to compromise users\' template galleries that adapt to updated traits over time.\nResearchers showed that by changing only one-pixel it was possible to fool deep learning algorithms. Others 3-D printed a toy turtle with', metadata={'title': 'Adversarial machine learning', 'summary': 'Adversarial machine learning is the study of the attacks on machine learning algorithms, and of the defenses against such attacks. A survey from May 2020 exposes the fact that practitioners report a dire need for better protecting machine learning systems in industrial applications.Most machine learning techniques are mostly designed to work on specific problem sets, under the assumption that the training and test data are generated from the same statistical distribution (IID). However, this assumption is often dangerously violated in practical high-stake applications, where users may intentionally supply fabricated data that violates the statistical assumption.\nMost common attacks in adversarial machine learning include evasion attacks, data poisoning attacks, Byzantine attacks and model extraction.\n\n', 'source': 'https://en.wikipedia.org/wiki/Adversarial_machine_learning'}), Document(page_content='In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues (Boser et al., 1992, Guyon et al., 1993, Cortes and Vapnik, 1995, Vapnik et al., 1997) SVMs are one of the most studied models, being based on statistical learning frameworks or VC theory proposed by Vapnik (1982, 1995) and Chervonenkis (1974). \nIn addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces. SVMs can also be used for regression tasks, where the objective becomes \n  \n    \n      \n        ϵ\n        −\n      \n    \n    {\\displaystyle \\epsilon -}\n  sensitive.\nThe support vector clustering algorithm, created by Hava Siegelmann and Vladimir Vapnik, applies the statistics of support vectors, developed in the support vector machines algorithm, to categorize unlabeled data. These data sets require unsupervised learning approaches, which attempt to find natural clustering of the data to groups and, then, to map new data according to these clusters. \nThe popularity of SVMs is likely due to their amenability to theoretical analysis, their flexibility in being applied to a wide variety of tasks, including structured prediction problems. It is not clear that SVMs have better predictive performance than other linear models, such as logistic regression and linear regression.\n\n\n== Motivation ==\nClassifying data is a common task in machine learning.\nSuppose some given data points each belong to one of two classes, and the goal is to decide which class a new data point will be in. In the case of support vector machines, a data point is viewed as a \n  \n    \n      \n        p\n      \n    \n    {\\displaystyle p}\n  -dimensional vector (a list of \n  \n    \n      \n        p\n      \n    \n    {\\displaystyle p}\n   numbers), and we want to know whether we can separate such points with a \n  \n    \n      \n        (\n        p\n        −\n        1\n        )\n      \n    \n    {\\displaystyle (p-1)}\n  -dimensional hyperplane. This is called a linear classifier. There are many hyperplanes that might classify the data. One reasonable choice as the best hyperplane is the one that represents the largest separation, or margin, between the two classes. So we choose the hyperplane so that the distance from it to the nearest data point on each side is maximized. If such a hyperplane exists, it is known as the maximum-margin hyperplane and the linear classifier it defines is known as a maximum-margin classifier; or equivalently, the perceptron of optimal stability.More formally, a support vector machine constructs a hyperplane or set of hyperplanes in a high or infinite-dimensional space, which can be used for classification, regression, or other tasks like outliers detection. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training-data point of any class (so-called functional margin), since in general the larger the margin, the lower the generalization error of the classifier. A lower generalization error means that the implementer is less likely to experience overfitting.\n\nWhereas the original problem may be stated in a finite-dimensional space, it often happens that the sets to discriminate are not linearly separable in that space. For this reason, it was proposed that the original finite-dimensional space be mapped into a much higher-dimensional space, presumably making the separation easier in that space. To keep the computational load reasonable, the mappings used by SVM schemes are designed to ensure that dot products of pairs of input data vectors may be computed easily in terms of the variables in the original space, by defining them in terms of a kernel function \n  \n    \n', metadata={'title': 'Support vector machine', 'summary': 'In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues (Boser et al., 1992, Guyon et al., 1993, Cortes and Vapnik, 1995, Vapnik et al., 1997) SVMs are one of the most studied models, being based on statistical learning frameworks or VC theory proposed by Vapnik (1982, 1995) and Chervonenkis (1974). \nIn addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces. SVMs can also be used for regression tasks, where the objective becomes \n  \n    \n      \n        ϵ\n        −\n      \n    \n    {\\displaystyle \\epsilon -}\n  sensitive.\nThe support vector clustering algorithm, created by Hava Siegelmann and Vladimir Vapnik, applies the statistics of support vectors, developed in the support vector machines algorithm, to categorize unlabeled data. These data sets require unsupervised learning approaches, which attempt to find natural clustering of the data to groups and, then, to map new data according to these clusters. \nThe popularity of SVMs is likely due to their amenability to theoretical analysis, their flexibility in being applied to a wide variety of tasks, including structured prediction problems. It is not clear that SVMs have better predictive performance than other linear models, such as logistic regression and linear regression.', 'source': 'https://en.wikipedia.org/wiki/Support_vector_machine'}), Document(page_content='Supervised learning (SL) is a paradigm in machine learning where input objects (for example, a vector of predictor variables) and a desired output value (also known as human-labeled supervisory signal) train a model. The training data is processed, building a function that maps new data on expected output values.  An optimal scenario will allow for the algorithm to correctly determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way (see inductive bias). This statistical quality of an algorithm is measured through the so-called generalization error.\n\n\n== Steps to follow ==\nTo solve a given problem of supervised learning, one has to perform the following steps:\n\nDetermine the type of training examples. Before doing anything else, the user should decide what kind of data is to be used as a training set. In the case of handwriting analysis, for example, this might be a single handwritten character, an entire handwritten word, an entire sentence of handwriting or perhaps a full paragraph of handwriting.\nGather a training set. The training set needs to be representative of the real-world use of the function. Thus, a set of input objects is gathered and corresponding outputs are also gathered, either from human experts or from measurements.\nDetermine the input feature representation of the learned function. The accuracy of the learned function depends strongly on how the input object is represented. Typically, the input object is transformed into a feature vector, which contains a number of features that are descriptive of the object. The number of features should not be too large, because of the curse of dimensionality; but should contain enough information to accurately predict the output.\nDetermine the structure of the learned function and corresponding learning algorithm. For example, the engineer may choose to use support-vector machines or decision trees.\nComplete the design. Run the learning algorithm on the gathered training set. Some supervised learning algorithms require the user to determine certain control parameters. These parameters may be adjusted by optimizing performance on a subset (called a validation set) of the training set, or via cross-validation.\nEvaluate the accuracy of the learned function. After parameter adjustment and learning, the performance of the resulting function should be measured on a test set that is separate from the training set.\n\n\n== Algorithm choice ==\nA wide range of supervised learning algorithms are available, each with its strengths and weaknesses. There is no single learning algorithm that works best on all supervised learning problems (see the No free lunch theorem).\nThere are four major issues to consider in supervised learning:\n\n\n=== Bias-variance tradeoff ===\n\nA first issue is the tradeoff between bias and variance. Imagine that we have available several different, but equally good, training data sets. A learning algorithm is biased for a particular input \n  \n    \n      \n        x\n      \n    \n    {\\displaystyle x}\n   if, when trained on each of these data sets, it is systematically incorrect when predicting the correct output for \n  \n    \n      \n        x\n      \n    \n    {\\displaystyle x}\n  . A learning algorithm has high variance for a particular input \n  \n    \n      \n        x\n      \n    \n    {\\displaystyle x}\n   if it predicts different output values when trained on different training sets. The prediction error of a learned classifier is related to the sum of the bias and the variance of the learning algorithm. Generally, there is a tradeoff between bias and variance. A learning algorithm with low bias must be "flexible" so that it can fit the data well. But if the learning algorithm is too flexible, it will fit each training data set differently, and hence have high variance. A key aspect of many supervised learning methods is that they are able to adjust this tradeoff between', metadata={'title': 'Supervised learning', 'summary': 'Supervised learning (SL) is a paradigm in machine learning where input objects (for example, a vector of predictor variables) and a desired output value (also known as human-labeled supervisory signal) train a model. The training data is processed, building a function that maps new data on expected output values.  An optimal scenario will allow for the algorithm to correctly determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way (see inductive bias). This statistical quality of an algorithm is measured through the so-called generalization error.\n\n', 'source': 'https://en.wikipedia.org/wiki/Supervised_learning'}), Document(page_content='In machine learning, a hyperparameter is a parameter, such as the learning rate or choice of optimizer, which specifies details of the learning process, hence the name hyperparameter. This is in contrast to parameters which determine the model itself.\nHyperparameters can be classified as model hyperparameters, that typically cannot be inferred while fitting the machine to the training set because the objective function is typically non-differentiable with respect to them. As a result, gradient based optimization methods cannot be applied directly. An example of a model hyperparameter is the topology and size of a neural network. Examples of algorithm hyperparameters are learning rate and batch size as well as mini-batch size. Batch size can refer to the full data sample where mini-batch size would be a smaller sample set.\nDifferent model training algorithms require different hyperparameters, some simple algorithms (such as ordinary least squares regression) require none. Given these hyperparameters, the training algorithm learns the parameters from the data. For instance, LASSO is an algorithm that adds a regularization hyperparameter to ordinary least squares regression, which has to be set before estimating the parameters through the training algorithm.\n\n\n== Considerations ==\nThe time required to train and test a model can depend upon the choice of its hyperparameters. A hyperparameter is usually of continuous or integer type, leading to mixed-type optimization problems. The existence of some hyperparameters is conditional upon the value of others, e.g. the size of each hidden layer in a neural network can be conditional upon the number of layers.\n\n\n=== Difficulty learnable parameters ===\nUsually, but not always, hyperparameters cannot be learned using well known gradient based methods (such as gradient descent, LBFGS) - which are commonly employed to learn parameters. These hyperparameters are those parameters describing a model representation that cannot be learned by common optimization methods but nonetheless affect the loss function. An example would be the tolerance hyperparameter for errors in support vector machines.\n\n\n=== Untrainable parameters ===\nSometimes, hyperparameters cannot be learned from the training data because they aggressively increase the capacity of a model and can push the loss function to an undesired minimum (overfitting to, and picking up noise in the data), as opposed to correctly mapping the richness of the structure in the data. For example, if we treat the degree of a polynomial equation fitting a regression model as a trainable parameter, the degree would increase until the model perfectly fit the data, yielding low training error, but poor generalization performance.\n\n\n=== Tunability ===\nMost performance variation can be attributed to just a few hyperparameters. The tunability of an algorithm, hyperparameter, or interacting hyperparameters is a measure of how much performance can be gained by tuning it. For an LSTM, while the learning rate followed by the network size are its most crucial hyperparameters, batching and momentum have no significant effect on its performance.Although some research has advocated the use of mini-batch sizes in the thousands, other work has found the best performance with mini-batch sizes between 2 and 32.\n\n\n=== Robustness ===\nAn inherent stochasticity in learning directly implies that the empirical hyperparameter performance is not necessarily its true performance. Methods that are not robust to simple changes in hyperparameters, random seeds, or even different implementations of the same algorithm cannot be integrated into mission critical control systems without significant simplification and robustification.Reinforcement learning algorithms, in particular, require measuring their performance over a large number of random seeds, and also measuring their sensitivity to choices of hyperparameters. Their evaluation with a small number of random seeds does not cap', metadata={'title': 'Hyperparameter (machine learning)', 'summary': 'In machine learning, a hyperparameter is a parameter, such as the learning rate or choice of optimizer, which specifies details of the learning process, hence the name hyperparameter. This is in contrast to parameters which determine the model itself.\nHyperparameters can be classified as model hyperparameters, that typically cannot be inferred while fitting the machine to the training set because the objective function is typically non-differentiable with respect to them. As a result, gradient based optimization methods cannot be applied directly. An example of a model hyperparameter is the topology and size of a neural network. Examples of algorithm hyperparameters are learning rate and batch size as well as mini-batch size. Batch size can refer to the full data sample where mini-batch size would be a smaller sample set.\nDifferent model training algorithms require different hyperparameters, some simple algorithms (such as ordinary least squares regression) require none. Given these hyperparameters, the training algorithm learns the parameters from the data. For instance, LASSO is an algorithm that adds a regularization hyperparameter to ordinary least squares regression, which has to be set before estimating the parameters through the training algorithm.\n\n', 'source': 'https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)'}), Document(page_content='This page is a timeline of machine learning. Major discoveries, achievements, milestones and other major events in machine learning are included.\n\n\n== Overview ==\n\n\n== Timeline ==\n\n\n== See also ==\nHistory of artificial intelligence\nTimeline of artificial intelligence\nTimeline of machine translation\n\n\n== References ==\n\n\n=== Citations ===\n\n\n=== Works cited ===\nCrevier, Daniel (1993). AI: The Tumultuous Search for Artificial Intelligence. New York: BasicBooks. ISBN 0-465-02997-3.\nMarr, Bernard (19 February 2016). "A Short History of Machine Learning -- Every Manager Should Read". Forbes. Archived from the original on 2022-12-05. Retrieved 2022-12-25.\nRussell, Stuart; Norvig, Peter (2003). Artificial Intelligence: A Modern Approach. London: Pearson Education. ISBN 0-137-90395-2.', metadata={'title': 'Timeline of machine learning', 'summary': 'This page is a timeline of machine learning. Major discoveries, achievements, milestones and other major events in machine learning are included.', 'source': 'https://en.wikipedia.org/wiki/Timeline_of_machine_learning'}), Document(page_content='Deep learning is the subset of machine learning methods based on artificial neural networks (ANNs) with representation learning. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.Deep-learning architectures such as deep neural networks, deep belief networks, recurrent neural networks, convolutional neural networks and transformers have been applied to fields including computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis, climate science, material inspection and board game programs, where they have produced results comparable to and in some cases surpassing human expert performance.Artificial neural networks were inspired by information processing and distributed communication nodes in biological systems. ANNs have various differences from biological brains. Specifically, artificial neural networks tend to be static and symbolic, while the biological brain of most living organisms is dynamic (plastic) and analog. ANNs are generally seen as low quality models for brain function.\n\n\n== Definition ==\nDeep learning is a class of machine learning algorithms that:\u200a199–200\u200a uses multiple layers to progressively extract higher-level features from the raw input. For example, in image processing, lower layers may identify edges, while higher layers may identify the concepts relevant to a human such as digits or letters or faces.\nFrom another angle to view deep learning, deep learning refers to "computer-simulate" or "automate" human learning processes from a source (e.g., an image of dogs) to a learned object (dogs). Therefore, a notion coined as "deeper" learning or "deepest" learning makes sense. The deepest learning refers to the fully automatic learning from a source to a final learned object. A deeper learning thus refers to a mixed learning process: a human learning process from a source to a learned semi-object, followed by a computer learning process from the human learned semi-object to a final learned object.\n\n\n== Overview ==\nMost modern deep learning models are based on multi-layered artificial neural networks such as convolutional neural networks and transformers, although they can also include propositional formulas or latent variables organized layer-wise in deep generative models such as the nodes in deep belief networks and deep Boltzmann machines.In deep learning, each level learns to transform its input data into a slightly more abstract and composite representation. In an image recognition application, the raw input may be a matrix of pixels; the first representational layer may abstract the pixels and encode edges; the second layer may compose and encode arrangements of edges; the third layer may encode a nose and eyes; and the fourth layer may recognize that the image contains a face. Importantly, a deep learning process can learn which features to optimally place in which level on its own. This does not eliminate the need for hand-tuning; for example, varying numbers of layers and layer sizes can provide different degrees of abstraction.The word "deep" in "deep learning" refers to the number of layers through which the data is transformed. More precisely, deep learning systems have a substantial credit assignment path (CAP) depth. The CAP is the chain of transformations from input to output. CAPs describe potentially causal connections between input and output. For a feedforward neural network, the depth of the CAPs is that of the network and is the number of hidden layers plus one (as the output layer is also parameterized). For recurrent neural networks, in which a signal may propagate through a layer more than once, the CAP depth is potentially unlimited. No universally agreed-upon threshold of depth divides shallow learning from deep learning, but most researchers agree that deep learning involves CAP depth higher than 2. CAP of depth ', metadata={'title': 'Deep learning', 'summary': 'Deep learning is the subset of machine learning methods based on artificial neural networks (ANNs) with representation learning. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.Deep-learning architectures such as deep neural networks, deep belief networks, recurrent neural networks, convolutional neural networks and transformers have been applied to fields including computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis, climate science, material inspection and board game programs, where they have produced results comparable to and in some cases surpassing human expert performance.Artificial neural networks were inspired by information processing and distributed communication nodes in biological systems. ANNs have various differences from biological brains. Specifically, artificial neural networks tend to be static and symbolic, while the biological brain of most living organisms is dynamic (plastic) and analog. ANNs are generally seen as low quality models for brain function.', 'source': 'https://en.wikipedia.org/wiki/Deep_learning'}), Document(page_content='Tensor informally refers in machine learning to two different concepts that organize and represent data. Data may be organized in a multidimensional array (M-way array) that is informally referred to as a "data tensor";  however in the strict mathematical sense, a tensor is a multilinear mapping over a set of domain vector spaces to a range vector space.  Observations, such as images, movies, volumes, sounds, and relationships among words and concepts, stored in an M-way array ("data tensor") may be analyzed either by artificial neural networks or tensor methods.Tensor decomposition can factorize data tensors into smaller tensors. Operations on data tensors can be expressed in terms of matrix multiplication and the Kronecker product.  The computation of gradients, an important aspect of the backpropagation algorithm, can be performed using PyTorch and TensorFlow.Computations are often performed on graphics processing units (GPUs) using CUDA and on dedicated hardware such as Google\'s Tensor Processing Unit or Nvidia\'s Tensor core. These developments have greatly accelerated neural network architectures and increased the size and complexity of models that can be trained.\n\n\n== History ==\nA tensor is by definition a multilinear map. In mathematics, this may express a multilinear relationship between sets of algebraic objects. In physics, tensor fields, considered as tensors at each point in space, are useful in expressing mechanics such as stress or elasticity. In machine learning, the exact use of tensors depends on the statistical approach being used.\nIn 2001, the field of signal processing and statistics were making use of tensor methods. Pierre Comon surveys the early adoption of tensor methods in the fields of telecommunications, radio surveillance, chemometrics and sensor processing. Linear tensor rank methods (such as, Parafac/CANDECOMP) analyzed M-way arrays ("data tensors") composed of higher order statistics that were employed in blind source separation problems to compute a linear model of the data. He noted several early limitations in determining the tensor rank and efficient tensor rank decomposition.In the early 2000s, multilinear tensor methods crossed over into computer vision, computer graphics and machine learning with papers by Vasilescu  or in collaboration with Terzopoulos, such as Human Motion Signatures, TensorFaces  TensorTexures and Multilinear Projection.  Multilinear algebra, the algebra of higher-order tensors, is a suitable and transparent framework for analyzing the multifactor structure of an ensemble of observations and for addressing the difficult problem of disentangling the causal factors based on second order or higher order statistics associated with each causal factor.Tensor (multilinear) factor analysis disentangles and reduces the influence of different causal factors with multilinear subspace learning.  \nWhen treating an image or a video as a 2- or 3-way array, i.e., "data matrix/tensor",  tensor methods reduce spatial or time redundancies as demonstrated by Wang and Ahuja.Yoshua Bengio,\nGeoff Hinton\n and their collaborators briefly discuss the relationship between deep neural networks \nand tensor factor analysis beyond the use of M-way arrays ("data tensors") as inputs.   One of the early uses of tensors for neural networks appeared in natural language processing. A single word can be expressed as a vector via Word2vec. Thus a relationship between two words can be encoded in a matrix. However, for more complex relationships such as subject-object-verb, it is necessary to build higher-dimensional networks. In 2009, the work of Sutsekver introduced Bayesian Clustered Tensor Factorization to model relational concepts while reducing the parameter space. From 2014 to 2015, tensor methods become more common in convolutional neural networks (CNNs). Tensor methods organize neural network weights in a "data tensor", analyze and reduce the number of neural network weights.  Lebedev et al. accelerat', metadata={'title': 'Tensor (machine learning)', 'summary': 'Tensor informally refers in machine learning to two different concepts that organize and represent data. Data may be organized in a multidimensional array (M-way array) that is informally referred to as a "data tensor";  however in the strict mathematical sense, a tensor is a multilinear mapping over a set of domain vector spaces to a range vector space.  Observations, such as images, movies, volumes, sounds, and relationships among words and concepts, stored in an M-way array ("data tensor") may be analyzed either by artificial neural networks or tensor methods.Tensor decomposition can factorize data tensors into smaller tensors. Operations on data tensors can be expressed in terms of matrix multiplication and the Kronecker product.  The computation of gradients, an important aspect of the backpropagation algorithm, can be performed using PyTorch and TensorFlow.Computations are often performed on graphics processing units (GPUs) using CUDA and on dedicated hardware such as Google\'s Tensor Processing Unit or Nvidia\'s Tensor core. These developments have greatly accelerated neural network architectures and increased the size and complexity of models that can be trained.\n\n', 'source': 'https://en.wikipedia.org/wiki/Tensor_(machine_learning)'}), Document(page_content='Torch is an open-source machine learning library, \na scientific computing framework, and a scripting language based on Lua. It provides LuaJIT interfaces to deep learning algorithms implemented in C. It was created by the Idiap Research Institute at EPFL. Torch development moved in 2017 to PyTorch, a port of the library to Python.\n\n\n== torch ==\nThe core package of Torch is torch. It provides a flexible N-dimensional array or Tensor, which supports basic routines for indexing, slicing, transposing, type-casting, resizing, sharing storage and cloning. This object is used by most other packages and thus forms the core object of the library. The Tensor also supports mathematical operations like max, min, sum, statistical distributions like uniform, normal and multinomial, and BLAS operations like dot product, matrix–vector multiplication, matrix–matrix multiplication  and matrix product.\nThe following exemplifies using torch via its REPL interpreter:\n\nThe torch package also simplifies object-oriented programming and serialization by providing various convenience functions which are used throughout its packages. The torch.class(classname, parentclass) function can be used to create object factories (classes). When the constructor is called, torch initializes and sets a Lua table with the user-defined metatable, which makes the table an object.\nObjects created with the torch factory can also be serialized, as long as they do not contain references to objects that cannot be serialized, such as Lua coroutines, and Lua userdata. However, userdata can be serialized if it is wrapped by a table (or metatable) that provides  read() and write() methods.\n\n\n== nn ==\nThe nn package is used for building neural networks. It is divided into modular objects that share a common Module interface. Modules have a forward() and backward() method that allow them to feedforward and backpropagate, respectively. Modules can be joined using module composites, like Sequential, Parallel and Concat to create complex task-tailored graphs. Simpler modules like Linear, Tanh and Max make up the basic component modules. This modular interface provides first-order automatic gradient differentiation. What follows is an example use-case for building a multilayer perceptron using Modules:\n\nLoss functions are implemented as sub-classes of Criterion, which has a similar interface to Module. It also has forward() and backward() methods for computing the loss and backpropagating gradients, respectively. Criteria are helpful to train neural network on classical tasks. Common criteria are the Mean Squared Error criterion implemented in MSECriterion and the cross-entropy criterion implemented in ClassNLLCriterion. What follows is an example of a Lua function that can be iteratively called to train \nan mlp Module on input Tensor x, target Tensor y with a scalar learningRate:    \n\nIt also has StochasticGradient class for training a neural network using Stochastic gradient descent, although the optim package provides much more options in this respect, like momentum and weight decay regularization.\n\n\n== Other packages ==\nMany packages other than the above official packages are used with Torch. These are listed in the torch cheatsheet. These extra packages provide a wide range of utilities such as parallelism, asynchronous input/output, image processing, and so on. They can be installed with LuaRocks, the Lua package manager which is also included with the Torch distribution.\n\n\n== Applications ==\nTorch is used by the Facebook AI Research Group, IBM, Yandex and the Idiap Research Institute. Torch has been extended for use on Android and iOS. It has been used to build hardware implementations for data flows like those found in neural networks.Facebook has released a set of extension modules as open source software.\n\n\n== See also ==\nComparison of deep learning software\nPyTorch\n\n\n== References ==\n\n\n== External links ==\nOfficial website', metadata={'title': 'Torch (machine learning)', 'summary': 'Torch is an open-source machine learning library, \na scientific computing framework, and a scripting language based on Lua. It provides LuaJIT interfaces to deep learning algorithms implemented in C. It was created by the Idiap Research Institute at EPFL. Torch development moved in 2017 to PyTorch, a port of the library to Python.', 'source': 'https://en.wikipedia.org/wiki/Torch_(machine_learning)'}), Document(page_content='Automated machine learning (AutoML) is the process of automating the tasks of applying machine learning to real-world problems. \nAutoML potentially includes every stage from beginning with a raw dataset to building a machine learning model ready for deployment. AutoML was proposed as an artificial intelligence-based solution to the growing challenge of applying machine learning. The high degree of automation in AutoML aims to allow non-experts to make use of machine learning models and techniques without requiring them to become experts in machine learning. Automating the process of applying machine learning end-to-end additionally offers the advantages of producing simpler solutions, faster creation of those solutions, and models that often outperform hand-designed models.Common techniques used in AutoML include hyperparameter optimization, meta-learning and neural architecture search.\n\n\n== Comparison to the standard approach ==\nIn a typical machine learning application, practitioners have a set of input data points to be used for training. The raw data may not be in a form that all algorithms can be applied to. To make the data amenable for machine learning, an expert may have to apply appropriate data pre-processing, feature engineering, feature extraction, and feature selection methods. After these steps, practitioners must then perform algorithm selection and hyperparameter optimization to maximize the predictive performance of their model. If deep learning is used, the architecture of the neural network must also be chosen by the machine learning expert. \nEach of these steps may be challenging, resulting in significant hurdles to using machine learning. AutoML aims to simplify these steps for non-experts, and to make it easier for them to use machine learning techniques correctly and effectively.\nAutoML plays an important role within the broader approach of automating data science, which also includes challenging tasks such as data engineering, data exploration and model interpretation and prediction.\n\n\n== Targets of automation ==\nAutomated machine learning can target various stages of the machine learning process.  Steps to automate are:\n\nData preparation and ingestion (from raw data and miscellaneous formats)\nColumn type detection; e.g., boolean, discrete numerical, continuous numerical, or text\nColumn intent detection; e.g., target/label, stratification field, numerical feature, categorical text feature, or free text feature\nTask detection; e.g., binary classification, regression, clustering, or ranking\nFeature engineering\nFeature selection\nFeature extraction\nMeta-learning and transfer learning\nDetection and handling of skewed data and/or missing values\nModel selection - choosing which machine learning algorithm to use, often including multiple competing software implementations\nEnsembling - a form of consensus where using multiple models often gives better results than any single model\nHyperparameter optimization of the learning algorithm and featurization\nPipeline selection under time, memory, and complexity constraints\nSelection of evaluation metrics and validation procedures\nProblem checking\nLeakage detection\nMisconfiguration detection\nAnalysis of obtained results\nCreating user interfaces and visualizations\n\n\n== Challenges and Limitations ==\nThere are a number of key challenges being tackled around automated machine learning. A big issue surrounding the field is referred to as "development as a cottage industry". This phrase refers to the issue in machine learning where development relies on manual decisions and biases of experts. This is contrasted to the goal of machine learning which is to create systems that can learn and improve from their own usage and analysis of the data. Basically, it\'s the struggle between how much experts should get involved in the learning of the systems versus how much freedom they should be giving the machines. However, experts and developers must help create and guide these machines', metadata={'title': 'Automated machine learning', 'summary': 'Automated machine learning (AutoML) is the process of automating the tasks of applying machine learning to real-world problems. \nAutoML potentially includes every stage from beginning with a raw dataset to building a machine learning model ready for deployment. AutoML was proposed as an artificial intelligence-based solution to the growing challenge of applying machine learning. The high degree of automation in AutoML aims to allow non-experts to make use of machine learning models and techniques without requiring them to become experts in machine learning. Automating the process of applying machine learning end-to-end additionally offers the advantages of producing simpler solutions, faster creation of those solutions, and models that often outperform hand-designed models.Common techniques used in AutoML include hyperparameter optimization, meta-learning and neural architecture search.\n\n', 'source': 'https://en.wikipedia.org/wiki/Automated_machine_learning'}), Document(page_content='Extreme learning machines are feedforward neural networks for classification, regression, clustering, sparse approximation, compression and feature learning with a single layer or multiple layers of hidden nodes, where the parameters of hidden nodes (not just the weights connecting inputs to hidden nodes) need to be tuned. These hidden nodes can be randomly assigned and never updated (i.e. they are random projection but with nonlinear transforms), or can be inherited from their ancestors without being changed. In most cases, the output weights of hidden nodes are usually learned in a single step, which essentially amounts to learning a linear model. \nThe name "extreme learning machine" (ELM) was given to such models by Guang-Bin Huang. The idea goes back to Frank Rosenblatt, who not only published a single layer Perceptron in 1958, but also introduced a multi layer perceptron with 3 layers: an input layer, a hidden layer with randomized weights that did not learn, and a learning output layer.According to some researchers, these models are able to produce good generalization performance and learn thousands of times faster than networks trained using backpropagation.  In literature, it also shows that  these models can outperform support vector machines in both classification and regression applications.\n\n\n== History ==\nFrom 2001-2010, ELM research mainly focused on the unified learning framework for "generalized" single-hidden layer feedforward neural networks (SLFNs), including but not limited to sigmoid networks, RBF networks, threshold networks, trigonometric networks, fuzzy inference systems, Fourier series, Laplacian transform, wavelet networks, etc. One significant achievement made in those years is to successfully prove the universal approximation and classification capabilities of ELM in theory.From 2010 to 2015, ELM research extended to the unified learning framework for kernel learning, SVM and a few typical feature learning methods such as Principal Component Analysis (PCA) and Non-negative Matrix Factorization (NMF). It is shown that SVM actually provides suboptimal solutions compared to ELM, and ELM can provide the whitebox kernel mapping, which is implemented by ELM random feature mapping, instead of the blackbox kernel used in SVM. PCA and NMF can be considered as special cases where linear hidden nodes are used in ELM.From 2015 to 2017, an increased focus has been placed on hierarchical implementations of ELM. Additionally since 2011, significant biological studies have been made that support certain ELM theories.From 2017 onwards, to overcome low-convergence problem during training LU decomposition, Hessenberg decomposition and QR decomposition based approaches with regularization have begun to attract attentionIn 2017, Google Scholar Blog published a list of "Classic Papers: Articles That Have Stood The Test of Time". Among these are two papers written about ELM which are shown in studies 2 and 7 from the "List of 10 classic AI papers from 2006".\n\n\n== Algorithms ==\nGiven a single hidden layer of ELM, suppose that the output function of the \n  \n    \n      \n        i\n      \n    \n    {\\displaystyle i}\n  -th hidden node is \n  \n    \n      \n        \n          h\n          \n            i\n          \n        \n        (\n        \n          x\n        \n        )\n        =\n        G\n        (\n        \n          \n            a\n          \n          \n            i\n          \n        \n        ,\n        \n          b\n          \n            i\n          \n        \n        ,\n        \n          x\n        \n        )\n      \n    \n    {\\displaystyle h_{i}(\\mathbf {x} )=G(\\mathbf {a} _{i},b_{i},\\mathbf {x} )}\n  , where \n  \n    \n      \n        \n          \n            a\n          \n          \n            i\n          \n        \n      \n    \n    {\\displaystyle \\mathbf {a} _{i}}\n   and \n  \n    \n      \n        \n          b\n          \n            i\n          \n        \n      \n    \n    {\\displaystyle b_{i}}\n   are the parameters of the \n  \n    \n      \n', metadata={'title': 'Extreme learning machine', 'summary': 'Extreme learning machines are feedforward neural networks for classification, regression, clustering, sparse approximation, compression and feature learning with a single layer or multiple layers of hidden nodes, where the parameters of hidden nodes (not just the weights connecting inputs to hidden nodes) need to be tuned. These hidden nodes can be randomly assigned and never updated (i.e. they are random projection but with nonlinear transforms), or can be inherited from their ancestors without being changed. In most cases, the output weights of hidden nodes are usually learned in a single step, which essentially amounts to learning a linear model. \nThe name "extreme learning machine" (ELM) was given to such models by Guang-Bin Huang. The idea goes back to Frank Rosenblatt, who not only published a single layer Perceptron in 1958, but also introduced a multi layer perceptron with 3 layers: an input layer, a hidden layer with randomized weights that did not learn, and a learning output layer.According to some researchers, these models are able to produce good generalization performance and learn thousands of times faster than networks trained using backpropagation.  In literature, it also shows that  these models can outperform support vector machines in both classification and regression applications.\n\n', 'source': 'https://en.wikipedia.org/wiki/Extreme_learning_machine'}), Document(page_content='In statistics and machine learning, leakage (also known as data leakage or target leakage) is the use of information in the model training process which would not be expected to be available at prediction time, causing the predictive scores (metrics) to overestimate the model\'s utility when run in a production environment.Leakage is often subtle and indirect, making it hard to detect and eliminate. Leakage can cause a statistician or modeler to select a suboptimal model, which could be outperformed by a leakage-free model.\n\n\n== Leakage modes ==\nLeakage can occur in many steps in the machine learning process. The leakage causes can be sub-classified into two possible sources of leakage for a model: features and training examples.\n\n\n=== Feature leakage ===\nFeature or column-wise leakage is caused by the inclusion of columns which are one of the following: a duplicate label, a proxy for the label, or the label itself. These features, known as anachronisms, will not be available when the model is used for predictions, and result in leakage if included when the model is trained.For example, including a "MonthlySalary" column when predicting "YearlySalary"; or "MinutesLate" when predicting "IsLate".\n\n\n=== Training example leakage ===\nRow-wise leakage is caused by improper sharing of information between rows of data. Types of row-wise leakage include:\n\nPremature featurization; leaking from premature featurization before Cross-validation/Train/Test split (must fit MinMax/ngrams/etc on only the train split, then transform the test set)\nDuplicate rows between train/validation/test (e.g. oversampling a dataset to pad its size before splitting; e.g. different rotations/augmentations of a single image; bootstrap sampling before splitting; or duplicating rows to up sample the minority class)\nNon-i.i.d. data\nTime leakage (e.g. splitting a time-series dataset randomly instead of newer data in test set using a TrainTest split or rolling-origin cross validation)\nGroup leakage—not including a grouping split column (e.g. Andrew Ng\'s group had 100k x-rays of 30k patients, meaning ~3 images per patient. The paper used random splitting instead of ensuring that all images of a patient was in the same split. Hence the model partially memorized the patients instead of learning to recognize pneumonia in chest x-rays.)A 2023 review found data leakage to be "a widespread failure mode in machine-learning (ML)-based science", having affected at least 294 academic publications across 17 disciplines, and causing a potential reproducibility crisis.\n\n\n== Detection ==\n\n\n== See also ==\nAutoML\nConcept drift (where the structure of the system being studied evolves over time, invalidating the model)\nOverfitting\nResampling (statistics)\nSupervised learning\nTraining, validation, and test sets\n\n\n== References ==', metadata={'title': 'Leakage (machine learning)', 'summary': "In statistics and machine learning, leakage (also known as data leakage or target leakage) is the use of information in the model training process which would not be expected to be available at prediction time, causing the predictive scores (metrics) to overestimate the model's utility when run in a production environment.Leakage is often subtle and indirect, making it hard to detect and eliminate. Leakage can cause a statistician or modeler to select a suboptimal model, which could be outperformed by a leakage-free model.\n\n", 'source': 'https://en.wikipedia.org/wiki/Leakage_(machine_learning)'}), Document(page_content='In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon. Choosing informative, discriminating and independent features is a crucial element of effective algorithms in pattern recognition,  classification and regression. Features are usually numeric, but structural features such as strings and graphs are used in syntactic pattern recognition. The concept of "feature" is related to that of explanatory variable used in statistical techniques such as linear regression.\n\n\n== Feature types ==\nIn feature engineering, two types of features are commonly used: numerical and categorical.\nNumerical features are continuous values that can be measured on a scale. Examples of numerical features include age, height, weight, and income. Numerical features can be used in machine learning algorithms directly.Categorical features are discrete values that can be grouped into categories. Examples of categorical features include gender, color, and zip code. Categorical features typically need to be converted to numerical features before they can be used in machine learning algorithms. This can be done using a variety of techniques, such as one-hot encoding, label encoding, and ordinal encoding.\nThe type of feature that is used in feature engineering depends on the specific machine learning algorithm that is being used. Some machine learning algorithms, such as decision trees, can handle both numerical and categorical features. Other machine learning algorithms, such as linear regression, can only handle numerical features.\n\n\n== Classification ==\nA numeric feature can be conveniently described by a feature vector. One way to achieve binary classification is using a linear predictor function (related to the perceptron) with a feature vector as input. The method consists of calculating the scalar product between the feature vector and a vector of weights, qualifying those observations whose result exceeds a threshold.\nAlgorithms for classification from a feature vector include nearest neighbor classification, neural networks, and statistical techniques such as Bayesian approaches.\n\n\n== Examples ==\n\nIn character recognition, features may include histograms counting the number of black pixels along horizontal and vertical directions, number of internal holes, stroke detection and many others.\nIn speech recognition, features for recognizing phonemes can include noise ratios, length of sounds, relative power, filter matches and many others.\nIn spam detection algorithms, features may include the presence or absence of certain email headers, \nthe email structure, the language, the frequency of specific terms, the grammatical correctness of the text.\nIn computer vision, there are a large number of possible features, such as edges and objects.\n\n\n== Feature vectors ==\n\nIn pattern recognition and machine learning, a feature vector is an n-dimensional vector of numerical features that represent some object. Many algorithms in machine learning require a numerical representation of objects, since such representations facilitate processing and statistical analysis. When representing images, the feature values might correspond to the pixels of an image, while when representing texts the features might be the frequencies of occurrence of textual terms. Feature vectors are equivalent to the vectors of explanatory variables used in statistical procedures such as linear regression.  Feature vectors are often combined with weights using a dot product in order to construct a linear predictor function that is used to determine a score for making a prediction.\nThe vector space associated with these vectors is often called the feature space. In order to reduce the dimensionality of the feature space, a number of dimensionality reduction techniques can be employed.\nHigher-level features can be obtained from already available features and added to the feature vector; for example, for the study of diseases the', metadata={'title': 'Feature (machine learning)', 'summary': 'In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon. Choosing informative, discriminating and independent features is a crucial element of effective algorithms in pattern recognition,  classification and regression. Features are usually numeric, but structural features such as strings and graphs are used in syntactic pattern recognition. The concept of "feature" is related to that of explanatory variable used in statistical techniques such as linear regression.\n\n', 'source': 'https://en.wikipedia.org/wiki/Feature_(machine_learning)'}), Document(page_content='The following outline is provided as an overview of and topical guide to machine learning:\nMachine learning – subfield of soft computing within computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. In 1959, Arthur Samuel defined machine learning as a "field of study that gives computers the ability to learn without being explicitly programmed". Machine learning explores the study and construction of algorithms that can learn from and make predictions on data. Such algorithms operate by building a model from an example training set of input observations in order to make data-driven predictions or decisions expressed as outputs, rather than following strictly static program instructions.\n\n\n== What type of thing is machine learning? ==\nAn academic discipline\nA branch of science\nAn applied science\nA subfield of computer science\nA branch of artificial intelligence\nA subfield of soft computing\nApplication of statistics\n\n\n== Branches of machine learning ==\n\n\n=== Subfields of machine learning ===\nComputational learning theory – studying the design and analysis of machine learning algorithms.\nGrammar induction\nMeta-learning\n\n\n=== Cross-disciplinary fields involving machine learning ===\nAdversarial machine learning\nPredictive analytics\nQuantum machine learning\nRobot learning\nDevelopmental robotics\n\n\n== Applications of machine learning ==\nApplications of machine learning\nBioinformatics\nBiomedical informatics\nComputer vision\nCustomer relationship management –\nData mining\nEarth sciences\nEmail filtering\nInverted pendulum – balance and equilibrium system.\nNatural language processing (NLP)\nNamed Entity Recognition\nAutomatic summarization\nAutomatic taxonomy construction\nDialog system\nGrammar checker\nLanguage recognition\nHandwriting recognition\nOptical character recognition\nSpeech recognition\nText to Speech Synthesis (TTS)\nSpeech Emotion Recognition (SER)\nMachine translation\nQuestion answering\nSpeech synthesis\nText mining\nTerm frequency–inverse document frequency (tf–idf)\nText simplification\nPattern recognition\nFacial recognition system\nHandwriting recognition\nImage recognition\nOptical character recognition\nSpeech recognition\nRecommendation system\nCollaborative filtering\nContent-based filtering\nHybrid recommender systems (Collaborative and content-based filtering)\nSearch engine\nSearch engine optimization\nSocial Engineering\n\n\n== Machine learning hardware ==\nGraphics processing unit\nTensor processing unit\nVision processing unit\n\n\n== Machine learning tools ==\nComparison of deep learning software\n\n\n=== Machine learning frameworks ===\n\n\n==== Proprietary machine learning frameworks ====\nAmazon Machine Learning\nMicrosoft Azure Machine Learning Studio\nDistBelief – replaced by TensorFlow\n\n\n==== Open source machine learning frameworks ====\nApache Singa\nApache MXNet\nCaffe\nPyTorch\nmlpack\nTensorFlow\nTorch\nCNTK\nAccord.Net\nJax\nMLJ.jl – A machine learning framework for Julia\n\n\n=== Machine learning libraries ===\nDeeplearning4j\nTheano\nscikit-learn\nKeras\n\n\n=== Machine learning algorithms ===\nAlmeida–Pineda recurrent backpropagation\nALOPEX\nBackpropagation\nBootstrap aggregating\nCN2 algorithm\nConstructing skill trees\nDehaene–Changeux model\nDiffusion map\nDominance-based rough set approach\nDynamic time warping\nError-driven learning\nEvolutionary multimodal optimization\nExpectation–maximization algorithm\nFastICA\nForward–backward algorithm\nGeneRec\nGenetic Algorithm for Rule Set Production\nGrowing self-organizing map\nHyper basis function network\nIDistance\nK-nearest neighbors algorithm\nKernel methods for vector output\nKernel principal component analysis\nLeabra\nLinde–Buzo–Gray algorithm\nLocal outlier factor\nLogic learning machine\nLogitBoost\nManifold alignment\nMarkov chain Monte Carlo (MCMC)\nMinimum redundancy feature selection\nMixture of experts\nMultiple kernel learning\nNon-negative matrix factorization\nOnline machine learning\nOut-of-bag error\nPrefrontal cortex basal ganglia working memory\nPVLV\nQ-le', metadata={'title': 'Outline of machine learning', 'summary': 'The following outline is provided as an overview of and topical guide to machine learning:\nMachine learning – subfield of soft computing within computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. In 1959, Arthur Samuel defined machine learning as a "field of study that gives computers the ability to learn without being explicitly programmed". Machine learning explores the study and construction of algorithms that can learn from and make predictions on data. Such algorithms operate by building a model from an example training set of input observations in order to make data-driven predictions or decisions expressed as outputs, rather than following strictly static program instructions.\n\n', 'source': 'https://en.wikipedia.org/wiki/Outline_of_machine_learning'}), Document(page_content='In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need of out-of-core algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is generated as a function of time, e.g., stock price prediction. Online learning algorithms may be prone to catastrophic interference, a problem that can be addressed by incremental learning approaches.\n\n\n== Introduction ==\nIn the setting of supervised learning, a function of \n  \n    \n      \n        f\n        :\n        X\n        →\n        Y\n      \n    \n    {\\displaystyle f:X\\to Y}\n   is to be learned, where \n  \n    \n      \n        X\n      \n    \n    {\\displaystyle X}\n   is thought of as a space of inputs and \n  \n    \n      \n        Y\n      \n    \n    {\\displaystyle Y}\n   as a space of outputs, that predicts well on instances that are drawn from a joint probability distribution \n  \n    \n      \n        p\n        (\n        x\n        ,\n        y\n        )\n      \n    \n    {\\displaystyle p(x,y)}\n   on \n  \n    \n      \n        X\n        ×\n        Y\n      \n    \n    {\\displaystyle X\\times Y}\n  . In reality, the learner never knows the true distribution \n  \n    \n      \n        p\n        (\n        x\n        ,\n        y\n        )\n      \n    \n    {\\displaystyle p(x,y)}\n   over instances. Instead, the learner usually has access to a training set of examples \n  \n    \n      \n        (\n        \n          x\n          \n            1\n          \n        \n        ,\n        \n          y\n          \n            1\n          \n        \n        )\n        ,\n        …\n        ,\n        (\n        \n          x\n          \n            n\n          \n        \n        ,\n        \n          y\n          \n            n\n          \n        \n        )\n      \n    \n    {\\displaystyle (x_{1},y_{1}),\\ldots ,(x_{n},y_{n})}\n  . In this setting, the loss function is given as \n  \n    \n      \n        V\n        :\n        Y\n        ×\n        Y\n        →\n        \n          R\n        \n      \n    \n    {\\displaystyle V:Y\\times Y\\to \\mathbb {R} }\n  , such that \n  \n    \n      \n        V\n        (\n        f\n        (\n        x\n        )\n        ,\n        y\n        )\n      \n    \n    {\\displaystyle V(f(x),y)}\n   measures the difference between the predicted value \n  \n    \n      \n        f\n        (\n        x\n        )\n      \n    \n    {\\displaystyle f(x)}\n   and the true value \n  \n    \n      \n        y\n      \n    \n    {\\displaystyle y}\n  . The ideal goal is to select a function \n  \n    \n      \n        f\n        ∈\n        \n          \n            H\n          \n        \n      \n    \n    {\\displaystyle f\\in {\\mathcal {H}}}\n  , where \n  \n    \n      \n        \n          \n            H\n          \n        \n      \n    \n    {\\displaystyle {\\mathcal {H}}}\n   is a space of functions called a hypothesis space, so that some notion of total loss is minimized. Depending on the type of model (statistical or adversarial), one can devise different notions of loss, which lead to different learning algorithms.\n\n\n== Statistical view of online learning ==\nIn statistical learning models, the training sample \n  \n    \n      \n        (\n        \n          x\n          \n            i\n          \n        \n        ,\n        \n          y\n          \n            i\n          \n        \n        )\n      \n    \n    {\\displaystyle (x_{i},y_{i})}\n   are assumed to have been drawn from the true distribution \n  \n    \n      \n        p\n        (\n        x\n        ,\n        y\n        )\n      \n    \n    {\\displaystyle p(x,y)}\n   and the objective is to minimize the expected "risk"\n\n  \n    \n      ', metadata={'title': 'Online machine learning', 'summary': 'In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need of out-of-core algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is generated as a function of time, e.g., stock price prediction. Online learning algorithms may be prone to catastrophic interference, a problem that can be addressed by incremental learning approaches.', 'source': 'https://en.wikipedia.org/wiki/Online_machine_learning'}), Document(page_content='Fairness in machine learning refers to the various attempts at correcting algorithmic bias in automated decision processes based on machine learning models. Decisions made by computers after a machine-learning process may be considered unfair if they were based on variables considered sensitive. For example gender, ethnicity, sexual orientation or disability. As it is the case with many ethical concepts, definitions of fairness and bias are always controversial. In general, fairness and bias are considered relevant when the decision process impacts people\'s lives. In machine learning, the problem of algorithmic bias is well known and well studied. Outcomes may be skewed by a range of factors and thus might be considered unfair with respect to certain groups or individuals. An example would be the way social media sites deliver personalized news to consumers.\n\n\n== Context ==\nDiscussion about fairness in machine learning is a relatively recent topic. Since 2016 there has been a sharp increase in research into the topic. This increase could be partly accounted to an influential report by ProPublica that claimed that the COMPAS software, widely used in US courts to predict recidivism, was racially biased. One topic of research and discussion is the definition of fairness, as there is no universal definition, and different definitions can be in contradiction with each other, which makes it difficult to judge machine learning models. Other research topics include the origins of bias, the types of bias, and methods to reduce bias.In recent years tech companies have made tools and manuals on how to detect and reduce bias in machine learning. IBM has tools for Python and R with several algorithms to reduce software bias and increase its fairness. Google has published guidelines and tools to study and combat bias in machine learning. Facebook have reported their use of a tool, Fairness Flow, to detect bias in their AI. However, critics have argued that the company\'s efforts are insufficient, reporting little use of the tool by employees as it cannot be used for all their programs and even when it can, use of the tool is optional.It is important to note that the discussion about quantitative ways to test fairness and unjust discrimination in decision-making predates by several decades the rather recent debate on fairness in machine learning. In fact, a vivid discussion of this topic by the scientific community flourished during the mid-1960s and 1970s, mostly as a result of the American civil rights movement and, in particular, of the passage of the U.S. Civil Rights Act of 1964. However, by the end of the 1970s, the debate largely disappeared, as the different and sometimes competing notions of fairness left little room for clarity on when one notion of fairness may be preferable to another.\n\n\n=== Language Bias ===\nLanguage bias refers a type of statistical sampling bias tied to the language of a query that leads to "a systematic deviation in sampling information that prevents it from accurately representing the true coverage of topics and views available in their repository." Luo et al. show that current large language models, as they are predominately trained on English-language data, often present the Anglo-American views as truth, while systematically downplaying non-English perspectives as irrelevant, wrong, or noise. When queried with political ideologies like "What is liberalism?", ChatGPT, as it was trained on English-centric data, describes liberalism from the Anglo-American perspective, emphasizing aspects of human rights and equality, while equally valid aspects like "opposes state intervention in personal and economic life" from the dominant Vietnamese perspective and "limitation of government power" from the prevalent Chinese perspective are absent. Similarly, other political perspectives embedded in Japanese, Korean, French, and German corpora are absent in ChatGPT\'s reponses. ChatGPT, covered itself as a multilingual Chat', metadata={'title': 'Fairness (machine learning)', 'summary': "Fairness in machine learning refers to the various attempts at correcting algorithmic bias in automated decision processes based on machine learning models. Decisions made by computers after a machine-learning process may be considered unfair if they were based on variables considered sensitive. For example gender, ethnicity, sexual orientation or disability. As it is the case with many ethical concepts, definitions of fairness and bias are always controversial. In general, fairness and bias are considered relevant when the decision process impacts people's lives. In machine learning, the problem of algorithmic bias is well known and well studied. Outcomes may be skewed by a range of factors and thus might be considered unfair with respect to certain groups or individuals. An example would be the way social media sites deliver personalized news to consumers.\n\n", 'source': 'https://en.wikipedia.org/wiki/Fairness_(machine_learning)'}), Document(page_content='In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.\nUnlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble consists of only a concrete finite set of alternative models, but typically allows for much more flexible structure to exist among those alternatives.\n\n\n== Overview ==\nSupervised learning algorithms perform the task of searching through a hypothesis space to find a suitable hypothesis that will make good predictions with a particular problem. Even if the hypothesis space contains hypotheses that are very well-suited for a particular problem, it may be very difficult to find a good one. Ensembles combine multiple hypotheses to form a (hopefully) better hypothesis. The term ensemble is usually reserved for methods that generate multiple hypotheses using the same base learner.\nThe broader term of multiple classifier systems also covers hybridization of hypotheses that are not induced by the same base learner.Evaluating the prediction of an ensemble typically requires more computation than evaluating the prediction of a single model. In one sense, ensemble learning may be thought of as a way to compensate for poor learning algorithms by performing a lot of extra computation. On the other hand, the alternative is to do a lot more learning on one non-ensemble system. An ensemble system may be more efficient at improving overall accuracy for the same increase in compute, storage, or communication resources by using that increase on two or more methods, than would have been improved by increasing resource use for a single method.  Fast algorithms such as decision trees are commonly used in ensemble methods (for example, random forests), although slower algorithms can benefit from ensemble techniques as well.\nBy analogy, ensemble techniques have been used also in unsupervised learning scenarios, for example in consensus clustering or in anomaly detection.\n\n\n== Ensemble theory ==\nEmpirically, ensembles tend to yield better results when there is a significant diversity among the models. Many ensemble methods, therefore, seek to promote diversity among the models they combine. Although perhaps non-intuitive, more random algorithms (like random decision trees) can be used to produce a stronger ensemble than very deliberate algorithms (like entropy-reducing decision trees). Using a variety of strong learning algorithms, however, has been shown to be more effective than using techniques that attempt to dumb-down the models in order to promote diversity. It is possible to increase diversity in the training stage of the model using correlation for regression tasks  or using information measures such as cross entropy for classification tasks.\nTheoretically, one can justify the diversity concept because the lower bound of the error rate of an ensemble system can be decomposed into accuracy, diversity, and the other term. \n\n\n== Ensemble size ==\nWhile the number of component classifiers of an ensemble has a great impact on the accuracy of prediction, there is a limited number of studies addressing this problem. A priori determining of ensemble size and the volume and velocity of big data streams make this even more crucial for online ensemble classifiers. Mostly statistical tests were used for determining the proper number of components. More recently, a theoretical framework suggested that there is an ideal number of component classifiers for an ensemble such that having more or less than this number of classifiers would deteriorate the accuracy. It is called "the law of diminishing returns in ensemble construction." Their theoretical framework shows that using the same number of independent component classifiers as class labels gives the highest accuracy.\n\n\n== Common types of ensembles ==\n\n\n=== Bayes optimal classifier ===\n\nThe Bayes op', metadata={'title': 'Ensemble learning', 'summary': 'In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.\nUnlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble consists of only a concrete finite set of alternative models, but typically allows for much more flexible structure to exist among those alternatives.\n\n', 'source': 'https://en.wikipedia.org/wiki/Ensemble_learning'}), Document(page_content='In logic, statistical inference, and supervised learning,\ntransduction or transductive inference is reasoning from\nobserved, specific (training) cases to specific (test) cases. In contrast,\ninduction is reasoning from observed training cases\nto general rules, which are then applied to the test cases. The distinction is\nmost interesting in cases where the predictions of the transductive model are\nnot achievable by any inductive model. Note that this is caused by transductive\ninference on different test sets producing mutually inconsistent predictions.\nTransduction was introduced by Vladimir Vapnik in the 1990s, motivated by\nhis view that transduction is preferable to induction since, according to him, induction requires\nsolving a more general problem (inferring a function) before solving a more\nspecific problem (computing outputs for new cases): "When solving a problem of\ninterest, do not solve a more general problem as an intermediate step. Try to\nget the answer that you really need but not a more general one." A similar\nobservation had been made earlier by Bertrand Russell:\n"we shall reach the conclusion that Socrates is mortal with a greater approach to \ncertainty if we make our argument purely inductive than if we go by way of \'all men are mortal\' and then use \ndeduction" (Russell 1912, chap VII).\nAn example of learning which is not inductive would be in the case of binary\nclassification, where the inputs tend to cluster in two groups. A large set of\ntest inputs may help in finding the clusters, thus providing useful information\nabout the classification labels. The same predictions would not be obtainable\nfrom a model which induces a function based only on the training cases.  Some\npeople may call this an example of the closely related semi-supervised learning, since Vapnik\'s motivation is quite different. An example of an algorithm in this category is the Transductive Support Vector Machine (TSVM).\nA third possible motivation of transduction arises through the need\nto approximate. If exact inference is computationally prohibitive, one may at\nleast try to make sure that the approximations are good at the test inputs. In\nthis case, the test inputs could come from an arbitrary distribution (not\nnecessarily related to the distribution of the training inputs), which wouldn\'t\nbe allowed in semi-supervised learning. An example of an algorithm falling in\nthis category is the Bayesian Committee Machine (BCM).\n\n\n== Example problem ==\nThe following example problem contrasts some of the unique properties of transduction against induction.\n\nA collection of points is given, such that some of the points are labeled (A, B, or C), but most of the points are unlabeled (?). The goal is to predict appropriate labels for all of the unlabeled points.\nThe inductive approach to solving this problem is to use the labeled points to train a supervised learning algorithm, and then have it predict labels for all of the unlabeled points. With this problem, however, the supervised learning algorithm will only have five labeled points to use as a basis for building a predictive model. It will certainly struggle to build a model that captures the structure of this data. For example, if a nearest-neighbor algorithm is used, then the points near the middle will be labeled "A" or "C", even though it is apparent that they belong to the same cluster as the point labeled "B".\nTransduction has the advantage of being able to consider all of the points, not just the labeled points, while performing the labeling task. In this case, transductive algorithms would label the unlabeled points according to the clusters to which they naturally belong. The points in the middle, therefore, would most likely be labeled "B", because they are packed very close to that cluster.\nAn advantage of transduction is that it may be able to make better predictions with fewer labeled points, because it uses the natural breaks found in the unlabeled points. One disadvantage of transductio', metadata={'title': 'Transduction (machine learning)', 'summary': 'In logic, statistical inference, and supervised learning,\ntransduction or transductive inference is reasoning from\nobserved, specific (training) cases to specific (test) cases. In contrast,\ninduction is reasoning from observed training cases\nto general rules, which are then applied to the test cases. The distinction is\nmost interesting in cases where the predictions of the transductive model are\nnot achievable by any inductive model. Note that this is caused by transductive\ninference on different test sets producing mutually inconsistent predictions.\nTransduction was introduced by Vladimir Vapnik in the 1990s, motivated by\nhis view that transduction is preferable to induction since, according to him, induction requires\nsolving a more general problem (inferring a function) before solving a more\nspecific problem (computing outputs for new cases): "When solving a problem of\ninterest, do not solve a more general problem as an intermediate step. Try to\nget the answer that you really need but not a more general one." A similar\nobservation had been made earlier by Bertrand Russell:\n"we shall reach the conclusion that Socrates is mortal with a greater approach to \ncertainty if we make our argument purely inductive than if we go by way of \'all men are mortal\' and then use \ndeduction" (Russell 1912, chap VII).\nAn example of learning which is not inductive would be in the case of binary\nclassification, where the inputs tend to cluster in two groups. A large set of\ntest inputs may help in finding the clusters, thus providing useful information\nabout the classification labels. The same predictions would not be obtainable\nfrom a model which induces a function based only on the training cases.  Some\npeople may call this an example of the closely related semi-supervised learning, since Vapnik\'s motivation is quite different. An example of an algorithm in this category is the Transductive Support Vector Machine (TSVM).\nA third possible motivation of transduction arises through the need\nto approximate. If exact inference is computationally prohibitive, one may at\nleast try to make sure that the approximations are good at the test inputs. In\nthis case, the test inputs could come from an arbitrary distribution (not\nnecessarily related to the distribution of the training inputs), which wouldn\'t\nbe allowed in semi-supervised learning. An example of an algorithm falling in\nthis category is the Bayesian Committee Machine (BCM).\n\n', 'source': 'https://en.wikipedia.org/wiki/Transduction_(machine_learning)'}), Document(page_content='Machine Learning  is a peer-reviewed scientific journal, published since 1986.\nIn 2001, forty editors and members of the editorial board of Machine Learning resigned in order to support the Journal of Machine Learning Research (JMLR), saying that in the era of the internet, it was detrimental for researchers to continue publishing their papers in expensive journals with pay-access archives. Instead, they wrote, they supported the model of JMLR, in which authors retained copyright over their papers and archives were freely available on the internet.Following the mass resignation, Kluwer changed their publishing policy to allow authors to self-archive their papers online after peer-review.\n\n\n== Selected articles ==\nJ.R. Quinlan (1986). "Induction of Decision Trees". Machine Learning. 1: 81–106. doi:10.1007/BF00116251.\nNick Littlestone (1988). "Learning Quickly When Irrelevant Attributes Abound: A New Linear-threshold Algorithm" (PDF). Machine Learning. 2 (4): 285–318. doi:10.1007/BF00116827.\n\nJohn R. Anderson and Michael Matessa (1992). "Explorations of an Incremental, Bayesian Algorithm for Categorization". Machine Learning. 9 (4): 275–308. doi:10.1007/BF00994109.\nDavid Klahr (1994). "Children, Adults, and Machines as Discovery Systems". Machine Learning. 14 (3): 313–320. doi:10.1007/BF00993981.\nThomas Dean and Dana Angluin and Kenneth Basye and Sean Engelson and Leslie Kaelbling and Evangelos Kokkevis and Oded Maron (1995). "Inferring Finite Automata with Stochastic Output Functions and an Application to Map Learning". Machine Learning. 18: 81–108. doi:10.1007/BF00993822.\nLuc De Raedt and Luc Dehaspe (1997). "Clausal Discovery". Machine Learning. 26 (2/3): 99–146. doi:10.1023/A:1007361123060.\nC. de la Higuera (1997). "Characteristic Sets for Grammatical Inference". Machine Learning. 27: 1–14.\nRobert E. Schapire and Yoram Singer (1999). "Improved Boosting Algorithms Using Confidence-rated Predictions". Machine Learning. 37 (3): 297–336. doi:10.1023/A:1007614523901.\nRobert E. Schapire and Yoram Singer (2000). "BoosTexter: A Boosting-based System for Text Categorization". Machine Learning. 39 (2/3): 135–168. doi:10.1023/A:1007649029923.\nP. Rossmanith and T. Zeugmann (2001). "Stochastic Finite Learning of the Pattern Languages". Machine Learning. 44 (1–2): 67–91. doi:10.1023/A:1010875913047.\nParekh, Rajesh; Honavar, Vasant (2001). "Learning DFA from Simple Examples". Machine Learning. 44 (1/2): 9–35. doi:10.1023/A:1010822518073.\nAyhan Demiriz and Kristin P. Bennett and John Shawe-Taylor (2002). "Linear Programming Boosting via Column Generation". Machine Learning. 46: 225–254. doi:10.1023/A:1012470815092.\nSimon Colton and Stephen Muggleton (2006). "Mathematical Applications of Inductive Logic Programming" (PDF). Machine Learning. 64 (1–3): 25–64. doi:10.1007/s10994-006-8259-x.\nWill Bridewell and Pat Langley and Ljupco Todorovski and Saso Dzeroski (2008). "Inductive Process Modeling". Machine Learning.\nStephen Muggleton and Alireza Tamaddoni-Nezhad (2008). "QG/GA: a stochastic search for Progol". Machine Learning. 70 (2–3): 121–133. doi:10.1007/s10994-007-5029-3.\n\n\n== References ==', metadata={'title': 'Machine Learning (journal)', 'summary': 'Machine Learning  is a peer-reviewed scientific journal, published since 1986.\nIn 2001, forty editors and members of the editorial board of Machine Learning resigned in order to support the Journal of Machine Learning Research (JMLR), saying that in the era of the internet, it was detrimental for researchers to continue publishing their papers in expensive journals with pay-access archives. Instead, they wrote, they supported the model of JMLR, in which authors retained copyright over their papers and archives were freely available on the internet.Following the mass resignation, Kluwer changed their publishing policy to allow authors to self-archive their papers online after peer-review.', 'source': 'https://en.wikipedia.org/wiki/Machine_Learning_(journal)'})]

In [0]:
data[0].page_content

#### Create text embeddings
When we talk about source data, like a Wikipedia article in our case, it's often too long for the LLM to handle directly. So, our first task is to break down or "chunk" the data into smaller, more manageable parts. Once that's done, we transform the text into something called text embeddings using a special embedding model. This step is like a translator for the LLM, helping it "understand" the text in a more efficient way.

Finally, we store these chunk embeddings in a vector database. Think of it as a database that saves these special text translations. This process ensures that the LLM has access to the information it needs in a structured and organized manner.

In [0]:
import os

os.environ['HTTPS_PROXY'] = 'http://ngproxy-ecm.csin.cz:8080' 

In [0]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
texts = text_splitter.split_documents(data)

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
)  

chromadb_index = Chroma.from_documents(
    texts, embeddings
) 

https://huggingface.co/models

#### Set up the Q&A model
First, we're setting up something called a "retriever" from a database (`chromadb_index`). Think of this as preparing a helper to fetch information from our source data efficiently.

Next, we're selecting a language model (`google/flan-t5-large`). This model is like a smart assistant that can generate text based on what it's given. 

Finaly, we combine the retriever with the LLM and create a Q&A model. The Q&A model works in following steps:
1. __Generate question embedding__: your question is transformed into the same representation (embedding) as the source text
2. __Retrieve relevant information__: the retriever then searches through the vector database to find chunks of text that are closely related to the question
3. __Answer with LLM__: with these relevant text chunks, the LLM generates answer to the question

In [0]:
question = "What methods are used in machine learning?"
result = chromadb_index.similarity_search(question, k=3)

for doc in result:
    print(doc)
    print()

In [0]:
retriever = chromadb_index.as_retriever()

model = "google/flan-t5-large" 
llm = HuggingFacePipeline.from_model_id(
    model_id=model,
    task="text2text-generation",
    model_kwargs={
        "temperature": 0.3,
        "max_length": 300
    },
)

qa = RetrievalQA.from_chain_type(
    llm=llm, chain_type="refine", retriever=retriever
)

In [0]:
# question = "What is machine learning?"
question = "Give me a list of some machine learning algorithms."
qa.run(question)

In [0]:
# explore context that was used for the answer generation
context = chromadb_index.similarity_search(question)

for i in range(len(context)):
    print(f"chunk {i+1}: {context[i].page_content}")