# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Overview-of-Data-Mining-and-Information-Retrieval-Methods" data-toc-modified-id="Overview-of-Data-Mining-and-Information-Retrieval-Methods-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Overview of Data Mining and Information Retrieval Methods</a></div><div class="lev2 toc-item"><a href="#What-is-data-mining?" data-toc-modified-id="What-is-data-mining?-11"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>What is data mining?</a></div><div class="lev2 toc-item"><a href="#Database-Query-vs-Data-Mining" data-toc-modified-id="Database-Query-vs-Data-Mining-12"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Database Query vs Data Mining</a></div><div class="lev2 toc-item"><a href="#Data-Mining-Methods" data-toc-modified-id="Data-Mining-Methods-13"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Data Mining Methods</a></div><div class="lev2 toc-item"><a href="#Data-Science-vs-Machine-Learning-vs-Data-Mining" data-toc-modified-id="Data-Science-vs-Machine-Learning-vs-Data-Mining-14"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Data Science vs Machine Learning vs Data Mining</a></div><div class="lev1 toc-item"><a href="#What-is-a-information-retrieval-system?" data-toc-modified-id="What-is-a-information-retrieval-system?-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>What is a information retrieval system?</a></div>

# Overview of Data Mining and Information Retrieval Methods

The data deluge in business, government, and science has far outpaced our ability to retrieve, interpret, and make sense of the data, creating pressing need for novel tools and techniques for intelligent and automated analysis of datasets/databases. These tools and techniques are the subject of the rapidly emerging field of data mining, machine learning, and information retrieval. 

In the DSA 8614: Applied Machine Learning Course you learned some methods such as classification, clustering, dimensionality reduction, which are also subject matters of data mining. In this course, we will touch this topics briefly while providing adequate pointers to them and focus more on the other subfields of data mining. 

## What is data mining?

Data mining is the process of discovering **insightful, interesting, and novel patterns**, as well as developing  **descriptive, understandable, and predictive models from large-scale data** [[source](https://dataminingbook.info/)]. Data mining is often discussed under a broad topic called **knowledge discovery in databases (KDD)**. KDD is the process that takes a large amount unprocessed and raw data stored in a data warehouse, transform it into a meaningful patterns, and present them in a form that can easily be interpreted and understood. KDD is essentially a three-step process: a) pre-processing of raw data, b) generate knowledge using data mining, and c) synthesize knowledge as a post processing. 

During the data-mining step of the KDD process, specific algorithms are applied to extract potentially useful patterns from the raw data. This is the heart of the KDD process as it identifies and exposes interesting, useful, and actionable patterns and rules hidden in the database/dataset. During this stage, either an entire database/dataset or a sample is used as input to the mining algorithm, and preparing right kind of input (i.e training data) takes place in the preprocessing stage. Once new knowledge is generated by the data-mining stage, the post processing stage converts the knowledge into a form more suitable for interpretation by the end users. During this stage irrelevant patterns are eliminated and relevant patterns are further summarized into more understandable and meaningful expressions.  

## Database Query vs Data Mining

Some of you studied a database course where you created (complex) queries to extract information from a database. Is this a data mining task? No. Data querying involves retrieving a subset of the existing data as specified by the user. When we specify the query, we get the **known** relevant data as output. On the other hand, data mining is more about analysis and developing models. It deals with extracting **useful**, **actionable**, and previously **unknown information** from raw data. For example, you try to predict unknown class labels in classification tasks, cluster similar instances, and discover patterns in data etc. So it is much more than mere retrieval of data. 

Often, raw data is stored as databases. Therefore, data querying becomes a part of the data mining task. But querying can be used only when the user knows exactly what he is looking for, while data mining is used when the user has a vague idea and wants to find more interesting features or patterns in the data.


## Data Mining Methods

Data mining is a tool used in various business sectors that provides effective, strategic assistance for decision making. The kind of information produced by a data-mining system depends on the information needs of the organization. While the scope of data-mining applications is wide, typical goals of data-mining applications include the thorough detection, accurate interpretation, and easy-to-understand presentation of meaningful patterns in data. To satisfy these goals data-mining algorithms employ techniques from a wide variety of disciplines such as AI/machine learning, database development, statistics, and mathematics. The algorithmic categories in data mining include association rule mining, classification, clustering, fuzzy theory, rough-set theory, and neural networks. Some of the practical applications of data data mining are observed in health-care services, banking, market-basket analysis, customer relationship management, and bioinformatics. 





## Data Science vs Machine Learning vs Data Mining

The boundaries between machine learning, data mining, and data science are blurred as these areas share a good amount of overlaps. There is no single definitions for these topics and the available definitions are subject to debates. You can go over the following article to learn one of the many views on the scope of these areas. 

* [Difference of Data Science, Machine Learning and Data Mining](https://www.datasciencecentral.com/profiles/blogs/difference-of-data-science-machine-learning-and-data-mining)


# What is a information retrieval system?

An information retrieval system is a software program that stores and manages information on documents, often textual documents but possibly multimedia. The system assists users in finding the information they need. It does not explicitly return information or answer questions. Instead, it informs on the existence and location of documents that might contain the desired information. Some suggested documents will, hopefully, satisfy the user’s information need. A desirable property of an IR system is that it would provide relevant information in a short-span of time given the user requirement and allow requirements to be expressed in a textual human readable format.

Many applications that handle information on the internet would be completely inadequate without the support of information retrieval technology. The most well-known and widely-used information retrieval system is the search engines including Google Search and Bing. IR is not limited to web searches. Any e-commerce store (e.g. Amazon), library, and financial entities like banks facilitate free-form text query to retrieve relevant information for the users. 

A perfect retrieval system would retrieve only the relevant documents and no irrelevant documents. However, perfect retrieval systems do not exist and will not exist, because search statements are necessarily incomplete and relevance depends on the subjective opinion of the user. In practice, two users may pose the same query to an in- formation retrieval system and judge the relevance of the retrieved documents differently: some users will like the results, others will not.


There are three basic processes an information retrieval system has to sup- port: a) the representation of the content of the documents, b) the representation of the user’s information need, and c) the comparison of the two representations. 

* Representing the documents is usually called the indexing process. The process takes place off-line, that is, the end user of the information retrieval system is not directly involved. 

* An end user uses an IR system from the perspective of a need for information. The process of representing their information need is often referred to as the query formulation process. The resulting representation is the query.

* The comparison of the query against the document representations is called the matching process. The matching process usually results in a ranked list of documents. Users will walk through this document list in search of the information they need. Ranked retrieval will hopefully put the relevant documents towards the top of the ranked list, minimizing the time the user has to invest in reading the documents.


Some of the notable models in IR system is the boolean model, vector-space model, probabilistic indexing/retrieval model, Bayesian model, language model, and Google's page rank model.