<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/MLPG-Book-Cover-Small.png"><br>

This notebook contains an excerpt from the **`Machine Learning Project Guidelines - For Beginners`** book written by *Balasubramanian Chandran*; the content is available [on GitHub](https://github.com/BalaChandranGH/Books/ML-Project-Guidelines).

<br>
<!--NAVIGATION-->

<[ [Other Considerations - Definitions](18.14-mlpg-Other-Considerations-Definitions.ipynb) | [Contents and Acronyms](00.00-mlpg-Contents-and-Acronyms.ipynb) | [Text Analytics – An Introduction](19.00-mlpg-Text-Analytics–An-Introduction.ipynb) ]>

# 18. Other Considerations

## 18.15. Miscellaneous

### 18.15.1. Pipeline
* A linear sequence of data preparation and modeling steps that can be treated as an atomic unit

### 18.15.2. Kernel and Kernel trick

**Kernel:**
* Kernels allow us to make complex, non-linear classifiers using SVM
* A kernel is a shortcut that helps us do certain calculations faster which otherwise would involve computations in higher-dimensional space
* A **kernel is a weighing factor** between two sequences of data. This weighting factor can assign more weight to one "data point" at one "time point" than the other "data point", or assign an equal weight or assign more weight to the other "data point" and so on
* A kernel is a way of computing the dot product of two vectors **X** and **Y** in some (possibly very high dimensional) feature space, which is why kernel functions are sometimes called "generalized dot product"

**Kernel Trick:**
* A simple method where non-Linear data is projected onto a higher dimension space to make it easier to classify the data where it could be linearly divided by a plane, e.g., project 2D data into 3D space; the trick is just to project data points without actually transforming them to a new dimension space<br>
<img align="left" style="padding-right:10px;" src="figures/MLPG-OC-KernelTrick.png"><br>
<br><br><br>
Image credit [ (Source) ](https://www.analyticsvidhya.com/)

* **Mathematical definition of Kernel:**<br>
  `K(x, y) = <f(x), f(y)>` where `K` is the kernel function, `x`, `y` are n dimensional inputs, `f` is a map from n-dimension to m-dimension space, `<x, y>` denotes the dot product, usually `m > n`
* **Example:**<br>
  ```
  x = (x1, x2, x3); y = (y1, y2, y3), then,
  Function f(x) = (x1x1, x1x2, x1x3, x2x1, x2x2, x2x3, x3x1, x3x2, x3x3)
  Function f(y) = (y1y1, y1y2, y1y3, y2y1, y2y2, y2y3, y3y1, y3y2, y3y3)
  The kernel K(x, y ) = (<x, y>)^2

  Suppose x = (1, 2, 3); y = (4, 5, 6), then,
  f(x) = (1, 2, 3, 2, 4, 6, 3, 6, 9)
  f(y) = (16, 20, 24, 20, 25, 30, 24, 30, 36)
  <f(x), f(y)> = 16 + 40 + 72 + 40 + 100+ 180 + 72 + 180 + 324 = 1024
  ```
  <br>A lot of algebra. Mainly because f is a mapping from 3D to 9D space. Now let us use the kernel instead:
  ```
  K(x, y) = (4 + 10 + 18 ) ^2 = 32^2 = 1024
  ```
  Same result, but this calculation is so much easier.

* **An additional beauty of Kernel**: kernels allow us to do stuff in infinite dimensions! Sometimes going to a higher dimension is not just computationally expensive, but also impossible. f(x) can be a mapping from n dimension to an infinite dimension which we may have little idea of how to deal with. Then kernel gives us a wonderful shortcut

### 18.15.3. A Dimensionality Reduction problem description
* Designing a model with too many and too many observations (rows) and variables (columns) will be taxed too much on computation
* For efficiency, we need to group the observations and variables and keep the numbers to minimal
* **Too many observations:**
  - Interested to see how observations hand together
    - Market segmentation
    - Types of observation
    - Grouping observations together
    - **Solution**: To reduce the dimensions of the population (i.e., observations), use **Cluster Analysis**
* **Too many variables:**
  - Interested to see how variables hand together
    - Variables may describe similar things
    - What is the underlying similarity
    - Grouping variables
    - Don't want to enter all the variables in the model - inefficient, computationally expensive, potentially high correlations among variables
    - **Solution**: To reduce the dimensions of the construct (i.e., variables), use **PCA** and **EFA**

### 18.15.4. Big data and its characteristics?
* Big data is a term for datasets that are so large or complex that traditional data processing application software is inadequate to deal with them
* Big Data is about deriving new insight from previously untouched data and integrating that insight into the business operations — data warehouses, business processes, and applications
* Big data is about the application of new tools to do MORE analytics on MORE data for MORE people
* The characteristics of big data are as follows:<br>

<img align="left" style="padding-right:10px;" src="figures/MLPG-OC-BigDataChar1.png"><br>
<br><br><br><br><br>
Image credit [ (Source) ](https://bigdatapath.wordpress.com/2019/11/13/understanding-the-7-vs-of-big-data/)

<img align="left" style="padding-right:10px;" src="figures/MLPG-OC-BigDataChar2.png"><br>
<br><br><br><br><br><br><br><br><br>
Image credit [ (Source) ](https://www.researchgate.net/figure/The-7Vs-of-Big-Data-Volume-Velocity-Variety-Variability-Veracity-Value-and_fig1_341622174)

### 18.15.5. Web scraping and its use-cases
* **Web scraping** is also known as **Web Data Extraction** or **Data scraping** or **Web harvesting**
* It’s the process of collecting structured web data in an automated fashion from any public website
* It uses intelligent automation to retrieve 100s, millions, or billions of data points from the internet
* In general, it’s used by people and businesses who want to make use of the vast amount of publicly available web data to make smarter decisions
* Some of the use-cases of web scraping:
  - E-commerce
    - Competitor price monitoring
  - Manufacturing - MAP (Minimum advertised price) monitoring
    - MAP monitoring ensures retailers are compliant with pricing guidelines for their products
    - MAP monitoring is the standard practice to make sure a brand’s online prices are aligned with its pricing policy
    - With tons of resellers and distributors, it’s impossible to monitor the prices manually
  - Market research 
    - Organizations and analysts depend on web scraping to gauge consumer sentiments by keeping track of online product reviews, news articles, and feedback
  - Finance 
    - Web scraping tools are used to extract insight from news stories which is used to guide investment strategies
  - Insurance 
    - Companies mine a rich seam of alternative data scraped from the web to design new products and policies for their customers
  - Price intelligence
    - Dynamic pricing
    - Revenue optimization
    - Competitor monitoring
    - Product trend monitoring
    - Brand and MAP compliance
  - Market research
    - Market trend analysis
    - Market pricing
    - Optimizing point of entry
    - Research & development
    - Competitor monitoring
  - Alternative data for finance
    - Extracting Insights from SEC Filings
    - Estimating Company Fundamentals
    - Public Sentiment Integrations
    - News Monitoring
  - Real estate
    - Appraising Property Value
    - Monitoring Vacancy Rates
    - Estimating Rental Yields
    - Understanding Market Direction
  - News & content monitoring
    - Investment Decision Making
    - Online Public Sentiment Analysis
    - Competitor Monitoring
    - Political Campaigns
  - Lead generation, Brand monitoring, Business automation, Journalism, Academic research, and more

### 18.15.6. Resilient Distributed Dataset (RDD)
* An RDD is a collection of elements portioned across the nodes of a cluster that can be operated on in parallel, in other words, an RDD is made up of multiple partitions
* Spark normally determines the number of partitions based on the number of CPUs in your cluster
* Each partition has a sequence of records on which tasks will execute on
* The partitioning is what enables the parallel execution of Spark jobs

### 18.15.7. How to choose a data layer for an application?
* It depends on the types of questions to ask and how long one can wait for the answers
* One needs to find answers to some of the questions, if not all, mentioned below, before choosing:

* **Database functionality and performance**: **`Choice - NoSQL or RDBMS`**
  - If you have a web or mobile application that requires `interactive responses`, then you will want to use a database that aims to be an operational data store; `NoSQL databases` may be a good choice
  - If your application requires data warehousing for `batch analytics`, then often a `relational database or Hadoop-based technology` would be a better fit

* **Database Size and Number of Connections: `Choice - NoSQL`**
  - How big your data will get and how many `concurrent connections` do you expect? 
  - Will you need a `scalable solution horizontally`, don’t completely know your capacity requirements upfront, or need something that scales as your application grows?
  - Then a `NoSQL database` might be a good choice

* **Data Durability: `Choice – RDBMS`**
  - Some databases offer the ability to store your data in memory for faster access, however, with this approach, there is an increased risk of losing the data when a server crashes
  - If data `durability` is paramount, then choose a `database that writes the data immediately to the disk`

* **Database Consistency and Transaction Requirements: `Choice - RDBMS`**
  - Relational databases provide strong consistency and transactional rollback capabilities and would be a good choice if you have a use case that requires these traits

* **Database Availability, Replication, and Geography: `Choice – NoSQL (few not all types)`**
  - Many `NoSQL databases` operate inherently in a cluster and therefore can meet stringent `high availability requirements`
  - Data `replication` is an important feature to achieve disaster recovery objectives by storing the data in additional data centers and allowing for syncing to application clients for offline access
  - A few, but not all, `NoSQL databases` are built to handle these complex replication scenarios while avoiding data corruption

* **Data Structure Changes: `Choice - NoSQL`**
  - Will I need a `flexible schema` for rapid development? Will the data model change over time?
  - Flexible schemas are a common trait amongst many `NoSQL databases`
  - If you require a flexible schema for rapid development where your data model may change over time, then you will often want to go with a NoSQL database for your application. 
  - Many of them require no database downtime while making schema changes, making development easier and faster

* **Database Developer and Administrator Skills: `Choice – Based on the skillsets of the existing resources`**
  - It is important to assess the skill sets of those developing the application, and administering the database and servers?
  - Make sure you `choose a technology that fits with your existing resources` before bringing it on-premise in your environment

* **Database Integration: `Choice – NoSQL or RDBMS`**
  - Think about whether or not your database layer can integrate easily with your application layer
  - For `web and mobile applications that use JSON`, use a `NoSQL` DB that also uses JSON
  - However, if the business intelligence tools (or reporting dashboard) are expecting to `consume rows/columns`, then a `relational datastore` might work better for you

* **Database System Hosting: `Choice - DIY or Hosted or DBaaS`**
  - Do I want to host it myself or use cloud infrastructure or a fully managed service?

<img align="left" style="padding-right:10px;" src="figures/MLPG-OC-DBHosting.png"><br>
<br><br><br><br><br>
Image credit [ (Source) ](https://cognitiveclass.ai/)

<!--NAVIGATION-->
<br>

<[ [Other Considerations - Definitions](18.14-mlpg-Other-Considerations-Definitions.ipynb) | [Contents and Acronyms](00.00-mlpg-Contents-and-Acronyms.ipynb) | [Text Analytics – An Introduction](19.00-mlpg-Text-Analytics–An-Introduction.ipynb) ]>