### 1. What are the key tasks involved in getting ready to work with machine learning modeling?

Ans : The Key tasks are : 

* **Data Gathering**: Any machine learning problem requires a lot of data for training/testing purposes. Identifying the right data sources and gathering data from these data sources is the key. Data could be found from databases, external agencies, the internet etc.
* **Data Preprocessing**: Before starting training the models, it is of utmost importance to prepare data appropriately. As part of data preprocessing, some of the following is done:
* **Data cleaning**: Data cleaning requires one to identify attributes having not enough data or attributes which are not having variance. These data (rows and columns) need to be removed from the training data set.
* **Missing data imputation**: Handling missing data using data imputation techniques such as replacing missing data with mean, median, or mode. Here is my post on this topic: Replace missing values with mean, median or mode
* **Exploratory Data Analysis (EDA)**: Once data is preprocessed, the next step is to perform exploratory data analysis to understand data distribution and relationships between/within the data. Some of the following are performed as part of EDA:
    * Correlation analysis
    * Multicollinearity analysis
    * Data distribution analysis
* **Feature Engineering:** Feature engineering is one of the critical tasks which would be used when building machine learning models. Feature engineering is important because selecting the right features would not only help build models of higher accuracy but also help achieve objectives related to building simpler models, reduce overfitting, etc. Feature engineering includes some of the tasks such as deriving features from raw features, identifying important features, feature extraction and feature selection. The following are some of the techniques which could be used for feature selection:
    * Filter methods help in selecting features based on the outcomes of statistical tests. The following are some of the statistical tests which are used:
        * Pearson’s correlation
        * Linear discriminant analysis (LDA)
        * Analysis of Variance (ANOVA)
        * Chi-square tests
    * Wrapper methods help in feature selection by using a subset of features and determining the model accuracy. The following are some of the algorithms used:
        * Forward selection
        * Backward elimination
        * Recursive feature elimination
    * Regularization techniques penalize one or more features appropriately to come up with most important features. The following are some of the algorithms used:
        * LASSO (L1) regularization
        * Ridge (L2) regularization
        * Elastic net regularization
        * Regularization with classification algorithms such as Logistic regression, SVM, etc.
* **Training Models:** Once some of the features are determined, then comes training models with data related to those features. The following is a list of different types of machine learning problems and related algorithms which can be used to solve these problems:
    * **Regression:** Regression tasks mainly deal with the estimation of numerical values (continuous variables). Some of the examples include estimation of housing price, product price, stock price etc. Some of the following ML methods could be used for solving regressions problems:
        * Kernel regression (Higher accuracy)
        * Gaussian process regression (Higher accuracy)
        * Regression trees
        * Linear regression
        * Support vector regression
        * LASSO / Ridge
        * Deep learning
        * Random forests
    * Classification: Classification tasks is simply related to predicting a category of data (discrete variables). One of the most common examples is predicting whether or not an email if spam or ham. Some of the common use cases could be found in the area of healthcare such as whether a person is suffering from a particular disease or not. It also has its application in financial use cases such as determining whether a transaction is a fraud or not. You might want to check this page on real-world examples of classification models, machine learning classification models real-life examples. The ML methods such as the following could be applied to solve classification tasks:
        * Kernel discriminant analysis (Higher accuracy)
        * K-Nearest Neighbors (Higher accuracy)
        * Artificial neural networks (ANN) (Higher accuracy)
        * Support vector machine (SVM) (Higher accuracy)
        * Random forests (Higher accuracy)
        * Decision trees
        * Boosted trees
        * Logistic regression
        * naive Bayes
        * Deep learning
    * **Clustering:** Clustering tasks are all about finding natural groupings of data and a label associated with each of these groupings (clusters). Some of the common examples include customer segmentation, product features identification for the product roadmap. Some of the following are common ML methods:
        * Mean-shift  (Higher accuracy)
        * Hierarchical clustering
        * K-means
        * Topic models
* **Multivariate querying:** Multivariate querying is about querying or finding similar objects. Some of the following ML methods could be used for such problems:
    * Nearest neighbors
    * Range search
    * Farthest neighbors
* **Density estimation:** Density estimation problems are related to finding the likelihood or frequency of objects. In probability and statistics, density estimation is the construction of an estimate, based on observed data, of an unobservable underlying probability density function. Some of the following ML methods could be used for solving density estimation tasks:
    * Kernel density estimation (Higher accuracy)
    * Mixture of Gaussians
    * Density estimation tree
* **Dimensionality reduction (feature extraction):** As per the Wikipedia page on Dimension reduction, Dimension reduction is the process of reducing the number of random variables under consideration, and can be divided into feature selection and feature extraction. Following are some of ML methods that could be used for dimension reduction:
    * Manifold learning/KPCA (Higher accuracy)
    * Principal component analysis
    * Independent component analysis
    * Gaussian graphical models
    * Non-negative matrix factorization
    * Compressed sensing
* **Model selection / Algorithm selection:** Many times, there are multiple models which are trained using different algorithms. One of the important tasks is to select the most optimal models for deploying them in production. Hyperparameter tuning is the most common task performed as part of model selection. Also, if there are two models trained using different algorithms which have similar performance, then one also needs to perform algorithm selection.
* **Testing and matching:** Testing and matching tasks relate to comparing data sets. Following are some of the methods that could be used for such kinds of problems:
    * Minimum spanning tree
    * Bipartite cross-matching
    * N-point correlation
* **Model monitoring:** Once the models are trained and deployed, they require to be monitored at regular intervals. Monitoring models require the processing actual values and predicted values and measure the model performance based on appropriate metrics.
* **Model retraining:** In case, the model performance degrades, the models are required to be retrained.  The following gets done as part of model retraining:
    * New features get determined
    * New algorithms can be used
    * Hyperparameters can get tuned
    * Model ensembles may get deployed

### 2. What are the different forms of data used in machine learning? Give a specific example for each of them.

Ans : Almost anything can be turned into DATA. Building a deep understanding of the different data types is a crucial prerequisite for doing Exploratory Data Analysis (EDA) and Feature Engineering for Machine Learning models. 

Most data can be categorized into 4 basic types from a Machine Learning perspective: 
* numerical data 
* categorical data 
* time-series data
* text

**Numerical data** can be characterized by continuous or discrete data. Continuous data can assume any value within a range whereas discrete data has distinct values.

![Image of Yaktocat](https://miro.medium.com/max/401/1*lheLiN7y4sSD2JKvow-clw.jpeg)


**Categorical data** represents characteristics, such as a hockey player’s position, team, hometown. Categorical data can take numerical values. For example, maybe we would use 1 for the colour red and 2 for blue. But these numbers don’t have a mathematical meaning. That is, we can’t add them together or take the average.

An example would be class difficulty, such as beginner, intermediate, and advanced. Those three types of classes would be a way that we could label the classes, and they have a natural order in increasing difficulty.

![Image of Yaktocat](https://miro.medium.com/max/229/1*wqUH7IOl8Hky5BI6RvoGtA.png)

**Time series data** is a sequence of numbers collected at regular intervals over some period of time. It is very important, especially in particular fields like finance. Time series data has a temporal value attached to it, so this would be something like a date or a timestamp that you can look for trends in time.

For example, we might measure the average number of home sales for many years. The difference of time series data and numerical data is that rather than having a bunch of numerical values that don’t have any time ordering, time-series data does have some implied ordering. There is a first data point collected and the last data point collected.

![Image of Yaktocat](https://miro.medium.com/max/700/1*3H17aiABEWXRY_ZD5MG6fg.png)

**Text data** is basically just words. A lot of the time the first thing that you do with text is you turn it into numbers using some interesting functions like the bag of words formulation.

### 3. Distinguish:

   1. Numeric vs. categorical attributes
   2. Feature selection vs. dimensionality reduction


Ans : 
### **Distinguish Numeric vs. categorical attributes**
    
**Key Differences Between Categorical & Numerical Data**

**Definitions :** 
Categorical data is a type of data that is used to group information with similar characteristics while Numerical data is a type of data that expresses information in the form of numbers. It combines numeric values to depict relevant information while categorical data uses a descriptive approach to express information

Examples

 Categorical data examples include personal biodata information—full name, gender, phone number, etc. Numerical data examples include CGPA calculator, interval sale, etc. 
 
**Types :**
Categorical data is divided into two types, namely; nominal and ordinal data while numerical data is categorised into discrete and continuous data. Continuous data is now further divided into interval data and ratio data.

**Data Characteristics :**
The characteristics of **categorical data** include; lack of a standardized order scale, natural language description, takes numeric values with qualitative properties, and visualized using bar chart and pie chart. 
**Numerical data**, on the other hand, has a standardized order scale, numerical description, takes numeric values with numerical properties, and visualized using bar charts, pie charts, scatter plots, etc.

**User-centred Design :**
Numerical data collection method is more user-centred than categorical data. Most respondents do not want to spend a lot of time filling out forms or surveys which is why questionnaires used to collect numerical data has a lower abandonment rate compared to that of categorical data.

**Data Collection Methods :**
**Categorical data** can be collected through different methods, which may differ from categorical data types. For instance, nominal data is mostly collected using open-ended questions while ordinal data is mostly collected using multiple-choice questions.**Numerical data**, on the other hand, is mostly collected through multiple-choice questions. We observe that it is mostly collected using open-ended questions whenever there is a need for calculation.

**Analysis & Interpretation :**
There are 2 methods of performing **numerical data** analysis, namely; descriptive and inferential statistics. Some examples of these 2 methods include; measures of central tendency, turf analysis, text analysis, conjoint analysis, trend analysis, etc.There are also 2 methods of analyzing **categorical data**, namely; median and mode. In some cases, we see that ordinal data Is analyzed using univariate statistics, bivariate statistics, regression analysis, etc. which is used as an alternative to calculating mean and standard deviation.

**Advantage :**

Numerical data is compatible with most statistical analysis methods and as such makes it the most used among researchers. Categorical data, on the other hand, does not support most statistical analysis methods. 

There are alternatives to some of the statistical analysis methods not supported by categorical data. However, they can not give results that are as accurate as the original.

**Disadvantage :**

Numerical data analysis is mostly performed in a standardized or controlled environment, which may hinder a proper investigation. This is because natural factors that may influence the results have been eliminated, causing the results not to be completely accurate. 

Numerical data collection is also strictly based on the researcher's point of view, limiting the respondent's influence on the result. This is not the case with categorical data. 

Nominal data captures human emotions to an extent through open-ended questions. However, the setback with this is that the researcher may sometimes have to deal with irrelevant data.

___________________________________________________________________________________________________________________________

### **Distinguish Feature selection vs. dimensionality reduction**

* feature selection: you select a subset of the original feature set; while
* feature extraction: you build a new set of features from the original feature set.

**Feature Selection**

* Feature selection yields a subset of features from the original set of features, which are best representatives of the data. It is an exhaustive search.
* In text data, features might be size of characters or some global features of the text. Feature selection will keep only certain features of those.
* Feature selection is done in the context of an optimization problem.

**Dimension Reduction**

* Dimensionality reduction is generic and only depends on the data and not on what you plan to do with it.
* Assuming a classification problem you select the features that will help you classify your data better, while a dimensionality reduction algorithm is unaware of this and just projects the data into a lower dimensionality space. That in turn can work quite well or not for your classification algorithm.

### 4. Make quick notes on any two of the following:

    1. The histogram
    2. Use a scatter plot
    3.PCA (Personal Computer Aid)


Ans : 

### The histogram

A histogram is a graphical representation that organizes a group of data points into user-specified ranges. Similar in appearance to a bar graph, the histogram condenses a data series into an easily interpreted visual by taking many data points and grouping them into logical ranges or bins.

**KEY TAKEAWAYS**

A histogram is a bar graph-like representation of data that buckets a range of outcomes into columns along the x-axis.
The y-axis represents the number count or percentage of occurrences in the data for each column and can be used to visualize data distributions.
In trading, the MACD histogram is used by technical analysts to indicate changes in momentum.

Example :

![Image of Yaktocat](https://www.investopedia.com/thmb/iM-OWscIDJ31DrGOwwtdBXyIBUg=/660x0/filters:no_upscale():max_bytes(150000):strip_icc():format(webp)/Histogram2-3cc0e953cc3545f28cff5fad12936ceb.png)


___________________________________________________________________________________________________________________________


### The Scatter Plot

A scatter plot (aka scatter chart, scatter graph) uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables.

**When you should use a scatter plot**

Scatter plots’ primary uses are to observe and show relationships between two numeric variables. The dots in a scatter plot not only report the values of individual data points, but also patterns when the data are taken as a whole.

![Image of Yaktocat](https://chartio.com/assets/5689fd/tutorials/charts/scatter-plots/a9b8dd5dc2057a70446e5aa32f32b49d54b55f5cabf17a4610e2da94bea7fed5/scatter-plot-example-2.png)

### 5. Why is it necessary to investigate data? Is there a discrepancy in how qualitative and quantitative data are explored?

Ans : This is common advice for many data scientists. If your data set is messy, building models will not help you to solve your problem. What will happen is “garbage in, garbage out.” In order to build a powerful machine learning algorithm. We need to explore and understand our data set before we define a predictive task and solve it.

**The Key Concepts To Investigating Your Dataset**

* Ask the right questions?
* Analyze different subsets of data
* Explore trends
* Find your blind spots
* Investigate the whys

Yes there is discrepancy in how qualitative and quantitative data are explored, because the nature and orbjective of both the data are diffrent hence explored in diffrent way.

<img src="images/how qualitative and quantitative data are explored.PNG">

### 6. What are the various histogram shapes? What exactly are ‘bins'?

Ans : The various histogram shapes are:

* **Normal Distribution**

    A common pattern is the bell-shaped curve known as the "normal distribution." In a normal or "typical" distribution, points are as likely to occur on one side of the average as on the other. Note that other distributions look similar to the normal distribution. Statistical calculations must be used to prove a normal distribution.

    It's important to note that "normal" refers to the typical distribution for a particular process. For example, many processes have a natural limit on one side and will produce skewed distributions. This is normal—meaning typical—for those processes, even if the distribution isn’t considered "normal."
    ![Image of Yaktocat](https://asq.org/-/media/Images/Learn-About-Quality/Histogram/dcat-histogram-normal.gif?la=en)


* **Skewed Distribution**

    The skewed distribution is asymmetrical because a natural limit prevents outcomes on one side. The distribution’s peak is off center toward the limit and a tail stretches away from it. For example, a distribution of analyses of a very pure product would be skewed, because the product cannot be more than 100 percent pure. Other examples of natural limits are holes that cannot be smaller than the diameter of the drill bit or call-handling times that cannot be less than zero. These distributions are called right- or left-skewed according to the direction of the tail.
    ![Image of Yaktocat](https://asq.org/-/media/Images/Learn-About-Quality/Histogram/dcat-histogram-right.gif?la=en)


* **Double-Peaked or Bimodal**

    The bimodal distribution looks like the back of a two-humped camel. The outcomes of two processes with different distributions are combined in one set of data. For example, a distribution of production data from a two-shift operation might be bimodal, if each shift produces a different distribution of results. Stratification often reveals this problem.
    ![Image of Yaktocat](https://asq.org/-/media/Images/Learn-About-Quality/Histogram/dcat-histogram-bimodal.gif?la=en)


* **Plateau or Multimodal Distribution**

    The plateau might be called a “multimodal distribution.” Several processes with normal distributions are combined. Because there are many peaks close together, the top of the distribution resembles a plateau.
    ![Image of Yaktocat](https://asq.org/-/media/Images/Learn-About-Quality/Histogram/dcat-histogram-plateau.gif?la=en)


* **Edge Peak Distribution**

    The edge peak distribution looks like the normal distribution except that it has a large peak at one tail. Usually this is caused by faulty construction of the histogram, with data lumped together into a group labeled “greater than.”
    ![Image of Yaktocat](https://asq.org/-/media/Images/Learn-About-Quality/Histogram/edge-peak-histogram.gif?la=en)

* **Comb Distribution**

    In a comb distribution, the bars are alternately tall and short. This distribution often results from rounded-off data and/or an incorrectly constructed histogram. For example, temperature data rounded off to the nearest 0.2 degree would show a comb shape if the bar width for the histogram were 0.1 degree.
    ![Image of Yaktocat](https://asq.org/-/media/Images/Learn-About-Quality/Histogram/dcat-histogram-comb.gif?la=en)


* **Truncated or Heart-Cut Distribution**

    The truncated distribution looks like a normal distribution with the tails cut off. The supplier might be producing a normal distribution of material and then relying on inspection to separate what is within specification limits from what is out of spec. The resulting shipments to the customer from inside the specifications are the heart cut.
    ![Image of Yaktocat](https://asq.org/-/media/Images/Learn-About-Quality/Histogram/dcat-histogram-truncated.gif?la=en)


* **Dog Food Distribution**
    The dog food distribution is missing something—results near the average. If a customer receives this kind of distribution, someone else is receiving a heart cut and the customer is left with the “dog food,” the odds and ends left over after the master’s meal. Even though what the customer receives is within specifications, the product falls into two clusters: one near the upper specification limit and one near the lower specification limit. This variation often causes problems in the customer’s process.
    ![Image of Yaktocat](https://asq.org/-/media/Images/Learn-About-Quality/Histogram/dcat-histogram-dog-food.gif?la=en)
    
    
**A histogram displays numerical data by grouping data into "bins" of equal width. Each bin is plotted as a bar whose height corresponds to how many data points are in that bin.
Bins are also sometimes called "intervals", "classes", or "buckets".**

### 7. How do we deal with data outliers?

Ans : An outlier is an observation that lies an abnormal distance from other values in a random sample from a population.

![Image of Yaktocat](https://cxl.com/wp-content/uploads/2017/01/outlier.jpg)

There are also different degrees of outliers:

* Mild outliers lie beyond an “inner fence” on either side.
* Extreme outliers are beyond an “outer fence.”

### 5 ways to deal with outliers in data

**1. Set up a filter in your testing tool**
* Even though this has a little cost, filtering out outliers is worth it. You often discover significant effects that are simply “hidden” by outliers.
    
**2. Remove or change outliers during post-test analysis**
* One way to account for this is simply to remove outliers, or trim your data set to exclude as many as you’d like.

**3. Change the value of outliers**
* Essentially, instead of removing outliers from the data, you change their values to something more representative of your data set. It’s a small but important distinction: When you trim data, the extreme values are discarded.

**4. Consider the underlying distribution**
* Traditional methods to calculate confidence intervals assume that the data follows a normal distribution, but as with certain metrics like average revenue per visitor, that usually isn’t the way reality works

**5. Consider the value of mild outliers**
* As exemplified by revenue per visitor, the underlying distribution is often non-normal. It’s common for a few big buyers to skew the data set toward the extremes. When this is the case, outlier detection falls prey to predictable inaccuracies—it detects outliers far more often.