<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/MLPG-Book-Cover-Small.png"><br>

This notebook contains an excerpt from the **`Machine Learning Project Guidelines - For Beginners`** book written by *Balasubramanian Chandran*; the content is available [on GitHub](https://github.com/BalaChandranGH/Books/ML-Project-Guidelines).

<br>
<!--NAVIGATION-->

<[ [Stage-1: Business Understanding](07.00-mlpg-Stage-1-Business-Understanding.ipynb) | [Contents and Acronyms](00.00-mlpg-Contents-and-Acronyms.ipynb) | [Stage-3: Research](09.00-mlpg-Stage-3-Research.ipynb) ]>

# 8. Stage-2: Data Understanding

It's assumed that, at this stage, the following are already made available by the DS team. EDA is the main focus of the ML team.
* Data requirements definition identifying the following:
  - Data sources (and they could be)
    - In-house or external source
    - APIs, XML feeds, CSVs, Excel files
    - Data mining/ scrapping from online
  - Data pipelines (and they could be)
    - Streaming vs Batch
    - Ingestion frequency
  - Data environments
    - Small vs Medium vs Big data
* Collected raw datasets (and they should be)
  - Diverse
  - Unbiased
  - Abundant

## 8.1. Exploratory Data Analysis (EDA)
### 8.1.1. Objectives of EDA
* To get an overview of the distribution of the dataset
* Check for missing numerical values, outliers, or other anomalies in the dataset
* Discover patterns and relationships between variables in the dataset
* Check the underlying assumptions in the dataset

### 8.1.2. Prerequisites of EDA
* _**NumPy**_: Python library for scientific computing supporting multi-dimensional arrays, matrices, and high-level mathematical functions
* _**Pandas**_: Python library for data analysis and manipulation
* _**Matplotlib**_: A core data visualization library for Python with an object-oriented API for embedding plots into applications
* _**Seaborn**_: A data visualization library for Python built on top of Matplotlib providing high-level visualizations

### 8.1.3. Types of variables
**`A) Numeric/Quantitative variables`**: contain numerical data. They can be further classified into two categories.

**Integer/Discrete variables:**
* These are numeric variables that have a finite number of countable values between any two values
* A discrete variable is always numeric and whole (1, 2, 3, 5, 10, 50, etc.)
* Examples:
  - The number of customer complaints
  - The number of flaws or defects

**Float/Continuous variables:**
* These are numeric variables that have an infinite number of values between any two values
* A continuous variable can be numeric and float (0.1, 0.2, 0.3, etc.) or date/time
* Examples:
  - Income
  - Temperature
  - Height
  - Weight
  - Distance

**Interval variable:**
* It takes numeric values and may be classified as a continuous variable type
* Arithmetic operations can be performed on interval variables. However, these operations are restricted to only addition and subtraction
* The interval variable is an extension of the ordinal variable. In other words, we could say interval variables are built upon ordinary variables
* The intervals on the scale are equal in an interval variable. The scale is equidistant (equally spaced)
* The variables are measured using an interval scale, which not only shows the order but also shows the exact difference in the value
* It has no zero value
* Examples:
  - Temperature
  - Time
  - CGPA

**Ratio variable:**
* Ratio variables have absolute zero characteristics
* The zero point makes is what makes it possible to measure multiple values and perform multiplication and division operations. Therefore, we can say that an object is twice as big or as long as another
* It has an intrinsic order with an equidistant scale, i.e, all the levels in the ratio scale have an equal distance
* Due to the absolute point characteristics of a ratio variable, it doesn’t have a negative number like an interval variable. Therefore, before measuring any object on a ratio scale, researchers need to first study if it satisfies all the properties of an interval variable and also the zero point characteristic
* The ratio variable is the peak type of measurement variable in statistical analysis. It allows for the addition, interaction, multiplication, and division of variables
* Also, all statistical analyses including mean, mode, median, etc. can be calculated on the ratio scale
* Examples:
  - Height
  - Weight

**`B) Categorical/Qualitative variables`**: are variables that describe a particular category. Usually, they take on one of several fixed variables. Eg.,
* The category “hair color” could contain the categorical variables “black,” “brown,” “blonde,” and “red.”
* The category “gender” could contain the categorical variables “Male” or “Female”.

We can divide categorical data into three types: Nominal data, Ordinal data & Boolean data.

**Nominal variable:**
* A variable that has two or more categories
* There is no ordering involved with this type of variable and there is no agreed way to order these from highest to lowest
* A nominal variable is qualitative, which means numbers are used here only to categorize or identify objects
* They can also take quantitative values, however, these quantitative values do not have numeric properties, i.e., arithmetic operations cannot be performed on them
* Examples:
  - Gender - male, female
  - Color - red, green, blue, black, etc.
  - Hair color - black, blonde, brown, brunette, red, etc.
  - Name - AAAAA, BBBBB, CCCCC, etc.
  - phone no. 

**Ordinal variable:**
* It is an extension of Nominal data
* It has a rank or order
* It establishes a relative rank
* It has no standardized interval scale
* It measures qualitative traits
* The median and mode can be analyzed
* There are 2 types of Ordinal variables and they are:
  - Ordinal Variable with Numeric Value
    - Example 1: How satisfied are you (on a scale of 1 - 5)?
      1. Very satisfied 
      2. Satisfied
      3. Indifferent
      4. Dissatisfied 
      5. Very dissatisfied
    - Example 2: What is your age group (on a scale of 1 - 3)?
      1. Low (13-19 years)
      2. Medium (20-50 years)
      3. High (51-99 years)
  - Ordinal Variable without Numeric Value
    - Example 1: How satisfied are you?
      * Very satisfied 
      * Satisfied
      * Indifferent
      * Dissatisfied 
      * Very dissatisfied
    - Example 2: What is your age group?
      * Low (13-19 years)
      * Medium (20-50 years)
      * High (51-99 years)
* Differences Between Nominal and Ordinal Variable
  - The ordinal variable has an intrinsic order while nominal variables do not have an order
  - It is only the mode of a nominal variable that can be analyzed while analysis like the median, mode, quantile, percentile, etc. can be performed on ordinal variables
  - The tests carried on nominal and ordinal variables are different
* Similarities Between Nominal and Ordinal Variable
  - They are both types of categorical variables
  - They both have an inconclusive mean and a mode
  - They are both visualized using bar charts and pie charts

**Boolean variable:**
* A variable that has only two states such as True or False (or) Yes or No (or) 1 or 0

![](figures/MLPG-TypesOfVariables.png)

### 8.1.4. Distribution of Variables
**Univariate distribution**: There is only one variable under consideration. It is the simplest form of analysis because only one quantity changes. It does not deal with causes or relationships. The main purpose of the analysis is to describe the data and find patterns that exist within it. An example of univariate data can be the height of a single person. We can describe patterns found in univariate data using central tendency (mean, median, and mode) and dispersion (range, variance, standard deviation, maximum and minimum values, and interquartile range). We can visualize the univariate data using various types of charts and graphs. These are frequency distribution tables, histograms, bar charts, pie charts, and frequency polygons.

**Bivariate distribution**: This type of data distribution involves two different variables. The analysis of this type of data deals with causes and relationships and the analysis is done to find out the relationship between the two variables. A very common example of bivariate distribution is the height and weight of a single person. It is one of the simplest forms of statistical analysis, used to find out if there is a relationship between two sets of values. Thus, bivariate data analysis involves comparisons, exploring relationships, and finding causes and explanations. These variables are often plotted on the X and Y-axis on the graph for a better understanding of the data and one of these variables is independent while the other is dependent. Common types of bivariate analysis include drawing scatter plots, regression analysis, and finding correlation coefficients. A scatter plot is used to find out if there exists any relationship between two variables. Regression analysis is a statistical method for estimating the relationships between variables. Correlation coefficient analysis measures the strength and direction of a linear relationship between two variables on a scatter plot.

**Multivariate distribution**: When the dataset involves three or more variables, it is categorized under multivariate distribution. Multivariate analysis is used to study more complex sets of data. It is usually unsuitable for small sets of data. There are a wide variety of analysis techniques to perform multivariate analysis. The choice of analysis techniques depends on the dataset and our goals to be achieved. Some examples of multivariate analysis techniques are additive tree, cluster analysis, correspondence analysis, factor analysis, MANOVA (multivariate analysis of variance), multidimensional scaling, multiple regression analysis, principal component analysis, and redundancy analysis.

### 8.1.5. Summary of EDA Techniques
EDA techniques depend on the type of data and the objectives of the analysis. The following is a summary of useful EDA techniques:
```
    Type of data                EDA techniques
    -------------------------   ----------------------
    Categorical                 Descriptive statistics
    Univariate discrete         Barplot
    Univariate continuous       Line plot, Histogram
    Bivariate continuous        2-D scatter plot
    2-D arrays                  Heatmap
    Multivariate distribution   3-D scatter plot
    Multiple groups             Boxplot
```
The following table summarizes the useful EDA techniques depending on the objective:
```
    Objective                                      EDA techniques
    --------------------------------------------   ----------------------------------------------
    Check the distribution of a variable           Histogram
    Find outliers                                  Histogram, scatterplot, box-and-whisker plot
    Quantify the relationship between variables    2-D scatter plot, covariance, and correlation
    Visualize the relationship between variables   Heatmap
    Visualize high-dimensional data                Principal component analysis, 3-D scatter plot
```

### 8.1.6. Text EDA: Understand the data with Descriptive Statistics
* Dimensions of the dataset
* An initial look at the raw data (e.g., first 10 rows & last 10 rows)
* Basic information of the dataset
* Statistical summary of the dataset
* Class distribution of the dataset
* Explore NA / NULL values in the dataset
* Explore duplicates in the dataset

### 8.1.7. Visual EDA: Understand the data with Visualizations
* Draw Univariate plots to better understand each attribute
  - Box and Whisker plots
  - Histograms
  - Pie-charts or Bar-charts (horizontal or vertical) to understand the distributions of data in int or float data-type
* Draw Multivariate plots to better understand the relationships between attributes
  - Scatter Plot Matrix
  - Correlation maps

## 8.2. Deliverables from Stage-2
* Data requirements definition
* Collected raw datasets & their metadata
* Raw data summary report
* Exploratory Data Analysis (EDA) report

## 8.3. Notebook development tips

In [None]:
###   (2.1) IMPORT LIBRARIES, MODULES, FUNCTIONS, OBJECTS   ###

## Import necessary libraries for this project. Examples are given below ##
# import sys         as sys
# import csv         as csv
# import numpy       as np
# import pandas      as pd
# import sklearn     as sk
# import scipy       as sp
# import seaborn     as sns
# import imblearn    as imb
# import matplotlib  as mpl
# import pickle      as pickle
# import warnings
# warnings.filterwarnings('ignore')

## Import necessary Modules, Functions and Objects from the Libraries. Examples given below ##
# from numpy                         import loadtxt
# from pandas                        import read_csv, read_excel
# from pandas.plotting               import scatter_matrix
# from datetime                      import datetime
# from collections                   import Counter
# from scipy.stats                   import boxcox as BoxCoxScaler
# from imblearn.over_sampling        import SMOTE
# from sklearn.metrics               import accuracy_score, roc_auc_score, make_scorer
# from sklearn.metrics               import confusion_matrix, classification_report
# from sklearn.preprocessing         import LabelEncoder, RobustScaler, StandardScaler, MinMaxScaler
# from sklearn.model_selection       import ShuffleSplit, cross_val_score, train_test_split
# from sklearn.model_selection       import GridSearchCV, KFold, StratifiedKFold, RepeatedStratifiedKFold
# from matplotlib                    import pyplot as plt
# --------- For Regression ----------
# from sklearn.linear_model          import LinearRegression, LassoCV, RidgeCV, ElasticNetCV
# from sklearn.neighbors             import KNeighborsRegressor
# from sklearn.svm                   import SVR
# from sklearn.tree                  import DecisionTreeRegressor
# from sklearn.ensemble              import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
# from xgboost                       import XGBRegressor
# -------- For Classification -------
# from sklearn.linear_model          import LogisticRegression, SGDClassifier
# from sklearn.neighbors             import KNeighborsClassifier
# from sklearn.svm                   import SVC
# from.naive_bayes                   import GaussianNB
# from sklearn.tree                  import DecisionTreeClassifier
# from sklearn.ensemble              import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
# from xgboost                       import XGBClassifier

## Check the versions of the imported libraries
# print("Versions of imported libraries:")
# print("Python:     {}".format(sys.version))
# print("Numpy:      {}".format(np.__version__))
# print("Pandas:     {}".format(pd.__version__))
# print("skLearn:    {}".format(sk.__version__))
# print("Scipy:      {}".format(sp.__version__))
# print("Seaborn:    {}".format(sns.__version__))
# print("Imblearn:   {}".format(imb.__version__))
# print("Matplotlib: {}".format(mpl.__version__))

# Optional settings
# mpl.style.use('ggplot')
# sns.set(style='whitegrid')
# pd.set_option('display.max_columns', None, 'precision', 3)

# print(__doc__)

In [None]:
###   (2.3) DATA LOADING   ###

## Load the necessary data files for this project. Examples given below ##
# If the data file is in csv format, it can be loaded in 4 different ways
#   - Using 'csv.reader' function from the standard library 'csv'
#   - Using 'loadtxt'    function from 'numpy'  library
#   - Using 'read_csv'   function from 'pandas' library
#   - Using 'read_excel' function from 'pandas' library

In [None]:
###   (2.4) EXPLORATORY DATA ANALYSIS (EDA)   ###

## Understand the data with Descriptive Statistics. Examples given below ##
# Use 'shape' function to see the dimensions of the datasets
# Use 'head()' & 'tail()' functions to see the first & last few samples of the datasets
# Use 'info()' function to see the basic information of the dataset
# Use 'describe()' function to see the statistical summary of the datasets
# Use 'Counter' or 'groupby()' function to see the class distribution of the datasets
# Use 'df.is_null().sum()' to explore NA / NULL values in the dataset
#     null_values     = pd.DataFrame(data=df.isnull().sum(), columns=['NULL count'])
#     null_values_per = pd.DataFrame(round(df.isnull().sum() / len(df) * 100, 2), columns=['NULL Percentage'])
#     null_values_df = pd.concat([null_values, null_values_per], axis=1)
# Use 'df.cpy' to find the duplicates in the dataset
#     print('Number of duplicates found:', (len(df) - len(dup_df))
# Use the following 'style' functions to visualize the pandas dataframes for better aesthetic
#   - Hiding function [hide_index(), hide_columns()]
#   - Highlight function [highlight_max(), highlight_min(), highlight_null()]
#   - Gradient function [background_gradient()]

## Understand the data with Visualizations ##
# Draw Univariate plots to better understand each attribute
#   - Draw 'box and whisker plots
#   - Draw 'histograms'

<!--NAVIGATION-->
<br>

<[ [Stage-1: Business Understanding](07.00-mlpg-Stage-1-Business-Understanding.ipynb) | [Contents and Acronyms](00.00-mlpg-Contents-and-Acronyms.ipynb) | [Stage-3: Research](09.00-mlpg-Stage-3-Research.ipynb) ]>