# 1. Exploratory Data Analysis

Exploring the data is the first step in any data science project

#### Data types

<table style='margin-left: 0; border: 1px solid'>
    <tr>
        <th>Term</th>
        <th>Explanation</th>
    </tr>
    <tr>
        <td>Numeric</td>
        <td>Data that are expressed on a numeric scale.</td>
    </tr>
     <tr>
        <td><i>Continuous</i></td>
        <td>Data that can take on any (numeric) value in an interval.</td>
    </tr>
     <tr>
        <td><i>Discrete</i></td>
        <td>Data that can take only integer values, such as counts.</td>
    </tr>
      <tr>
        <td>Categorical</td>
        <td>Data that can take only a specific set of values, representing a set of possible categories.</td>
    </tr>
    <tr>
        <td><i>Binary</i></td>
        <td>A special case of categorical data with just two categories of values, e.g. 0/1, true/false.</td>
    </tr>
    <tr>
        <td><i>Ordinal</i></td>
        <td>Categorical data that has an explicit ordering.</td>
    </tr>
</table>

#### Rectangular data

_Rectangular data_ is the general term for a two-dimensional matrix with rows indicating records (cases) and columns indicating features (variables)

<table style='margin-left: 0; border: 1px solid'>
    <tr>
        <th>Term</th>
        <th>Explanation</th>
    </tr>
    <tr>
        <td>Data frame</td>
        <td>Rectangular data (like a spreadsheet) is the basic data structure for statistical and machine learning models.</td>
    </tr>
     <tr>
        <td>Feature</td>
        <td>A column within a table is commonly referred to as a <i>feature</i>.</td>
    </tr>
     <tr>
        <td>Outcome</td>
        <td>Many data science projects involve predicting an <i>outcome</i>. The <i>features</i> are some times used to predict the <i>outcome</i> in an experiment or study</td>
    </tr>
      <tr>
        <td>Records</td>
        <td>A row within a table is commonly referred to as a <i>record</i>.</td>
    </tr>
</table>

#### Estimates of location

A basic step in exploring your data is getting a "typical value" for each feature (variable): an estimate of where most of the data is located.

<table style='margin-left: 0; border: 1px solid'>
    <tr>
        <th>Term</th>
        <th>Explanation</th>
        <th>Formula</th>
    </tr>
    <tr>
        <td>Mean</td>
        <td>The sum of all values divided by the number of values.</td>
        <td style="font-size: 1.5em">$\bar{x} = {{\sum_{i=1}^n x_i} \over n}$</td>
    </tr>
    <tr>
        <td>Weighted Mean</td>
        <td>The sum of all values times a weight divided by the sum of the weights.</td>
        <td style="font-size: 1.5em">$\bar{x_w}={{\sum_{i=1}^n w_i x_i} \over \sum_{i=1}^n w_i}$</td>
    </tr>
    <tr>
        <td>Trimmed mean</td>
        <td>The average of all values after dropping a fixed number of extreme values.</td>
        <td style="font-size: 1.5em">$\bar{x} = {{\sum_{i=p+1}^{n-p} x_i} \over {n-2p}}$</td>
    </tr>
    <tr>
        <td>Median</td>
        <td>The value such that one half of the data lies above an below (50th percentile).</td>
    </tr>
    <tr>
        <td>Percentile</td>
        <td>The values such that P percent of the data lies below.</td>
    </tr>
    <tr>
        <td>Weighted median</td>
        <td>The values such that one-half of the sum of the weights lies above and below the sorted data.</td>
    </tr>
    <tr>
        <td>Robust</td>
        <td>Not sensitive to extreme values.</td>
    </tr>
    <tr>
        <td>Outlier</td>
        <td>A data value that is very different from most of the data.</td>
    </tr>
</table>

#### Estimates of variability

A second dimension, _variability_, also referred to as _dispersion_, measures whether the data values are tightly clustered or spread out.

<table style='margin-left: 0; border: 1px solid'>
    <tr>
        <th>Term</th>
        <th>Explanation</th>
        <th>Formula</th>
    </tr>
    <tr>
        <td>Deviations</td>
        <td>The difference between the observed values and the estimate of location.</td>
    </tr>
    <tr>
        <td>Variance</td>
        <td>The sum of squared deviations from the mean divided by _n - 1_ where _n_ is the number of data values.</td>
        <td style="font-size: 1.5em">$s^2 = {{\sum_{i=1}^n (x_i - \bar{x})^2} \over n - 1}$</td>
    </tr>
     <tr>
        <td>Standard deviation</td>
        <td>The square root of the variance.</td>
        <td style="font-size: 1.5em">$s = {\sqrt{s^2}}$</td>
    </tr>
     <tr>
        <td>Mean absolute deviation</td>
        <td>The mean of the absolute values of the deviations from the mean.</td>
        <td style="font-size: 1.5em">${{\sum_{i=1}^n |x_i - \bar{x}|} \over n}$</td>
    </tr>
    <tr>
        <td>Median absolute deviation from the median</td>
        <td>The median of the absolute values of the deviations from the median.</td>
    </tr>
    <tr>
        <td>Range</td>
        <td>The difference between the largest and the smallest value in a data set.</td>
    </tr>
     <tr>
        <td>Order statistics</td>
        <td>Metrics based on the data values sorted from smallest to biggest.</td>
    </tr>
    <tr>
        <td>Percentile</td>
        <td>The values such that P percent of the values take on this value or less and (100-P) percent take on this value or more.</td>
    </tr>
    <tr>
        <td>Interquartile range</td>
        <td>The difference between the 75th percentile and the 25th percentile.</td>
    </tr>
</table>
