<img src="https://media-exp1.licdn.com/dms/image/D5616AQHjeiUMSC8BAQ/profile-displaybackgroundimage-shrink_350_1400/0/1667108026239?e=1675296000&v=beta&t=jlQ0O03uHiDrL_sxdsQebQ8zIS_VuTm2btr2yrKL4rA">

# **STATISTICAL ANALYSIS: DESCRIPTIVE STATISTICS**
<p> Created by: <a href = "https://www.linkedin.com/in/tafiflukman/">Ta'fif Lukman Afandi</a></p>

## Dataset
<p align="justify">The dataset used is <a href = "https://datahub.io/machine-learning/iris#resource-iris"><strong>Iris Plants Database</strong></a></p>
<p align="justify">This is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.</p>
<dl>
    <dt>Attribute Information:</dt>
    <dd>1. sepal length in cm</dd>
    <dd>2. sepal width in cm</dd>
    <dd>3. petal length in cm</dd>
    <dd>4. petal width in cm</dd>
    <dd>5. class:</dd>
        <li>Iris Setosa</li>
        <li>Iris Versicolour</li>
        <li>Iris Virginica</li>
</dl>

<strong>IMPORT DATASET</strong>

In [1]:
dt = read.csv("./dataset/iris_csv.csv")
print(head(dt))

  sepallength sepalwidth petallength petalwidth       class
1         5.1        3.5         1.4        0.2 Iris-setosa
2         4.9        3.0         1.4        0.2 Iris-setosa
3         4.7        3.2         1.3        0.2 Iris-setosa
4         4.6        3.1         1.5        0.2 Iris-setosa
5         5.0        3.6         1.4        0.2 Iris-setosa
6         5.4        3.9         1.7        0.4 Iris-setosa


## Theoretical Basis
<p align = "justify">
        Descriptive statistics is statistics used to describe facts by calculating the size of parameters and statistical distribution functions based on empirical data. This section discusses three main discussions: </p>
<ol>
    <li align = "justify">Central Tendency</li>
    <li align = "justify">Measurement of Spread</li> 
    <li align = "justify">Distribution</li>
</ol>

### 1. Central Tendecy
<p align = "justify">
    Central Tendency is a measure of the middle value. These measurements include mean, median, and mode.
</p>

<strong>
    A. Mean
</strong>
<img src="https://www.pioneermathematics.com/formulasimages/arithemetic%20mean(1).gif">
<p>Source:
    <a href="https://www.pioneermathematics.com/arithmetic-mean-formula.html">https://www.pioneermathematics.com/arithmetic-mean-formula.html
    </a>
</p>
<p>
    Syntax for mean() function in R:
</p>
<p>
    <strong>
        mean(x, na.rm = FALSE, …)
    </strong>
</p>
<p>
    Information:
</p>
<li>
    <strong>
        x:
    </strong>
    numeric vector
</li>
<li>
    <strong>
        rm:
    </strong>
    whether NA should be removed, if not, NA will be returned
</li>

<strong>MEAN: SEPAL LENGTH (with NA)</strong>

In [6]:
mean(dt$sepallength, na.rm=TRUE)

<strong>MEAN: SEPAL LENGTH (without NA)</strong>

In [5]:
mean(dt$sepallength)

<strong>
    B. Median
</strong>
<img src="https://www.pioneermathematics.com/formulasimages/median(1).gif">
<p>Source:
    <a href="https://www.pioneermathematics.com/median-formula.html">https://www.pioneermathematics.com/median-formula.html
    </a>
</p>
<p>
    Syntax for median() function in R:
</p>
<p>
    <strong>
        median(x, na.rm = FALSE, …)
    </strong>
</p>
<p>
    Information:
</p>
<li>
    <strong>
        x:
    </strong>
    numeric vector
</li>
<li>
    <strong>
        rm:
    </strong>
    whether NA should be removed, if not, NA will be returned
</li>

<strong>MEDIAN: SEPAL LENGTH (with NA)</strong>

In [13]:
median(dt$sepallength, na.rm = TRUE)

<strong>MEDIAN: SEPAL LENGTH (without NA)</strong>

In [8]:
median(dt$sepallength)

<strong>
    C. Mode
</strong>
<p>
    Mode is the value which occurs most frequently in a set of observations. It is point maximum frequency.
</p>
<p>
    Syntax for mode() function in R:
</p>
<p>
    <strong>
        mode = function(x){
    </strong>
</p>
<p>
    <strong>
        &nbsp;&nbsp;&nbsp;&nbsp;uniqx = unique(x)
    </strong>
</p>
<p>
    <strong>
        &nbsp;&nbsp;&nbsp;&nbsp;uniqx[which.max(tabulate(match(x, uniqx)))]
    </strong>
</p>
<p>
    <strong>
        }
    </strong>
</p>
<p>
    Information:
</p>
<li>
    <strong>
        x:
    </strong>
    numeric vector
</li>

In [15]:
mode = function(x){
    uniqx = unique(x)
    uniqx[which.max(tabulate(match(x, uniqx)))]
}

<strong>MODE: SEPAL LENGTH</strong>

In [12]:
mode(dt$sepallength)