# **STATISTICAL ANALYSIS: DESCRIPTIVE STATISTICS**
<p> Created by: <a href = "https://www.linkedin.com/in/tafiflukman/">Ta'fif Lukman Afandi</a></p>
<img src="https://aegis4048.github.io/jupyter_images/test_score_dist.png">

## Dataset
<p align="justify">The dataset used is <a href = "https://datahub.io/machine-learning/iris#resource-iris"><strong>Iris Plants Database</strong></a></p>
<p align="justify">This is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.</p>
<dl>
    <dt>Attribute Information:</dt>
    <dd>1. sepal length in cm</dd>
    <dd>2. sepal width in cm</dd>
    <dd>3. petal length in cm</dd>
    <dd>4. petal width in cm</dd>
    <dd>5. class:</dd>
        <li>Iris Setosa</li>
        <li>Iris Versicolour</li>
        <li>Iris Virginica</li>
</dl>

<strong>IMPORT DATASET</strong>

In [1]:
dt = read.csv("./dataset/iris_csv.csv")
print(head(dt))

  sepallength sepalwidth petallength petalwidth       class
1         5.1        3.5         1.4        0.2 Iris-setosa
2         4.9        3.0         1.4        0.2 Iris-setosa
3         4.7        3.2         1.3        0.2 Iris-setosa
4         4.6        3.1         1.5        0.2 Iris-setosa
5         5.0        3.6         1.4        0.2 Iris-setosa
6         5.4        3.9         1.7        0.4 Iris-setosa


## Theoretical Basis
<p align = "justify">
        Descriptive statistics is statistics used to describe facts by calculating the size of parameters and statistical distribution functions based on empirical data. This section discusses three main discussions: </p>
<ol>
    <li align = "justify">Central Tendency</li>
    <li align = "justify">Measurement of Spread</li> 
    <li align = "justify">Distribution</li>
</ol>
<img src="https://i.ibb.co/KWvrPkd/Screenshot-564.png">

### 3. Distribution
<p align = "justify">
    <strong>The distribution of a data set</strong> is a function that shows all possible intervals of values of the data and how often the data appears. There are two measurements to determine the distribution of a data set: <strong>Skewness and Kurtosis</strong>. Skewness and kurtosis are the two important characteristics of distribution that are studied in descriptive statistics.
</p>

<strong>
    A. Skewness
</strong>
<p align="justify">
    <strong>Skewness</strong> is a statistical number that tells us if a distribution is symmetric or not.  A distribution is symmetric if the right side of the distribution is similar to the left side of the distribution.
    <li>If Skewness is greater than 0, then it is called <strong>right-skewed (Positive)</strong> or that the right tail is longer than the left tail.</li>
    <li>If a distribution is <strong>Symmetric (normal distribution):</strong> median= mean= mode, (Skewness value is 0)</li>
    <li>If Skewness is less than 0, then it is called <strong>left-skewed (Negative)</strong> or that the left tail is longer than the right tail.</li>
</p>
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/cc/Relationship_between_mean_and_median_under_different_skewness.png/434px-Relationship_between_mean_and_median_under_different_skewness.png">
<p>
    <strong>The skewness can be calculated from the following formula:</strong>
</p>
<img src="https://miro.medium.com/max/1400/1*Sp3J-uwFrOGCGbP5NjIm8Q.webp">
<p>
    Syntax for skewness() functions from the <strong>moments</strong> library in R:
</p>
<p>
    <strong>
        skewness(x, na.rm = FALSE)
    </strong>
</p>
<p>
    Information:
</p>
<li>
    <strong>
        x:
    </strong>
    numeric vector
</li>
<li>
    <strong>
        rm:
    </strong>
    whether NA should be removed, if not, NA will be returned
</li>

<strong>LIBRARY MOMENTS</strong>

In [2]:
library(moments)

<strong>SKEWNESS: SEPAL LENGTH </strong>

In [3]:
#Option 1 using function skewness in R
round(skewness(dt$sepallength),3)

In [4]:
#Option 2 using manual
SKEWNESS = function(x){
    N = length(x)
    up = NULL
    for(i in 1:N){
        up[i] = (x[i]-mean(x))^3
    }
    hasil = sum(up)/((N-1)*(sd(x)^3))
    hasil = round(hasil,3)
    return(hasil)
}

In [5]:
SKEWNESS(dt$sepallength)

<strong>SKEWNESS: SEPAL WIDTH</strong>

In [6]:
round(skewness(dt$sepalwidth),3)

In [7]:
SKEWNESS(dt$sepalwidth)

<strong>
    B. Kurtosis
</strong>
<p>
    <strong>
        Kurtosis
    </strong> is a statistical number that tells us if a distribution is taller or shorter than a normal distribution.
    <li>Mesokurtic distribution – Distributions with zero excess kurtosis are called mesokurtic. The standard normal distribution has a kurtosis of three, which indicates data that follow a Gaussian distribution have neither fat or thin tails.</li>
    <li>Leptokurtic distribution – Lepto means skinny. Here kurtosis is less than three, it has extremely thick tails and a very thin and tall peak.</li>
    <li>Platykurtic distribution – Platy means broad. Here kurtosis is more than three, it has extremely thin tails and a very broad and short peak.</li>
</p>
<img src ="https://unofficed.com/wp-content/uploads/2018/04/Kurtosis.jpg">
<p>
    <strong>The kurtosis can be calculated from the following formula:</strong>
</p>
<img src="http://allthingsstatistics.com/wp-content/uploads/2021/07/Kurtosis-1.png">
<p>
    Syntax kurtosis() functions from the <strong>moments</strong> library in R:
</p>
<p>
    <strong>
        kurtosis(x, na.rm = FALSE)
    </strong>
</p>
<p>
    Information:
</p>
<li>
    <strong>
        x:
    </strong>
    numeric vector
</li>
<li>
    <strong>
        rm:
    </strong>
    whether NA should be removed, if not, NA will be returned
</li>

<strong>KURTOSIS: SEPAL LENGTH</strong>

In [8]:
#Option 1 using function kurtosis in R
round(kurtosis(dt$sepallength),3)

In [9]:
#Option 2
KURTOSIS = function(x){
    N = length(x)
    up = NULL
    for(i in 1:N){
        up[i] = (x[i]-mean(x))^4
    }
    hasil = sum(up)/((N)*(sd(x)^4))
    hasil = round(hasil,3)
    return(hasil)
}

In [10]:
KURTOSIS(dt$sepallength)