# **STATISTICAL ANALYSIS: DESCRIPTIVE STATISTICS**
<p> Created by: <a href = "https://www.linkedin.com/in/tafiflukman/">Ta'fif Lukman Afandi</a></p>
<img src="https://tse2.mm.bing.net/th?id=OIP.GwaUnYMh15WEdFOxIN2FVQAAAA&pid=Api&P=0">

## Dataset
<p align="justify">The dataset used is <a href = "https://datahub.io/machine-learning/iris#resource-iris"><strong>Iris Plants Database</strong></a></p>
<p align="justify">This is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.</p>
<dl>
    <dt>Attribute Information:</dt>
    <dd>1. sepal length in cm</dd>
    <dd>2. sepal width in cm</dd>
    <dd>3. petal length in cm</dd>
    <dd>4. petal width in cm</dd>
    <dd>5. class:</dd>
        <li>Iris Setosa</li>
        <li>Iris Versicolour</li>
        <li>Iris Virginica</li>
</dl>

<strong>IMPORT DATASET</strong>

In [1]:
dt = read.csv("./dataset/iris_csv.csv")
print(head(dt))

  sepallength sepalwidth petallength petalwidth       class
1         5.1        3.5         1.4        0.2 Iris-setosa
2         4.9        3.0         1.4        0.2 Iris-setosa
3         4.7        3.2         1.3        0.2 Iris-setosa
4         4.6        3.1         1.5        0.2 Iris-setosa
5         5.0        3.6         1.4        0.2 Iris-setosa
6         5.4        3.9         1.7        0.4 Iris-setosa


## Theoretical Basis
<p align = "justify">
        Descriptive statistics is statistics used to describe facts by calculating the size of parameters and statistical distribution functions based on empirical data. This section discusses three main discussions: </p>
<ol>
    <li align = "justify">Central Tendency</li>
    <li align = "justify">Measurement of Spread</li> 
    <li align = "justify">Distribution</li>
</ol>
<img src="https://i.ibb.co/KWvrPkd/Screenshot-564.png">

### 2. Measurement of Spread
<p align = "justify">
    <strong>
        Spread
    </strong> is the degree of scatter or variation of the variable about the central value. Examples of these measures includes: <strong>
    the range, Inter-Quartile range, Variance and standard deviation.</strong>
</p>

<strong>
    A. Range
</strong>
<img src="https://i.ibb.co/84bQvYY/Range-Formula.png">
<p>Source:
    <a href="https://www.piqosity.com/2022/08/12/isee-math-review-mean-median-mode-range-weighted-average/">https://www.piqosity.com/2022/08/12/isee-math-review-mean-median-mode-range-weighted-average
    </a>
</p>
<p>
    Syntax for range() function in R:
</p>
<p>
    <strong>
        range(x, na.rm = FALSE, …)
    </strong>
</p>
<p>
    Information:
</p>
<li>
    <strong>
        x:
    </strong>
    numeric vector
</li>
<li>
    <strong>
        rm:
    </strong>
    whether NA should be removed, if not, NA will be returned
</li>

<strong>RANGE: SEPAL LENGTH (with NA)</strong>

In [2]:
range(dt$sepallength, na.rm=TRUE)

In [3]:
#Option 1 using function range in R
RANGE = range(dt$sepallength, na.rm=TRUE)[2]-range(dt$sepallength, na.rm=TRUE)[1]
RANGE

In [4]:
#Option 2 using manual
RANGE2 = max(dt$sepallength) - min(dt$sepallength)
RANGE2

<strong>RANGE: SEPAL LENGTH (without NA)</strong>

In [5]:
range(dt$sepallength)

In [6]:
RANGE2 = range(dt$sepallength)[2]-range(dt$sepallength)[1]
RANGE2

<strong>
    B. Interquartile Range (IQR)
</strong>
<p>
    Quartiles split a dataset into four quarters when the values are written in ascending order.
<img src="https://i.ibb.co/51RP20w/IQR.png">
<p>Source:
    <a href="https://byjus.com/interquartile-range-formula/">https://byjus.com/interquartile-range-formula
    </a>
</p>
<p>
    Syntax for IQR() function in R:
</p>
<p>
    <strong>
        IQR(x, na.rm = FALSE, ...)
    </strong>
</p>
<p>
    Information:
</p>
<li>
    <strong>
        x:
    </strong>
    numeric vector
</li>
<li>
    <strong>
        rm:
    </strong>
    whether NA should be removed, if not, NA will be returned
</li>

<strong>IQR: SEPAL LENGTH (with NA)</strong>

In [7]:
#Option 1 using function IQR in R
IQR(dt$sepallength, na.rm = TRUE)

In [8]:
#Option 2
IQR2 = function(data){
    Q3 = quantile(data, probs = .75)
    Q1 = quantile(data, probs = .25)
    cat('----------INTERQUARTILE RANGE----------\n')
    cat('Q1  =',Q1,'\n')
    cat('Q3  =',Q3,'\n')
    cat('---------------------------------------\n')
    cat('IQR =',Q3-Q1,'\n')
    cat('---------------------------------------')
}

In [9]:
IQR2(dt$sepallength)

----------INTERQUARTILE RANGE----------
Q1  = 5.1 
Q3  = 6.4 
---------------------------------------
IQR = 1.3 
---------------------------------------

<strong>IQR: SEPAL LENGTH (without NA)</strong>

In [10]:
IQR(dt$sepallength)

<strong>
    C. Variance and Standard Deviation
</strong>
<p>
    <strong>
        The variance
    </strong> 
    is such a value. We often use the related measure, its <strong>
    square-root
    </strong>, called the 
    <strong>
        standard deviation
    </strong>. <strong>The formula for variance and standard deviation is given bellow</strong>
</p>
<img src="https://standarddeviationformula.com/wp-content/uploads/2019/12/variance-and-standard-deviation-formula.png">
<p>
    Syntax for var() function in R:
</p>
<p>
    <strong>
        var(x, na.rm = FALSE)
    </strong>
</p>
<p>
    Syntax for sd() function in R:
</p>
<p>
    <strong>
        sd(x, na.rm = FALSE)
    </strong>
</p>
<p>
    Information:
</p>
<li>
    <strong>
        x:
    </strong>
    numeric vector
</li>

<strong>VARIANCE AND STANDARD DEVIATION: SEPAL LENGTH</strong>

In [11]:
var(dt$sepallength)

In [12]:
#Option 1 using function sd in R
sd(dt$sepallength)

In [13]:
#Option 2 using manual
sqrt(var(dt$sepallength))

<strong>SPREAD</strong>

In [14]:
spread = function(x){
    kolom = ncol(x)
    Range = NULL
    IQR = NULL
    Var = NULL
    Std = NULL
    for(i in 1:kolom){
        Range[i] = round((range(x[,i])[2]-range(x[,i])[1]),3)
        IQR[i] = round(IQR(x[,i]),3)
        Var[i] = round(var(x[,i]),3)
        Std[i] = round(sd(x[,i]),3)
    }
    baris = cbind(Range, IQR, Var, Std)
    colnames(baris) = c("RANGE", "IQR", "VARIANCE", "STD. DEVIATION")
    rownames(baris) = names(x)
    print(baris)
}

In [15]:
spread(dt[,c(1:4)])

            RANGE IQR VARIANCE STD. DEVIATION
sepallength   3.6 1.3    0.686          0.828
sepalwidth    2.4 0.5    0.188          0.434
petallength   5.9 3.5    3.113          1.764
petalwidth    2.4 1.5    0.582          0.763
