<a href="https://colab.research.google.com/github/Akmazad/Data-Science-Fundamentals-in-R/blob/main/Modules/Module_3_Exploratory_Data_Analysis_(EDA).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 3: Exploratory Data Analysis (EDA) #


---


**Contents:**
*   Summarizing a dataset using descriptive statistics
*   Measures of Central Tendancies, Spread, and Percentiles
*   Finding missing value
*   outlier detection
*   Correlation
*   Clustering

---




There are two approaches for performing EDA:


1.   Through descriptive statistics (Covered in this module)
2.   Through Visualization (Covered in Module 4)






---
### First approach: Descriptive statistics ###


---




In [None]:
# Summarization/Describe the whole dataset
summary(mtcars)
summary(iris)
str(iris)
str(list("first" = 1, "second" = 2, "third" = 3)

# Types for each attributes of the dataset
sapply(iris,class)

# number of rows of a data.frame or Matrix
nrow(mtcars)

# number of columns of a data.frame or Matrix
ncol(mtcars)

# dimension of a data.frame or Matrix
dim(mtcars)

# An excerpt of the full data
head(mtcars, n = 2)
tail(mtcars, n = 2)







---


### Measures of Central Tendancies, Spread, and Percentiles ###


---



In [None]:
# reference:
# central tendancies
mean(mtcars$mpg)
median(mtcars$mpg)

# Spread
var(mtcars$mpg)
sd(mtcars$mpg)
IQR(mtcars$mpg)
min(mtcars$mpg)
max(mtcars$mpg)
range(mtcars$mpg)

# Categorical variables
table(mtcars$cyl)
table(mtcars$cyl)/nrow(mtcars)
table(mtcars$cyl, mtcars$carb)

# uniqueness
unique(mtcars$cyl)

# # quartiles (of a vector)
quantile(x <- rnorm(1001)) # Extremes & Quartiles by default
quantile(x,  probs = c(0.1, 0.5, 1, 2, 5, 10, 50, NA)/100)



---

### Missingness ###


---



In [None]:
library(dplyr)
library(data.table)
df = fread("/content/sample_data/housing_price_train.csv", stringsAsFactors = F) %>%
  as_tibble
# str(df)

# total missing values in the whole data set
df %>%
  is.na %>%
  sum

# total missingness in a single column
df$Alley %>% is.na %>% sum



---
### Outlier detection ###


---




In [None]:
# first see a distribution with outliers
mean(airquality$Wind)
median(airquality$Wind)


Q1 <- quantile(airquality$Wind, .25)
Q3 <- quantile(airquality$Wind, .75)
IQR <- IQR(airquality$Wind)
outlier_logic <- airquality$Wind > (Q1 - 1.5*IQR) & airquality$Wind < (Q3 + 1.5*IQR)
outliers <- subset(airquality, !outlier_logic)
outliers
nrow(outliers)

# Q: Does this algorithm has a visual manifestation?
# Answer: Boxplot



---
### Correlation ###

Correlation measures the strength of relationship between to random variables. Three approaches exists for this:


1.   Pearson's correlation coefficient (PCC) - parametric
2.   Spearman's & Kendall's rank correlation - non-parametric


---




In [None]:
suppressMessages(library(dplyr))
# Pearson's correlation coefficient (parametric)
pcc.val = cor(mtcars$wt, mtcars$mpg, method = "pearson")
# significance test
pcc.test.res <- cor.test(mtcars$wt, mtcars$mpg, method = "pearson") %>%
  suppressWarnings()
# pcc.res

# Spearman rank correlation coefficient (non-parametric)
sp.val = cor(mtcars$wt, mtcars$mpg, method = "spearman")
# Spearman rank correlation coefficient
sp.test.res <-cor.test(mtcars$wt, mtcars$mpg,  method = "spearman") %>%
  suppressWarnings()
# sp.test.res

# kendall's rank correlation coefficient (non-parametric)
kndl.val = cor(mtcars$wt, mtcars$mpg, method = "kendall")
# Kendall rank correlation test
kndl.test.res <- cor.test(mtcars$wt, mtcars$mpg,  method="kendall") %>%
  suppressWarnings()
# kndl.test.res

list("pearson's cc" = pcc.val,
     "spearman's cc" = sp.val,
     "kendall's cc" = kndl.val) %>% as_tibble
list("p-value (pearson)" = pcc.test.res$p.value,
     "p-value (spearman)" = sp.test.res$p.value,
     "p-value (kendall)" = kndl.test.res$p.value) %>% as_tibble



---
### Clustering ###

Clustering in an unsupervised approach to find grouping patterns in a given dataset.

Types of Clustering in R: [Ref](https://www.geeksforgeeks.org/clustering-in-r-programming/)


*   K-means clustering
*   Hierarchical clustering
*   Spectral clustering
*   Fuzzy clustering
*   Density-based clustering
*   Ensemble clustering




---




In [None]:
# ref https://www.geeksforgeeks.org/clustering-in-r-programming/
library("factoextra")
library("cluster")
#load data
df <- mtcars

#remove rows with missing values
df <- na.omit(df)

#scale each variable to have a mean of 0 and sd of 1
df <- scale(df)

# perform K-means clustering
km.res = kmeans(df, centers = 4, nstart = 25)
# visualize the clusters
km.res %>% fviz_cluster(data=df)

In [None]:
install.packages(c(
  "factoextra",
  "cluster"
  ))



---

Excercise: Using the past Exercise, see if there is any correlation (Hint: Pearson) between university 'International Research Network Score' and 'International Faculty Rank'.

---



In [11]:
#@title Solution
library(data.table)
library(dplyr)

df = fread("https://docs.google.com/spreadsheets/d/e/2PACX-1vRtySA5U09DJktfiQdTP_j50tCI3h64G6zHFxCDJvkpA8VFgRTn6G9zFGDU9Kwv4s0sianfz7YcvYTD/pub?gid=1872926621&single=true&output=csv")

df <- df %>% dplyr::filter(`Country Code` == 'SA') %>%
  dplyr::arrange(desc(`International Faculty Rank`))

df$`International Faculty Rank` = as.numeric(df$`International Faculty Rank`)

cor(df$`International Faculty Rank`, df$`International Research Network Score`, method = "pearson")



---

