<a href="https://colab.research.google.com/github/The-Algorist/Data-Transformations-and-Symmetry/blob/main/R.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Importing the necessary Libraries**

In [None]:
# Install required packages
install.packages("remotes") # Ensure remotes is installed
remotes::install_github("bayesball/LearnEDAfunctions")

In [None]:
install.packages("aplpack")


In [None]:
# Load libraries
library(dplyr)
library(LearnEDAfunctions)  # Load LearnEDA package for the dataset and functions
library(ggplot2)            # For visualizations
library(aplpack)            # For stem.leaf plots

**Task 1**: Find a Transformation for Symmetry in the Lake Dataset
The dataset "lake" includes the following variables:

AREA: Area of lake in acres.
DEPTH: Maximum depth of the lake in feet.
PH: Acidity measurement.
WSHED: Watershed area in square miles.
HIONS: Concentration of hydrogen ions.


For this task, the symmetry transformation is tested for two non-symmetric variables, AREA and HIONS

**Steps:**

- Inspect the data for skewness.
- Apply transformations (like square root for right-skewed data or log for multiplicative data).
- Use visual tools like symmetry plots to assess improvements.

In [None]:
# Load the lake dataset from LearnEDA package
data("lake")

In [None]:
# Display the first few rows of the dataset
head(lake)

In [None]:
# Summary statistics to inspect skewness and non-symmetry
summary(lake)

In [None]:
# Handling missing values
colSums(is.na(lake))

In [None]:
# Visualize with boxplot to spot outliers
boxplot(lake$Area, main = "Area")
boxplot(lake$Depth, main = "Depth")
boxplot(lake$PH, main = "PH")
boxplot(lake$Wshed, main = "Wshed")
boxplot(lake$Hions, main = "Hions")

In [None]:
# Convert specified columns to numeric
lake$Area <- as.numeric(lake$Area)
lake$Depth <- as.numeric(lake$Depth)
lake$PH <- as.numeric(lake$PH)
lake$Wshed <- as.numeric(lake$Wshed)
lake$Hions <- as.numeric(lake$Hions)

In [None]:
# Check if the columns have been converted successfully
str(lake)

In [None]:
# Check if the columns have been converted successfully
str(lake)

# Symmetry plot for AREA (Lake area in acres)
symplot(lake$Area)

# Symmetry plot for DEPTH (Lake depth in feet)
symplot(lake$Depth)

# Symmetry plot for pH (Acidity)
symplot(lake$PH)

# Symmetry plot for Watershed area (WSHED)
symplot(lake$Wshed)

# Symmetry plot for Hydrogen ion concentration (HIONS)
symplot(lake$Hions)

In [None]:
# Apply square root transformation for AREA (as it's likely right-skewed)
area_transformed <- sqrt(lake$Area)

In [None]:
# Apply log transformation for HIONS (as it's likely right-skewed and multiplicative)
hions_transformed <- log(lake$Hions)

In [None]:
# Check symmetry of the transformed variables
symplot(area_transformed)
symplot(hions_transformed)

**Task 2:** Transformations for Two Additional Datasets
Two additional datasets need to be analyzed to find a suitable transformation for symmetry. Based on the documents, the "mortality rates" and "farms" datasets are provided in the LearnEDA package, and the following approach is used:

**Mortality Rates Dataset** (skewed to the right, making log transformation suitable):
Use a log transformation to make the data more symmetric.
Plot symmetry before and after transformation.
**Farms Dataset** (skewed with both left and right tails):
A square root or log transformation is tested to reduce skewness.

In [None]:
# Use built-in dataset
data(mtcars)

In [None]:
# Display the first few rows of the dataset
head(mtcars)

In [None]:
# Summary statistics to inspect skewness and non-symmetry
summary(mtcars)

In [None]:
# Check symmetry for Weight (wt) and Horsepower (hp) in the mtcars dataset
ggplot(mtcars, aes(x = wt)) +
  geom_histogram(bins = 10, fill = "blue", alpha = 0.5) +
  labs(title = "Histogram of Weight")

ggplot(mtcars, aes(sample = wt)) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "Q-Q Plot for Weight")

ggplot(mtcars, aes(x = hp)) +
  geom_histogram(bins = 10, fill = "red", alpha = 0.5) +
  labs(title = "Histogram of Horsepower")

ggplot(mtcars, aes(sample = hp)) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "Q-Q Plot for Horsepower")

In [None]:
# Apply log transformation for weight
wt_log <- log(mtcars$wt)


In [None]:
# Apply square root transformation for horsepower
hp_sqrt <- sqrt(mtcars$hp)

In [None]:
# Visualize the transformed variables
ggplot(data.frame(wt_log), aes(x = wt_log)) +
  geom_histogram(bins = 10, fill = "blue", alpha = 0.5) +
  labs(title = "Log-transformed Weight")

ggplot(data.frame(hp_sqrt), aes(x = hp_sqrt)) +
  geom_histogram(bins = 10, fill = "red", alpha = 0.5) +
  labs(title = "Square-root Transformed Horsepower")

In [None]:
# Q-Q plot for transformed weight
ggplot(data.frame(wt_log), aes(sample = wt_log)) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "Q-Q Plot for Log-transformed Weight")

# Q-Q plot for transformed horsepower
ggplot(data.frame(hp_sqrt), aes(sample = hp_sqrt)) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "Q-Q Plot for Square-root Transformed Horsepower")

In [None]:
# Summary of original and transformed variables
summary(mtcars$wt)
summary(wt_log)

summary(mtcars$hp)
summary(hp_sqrt)

In [None]:
# Check symmetry for Petal Length in iris dataset
ggplot(iris, aes(x = Petal.Length)) +
  geom_histogram(bins = 10, fill = "green", alpha = 0.5) +
  labs(title = "Histogram of Petal Length")

# Apply log transformation for Petal Length
petal_log <- log(iris$Petal.Length)

# Check symmetry post-transformation
ggplot(data.frame(petal_log), aes(x = petal_log)) +
  geom_histogram(bins = 10, fill = "green", alpha = 0.5) +
  labs(title = "Log-transformed Petal Length")

**Using Hinkley's Method**

What is Hinkley's Method?
Hinkley suggested a simple measure to quantify the skewness of a dataset:

d = (mean - median) / scale, where the scale is often taken as the fourth-spread (i.e., the difference between the upper and lower fourths).
Positive d indicates right-skewness (mean > median).
Negative d indicates left-skewness (mean < median).
d ≈ 0 suggests approximate symmetry.
By comparing the value of d before and after a transformation, we can assess whether the transformation successfully reduced the asymmetry.

**Using Hinkley's Method: **

**Inspect Skewness Using Hinkley's Method:**
Before applying any transformation, calculate Hinkley's statistic to determine whether the data is skewed.

**Apply Transformations:**
Based on the sign and magnitude of d, apply an appropriate transformation (log, square root, reciprocal, etc.).

**Recalculate Hinkley's Statistic:**
After the transformation, recalculate d to check if the transformation has improved the symmetry of the dataset.

**Choose the Best Transformation:**
The goal is to bring d as close to zero as possible, indicating reduced skewness and better symmetry.

**1. Initial Hinkley’s Statistic Calculation**

Before applying transformations, calculate Hinkley’s d for the raw data.

In [None]:
# Load necessary libraries and dataset
library(LearnEDAfunctions)

# Define Hinkley's method function
hinkley <- function(x) {
  mean_x <- mean(x)
  median_x <- median(x)
  spread_x <- IQR(x)  # Interquartile range as the scale (or fourth-spread)
  d <- (mean_x - median_x) / spread_x
  return(d)
}

# Initial Hinkley's d for Area and Hions
d_area <- hinkley(lake$Area)
d_hions <- hinkley(lake$Hions)

cat("Hinkley's d for Area (before transformation):", d_area, "\n")
cat("Hinkley's d for Hions (before transformation):", d_hions, "\n")


**2. Apply Transformations Based on Initial d**

If d is positive (right-skewed), use log or square root transformations. If it is negative (left-skewed), consider square or higher power transformations.

In [None]:
# Apply square root transformation for Area (likely right-skewed)
lake$Area_transformed <- sqrt(lake$Area)

# Apply log transformation for HIONS (likely right-skewed)
lake$Hions_transformed <- log(lake$Hions)

# Recalculate Hinkley's d after transformations
d_area_transformed <- hinkley(lake$Area_transformed)
d_hions_transformed <- hinkley(lake$Hions_transformed)

cat("Hinkley's d for Area (after transformation):", d_area_transformed, "\n")
cat("Hinkley's d for Hions (after transformation):", d_hions_transformed, "\n")


**3. Visualizing Before and After**

To help visualize the effectiveness of the transformations, use histograms, symmetry plots, and Hinkley’s statistic to demonstrate the improvement:

In [None]:
# Visualize histograms before and after transformations
ggplot(lake, aes(x=Area)) + geom_histogram(bins=10, fill="blue", alpha=0.7) + ggtitle("Area: Before Transformation")
ggplot(data.frame(lake$Area_transformed), aes(x=lake.Area_transformed)) + geom_histogram(bins=10, fill="blue", alpha=0.7) + ggtitle("Area: After Transformation")



In [None]:
ggplot(lake, aes(x=Hions)) + geom_histogram(bins=10, fill="green", alpha=0.7) + ggtitle("Hions: Before Transformation")
ggplot(data.frame(lake$Hions_transformed), aes(x=lake.Hions_transformed)) + geom_histogram(bins=10, fill="green", alpha=0.7) + ggtitle("Hions: After Transformation")
