**What is Grubbs test?**

Grubbs test is a hypothesis test that checks whether the most extreme value in a dataset is an outlier. The test calculates a test statistic, which is then compared to a critical value from a t-distribution.

**How does Grubbs test work?**

Here is a step-by-step overview:
1. **Arrange data in order**: Sort the data in ascending or descending order.
2. **Calculate the mean and standard deviation**: Compute the mean and standard deviation of the data.
3. **Calculate the test statistic**: Compute the Grubbs test statistic using the formula:
**G = (max(x) - μ) / σ**
where G is the test statistic, max(x) is the most extreme value, μ is the mean, and σ is the standard deviation.
4. **Determine the critical value**: Look up the critical value from a t-distribution table or use software to calculate it.
5. **Compare the test statistic to the critical value**: If the test statistic (G) is greater than the critical value, the null hypothesis is rejected and the most extreme value is considered an outlier.

**What is Mahalanobis distance?**

Mahalanobis distance is a measure of the distance between a point and the center of a multivariate distribution, taking into account the covariance between variables.

**What does it represent?**

Mahalanobis distance represents how many standard deviations away from the mean a data point is, considering the correlations between variables.

**Key aspects**:
1. Multivariate: Mahalanobis distance is used for multivariate data, where each data point has multiple features or variables.
2. Covariance: It takes into account the covariance between variables, which means it considers how the variables are related to each other.
3. Standardized: Mahalanobis distance is a standardized measure, meaning it's scale-invariant and can be compared across different datasets.

**How is it calculated?**

The Mahalanobis distance between a data point x and the mean μ of a multivariate distribution is calculated as:
D(x) = √[(x - μ)ᵀ Σ⁻¹ (x - μ)]
where:
- x is the data point
- μ is the mean of the distribution
- Σ is the covariance matrix of the distribution
- Σ⁻¹ is the inverse of the covariance matrix
- ᵀ denotes the transpose operation

**Why is it useful?**

Mahalanobis distance is useful in various applications, such as:
1. Outlier detection: It helps identify data points that are farthest from the center of the distribution.
2. Anomaly detection: It can be used to detect unusual patterns or anomalies in multivariate data.
3. Clustering: Mahalanobis distance can be used as a metric for clustering algorithms, such as k-means or hierarchical clustering.

In [30]:
import pandas as pd
import numpy as np
from scipy import stats

**Load the CSV file into a Pandas DataFrame**

In [31]:
def loadData(csvFile):
    try:
        data = pd.read_csv(csvFile)
        return data
    except Exception as e:
        print(f"Error loading data: {e}")
        return None

csvFile = 'LOG10-20171018_m.csv'
data = loadData(csvFile)

In [32]:
print(data.head())

     ID  macid           Date_time  Acceleration_x  Acceleration_y  \
0  1002    NaN  10/18/2017 0:00:01            1.50            0.89   
1  1024    NaN  10/18/2017 0:00:02            1.54            0.81   
2  1013    NaN  10/18/2017 0:00:02            1.51            0.89   
3  1009    NaN  10/18/2017 0:00:03            1.54            0.84   
4  1008    NaN  10/18/2017 0:00:04            1.47            0.93   

   Acceleration_z  Acceleration_s  Frequency  Amplitude  sound  ...  an  \
0            0.03            1.74       8.99      15.10      0  ...  59   
1            0.04            1.74       8.79      15.80      0  ...  62   
2            0.04            1.75       8.90      15.50      0  ...  59   
3            0.02            1.75       8.86      15.64      0  ...  61   
4            0.04            1.73       8.99      15.02      0  ...  57   

   device_id  node_firm gateway_firm radio_power   res  sen_type  Unnamed: 22  \
0        NaN        NaN          NaN         Na

In [33]:
# Select relevant columns
columns = ['Frequency', 'Temp_t1', 'Temp_t2', 'Temp_t3']
dataSubset  = data[columns]

In [34]:
# Calculate Mahalanobis distance
def mahalanobisDistance(data):
    mean = data.mean()
    cov = data.cov()
    invCov = np.linalg.inv(cov)
    distances = []
    for i, row in data.iterrows():
        diff = row - mean
        distance = np.sqrt(np.dot(np.dot(diff.T, invCov), diff))
        distances.append(distance)
    return np.array(distances)

distances = mahalanobisDistance(dataSubset)

**Grubbs' Test Implementation**

Implement Grubbs' test to detect outliers:

In [37]:
# Apply Grubbs' test to Mahalanobis distances
def grubbsTest(data, alpha=0.05):
    # Calculate mean and standard deviation
    mean = np.mean(data)
    stdDev = np.std(data)

    # Calculate Grubbs' test statistic
    testStatistic = np.abs(data - mean) / stdDev

    # Determine critical value
    criticalValue = stats.t.ppf(1 - alpha / (2 * len(data)), len(data) - 2)

    # Identify outliers
    outlierIndices  = np.where(testStatistic > criticalValue)[0]
    return outlierIndices

outlierIndices = grubbsTest(distances)

# Print outlier rows
outlierRows = data.iloc[outlierIndices]
print(outlierRows)

         ID  macid            Date_time  Acceleration_x  Acceleration_y  \
50     1024    NaN   10/18/2017 0:01:42            1.53            0.80   
67     1024    NaN   10/18/2017 0:02:22            1.53            0.80   
93     1024    NaN   10/18/2017 0:03:22            1.53            0.80   
227    1024    NaN   10/18/2017 0:08:02            1.53            0.80   
302    1024    NaN   10/18/2017 0:10:42            1.52            0.80   
476    1024    NaN   10/18/2017 0:17:03            1.52            0.80   
548    1024    NaN   10/18/2017 0:20:13            1.52            0.80   
577    1024    NaN   10/18/2017 0:21:33            1.53            0.80   
598    1024    NaN   10/18/2017 0:22:33            1.53            0.81   
702    1024    NaN   10/18/2017 0:27:03            1.53            0.80   
767    1024    NaN   10/18/2017 0:29:53            1.53            0.80   
829    1024    NaN   10/18/2017 0:32:04            1.52            0.80   
875    1024    NaN   10/1

Mahalanobis distance and Grubbs test can be used together, but it's essential to understand the context and limitations.

**Using Mahalanobis distance with Grubbs test**:
1. Calculate Mahalanobis distance: First, calculate the Mahalanobis distance for each data point in your multivariate dataset.
2. Apply Grubbs' test: Then, apply Grubbs' test to the Mahalanobis distance values to detect outliers.

**Pros of combining Mahalanobis distance with Grubbs test**:
1. Improved outlier detection: By using Mahalanobis distance, you can detect outliers that are farthest from the center of the distribution, considering the covariance between variables. Grubbs' test can then be used to identify the most extreme outliers.
2. Multivariate outlier detection: This combination allows for multivariate outlier detection, which can be more effective than using Grubbs' test alone on individual variables.

**Cons of combining Mahalanobis distance with Grubbs test**:
1. Assumptions: Grubbs' test assumes normality, which might not always be the case. Mahalanobis distance also assumes multivariate normality.
2. Sensitivity to parameters: The performance of Grubbs' test can be sensitive to the choice of parameters, such as the significance level (α).
3. Computational complexity: Calculating Mahalanobis distance can be computationally expensive, especially for large datasets.