# Outlier Detection in Human Activity Data

In this project you will be performing outlier detection on the UCI Human Activity Recognition (HAR) dataset, focusing on two types of outliers:
- Instances far from the center of the class distribution.
- Instances that are isolated in feature space.

Steps:

- Data Preprocessing: Load and prepare the UCI HAR dataset for classification.
- Feature Extraction: Train a network (e.g., a neural network or another feature extractor) to classify the activities and extract useful features.
- Modeling Class Distributions: Use the extracted features to model the distribution of each class using a Gaussian Mixture Model (GMM).
- Center Detection: Identify the center of the distribution for each class and apply k-Nearest Neighbors (kNN) to find the central instances and prototypes for each class.
- Outlier Detection: For each class, identify outliers using:
    - Instances far from the center.
    - Instances that are isolated in the feature space.
- Comparison: Compare instances close to the center and the detected outliers. Attempt to form hypotheses explaining why the outliers are different.
- Global Outlier Detection: Repeat the outlier detection for the entire dataset (ignoring class labels), and compare the results to the class-specific outliers.

**Important**: At the end you should write a report of adequate size, which will probably mean at least half a page. In the report you should describe how you approached the task. You should describe:
- Encountered difficulties (due to the method, e.g. "not enough training samples to converge", not technical like "I could not install a package over pip")
- Steps taken to alleviate difficulties
- General description of what you did, explain how you understood the task and what you did to solve it in general language, no code.
- Potential limitations of your approach, what could be issues, how could this be hard on different data or with slightly different conditions
- If you have an idea how this could be extended in an interesting way, describe it.

In [None]:
import urllib.request

def unzip(filename, dest_path = None):
    # unzips a zip file in the folder of the notebook to the notebook
    with ZipFile(filename, 'r') as zObject: 
        # Extracting all the members of the zip  
        # into a specific location. a
        if dest_path is None:
            zObject.extractall(path=os.getcwd())
        else:
            zObject.extractall(path=dest_path)

import os
def download(url, filename):
    # download with check if file exists already
    if os.path.isfile(filename):
        return
    urllib.request.urlretrieve(url,filename)

from zipfile import ZipFile

# Un-comment lines below only if executing on Google-COLAB
# ![[ -f UCI_HAR.zip ]] || wget --no-check-certificate https://people.minesparis.psl.eu/fabien.moutarde/ES_MachineLearning/Practical_sequentialData/UCI_HAR.zip
# ![[ -f "UCI_HAR" ]] || unzip UCI_HAR.zip

download('https://people.minesparis.psl.eu/fabien.moutarde/ES_MachineLearning/Practical_sequentialData/UCI_HAR.zip','UCI_HAR.zip')

unzip('UCI_HAR.zip')

In [None]:
# Step 2: Load and Standardize Data
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load train and test data
train_data = pd.read_csv('./UCI_HAR/train/X_train.txt', delim_whitespace=True, header=None)
train_labels = pd.read_csv('./UCI_HAR/train/y_train.txt', delim_whitespace=True, header=None)
test_data = pd.read_csv('./UCI_HAR/test/X_test.txt', delim_whitespace=True, header=None)
test_labels = pd.read_csv('./UCI_HAR/test/y_test.txt', delim_whitespace=True, header=None)

# Standardize the data
scaler = StandardScaler()
train_data_scaled = scaler.fit_transform(train_data)
test_data_scaled = scaler.transform(test_data)

# Create a custom dataset class for PyTorch
from torch.utils.data import Dataset
class HARDataset(Dataset):
    def __init__(self, data, labels):
        self.data = data
        self.labels = labels - 1

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]

### Step 3: Feature Extraction
Define a feature extraction model (e.g., an LSTM-based network).

You should:
- Design a model architecture.
- Train the model on the training data.
- Save extracted features for further analysis.

Provide training logic, loss computation, and optimization steps.

In [None]:
# Implement the feature extraction model here
# Suggested tools: PyTorch, torch.nn.LSTM, torch.optim
# Extract and save features

### Step 4: Modeling Class Distributions
Use the extracted features to model class distributions with Gaussian Mixture Models (GMM).

You should:
- Fit a GMM for each class.
- Store the GMMs for later outlier detection.

In [None]:
# Implement GMM fitting here
# Suggested tool: sklearn.mixture.GaussianMixture
# Fit GMMs for each class and store the models

### Step 5: Outlier Detection
Using the GMMs, identify outliers in two ways:
- Instances far from the center (low probability under the GMM).
- Isolated instances using k-Nearest Neighbors (kNN).

Provide detailed steps for each detection method.

In [None]:
# Implement outlier detection logic here
# Suggested tools: sklearn.neighbors.NearestNeighbors, numpy for thresholding

### Step 6: Visualization and Analysis
Plot and compare central instances with detected outliers for each class.

Discuss potential reasons for the differences between the central and outlier instances.

In [None]:
# Implement visualization logic here
# Suggested tools: matplotlib.pyplot for scatter plots

### Step 7: Global Analysis
Repeat the outlier detection process without considering class labels.

Compare the results with class-specific outliers.

In [None]:
# Implement global outlier detection here
# Fit a single GMM on all features and apply outlier detection logic

### Report Writing
At the end of the notebook, provide a written report summarizing your approach, difficulties, and findings. Use Markdown to structure the report within the notebook itself.