# Anomaly Detection in Python

Dataset from Kaggle : **"Pokemon with stats"** by *Alberto Barradas*  
Source: https://www.kaggle.com/abcsds/pokemon (requires login)

---

### Essential Libraries

Let us begin by importing the essential Python Libraries.

> NumPy : Library for Numeric Computations in Python  
> Pandas : Library for Data Acquisition and Preparation  
> Matplotlib : Low-level library for Data Visualization  
> Seaborn : Higher-level library for Data Visualization  

In [None]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

---

### Import the Dataset

The dataset is in CSV format; hence we use the `read_csv` function from Pandas.  
Immediately after importing, take a quick look at the data using the `head` function.

In [None]:
pkmndata = pd.read_csv('pokemonData.csv')
pkmndata.head()

Description of the dataset, as available on Kaggle, is as follows.
Learn more : https://en.wikipedia.org/wiki/List_of_Pok%C3%A9mon

> **\#** : ID for each Pokemon (runs from 1 to 721)  
> **Name** : Name of each Pokemon  
> **Type 1** : Each Pokemon has a basic Type, this determines weakness/resistance to attacks  
> **Type 2** : Some Pokemons are dual type and have a Type 2 value (set to nan otherwise)  
> **Total** : Sum of all stats of a Pokemon, a general guide to how strong a Pokemon is  
> **HP** : Hit Points, defines how much damage a Pokemon can withstand before fainting  
> **Attack** : The base modifier for normal attacks by the Pokemon (e.g., scratch, punch etc.)  
> **Defense** : The base damage resistance of the Pokemon against normal attacks  
> **SP Atk** : Special Attack, the base modifier for special attacks (e.g. fire blast, bubble beam)  
> **SP Def** : Special Defense, the base damage resistance against special attacks  
> **Speed** : Determines which Pokemon attacks first each round  
> **Generation** : Each Pokemon belongs to a certain Generation  
> **Legendary** : Legendary Pokemons are powerful, rare, and hard to catch

---

Check the vital statistics of the dataset using the `type` and `shape` attributes.

In [None]:
print("Data type : ", type(pkmndata))
print("Data dims : ", pkmndata.shape)

Check the variables (and their types) in the dataset using the `dtypes` attribute.

In [None]:
print(pkmndata.dtypes)

---

## Bi-Variate Anomaly Detection

Set up a Bi-Variate Anomaly Detection problem on the Pokemon Dataset.   
Features to be used for Anomaly Detection : **Total, Speed**       

Note : There is no Predictor or Response (Unsupervised Learning).

In [None]:
# Extract the Features from the Data
X = pd.DataFrame(pkmndata[["Total", "Speed"]])

# Plot the Raw Data on a 2D grid
f, axes = plt.subplots(1, 1, figsize=(16,8))
plt.scatter(x = "Total", y = "Speed", data = X)

#### Basic Anomaly Detection

Use the Nearest Neighbors (k-NN) pattern-identification method for detecting Outliers and Anomalies.    
We will use the `LocalOutlierFactor` neighborhood model from `sklearn.neighbors` module.

In [None]:
# Import LocalOutlierFactor from sklearn.neighbors
from sklearn.neighbors import LocalOutlierFactor

# Set the Parameters for Neighborhood
num_neighbors = 20      # Number of Neighbors
cont_fraction = 0.05    # Fraction of Anomalies

# Create Anomaly Detection Model using LocalOutlierFactor
lof = LocalOutlierFactor(n_neighbors = num_neighbors, contamination = cont_fraction)

# Fit the Model on the Data and Predict Anomalies
lof.fit(X)

#### Labeling the Anomalies in the Data

We may use the model on the data to `predict` the anomalies.

In [None]:
# Predict the Anomalies
labels = lof.fit_predict(X)

# Append Labels to the Data
X_labeled = X.copy()
X_labeled["Anomaly"] = pd.Categorical(labels)

# Summary of the Anomaly Labels
sb.countplot(X_labeled["Anomaly"])

In [None]:
# Visualize the Anomalies in the Data
f, axes = plt.subplots(1, 1, figsize=(16,8))
plt.scatter(x = "Total", y = "Speed", c = "Anomaly", cmap = 'viridis', data = X_labeled)

#### Interpret the Anomalies

Try to interpret the anomalies by exploring the Features across them.

In [None]:
# Boxplots for the Features for the Anomalies
f, axes = plt.subplots(2, 1, figsize=(16,8))
sb.swarmplot(x = 'Total', y = 'Anomaly', data = X_labeled, ax = axes[0])
sb.swarmplot(x = 'Speed', y = 'Anomaly', data = X_labeled, ax = axes[1])

---

## Multi-Variate Anomaly Detection

Set up a Multi-Variate Anomaly Detection problem on the Pokemon Dataset.   
Features : **Total, HP, Attack, Defense, Sp. Atk, Sp. Def, Speed**  

In [None]:
# Extract the Features from the Data
X = pd.DataFrame(pkmndata[["HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed"]]) 

# Plot the Raw Data on 2D grids
sb.pairplot(X)

In [None]:
# Import LocalOutlierFactor from sklearn.neighbors
from sklearn.neighbors import LocalOutlierFactor

# Set the Parameters for Neighborhood
num_neighbors = 20      # Number of Neighbors
cont_fraction = 0.05    # Fraction of Anomalies

# Create Anomaly Detection Model using LocalOutlierFactor
lof = LocalOutlierFactor(n_neighbors = num_neighbors, contamination = cont_fraction)

# Fit the Model on the Data and Predict Anomalies
lof.fit(X)

In [None]:
# Predict the Anomalies
labels = lof.fit_predict(X)

# Append Labels to the Data
X_labeled = X.copy()
X_labeled["Anomaly"] = pd.Categorical(labels)

# Summary of the Anomaly Labels
sb.countplot(X_labeled["Anomaly"])

In [None]:
# Visualize the Anomalies in the Data
sb.pairplot(X_labeled, vars = X.columns.values, hue = "Anomaly")

In [None]:
# Boxplots for all Features against the Anomalies
f, axes = plt.subplots(6, 1, figsize=(16,24))
sb.boxplot(x = 'HP', y = 'Anomaly', data = X_labeled, ax = axes[0])
sb.boxplot(x = 'Attack', y = 'Anomaly', data = X_labeled, ax = axes[1])
sb.boxplot(x = 'Defense', y = 'Anomaly', data = X_labeled, ax = axes[2])
sb.boxplot(x = 'Sp. Atk', y = 'Anomaly', data = X_labeled, ax = axes[3])
sb.boxplot(x = 'Sp. Def', y = 'Anomaly', data = X_labeled, ax = axes[4])
sb.boxplot(x = 'Speed', y = 'Anomaly', data = X_labeled, ax = axes[5])