<a href="https://colab.research.google.com/github/DonErnesto/masterclassSFI_2021/blob/main/notebooks/CreditCardUnsupervised.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Jupyter notebook, Python basics and pandas

The data was taken from https://www.kaggle.com/mlg-ulb/creditcardfraud, and downsampled for the purpose of this masterclass. 


In [None]:
## Data import from Github
import os
if not os.path.exists('X_unsupervised.csv.zip'):
    !curl -O https://raw.githubusercontent.com/DonErnesto/masterclassSFI_2021/main/data/X_unsupervised.csv.zip

We will be using the "pandas" package for data handling and manipulation, and later "scikit-learn" (imported with "sklearn") for various outlier detection algorithms. 

In [None]:
## Package import: pandas for data handling and manipulation
import pandas as pd

# A small hack: "monkey-patching" the DataFrame class to add column-wise normalization as a method
def normalize_columns(self,):
    return (self - self.mean()) / self.std()

pd.DataFrame.normalize_columns = normalize_columns

Next, we will load the data in a so-called DataFrame (a pandas object), and inspect it by plotting the N-top rows

In [None]:
X = pd.read_csv('X_unsupervised.csv.zip')
# .head() returns a DataFrame, that consists of the first N (default: N=5) rows 
# of the DataFrame it is applied on
X.head() 

The data describes credit card transactions, one transaction per row. 

As you may notice, all features are numeric. All Vx features are the result of a mathematical operation called PCA. In reality, we have to deal often with non-numerical (for instance, categorical data), that requires some effort to make it numerical and suitable for the mathematical models we work with. 

The pre-fabricated data thus saves us considerable time. 

Let us first determine the dimensions of the DataFrame (note that the first dimension goes along the rows, the second along columns):

In [None]:
X.shape

In any realistic situation, we would not have access to labels (otherwise, we would be using a supervised approach) and typically know nothing about the fraction of positives. We will already give one fact away: the fraction of positive labels is about 0.3%. 

Before proceeding, let's demonstrate some dataframe operations, with a smaller demonstration dataframe. 

### Some useful pandas DataFrame methods:

In [None]:
# Let's demonstrate the hints with a smaller dataframe (the first 5 rows):
small_df = X.head(5).copy()

**.drop()** 

The .drop(columns=[...]) method can be applied on a DataFrame to drop one or more rows or columns, and returns itself (i.e.: a DataFrame). 

Example usage to delete ("drop") one or more columns: 

In [None]:
small_df.drop(columns=['V1', 'V5']) # This drops the V1 and V5 column s

**.abs()** 

The .abs() method can be applied on a DataFrame (or Series) to convert absolute numerical values, and returns itself (i.e., a DataFrame). 

Example usage for .abs():

In [None]:
small_df.abs()

**.max(axis=1), .sum(axis=1), .mean(axis=1)**

These methods can be applied on a DataFrame to do row-wise operations. They all returna a Series (with as many rows as the DataFrame it was applied on)

In [None]:
small_df.max(axis=1)

**.normalize_columns()**

We added this method to our DataFrame in the beginning. It performs column-wise normalization (i.e.: after this operation, the column-wise mean is zero, and the column-wise variance is one. 

In [None]:
small_df.normalize_columns()

#### Other useful operations: selecting single and multiple columns

Generally, this returns a DataFrame when selecting multiple columns, and a Series when selecting a single columns

- Selecting a single column by its name:

In [None]:
# A single column:
small_df['Amount']

- Selecting multiple columns with their numerical index using .iloc: 

In [None]:
# The first 5 columns:
small_df.iloc[:, :5]

In [None]:
# All columns execpt the last one:
small_df.iloc[:, :-1]

Note that many pandas DataFrame methods return a DataFrame, on which we can apply another function. 
Applying a method on the result of another method is called "chaining". We can for instance first drop a column, then use normalize_columns() (our home-made addition) to normalize the columns), then .abs() to convert to absolute, then .sum(axis=1) to sum horizontally, to yield a Series:

In [None]:
small_df.drop(columns=['Amount']).normalize_columns().abs().min(axis=1)