### Unsupervised Methods

Notes:

In some cases, labelled data is not available, so we must rely on "unsupervised" methods in order to reveal patterns in data, and impose structure on it. In this tutorial we will introduce three approaches: dimensionality reduction using singular value decomposition, clustering using KMeans clustering, and anomaly detection using an isolation forest.

func-ai -> common imports

In [1]:
## Open func-ai, and click on common imports. Copy the code here and then execute the cell.

import pandas as pd
import numpy as np
import scipy as sp
import seaborn as sns
import matplotlib   ### note to func-ai --> this is probably unnecessary
from matplotlib import pyplot as plt


Note: Again, we will load the [adult](https://archive.ics.uci.edu/dataset/2/adult) dataset, an open source dataset based on a sample of US census data collected in 1994.

We name the dataset `data`, and use the pandas `read_csv` function to load this dataset

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
data = pd.read_csv('/content/drive/MyDrive/SampleData/adult.csv')

Note --- First we will look at a small sample of the data.

You can try and do this yourself, it's the same as tutorial 1.

We'll also check the column names again.

As in tutorial 2 on modelling, we may need to do some additional work in order to include non-numerical features in our analysis, since our methods only take numeric data as inputs. To begin with however, lets focus on the numerical columns: age, fnlwgt, education-num, capital-gain, capital-loss, hours-per-week.

We first look at clustering the data. Since we know the income data was in two classes, <50k and >50k, let's use 2 clusters.

func-ai -> modelling -< cluster (2 clusters) -> K Means Clustering

We don't see much good separation between age and hours per week working.

Use func-ai -> boxplot to look at the range of each feature in numeric-data

Let's try using a scaler to make sure the features are all on the same scale, otherwise fnlweight will dominate the model fitting because it is so much larger.

func-ai -> data wrangling -> normalise


Then lets use KMeans on the scaled data.

Now we can try using Principal component analysis, which is often an effective preprocessing step to using KMeans.

func-ai -> modelling -> dimensionality reduction

In [9]:
import pandas as pd
from sklearn.decomposition import PCA

principal_components = 3
columns_to_reduce = ['age','fnlwgt','education-num','capital-gain','capital-loss','hours-per-week']
data = data.dropna()

pca = PCA(n_components=principal_components)

pca.fit(data[columns_to_reduce])
transformed = pd.DataFrame(pca.transform(data[columns_to_reduce]),
columns=[f"PC {n+1}" for n in range(principal_components)])

df_new = pd.concat([data.drop(columns_to_reduce, axis=1), transformed], axis=1)
print(df_new)

transformed_df = pd.DataFrame(transformed)
transformed_df.head()


          workclass     education      marital-status         occupation  \
0           Private          11th       Never-married  Machine-op-inspct   
1           Private       HS-grad  Married-civ-spouse    Farming-fishing   
2         Local-gov    Assoc-acdm  Married-civ-spouse    Protective-serv   
3           Private  Some-college  Married-civ-spouse  Machine-op-inspct   
4                 ?  Some-college       Never-married                  ?   
...             ...           ...                 ...                ...   
48837       Private    Assoc-acdm  Married-civ-spouse       Tech-support   
48838       Private       HS-grad  Married-civ-spouse  Machine-op-inspct   
48839       Private       HS-grad             Widowed       Adm-clerical   
48840       Private       HS-grad       Never-married       Adm-clerical   
48841  Self-emp-inc       HS-grad  Married-civ-spouse    Exec-managerial   

      relationship   race     sex native-country  class  cluster  \
0        Own-child 

Unnamed: 0,PC 1,PC 2,PC 3
0,37138.149345,-1069.158049,-88.736001
1,-99849.846043,-1105.158679,-91.038016
2,147287.145485,-1040.207456,-86.840225
3,-29342.869347,6601.360791,-76.72207
4,-86166.846289,-1101.567958,-90.878443


## Fit a model to detect outliers



The final thing we can do is to check for outliers. While using an IQR method or tukey outliers is often effective, it's possible to use unsupervised modelling approaches too. Since the dataset is pretty big, this might take a little while to complete.  
func-ai -> modelling -> random isolation forests