# Problem Statement

In this competition, we predict whether or not an email is spam.

We are going to cover the following steps:
1. Install Vaex
2. Load Data
3. Dimensionality Reduction (PCA)
4. Dimensionality Reduction (Incremental PCA)
5. Dimensionality Reduction (Random Projections)
6. References

Let’s begin by installing the latest stable version of <span style="color:#E94A93;">Vaex</span>

# Install Vaex

In [None]:
!pip install -I vaex

In [None]:
# Load Libraries
import vaex
vaex.multithreading.thread_count_default = 8
import vaex.ml

import numpy as np
import pylab as plt
import time
from pathlib import Path
import pprint
import pandas
from IPython.core.interactiveshell import InteractiveShell  # for printing all outputs of a cell 
InteractiveShell.ast_node_interactivity = "all" # to revert to original setting set InteractiveShell.ast_node_interactivity = "last_expr"

import warnings
warnings.filterwarnings("ignore")

# Load Data

In [None]:
# Load data using Vaex
start = time.time()
data_dir = Path('../input/tabular-playground-series-nov-2021/')
vaex_train = vaex.read_csv(data_dir / "train.csv")
end = time.time()
print(end - start)

Let's take a random sample of the training data so that we can at least do something instead of running into memory issues.

In [None]:
# https://datatofish.com/random-rows-pandas-dataframe/
vaex_train_99_percent = vaex_train.sample(frac=0.999)

In [None]:
features = vaex_train.column_names[1:-1] # because we want to exclude id and target columns from the training dataset
print(features)

# Dimensionality Reduction using PCA

In [None]:
features_first_90 = vaex_train.column_names[1:90]
pca_first_90_features = vaex.ml.PCA(features=features_first_90, n_components=10)
vaex_train_99_percent_first_90_features = pca_first_90_features.fit_transform(vaex_train_99_percent)
vaex_train_99_percent_first_90_features

In [None]:
features_91_101 = vaex_train.column_names[91:101]
pca_91_101_features = vaex.ml.PCA(features=features_91_101, n_components=10)
vaex_train_99_percent_91_101_features = pca_91_101_features.fit_transform(vaex_train_99_percent)
vaex_train_99_percent_91_101_features

# Dimensionality Reduction using Incremental PCA

In [None]:
pca_first_90_features_incremental = vaex.ml.PCAIncremental(n_components=10, features=features_first_90, batch_size=42000)
pca_first_90_features_incremental.fit(vaex_train_99_percent, progress='widget')
pca_first_90_features_incremental.transform(vaex_train_99_percent)

In [None]:
pca_91_101_features_incremental = vaex.ml.PCAIncremental(n_components=10, features=features_91_101, batch_size=42000)
pca_91_101_features_incremental.fit(vaex_train_99_percent, progress='widget')
pca_91_101_features_incremental.transform(vaex_train_99_percent)

# Dimensionality Reduction using Random Projections

Random projections is another popular way of doing dimensionality reduction, especially when the dimensionality of the data is very high. <span style="color:#E94A93;">vaex.ml</span> conveniently wraps both <span style="color:#E94A93;">scikit-learn.random_projection.GaussianRandomProjection</span> and <span style="color:#E94A93;">scikit-learn.random_projection.SparseRandomProjection</span> in a single <span style="color:#E94A93;">vaex.ml</span> transformer.

In [None]:
random_projections_first_90_features = vaex.ml.RandomProjections(features=features_first_90, n_components=10)
random_projections_first_90_features.fit(vaex_train_99_percent)
random_projections_first_90_features.transform(vaex_train_99_percent)

In [None]:
random_projections_91_101_features = vaex.ml.RandomProjections(features=features_91_101, n_components=10)
random_projections_91_101_features.fit(vaex_train_99_percent)
random_projections_91_101_features.transform(vaex_train_99_percent)

# References

1. Thank you to the [Vaex Documentation](https://vaex.io/docs/tutorial_ml.html#Dimensionality-reduction) for demonstrating how to use Vaex to do Dimensionality Reduction.