# Workbook 03 - Dimension Reduction
by [David Elliott](https://eldave93.netlify.app/)

1. [Workspace Setup](#setup)

2. [Problem Understanding](#problem)

3. [Data Pre-Processing/EDA](#eda)
   

This workbook we are going to be looking at how we can reduce the number of attributes to those that might be useful for our model. We've been so far working with data with a few attributes (features), but sometimes you have lots of attributes and don't know where to start!

In order to reduce a models complexity, run time, and potential for overfitting to the training data, dimension reduction techniques can be used. Broadly they can be grouped into methods that create a subset of the original set of features (Feature Selection) and methods that create new synthetic features through combining the original features and discarding less important ones (Feature Extraction). Essentially we want to remove "uninformative infromation" and retain useful bits<sup>1</sup>. If you have too many features, it may be that some of them are highly correlated and therefore redundant. Therefore we can either select just some of them, or compress them onto a lower dimensional subspace<sup>2</sup>.

---
1. Zheng, A., & Casari, A. (2018). Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists. " O'Reilly Media, Inc.".

2. Raschka, 2016

# 1. Workspace Setup

Before downloading any data we should think about our workspace. It is assumed if you have made it this far you have already got your workspace setup. There are two ways of using these notebooks. The first is to use Google Colab, which is a website that allows you to write and execute python code through the browser. The second is a local workspace (e.g. Anaconda).

## 1.A. Google Colab
If you are not using google colab then you can skip this section. 

If you are using colab we will first need to have the libaries for the runtime and files associated with this workbook uploaded to the temporary file store. You also need to set the working directory to be a local version of the workshop repository. This is so all the data, images, and scripts for displaying the solutions works. This is located on the temporary file store associated with this colabs runtime. The below code will do all this for you.

__Required:__ After running this code make sure to restart the runtime to ensure everything works correctly (Runtime > Restart runtime).

In [1]:
import sys
import os

try:
    import google.colab
    COLAB=True
    
    # set the workbook code
    WORKSHOP_NAME = "03-dimension-reduction"
    # set the folder name
    FOLDER_NAME = "machine-learning-workbooks-main"

    # check if the environment is already setup to avoid repeating this after 
    # restarting the runtime
    if not os.path.exists(FOLDER_NAME):
        # get the zip of the workbooks
        !wget https://github.com/Eldave93/machine-learning-workbooks/archive/refs/heads/main.zip
        # unzip the file
        !unzip main.zip
        # install the libaries
        !{sys.executable} -m pip install -r ./{FOLDER_NAME}/scripts/requirements.txt
          
    print("Setting working directory to:")
    %cd ./{FOLDER_NAME}/{WORKSHOP_NAME}
    
except:
    COLAB=False
    print("Colab is not being used")

Colab is not being used


## 1.1. Packages

If your using a local workspace you will need all the following packages to run this notebook. If you do not already have them, or you want to ensure you are using the same versions as used when created, you could run `!{sys.executable} -m pip install -r ../scripts/requirements.txt` to install them for you.

In [3]:
import os # locating directories
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import warnings # prevent warnings

sys.path.append('../scripts') # add scripts to the path for use later
sns.set(rc={'figure.figsize': (14, 8)}) # For plot sizes
np.random.seed(42)  # to make this notebook's output identical at every run

# colours for print()
class color:
   PURPLE = '\033[95m'
   CYAN = '\033[96m'
   DARKCYAN = '\033[36m'
   BLUE = '\033[94m'
   GREEN = '\033[92m'
   YELLOW = '\033[93m'
   RED = '\033[91m'
   BOLD = '\033[1m'
   UNDERLINE = '\033[4m'
   END = '\033[0m'

## 1.2. Displaying solutions

The solutions are activated using a new .txt file which can be put in the workbook folder (e.g. `03-dimension-reduction`). Please put in a request for access.

If you have access to the solutions, the following cell will create clickable buttons under each exercise, which will allow you to reveal the solutions.

__Notes__

- This method was created by [Charlotte Desvages](https://charlottedesvages.com/).
- This may not work if you don't have the same version of `IPython.display` as the machine that created them:
    ```
    ImportError: cannot import name 'Code' from 'IPython.display' (/usr/local/lib/python3.7/dist-packages/IPython/display.py)
    ```
    If so run the auto-install above or manually run `!pip install ipython==7.20`.

---

# 2. Problem Understanding <a id='problem'></a>

