## <center>Data Preprocessing<center>

### Useful Links

[Python Quick Reference](https://www.python.org/ftp/python/doc/1.1/quick-ref.1.1.html)

[Python3 cheat sheet](https://perso.limsi.fr/pointal/_media/python:cours:mementopython3-english.pdf)

[Python.org](https://www.python.org/)

[Python Standard Library](https://docs.python.org/3/library/index.html)

[Python Functions](https://docs.python.org/3/library/functions)

[Python NumPy](https://numpy.org/)

[Matplotlib (PyPlot)](https://matplotlib.org/stable/index.html#)

---

|<td colspan=3> <center><bold><h3>**Table of contents**<center>|||
|---|:---|:---:|
|**01.**|[**Data Scaling**](#Data-Scaling:)||
|**02.**|[**Data Normalisation**](#Data-Normalisation:)||
|**03.**|[**Ordinal Encoding**](#Ordinal-Encoding:)||
|**04.**|[**One-Hot Encoding**](#One-Hot-Encoding:)||
|**05.**|[**Concatenate Arrays**](#Concatenate-Arrays:)||

**If you don't have the data**, then the best way is to initialise a matrix with zeros and populate it  

In [1]:
# Reading csv file into array (removing header and column containing text)

import numpy as np
import csv

irisArray = np.zeros((150,4))
labelsArray = np.empty(150, dtype='U10')

with open('data/iris.csv') as fs:
    csv_reader = csv.reader(fs,delimiter=',') #if it is a comma, it can be omitted
    header = next(csv_reader)              # header saved to its own variable and left out of matrix
    for (i,r) in enumerate(csv_reader):
        for j in range(4):              # does not include 5th coloumn as contains text
            irisArray[i,j] = float(r[j]) # converts strings read to floats and enters into matrix
        labelsArray[i] = r[4]             # adds string labels to array
    
print(irisArray)
print(labelsArray)

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.1 1.5 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3.2 4.5 1.5]
 [6.9 3.1 4.

### Data Scaling:

**`min = npArrayVar.min(axis=0)`**  
**`max = npArrayVar.max(axis=0)`**  

**`scaledArrayVar = (npArrayVar - min)/ (max-min)`**  
* **Scaleing** the range of the **data [min,max] to [0,1]**  
* **`axis=0`** restricts calculations to columns of data to **treat each feature separately**  

In [2]:
# Scaling dataset with min and max

m = irisArray.min(axis=0)
M = irisArray.max(axis=0)

scaledArray = (irisArray - m)/(M-m)

print(scaledArray)

[[0.22222222 0.625      0.06779661 0.04166667]
 [0.16666667 0.41666667 0.06779661 0.04166667]
 [0.11111111 0.5        0.05084746 0.04166667]
 [0.08333333 0.45833333 0.08474576 0.04166667]
 [0.19444444 0.66666667 0.06779661 0.04166667]
 [0.30555556 0.79166667 0.11864407 0.125     ]
 [0.08333333 0.58333333 0.06779661 0.08333333]
 [0.19444444 0.58333333 0.08474576 0.04166667]
 [0.02777778 0.375      0.06779661 0.04166667]
 [0.16666667 0.45833333 0.08474576 0.        ]
 [0.30555556 0.70833333 0.08474576 0.04166667]
 [0.13888889 0.58333333 0.10169492 0.04166667]
 [0.13888889 0.41666667 0.06779661 0.        ]
 [0.         0.41666667 0.01694915 0.        ]
 [0.41666667 0.83333333 0.03389831 0.04166667]
 [0.38888889 1.         0.08474576 0.125     ]
 [0.30555556 0.79166667 0.05084746 0.125     ]
 [0.22222222 0.625      0.06779661 0.08333333]
 [0.38888889 0.75       0.11864407 0.08333333]
 [0.22222222 0.75       0.08474576 0.08333333]
 [0.30555556 0.58333333 0.11864407 0.04166667]
 [0.22222222 

[<p style="text-align: right;">**⬆ Top of Page ⬆**</p>](#Data-Preprocessing)

### Data Normalisation:

**`normalisedVar = (npArrayVar-npArrayVar.mean(axis=0))/npArrayVar.std(axis=0)`**  
* Another way to **scale dataset**, purpose is to make the **mean of your data equal (or very close to) 0**, while the **standard deviation becomes 1**  
* Calculated by **subtracting** the **mean from** your **data and** by **dividing** the **result by** the **standard deviation**  
* **`axis=0`** restricts calculations to columns of data to **treat each feature separately**  

In [3]:
print("Mean:\n", irisArray.mean(axis=0))
print("Standard deviation:\n", irisArray.std(axis=0))

iris_norm = (irisArray-irisArray.mean(axis=0))/irisArray.std(axis=0)

print()
print("Scaled mean:\n", iris_norm.mean(axis=0))
print("Scaled standard deviation:\n", iris_norm.std(axis=0))

Mean:
 [5.84333333 3.054      3.75866667 1.19866667]
Standard deviation:
 [0.82530129 0.43214658 1.75852918 0.76061262]

Scaled mean:
 [-1.69031455e-15 -1.63702385e-15 -1.48251781e-15 -1.62314606e-15]
Scaled standard deviation:
 [1. 1. 1. 1.]


[<p style="text-align: right;">**⬆ Top of Page ⬆**</p>](#Data-Preprocessing)

### Ordinal Encoding:

Most categorical data cannot be handled in machine learning and cannot be processed as above  
Can assign numbers to each and create an array as follows:  

**`categoriesArray = np.unique(npArrayCategoricalVar)`**$\;\;\;\;$**Can be done from dataset or array**  
**`dictCat = {}`**  

**`for i,cat in enumerate (categoriesArray):`**  
$\;\;\;\;$**`dictCat[cat] = i`**  

**`ordinalArray = np.zeros((npArrayCategoricalVar.size,),dtype=np.int32)`**  

**`for i in range(ordinalArray.size):`**  
$\;\;\;\;$**`ordinalArray[i] = dictCat[npArrayCategoricalVar[i]]`**  
* Returns an array where each category has been replaced with a unique number  

In [4]:
# Converting to ordinal encoding

categories = np.unique(labelsArray)      # gets unique elements
dict_cat = {}

for i,cat in enumerate(categories):     # adds each unique element to dictionary as key with enumerate value as value
    dict_cat[cat] = i
    
print(dict_cat)
print()

labels_ord = np.zeros((labelsArray.size,),dtype=np.int32)     # creates array of zeros of same size as labelsArray

for i in range(labels_ord.size):
    labels_ord[i] = dict_cat[labelsArray[i]]
    
print(labels_ord)

{'setosa': 0, 'versicolor': 1, 'virginica': 2}

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


[<p style="text-align: right;">**⬆ Top of Page ⬆**</p>](#Data-Preprocessing)

### One-Hot Encoding:

**`oneHotArrayVar = np.eye(categoriesArrayVar.size)[ordinalArrayVar]`**  
* Converts ordinal encoding to a **group of 'bits'** where each bit relates to a category  
* Makes it **easier to concatinate array with original array** (See below)  
* Each column represents a category, each row will have 1 in one category and 0 in the rest, the resulting row is a binary variable encoding for each category  

In [5]:
# One hot encoding

labelsOneHot = np.eye(categories.size)[labels_ord]

print(labelsOneHot)

[[1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0.

[<p style="text-align: right;">**⬆ Top of Page ⬆**</p>](#Data-Preprocessing)

### Concatenate Arrays:

**`concatenatedArray = np.concatenate(firstArray, secondArray, axis=0)`**  
* **`axis`** **defaults to 0 (columns)**, set to **1** to concatenate **rows**, set to **None** to **flattern arrays before use**  
* The arrays must have the same shape, except in the dimension corresponding to **`axis`** (the first, by default)  

In [6]:
# Concatenate arrays

arr1 = np.array([[1, 2], [3, 4]])
arr2 = np.array([[5, 6], [7, 8]])

arrCol = np.concatenate((arr1, arr2), axis=0) # axis defaults to 0 when not provided
arrRow = np.concatenate((arr1, arr2), axis=1)
arrNone = np.concatenate((arr1, arr2), axis=None)

print("axis=0 (columns):\n", arrCol)
print()
print("axis=1 (rows):\n", arrRow)
print()
print("axis=None (flattened):\n", arrNone)

axis=0 (columns):
 [[1 2]
 [3 4]
 [5 6]
 [7 8]]

axis=1 (rows):
 [[1 2 5 6]
 [3 4 7 8]]

axis=None (flattened):
 [1 2 3 4 5 6 7 8]


In [7]:
# concatenating ordinal encoded array requires an axis to be added so it is the same shape as original array

newIris = np.concatenate([irisArray,labels_ord[:,np.newaxis]],axis=1)

print(newIris)

[[5.1 3.5 1.4 0.2 0. ]
 [4.9 3.  1.4 0.2 0. ]
 [4.7 3.2 1.3 0.2 0. ]
 [4.6 3.1 1.5 0.2 0. ]
 [5.  3.6 1.4 0.2 0. ]
 [5.4 3.9 1.7 0.4 0. ]
 [4.6 3.4 1.4 0.3 0. ]
 [5.  3.4 1.5 0.2 0. ]
 [4.4 2.9 1.4 0.2 0. ]
 [4.9 3.1 1.5 0.1 0. ]
 [5.4 3.7 1.5 0.2 0. ]
 [4.8 3.4 1.6 0.2 0. ]
 [4.8 3.  1.4 0.1 0. ]
 [4.3 3.  1.1 0.1 0. ]
 [5.8 4.  1.2 0.2 0. ]
 [5.7 4.4 1.5 0.4 0. ]
 [5.4 3.9 1.3 0.4 0. ]
 [5.1 3.5 1.4 0.3 0. ]
 [5.7 3.8 1.7 0.3 0. ]
 [5.1 3.8 1.5 0.3 0. ]
 [5.4 3.4 1.7 0.2 0. ]
 [5.1 3.7 1.5 0.4 0. ]
 [4.6 3.6 1.  0.2 0. ]
 [5.1 3.3 1.7 0.5 0. ]
 [4.8 3.4 1.9 0.2 0. ]
 [5.  3.  1.6 0.2 0. ]
 [5.  3.4 1.6 0.4 0. ]
 [5.2 3.5 1.5 0.2 0. ]
 [5.2 3.4 1.4 0.2 0. ]
 [4.7 3.2 1.6 0.2 0. ]
 [4.8 3.1 1.6 0.2 0. ]
 [5.4 3.4 1.5 0.4 0. ]
 [5.2 4.1 1.5 0.1 0. ]
 [5.5 4.2 1.4 0.2 0. ]
 [4.9 3.1 1.5 0.1 0. ]
 [5.  3.2 1.2 0.2 0. ]
 [5.5 3.5 1.3 0.2 0. ]
 [4.9 3.1 1.5 0.1 0. ]
 [4.4 3.  1.3 0.2 0. ]
 [5.1 3.4 1.5 0.2 0. ]
 [5.  3.5 1.3 0.3 0. ]
 [4.5 2.3 1.3 0.3 0. ]
 [4.4 3.2 1.3 0.2 0. ]
 [5.  3.5 1

In [8]:
# Easier to concatenate one-hot encoded array with original as no new axis required

newIris = np.concatenate([irisArray,labelsOneHot],axis=1)

print(newIris)

[[5.1 3.5 1.4 ... 1.  0.  0. ]
 [4.9 3.  1.4 ... 1.  0.  0. ]
 [4.7 3.2 1.3 ... 1.  0.  0. ]
 ...
 [6.5 3.  5.2 ... 0.  0.  1. ]
 [6.2 3.4 5.4 ... 0.  0.  1. ]
 [5.9 3.  5.1 ... 0.  0.  1. ]]


[<p style="text-align: right;">**⬆ Top of Page ⬆**</p>](#Data-Preprocessing)

---