<a href="https://colab.research.google.com/github/Elainehh/CIT-545/blob/main/Copy_of_Matrices_for_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning: Representing Data

As we round out our discussion of representing data, it's important to consider not just dataframes but alternatives.

As we gear up for machine learning in the second half of the semester -- we should note that we will be using *matrices* and *arrays* for most machine learning algorithm.

What's the difference between a matrix and an array?  For our purposes, a matrix is a mathematical structure with a series of well-defined operators (matrix multiply, determinant, inverse, etc.).  An array is a data structure, which can be used to encode a matrix.

This notebook provides a review of some of the basics.


## 1.1 Basic Review

To this point, we've used **dataframes** to capture data.  Dataframes allow for many rows, each containing named fields with heterogeneous types (string, Python object, image, etc.).  Of course, we assume that each field generally has a value (recall how we did *data cleaning* for cases where values were missing).

For machine learning, we'll typically want to use **arrays** (matrices) to represent data.  Recall that arrays have **homogeneous** types, i.e., all values are of the same type.  Typically we will use either integer or floating-point values for the array.

We expect you to be generally familiar with Numpy arrays, but we will provide a brief refresher of a few key points...

Given data in a dataframe, we will typically convert it into a matrix as follows:

1. We will convert each dataframe row (*instance*) to a matrix row.
1. We will drop columns that contain data we don't expect to be useful as **features** (i.e., that isn't predictive for the machine learning output).
1. We will convert the remaining columns to one or more columns in the matrix.

Often, step #3 involves **feature extraction** (perhaps extracting a particular piece of data out of a dataframe field), normalization, or encoding.

## 1.1 Numpy Basics

Numpy is the set of array libraries that come with Python, analogous to Pandas for dataframes, with a number of predefined operators.

In fact you'll see that notions like filtering and projection are implemented for Numpy arrays in a very similar way to Pandas.

*For those familiar with other programming languages: while Pandas borrows many ideas from the R language, Numpy borrows many ideas from Matlab.*

In [None]:
# Basics of arrays: Numpy

import numpy as np

# We start with a simple array, initialized with random values

arr = np.ndarray((4,2))

# Show dimensions
arr.shape


(4, 2)

In [None]:
arr

array([[2.61130690e-316, 1.58101007e-322],
       [2.12199579e-314, 8.82664251e+169],
       [1.20098376e-071, 1.39727473e-075],
       [4.28263859e-037, 1.33511316e-306]])

In [None]:
# We can also initialize with zeros or ones...
arr2 = np.zeros((4,2))

arr2

array([[0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.]])

## 1.2 Encoding Categorical Data

While arrays are integer-valued, sometimes we need to start with DataFrame data, which may be **categorical**.  For instance, we may have cities and states from addresses, and perhaps our machine learning algorithm is supposed to predict population from this information.

How do we go from these values  to numbers?

In [None]:
# Here's our sample data

import pandas as pd

# Imagine this simple example were loaded from some larger
# CSV file or from the web.
addresses_df = pd.DataFrame([{'city': 'New York', 'state': 'NY'},\
                             {'city': 'Los Angeles', 'state': 'CA'},\
                             {'city': 'Chicago', 'state': 'IL'},\
                             {'city': 'Houston', 'state': 'TX'},\
                             {'city': 'Phoenix', 'state': 'AZ'}, \
                             {'city': 'Philadelphia', 'state': 'PA'}, \
                             {'city': 'San Antonio', 'state': 'TX'}])

addresses_df

Unnamed: 0,city,state
0,New York,NY
1,Los Angeles,CA
2,Chicago,IL
3,Houston,TX
4,Phoenix,AZ
5,Philadelphia,PA
6,San Antonio,TX


Suppose we want to turn this into machine learning features!  Let's focus on  `state`.  This is categorical and string-valued, so we may need to *one-hot encode* it.  Each unique string will receive its own column, and the column will be set to `0` if the value isn't present, or `1` if it is.  Only one column will be set for each row, hence "one-hot".

Doing this manually is both painful **and** "brittle" in that any new data values might require you to manually change the schema.

Fortunately we can automate.

Pandas' `get_dummies` function will take a DataFrame or Series, and produce a one-hot-encoded DataFrame useful for machine learning.

In [None]:
# Create a Pandas dataframe with Boolean values for matches to each categorical
# value. Observe this is a sparse bit-vector (the values are Booleans or bits,
# there is only 1 set bit in each row).
pd.get_dummies(addresses_df['state'])

Unnamed: 0,AZ,CA,IL,NY,PA,TX
0,0,0,0,1,0,0
1,0,1,0,0,0,0
2,0,0,1,0,0,0
3,0,0,0,0,0,1
4,1,0,0,0,0,0
5,0,0,0,0,1,0
6,0,0,0,0,0,1


In [None]:
# The above was still a dataframe.  If we really want an array...
pd.get_dummies(addresses_df['state']).to_numpy()

array([[0, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 1]], dtype=uint8)

For many kinds of machine learning, each **row** represents an **instance** and its features.  It also has a **class**, which is what we ultimately want to learn to predict.

For training a machine learning model, we will provide the input features in a matrix $X$, and we'll put the classes in a separate vector $y$.

Let's see an example of how this works.

In [None]:
# A more realistic use case of slicing...  We often start with training
# data that has *labels*, i.e., in- or out-of-class

# Suppose we wanted addresses in states that have the letter 'A' in them...
addresses_df['has_a'] = addresses_df['state'].apply(lambda x: 'A' in x)
addresses_df

data = pd.get_dummies(addresses_df['state'])
data['class'] = addresses_df['has_a']

# This is what the training set might look like
# Note the last column is the label (it's Boolean for in-the-class or not)
data = data.to_numpy()
data

array([[0, 0, 0, 1, 0, 0, False],
       [0, 1, 0, 0, 0, 0, True],
       [0, 0, 1, 0, 0, 0, False],
       [0, 0, 0, 0, 0, 1, False],
       [1, 0, 0, 0, 0, 0, True],
       [0, 0, 0, 0, 1, 0, True],
       [0, 0, 0, 0, 0, 1, False]], dtype=object)

We'll often break this up, which is easy to do by slicing the array.

For each dimension, we can specify a "slice" like $a$:$b$ where we keep index positions $a$ through $b$ (non-inclusive on $b$).

A blank `:` keeps everything.  A negative value for $b$ subtracts $b$ from the last index position of the array, i.e., `-1` will be 1 fewer than the size.

In [None]:
# All rows, last column
y = data[:,-1]
# All rows, all but last column
X = data[:,0:-1]

print('Labels:')
print(y)

print('Training data:')
print(X)

Labels:
[False True False False True True False]
Training data:
[[0 0 0 1 0 0]
 [0 1 0 0 0 0]
 [0 0 1 0 0 0]
 [0 0 0 0 0 1]
 [1 0 0 0 0 0]
 [0 0 0 0 1 0]
 [0 0 0 0 0 1]]


## 1.3 Slicing Parts of an Array

In [None]:
# Slicing gives a subset of an array

# In each dimension you can give a left:right
# range. Ranges are left-inclusive and right exclusive
arr[1:3,1]

array([8.82664251e+169, 1.39727473e-075])

In [None]:
# Negative numbers cut from the end
arr[-2,1:3]

array([1.71174922e-52])

## 1.4 Reshaping an Array

Sometimes you'll want to keep the data in an array, but change the number of dimensions.

A common use case for this is taking an image (like these Pandas!):
![Pandas](https://hips.hearstapps.com/wdy.h-cdn.co/assets/16/05/980x635/gallery-1454612525-baby-pandas.jpg?resize=768:*)

which is 2 or 3 dimensions, and converting it to a 1D row in a matrix so each pixel value is a feature.

Let's start with the simple example above.


In [None]:
import numpy as np

arr = np.ndarray((2,3), dtype=int)

arr[0] = [1,2,3]
arr[1] = [4,5,6]

arr

array([[1, 2, 3],
       [4, 5, 6]])

In [None]:
# Convert to 1D:

arr.reshape(6)

array([ 8,  9, 10, 12, 13, 14])

In [None]:
# Convert 2x3 to 3x2:

arr.reshape((3,2))

array([[ 8,  9],
       [10, 12],
       [13, 14]])

In [None]:
# Numpy can calculate the last
# value for you if you put in a -1

arr.reshape((3,-1))

array([[ 8,  9],
       [10, 12],
       [13, 14]])