# CSc 74020 Machine Learning Week 1

## 1. <font color='blue'>Introduction</font>

### <font color='blue'>Example Types of Machine Learning</font>
Most machine learning tasks fall into one of the following categories:
- **Supervised learning**: a specific output is expected
    - Regression: output is numerical
    - Classification: output is categorical
- **Unsupervised learning**: no specific output is expected
    - Clustering: split data into groups by similarity
    - Pattern mining: discover patterns in datasets
- **Reinforcement learning**: take actions to maximize cumulative reward

The techniques to solve these tasks are often closely related. Note terminology such as semi-supervised and self-supervised have became popular in recent years.

## 2. <font color='blue'> Partitioning Methods </font>
Discussed in class

## 3. <font color='blue'> Example Distance Measures </font>

In your first assignment you will implement the following distances we went over in class
- Euclidean
- Manhattan
- Minkowski

Below are implementations of Jaccard and Cosine Distances (1-Similarity)

In [18]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

### <font color='blue'>Jaccard Distance</font>
$$J(A,B) = 1 - \frac{|A\bigcap B|}{|A\bigcup B|} $$

In [22]:
def jaccard_dist(A,B):
    """
    Arguments:
    A, B : list, set,  or array of elements to calculate distance on
    Returns: Jaccard distance between A and B (viewed as sets)
    """
    set_A=set(A)
    set_B=set(B)
    return 1.-len(set_A.intersection(set_B))/len(set_A.union(set_B))

In [23]:
A = "I ran to class and I made it just in time."
B = "You walked to class and you were late."
print(A)
print(B)
jaccard_dist(A.split(),B.split())

I ran to class and I made it just in time.
You walked to class and you were late.


0.8

### <font color='blue'>Cosine Distance</font>
$$J(A,B) = 1 - \frac{|x \cdot y|}{||x||\cdot||y||} $$

In [25]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform([A,B])
vectorizer.get_feature_names()

['and',
 'class',
 'in',
 'it',
 'just',
 'late',
 'made',
 'ran',
 'time',
 'to',
 'walked',
 'were',
 'you']

Why is "i" not in the the token list? BE CAREFUL USING DEFAULT PARAMETERS - ALWAYS CHECK!

In [13]:
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
X = vectorizer.fit_transform([A,B])
vectorizer.get_feature_names()

['and',
 'class',
 'i',
 'in',
 'it',
 'just',
 'late',
 'made',
 'ran',
 'time',
 'to',
 'walked',
 'were',
 'you']

In [14]:
X=X.toarray()
X

array([[1, 1, 2, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0],
       [1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 2]], dtype=int64)

In [15]:
def cosine_dist(x,y):
    """
    Arguments:
    x, y : array of numeric values to calculate distance on
    Returns: Cosine distance between x and y
    """
    return 1. - np.dot(x,y) / np.sqrt(np.dot(x,x)*np.dot(y,y))

In [16]:
cosine_dist(X[0],X[1])

0.7368825942078912