# Vector Operations

I will be using the `cereal.csv` file, which was used in Chapter 3, as my data source. 


In [8]:
# import libraries
import pandas as pd
import numpy as np # this is used for numerical computing
import numpy.typing as npt # this is used for type annotation

In [2]:
# load data into dataframe
df = pd.read_csv('../datasets/cereal.csv')


Before I begin, I'd like to explain what vectors are.

Vectors are ordered list of numbers used to represent quantities or data mathematically. 

Vectors are the fundamental way machine learning represents and manipulates data. They are 1-dimensional structures.

Vectors are usually represented using NumPy arrays. (_Note that this does not mean all arrays are vectors_)

In this project, I will be representing various columns as vectors, and perform vector operations on them.

In [3]:
# represent the 'sugars' and 'calories' columns as  vectors
sugar_vector = np.array(df['sugars'])
calories_vector = np.array(df['calories'])

# check that they are both arrays
print(f"Sugar Vector is a {type(sugar_vector)}")
print(f"Calories Vector is a {type(calories_vector)}")

Sugar Vector is a <class 'numpy.ndarray'>
Calories Vector is a <class 'numpy.ndarray'>


The first question I'll be asking is: "Do cereals that have more sugar also tend to have more calories?"

To answer this question, I will be using Cosine Similarity

In [11]:
def cosine_similarity(A: npt.NDArray[np.float64], B: npt.NDArray[np.float64]) -> float:
    # compute dot product
    dot_prod = np.dot(A, B)

    # compute the norm of the vectors
    norm_A = np.linalg.norm(A)
    norm_B = np.linalg.norm(B)

    # find the product of the norms
    vector_norm = norm_A * norm_B

    # compute cosine similarity
    cos_sim = np.round(dot_prod / vector_norm, 3)

    return(cos_sim)



In [None]:
# print cosine similarity
cos_sim = cosine_similarity(sugar_vector, calories_vector)

print(f"The Cosine Similarity of the Sugar column and Calories column is {cos_sim}")

The Cosine Similarity of the Sugar column and Calories column is 0.883


_What does Cosine Similarity even mean?_

Cosine Similarity measures how similar two data items are by calculating the cosine of the angle between them. It focuses on the direction rather than magnitude, with a score of 1 meaning identical, 0 unrelated, and -1 opposite.


The Cosine Similarity of `0.883` means that there is a strong positive similarity between the two vectors.

This may suggest that cereals with higher sugar content generally have higher calorie values.


**NOTE:** 

Cosine Similarity does not measure correlation. Cosine similarity measures geometric similarity, while correlation measures statistical relationship.