# Linear Algebra: Cosine Similarity of CPS High School Demographics

Example applying linear algebra to calculate similarity between entities. Here, we'll use the [Chicago Public Schools School Profile data for 2017-2018](https://data.cityofchicago.org/Education/Chicago-Public-Schools-School-Profile-Information-/w4qj-h7bg). 

Using these data, we can try to determine which high schools are most similar to each other on a set of different deomgraphic variables published by the City of Chicago recently.

In [357]:
import numpy as np
import pandas as pd

df = pd.read_csv('cps_profiles_public_1718.csv')

cols = ['Long_Name', 'Student_Count_Total', 'Student_Count_Low_Income',
        'Student_Count_English_Learners', 'Student_Count_Special_Ed',
        'Student_Count_White', 'Student_Count_Black', 'Student_Count_Hispanic',
        'College_Enrollment_Rate_School']

cps_df = (df.loc[df['Primary_Category'] == 'HS', cols]
          .dropna(0)
          .set_index('Long_Name'))


This dataset has information about each CPS school. We'll focus on looking at a smaller subset of pertinent variables to compare CPS high school similarity on. These variables largely reflect demographic characteristics about students at these schools.

* Student_Count_Total
* Student_Count_Low_Income
* Student_Count_English_Learners
* Student_Count_Special_Ed
* Student_Count_White
* Student_Count_Black
* Student_Count_Hispanic
* College_Enrollment_Rate_School

In [358]:
cps_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 161 entries, TEAM Englewood Community Academy High School to Hyman G Rickover Naval Academy High School
Data columns (total 8 columns):
Student_Count_Total               161 non-null int64
Student_Count_Low_Income          161 non-null int64
Student_Count_English_Learners    161 non-null int64
Student_Count_Special_Ed          161 non-null int64
Student_Count_White               161 non-null int64
Student_Count_Black               161 non-null int64
Student_Count_Hispanic            161 non-null int64
College_Enrollment_Rate_School    161 non-null float64
dtypes: float64(1), int64(7)
memory usage: 11.3+ KB


## Algorithm Goals

I will write a cosine similarity algorithm. Given 2 vectors of n-dimensions, it will compute the cosine of t he angle between each vector, returning the top N most similar vectors pointing in the closest direction to the original vector. Since we're working in positive space where no elements of any vector are < 0,  the cosine will be [0,1] bound. Cosine = 0 is a perpendicular n-dimensional angle representing maximum independence between vectors. Cosine = 1 means a 0 degree angle representing an identical orientation or direction of the compared vectors. Cosine will only assess directional relationships, and disregards magnitudes. In other words, we're measuring resemblances and not absolute distances between data points. 

Cosine similarity is useful in situations where you are concerned more about how closely the direction of two vectors move instead of their closeness/distance from each other in space.

This metric also works well on sparse vectors where 0's or missing information may be present. 

If [distance/magnitudes between vectors is primarily important](https://www.quora.com/Is-cosine-similarity-effective), then a measure like Euclidean distance calculation may be more appropriate.  

## How to Solve it?

1. write a function cos_similarity(v1, v2)
    * v1 is an N-dimensional vector
    * v2 is also an N-dimensional  vector to be compared to v1.


2. write a function to iterate `cos_similarity()` for each vector to be compared to every other vector, `cos_top_n()`:
    * calculate the cosine similarity of v1 to each other vector than itself.
    * track v1's cosine similarity to all other vectors and the index of all other vectors.
    * create dict pairs of index: cosine similarity for the Top N similar vectors to v1
    * repeat loop over next vectors, v2 ... v1000


3. capture the output as a dictionary

In [359]:
# way to write this using only base Python.
def cos_similarity(v1, v2):
    """for 2 non-null vectors, v1 and v2, measure their cosine similarity
    which is the ratio of their dot products to their length products.
    
    (v1 dot v2) / (||v1||*||v2||)
    
    notes:
      v1 dot v2 = sum(v1 * v2)
         ||v1|| = sqrt(v1 * v1)
         ||v2|| = sqrt(v2 * v2)
   """
    
    dot_product, v1_length_sq, v2_length_sq = 0, 0, 0
    
    for element in range(len(v1)):
        dot_product += v1[element] * v2[element]
        # get the squared vector length of each v1, v2.
        v1_length_sq += v1[element] * v1[element]
        v2_length_sq += v2[element] * v2[element] 
    
    length_product = (v1_length_sq ** (1/2)) * (v2_length_sq ** (1/2))
    
    return dot_product / length_product



# Alternate way to write this using numpy vectorized arrays.
def cos_similarity_np(v1, v2):
    """vectorized cosine similarity calculation"""
    
    return np.dot(v1, v2) / np.sqrt(np.dot(v1, v1) * np.dot(v2, v2))

### Let's check our work so far. . .
To see if this works correctly, let's compare our `cos_similarity()` function to a prebuilt one by other libraries...

In [360]:
# generate some fake data
rows = 10
cols = 5
df = pd.DataFrame(np.matrix(np.random.randint(0,10, size=(rows, cols))),
                 index = ['exam' + str(i) for i in range(1,11)],
                 columns = ['rule' + str(i) for i in range(1,6)]) 
d1 = df.to_dict('split')
d1['data']


[[2, 7, 9, 5, 6],
 [1, 7, 1, 6, 7],
 [5, 8, 8, 8, 6],
 [2, 8, 6, 1, 8],
 [0, 6, 6, 4, 1],
 [5, 8, 7, 1, 4],
 [9, 2, 5, 5, 4],
 [3, 2, 9, 2, 2],
 [0, 9, 7, 5, 9],
 [0, 6, 6, 3, 1]]

Let's compare the similarity of the first vector to the last vector using our `cos_similarity()` function.

In [361]:
cos_similarity(d1['data'][0], d1['data'][-1])


0.9252554031053316

Let's compare this using the `numpy` based function we created, `cos_similarity_np()`.

In [362]:
cos_similarity2(d1['data'][0], d1['data'][-1])

0.9252554031053316

How does this compare to `sklearn`'s `cosine_similarity()` function? Also the same. 

In [363]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity([d1['data'][0]], [d1['data'][-1]])

array([[0.9252554]])

### We're on the right track

Now let's take our `cos_similarity()` function, that can compare the similarity of any 2 N-dimensions vectors, and iterate those comparisons to complete steps **#2** and **#3** in our roadmap above. We'll do this with a wrapper function called `cos_top_n()` that is built around `cos_similarity()`.

In [364]:
def cos_top_n(df, most_similar=5):
    """For each row in a dataframe (df), calculate the cosine similarity of all
    other rows against this row. Then return the top 5 most similar row vectors
    as a name:value pair. This ASSUMES ALL df columns are vector elements worth
    comparing and the df's index is a meaningful observation label worth tracking."""
    
    df_parsed = df.to_dict('split')
    data = df_parsed['data']
    index = df_parsed['index']
    
    results_dict = {}
    
    for v in range(len(data)):
            comps = {}
            
            for comp_v in range(len(data)):
                if v == comp_v:
                    continue # skip self-comparison.
                else:
                    comps[index[comp_v]] = round(cos_similarity_np(data[v], data[comp_v]), 4)
            
            # after v is compared to all other v's, identify N most similar vectors
            # captured as a list of tuples. https://bit.ly/2EYsfge
            results_dict[index[v]] = str(dict(sorted(comps.items(), # key:val tuple pairs
                                            key=lambda d: d[1], # sort by val
                                            reverse=True)[:most_similar]))
            
    return results_dict


## Let's run the similarity analysis

In [365]:
cps_similarity = cos_top_n(cps_df, 5)

Grab a particular school to view their top 5 most similar CPS high schools. For Instituto Health Sciences Career Academy, it's interesting to note that their top similar match is another Instituto organization school, since all the variables being compared reflect only demographic, summary information. The similarity is likely very high as Instituto tends to serve lower income, latinx communities.

In [366]:
cps_similarity['Instituto Health Sciences Career Academy']

"{'Instituto - Justice Lozano': 0.9989, 'ASPIRA Charter School - Early College High School': 0.998, 'World Language Academy High School': 0.9979, 'YCCS-ASPIRA,Antonia Pantoja Alternative HS': 0.9978, 'Gurdon S Hubbard High School': 0.9965}"

The raw data associated with the top 5 matching schools of our particular school of interest. You'll see that some of the most similar schools have drastically different absolute frequencies. This is because *cosine similarity* is a measure of directionality of the vector elements. The distances between the vectors are not weighted. 

So we could say that these are schools with similar makeup to each other in terms of demography, all though they may differ in characteristics reflecting magnitude or size like student population count. What type of similarity you use is very domain dependent!

In [367]:
cps_df.loc[['Instituto Health Sciences Career Academy',
            'Instituto - Justice Lozano',
            'ASPIRA Charter School - Early College High School',
            'World Language Academy High School',
            'YCCS-ASPIRA,Antonia Pantoja Alternative HS',
            'Gurdon S Hubbard High School']]

Unnamed: 0_level_0,Student_Count_Total,Student_Count_Low_Income,Student_Count_English_Learners,Student_Count_Special_Ed,Student_Count_White,Student_Count_Black,Student_Count_Hispanic,College_Enrollment_Rate_School
Long_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Instituto Health Sciences Career Academy,745,693,213,146,2,14,726,60.4
Instituto - Justice Lozano,89,85,21,13,1,0,87,9.1
ASPIRA Charter School - Early College High School,340,316,108,86,5,7,323,55.3
World Language Academy High School,355,333,103,52,3,17,330,58.4
"YCCS-ASPIRA,Antonia Pantoja Alternative HS",152,122,36,30,3,5,142,15.7
Gurdon S Hubbard High School,1705,1463,297,253,31,112,1547,57.4


# Add the similar schools JSON string as a new dataframe column

In [368]:
similarity_df = pd.DataFrame.from_dict(cps_similarity, 
                                       orient='index', 
                                       columns=['similar_schools'])

assert similarity_df.shape[0] == cps_df.shape[0] # check row counts same

Join the similarity output from *similarity_df* to the original *cps_df* as a new column. Let's take a look at another school, Lane Tech. 

In [369]:
cps_df_updated = pd.merge(cps_df, similarity_df, left_index=True, right_index=True, how='inner')
cps_df_updated.loc['Albert G Lane Technical High School']

Student_Count_Total                                                            4514
Student_Count_Low_Income                                                       1806
Student_Count_English_Learners                                                   34
Student_Count_Special_Ed                                                        200
Student_Count_White                                                            1598
Student_Count_Black                                                             350
Student_Count_Hispanic                                                         1805
College_Enrollment_Rate_School                                                 86.9
similar_schools                   {'William Jones College Preparatory High Schoo...
Name: Albert G Lane Technical High School, dtype: object

In [370]:
cps_df_updated.loc['Whitney M Young Magnet High School', 'similar_schools']

"{'Lincoln Park High School': 0.9981, 'William Jones College Preparatory High School': 0.9929, 'Northside College Preparatory High School': 0.9925, 'Albert G Lane Technical High School': 0.9881, 'Chicago High School for the Arts (ChiArts)': 0.9855}"