# 🔎 Friend Finder
This notebook is designed for the Girl Scouts STEM Data Science Workshop to introduce 6th-12th graders to feature selection, cosine similarity, and data science in general.

### Instructors
- **Jason Osajima** - TODO: Jason's quick bio and origin story
- **Murphy Studebaker** is an instructor of computer science at Chapman University who previously worked as a Data Scientist for The Walt Disney Company. She was originally interested in Computer Science because of the creativity involved in building software and data science models.

## Introduction
The following notebook will walk you through the process of creating your first data science model that we will use to find new friends based on their similarity to you and your interests, using data science techniques that are commonly used for recommendations in apps like Netflix and Spotify.

You will not have to write any of your own code. You can run each code cell by hitting the play button in the left corner or by hitting the keys `Shift + Return`.

### Software Tools
- **Python** is a programming language known for being easy to learn and good at many different programming tasks, especially data science. The code examples in this notebook are all written in Python. **Pandas** and **sklearn** are libraries we use with Python to easily work with data.
- **CoLab** is Google's online editor for writing coding notebooks like the one you're currently using. You can save this notebook to your own account, so you can make edits or take notes and save your changes. Eventually you can make a new notebook with your own data science models!


## ✍️ Your Turn

Run the following code block to set up your notebook with everything we need for the workshop. You should see "SUCCESS" printed out.

In [1]:
# this cell imports the required libraries for our workshop code to run
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
print("SUCCESS!")

SUCCESS!


## Features

A **feature** is a measurable way to describe something or someone in your dataset. When working on a data science project, deciding what features are important to the model you are trying to build is a good first step. In many cases, choosing good features can be the difference between a good and a bad model!

Most features fall into one of the following categories:

#### **Numerical Features**
These are things that are represented naturally as numbers. With numerical features, you can compare two different records in your data set by seeing if these numerical features are less than, greater than, or equal to each other.  Some examples are things like:
- the number of times you've seen Taylor Swift live in concert
- how many states you have visited
- how many pets you own
- how many siblings you have

#### **Bin Features**
Bin features are similar to numerical features, but they allow you to specify a certain range (or "bin") of values to group numerical values together. This is often used for things like:
- age range (1-12, 13-18, 18-24, 25+)
- your astrological sign (bins of birth date ranges)

#### **One-Hot Encoded Features**
Categorical features are features that are not naturally represented numerically and usually represent some sort of category or description. Some examples are:
- your favorite color ("Blue","Purple","Orange")
- your favorite sport ("Soccer","Football","Hockey")

It's much easier for a model to compare the meaning of two numbers than to compare the meaning of two words. We use a technique called one-hot encoding to transform descriptive features into numerical features, so we can perform mathematic calculations using the features later on. With one-hot encoding, we represent each potential option as its own column and then represent the selection of that category with a 1 or a 0. After encoding, we would have the following features:
- color_green, color_purple, color_orange
- sport_soccer, sport_football, sport_hockey


In [2]:
# feature example
values = {
    'title': ['Teardrops on My Guitar','Delicate','22'], # categorical identifier
    'genre': ['Country','Pop','Pop'], # categorical that we will one-hot encode
    'duration_seconds': [179, 232, 230], # numerical
    'release_year': [2006, 2017, 2012] # numerical
}
taylor_swift_data = pd.DataFrame(data=values)
taylor_swift_data

Unnamed: 0,title,genre,duration_seconds,release_year
0,Teardrops on My Guitar,Country,179,2006
1,Delicate,Pop,232,2017
2,22,Pop,230,2012


In [8]:
# we want to transform our Taylor Swift data to make it easier to analyze
# we can bin the duration in seconds to the minute level
# and one-hot encode the genre
# first we create a dataset with the binned features
# Define the bins and labels
bins = list(range(2005, 2022, 5))
labels = ["{}-{}".format(i, i+4) for i in range(2005, 2020, 5)]
binned_year = pd.cut(taylor_swift_data['release_year'], bins=bins, labels=labels, right=False)
# next we create a dataset with the encoded features
encoded_features = pd.concat(
    [pd.get_dummies(taylor_swift_data.genre),
    pd.get_dummies(binned_year)],
    axis=1
)
# then we add those new columns back to our original dataset
taylor_swift_data = pd.concat([taylor_swift_data, encoded_features], axis=1)
taylor_swift_data

['2005-2009', '2010-2014', '2015-2019']


Unnamed: 0,title,genre,duration_seconds,release_year,Country,Pop,2005-2006,2010-2011,2015-2016,Country.1,...,Country.2,Pop.1,2005-2010,2010-2015,2015-2020,Country.3,Pop.2,2005-2009,2010-2014,2015-2019
0,Teardrops on My Guitar,Country,179,2006,1,0,1,0,0,1,...,1,0,1,0,0,1,0,1,0,0
1,Delicate,Pop,232,2017,0,1,0,0,1,0,...,0,1,0,0,1,0,1,0,0,1
2,22,Pop,230,2012,0,1,0,1,0,0,...,0,1,0,1,0,0,1,0,1,0


In [9]:
encoded_features

Unnamed: 0,Country,Pop,2005-2009,2010-2014,2015-2019
0,1,0,1,0,0
1,0,1,0,0,1
2,0,1,0,1,0


## ✍️ Your Turn
What kinds of features can you use to describe yourself? What features might be important when considering who your new friends might be? Make a list of potential friendship features and how to represent them numerically.

Then, as a group, we will combine everyone's features toether to make a consistent dataset of everyone in the workshop.

## 🏗 Building a Vector
Now that we have our features, we need a consistent way to represent each song. This is where vectors come in! Each song can be represented as a vector, where each element in the vector is a feature.

Let's use `Teardrops on my Guitar` as an example! if we build our vector for this song we get-

`[179, 2006, 1, 0]`

Can you guess how we built this vector?

In [10]:
# turn each taylor swift row of features into a vector
taylor_swift_vectors = (taylor_swift_data.set_index('title')[
        ['duration_seconds', 'release_year', 'Country', 'Pop']
])
taylor_swift_vectors

Unnamed: 0_level_0,duration_seconds,release_year,Country,Country,Country,Country,Pop,Pop,Pop,Pop
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Teardrops on My Guitar,179,2006,1,1,1,1,0,0,0,0
Delicate,232,2017,0,0,0,0,1,1,1,1
22,230,2012,0,0,0,0,1,1,1,1


## ✍️ Your Turn

Write out the vecotr representing your personal features. Then add your vector to our shared document, so we can measure similarity between all of the workshop participants!

Shared Document: https://docs.google.com/document/d/1XgrSgU0eBY0kcInB-4vxuaUuGJAzU8_8xHNfe6FI92s/edit?usp=sharing

In [None]:
# vectors of all participants will go here

## ⚖️ Measuring Similarity
Alright, with our vectors ready, let's dive deeper to see how similar they really are. By using cosine similarity, we'll get scores ranging from -1 to 1. If we get a score of 1, it means they're nearly identical. A -1? They're polar opposites.

Now, let's apply this to our Taylor Swift songs. Which songs do you think are strikingly similar, and which ones march to their own beat? 🎵🤔

In [11]:
# Calculate the cosine similarity
cosine_sim = cosine_similarity(taylor_swift_vectors)

# Transform the results into a DataFrame for better visualization
cosine_sim_df = pd.DataFrame(
    cosine_sim,
    index=taylor_swift_vectors.index,
    columns=taylor_swift_vectors.index
)
cosine_sim_df.columns.name = None
cosine_sim_df

Unnamed: 0,Teardrops on My Guitar,Delicate,22
Teardrops on My Guitar,1.0,0.999673,0.999691
Delicate,0.999673,1.0,1.0
22,0.999691,1.0,1.0


In [None]:
# similarity of all participants will go here

## 🎉 Results
Alright, the moment we've been waiting for! 🌟 Let's dive in and discover our friendship matches using the magic of data science! By crunching our features and sprinkling in some cosine similarity, we'll reveal our top friendship contenders. When you spot the names at the top of your list, have a think – what awesome things do you think you have in common? Ready to meet your data bestie? Let's roll! 🎈👩‍🔬👩‍💻👭

In [None]:
# ranking / sorting / formatting friends code
# def process_features(df: pd.DataFrame):

SyntaxError: unexpected EOF while parsing (3828620722.py, line 2)

In [19]:
def get_most_similar(data, name):
  similar = data.loc[name].sort_values(ascending=False)
  names_in_order = list(similar.index)[1:]
  return names_in_order

get_most_similar(cosine_sim_df, 'Delicate')

['22', 'Teardrops on My Guitar']