# 🔎 Friend Finder
This notebook is designed for the Girl Scouts STEM Data Science Workshop to introduce 6th-12th graders to feature selection, cosine similarity, and data science in general.

### Instructors
- Jason Osajima
- Murphy Studebaker

(idk if we want to say anything about us here)

## Introduction
The following notebook will walk you through the process of creating your first data science model that we will use to find new friends based on their similarity to you and your interests, using data science techniques that are commonly used for recommendations in apps like Netflix and Spotify. 

You will not have to write any of your own code. You can run each code cell by hitting the play button in the left corner or by hitting the keys `Shift + Return`.

### Software Tools
- **Python** is a programming language known for being easy to learn and good at many different programming tasks, especially data science. The code examples in this notebook are all written in Python. **Pandas** is a library we use with Python to easily work with data.
- **CoLab** is Google's online editor for writing coding notebooks like the one you're currently using. You can save this notebook to your own account, so you can make edits or take notes and save your changes. Eventually you can make a new notebook with your own data science models!


## ✍️ Your Turn

Run the following code block to set up your notebook with everything we need for the workshop. You should see "SUCCESS" printed out.

In [2]:
# this cell imports the required libraries for our workshop code to run
import pandas as pd
print("SUCCESS!")

SUCCESS!


## Features

A **feature** is a measurable way to describe something or someone in your dataset. When working on a data science project, deciding what features are important to the model you are trying to build is a good first step. In many cases, choosing good features can be the difference between a good and a bad model!

Most features fall into one of the following categories:

#### **Numerical Features**
These are things that are represented naturally as numbers. With numerical features, you can compare two different records in your data set by seeing if these numerical features are less than, greater than, or equal to each other.  Some examples are things like:
- the number of times you've seen Taylor Swift live in concert
- how many states you have visited
- how many pets you own
- how many siblings you have

#### **Bin Features**
Bin features are similar to numerical features, but they allow you to specify a certain range (or "bin") of values to group numerical values together. This is often used for things like:
- age range (1-12, 13-18, 18-24, 25+)
- your astrological sign (bins of birth date ranges)

#### **One-Hot Encoded Features**
Categorical features are features that are not naturally represented numerically and usually represent some sort of category or description. Some examples are:
- your favorite color ("Blue","Purple","Orange")
- your favorite sport ("Soccer","Football","Hockey")

It's much easier for a model to compare the meaning of two numbers than to compare the meaning of two words. We use a technique called one-hot encoding to transform descriptive features into numerical features, so we can perform mathematic calculations using the features later on. With one-hot encoding, we represent each potential option as its own column and then represent the selection of that category with a 1 or a 0. After encoding, we would have the following features:
- color_green, color_purple, color_orange
- sport_soccer, sport_football, sport_hockey


In [3]:
# feature example
values = {
    'title': ['Teardrops on My Guitar','Delicate','22'], # categorical identifier
    'genre': ['Country','Pop','Pop'], # categorical that we will one-hot encode
    'duration_seconds': [179, 232, 230], # numerical 
    'release_year': [2006, 2017, 2012] # numerical
}
taylor_swift_data = pd.DataFrame(data=values)
taylor_swift_data

Unnamed: 0,title,genre,duration_seconds,release_year
0,Teardrops on My Guitar,Country,179,2006
1,Delicate,Pop,232,2017
2,22,Pop,230,2012


In [6]:
# we want to transform our Taylor Swift data to make it easier to analyze
# we can bin the duration in seconds to the minute level
# and one-hot encode the genre
# first we create a dataset with the encoded features
encoded_genre = pd.get_dummies(taylor_swift_data.genre)
# then we add those new columns back to our original dataset
taylor_swift_data = pd.concat([taylor_swift_data, encoded_genre], axis=1)
taylor_swift_data

Unnamed: 0,title,genre,duration_seconds,release_year,Country,Pop
0,Teardrops on My Guitar,Country,179,2006,1,0
1,Delicate,Pop,232,2017,0,1
2,22,Pop,230,2012,0,1


## ✍️ Your Turn
What kinds of features can you use to describe yourself? What features might be important when considering who your new friends might be? Make a list of potential friendship features and how to represent them numerically.

Then, as a group, we will combine everyone's features toether to make a consistent dataset of everyone in the workshop.

## Building a Vector

In [None]:
# vector example

## Measuring Similarity

In [None]:
# cosine similarity code

## Results

In [None]:
# ranking / sorting / formatting friends code