In [1]:
#lin alg
import numpy as np

#csv IOs and dataframes
import pandas as pd

#clustering
from sklearn.cluster import KMeans

import matplotlib.pyplot as plt

%matplotlib notebook

### Hello DST!
#### This notebook is the first installation of a series of data science tutorials. 

##### I am planning on walking you through how to do some simple data engineering, where we will focus on intake and subsequently cleaning/processing. After that, I will show you how to do a simple model inference. For our dataset we will be using the kmeans model supplied by sklean

## Data Engineering

In [2]:
#This can be a relative or absolute path. In actual work it will very highly on your envoirment
df = pd.read_csv('scoping_survey.csv') 

#more on these two later
drop_cols = set([]) 
id_cols = set([])

In [3]:
#Lets take a look at our survey results, what are its dimensions?
df.shape

(8, 10)

#### Normally, our data would be too large to see at a glance in a notebook

#### so we would have to use the head() funct given by dataframes, but here we have

#### such a small dataset its fine either way.


In [4]:
df

Unnamed: 0,Timestamp,Email Address,First Name,Last Name,What Grade are you?,How Familiar are you with ML/Data Science?,Have you worked with clustering before?,Are you familiar with Git?,How Familiar are you with Python,How Familiar are you with Jupyter Notebooks?
0,3/3/2020 21:51:48,afeng99@bu.edu,Amy,Feng,Sophomore,1,A little bit,Kinda,4,4
1,3/3/2020 21:58:44,shkim219@bu.edu,Paul,Kim,Freshman,2,A little bit,Kinda,3,1
2,3/3/2020 22:03:16,shiqy@bu.edu,Rose (Qingyang),Shi,Junior,1,A little bit,Yes,4,3
3,3/3/2020 23:00:49,zwan1312@bu.edu,Zhenghui,Wang,Sophomore,2,A little bit,Yes,4,3
4,3/4/2020 0:59:19,ebudur@bu.edu,Eren,Budur,Freshman,2,A little bit,Kinda,3,2
5,3/4/2020 9:26:30,jfli@bu.edu,Jason,Li,Sophomore,2,A little bit,Yes,3,3
6,3/4/2020 16:03:20,nathanhh@bu.edu,Nathan,Ho,Sophomore,1,A little bit,Yes,5,3
7,3/5/2020 10:50:03,janamha@bu.edu,Jana Mikaela,Aguilar,Sophomore,3,I did one clustering project before,Kinda,4,4


In [5]:
df.head()

Unnamed: 0,Timestamp,Email Address,First Name,Last Name,What Grade are you?,How Familiar are you with ML/Data Science?,Have you worked with clustering before?,Are you familiar with Git?,How Familiar are you with Python,How Familiar are you with Jupyter Notebooks?
0,3/3/2020 21:51:48,afeng99@bu.edu,Amy,Feng,Sophomore,1,A little bit,Kinda,4,4
1,3/3/2020 21:58:44,shkim219@bu.edu,Paul,Kim,Freshman,2,A little bit,Kinda,3,1
2,3/3/2020 22:03:16,shiqy@bu.edu,Rose (Qingyang),Shi,Junior,1,A little bit,Yes,4,3
3,3/3/2020 23:00:49,zwan1312@bu.edu,Zhenghui,Wang,Sophomore,2,A little bit,Yes,4,3
4,3/4/2020 0:59:19,ebudur@bu.edu,Eren,Budur,Freshman,2,A little bit,Kinda,3,2


The general goal of data engineering is to create a pipeline that can intake data, and make it digestible for a computer model.

What is digestible?

The root of computer models tend to generally be linear algebra. Therefore anything input that is not a number, is not useable in its raw form. 

To remedy this, we transform the data in something there computer can understand, and then we can run it through models. In this exercise we are going to run through some different types of data, and how it is useful to us. So lets go through our data column by column

In [6]:
#first column
col_1 = df['Timestamp']
pd.DataFrame(col_1) #this casting is just to make it look nice

Unnamed: 0,Timestamp
0,3/3/2020 21:51:48
1,3/3/2020 21:58:44
2,3/3/2020 22:03:16
3,3/3/2020 23:00:49
4,3/4/2020 0:59:19
5,3/4/2020 9:26:30
6,3/4/2020 16:03:20
7,3/5/2020 10:50:03


In [7]:
# notice that it is a type "Timestamp"?
col_1

0    3/3/2020 21:51:48
1    3/3/2020 21:58:44
2    3/3/2020 22:03:16
3    3/3/2020 23:00:49
4     3/4/2020 0:59:19
5     3/4/2020 9:26:30
6    3/4/2020 16:03:20
7    3/5/2020 10:50:03
Name: Timestamp, dtype: object

There are multiple ways to clean a timestamp; however, for our purposes we do not need them. The time someone completes this test will probably not be indictive of our data. For now, we will just drop this column.

If you guys want to learn more about how to deal with time data, let me known in the gc. Timestamps are mostly used in time-series based models, which is a completely different beast. So instead, lets mark this column for deletion.

In [8]:
#drop_cols is a set that was defined earlier. Do you know why I used a set?

drop_cols.add('Timestamp')

#### Now lets look at the next three columns.

In [9]:
df[['Email Address', 'First Name', 'Last Name']]

Unnamed: 0,Email Address,First Name,Last Name
0,afeng99@bu.edu,Amy,Feng
1,shkim219@bu.edu,Paul,Kim
2,shiqy@bu.edu,Rose (Qingyang),Shi
3,zwan1312@bu.edu,Zhenghui,Wang
4,ebudur@bu.edu,Eren,Budur
5,jfli@bu.edu,Jason,Li
6,nathanhh@bu.edu,Nathan,Ho
7,janamha@bu.edu,Jana Mikaela,Aguilar


These columns are not really useful to our model. Unless we are planning to run some Natural Language Processing, which is much more complicated, we cannot really feed these into our model.

However instead of dropping it like we did with Timestamp, we should keep these columns vectors. The time-stamp is relatively useless, but these columns are needed. If we get rid of them, the rest of the data is useless because we wont know who is in clustered.

In [10]:
id_cols = set(['Email Address', 'First Name', 'Last Name'])

Now lets take a look at the `Grade` column vector.

It is a very interesting case. We don't want the model to treat the different grades on the same linear scale. Instead we are going to treat it as categorical variables (I talked about this in our meeting) if you want to learn more about categorical variables read this: [link for theory](https://stats.idre.ucla.edu/other/mult-pkg/whatstat/what-is-the-difference-between-categorical-ordinal-and-numerical-variables/), [link for application](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html).

We are going to use One-Hot Encoding.

Unlike a traditional categorical variable, our Grade vector is _technically_ an ordinal variable. This is because unlike a traditional categorical, there is an intuitive relationship (Freshman -> Sophmore -> Junior -> Senior). However by setting them up as different dimensions, our model will be able to cluster more efficiently.

In [11]:
# Lets see if we have every grade here:
df['What Grade are you?'].unique()

array(['Sophomore', 'Freshman', 'Junior'], dtype=object)

#### Since we do not have seniors here, we only have 3 categorical variables. 

#### Therefore we only need 2 new dimensions. If we were running a neural net we would want all 3, but for our purposes we want to stick with just 2 new dimensions. If you are interested in learning why, look up multicollinearity and "dummy variable trap"

In [12]:
grade_df = pd.get_dummies(df['What Grade are you?'], drop_first=True)
grade_df

Unnamed: 0,Junior,Sophomore
0,0,1
1,0,0
2,1,0
3,0,1
4,0,0
5,0,1
6,0,1
7,0,1


In [13]:
# lets add grade back into our deleted columns:
drop_cols.add('What Grade are you?')

In [14]:
# Now lets add this back to our data, and get rid of the old column
df[grade_df.columns] = grade_df

In [15]:
df

Unnamed: 0,Timestamp,Email Address,First Name,Last Name,What Grade are you?,How Familiar are you with ML/Data Science?,Have you worked with clustering before?,Are you familiar with Git?,How Familiar are you with Python,How Familiar are you with Jupyter Notebooks?,Junior,Sophomore
0,3/3/2020 21:51:48,afeng99@bu.edu,Amy,Feng,Sophomore,1,A little bit,Kinda,4,4,0,1
1,3/3/2020 21:58:44,shkim219@bu.edu,Paul,Kim,Freshman,2,A little bit,Kinda,3,1,0,0
2,3/3/2020 22:03:16,shiqy@bu.edu,Rose (Qingyang),Shi,Junior,1,A little bit,Yes,4,3,1,0
3,3/3/2020 23:00:49,zwan1312@bu.edu,Zhenghui,Wang,Sophomore,2,A little bit,Yes,4,3,0,1
4,3/4/2020 0:59:19,ebudur@bu.edu,Eren,Budur,Freshman,2,A little bit,Kinda,3,2,0,0
5,3/4/2020 9:26:30,jfli@bu.edu,Jason,Li,Sophomore,2,A little bit,Yes,3,3,0,1
6,3/4/2020 16:03:20,nathanhh@bu.edu,Nathan,Ho,Sophomore,1,A little bit,Yes,5,3,0,1
7,3/5/2020 10:50:03,janamha@bu.edu,Jana Mikaela,Aguilar,Sophomore,3,I did one clustering project before,Kinda,4,4,0,1


#### We are almost there now. The last few columns are all numeric. Lets just practice filtering them. I'll show you a few different ways

In [16]:
df['Are you familiar with Git?'].unique()

array(['Kinda', 'Yes'], dtype=object)

In [17]:
# Lets make it a simple 0 or 1
df['Are you familiar with Git?'].apply(lambda x: 0 if x=='Kinda' else 1)

0    0
1    0
2    1
3    1
4    0
5    1
6    1
7    0
Name: Are you familiar with Git?, dtype: int64

In [18]:
df['Are you familiar with Git?']

0    Kinda
1    Kinda
2      Yes
3      Yes
4    Kinda
5      Yes
6      Yes
7    Kinda
Name: Are you familiar with Git?, dtype: object

In [19]:
# Looks good, lets now set the column to the new variable
df['Are you familiar with Git?'] = df['Are you familiar with Git?'].apply(lambda x: 0 if x=='Kinda' else 1)

In [20]:
# Same idea here. If we want to, we can add more weight to specific values, but for now we will just have 0 and 1
df['Have you worked with clustering before?'] = df["Have you worked with clustering before?"].apply(lambda x: 0 if x =='A little bit' else 1)

In [21]:
df

Unnamed: 0,Timestamp,Email Address,First Name,Last Name,What Grade are you?,How Familiar are you with ML/Data Science?,Have you worked with clustering before?,Are you familiar with Git?,How Familiar are you with Python,How Familiar are you with Jupyter Notebooks?,Junior,Sophomore
0,3/3/2020 21:51:48,afeng99@bu.edu,Amy,Feng,Sophomore,1,0,0,4,4,0,1
1,3/3/2020 21:58:44,shkim219@bu.edu,Paul,Kim,Freshman,2,0,0,3,1,0,0
2,3/3/2020 22:03:16,shiqy@bu.edu,Rose (Qingyang),Shi,Junior,1,0,1,4,3,1,0
3,3/3/2020 23:00:49,zwan1312@bu.edu,Zhenghui,Wang,Sophomore,2,0,1,4,3,0,1
4,3/4/2020 0:59:19,ebudur@bu.edu,Eren,Budur,Freshman,2,0,0,3,2,0,0
5,3/4/2020 9:26:30,jfli@bu.edu,Jason,Li,Sophomore,2,0,1,3,3,0,1
6,3/4/2020 16:03:20,nathanhh@bu.edu,Nathan,Ho,Sophomore,1,0,1,5,3,0,1
7,3/5/2020 10:50:03,janamha@bu.edu,Jana Mikaela,Aguilar,Sophomore,3,1,0,4,4,0,1


### Lets just delete our drop columns, and move on to the model! Data cleaning is notorious for being the most time consuming part.

In [22]:
df.drop(drop_cols, axis=1,inplace=True)

In [23]:
df

Unnamed: 0,Email Address,First Name,Last Name,How Familiar are you with ML/Data Science?,Have you worked with clustering before?,Are you familiar with Git?,How Familiar are you with Python,How Familiar are you with Jupyter Notebooks?,Junior,Sophomore
0,afeng99@bu.edu,Amy,Feng,1,0,0,4,4,0,1
1,shkim219@bu.edu,Paul,Kim,2,0,0,3,1,0,0
2,shiqy@bu.edu,Rose (Qingyang),Shi,1,0,1,4,3,1,0
3,zwan1312@bu.edu,Zhenghui,Wang,2,0,1,4,3,0,1
4,ebudur@bu.edu,Eren,Budur,2,0,0,3,2,0,0
5,jfli@bu.edu,Jason,Li,2,0,1,3,3,0,1
6,nathanhh@bu.edu,Nathan,Ho,1,0,1,5,3,0,1
7,janamha@bu.edu,Jana Mikaela,Aguilar,3,1,0,4,4,0,1


# Inferencing

#### Let seperate the data from its ids for now

In [24]:
X, X_data = df, df.drop(id_cols, axis=1)

In [25]:
X

Unnamed: 0,Email Address,First Name,Last Name,How Familiar are you with ML/Data Science?,Have you worked with clustering before?,Are you familiar with Git?,How Familiar are you with Python,How Familiar are you with Jupyter Notebooks?,Junior,Sophomore
0,afeng99@bu.edu,Amy,Feng,1,0,0,4,4,0,1
1,shkim219@bu.edu,Paul,Kim,2,0,0,3,1,0,0
2,shiqy@bu.edu,Rose (Qingyang),Shi,1,0,1,4,3,1,0
3,zwan1312@bu.edu,Zhenghui,Wang,2,0,1,4,3,0,1
4,ebudur@bu.edu,Eren,Budur,2,0,0,3,2,0,0
5,jfli@bu.edu,Jason,Li,2,0,1,3,3,0,1
6,nathanhh@bu.edu,Nathan,Ho,1,0,1,5,3,0,1
7,janamha@bu.edu,Jana Mikaela,Aguilar,3,1,0,4,4,0,1


In [26]:
X_data

Unnamed: 0,How Familiar are you with ML/Data Science?,Have you worked with clustering before?,Are you familiar with Git?,How Familiar are you with Python,How Familiar are you with Jupyter Notebooks?,Junior,Sophomore
0,1,0,0,4,4,0,1
1,2,0,0,3,1,0,0
2,1,0,1,4,3,1,0
3,2,0,1,4,3,0,1
4,2,0,0,3,2,0,0
5,2,0,1,3,3,0,1
6,1,0,1,5,3,0,1
7,3,1,0,4,4,0,1


### Depending on how this goes, we can add more stuff. However for now, lets just focus on a classic kmeans model. 

### If you want more documentation use the [api](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)

#### It'll initialize the kmeans++ algo that we talked about in our last meeting! Feel free to play around with it, and test some stuff. I would also recommend playing with some toy datasets you can find out on line if you're interested in the actual model. 

#### You can also message in the gc if you have more questions. And we can even do zoom if you guys want more theory with clustering, time-series, or NLP (I've done research in those three). 

In [27]:
model = KMeans(n_clusters=4, random_state=0).fit(X_data)

In [28]:
#These are the lables, now lets stitch them back to the original data 
model.labels_

array([3, 2, 3, 1, 2, 1, 3, 0], dtype=int32)

In [29]:
X['Labels'] = model.labels_

In [30]:
X

Unnamed: 0,Email Address,First Name,Last Name,How Familiar are you with ML/Data Science?,Have you worked with clustering before?,Are you familiar with Git?,How Familiar are you with Python,How Familiar are you with Jupyter Notebooks?,Junior,Sophomore,Labels
0,afeng99@bu.edu,Amy,Feng,1,0,0,4,4,0,1,3
1,shkim219@bu.edu,Paul,Kim,2,0,0,3,1,0,0,2
2,shiqy@bu.edu,Rose (Qingyang),Shi,1,0,1,4,3,1,0,3
3,zwan1312@bu.edu,Zhenghui,Wang,2,0,1,4,3,0,1,1
4,ebudur@bu.edu,Eren,Budur,2,0,0,3,2,0,0,2
5,jfli@bu.edu,Jason,Li,2,0,1,3,3,0,1,1
6,nathanhh@bu.edu,Nathan,Ho,1,0,1,5,3,0,1,3
7,janamha@bu.edu,Jana Mikaela,Aguilar,3,1,0,4,4,0,1,0


### This Notebook is more focused on the cleaning aspect. 

### In real world, we will mostly used premade frameworks, so the complicated part of the model is understanding the theory and using the correct techniques. If this goes well, we can move to hierarchical clustering. 

### However we need: 
#### 1) The data to intake (people will have to work on the google form)

#### 2) Data cleaning (exactly like we did here, the csv I supplied is directly from a google form)

#### 3) The model (A kmeans model is simple, but if we have time we'll do the more cool stuff)