# Use Count Vectorizer
***Matt Paterson***<br>
***Machine Learning Engineer***<br>
***Santa Cruz, California***<br>
***10/16/2021***<br>

In this challenge, I'll take four csv's with raw data about users of a platform who view videos, review the videos, and also voluntarily submit their interests in the videos, as well as the videos and video authors. 

I will create a model and api that allows an administrator/user (comapny) to input a user_handle (customer) from the existing group of user_handle(s) and get an output of the closest users to the input user_handle.

To get there, 
- I will use a simple cosine similarity score for the users, 
- run a dbscan or heirarchical clustering model as an additional input column,
- employ some Natural Language Processing techniques to find 
    - similar course tags and 
    - interest tags and 
    - assessment tags where they are inconsistent, and will 
- utilize scikit-learn's OneHotEncoder to quickly vectorize categorical data

I will then create a lookup table in DynamoDB that can store the resulting users table, allowing a RESTful API to query the database through Amazon API Gateway through the depoloyment of an Amazon SageMaker Model Endpoint.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import sparse
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

# Before final push, add any imports that come later in this program to this top box

## Load `course_tags.csv`, `user_assessment_scores.csv`, `user_course_views.csv`, and `user_interests.csv`
---

In [3]:
courses = pd.read_csv('../data/course_tags.csv')
print("courses.shape is", courses.shape)
courses.head()

courses.shape is (11337, 2)


Unnamed: 0,course_id,course_tags
0,12-principles-animation-toon-boom-harmony-1475,2d-animation
1,2d-racing-game-series-unity-5-1312,game-design
2,2d-racing-games-unity-volume-2-1286,game-art
3,2d-racing-games-unity-volume-2-1286,digital-painting
4,2d-racing-games-unity-volume-2-1286,image-editing


In [3]:
courses.get_dummies

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11337 entries, 0 to 11336
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   course_id    11337 non-null  object
 1   course_tags  11225 non-null  object
dtypes: object(2)
memory usage: 177.3+ KB


In [4]:
courses.isna().sum()

course_id        0
course_tags    112
dtype: int64

In [38]:
len(courses.course_id.value_counts())

5830

In [5]:
#courses.groupby('course_id').sum()a
courses[courses['course_id']=='artists-guide-mel-3163']

Unnamed: 0,course_id,course_tags
983,artists-guide-mel-3163,


In [6]:
assess = pd.read_csv('../data/user_assessment_scores.csv')
print("assess.shape is", assess.shape)
assess.head()

assess.shape is (6571, 4)


Unnamed: 0,user_handle,assessment_tag,user_assessment_date,user_assessment_score
0,7487,angular-js,2017-08-11 19:03:38,134
1,7487,css,2017-08-11 20:09:56,38
2,7487,html5,2017-07-31 18:59:37,84
3,7487,java,2017-07-31 18:49:27,149
4,7487,javascript,2017-07-31 19:05:03,92


In [7]:
assess.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6571 entries, 0 to 6570
Data columns (total 4 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   user_handle            6571 non-null   int64 
 1   assessment_tag         6571 non-null   object
 2   user_assessment_date   6571 non-null   object
 3   user_assessment_score  6571 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 205.5+ KB


In [41]:
assess

Unnamed: 0,user_handle,assessment_tag,user_assessment_date,user_assessment_score
0,7487,angular-js,2017-08-11 19:03:38,134
1,7487,css,2017-08-11 20:09:56,38
2,7487,html5,2017-07-31 18:59:37,84
3,7487,java,2017-07-31 18:49:27,149
4,7487,javascript,2017-07-31 19:05:03,92
...,...,...,...,...
6566,958,node-js,2017-04-26 20:36:35,245
6567,8887,angular-js,2016-09-30 22:30:48,221
6568,8887,docker,2017-03-24 17:55:06,148
6569,8887,html5,2017-02-10 16:38:53,241


In [8]:
views = pd.read_csv('../data/user_course_views.csv')
print("views.shape is", views.shape)
views.head()

views.shape is (249238, 6)


Unnamed: 0,user_handle,view_date,course_id,author_handle,level,view_time_seconds
0,1,2017-06-27,cpt-sp2010-web-designers-branding-intro,875,Beginner,3786
1,1,2017-06-28,cpt-sp2010-web-designers-branding-intro,875,Beginner,1098
2,1,2017-06-28,cpt-sp2010-web-designers-css,875,Intermediate,4406
3,1,2017-07-27,cpt-sp2010-web-designers-css,875,Intermediate,553
4,1,2017-09-12,aws-certified-solutions-architect-professional,281,Advanced,102


In [9]:
views[views['course_id']=='wpf-advanced-topics'].shape

(34, 6)

In [10]:
views.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 249238 entries, 0 to 249237
Data columns (total 6 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   user_handle        249238 non-null  int64 
 1   view_date          249238 non-null  object
 2   course_id          249238 non-null  object
 3   author_handle      249238 non-null  int64 
 4   level              249238 non-null  object
 5   view_time_seconds  249238 non-null  int64 
dtypes: int64(3), object(3)
memory usage: 11.4+ MB


In [11]:
interests = pd.read_csv('../data/user_interests.csv')
print("interests.shape is", interests.shape)
interests.head()

interests.shape is (297526, 3)


Unnamed: 0,user_handle,interest_tag,date_followed
0,1,mvc-scaffolding,2017-06-27 16:26:52
1,1,mvc2,2017-06-27 16:26:52
2,1,mvc-html-helpers,2017-06-27 16:26:52
3,1,mvc4-ioc,2017-06-27 16:26:52
4,1,mvc-testing,2017-06-27 16:26:52


In [12]:
interests.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 297526 entries, 0 to 297525
Data columns (total 3 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   user_handle    297526 non-null  int64 
 1   interest_tag   297526 non-null  object
 2   date_followed  297526 non-null  object
dtypes: int64(1), object(2)
memory usage: 6.8+ MB


## Drop unnecessary columns
---

For this first run through, drop all of the null values from courses as there are a small number of them--112/11337

In [13]:
courses = courses.dropna()

## Merge into a single dataframe
---

Use the user_handle column to merge three of the dataframes and course_id to merge the other

In [14]:
def make_dfs():
    '''
    return updated list with each df for parsing
    '''
    return [courses, assess, views, interests]

In [15]:
dfs=make_dfs()
df_names=['courses', 'assess', 'views', 'interests']

In [16]:
for df in dfs:
    print(df.shape)

(11225, 2)
(6571, 4)
(249238, 6)
(297526, 3)


In [17]:
for i in range(len(dfs)):
    print("***" + df_names[i] + "***")
    print(dfs[i].dtypes)

***courses***
course_id      object
course_tags    object
dtype: object
***assess***
user_handle               int64
assessment_tag           object
user_assessment_date     object
user_assessment_score     int64
dtype: object
***views***
user_handle           int64
view_date            object
course_id            object
author_handle         int64
level                object
view_time_seconds     int64
dtype: object
***interests***
user_handle       int64
interest_tag     object
date_followed    object
dtype: object


In [18]:
dfs[1]['user_assessment_date'].head()

0    2017-08-11 19:03:38
1    2017-08-11 20:09:56
2    2017-07-31 18:59:37
3    2017-07-31 18:49:27
4    2017-07-31 19:05:03
Name: user_assessment_date, dtype: object

In [19]:
assess.head(1)

Unnamed: 0,user_handle,assessment_tag,user_assessment_date,user_assessment_score
0,7487,angular-js,2017-08-11 19:03:38,134


In [20]:
assess['user_assessment_date'] = pd.to_datetime(assess['user_assessment_date'])

In [21]:
dfs = make_dfs()

In [22]:
for i in range(len(dfs)):
    print("***" + df_names[i] + "***")
    print(dfs[i].dtypes)

***courses***
course_id      object
course_tags    object
dtype: object
***assess***
user_handle                       int64
assessment_tag                   object
user_assessment_date     datetime64[ns]
user_assessment_score             int64
dtype: object
***views***
user_handle           int64
view_date            object
course_id            object
author_handle         int64
level                object
view_time_seconds     int64
dtype: object
***interests***
user_handle       int64
interest_tag     object
date_followed    object
dtype: object


How many unique tags exist in each dataframe?

In [23]:
tags = len(courses.course_tags.value_counts())
ids = len(courses.course_id.value_counts())

print(f"There are {tags} unique course tags")
print(f"There are {ids} unique course ids")

There are 998 unique course tags
There are 5830 unique course ids


In [24]:
views.columns

Index(['user_handle', 'view_date', 'course_id', 'author_handle', 'level',
       'view_time_seconds'],
      dtype='object')

In [25]:
assess.columns

Index(['user_handle', 'assessment_tag', 'user_assessment_date',
       'user_assessment_score'],
      dtype='object')

In [26]:
assess_tags = len(assess.assessment_tag.value_counts())

print(f"There are {assess_tags} unique assessment tags")


There are 54 unique assessment tags


In [27]:
interest_tags = len(interests.interest_tag.value_counts())

print(f"There are {interest_tags} unique interest tags")


There are 748 unique interest tags


In [28]:
for df in dfs:
    print(df.columns)

Index(['course_id', 'course_tags'], dtype='object')
Index(['user_handle', 'assessment_tag', 'user_assessment_date',
       'user_assessment_score'],
      dtype='object')
Index(['user_handle', 'view_date', 'course_id', 'author_handle', 'level',
       'view_time_seconds'],
      dtype='object')
Index(['user_handle', 'interest_tag', 'date_followed'], dtype='object')


In [29]:
df_names

['courses', 'assess', 'views', 'interests']

In [30]:
assess.shape

(6571, 4)

In [31]:
views.shape

(249238, 6)

In [32]:
pd.merge(left=views, right=assess, left_on='user_handle', right_on='user_handle').head() #.shape

Unnamed: 0,user_handle,view_date,course_id,author_handle,level,view_time_seconds,assessment_tag,user_assessment_date,user_assessment_score
0,2,2017-05-01,arnold-maya-fundamentals,273,Beginner,3277,photoshop,2016-09-23 16:59:45,139
1,2,2017-05-08,animated-web-social-media-banners-photoshop-fl...,62,Advanced,1996,photoshop,2016-09-23 16:59:45,139
2,2,2017-05-08,arnold-maya-fundamentals,273,Beginner,2612,photoshop,2016-09-23 16:59:45,139
3,2,2017-05-09,arnold-maya-fundamentals,273,Beginner,2142,photoshop,2016-09-23 16:59:45,139
4,2,2017-05-11,design-2d-game-level-illustrator-2113,640,Advanced,2131,photoshop,2016-09-23 16:59:45,139


In [33]:
pd.merge(left=views, right=assess, left_on='user_handle', right_on='user_handle').shape

(305883, 9)

That merge didn't go as planned. Are the course_tags neccessary? How closely are they related to the interests tags?

In [34]:
interests.interest_tag.value_counts()

javascript               4878
javascript-frameworks    4469
javascript-libraries     4469
c#                       4178
cloud-computing          3933
                         ... 
stitcher                    2
ketiv                       2
onshape                     2
ansys                       2
netfabb                     1
Name: interest_tag, Length: 748, dtype: int64

In [35]:
courses.course_tags.value_counts()

3d-modeling          484
3d-rendering         394
3d-texturing         347
3d-animation         307
creative-pipeline    290
                    ... 
google-big-query       1
google-analytics       1
opencv                 1
woodwork               1
dreamweaver            1
Name: course_tags, Length: 998, dtype: int64

## Create a `user-course` dataframe
**It should have the following columns to start**
- user_handle
- total_users_courses
- course_id
- course_tags
- first_view_date
- total_views
- avg_viewtime
- level
- author_handle
- interest_tags
- user_assessment_score
- user_avg_assess_score

**It should have a compound-index of user_handle_course_id**<br>

***On second thought, will that be helpful or only serve to delay the time to get to the MVP?***

## Create a `users` dataframe
**This should really have the user_handle as an index**


From here, create a DBScan clustering model and use the clusters from it as an additional feature.<br>

Once completed, run the cosine similarity and create a way to score the users.

This will require that we group each dataframe by the above factors

In [36]:
assess_users = len(assess.user_handle.value_counts())
interests_users = len(interests.user_handle.value_counts())
views_users = len(views.user_handle.value_counts())

print(f"There are {assess_users} user_handles in the assess df")
print(f"There are {interests_users} user_handles in the interests df")
print(f"There are {views_users} user_handles in the views df")

There are 3114 user_handles in the assess df
There are 10000 user_handles in the interests df
There are 8760 user_handles in the views df


I'll need to figure out a logic to create this table that has 
1. a unique row for each comination of user and course that the user viewed
2. the assessment and score that he user gave the course
3. linked together by the course tags, assessment tags
4. whether or not course tag or assessment tag match an interest tag from this user

In [66]:
users = pd.merge(left=interests, right=views, how='outer', left_on='user_handle', right_on='user_handle')

In [67]:
users.shape

(9474074, 8)

In [68]:
users.user_handle.value_counts().shape

(10000,)

We use the outer join merge, or otherwise keep all user_handles total, so that we can construct a dataset that includes those that never took a course but only gave their interests

In [69]:
users = pd.merge(left=users, right=assess, how='outer', left_on='user_handle', right_on='user_handle')

In [70]:
users.shape

(21951197, 11)

In [71]:
users.user_handle.value_counts().shape

(10000,)

In [74]:
users = pd.merge(left=users, right=courses, how='outer', left_on='course_id', right_on='course_id')

In [76]:
users.shape

(42253097, 12)

In [77]:
users.user_handle.value_counts().shape

(10000,)

In [79]:
users_grouped = users.groupby('user_handle').sum()

In [80]:
users_grouped.head()

Unnamed: 0_level_0,author_handle,view_time_seconds,user_assessment_score
user_handle,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,948640.0,3182060.0,0.0
2,852832.0,2685460.0,250756.0
3,99360.0,40752.0,0.0
4,74310.0,468648.0,43602.0
5,738025.0,3309956.0,0.0


In [82]:
users_grouped.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
author_handle,10000.0,2176335.0,43374020.0,0.0,10597.75,91713.0,544918.5,3255055000.0
view_time_seconds,10000.0,9200670.0,269930000.0,0.0,17776.0,265447.0,1799142.75,21943160000.0
user_assessment_score,10000.0,524040.6,13454680.0,0.0,0.0,0.0,10159.5,1024915000.0


## Use NLP to find the similar course_tags and assessment_tags and interest_tags

## Use OHE to vectorize categorical columns

## Use DBScan to create clusters of users

## Use Cosine Similarity Scores to compare the users to one another

## Save the dataframe to csv and consider DynamoDB, SQLite, PostreSQL, and AWS RDS