## Curriculum Logs Anomaly Detection
By: Scott Schmidl, Rajaram Gautam
01/31/2022

### Goal


### Description


### Initial Questions
<p>1. Which lesson appears to attract the most traffic consistently across cohorts (per program)?</p>
<p>2. Is there a cohort that referred to a lesson significantly more than other cohorts seemed to gloss over?</p>
<p>3. Are there students who, when active, hardly access the curriculum? If so, what information do you have about these students?</p>
<p>4. Is there any suspicious activity, such as users/machines/etc accessing the curriculum who shouldn’t be? Does it appear that any web-scraping is happening? Are there any suspicious IP addresses?</p>
<p>5. At some point in 2019, the ability for students and alumni to access both curriculums (web dev to ds, ds to web dev) should have been shut off. Do you see any evidence of that happening? Did it happen before?</p>
<p>6. What topics are grads continuing to reference after graduation and into their jobs (for each program)?</p>
<p>7. Which lessons are least accessed?</p>

### Data Dictionary
<table>
<thead><tr>
<th>Variable</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td>path</td>
<td>The lesson that was accessed</td>
</tr>
<tr>
<td>user_id</td>
<td>id of the user</td>
</tr>
<tr>
<td>cohort_id</td>
<td>id of the cohort</td>
</tr>
<tr>
<td>ip</td>
<td>ip address of the user</td>
</tr>
<tr>
<td>name</td>
<td>name of cohort</td>
</tr>
<tr>
<td>slack</td>
<td>slack channel name</td>
</tr>
<tr>
<td>start_date</td>
<td>start date of cohort</td>
</tr>
<tr>
<td>end_date</td>
<td>end date of cohort</td>
</tr>
<tr>
<td>created_at</td>
<td>date time cohort information was entered into database</td>
</tr>
<tr>
<td>updated_at</td>
<td>date time cohort information was updated</td>
</tr>
<tr>
<td>deleted_at</td>
<td>date time cohort information was deleted</td>
</tr>
<tr>
<td>program_id</td>
<td>id of the program</td>
</tr>
<tr>
<td>program_name</td>
<td>name of program, either web dev or data science</td>
</tr>
</tbody>
</table>

### Wrangling and Prepare
- To wrangle the curriculum logs, we used the curriculum logs database in our MySQL server and saved it to a CSV.
- To prepare the data we performed some feature engineering to create date time columns and program_name

In [1]:
from wrangle import Wrangle
logs = Wrangle().prep_data()

### Exploratory Data Analysis

### Question 1
- Which lesson appears to attract the most traffic consistently across cohorts (per program)?

### Take Away
- 

### Question 2
- Is there a cohort that referred to a lesson significantly more than other cohorts seemed to gloss over?

In [2]:
logs[["name", "path"]][logs["name"] != "Staff"].groupby(by=["name"], sort=False).count().nlargest(n=10, columns="path", keep="first")

Unnamed: 0_level_0,path
name,Unnamed: 1_level_1
Ceres,40730
Zion,38096
Jupiter,37109
Fortuna,36902
Voyageurs,35636
Ganymede,33844
Apex,33568
Deimos,32888
Darden,32015
Teddy,30926


### Take Away
- These are the top 10 cohorts who accessed the curriculum the most.

### Question 3
- Are there students who, when active, hardly access the curriculum? If so, what information do you have about these students?

In [3]:
was_student = (logs.index < logs["end_date"]) & (logs.index > logs["start_date"])
not_staff = (logs["name"] != "Staff")

logs[["user_id", "path"]][was_student & not_staff].groupby(by=["user_id"], sort=False).count().nsmallest(n=10, columns="path", keep="first")

Unnamed: 0_level_0,path
user_id,Unnamed: 1_level_1
619,1
879,1
918,1
940,1
832,3
278,4
539,5
956,6
812,7
388,8


### Take Away
- These students were active students and not staff. These are the 10 students whom accessed the curriculum the fewest.

### Question 4
- Is there any suspicious activity, such as users/machines/etc accessing the curriculum who shouldn’t be? Does it appear that any web-scraping is happening? Are there any suspicious IP addresses?


### Take Away
-

### Question 5
- At some point in 2019, the ability for students and alumni to access both curriculums (web dev to ds, ds to web dev) should have been shut off. Do you see any evidence of that happening? Did it happen before?

### Take Away
- 

### Questions 6
- What topics are grads continuing to reference after graduation and into their jobs (for each program)?


#### Web Dev

In [5]:
logs["path"][(logs.index > logs["end_date"]) & (logs["program_name"] == "web dev") & (logs["name"] != "Staff")]

date_time
2018-01-26 09:55:03                                      /
2018-01-26 09:56:02                                java-ii
2018-01-26 09:56:05    java-ii/object-oriented-programming
2018-01-26 09:56:06     slides/object_oriented_programming
2018-01-26 10:14:47                                      /
                                      ...                 
2021-04-21 14:43:09                      jquery/mapbox-api
2021-04-21 14:43:10                jquery/ajax/weather-map
2021-04-21 14:50:36                                 java-i
2021-04-21 14:50:38            java-i/introduction-to-java
2021-04-21 16:30:30                               appendix
Name: path, Length: 104557, dtype: object

#### Data Science

In [6]:
logs["path"][(logs.index > logs["end_date"]) & (logs["program_name"] == "data science") & (logs["name"] != "Staff")]

date_time
2020-01-30 08:01:31             1-fundamentals/1.1-intro-to-data-science
2020-01-30 08:01:32             1-fundamentals/modern-data-scientist.jpg
2020-01-30 08:01:32                 1-fundamentals/AI-ML-DL-timeline.jpg
2020-01-31 11:05:04                                                    /
2020-01-31 11:05:13             1-fundamentals/1.1-intro-to-data-science
                                             ...                        
2021-04-21 15:20:12                              classification/overview
2021-04-21 15:20:12    classification/classical_programming_vs_machin...
2021-04-21 15:20:12             classification/scale_features_or_not.svg
2021-04-21 15:20:14                               classification/project
2021-04-21 15:20:18                               classification/acquire
Name: path, Length: 11544, dtype: object

### Take Away
- These are the topics by program that got referenced after a student graduated.

### Question 7
7) Which lessons are least accessed?

In [7]:
logs["path"].value_counts(ascending=True).nsmallest(n=10, keep="first")

appendix/professional-development/post-interview-review-form    1
cli-03-file-paths                                               1
cli-07-more-topics                                              1
spring/services                                                 1
8-timeseries/1-overview                                         1
8-timeseries/2-intro-to-timeseries                              1
8-timeseries/3-acquire                                          1
8-timeseries/4-prep                                             1
8-timeseries/6.1-parametric-modeling                            1
8-timeseries/6.2-prophet                                        1
Name: path, dtype: int64

### Take Away
- These are the top 10 least accessed topics, but this can be adjusted based on needs.