
# Who am I?

I'm curious as to where I've been and when I have accessed the curriculum. 
Also, I want to know the pages that our Ada students have found most useful as they have continued into their jobs. 

## Acquire

In [69]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

colnames = ['date','time','page','user', 'idk', 'source_ip']
df = pd.read_csv('anonymized-curriculum-access.txt', sep=' ', names=colnames)

In [70]:
df.head()

Unnamed: 0,date,time,page,user,idk,source_ip
0,2018-01-26,09:55:03,/,1,8.0,97.105.19.61
1,2018-01-26,09:56:02,java-ii,1,8.0,97.105.19.61
2,2018-01-26,09:56:05,java-ii/object-oriented-programming,1,8.0,97.105.19.61
3,2018-01-26,09:56:06,slides/object_oriented_programming,1,8.0,97.105.19.61
4,2018-01-26,09:56:24,javascript-i/conditionals,2,22.0,97.105.19.61


What is the 5th column?

In [71]:
df.idk.value_counts()

29.0    35969
24.0    35039
33.0    34433
22.0    28875
23.0    28056
32.0    26801
26.0    26760
25.0    25233
31.0    22665
28.0    20677
27.0    20198
34.0    15519
51.0    10835
14.0     9069
1.0      8877
21.0     7181
17.0     3792
52.0     2896
13.0     2610
8.0      1671
18.0     1603
19.0     1142
16.0      740
15.0      691
7.0       461
12.0      270
11.0      204
2.0        93
6.0        72
9.0         5
4.0         4
Name: idk, dtype: int64

To explore more, I will count the number of unique values per user, and then count the number of users for each number of unique values. For example, how many users have 1 unique value, how many have 2, 3, ...?

In [72]:
pd.DataFrame(df.groupby('user')['idk'].nunique()).reset_index().groupby('idk')['user'].count()

idk
0     46
1    479
2     17
3      1
Name: user, dtype: int64

Based on this, I'm guessing this id refers to cohorts. 
What I know about cohorts:  

- the data starts January, 2018
- alumni can access curriculum 
- the later ids are showing that range we would expect
- we've had about 35 cohorts come through. How many have been since Jan, 2018? 
- if they are 18-20 weeks and at least 2 running at a time, but often 3 and sometimes 4, I might estimate since Jan, 2018, about 14 cohorts have started. 
- there are about 17-29 students per cohort
- if cohorts are the id, then I'm not sure where instructors would be. Maybe they are the ones with more than 1 possible cohort id (if that's what it is) listed. 
- if cohorts are the id, then i would expect 2 cohorts where data science is dominant, and all the others, web dev. 


I will do some aggregation to see what the data is saying. 

1. count distinct 'idk' values

In [73]:
print("Distinct Values: ", len(df.idk.value_counts()))

Distinct Values:  31


2. What is the average number of users per value over all? over the last 14 values? I will look to see if it is in the range I would expect (especially for the last 14 values). 

In [74]:
user_count = pd.DataFrame(df.groupby('idk')['user'].nunique()).reset_index()

In [75]:
user_count.user.mean()

16.64516129032258

In [76]:
user_count[-13:].user.mean()

25.76923076923077

3. Finally, I could look up how many 'cohorts' have primarily accessed data science. I would expect 2. 

From this, I will conclude that the 'idk' column represents a cohort id. 

Rename the column to `cohort`. Sort of...

In [77]:
df['cohort'] = df['idk']
df.drop(columns=['idk'], inplace=True)

In [78]:
df.head()

Unnamed: 0,date,time,page,user,source_ip,cohort
0,2018-01-26,09:55:03,/,1,97.105.19.61,8.0
1,2018-01-26,09:56:02,java-ii,1,97.105.19.61,8.0
2,2018-01-26,09:56:05,java-ii/object-oriented-programming,1,97.105.19.61,8.0
3,2018-01-26,09:56:06,slides/object_oriented_programming,1,97.105.19.61,8.0
4,2018-01-26,09:56:24,javascript-i/conditionals,2,97.105.19.61,22.0


I'm going to explore the hypothesis that those with multiple cohorts are instructors.

In [79]:
user_cohorts = pd.DataFrame(df.groupby('user')['cohort'].nunique())
instructors = user_cohorts[user_cohorts.cohort>1].drop(columns='cohort')

df = df.set_index('user')


In [84]:
instructors = df.join(instructors, how='inner', on='user')

In [87]:
instructors.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44058 entries, 1 to 539
Data columns (total 5 columns):
date         44058 non-null object
time         44058 non-null object
page         44058 non-null object
source_ip    44058 non-null object
cohort       43835 non-null float64
dtypes: float64(1), object(4)
memory usage: 2.0+ MB


Clean up data a bit by removing the time component. 

In [95]:
agg_df = instructors.reset_index().groupby(['user', 'date', 'page', 'source_ip']).count().reset_index()

In [97]:
agg_df['count'] = agg_df['cohort']
agg_df.drop(columns=['time','cohort'], inplace=True)

In [98]:
agg_df.head()

Unnamed: 0,user,date,page,source_ip,count
0,1,2018-01-26,/,97.105.19.61,1
1,1,2018-01-26,java-i,97.105.19.61,1
2,1,2018-01-26,java-ii,97.105.19.61,1
3,1,2018-01-26,java-ii/object-oriented-programming,97.105.19.61,1
4,1,2018-01-26,javascript-i/functions,97.105.19.61,1


In [117]:
agg_df = agg_df.join(agg_df.page.str.split('\/', expand=True))

In [118]:
agg_df.head()

Unnamed: 0,user,date,page,source_ip,count,0,1,2,3,4,5,6,7
0,1,2018-01-26,/,97.105.19.61,1,,,,,,,,
1,1,2018-01-26,java-i,97.105.19.61,1,java-i,,,,,,,
2,1,2018-01-26,java-ii,97.105.19.61,1,java-ii,,,,,,,
3,1,2018-01-26,java-ii/object-oriented-programming,97.105.19.61,1,java-ii,object-oriented-programming,,,,,,
4,1,2018-01-26,javascript-i/functions,97.105.19.61,1,javascript-i,functions,,,,,,


In [111]:
page0[page0[1]>0]

Unnamed: 0_level_0,1,2,3,4,5,6,7
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
javascript-i,1963,783,0,0,0,0,0
,1836,0,0,0,0,0,0
html-css,1554,1011,161,0,0,0,0
spring,1505,1340,123,0,0,0,0
mysql,1244,294,0,0,0,0,0
jquery,1132,586,0,0,0,0,0
java-iii,1129,0,0,0,0,0,0
appendix,952,612,269,0,0,0,0
java-ii,928,0,0,0,0,0,0
javascript-ii,775,0,0,0,0,0,0
