# Project Work
---

## Questions to Answer:

### 1. Which lesson appears to attract the most traffic consistently across cohorts (per program)?
- could create separate dataframes for each program
- group by `name` and look at most frequent `path` values
- look at this after grouping `path` values into modules.

### 3. Are there students who, when active, hardly access the curriculum? If so, what information do you have about these students?
- look for users with low count of requests between `start_date` and `end_date`
- we have their ip, start/end dates, and program type

### 6. What topics are grads continuing to reference after graduation and into their jobs (for each program)?
- look for most frequent `path` values accessed after `end_date`
 - too many unique paths to get useful info
- group by module to see which modules are most often referenced by grads

### 7. Which lessons are least accessed?
- look for least common module
- do this after grouping `path` values into modules.

In [1]:
# imports
import wrangle as w
import pandas as pd

# wrangle data
logs = w.wrangle_logs()
# preview data
logs.head()

Unnamed: 0,path,ip,user_id,name,program_id,start_date,end_date,program_name
2018-01-26 09:55:03,/,97.105.19.61,1,Hampton,1,2015-09-22,2016-02-06,WebDev-PHP
2018-01-26 09:56:02,java-ii,97.105.19.61,1,Hampton,1,2015-09-22,2016-02-06,WebDev-PHP
2018-01-26 09:56:05,java-ii/object-oriented-programming,97.105.19.61,1,Hampton,1,2015-09-22,2016-02-06,WebDev-PHP
2018-01-26 09:56:06,slides/object_oriented_programming,97.105.19.61,1,Hampton,1,2015-09-22,2016-02-06,WebDev-PHP
2018-01-26 09:56:24,javascript-i/conditionals,97.105.19.61,2,Teddy,2,2018-01-08,2018-05-17,WebDev-Java


In [2]:
# check for any nulls
logs.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 847330 entries, 2018-01-26 09:55:03 to 2021-04-21 16:44:39
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   path          847329 non-null  object
 1   ip            847330 non-null  object
 2   user_id       847330 non-null  int64 
 3   name          847330 non-null  object
 4   program_id    847330 non-null  int64 
 5   start_date    847330 non-null  object
 6   end_date      847330 non-null  object
 7   program_name  847330 non-null  object
dtypes: int64(2), object(6)
memory usage: 58.2+ MB


In [3]:
# see how many paths were accessed only once
logs.path.value_counts()[logs.path.value_counts() < 2]

app                                        1
examples/css/..%c0%af                      1
student/120                                1
prework/cli/03-filepaths                   1
sql/database                               1
                                          ..
stats-assessment                           1
mysql//functions                           1
9_Appendix_TSAD_Lesson3                    1
further-reading/javascript/array-splice    1
examples/css//logincss                     1
Name: path, Length: 467, dtype: int64

In [4]:
# view number of unique pages accessed by students after their program has ended
logs[(logs.index > logs.end_date) & (logs.name != 'Staff') & (logs.path != '/')].path.nunique()

1376

Since I'm really only interested in student access logs, I'll create a separate dataframe for just student access logs and use that for my exploration.

In [5]:
# isolate student access logs
student_logs = logs[logs.name != 'Staff']
student_logs.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 763299 entries, 2018-01-26 09:55:03 to 2021-04-21 16:41:51
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   path          763298 non-null  object
 1   ip            763299 non-null  object
 2   user_id       763299 non-null  int64 
 3   name          763299 non-null  object
 4   program_id    763299 non-null  int64 
 5   start_date    763299 non-null  object
 6   end_date      763299 non-null  object
 7   program_name  763299 non-null  object
dtypes: int64(2), object(6)
memory usage: 52.4+ MB


In [6]:
# drop that one row which has a null path
student_logs.dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  student_logs.dropna(inplace=True)


In [7]:
# split paths into lists
student_logs.path.str.split('/')

2018-01-26 09:55:03                                         [, ]
2018-01-26 09:56:02                                    [java-ii]
2018-01-26 09:56:05       [java-ii, object-oriented-programming]
2018-01-26 09:56:06        [slides, object_oriented_programming]
2018-01-26 09:56:24                 [javascript-i, conditionals]
                                         ...                    
2021-04-21 16:36:09                      [jquery, personal-site]
2021-04-21 16:36:34    [html-css, css-ii, bootstrap-grid-system]
2021-04-21 16:37:48                                   [java-iii]
2021-04-21 16:38:14                         [java-iii, servlets]
2021-04-21 16:41:51             [javascript-i, bom-and-dom, dom]
Name: path, Length: 763298, dtype: object

In [8]:
# label access logs by module
student_logs['module'] = [listy[0] for listy in student_logs.path.str.split('/')]
student_logs

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  student_logs['module'] = [listy[0] for listy in student_logs.path.str.split('/')]


Unnamed: 0,path,ip,user_id,name,program_id,start_date,end_date,program_name,module
2018-01-26 09:55:03,/,97.105.19.61,1,Hampton,1,2015-09-22,2016-02-06,WebDev-PHP,
2018-01-26 09:56:02,java-ii,97.105.19.61,1,Hampton,1,2015-09-22,2016-02-06,WebDev-PHP,java-ii
2018-01-26 09:56:05,java-ii/object-oriented-programming,97.105.19.61,1,Hampton,1,2015-09-22,2016-02-06,WebDev-PHP,java-ii
2018-01-26 09:56:06,slides/object_oriented_programming,97.105.19.61,1,Hampton,1,2015-09-22,2016-02-06,WebDev-PHP,slides
2018-01-26 09:56:24,javascript-i/conditionals,97.105.19.61,2,Teddy,2,2018-01-08,2018-05-17,WebDev-Java,javascript-i
...,...,...,...,...,...,...,...,...,...
2021-04-21 16:36:09,jquery/personal-site,136.50.98.51,869,Marco,2,2021-01-25,2021-07-19,WebDev-Java,jquery
2021-04-21 16:36:34,html-css/css-ii/bootstrap-grid-system,104.48.214.211,948,Neptune,2,2021-03-15,2021-09-03,WebDev-Java,html-css
2021-04-21 16:37:48,java-iii,67.11.50.23,834,Luna,2,2020-12-07,2021-06-08,WebDev-Java,java-iii
2021-04-21 16:38:14,java-iii/servlets,67.11.50.23,834,Luna,2,2020-12-07,2021-06-08,WebDev-Java,java-iii


In [9]:
# number of unique paths containing 'java'
student_logs[student_logs.path.str.contains('java')].path.nunique()

191

In [10]:
# number of unique paths containing 'php'
student_logs[student_logs.path.str.contains('php')].path.nunique()

119

In [11]:
# number of unique paths containing 'data-sci'
student_logs[student_logs.path.str.contains('data-sci')].path.nunique()

16

In [12]:
# view number of unique paths
student_logs.path.nunique()

1844

There are 1,844 unique pages that were visited by Codeup students (in this dataset).

I want to create separate dataframes for each program to try and figure out which pages go with each program.

In [13]:
# create php dataframe
php = student_logs[student_logs.program_id == 1]
print(php.user_id.nunique())
php.head(2)

98


Unnamed: 0,path,ip,user_id,name,program_id,start_date,end_date,program_name,module
2018-01-26 09:55:03,/,97.105.19.61,1,Hampton,1,2015-09-22,2016-02-06,WebDev-PHP,
2018-01-26 09:56:02,java-ii,97.105.19.61,1,Hampton,1,2015-09-22,2016-02-06,WebDev-PHP,java-ii


In [14]:
# create java dataframe
java = student_logs[student_logs.program_id == 2]
print(java.user_id.nunique())
java.head(2)

682


Unnamed: 0,path,ip,user_id,name,program_id,start_date,end_date,program_name,module
2018-01-26 09:56:24,javascript-i/conditionals,97.105.19.61,2,Teddy,2,2018-01-08,2018-05-17,WebDev-Java,javascript-i
2018-01-26 09:56:41,javascript-i/loops,97.105.19.61,2,Teddy,2,2018-01-08,2018-05-17,WebDev-Java,javascript-i


In [15]:
# create data science dataframe
ds = student_logs[student_logs.program_id == 3]
print(ds.user_id.nunique())
ds.head(2)

111


Unnamed: 0,path,ip,user_id,name,program_id,start_date,end_date,program_name,module
2019-08-20 09:39:58,/,97.105.19.58,466,Bayes,3,2019-08-19,2020-01-30,DataSci,
2019-08-20 09:39:59,/,97.105.19.58,467,Bayes,3,2019-08-19,2020-01-30,DataSci,


In [16]:
# create front end dataframe
frontend = student_logs[student_logs.program_id == 4]
print(frontend.user_id.nunique())
frontend.head(2)

1


Unnamed: 0,path,ip,user_id,name,program_id,start_date,end_date,program_name,module
2018-03-22 19:01:49,/,207.68.209.17,152,Apollo,4,2015-03-30,2015-07-29,FrontEnd,
2018-03-22 19:01:54,content/html-css,207.68.209.17,152,Apollo,4,2015-03-30,2015-07-29,FrontEnd,content


In [17]:
# view unique paths for php program
php.path.value_counts()

/                                      1681
index.html                             1011
javascript-i                            736
html-css                                542
spring                                  501
                                       ... 
content/loops.html                        1
appendix/extra-challenges/locations       1
content/examples/php/while.html           1
content/javascript/arrays/arrays          1
2-storytelling/2.1-understand             1
Name: path, Length: 710, dtype: int64

In [18]:
# view unique paths for java program
java.path.value_counts()

/                                                            29474
toc                                                          16517
javascript-i                                                 15640
search/search_index.json                                     13863
java-iii                                                     11290
                                                             ...  
sgithubtudents/1215                                              1
content/examples/examples/html/gitbook/images/favicon.ico        1
appendix/spring/authorization                                    1
coding-challenges                                                1
quic/115                                                         1
Name: path, Length: 1113, dtype: int64

In [19]:
# view unique paths for data science program
ds.path.value_counts()

/                                           8358
search/search_index.json                    2203
classification/overview                     1785
1-fundamentals/modern-data-scientist.jpg    1655
1-fundamentals/AI-ML-DL-timeline.jpg        1651
                                            ... 
12-distributed-ml/6.3-prepare-part-3           1
bad-charts                                     1
itc-ml                                         1
itc%20-%20ml                                   1
sql/database                                   1
Name: path, Length: 682, dtype: int64

In [20]:
# view unique paths for front end program
frontend.path.value_counts()

content/html-css                               2
content/html-css/introduction.html             1
/                                              1
content/html-css/gitbook/images/favicon.ico    1
Name: path, dtype: int64

In [21]:
# see how may cohorts were in front end program
frontend.name.value_counts()

Apollo    5
Name: name, dtype: int64

In [22]:
# view users in front end program
frontend.user_id.value_counts()

152    5
Name: user_id, dtype: int64

Only one user ever accessed the Front End curriculum. I want to see if there is anyone else in the Apollo cohort and if not I won't worry about exploring this program at all since it would practically be nonexistent in comparison to the others. Also, this program does not appear to have continued past Apollo so it's unlikely our stakeholder would be interested in this at all.

In [23]:
# view curriculum access logs for Apollo cohort
student_logs[student_logs.name == 'Apollo']

Unnamed: 0,path,ip,user_id,name,program_id,start_date,end_date,program_name,module
2018-03-22 19:01:49,/,207.68.209.17,152,Apollo,4,2015-03-30,2015-07-29,FrontEnd,
2018-03-22 19:01:54,content/html-css,207.68.209.17,152,Apollo,4,2015-03-30,2015-07-29,FrontEnd,content
2018-03-22 19:01:54,content/html-css/gitbook/images/favicon.ico,207.68.209.17,152,Apollo,4,2015-03-30,2015-07-29,FrontEnd,content
2018-03-22 19:02:47,content/html-css,207.68.209.17,152,Apollo,4,2015-03-30,2015-07-29,FrontEnd,content
2018-03-22 19:02:52,content/html-css/introduction.html,207.68.209.17,152,Apollo,4,2015-03-30,2015-07-29,FrontEnd,content


In [24]:
# see if there are other html-css pages in curriculum
student_logs.path.str.contains('html-css').sum()

77810

In [25]:
# see when php program ended
php.end_date.max()

'2017-09-22'

In [26]:
# see when java program started
java.start_date.min()

'2016-09-26'

In [27]:
# see last record of java program
java.end_date.max()

'2021-10-01'

In [28]:
# see first and last records of data science program
ds.start_date.min(), ds.end_date.max()

('2019-08-19', '2021-09-03')

So it looks like the only programs that are still being taught are the WebDev-Java and Data Science programs.

In [29]:
# see top 20 pages accessed by php
php.path.value_counts().head(20)

/                                                                            1681
index.html                                                                   1011
javascript-i                                                                  736
html-css                                                                      542
spring                                                                        501
java-iii                                                                      479
java-ii                                                                       454
java-i                                                                        444
javascript-ii                                                                 429
appendix                                                                      409
jquery                                                                        344
mysql                                                                         284
content/html-css

In [30]:
# see top 20 pages accessed by java
java.path.value_counts().head(20)

/                                                                            29474
toc                                                                          16517
javascript-i                                                                 15640
search/search_index.json                                                     13863
java-iii                                                                     11290
html-css                                                                     11285
java-ii                                                                      10459
spring                                                                        9973
jquery                                                                        9776
mysql                                                                         9423
java-i                                                                        9161
javascript-ii                                                                 8868
java

In [31]:
# see top 20 pages accessed by ds
ds.path.value_counts().head(20)

/                                                    8358
search/search_index.json                             2203
classification/overview                              1785
1-fundamentals/modern-data-scientist.jpg             1655
1-fundamentals/AI-ML-DL-timeline.jpg                 1651
1-fundamentals/1.1-intro-to-data-science             1633
classification/scale_features_or_not.svg             1590
fundamentals/AI-ML-DL-timeline.jpg                   1443
fundamentals/modern-data-scientist.jpg               1438
sql/mysql-overview                                   1424
fundamentals/intro-to-data-science                   1413
6-regression/1-overview                              1124
anomaly-detection/AnomalyDetectionCartoon.jpeg        829
anomaly-detection/overview                            804
10-anomaly-detection/AnomalyDetectionCartoon.jpeg     754
10-anomaly-detection/1-overview                       751
3-sql/1-mysql-overview                                707
1-fundamentals

In [32]:
# view top 10 most accessed modules for data science
ds.module.value_counts().head(10)

fundamentals      8746
classification    8620
                  8358
1-fundamentals    7945
sql               7505
3-sql             6165
python            5599
4-python          4856
6-regression      4562
appendix          3944
Name: module, dtype: int64

The blank module is just the main curriculum page so I don't need to look into that any further. I am curious about what is being looked at in the appendix so I'll investigate that a little more.

In [45]:
# look at appendix pages
ds[ds.module == 'appendix']

Unnamed: 0,path,ip,user_id,name,program_id,start_date,end_date,program_name,module
2019-08-20 09:40:50,appendix/cli-git-overview,97.105.19.58,479,Bayes,3,2019-08-19,2020-01-30,DataSci,appendix
2019-08-21 13:55:51,appendix/cli-git-overview,97.105.19.58,467,Bayes,3,2019-08-19,2020-01-30,DataSci,appendix
2019-08-22 09:40:29,appendix/interview_questions_students,97.105.19.58,473,Bayes,3,2019-08-19,2020-01-30,DataSci,appendix
2019-08-23 08:21:37,appendix/cli-git-overview,97.105.19.58,470,Bayes,3,2019-08-19,2020-01-30,DataSci,appendix
2019-08-23 16:39:09,appendix/cli-git-overview,97.105.19.58,470,Bayes,3,2019-08-19,2020-01-30,DataSci,appendix
...,...,...,...,...,...,...,...,...,...
2021-04-20 19:06:59,appendix/professional-development/t-block-less...,185.247.70.173,580,Curie,3,2020-02-03,2020-07-07,DataSci,appendix
2021-04-20 21:18:22,appendix/postwork,148.66.39.72,845,Easley,3,2020-12-07,2021-06-08,DataSci,appendix
2021-04-20 21:18:26,appendix/ds-environment-setup,148.66.39.72,845,Easley,3,2020-12-07,2021-06-08,DataSci,appendix
2021-04-21 08:30:37,appendix/professional-development/vertical-resume,172.58.111.66,843,Easley,3,2020-12-07,2021-06-08,DataSci,appendix


I'm more interested in what is being looked at within the appendix than just the fact that the appendix was being looked at so I want to re-label these rows.

In [47]:
import numpy as np

In [50]:
np.where(ds.module == 'appendix', [lst[1] for lst in ds[ds.module == 'appendix']], ds.module)

ValueError: operands could not be broadcast together with shapes (103411,) (9,) (103411,) 

In [33]:
# view top 10 most accessed modules for php
php.module.value_counts().head(10)

content         6397
javascript-i    3708
html-css        2463
spring          2324
mysql           2067
java-iii        1953
                1681
java-ii         1572
jquery          1526
java-i          1456
Name: module, dtype: int64

In [43]:
# look at logs for 'content'
php[php.module == 'content']

Unnamed: 0,path,ip,user_id,name,program_id,start_date,end_date,program_name,module
2018-01-26 12:38:12,content/javascript/objects/math.html,192.171.117.210,37,Quincy,1,2017-06-05,2017-09-22,WebDev-PHP,content
2018-01-26 12:38:19,content/laravel/quickstart/sessions.html,192.171.117.210,37,Quincy,1,2017-06-05,2017-09-22,WebDev-PHP,content
2018-01-27 07:27:52,content/mysql/intro-to-mysql/users.html,72.179.161.39,51,Kings,1,2016-05-23,2016-09-15,WebDev-PHP,content
2018-01-27 07:28:04,content/html-css,72.179.161.39,51,Kings,1,2016-05-23,2016-09-15,WebDev-PHP,content
2018-01-27 07:28:09,content/html-css/elements.html,72.179.161.39,51,Kings,1,2016-05-23,2016-09-15,WebDev-PHP,content
...,...,...,...,...,...,...,...,...,...
2021-04-10 00:20:55,content/javascript/conditionals.html,72.179.168.148,51,Kings,1,2016-05-23,2016-09-15,WebDev-PHP,content
2021-04-10 00:21:15,content/javascript/loops.html,72.179.168.148,51,Kings,1,2016-05-23,2016-09-15,WebDev-PHP,content
2021-04-10 00:28:10,content/javascript/javascript-with-html.html,136.50.29.193,80,Lassen,1,2016-07-18,2016-11-10,WebDev-PHP,content
2021-04-10 00:28:11,content/javascript/conditionals.html,136.50.29.193,80,Lassen,1,2016-07-18,2016-11-10,WebDev-PHP,content


In [34]:
# view top 10 most accessed modules for java
java.module.value_counts().head(10)

javascript-i     103817
html-css          73679
mysql             71665
jquery            52554
java-iii          47857
spring            47583
java-ii           47126
java-i            35611
javascript-ii     33418
                  29474
Name: module, dtype: int64

In [36]:
len(php.name.value_counts())

13

In [37]:
len(java.name.value_counts())

27

In [38]:
len(ds.name.value_counts())

5

In [42]:
len(frontend.name.value_counts())

1

In [39]:
len(student_logs.name.value_counts())

46

In [40]:
student_logs

Unnamed: 0,path,ip,user_id,name,program_id,start_date,end_date,program_name,module
2018-01-26 09:55:03,/,97.105.19.61,1,Hampton,1,2015-09-22,2016-02-06,WebDev-PHP,
2018-01-26 09:56:02,java-ii,97.105.19.61,1,Hampton,1,2015-09-22,2016-02-06,WebDev-PHP,java-ii
2018-01-26 09:56:05,java-ii/object-oriented-programming,97.105.19.61,1,Hampton,1,2015-09-22,2016-02-06,WebDev-PHP,java-ii
2018-01-26 09:56:06,slides/object_oriented_programming,97.105.19.61,1,Hampton,1,2015-09-22,2016-02-06,WebDev-PHP,slides
2018-01-26 09:56:24,javascript-i/conditionals,97.105.19.61,2,Teddy,2,2018-01-08,2018-05-17,WebDev-Java,javascript-i
...,...,...,...,...,...,...,...,...,...
2021-04-21 16:36:09,jquery/personal-site,136.50.98.51,869,Marco,2,2021-01-25,2021-07-19,WebDev-Java,jquery
2021-04-21 16:36:34,html-css/css-ii/bootstrap-grid-system,104.48.214.211,948,Neptune,2,2021-03-15,2021-09-03,WebDev-Java,html-css
2021-04-21 16:37:48,java-iii,67.11.50.23,834,Luna,2,2020-12-07,2021-06-08,WebDev-Java,java-iii
2021-04-21 16:38:14,java-iii/servlets,67.11.50.23,834,Luna,2,2020-12-07,2021-06-08,WebDev-Java,java-iii


In [35]:
# isolate student access logs after program end
after_grad = student_logs[student_logs.end_date < student_logs.index]
after_grad

Unnamed: 0,path,ip,user_id,name,program_id,start_date,end_date,program_name,module
2018-01-26 09:55:03,/,97.105.19.61,1,Hampton,1,2015-09-22,2016-02-06,WebDev-PHP,
2018-01-26 09:56:02,java-ii,97.105.19.61,1,Hampton,1,2015-09-22,2016-02-06,WebDev-PHP,java-ii
2018-01-26 09:56:05,java-ii/object-oriented-programming,97.105.19.61,1,Hampton,1,2015-09-22,2016-02-06,WebDev-PHP,java-ii
2018-01-26 09:56:06,slides/object_oriented_programming,97.105.19.61,1,Hampton,1,2015-09-22,2016-02-06,WebDev-PHP,slides
2018-01-26 10:14:47,/,97.105.19.61,11,Arches,1,2014-02-04,2014-04-22,WebDev-PHP,
...,...,...,...,...,...,...,...,...,...
2021-04-21 15:20:12,classification/classical_programming_vs_machin...,96.8.130.134,692,Darden,3,2020-07-13,2021-01-12,DataSci,classification
2021-04-21 15:20:12,classification/scale_features_or_not.svg,96.8.130.134,692,Darden,3,2020-07-13,2021-01-12,DataSci,classification
2021-04-21 15:20:14,classification/project,96.8.130.134,692,Darden,3,2020-07-13,2021-01-12,DataSci,classification
2021-04-21 15:20:18,classification/acquire,96.8.130.134,692,Darden,3,2020-07-13,2021-01-12,DataSci,classification
