Email to analyst:


Hello,


I have some questions for you that I need to be answered before the board meeting Wednesday afternoon. I need to be able to speak to the following questions. I also need a single slide that I can incorporate into my existing presentation (Google Slides) that summarizes the most important points. My questions are listed below; however, if you discover anything else important that I didn’t think to ask, please include that as well.


1. Which lesson appears to attract the most traffic consistently across cohorts (per program)?
2. Is there a cohort that referred to a lesson significantly more than other cohorts seemed to gloss over?
3. Are there students who, when active, hardly access the curriculum? If so, what information do you have about these students?
4. Is there any suspicious activity, such as users/machines/etc accessing the curriculum who shouldn’t be? Does it appear that any web-scraping is happening? Are there any suspicious IP addresses?
5. At some point in 2019, the ability for students and alumni to access both curriculums (web dev to ds, ds to web dev) should have been shut off. Do you see any evidence of that happening? Did it happen before?
6. What topics are grads continuing to reference after graduation and into their jobs (for each program)?
7. Which lessons are least accessed?
8. Anything else I should be aware of?


Thank you,


___________________


Other info:


• To get 100 on this project you only need to answer 5 out of the 7 questions (along with the other deliverables listed below i.e. slide, your notebook, etc).
• send your email before the due date and time to datascience@codeup.com (Only one team member can do this on behalf of whole team).
• Submit a link to a final notebook on GitHub that asks and answers questions - document the work you do to justify findings
• Compose an email with the answers to the questions/your findings, and in the email, include the link to your notebook in GitHub and attach your slide.
• You will not present this, so be sure that the details you need your leader to convey/understand are clearly communicated in the email.
• Your slide should be like an executive summary and be in form to present.
• Continue to use best practices of acquire.py, prepare.py, etc.
• Since there is no modeling to be done for this project, there is no need to split the data into train/validate/test
• The cohort schedule is in the SQL database, and alumni.codeup.com has info as well.
• Teamwork with Git handout is posted in the google classroom

In [1]:
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import os
import env
import wrangle as w

In [2]:
# Import .txt file and convert it to a DataFrame object 
df = pd.read_table("anonymized-curriculum-access.txt", sep = '\s', header = None, 
                   names = ['date', 'time', 'page', 'id', 'cohort', 'ip'])

In [3]:
# let's examine the head of the dataframe to make sure its
# what we were expecting
df.head()

Unnamed: 0,date,time,page,id,cohort,ip
0,2018-01-26,09:55:03,/,1,8.0,97.105.19.61
1,2018-01-26,09:56:02,java-ii,1,8.0,97.105.19.61
2,2018-01-26,09:56:05,java-ii/object-oriented-programming,1,8.0,97.105.19.61
3,2018-01-26,09:56:06,slides/object_oriented_programming,1,8.0,97.105.19.61
4,2018-01-26,09:56:24,javascript-i/conditionals,2,22.0,97.105.19.61


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 900223 entries, 0 to 900222
Data columns (total 6 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   date    900223 non-null  object 
 1   time    900223 non-null  object 
 2   page    900222 non-null  object 
 3   id      900223 non-null  int64  
 4   cohort  847330 non-null  float64
 5   ip      900223 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 41.2+ MB


In [5]:
df['datetime'] = pd.to_datetime(df['date'] + ' ' + df['time'])

# Set the index as that date and then sort index (by the date)
df = df.set_index(['datetime']).sort_index()

del df['date']
del df['time']

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 900223 entries, 2018-01-26 09:55:03 to 2021-04-21 16:44:39
Data columns (total 4 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   page    900222 non-null  object 
 1   id      900223 non-null  int64  
 2   cohort  847330 non-null  float64
 3   ip      900223 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 34.3+ MB


In [7]:
df=df.dropna()

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 847329 entries, 2018-01-26 09:55:03 to 2021-04-21 16:44:39
Data columns (total 4 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   page    847329 non-null  object 
 1   id      847329 non-null  int64  
 2   cohort  847329 non-null  float64
 3   ip      847329 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 32.3+ MB


In [9]:
df.astype({'cohort': 'object'}).dtypes

page      object
id         int64
cohort    object
ip        object
dtype: object

In [10]:
page_views = df.groupby(['cohort'])['page'].agg(['count', 'nunique'])

In [11]:
page_views.sort_values(by='count', ascending=False)

Unnamed: 0_level_0,count,nunique
cohort,Unnamed: 1_level_1,Unnamed: 2_level_1
28.0,84031,1404
33.0,40730,301
29.0,38096,317
62.0,37109,288
53.0,36902,258
24.0,35636,377
57.0,33844,296
56.0,33568,273
51.0,32888,288
59.0,32015,420


In [12]:
dict_map = {34:'Data Science', 55:'Data Science', 59:'Data Science', 133:'Data Science', 137:'Data Science', 
            9: 'Front End',8: 'PHP', 1: 'PHP',19:'PHP',13:'PHP',14:'PHP',7:'PHP',4:'PHP',12:'PHP',17:'PHP',
            2:'PHP',11:'PHP',6:'PHP',5:'PHP',22:'Java',21:'Java',16:'Java',18:'Java',15:'Java',23:'Java',
            24:'Java',25:'Java',26:'Java',27:'Java',29:'Java',31:'Java',32:'Java',33:'Java',51:'Java',
            52:'Java',53:'Java',56:'Java',57:'Java',58:'Java',59:'Java',61:'Java',62:'Java',132:'Java',
            134:'Java',135:'Java',138:'Java',139:'Java'}  
    
update = df['cohort'].map(dict_map)
df['program'] = update
print(df)


                                                                 page   id  \
datetime                                                                     
2018-01-26 09:55:03                                                 /    1   
2018-01-26 09:56:02                                           java-ii    1   
2018-01-26 09:56:05               java-ii/object-oriented-programming    1   
2018-01-26 09:56:06                slides/object_oriented_programming    1   
2018-01-26 09:56:24                         javascript-i/conditionals    2   
...                                                               ...  ...   
2021-04-21 16:41:51                              jquery/personal-site   64   
2021-04-21 16:42:02                                 jquery/mapbox-api   64   
2021-04-21 16:42:09                           jquery/ajax/weather-map   64   
2021-04-21 16:44:37  anomaly-detection/discrete-probabilistic-methods  744   
2021-04-21 16:44:39                                 jquery/mapbo

In [13]:
df.head()

Unnamed: 0_level_0,page,id,cohort,ip,program
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2018-01-26 09:55:03,/,1,8.0,97.105.19.61,PHP
2018-01-26 09:56:02,java-ii,1,8.0,97.105.19.61,PHP
2018-01-26 09:56:05,java-ii/object-oriented-programming,1,8.0,97.105.19.61,PHP
2018-01-26 09:56:06,slides/object_oriented_programming,1,8.0,97.105.19.61,PHP
2018-01-26 09:56:24,javascript-i/conditionals,2,22.0,97.105.19.61,Java


In [14]:
program_views = df.groupby(['program'])['page'].agg(['count', 'nunique'])

In [15]:
program_views.sort_values(by='count', ascending=False)

Unnamed: 0_level_0,count,nunique
program,Unnamed: 1_level_1,Unnamed: 2_level_1
Java,661349,1458
Data Science,71396,643
PHP,30548,710
Front End,5,4


In [16]:
df.groupby('program').describe()

Unnamed: 0_level_0,id,id,id,id,id,id,id,id,cohort,cohort,cohort,cohort,cohort,cohort,cohort,cohort
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
program,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Data Science,71396.0,626.892207,175.372847,143.0,479.0,581.0,840.0,949.0,71396.0,73.104039,43.548679,34.0,34.0,55.0,133.0,137.0
Front End,5.0,152.0,0.0,152.0,152.0,152.0,152.0,152.0,5.0,9.0,0.0,9.0,9.0,9.0,9.0,9.0
Java,661349.0,479.226844,234.171409,2.0,293.0,495.0,667.0,981.0,661349.0,50.217583,31.337342,15.0,27.0,51.0,58.0,139.0
PHP,30548.0,116.277301,119.595791,1.0,53.0,64.0,156.0,952.0,30548.0,10.237724,6.425684,1.0,1.0,14.0,14.0,19.0


In [17]:
df.groupby('cohort')['page'].value_counts()

cohort  page                                       
1.0     /                                              626
        javascript-i                                   294
        html-css                                       215
        javascript-ii                                  204
        spring                                         192
                                                      ... 
139.0   java-iii/servlets                                1
        javascript-i/bom-and-dom/dom                     1
        javascript-i/objects                             1
        javascript-i/objects/math                        1
        jquery/essential-methods/attributes-and-css      1
Name: page, Length: 13565, dtype: int64

In [18]:
df.groupby('cohort')['page'].count()

cohort
1.0       8890
2.0         93
4.0          4
5.0          1
6.0         72
7.0        598
8.0       1712
9.0          5
11.0       253
12.0       302
13.0      2845
14.0      9587
15.0       691
16.0       755
17.0      4954
18.0      2158
19.0      1237
21.0      7444
22.0     30926
23.0     28534
24.0     35636
25.0     25586
26.0     27749
27.0     20743
28.0     84031
29.0     38096
31.0     25359
32.0     29356
33.0     40730
34.0     26538
51.0     32888
52.0     28033
53.0     36902
55.0     21581
56.0     33568
57.0     33844
58.0     29855
59.0     32015
61.0     17713
62.0     37109
132.0    23691
133.0    14715
134.0    16623
135.0    16397
137.0     8562
138.0     7276
139.0     1672
Name: page, dtype: int64

In [19]:
df.groupby('cohort', as_index=False).agg({"page": "max"})

Unnamed: 0,cohort,page
0,1.0,uploads/5762c2946250b.jpg
1,2.0,toc
2,4.0,prework/versioning/github
3,5.0,/
4,6.0,spring/setup
5,7.0,toc
6,8.0,uploads/58a217a705bde.jpg
7,9.0,content/html-css/introduction.html
8,11.0,toc
9,12.0,toc


In [20]:
df.groupby('program')['page'].count()

program
Data Science     71396
Front End            5
Java            661349
PHP              30548
Name: page, dtype: int64

In [21]:
df.groupby('program', as_index=False).agg({"page": "max"})

Unnamed: 0,program,page
0,Data Science,where
1,Front End,content/html-css/introduction.html
2,Java,wp-login
3,PHP,web-design/ux/purpose


In [22]:
df.groupby('program', as_index=False).agg({"page": "min"})

Unnamed: 0,program,page
0,Data Science,%20https://github.com/RaulCPena
1,Front End,/
2,Java,.git
3,PHP,/


In [23]:
df.groupby('program')['page'].value_counts()

program       page                                    
Data Science  /                                           5378
              search/search_index.json                    1539
              1-fundamentals/modern-data-scientist.jpg    1185
              1-fundamentals/AI-ML-DL-timeline.jpg        1181
              1-fundamentals/1.1-intro-to-data-science    1173
                                                          ... 
PHP           students                                       1
              students/468/notes                             1
              students/units/75/sub_units/268                1
              teams/13                                       1
              uploads/58a217a705bde.jpg                      1
Name: page, Length: 2815, dtype: int64

In [24]:
stop

NameError: name 'stop' is not defined

In [26]:
w.convert_txt_data()

Unnamed: 0,date,time,page,user_id,cohort_id,ip
0,2018-01-26,09:55:03,/,1,8.0,97.105.19.61
1,2018-01-26,09:56:02,java-ii,1,8.0,97.105.19.61
2,2018-01-26,09:56:05,java-ii/object-oriented-programming,1,8.0,97.105.19.61
3,2018-01-26,09:56:06,slides/object_oriented_programming,1,8.0,97.105.19.61
4,2018-01-26,09:56:24,javascript-i/conditionals,2,22.0,97.105.19.61
...,...,...,...,...,...,...
900218,2021-04-21,16:41:51,jquery/personal-site,64,28.0,71.150.217.33
900219,2021-04-21,16:42:02,jquery/mapbox-api,64,28.0,71.150.217.33
900220,2021-04-21,16:42:09,jquery/ajax/weather-map,64,28.0,71.150.217.33
900221,2021-04-21,16:44:37,anomaly-detection/discrete-probabilistic-methods,744,28.0,24.160.137.86


In [27]:
w.get_cohort_data()

Unnamed: 0,cohort_id,name,start_date,end_date,program_id
0,1,Arches,2014-02-04,2014-04-22,1
1,2,Badlands,2014-06-04,2014-08-22,1
2,3,Carlsbad,2014-09-04,2014-11-05,1
3,4,Denali,2014-10-20,2015-01-18,1
4,5,Everglades,2014-11-18,2015-02-24,1
5,6,Franklin,2015-02-03,2015-05-26,1
6,7,Glacier,2015-06-05,2015-10-06,1
7,8,Hampton,2015-09-22,2016-02-06,1
8,9,Apollo,2015-03-30,2015-07-29,4
9,10,Balboa,2015-11-03,2016-03-11,4


In [28]:
df=w.acquire_anonymized_curriculum_access_data()

In [29]:
w.add_program_type(df)

Unnamed: 0,date,time,page,user_id,cohort_id,ip,name,start_date,end_date,program_id
0,2018-01-26,09:55:03,/,1,8.0,97.105.19.61,Hampton,2015-09-22,2016-02-06,PHP
1,2018-01-26,09:56:02,java-ii,1,8.0,97.105.19.61,Hampton,2015-09-22,2016-02-06,PHP
2,2018-01-26,09:56:05,java-ii/object-oriented-programming,1,8.0,97.105.19.61,Hampton,2015-09-22,2016-02-06,PHP
3,2018-01-26,09:56:06,slides/object_oriented_programming,1,8.0,97.105.19.61,Hampton,2015-09-22,2016-02-06,PHP
4,2018-01-26,09:56:24,javascript-i/conditionals,2,22.0,97.105.19.61,Teddy,2018-01-08,2018-05-17,Java
...,...,...,...,...,...,...,...,...,...,...
900218,2021-04-21,16:41:51,jquery/personal-site,64,28.0,71.150.217.33,Staff,2014-02-04,2014-02-04,Java
900219,2021-04-21,16:42:02,jquery/mapbox-api,64,28.0,71.150.217.33,Staff,2014-02-04,2014-02-04,Java
900220,2021-04-21,16:42:09,jquery/ajax/weather-map,64,28.0,71.150.217.33,Staff,2014-02-04,2014-02-04,Java
900221,2021-04-21,16:44:37,anomaly-detection/discrete-probabilistic-methods,744,28.0,24.160.137.86,Staff,2014-02-04,2014-02-04,Java


In [30]:
w.convert_datetimes(df)

Unnamed: 0_level_0,date,time,page,user_id,cohort_id,ip,name,start_date,end_date,program_id
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2018-01-26,2018-01-26,09:55:03,/,1,8.0,97.105.19.61,Hampton,2015-09-22,2016-02-06,PHP
2018-01-26,2018-01-26,09:56:02,java-ii,1,8.0,97.105.19.61,Hampton,2015-09-22,2016-02-06,PHP
2018-01-26,2018-01-26,09:56:05,java-ii/object-oriented-programming,1,8.0,97.105.19.61,Hampton,2015-09-22,2016-02-06,PHP
2018-01-26,2018-01-26,09:56:06,slides/object_oriented_programming,1,8.0,97.105.19.61,Hampton,2015-09-22,2016-02-06,PHP
2018-01-26,2018-01-26,09:56:24,javascript-i/conditionals,2,22.0,97.105.19.61,Teddy,2018-01-08,2018-05-17,Java
...,...,...,...,...,...,...,...,...,...,...
2021-04-21,2021-04-21,16:41:51,jquery/personal-site,64,28.0,71.150.217.33,Staff,2014-02-04,2014-02-04,Java
2021-04-21,2021-04-21,16:42:02,jquery/mapbox-api,64,28.0,71.150.217.33,Staff,2014-02-04,2014-02-04,Java
2021-04-21,2021-04-21,16:42:09,jquery/ajax/weather-map,64,28.0,71.150.217.33,Staff,2014-02-04,2014-02-04,Java
2021-04-21,2021-04-21,16:44:37,anomaly-detection/discrete-probabilistic-methods,744,28.0,24.160.137.86,Staff,2014-02-04,2014-02-04,Java


In [31]:
w.clean_the_data(df)

Unnamed: 0_level_0,date,time,page,user_id,cohort_id,ip,name,start_date,end_date,program_id
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2018-01-26,2018-01-26,09:55:03,/,1,8.0,97.105.19.61,Hampton,2015-09-22,2016-02-06,PHP
2018-01-26,2018-01-26,09:56:02,java-ii,1,8.0,97.105.19.61,Hampton,2015-09-22,2016-02-06,PHP
2018-01-26,2018-01-26,09:56:05,java-ii/object-oriented-programming,1,8.0,97.105.19.61,Hampton,2015-09-22,2016-02-06,PHP
2018-01-26,2018-01-26,09:56:06,slides/object_oriented_programming,1,8.0,97.105.19.61,Hampton,2015-09-22,2016-02-06,PHP
2018-01-26,2018-01-26,09:56:24,javascript-i/conditionals,2,22.0,97.105.19.61,Teddy,2018-01-08,2018-05-17,Java
...,...,...,...,...,...,...,...,...,...,...
2021-04-21,2021-04-21,16:41:51,jquery/personal-site,64,28.0,71.150.217.33,Staff,2014-02-04,2014-02-04,Java
2021-04-21,2021-04-21,16:42:02,jquery/mapbox-api,64,28.0,71.150.217.33,Staff,2014-02-04,2014-02-04,Java
2021-04-21,2021-04-21,16:42:09,jquery/ajax/weather-map,64,28.0,71.150.217.33,Staff,2014-02-04,2014-02-04,Java
2021-04-21,2021-04-21,16:44:37,anomaly-detection/discrete-probabilistic-methods,744,28.0,24.160.137.86,Staff,2014-02-04,2014-02-04,Java


In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 900223 entries, 0 to 900222
Data columns (total 10 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   date        900223 non-null  datetime64[ns]
 1   time        900223 non-null  object        
 2   page        900222 non-null  object        
 3   user_id     900223 non-null  int64         
 4   cohort_id   847330 non-null  float64       
 5   ip          900223 non-null  object        
 6   name        847330 non-null  object        
 7   start_date  847330 non-null  datetime64[ns]
 8   end_date    847330 non-null  datetime64[ns]
 9   program_id  900223 non-null  object        
dtypes: datetime64[ns](3), float64(1), int64(1), object(5)
memory usage: 68.7+ MB


In [33]:
page_views = df.groupby(['cohort_id'])['page'].agg(['count', 'nunique'])

In [34]:
page_views.sort_values(by='count', ascending=False)

Unnamed: 0_level_0,count,nunique
cohort_id,Unnamed: 1_level_1,Unnamed: 2_level_1
28.0,84031,1404
33.0,40730,301
29.0,38096,317
62.0,37109,288
53.0,36902,258
24.0,35636,377
57.0,33844,296
56.0,33568,273
51.0,32888,288
59.0,32015,420


In [35]:
program_views = df.groupby(['program_id'])['page'].agg(['count', 'nunique'])

In [36]:
program_views.sort_values(by='count', ascending=False)

Unnamed: 0_level_0,count,nunique
program_id,Unnamed: 1_level_1,Unnamed: 2_level_1
Java,713365,1913
Data Science,103411,682
,52893,1112
PHP,30548,710
Front End,5,4


In [37]:
df.groupby('cohort_id')['page'].value_counts()

cohort_id  page                                       
1.0        /                                              626
           javascript-i                                   294
           html-css                                       215
           javascript-ii                                  204
           spring                                         192
                                                         ... 
139.0      java-iii/servlets                                1
           javascript-i/bom-and-dom/dom                     1
           javascript-i/objects                             1
           javascript-i/objects/math                        1
           jquery/essential-methods/attributes-and-css      1
Name: page, Length: 13565, dtype: int64

In [38]:
df.groupby('cohort_id')['page'].count()

cohort_id
1.0       8890
2.0         93
4.0          4
5.0          1
6.0         72
7.0        598
8.0       1712
9.0          5
11.0       253
12.0       302
13.0      2845
14.0      9587
15.0       691
16.0       755
17.0      4954
18.0      2158
19.0      1237
21.0      7444
22.0     30926
23.0     28534
24.0     35636
25.0     25586
26.0     27749
27.0     20743
28.0     84031
29.0     38096
31.0     25359
32.0     29356
33.0     40730
34.0     26538
51.0     32888
52.0     28033
53.0     36902
55.0     21581
56.0     33568
57.0     33844
58.0     29855
59.0     32015
61.0     17713
62.0     37109
132.0    23691
133.0    14715
134.0    16623
135.0    16397
137.0     8562
138.0     7276
139.0     1672
Name: page, dtype: int64

In [39]:
df.groupby('cohort_id', as_index=False).agg({"page": "max"})

TypeError: '>=' not supported between instances of 'str' and 'float'

In [40]:
df.groupby('program_id')['page'].count()

program_id
Data Science    103411
Front End            5
Java            713365
PHP              30548
nan              52893
Name: page, dtype: int64

In [41]:
df.groupby('program_id', as_index=False).agg({"page": "max"})

TypeError: '>=' not supported between instances of 'str' and 'float'

In [43]:
df.groupby(['program_id', 'cohort_id'])['page'].value_counts()

program_id    cohort_id  page                                       
Data Science  34.0       /                                              1967
                         1-fundamentals/modern-data-scientist.jpg        650
                         1-fundamentals/AI-ML-DL-timeline.jpg            648
                         1-fundamentals/1.1-intro-to-data-science        640
                         search/search_index.json                        588
                                                                        ... 
PHP           19.0       spring/fundamentals/controllers                   1
                         spring/fundamentals/security                      1
                         spring/fundamentals/security/authentication       1
                         spring/fundamentals/services                      1
                         spring/setup                                      1
Name: page, Length: 13565, dtype: int64