# Anomaly Detection Group Project <a id="questions"></a>

1. Which lesson appears to attract the most traffic consistently across cohorts (per program)?
2. Is there a cohort that referred to a lesson significantly more than other cohorts seemed to gloss over?
3. Are there students who, when active, hardly access the curriculum? If so, what information do you have about these students?
4. Is there any suspicious activity, such as users/machines/etc accessing the curriculum who shouldn’t be? Does it appear that any web-scraping is happening? Are there any suspicious IP addresses?
5. At some point in 2019, the ability for students and alumni to access both curriculums (web dev to ds, ds to web dev) should have been shut off. Do you see any evidence of that happening? Did it happen before?
6. What topics are grads continuing to reference after graduation and into their jobs (for each program)?
7. Which lessons are least accessed?
8. Anything else I should be aware of?

# acquire

In [1]:
import pandas as pd
import os
from env import *

In [2]:
def df_to_csv(df):
    try:
        df.to_csv(filename, index=False, mode='x')
    except FileExistsError:
        df.to_csv(filename,index=False)

In [3]:
def manny_acquire(filename):
    exists = os.path.isfile(filename)
    if exists:
        df = pd.read_csv(filename)
        
        return df
    else:
        #Define query
        query = '''
                select * 
                from `logs`
                JOIN cohorts ON logs.cohort_id=cohorts.id;
                '''
        #Define url
        url = get_db_url('curriculum_logs')
        
        #Read data from SQL server
        df = pd.read_sql(query, url)
        
        #Cache
        df_to_csv(df) 
        
        return df

[Top](#questions)

In [4]:
def manny_clean(df):
    # create datetime column and set as index
    df['fixed_date']=df['date']+ ' ' +df['time']
    df.set_index(pd.DatetimeIndex(df['fixed_date']), inplace=True)
   
    # drop unecessary columns after setting index
    df = df.drop(columns = ['date','time','fixed_date'])
   
    # creating columns for names of program_id
    df['data']= df['program_id']== 3
    df['web']= df['program_id']== 2
    df['php']= df['program_id']== 1
    df['front_end']= df['program_id']== 4
    
    # fix cohort_id float datatype to interger
    df['cohort_id'] = df['cohort_id'].astype(int)
    
    # fix 'to_date' column datatype to datetime
    df['end_date'] = pd.to_datetime(df['end_date'])
    
    # df = df[df['path'].str.len()>3]
    
    return df

[Top](#questions)

```sql
(CAST(logs.date as DATETIME) + CAST(logs.time AS TIME))as 'date_time'
```

In [5]:
def manny_wrangle(): 
    df = manny_acquire('curriculum_logs.csv')
    df = manny_clean(df) 
    return df

[Top](#questions)

In [6]:
df = manny_wrangle()

# 1. Which lesson appears to attract the most traffic consistently across cohorts (per program)?

[Top](#questions)

In [7]:
def question_01():
    data_df = df[df['data']==True].copy()



In [None]:
def question_06():
    
    
    

### Most Accessed Path

[Top](#questions)

In [8]:
df['data'].value_counts()

False    743918
True     103412
Name: data, dtype: int64

### Data Science Subset DataFrame

[Top](#questions)

each cohort, sort by count (top 10)

each path is in list?

In [9]:
data_df = df[df['data']==True].copy()

In [10]:
data_df.nunique()

path          682
user_id       111
cohort_id       5
ip            990
id              5
name            5
slack           5
start_date      5
end_date        5
created_at      5
updated_at      5
deleted_at      0
program_id      1
data            1
web             1
php             1
front_end       1
dtype: int64

In [11]:
data_df = data_df[data_df['path']!= '/']

In [28]:
top_results = data_df.groupby(['cohort_id', 'path'])['id'].count().reset_index().sort_values(['cohort_id', 'id'], ascending=[True, False]).groupby('cohort_id').nth(0)

[Top](#questions)

In [29]:
top_results

Unnamed: 0_level_0,path,id
cohort_id,Unnamed: 1_level_1,Unnamed: 2_level_1
34,1-fundamentals/modern-data-scientist.jpg,650
55,6-regression/1-overview,595
59,classification/overview,1109
133,classification/scale_features_or_not.svg,463
137,fundamentals/modern-data-scientist.jpg,627


In [16]:
# df = df[df['path'].str.len()>3]

```python
answer_1 = df.groupby(['program_id', 'path'])
['id'].
count().
reset_index().
sort_values(['program_id', 'id'],
ascending=[True, False])
```

# What topics are grads continuing to reference after graduation and into their jobs (for each program)?

[Top](#questions)

In [35]:
# Filter the logs by users who have already graduated
grads_logs = df[df['end_date'].notnull()]

# Group the filtered logs by path and user_id, and count the number of accesses
topics_by_user = grads_logs.groupby(['path', 'user_id'])['id'].count()

# Filter the logs by the graduating date for each user
grad_date_by_user = grads_logs.groupby('user_id')['end_date'].max()

# function to check if a log row is after a user's graduating date
def is_after_grad_date(row):
    user_id = row['user_id']
    log_date = row.name
    grad_date = grad_date_by_user[user_id]
    return log_date > grad_date

# Filter the logs by the graduating date for each user and count the number of accesses to each path by each user after their graduating date
topics_by_user_after_grad_date = grads_logs[grads_logs.apply(is_after_grad_date, axis=1)].groupby(['path', 'user_id'])['id'].count()

# Group the filtered logs by path, count the number of unique users who accessed each path after their graduating date, and sort the result
topics_by_access = topics_by_user_after_grad_date.groupby('path').nunique().sort_values(ascending=False)

# ouput
print(topics_by_access)

path
/                           92
javascript-i                56
search/search_index.json    50
html-css                    48
spring                      48
                            ..
coffee_consumption.csv       1
cohorts/27/quizzes           1
collections                  1
conditionals                 1
wp-login                     1
Name: id, Length: 1848, dtype: int64
