## Exercise 03. Aggregations

1. Create a connection to the database using the `sqlite3` library.
2. Get the schema of the `test` table.
3. Get only the first ten rows of the `test` table to see what it looks like.

In [1]:
import pandas as pd
import sqlite3

con=sqlite3.connect('data/checking-logs.sqlite')
scheme_test=pd.read_sql("""PRAGMA table_info(test)""", con)
scheme_test


Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,index,INTEGER,0,,0
1,1,uid,TEXT,0,,0
2,2,labname,TEXT,0,,0
3,3,first_commit_ts,TIMESTAMP,0,,0
4,4,first_view_ts,TIMESTAMP,0,,0


In [2]:
test_10_rows=pd.read_sql(""" SELECT * FROM test LIMIT 10""", con)
test_10_rows.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   index            10 non-null     int64 
 1   uid              10 non-null     object
 2   labname          10 non-null     object
 3   first_commit_ts  10 non-null     object
 4   first_view_ts    10 non-null     object
dtypes: int64(1), object(4)
memory usage: 532.0+ bytes


4. Find the minimum value of the delta between the first commit and the deadline of the corresponding lab for all users using only one query.
   - Do this by joining the table with the `deadlines` table.
   - The difference should be displayed in hours.
   - Do not take lab `project1` into account; it has longer deadlines and will be an outlier.
   - The value should be stored in the dataframe `df_min` with the corresponding uid.

In [3]:
query_min_delta="""
SELECT uid,  (JULIANDAY(datetime(deadlines.deadlines, 'unixepoch'))-JULIANDAY(test.first_commit_ts))*24 AS delta_time 
FROM test JOIN deadlines ON test.labname=deadlines.labs
WHERE test.labname != 'project1'
        AND
    test.first_commit_ts IS NOT NULL
    
ORDER BY delta_time DESC
LIMIT 1
"""
df_min=pd.read_sql(query_min_delta, con)
df_min


Unnamed: 0,uid,delta_time
0,user_30,202.385


5. Do the same thing for the maximum, but use only one query. The dataframe name is `df_max`.

In [4]:
query_max="""
SELECT 
    uid, (JULIANDAY(datetime(deadlines.deadlines, 'unixepoch'))-JULIANDAY(test.first_commit_ts))*24 AS delta
FROM test JOIN deadlines ON test.labname=deadlines.labs
    WHERE test.labname!='project1'
ORDER BY delta ASC
LIMIT 1
"""
df_max=pd.read_sql(query_max, con)
df_max

Unnamed: 0,uid,delta
0,user_25,2.8675


6. Do the same thing, but for the average. Use only one query. This time, your dataframe should not include the uid column. The dataframe name is `df_avg`.

In [5]:
query_avg="""
SELECT  AVG(JULIANDAY(datetime(deadlines.deadlines, 'unixepoch'))-JULIANDAY(test.first_commit_ts))*24 AS avg
FROM test JOIN deadlines ON test.labname=deadlines.labs
WHERE deadlines.labs!='project1'
LIMIT 1
"""
df_avg=pd.read_sql(query_avg, con)
df_avg

Unnamed: 0,avg
0,89.687841


7. We want to test the hypothesis that users who visited the newsfeed just a few times have a lower delta between the first commit and the deadline. To do this, calculate the correlation coefficient between the number of pageviews and the difference.
   - Using only one query, create a table with the following columns: "uid", "avg_diff", and "pageviews".
   - "uid" is the uids that exist in the `test`.
   - "avg_diff" is the average delta between the first commit and the lab deadline per user.
   - "pageviews" is the number of Newsfeed visits per user.
   - Do not take the lab `project1` into account.
   - Store it in the dataframe `views_diff`.
   - Use the Pandas `corr()` method to calculate the correlation coefficient between the number of pageviews and the difference.

In [6]:
query_for_corr="""
SELECT 
    test.uid, 
    (strftime('%s',test.first_commit_ts)-strftime('%s', deadlines.deadlines, 'unixepoch'))/3600 as avg_diff,
    COUNT( pageviews.datetime) AS page_views
FROM test 
JOIN deadlines ON test.labname=deadlines.labs
JOIN pageviews ON test.uid=pageviews.uid 
    WHERE test.labname!='project1'
GROUP BY test.uid
"""

views_diff=pd.read_sql(query_for_corr, con)

cor=views_diff.corr(numeric_only=True)
cor

Unnamed: 0,avg_diff,page_views
avg_diff,1.0,0.020042
page_views,0.020042,1.0


In [7]:
con.close()