<p align="center">
  <a href="http://www.openpandemic.io"><img alt="openpandemic" src="https://avatars2.githubusercontent.com/u/63398478?s=100&v=4" width=100 /></a>
  <h3 align="center">Openpandemic - Analytics</h3>
  <p align="center">
  <table style="border-collapse: collapse; border: none;">
<tr>
  <td>
    <img align="center" alt="We love Opensource" src="https://badges.frapsoft.com/os/v1/open-source.svg?v=103" />
  </td>
  <td>
    <a href="https://colab.research.google.com/github/openpandemic/openpandemic-analytics/blob/master/notebooks/covid19/01-Symptoms_exploration.ipynb"><img align="center" alt="Colab" src="https://colab.research.google.com/assets/colab-badge.svg" /></a>
  </td>
  <td>
    <a href="https://mybinder.org/v2/gh/openpandemic/openpandemic-analytics/master?filepath=notebooks/%2Fcovid19/%2F01-Symptoms_exploration.ipynb"><img src="https://mybinder.org/badge_logo.svg" alt="Live Bokeh tutorial notebooks on MyBinder" /></a>
  </td>
  </tr>
  </table>
  </p>
</p>

---

We want to collaborate with the OpenPandemic initiative, a generous gesture in OSS terms to help to stop pandemic diseases.

Please take a look at the [openpandemic-app](https://github.com/OpenPandemic/openpandemic-app) and [openpandemic-back](https://github.com/OpenPandemic/openpandemic-back) repositories.



This notebook is meant to give you a basic entrypoint to explore collected datasets from the application.

# Requirements

*   Google Cloud BigQuery connector (loaded by default in Colab python runtimes)
*   [optional] Data access to the GCP project where data is placed, only if you are not a project member.




In order to get access to GCP run this cell (set the suitable values for variables, name of project is required)

In [0]:
from google.colab import auth
auth.authenticate_user()

PROJECT='openpandemic-analytics' # SET THE GCP PROJECT NAME
BUCKET='' # SET YOUR GCS BUCKET NAME

import os
ROOT='./'
MODEL_DIR=os.path.join(ROOT,'models')
PACKAGES_DIR=os.path.join(ROOT,'packages')

## Getting started

We have three main options to operate with Bigquery from python kernels

### Magic words (bigquery extensions)

Let's try our first query to BigQuery via magic word. We try to get the total count of evaluations and users

In [2]:
%load_ext google.cloud.bigquery
%load_ext google.colab.data_table

The google.cloud.bigquery extension is already loaded. To reload it, use:
  %reload_ext google.cloud.bigquery


In [4]:
%%bigquery --project {PROJECT} --verbose df_total_eval

SELECT
  COUNT(1) as total_evaluation,
  COUNT(DISTINCT person_id) as total_person_count
FROM `openpandemic-analytics.openpandemic_test.data_test_es_v1`

Executing query with job ID: d8e3c61b-0d7f-438e-aa31-3390d1b0bc22
Query executing: 0.43s
Query complete after 1.00s


Unnamed: 0,total_evaluation,total_person_count
0,5001,3889


In [None]:
df_total_eval

### Official python client for Bigquery

We could have done the same queries via bigquery client directly, using native resources or dump results into a pandas dataframe 



In [5]:
from google.cloud import bigquery as bq

client = bq.Client(project=PROJECT)

dataset_name="openpandemic_test"
table_name="data_test_es_v1"
table_id = f"{PROJECT}.{dataset_name}.{table_name}"

# Query to get total evaluations and person total count
q_summary = f'''
SELECT
  COUNT(1) as total_evaluation,
  COUNT(DISTINCT person_id) as total_person_count
FROM `{table_id}`
'''

query_job = client.query(q_summary)  # API request
rows = query_job.result()            # Waits for query to finish

# Show the summary of items
for row in rows:
  print({k:v for (k,v) in row.items()})

# Dump into pandas dataframe
#df_summary = rows.to_dataframe()
#df_summary.head()                    

{'total_evaluation': 5001, 'total_person_count': 3889}


### Pandas client for Bigquery

We have a third alternative to get data into a pandas dataframe as well, that is to use [pandas-gbq](https://pypi.org/project/pandas-gbq/) the pandas development to operate with BigQuey.



In [6]:
import pandas as pd

df = pd.io.gbq.read_gbq(f'''
SELECT
  COUNT(1) as total_evaluation,
  COUNT(DISTINCT person_id) as total_person_count
FROM `{table_id}`
''', project_id=PROJECT, dialect='standard')

df.head()

Unnamed: 0,total_evaluation,total_person_count
0,5001,3889


## Evaluations

We're going to extract the summary of users who have done any evaluation and the latest test result.

In [7]:
%%bigquery --project {PROJECT} --verbose df_summary_eval

SELECT
 person_id,
 test.id as test_id,
 test.time as test_time,
 test.result as test_result
FROM openpandemic-analytics.openpandemic_test.data_test_es_v1 C
JOIN (
  SELECT
    person_id as person_id1,
    COUNT(*) as eval_count,
    MAX(test.time) as latest_test_time
  FROM `openpandemic-analytics.openpandemic_test.data_test_es_v1`
  GROUP BY person_id
) C1
ON C.person_id = C1.person_id1 AND C.test.time = latest_test_time
ORDER BY PERSON_ID,TEST.TIME DESC

Executing query with job ID: 44b5482c-b68b-4d31-9df4-47d0b42a7269
Query executing: 0.95s
Query complete after 1.81s


In [8]:
print('Number of users: %s' % df_summary_eval['person_id'].count())

Number of users: 3889


In [9]:
df_summary_eval['test_result'].value_counts()

symptoms       2565
no-symptoms    1324
Name: test_result, dtype: int64

Now, since we have some data about single evaluation per user, let's draw the results

In [0]:
import plotly.graph_objs as go
from plotly.offline import iplot, plot
import numpy as np

def graph(x, y, title):
    y_sum = sum(y)
    y_text = [ f"{text:.0f}%" for text in np.around((y / y_sum) * 100)]
    data = [go.Bar(
        x=x,
        y=y,
        marker_color='rgba(218, 201, 41, 1)',
        text=y_text,
        textposition = 'auto',
        marker=dict(
            color='rgb(158,202,225)'
        ),
        opacity=0.8)
    ]

    layout = go.Layout(
        title=f'{title}, {y_sum} ',
        paper_bgcolor='rgba(245, 246, 249, 1)',
        plot_bgcolor='rgba(245, 246, 249, 1)',
        showlegend=False,
        xaxis=dict(
            showgrid=True,
            showline=True,
            showticklabels=True,
            zeroline=True,
            domain=[0.15, 1]
        ),
        yaxis=dict(
            showgrid=True,
            showline=True,
            showticklabels=True,
            zeroline=True,
        )
    )
    return go.Figure(data=data, layout=layout)


In [11]:
y_counts = df_summary_eval['test_result'].value_counts()

y = (y_counts['symptoms'], y_counts['no-symptoms'],)

x=['Con síntomas compatibles de infección',
   'Sin síntomas compatibles']

title = "Usuarios únicos (identificados o no)"

graph(x, y, title).show()


## Re-evaluations

Now, we're going to find out users who have done more than one evaluation and thier evolution with the symptoms.

We need to know how many evaluations were done by users and wheter symptoms are persistent or not.

In [13]:
%%bigquery --project {PROJECT} --verbose df_re_eval

SELECT
 person_id,
 test.id as test_id,
 test.time as test_time,
 test.result as test_result,
 C1.eval_count,
 C1.test_no_symptoms_count,
 C1.test_symptoms_count
FROM openpandemic-analytics.openpandemic_test.data_test_es_v1 C
JOIN (
  SELECT
    person_id as person_id1,
    COUNT(*) as eval_count,
    COUNTIF(test.result like "no-symptoms") as test_no_symptoms_count,
    COUNTIF(test.result like "symptoms") as test_symptoms_count,
    MAX(test.time) as latest_test_time
  FROM `openpandemic-analytics.openpandemic_test.data_test_es_v1`
  GROUP BY person_id
  HAVING COUNT(*) > 1
) C1
ON C.person_id = C1.person_id1 AND C.test.time = latest_test_time
ORDER BY PERSON_ID,TEST.TIME DESC

Executing query with job ID: 12c39c83-2c15-414e-9b3d-418f85e39d0c
Query executing: 1.06s
Query complete after 1.63s


In [14]:
print('Number of user with more than one evaluation: %s' % df_re_eval['person_id'].count())

Number of user with more than one evaluation: 937


In [15]:
df_re_eval.head(20)

Unnamed: 0,person_id,test_id,test_time,test_result,eval_count,test_no_symptoms_count,test_symptoms_count
0,003fa68d47ecffb7a5a7d88f5f2cd08f,13763fba40b3d24ffd1001f8dca0115b,2020-04-10 10:05:16+00:00,symptoms,3,0,3
1,005090bb051247ce12342169a6c4c43e,8dcd870d88e4a8bb6f3f4080926f0b69,2020-04-12 08:10:04+00:00,symptoms,2,0,2
2,00798094b6739e834588f506590e2932,1b4e8c8e82b731702cb198146cfe1ba2,2020-04-10 11:48:57+00:00,symptoms,2,0,2
3,00e44bc40f5d5f7aa4ef584baf372beb,dba0168739f7a281cf4d28ff0368add4,2020-04-11 20:38:52+00:00,symptoms,2,1,1
4,00fc302aff51488840affa6445e24409,37fd1715925cac092f34af89fe70e973,2020-04-06 07:52:47+00:00,no-symptoms,2,2,0
5,015e0f24170e5e99d9819fe85bd4145c,0ccafd4d90ca6972553feb862ca6b748,2020-04-13 02:01:26+00:00,no-symptoms,2,2,0
6,01aa048f3810845bf52bf3843a2f122b,0f215c7471ab959cc514a9f90c2168a7,2020-04-08 16:02:23+00:00,no-symptoms,2,2,0
7,01ac42982613946776d379e06c24083b,322a0719d5e8ce91eecafadf608d5c3d,2020-04-02 22:39:50+00:00,no-symptoms,2,1,1
8,01b7d2fd6ea786e1318ce5ca97775271,80eb3308aa5f708b4d7dbba0d1f4a2f1,2020-04-13 18:26:23+00:00,no-symptoms,2,1,1
9,0217a1c5c386af49194f94c270f69974,2421d3912bb765b5d881f1dd01a12efc,2020-04-12 21:59:31+00:00,symptoms,3,0,3


So we have the number of evaluations, negative and positive count of test results and the last result of them then we'd calculate the ratio of users in each situation:

* Users with symptoms (test result were always 'symptoms', eval_count=test_symptoms_count).
* User without symptoms (test result are always 'no-symptoms', eval_count=test_no_symptoms_count).
* Users who now have symptoms (latest test result is 'symptoms' but sometimes before the user had one 'no-symptoms' test result, test_no_symptoms_count>0)
* Users who now have no symptoms (latest test result is 'no-symptoms' but sometimes before the user had one 'symptoms' test result, test_symptoms_count>0)

In [0]:
users_with_symptoms = df_re_eval[(df_re_eval['test_result'] == "symptoms") & (df_re_eval['test_no_symptoms_count'] == 0)].shape[0]
users_no_symptoms = df_re_eval[(df_re_eval['test_result'] == "no-symptoms") & (df_re_eval['test_symptoms_count'] == 0)].shape[0]
users_changed_symptoms = df_re_eval[(df_re_eval['test_result'] == "symptoms") & (df_re_eval['test_no_symptoms_count'] > 0)].shape[0]
users_changed_no_symptoms = df_re_eval[(df_re_eval['test_result'] == "no-symptoms") & (df_re_eval['test_symptoms_count'] > 0)].shape[0]

In [17]:
x=['Se mantienen con síntomas',
   'Se mantienen sin síontomas',
   'Cambian su estado a presentar sintomas',
   'Cambian su estado a sin síntomas']

y = np.array([users_with_symptoms, users_no_symptoms, users_changed_symptoms, users_changed_no_symptoms])

re_eval_ratio = np.around((df_re_eval['person_id'].count()/df_total_eval['total_evaluation'])*100)

title = f"Usuarios re-evaluados, {int(re_eval_ratio)}%"

graph(x, y, title).show()


## All in one

We could have done all in few steps, the key is the data understanding and the query we can formed in our minds beforehand.

Let's get the data summary as we need to calculate the results we expect to draw. In this case we're going to classify the type of user regard the the symptoms and evaluations.

In [18]:
# Query to get the summaty of evaluations
q_summary = f'''
SELECT
 person_id,
 test.id as test_id,
 test.time as test_time,
 test.result as test_result,
 C1.eval_count,
 C1.test_no_symptoms_count,
 C1.test_symptoms_count,
 CASE 
    WHEN (test.RESULT = "symptoms" AND test_no_symptoms_count = 0) THEN "SYMPTOM" 
    WHEN (test.RESULT = "no-symptoms" AND test_symptoms_count = 0) THEN "NO_SYMPTOM"
    WHEN (test.RESULT = "symptoms" AND test_no_symptoms_count > 0) THEN "TO_SYMPTOM"
    WHEN (test.RESULT = "no-symptoms" AND test_symptoms_count > 0) THEN "TO_NO_SYMPTOM"
 END as user_type
FROM `{table_id}` C
JOIN (
  SELECT
    person_id as person_id1,
    COUNT(*) as eval_count,
    COUNTIF(test.result like "no-symptoms") as test_no_symptoms_count,
    COUNTIF(test.result like "symptoms") as test_symptoms_count,
    MAX(test.time) as latest_test_time
  FROM `{table_id}`
  GROUP BY person_id
) C1
ON C.person_id = C1.person_id1 AND C.test.time = latest_test_time
ORDER BY PERSON_ID,TEST.TIME DESC
'''

query_job = client.query(q_summary)  # API request

rows = query_job.result()            # Waits for query to finish

df_summary = rows.to_dataframe()
print(f"Total number of rows: {rows.num_results}\n")
df_summary.head(20)  

Total number of rows: 3889



Unnamed: 0,person_id,test_id,test_time,test_result,eval_count,test_no_symptoms_count,test_symptoms_count,user_type
0,00177d792f180ca9fe30e319fdbc1867,d19d88b58db734c69ae1c0842b9697a8,2020-03-11 19:52:47+00:00,symptoms,1,0,1,SYMPTOM
1,003fa68d47ecffb7a5a7d88f5f2cd08f,13763fba40b3d24ffd1001f8dca0115b,2020-04-10 10:05:16+00:00,symptoms,3,0,3,SYMPTOM
2,005090bb051247ce12342169a6c4c43e,8dcd870d88e4a8bb6f3f4080926f0b69,2020-04-12 08:10:04+00:00,symptoms,2,0,2,SYMPTOM
3,00757f5187053f1c4eef4232ef72a7c5,0b35e5bef7d72d7cd750e95cabc27efc,2020-02-27 17:05:45+00:00,symptoms,1,0,1,SYMPTOM
4,00798094b6739e834588f506590e2932,1b4e8c8e82b731702cb198146cfe1ba2,2020-04-10 11:48:57+00:00,symptoms,2,0,2,SYMPTOM
5,007dc634cefe1e741bb85f2b5c738fb3,5b4285868e2ae01082bb46aa7346a709,2020-03-05 07:12:28+00:00,no-symptoms,1,1,0,NO_SYMPTOM
6,00a1d3d194e7510de966b62057c287ed,bd3e72e323dc3d9457a631c1a45bd74c,2020-02-25 00:00:28+00:00,no-symptoms,1,1,0,NO_SYMPTOM
7,00ad16ea19503c7ccbdac9860f6f79c5,07db9a8a4949d7c84277836596169a57,2020-03-16 16:31:11+00:00,symptoms,1,0,1,SYMPTOM
8,00af26687e2a83d172e51344c9efbcd7,cf6dfa4edd58e813b1612f2762cd5a3e,2020-03-04 04:08:09+00:00,no-symptoms,1,1,0,NO_SYMPTOM
9,00e44bc40f5d5f7aa4ef584baf372beb,dba0168739f7a281cf4d28ff0368add4,2020-04-11 20:38:52+00:00,symptoms,2,1,1,TO_SYMPTOM


In [19]:
df_summary_unique_eval = df_summary[(df_summary['eval_count'] == 1)]
print('Number of user with just one evaluation: %s\n' % df_summary_unique_eval['person_id'].count())
df_summary_unique_eval.groupby(['user_type'])['person_id'].count()

Number of user with just one evaluation: 2952



user_type
NO_SYMPTOM    1018
SYMPTOM       1934
Name: person_id, dtype: int64

In [20]:
df_summary_re_eval = df_summary[(df_summary['eval_count'] > 1)]
print('Number of user with more than one evaluation: %s\n' % df_summary_re_eval['person_id'].count())
df_summary_re_eval.groupby(['user_type'])['person_id'].count()

Number of user with more than one evaluation: 937



user_type
NO_SYMPTOM        90
SYMPTOM          397
TO_NO_SYMPTOM    216
TO_SYMPTOM       234
Name: person_id, dtype: int64

This query seems to be interesting enought for my coworkers, thus I'm thinking about shere it ... beyond storing it into file in a repository obviously

That's pretty easy, just save it using the bigquery client, for example as a view:


In [21]:
view_name = "summary_eval_view"
view_id = f"{PROJECT}.openpandemic_test.summary"
view = bq.Table(view_id)
view.view_query = q_summary
view = client.create_table(view)  # API request

print("Successfully created view at {}".format(view.full_table_id))

Successfully created view at opencorona-analytics:openpandemic_test.summary


And now we can use the view to get results: 

In [22]:
query_job = client.query(f'''
SELECT
  *
FROM `{PROJECT}.openpandemic_test.summary`
''')
rows = query_job.result(max_results=10)
rows.to_dataframe()

Unnamed: 0,person_id,test_id,test_time,test_result,eval_count,test_no_symptoms_count,test_symptoms_count,user_type
0,00177d792f180ca9fe30e319fdbc1867,d19d88b58db734c69ae1c0842b9697a8,2020-03-11 19:52:47+00:00,symptoms,1,0,1,SYMPTOM
1,003fa68d47ecffb7a5a7d88f5f2cd08f,13763fba40b3d24ffd1001f8dca0115b,2020-04-10 10:05:16+00:00,symptoms,3,0,3,SYMPTOM
2,005090bb051247ce12342169a6c4c43e,8dcd870d88e4a8bb6f3f4080926f0b69,2020-04-12 08:10:04+00:00,symptoms,2,0,2,SYMPTOM
3,00757f5187053f1c4eef4232ef72a7c5,0b35e5bef7d72d7cd750e95cabc27efc,2020-02-27 17:05:45+00:00,symptoms,1,0,1,SYMPTOM
4,00798094b6739e834588f506590e2932,1b4e8c8e82b731702cb198146cfe1ba2,2020-04-10 11:48:57+00:00,symptoms,2,0,2,SYMPTOM
5,007dc634cefe1e741bb85f2b5c738fb3,5b4285868e2ae01082bb46aa7346a709,2020-03-05 07:12:28+00:00,no-symptoms,1,1,0,NO_SYMPTOM
6,00a1d3d194e7510de966b62057c287ed,bd3e72e323dc3d9457a631c1a45bd74c,2020-02-25 00:00:28+00:00,no-symptoms,1,1,0,NO_SYMPTOM
7,00ad16ea19503c7ccbdac9860f6f79c5,07db9a8a4949d7c84277836596169a57,2020-03-16 16:31:11+00:00,symptoms,1,0,1,SYMPTOM
8,00af26687e2a83d172e51344c9efbcd7,cf6dfa4edd58e813b1612f2762cd5a3e,2020-03-04 04:08:09+00:00,no-symptoms,1,1,0,NO_SYMPTOM
9,00e44bc40f5d5f7aa4ef584baf372beb,dba0168739f7a281cf4d28ff0368add4,2020-04-11 20:38:52+00:00,symptoms,2,1,1,TO_SYMPTOM
