# Tracker Update Functions
Sam Ko<br>Mar 27, 2020

The purpose of this notebook is as follows:<br>
1. Get data from MongoDB. Look into CDC-TimeSeries table.

2. Create function that takes in the dataset CDC-TimeSeries from MongoDB and spits out country, date, total_num_infections, total_num_deaths. 

3. Create a function that takes in CDC-TimeSeries from MongoDB and spits out country, days_since_first_infection, total_num_infections, total_num_deaths.


There are 2 functions that are created: **tracker_update()** and **cml_tracker_update()**.<br>Once run, both functions will each show the output in the notebook and export the output as a csv file as well. 


In [1]:
import pandas as pd
import pymongo
from pymongo import MongoClient
import warnings
warnings.filterwarnings("ignore")

In [2]:
client = pymongo.MongoClient("mongodb://analyst:grmds@3.101.18.8/COVID19-DB") # defaults to port 27017
db = client['COVID19-DB']

# print the number of documents in the collection
print(db['CDC-TimeSeries'].count())

29707


In [3]:
cdc_ts = pd.DataFrame(list(db['CDC-TimeSeries'].find({})))
cdc_ts.head()

Unnamed: 0,_id,Province/State,Country/Region,Latitude,Longitude,Confirmed,Date,Death,Recovery
0,5e78dfab674e1af34ddc0bfd,,Thailand,15,101,2,2020-01-22,0,0
1,5e78dfab674e1af34ddc0bfe,,Thailand,15,101,3,2020-01-23,0,0
2,5e78dfab674e1af34ddc0bff,,Thailand,15,101,5,2020-01-24,0,0
3,5e78dfab674e1af34ddc0c00,,Thailand,15,101,7,2020-01-25,0,0
4,5e78dfab674e1af34ddc0c01,,Thailand,15,101,8,2020-01-26,0,2


In [4]:
cdc_ts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29707 entries, 0 to 29706
Data columns (total 9 columns):
_id               29707 non-null object
Province/State    29707 non-null object
Country/Region    29707 non-null object
Latitude          29707 non-null object
Longitude         29707 non-null object
Confirmed         29707 non-null object
Date              29707 non-null datetime64[ns]
Death             29707 non-null object
Recovery          29707 non-null object
dtypes: datetime64[ns](1), object(8)
memory usage: 2.0+ MB


### Tracker Update Function

In [5]:
def tracker_update():
    global tracker   
    output = cdc_ts.loc[:,['Country/Region','Date','Confirmed','Death']]
    output['Confirmed'] = output['Confirmed'].astype(int)
    output['Death'] = output['Death'].astype(int)

    tracker = pd.DataFrame(columns=['num_infections', 'num_deaths'])

    tracker['num_infections'] = output.groupby(['Country/Region','Date'])['Confirmed'].sum()
    tracker['num_deaths'] = output.groupby(['Country/Region','Date'])['Death'].sum()
    tracker.reset_index(inplace= True)
    tracker.rename(columns={"Country/Region": "country", "Date": "date"}, inplace = True)
    tracker.to_csv('tracker.csv',index=False)
    return tracker

In [6]:
tracker_update()

Unnamed: 0,country,date,num_infections,num_deaths
0,Afghanistan,2020-01-22,0,0
1,Afghanistan,2020-01-23,0,0
2,Afghanistan,2020-01-24,0,0
3,Afghanistan,2020-01-25,0,0
4,Afghanistan,2020-01-26,0,0
...,...,...,...,...
10426,Zimbabwe,2020-03-18,0,0
10427,Zimbabwe,2020-03-19,0,0
10428,Zimbabwe,2020-03-20,1,0
10429,Zimbabwe,2020-03-21,3,0


### Cumulative Tracker Update Function

In [7]:
def cml_tracker_update():

    tracker['days_since_first_infection'] = ""
    tracker.iloc[0,4] = 0

    for i in range(1,len(tracker)):
        if tracker.iloc[i,0] == tracker.iloc[i-1,0]:
            tracker.iloc[i,4] = tracker.iloc[i-1,4] + (tracker.iloc[i,1] - tracker.iloc[i-1,1]).days
        else:
            tracker.iloc[i,4] = 0                   
    
    tracker["total_num_infections"] = tracker.groupby('country')['num_infections'].cumsum()
    tracker["total_num_deaths"] = tracker.groupby('country')['num_deaths'].cumsum()
    tracker_cml = tracker.drop(['date', 'num_infections','num_deaths'], axis=1)
    tracker_cml.to_csv('tracker_cumulative.csv',index=False)
    return tracker_cml

In [8]:
cml_tracker_update()

Unnamed: 0,country,days_since_first_infection,total_num_infections,total_num_deaths
0,Afghanistan,0,0,0
1,Afghanistan,1,0,0
2,Afghanistan,2,0,0
3,Afghanistan,3,0,0
4,Afghanistan,4,0,0
...,...,...,...,...
10426,Zimbabwe,56,0,0
10427,Zimbabwe,57,0,0
10428,Zimbabwe,58,1,0
10429,Zimbabwe,59,4,0


The function above may take some time as it calculates the **days_since_first_infection** for each of the days. <br> However, if the database in MongoDB guarantees that it is updated daily without skipping any dates, we can use the function below for a faster result.