# Exploratory Data Analysis

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from imblearn.over_sampling import SMOTE
from collections import Counter

In [None]:
# Loading epidemic datasets

## Cases and testing
cases_malaysia = pd.read_csv('dataset/epidemic/cases_malaysia.csv')
cases_state = pd.read_csv('dataset/epidemic/cases_state.csv')
tests_malaysia = pd.read_csv('dataset/epidemic/tests_malaysia.csv') 
tests_state = pd.read_csv('dataset/epidemic/tests_state.csv')

## Deaths
deaths_malaysia = pd.read_csv('dataset/epidemic/deaths_malaysia.csv')
deaths_state = pd.read_csv('dataset/epidemic/deaths_state.csv')

## Clustering
clusters = pd.read_csv('dataset/epidemic/clusters.csv')

## Healthcare
hospital = pd.read_csv('dataset/epidemic/hospital.csv')
icu = pd.read_csv('dataset/epidemic/icu.csv')
pkrc = pd.read_csv('dataset/epidemic/pkrc.csv') # PUSAT KUARANTINE DAN RAWATAN COVID

# Loading vaccination datasets

## Adverse Events Following Immunization (AEFI)
aefi = pd.read_csv('dataset/vaccination/aefi.csv')
aefi_serious = pd.read_csv('dataset/vaccination/aefi.csv')

## Vaccination
vax_malaysia = pd.read_csv('dataset/vaccination/vax_malaysia.csv')
vax_state = pd.read_csv('dataset/vaccination/vax_state.csv')

# Loading registration datasets

vaxreg_malaysia = pd.read_csv('dataset/registration/vaxreg_malaysia.csv')
vaxreg_state = pd.read_csv('dataset/registration/vaxreg_state.csv')

# Loading mysejahtera datasets

checkin_malaysia = pd.read_csv('dataset/mysejahtera/checkin_malaysia.csv')
checkin_malaysia_time = pd.read_csv('dataset/mysejahtera/checkin_malaysia_time.csv')
checkin_state = pd.read_csv('dataset/mysejahtera/checkin_state.csv')
trace_malaysia = pd.read_csv('dataset/mysejahtera/trace_malaysia.csv')

# Open data on COVID-19 in Malaysia

**The scope and granularity of data in this repo will evolve over time.**
+ Documentation and data descriptions contained within subfolders. 
+ Submit pull requests to [share your work for the community](/CONTRIB.md#share-your-work) or [request more data](/CONTRIB.md#data-requests).

---

## 📁Epidemic

#### Cases and Testing

1) [`cases_malaysia.csv`](/epidemic/cases_malaysia.csv): Daily recorded COVID-19 cases at country level, as of 1200 of date.
2) [`cases_state.csv`](/epidemic/cases_state.csv): Daily recorded COVID-19 cases at state level, as of 1200 of date.
3) [`clusters.csv`](/epidemic/clusters.csv): Exhaustive list of announced clusters with relevant epidemiological datapoints, as of 2359 of date of update.
4) [`tests_malaysia.csv`](/epidemic/tests_malaysia.csv): Daily tests (note: not necessarily unique individuals) by type at country level, as of 2359 of date.
4) [`tests_state.csv`](/epidemic/tests_malaysia.csv): Daily tests (note: not necessarily unique individuals) by type at state level, as of 2359 of date.

#### Healthcare

1) [`pkrc.csv`](/epidemic/pkrc.csv): Flow of patients to/out of Covid-19 Quarantine and Treatment Centres (PKRC), with capacity and utilisation as of 2359 of date.
2) [`hospital.csv`](/epidemic/hospital.csv): Flow of patients to/out of hospitals, with capacity and utilisation as of 2359 of date.
3) [`icu.csv`](/epidemic/icu.csv): Capacity and utilisation of intensive care unit (ICU) beds as of 2359 of date.

#### Deaths

1) [`deaths_malaysia.csv`](/epidemic/deaths_malaysia.csv): Daily deaths due to COVID-19 at country level, as of 1200 of date.
2) [`deaths_state.csv`](/epidemic/deaths_state.csv): Daily deaths due to COVID-19 at state level, as of 1200 of date.



## 📁MySejahtera

### Mobility and Contact Tracing

1) [`checkin_malaysia.csv`](/mysejahtera/checkin_malaysia.csv): Daily checkins on MySejahtera at country level, as of 2359 of date.
2) [`checkin_state.csv`](/mysejahtera/checkin_state.csv): Daily checkins on MySejahtera at state level, as of 2359 of date.
3) [`checkin_malaysia_time.csv`](/mysejahtera/checkin_malaysia_time.csv): Time distribution of daily checkins on MySejahtera at country level, as of 2359 of date.
4) [`trace_malaysia.csv`](/mysejahtera/trace_malaysia.csv): Daily casual contacts traced and hotspots identified by HIDE, at country level, as of 2359 of date.

## 📁Registration

1) [`vaxreg_malaysia.csv`](/registration/vaxreg_malaysia.csv): Cumulative registrations for vaccination at country level, as at 2359 of date.
2) [`vaxreg_state.csv`](/registration/vaxreg_state.csv): Cumulative registrations vaccination at state level, as at 2359 of date.

## 📁Static

1) [`population.csv`](/static/population.csv): Total, adult (≥18), and elderly (≥60) population at state level.

_Static data will (probably) remain unchanged for the duration of the program, barring an update from the source, e.g. if DOSM makes an update to population estimates. We provide this data here not to supersede the source, but rather to be transparent about the data we use to compute key statistics e.g. the % of the population that is vaccinated. We also hope this ensures synchronisation (across various independent analysts) of key statistics down to the Nth decimal place._

## 📁Vaccination

MoH collaborates with MoSTI and the COVID-19 Immunisation Task Force (CITF) to publish open data on Malaysia's vaccination rollout. All relevant data can be found at [this repo](https://github.com/CITF-Malaysia/citf-public).

1) [`vax_malaysia.csv`](/vaccination/vax_malaysia.csv): Daily and cumulative vaccination at country level, as at 2359 of date.
2) [`vax_state.csv`](/vaccination/vax_state.csv): Daily and cumulative vaccination at state level, as at 2359 of date.# Documentation for MySejahtera datasets

_Note: As per the MySejahtera privacy policy, individual-level check-in data is purged after 90 days. These summary statistics are stored only as aggregated totals; MySejahtera does not store the underlying data. Consequently, data revisions are not possible for dates more than 90 days ago, even if an inconsistency is spotted._

## File naming convention

1) `checkin_malaysia.csv`: Static name; file is updated by 1500hrs daily
2) `checkin_malaysia_time.csv`: Static name; file is updated by 1500hrs daily
3) `trace_malaysia.csv`: Static name; file is updated by 1500hrs daily

## Variables and Methodology

1) `date`: yyyy-mm-dd format; data correct as of 2359hrs on that date
2) `checkins`: number of checkins at all locations registered on MySejahtera
3) `unique_ind`: number of unique accounts which checked in
4) `unique_loc`: number of unique premises checked into
5) `i`: in the time density file, checkins are aggregated by half-hour buckets, giving 48 in total; bucket `i` corresponds to the ith half-hour slot of the day. for instance, `i = 0` corresponds to 0000 - 0029; `i = 31` corresponds to 1500 - 1529.
6) `casual_contacts`: number of casual contacts identified and notified by CPRC's automated contact tracing system
7) `hide_large`: number of large hotspots identified by CPRC's hotspot identification system
8) `hide_small`: number of small hotspots identified by CPRC's hotspot identification system