Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.
/ covid19-india-data Public archive

Publicly available structured COVID-19 data from India, extracted automatically from daily health bulletins published by state governments.

License

Notifications You must be signed in to change notification settings

IBM/covid19-india-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Covid-19 India Data ๐Ÿ‡ฎ๐Ÿ‡ณ

License Website Database Slack

Download data CSV JSON Microsoft Excel SQLite

Availability of COVID-19 data is crucial for researchers and policy makers to understand the progression of the pandemic and react to it in real time. Here is recent plea from researchers in India for they urgent access to COVID data collected by government agencies. Individual states and cities in India provide detailed information in their daily media bulletins about the current situation of COVID-19 in their respective locations. However, such data (usually in the form of PDF documents) is not readily accessible in structured form.

While there are fantastic crowd-sourced efforts underway to curate such data, manual approaches cannot scale to the volume of the data produced over the long term. Unfortunately, although this project originally began anticipating this outcome, this eventuality has already come to pass.

Project Overview

Read More

In this project, we use AI-assisted document and image extraction techniques to automate the extraction of such data in structured (SQL) form from the state-level daily health bulletins; and aim to make this data readily (and freely) available for further research and analysis. The target is to automate the data extraction and curation for each Indian state, so that once the extraction process of each state is complete, we can be on "autopilot" for that state, requiring little to none continued manual curation (other than to respond to changes in schema).

Citing us

If you are using this data in your reserach, please remember to cite us. ๐Ÿ™ Note that the list of authors will continue to grow over time with our OSS contributors. Please make sure to update the citation text in your future papers accordingly.

@inproceedings{agarwal2021covid,
  title={COVID-19 India Dataset: Parsing Detailed COVID-19 Data in Daily Health Bulletins from States in India},
  author={Mayank Agarwal and Tathagata Chakraborti and Sachin Grover and Arunima Chaudhary},
  booktitle={NeurIPS 2021 Workshop on Machine Learning in Public Health},
  year={2021}
}

Getting Started with the Code

There are two ways to get started:

The Backend

The most important part of this codebase is the data extraction pipeline, as described above.

  1. To setup your environment, follow the instructions here.
  2. To run the extraction pipeline, refer to instructions here.
  3. For a detailed walkthrough of using the pipeline end to end on a state, refer to our Wiki.

The Frontend

Secondary, but almost as important, is the landing page that allows users to access the data quickly and in different forms such as time series visualization, data tables, CSVs, APIs, etc. For instructions on how to contribute to the landing page, see here.

How to Contribute

The following are a few ways to get going. In general, you can pick up any unassigned issue, or issues tagged with help wanted, from the issue board.

โœŠ Own a State

priority

This is the biggest way you can contribute in the beginning stages of the project. "Owning a state" involves:

  1. Write the data extraction code for the bulletins of the state. This repository provides the starting code and helper packages to make this as simple as possible. See here for instructions.

  2. Eventually reacting (or helping others react) to additions or changes in schema for the bulletins being put out by that state. The schemas have remained quite stable all this while but this issue may show up in a few states as the pandemic evolves.

For the project to succeed, this is the most crucial part. Once the data extraction code for a state is done, the logging of data for that state is automatic and we can sit back and relax scale up to the rest of the country over time.

๐Ÿ˜’ Data Cleaning

Data at this volume and timeline is bound to suffer from inconsistencies. We will be documenting these as and when we find them on the dedicated Anomalies Page. Help us:

  1. Remove missing data / deal with missing for the plots.
  2. Idenitify possible outliers and errors.

๐Ÿค“ Analysis

Analyze the data for insights, irregularities, etc. You can put up results of your analysis in your papers, blogs, etc. (and point to that from our landing page) or directly add it to our landing page as a standalone new page or in the existing Analysis Page. You can use the data to validate or extend models developed for other countries to India [1] [2] [3]; developing epidemiological models which integrate additional variables [4] [5] [6] [7]; understanding various aspects of the pandemic in detail [8] [1] [9], among others.

๐Ÿ’ก ๐Ÿ’ก ๐Ÿ’ก If you are looking for some concrete tasks to get started, find out more about Challenge Tasks here.

Current state roster

State Link to Bulletin Owner Status
Andaman and Nicobar AN Link โŒ› Own it! #113
Arunachal Pradesh AR Link โŒ› Own it! #129
Assam AS Link โŒ› Own it! #130
Bihar BR Link โŒ› Own it! #126
Chhattisgarh CT Link โŒ› Own it! #131
Dadra and Nagar Haveli and Daman and Diu DH Link โŒ› Own it! #125
Delhi DL Link Mayank โœ… ย  COMPLETE Wiki
Goa GA Link Tathagata | Mayank โœ… ย  COMPLETE Wiki
Gujarat GJ Link โŒ› Own it! #121
Haryana HR Link Mayank โœ… ย  COMPLETE Wiki
Himachal Pradesh HP Link โŒ› Own it! #132
Jammu and Kashmir JK Link โŒ› Own it! #133
Karnataka KA Link Sushovan De | Mayank ๐Ÿšง ย  IN PROGRESS Wiki
Kerala KL Link Tathagata ๐Ÿšง ย  IN PROGRESS Wiki
Ladakh LA Link โŒ› Own it! #114
Madhya Pradesh MP Link Tathagata ๐Ÿšง ย  IN PROGRESS Wiki
Maharashtra MH Link Mayank โœ… ย  COMPLETE Wiki
Manipur MN Link | Link โŒ› Own it! #116
Meghalaya ML Link โŒ› Own it! #111
Mizoram MZ Link โŒ› Own it! #135
Nagaland NL Link โŒ› Own it! #124
Puducherry PY Link โŒ› Own it! #128
Punjab PB Link Sachin โœ… ย  COMPLETE Wiki
Odisha OR Link โŒ› Own it! #115
Rajasthan RJ Link
Tamil Nadu TN Link Sachin | Tathagata โœ… ย  COMPLETE Wiki
Telengana TG Link Mayank โœ… ย  COMPLETE Wiki
Uttarakhand UK Link | Link Arunima โœ… ย  COMPLETE Wiki
Uttar Pradesh UP Link โŒ› Own it! #127
West Bengal WB Link Mayank โœ… ย  COMPLETE Wiki
Add new state

As you might have noticed, this is an incomplete list of Indian states. Not all states produce this form of data and not all bulletins are accessible. โ˜น๏ธ We will continue adding new sources over time.

Interested? Join the Community

slack

About

Publicly available structured COVID-19 data from India, extracted automatically from daily health bulletins published by state governments.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published