├── data
│ ├── external <- exeternal data
│ ├── interim <- modified dataset
│ ├── processed <- final dataset used for analysis
│ └── raw <- original dataset
│
├── docs <- presentation, documents used for reports etc.
│
├── models <- Trained Doc2Vec model, TF-IDF Vectors
│
├── notebooks <- Jupyter notebooks (Creators initials and enumerated)
│
├── reports
│ └── figures <- Interactive HTML figures from the analysis
│
├── requirements.txt <- The requirements file for reproducing the analysis environment
│
├── src
│ ├── models <- Scripts to train models
└── train_model.py
Project structure is an adaption of Cookiecutter data science template.
- All COVID-19 Vaccines Tweets
- COVID-19 World Vaccination Progress
- Coronavirus (COVID-19) Geo-Tagged Tweets Dataset
Unfortunately the Twitter Guidelines do not allow the upload of tweets. Tweet IDs can be provided. To build the dataset, follow the steps here to hydrate the IDs.
- Download the datasets above and place them in
/data/raw
- Hydrate the tweet IDs in
/data/raw/tweet_ids.csv/
and store the resulting jsonl file as "vaccine_tweets_hydrated.jsonl" in/data/raw/
- Run Notebooks 2 - 6 in
/notebooks/
- Note: you may have to install requirements (
pip3 install requirements.txt
)
- Hydrate
Corona_Combined_Nov2020-June2021.csv
and store as "Hydrated_Tweets.jsonl" in/data/raw
- Run Notebook 1 and 7 - 11 in
/notebooks/
- Note: you may have to install requirements (
pip3 install requirements.txt
)
- NLP Pipeline: Word2Vec, Doc2Vec, TF-IDF, K-Means
- SavGol-Filter (value smoothing)
- Plotly (interactive Plots)
Read the report here. The interactive Plots are stored in /reports/figures/
.