In [1]:
%reload_ext autoreload
%autoreload 2

In [20]:
from pathlib import Path
import os
import pandas as pd




## Using Kaggle API to download datasets

This notebook is intended to provide a detailed guide with the necessary steps to download the Covid-19 tweets [dataset](https://www.kaggle.com/smid80/coronavirus-covid19-tweets) from Kaggle using their API. Although the simple way would be to get it from the website, sometimes it can be usefull to have an automated process for downloading datasets, specially if they are updated on a regular basis.

Credits to [fast.ai](https://github.com/fastai/course-v3/blob/master/nbs/dl1/lesson3-planet.ipynb)

1) First, we install the necessary libraries by running the following lines

In [6]:
! pip install kaggle --upgrade
! pip install tqdm

Requirement already up-to-date: kaggle in /Users/julian/opt/anaconda3/lib/python3.7/site-packages (1.5.6)
You should consider upgrading via the '/Users/julian/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m


2) Then we need to upload our Kaggle credentials to our workspace. Go to the site 
https://www.kaggle.com/YOURUSERNAME/account . Scroll down until you find a button named 'Create New API Token' and click on it. This will trigger the download of a file named 'kaggle.json'.

Upload this file to the directory this notebook is running in, by clicking "Upload" on your main Jupyter page, then uncomment and execute the next two lines.

In [4]:
! mkdir -p ~/.kaggle/ 
! mv kaggle.json ~/.kaggle/

mv: kaggle.json: No such file or directory


3) We define a data-path and download the data

In [13]:
path = Path('data')
path.mkdir(parents=True, exist_ok=True)

datasets = ['coronavirus-covid19-tweets',
           'coronavirus-covid19-tweets-early-april',
           'coronavirus-covid19-tweets-late-april']

for dataset in datasets:
    !kaggle datasets download smid80/$dataset

coronavirus-covid19-tweets.zip: Skipping, found more recently modified local copy (use --force to force download)
coronavirus-covid19-tweets-early-april.zip: Skipping, found more recently modified local copy (use --force to force download)
coronavirus-covid19-tweets-late-april.zip: Skipping, found more recently modified local copy (use --force to force download)


4) We unzip the files

In [None]:
! unzip -q -n coronavirus-covid19-tweets-early-april.zip -d $path
! unzip -q -n coronavirus-covid19-tweets-late-april.zip -d $path

5) We load all the data as a dataframe

In [21]:
import tqdm

list_files = []

for dirname, _, filenames in os.walk(path):
    for filename in tqdm(filenames):
        if 'Coronavirus Tweets' in filename:
            df = pd.read_csv(os.path.join(dirname, filename), index_col=None, header=0)
            list_files.append(df)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  after removing the cwd from sys.path.


HBox(children=(FloatProgress(value=0.0, max=54.0), HTML(value='')))




In [23]:
df = pd.concat(list_files, axis=0, ignore_index=True)
df.head()

Unnamed: 0,status_id,user_id,created_at,screen_name,text,source,reply_to_status_id,reply_to_user_id,reply_to_screen_name,is_quote,...,retweet_count,country_code,place_full_name,place_type,followers_count,friends_count,account_lang,account_created_at,verified,lang
0,1250937015840804864,544900497,2020-04-17T00:00:00Z,_wheresmymojo,Math Formulas And Numbers On Blackboard Cloth ...,TweetDeck,,,,False,...,0,,,,406,271,,2012-04-04T03:52:50Z,False,en
1,1250937015744581634,127591568,2020-04-17T00:00:00Z,Huauchifm,🎯#Ahora te traemos la rueda de prensa del @Gob...,TweetDeck,,,,False,...,0,,,,1530,861,,2010-03-29T17:54:07Z,False,es
2,1250937016428032002,127599939,2020-04-17T00:00:00Z,ZacatlanFM,🎯#Ahora te traemos la rueda de prensa del @Gob...,TweetDeck,,,,False,...,0,,,,2000,1416,,2010-03-29T18:14:44Z,False,es
3,1250937015572414464,128124996,2020-04-17T00:00:00Z,Izucar_fm,🎯#Ahora te traemos la rueda de prensa del @Gob...,TweetDeck,,,,False,...,0,,,,1153,167,,2010-03-31T04:56:31Z,False,es
4,1250937015002152961,314210521,2020-04-17T00:00:00Z,TehuacanFM,🎯#Ahora te traemos la rueda de prensa del @Gob...,TweetDeck,,,,False,...,0,,,,2199,114,,2011-06-09T21:58:36Z,False,es
