### This Jupyter Notebook shows how to load the Twitter data into a dataframe

Useful links

* Introduction to Jupyter Notebooks: [Jupyter Notebook Tutorial: Introduction, Setup, and Walkthrough](https://www.youtube.com/watch?v=HW29067qVWk)
* Getting started with pandas https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/index.html
* Getting started with natural language processing (NLP) with NLTK https://www.nltk.org/book/

In [1]:
import os
import json
from pprint import pprint

import pandas as pd

In [2]:
# this is the path to the extracted JSON file from a .tar.gz archive
JSON_FILE = "2020-01-01.json"

Each line of the file contains the data from a single tweet. We can get an idea of the structure by printing a line from the file.

{'_id': {'$oid': '5e0d442d1aa16380b7a9297c'},
 'collected_by': 'sentinelRibTwitter',
 'contributors': None,
 'coordinates': None,
 'created_at': {'$date': 1577923199000},
 'created_at_src': 'Wed Jan 01 23:59:59 +0000 2020',
 'display_text_range': [8, 140],
 'entities': {'emotion': [],
              'hashtags': [],
              'indexed_by': {},
              'ner': [],
              'ontology': [],
              'pronoun': [],
              'swearword': [],
              'symbols': [],
              'token': [],
              'urls': [{'display_url': 'twitter.com/i/web/status/1…',
                        'expanded_url': 'https://twitter.com/i/web/status/1212523896727068672',
                        'indices': [117, 140],
                        'url': 'https://t.co/UkodUK8O88'}],
              'user_mentions': [{'id': 14638581,
                                 'id_str': '14638581',
                                 'indices': [0, 7],
                                 'name': 'BioBioChil

There are many things that we do with this. Let's say for this example we're only interested in the id and text of a tweet.

In [4]:
# read each line of the file, extract relevant data
rows = []
with open(JSON_FILE, "r") as f:
    for line in f:
        data = json.loads(line)
    
        tweet_id = data["id_str"]

        if "extended_tweet" in data:
            text = data["extended_tweet"]["full_text"]
        else:
            text = data["text"]

        rows.append((tweet_id, text))

The id and text have been loaded into a list of tuples.

In [5]:
print(rows[0])

('1212523896727068672', '@biobio Radio facha. Difunde propaganda para rechazar una nueva Constitución. Apoya al nazi Kast y además emite estrofa del himno nacional que hace referencia a los cobardes soldados de la dictadura. Por eso  #ChaoBioBio #DejarDeSeguirRadioBioBio')


This can now be put this into a dataframe.

In [6]:
df = pd.DataFrame(data=rows, columns=["tweet_id", "tweet_text"])

We can view the head of the dataframe to check it has been loaded as expected.

In [7]:
df.head(5)

Unnamed: 0,tweet_id,tweet_text
0,1212523896727068672,@biobio Radio facha. Difunde propaganda para r...
1,1212523895821160449,RT @GregRubini: the replacement of Gen. Mike ...
2,1212523898564038656,RT @TheTyee: Jason Kenney is learning that it’...
3,1212523893157761025,RT @realDonaldTrump: The Fake News said I play...
4,1212523894076129280,@rosebowlgame @BadgerFootball @oregonfootball ...


In [8]:
print(len(df), "texts loaded.")

235482 texts loaded.


We can now save this dataframe to disk for later use.

In [9]:
df.to_pickle("2020-01-01_tweet_data.pkl")

The dataframe can be loaded later using the read_pickle method.

In [10]:
df = pd.read_pickle("2020-01-01_tweet_data.pkl")
df.head(5)

Unnamed: 0,tweet_id,tweet_text
0,1212523896727068672,@biobio Radio facha. Difunde propaganda para r...
1,1212523895821160449,RT @GregRubini: the replacement of Gen. Mike ...
2,1212523898564038656,RT @TheTyee: Jason Kenney is learning that it’...
3,1212523893157761025,RT @realDonaldTrump: The Fake News said I play...
4,1212523894076129280,@rosebowlgame @BadgerFootball @oregonfootball ...
