This notebook contains code used to configure and consolidate data from the Twitter API using urls found in TheFlipSide articles. In contrast to other notebooks, this was run in two parts: sections preceding "Consolidating Tweets", and then the final section after fetching the data using the pull_twitter tool.

# Imports 

In [55]:
import pandas as pd
import numpy as np

import json
import yaml
import os
from datetime import datetime

import requests
import urlexpander

from ast import literal_eval

from tqdm import tqdm
tqdm.pandas()

import time

import urlexpander

# Settings and Configuration 

In [125]:
SRC_DATA   = '../../data/webpage_data/full_flipside_data_clean.csv'
TWEET_DATA = '../../data/twitter_data/url_query_list'
OUTPUT_DIR = '../../data/twitter_data/search/'
FINAL_OUTPUT_FP = '../../data/twitter_data/all_tweet_data.csv'

CLIP_URLS = 8 # Take first N urls to attempt to expand for scraping Twitter
MAX_URLS  = 5 # Maximum number of urls to use in scraping Twitter

# Data Loading 

Load and clean the original article data 

In [38]:
art_data = pd.read_csv(SRC_DATA)

In [39]:
art_data['linked_arts_clean'] = art_data['linked_arts_clean'].str.replace('\n', ',')

In [40]:
art_data['linked_arts_clean'] = art_data['linked_arts_clean'].apply(literal_eval)

In [41]:
art_data['num_news'] = art_data['linked_arts_clean'].apply(len)

# Creating Query List

In [42]:
tweet_queries = [url.split('?')[0] for url_list in art_data['linked_arts_clean'].values for url in url_list]

In [43]:
tweet_queries = ['url:\"' + q + '\" -is:reply -is:retweet -is:verified' for q in tweet_queries]

In [44]:
with open(TWEET_DATA, 'w') as f:
    f.write('\n'.join(tweet_queries))

# Run Scraping

The twitter scraping is handled by the `scrape_twitter.sh` bash script in the same directory as this notebook. Run the following command to generate the scraped files.

```
    ./scrape_twitter.sh  "../../data/twitter_data/url_query_list"
```

# Consolidating Tweets 

Align the queries and Twitter API Results to map tweets back to the database

In [103]:
# Create map of urls to titles
full_queries = {url.split('?')[0]: title + '|' + date for (title, date), url_list in zip(zip(art_data['title'], art_data['date']), art_data['linked_arts_clean']) for url in url_list}

In [104]:
# Retrieve the ordered list of tweet urls
with open(TWEET_DATA, 'r') as f:
    queries = f.readlines()

In [105]:
# Generate unsorted list of data directories
data_dirs = [OUTPUT_DIR + resdir for resdir in os.listdir(OUTPUT_DIR) if resdir[-4:] != '.zip']

In [106]:
# Sort the Twitter API results using the datetime in the directory name
data_dirs.sort(key = lambda fn: datetime.strptime(fn.split('/')[-1], '%Y-%m-%d %H.%M.%S'))

In [108]:
data_fps = [data_dir + '/data_tweets.csv' for data_dir in data_dirs]
query_urls = [query.split(' ')[0][5:-1] for query in queries]

In [109]:
url_to_data = {query_url: data_fp for query_url, data_fp in zip(query_urls, data_fps)}

In [110]:
# Ensure all urls are present in both mapping dictionaries
assert set(url_to_data.keys()) == set(full_queries.keys())

In [113]:
data_to_title_date = {url_to_data[url]: title_date for url, title_date in full_queries.items() }

In [121]:
# Extract the title and date columns separately and prepare to consolidate data
all_tweets = []
for data_fp, title_date_q in data_to_title_date.items():
    try:
        data = pd.read_csv(data_fp)
        data['title_q'] = title_date_q.split('|')[0]
        data['date_q']  = title_date_q.split('|')[-1]
        all_tweets.append(data)
    except:
        print(f'{title_date_q.split("|")[0]} had no results for a url')

General Election Update had no results for a url
General Election Update had no results for a url
Judy Shelton had no results for a url
Latest Polling and Dems 2020 Update had no results for a url
NYT Scoop and NYS Bill re Trump’s Taxes had no results for a url
Tech Sector Update had no results for a url
Tech Sector Update had no results for a url
Woodward’s Trump Interviews had no results for a url
2020 Census Battle had no results for a url
2020 Election had no results for a url
AOC and Cruz to Work Together on Anti-Lobbying Bill had no results for a url
Adjourning Congress had no results for a url
All Things Healthcare had no results for a url
Anti-Semitism EO had no results for a url
Anti-Semitism EO had no results for a url
Background Checks had no results for a url
Baghdadi Dead had no results for a url
Barr and Stone had no results for a url
Barr for AG and Nauert for UN Ambassador had no results for a url
Barr for AG and Nauert for UN Ambassador had no results for a url
Beto An

Questions Answered had no results for a url
Chris Cuomo had no results for a url
Questions Answered had no results for a url
Ukraine had no results for a url
Britain Approves COVID Vaccine had no results for a url
Myanmar Coup had no results for a url
Monkeypox had no results for a url
Democratic Debate had no results for a url
Recession Debate had no results for a url
Biden’s Speech had no results for a url
COVID Relief Bill had no results for a url
Election Integrity had no results for a url
COVID Vaccines had no results for a url
CDC Guidance had no results for a url
NYT Endorsements had no results for a url
Democratic Debate had no results for a url
Hong Kong had no results for a url
Independence Day had no results for a url
Georgia Senate Runoffs had no results for a url
Captain Crozier had no results for a url
Captain Crozier had no results for a url
French Election had no results for a url
Armenian Genocide had no results for a url
Democratic Debate had no results for a url
Alex

In [122]:
# Consolidate and save tweet data, filtering by english tweets
all_tweets_df = pd.concat(all_tweets, axis = 0)
all_tweets_df = all_tweets_df[all_tweets_df['lang'] == 'en']
all_tweets_df.to_csv(FINAL_OUTPUT_FP, index = None)