# Getting tweets by philosophers

This notebooks shows how you can build a dataset of philosopher-tweets from the list given by @truesciphi.
It gets the accounts out of the html of the list, then downloads their timelines using twarc (Which you will need to configure:  https://github.com/DocNow/twarc) and extracts who the accounts have been mentioning.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
import requests
import re



Load the philo-twitter-list:

In [None]:
# url: truesciphi.org/phi.html
soup = BeautifulSoup(open("philotwitter_2.html",encoding='utf-8'))

In [None]:
# This code did oversample. I corrected this in the next notebook. If you run this, you should use the correct one:
# links = []
# for link in soup.findAll('a'):
#     links.append(link.text)

In [None]:
# This is the correct code:
soup = BeautifulSoup(open("philotwitter_2.html",encoding='utf-8'))
links = []
for link in soup.findAll('a'):
    links.append(link.text)

r1 = re.findall(r"@[A-Za-z0-9_]*",str(links))


In [None]:

r1 = re.findall(r"@[A-Za-z0-9_]*(?=<)",str(soup))
len(r1)

We dump our list, in case we need it later:

In [None]:
import pickle

pickle.dump( r1, open( "philosophen_liste.p", "wb" ) )

In [None]:
import pickle
philos = pickle.load( open( "philosophen_liste.p", "rb" ) )
print(philos)

Now we send a command to twarc (in the console) that downloads from the timeline of the user. I don't know exactly what determines how much it gets, but it's usually around 3000 tweets. It's roughly ten GB, so I pointed this to an external drive.

In [None]:
count = 0

for tweeter in philos:
    filename = str(r'E:/timelines/' + tweeter + '.jsonl')
    name = tweeter.replace('@','')
    print(count)
    count = count +1
    !twarc timeline $name > $filename
#     !twarc users $name > $filename

We loop over all the collected files, and extract for each author the metadata associated with the last tweet. We also extract lists with all the mentioned accounts (from the 'user_mentions' column). You could do way more with this, e. g. by extracting texts and building your similarities from that.

In [None]:
import pandas as pd
from IPython.display import display
import os
from io import StringIO
import tqdm

tweetcounts = []
tweeter_data = []

for filename in tqdm.tqdm_notebook(os.listdir(r'E:/timelines/')):
    print(str(r'E:/timelines/'+filename))
    with open(str(r'E:/timelines/'+filename), 'r') as file:
        try:
            data = file.read()
            this_dataset = pd.read_json(StringIO(data), lines=True)
            tweetcounts.append(len(this_dataset))
            mentioned_users_screenames = [[y['screen_name'] for y in x['user_mentions']] for x in this_dataset['entities']]
            mentioned_users_screenames = [item for sublist in mentioned_users_screenames for item in sublist]

            mentioned_users_names = [[y['name'] for y in x['user_mentions']] for x in this_dataset['entities']]
            mentioned_users_names = [item for sublist in mentioned_users_names for item in sublist]

            user_values = this_dataset['user'][0]
            user_values.pop('entities') # we need to remove these, because they mess uo the df otherwise.


            user_data = pd.DataFrame(user_values, index=[0])
            user_data['mentioned_users_screenames'] = str(mentioned_users_screenames)
            user_data['mentioned_users_names'] = str(mentioned_users_names)
            tweeter_data.append(user_data)
        except:
            print('Parsing failure') # These are mainly protected accounts
            pass


In [None]:
# We have a look at how meny tweets there are.
print(np.sum(tweetcounts))

In [None]:
# And now we built the whole dataset, have a look at it, and pickle it.
tweeter_data = pd.concat(tweeter_data)
display(tweeter_data)
pickle.dump( tweeter_data, open( "tweeter_data.p", "wb" ) )