<a href="https://colab.research.google.com/github/CALDISS-AAU/sdsphd19_coursematerials/blob/master/notebooks/Tweet_search_twint.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting twitter data - Easy and without API

In this short tutorial we are going to use Twint - https://pypi.org/project/twint/ to get Twitter data. This is a relatively new package that manages to get around Twitter's API. Use with care...

<img src="https://media.giphy.com/media/SMKiEh9WDO6ze/giphy.gif" alt="Girl in a jacket" width="300">

We will cover 2 cases:



*   Searching for tweets
*   Extracting followers/following

**tl;dr Benefits** (from Twint authors)
Some of the benefits of using Twint vs Twitter API:

- Can fetch almost all Tweets (Twitter API limits to last 3200 Tweets only);
- Fast initial setup;
- Can be used anonymously and without Twitter sign up;
- No rate limitations.



In [0]:

!pip3 install -qq twint
!pip install -qq whatthelang

[K     |████████████████████████████████| 788kB 5.0MB/s 
[K     |████████████████████████████████| 102kB 10.7MB/s 
[K     |████████████████████████████████| 245kB 46.7MB/s 
[?25h  Building wheel for whatthelang (setup.py) ... [?25l[?25hdone
  Building wheel for cysignals (setup.py) ... [?25l[?25hdone
  Building wheel for pyfasttext (setup.py) ... [?25l[?25hdone


In [0]:
# Import Library
import twint

## Get tweets

In this example you can see, how I got you the lovely #OKBoomer data

In [0]:
# Instantiate and configure the twint-object
c = twint.Config()
c.Store_object = True
c.Pandas =True
c.Search = "#okboomer"
c.Limit = 10000
c.Lang = 'en'

In [0]:
# Run search
twint.run.Search(c)

In [0]:
# Quick check
twint.storage.panda.Tweets_df.head()

Unnamed: 0,id,conversation_id,created_at,date,timezone,place,tweet,hashtags,cashtags,user_id,user_id_str,username,name,day,hour,link,retweet,nlikes,nreplies,nretweets,quote_url,search,near,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date
0,1197612747715837953,1197612747715837953,1574368104000,2019-11-21 20:28:24,UTC,,"Suddenly, #OkBoomer is trending again. https:...",[#okboomer],[],262050686,262050686,DanClarkSports,Dan,3,1,https://twitter.com/DanClarkSports/status/1197...,False,0,0,0,https://twitter.com/Fox35Matt/status/119718535...,#okboomer,,,,,,,"[{'user_id': '262050686', 'username': 'DanClar...",
1,1197612550403297280,1197612550403297280,1574368057000,2019-11-21 20:27:37,UTC,,I like my role in this 🤷‍♂️ #OkBoomer #GenX pi...,"[#okboomer, #genx]",[],41444665,41444665,sinths,Sven Thomas,2,12,https://twitter.com/sinths/status/119761255040...,False,0,0,0,,#okboomer,,,,,,,"[{'user_id': '41444665', 'username': 'sinths'}]",
2,1197612190867476481,1197611882955325441,1574367971000,2019-11-21 20:26:11,UTC,,He looks allot like you. Old and white.\n\n#Ok...,[#okboomer],[],1005516908181925888,1005516908181925888,DustFar,FarThrustStarDust,1,12,https://twitter.com/DustFar/status/11976121908...,False,0,1,0,,#okboomer,,,,,,,"[{'user_id': '1005516908181925888', 'username'...",
3,1197611782669402113,1197611782669402113,1574367874000,2019-11-21 20:24:34,UTC,,wait is my university’s president gaslighting ...,[#okboomer],[],2509106274,2509106274,summerash99,"queer, sultry summer",7,9,https://twitter.com/summerash99/status/1197611...,False,1,0,0,,#okboomer,,,,,,,"[{'user_id': '2509106274', 'username': 'summer...",
4,1197611614687637504,1197611614687637504,1574367834000,2019-11-21 20:23:54,UTC,,"Shut up Conway you whiny, fragile fossil. #okb...","[#okboomer, #expirealready]",[],2789202068,2789202068,Crayondroids,Crayondroids,6,22,https://twitter.com/Crayondroids/status/119761...,False,0,0,0,,#okboomer,,,,,,,"[{'user_id': '2789202068', 'username': 'Crayon...",


### The End (of the data extraction)
the stuff below is just some cleanup...

In [0]:
# Cleanup
tweets = twint.storage.panda.Tweets_df.drop_duplicates(subset=['id'])

In [0]:
# Reindex
tweets.index = range(len(tweets))

In [0]:
# Remove non-english
from whatthelang import WhatTheLang
wtl = WhatTheLang()

In [0]:
# This function makes easy to handle exceptions (e.g. no text where text should be)
# not really needed but can be useful 

def detect_lang(text):
    try: 
        return wtl.predict_lang(text)
    except Exception:
        return 'exp'

In [0]:
# Added performance measure here...you can leave teh %%time line out

%%time

tweets['lang'] = tweets['tweet'].map(lambda t: detect_lang(t))

CPU times: user 383 ms, sys: 1.01 ms, total: 384 ms
Wall time: 390 ms


In [0]:
# keep only english

tweets = tweets[tweets.lang == 'en']

In [0]:
# Done

tweets.to_json('tweets_boomer.json')

## Get peoples' connections

This is a short analysis in which I combine (very) basic scraping with extraction of Twitter networks and network analysis. 
The purpose was to identify interesting people on Twitter for me to follow...

The appropach:

- Fet links to all shows
- Fetch links to twitter-accounts form the shownotes
- Use these URLs to identify users
- Scrape all people these people follow

Assumption: People that are followed by people that are invited on TwimlAI are people, I should be following...



In [0]:
# Import libraries
import re
import pickle # pickle is for storing element...pickling... you can store any kind of python object with that
import requests as rq

In [0]:
# Load HTML parser library...yes, that's its name.
from bs4 import BeautifulSoup

In [0]:
# Get URLs of all TWIML shows
r = rq.get('https://twimlai.com/shows/')

In [0]:
# Parse the HTML
soup = BeautifulSoup(r.text)

In [0]:
# Fetch all links from parsed HTML
links = soup.find_all('a')

In [0]:
# Keep only links leading to a twiml-podcast
links = [l.attrs['href'] for l in links if l.attrs['href'].startswith('https://twimlai.com/twiml-talk')]

In [0]:
# Drop duplicated links
links = list(set(links))

In [0]:
# Iterate and fetch show-notes, then extract links leading to twitter. 
twitter_urls = []
for link in links:
  show = rq.get(link) # get shownotes 
  soup = BeautifulSoup(show.text) # parse
  show_links = soup.find_all('a') # find links 
  show_links = [l.attrs['href'] for l in show_links if l.attrs['href'].startswith('https://twitter.com')] # keep only links to twitter
  twitter_urls.extend(show_links) # store

In [0]:
# Store the lovely list of links to twitter profiles
pickle.dump(list(set(twitter_urls)), open('twitter-list.p','wb'))

In [0]:
# Unless already imported
import twint
import numpy as np

In [0]:
# Filter out tooooo long twitter links that are more than likely not profiles
usernames = [x.replace('https://twitter.com/','') for x in set(twitter_urls) if len(x) <= 50]

In [0]:
# Profile lookup

for username in usernames:
  c = twint.Config()
  c.Username = username
  c.Store_object = True
  c.User_full = False
  c.Pandas =True
  twint.run.Lookup(c)

In [0]:
# Store in a DF
user_df = twint.storage.panda.User_df.drop_duplicates(subset=['id'])

In [0]:
#Store away
user_df.to_csv('user_df.csv')

In [0]:
# Or like that
user_df[['bio','username']].to_csv('short.csv')

In [0]:
# Clean up
twint.storage.panda.clean()
twint.output.clean_follow_list()

In [0]:
# Connect Google drive
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


Unfortunately getting followers is not as easy. It requires some trickery. In this case I decided to write out the followers of each person as a pickle file to disk. This happened after I realized that I often get blank responses. Writing on disk of individual DFs with followers allowed me to spot the ones that are empty and remove by hand. Probably there is a smarter solution to that somewhere

In [0]:
# we iterate over the different usernames and store follower dataframe
for u in user_df['username']:
  c = twint.Config()
  c.Username = u
  c.Store_object = True
  c.User_full = False
  c.Pandas = True
  c.Store_pandas = True
  c.Stats = False
  c.Hide_output = True

  twint.run.Following(c)
  twint.storage.panda.Follow_df.to_pickle("/content/drive/My Drive/Colab/TWIML-guests/{}.p".format(u))
  twint.storage.panda.clean()
  twint.output.clean_follow_list()

In [0]:
# To get the data back we use glob...which will help us dealing with many tiny filed
import glob

In [0]:
# Get paths of all stored files
paths = glob.glob('/content/drive/My Drive/Colab/TWIML-guests/*.*')

In [0]:
# Create an edgelist
# read stored DFs with following, append into long edgelist

empty = []
edgelist = pd.DataFrame(columns = ['target', 'source'])
for path in paths:
  df = pd.read_pickle(path)
  if len(df) == 1:
    name = df.index[0]
    edges = pd.DataFrame(df['following'][name], columns=['target'])
    edges['source'] = name
    edgelist = edgelist.append(edges)
  else:
    empty.append(path)

In [0]:
# Reindex

edgelist.index = range(edgelist.shape[0])

### From here: Network analysis 101

In [0]:
import networkx as nx
from networkx.algorithms import bipartite 

In [0]:
G = nx.DiGraph()

In [0]:
G.add_edges_from([(u,v) for (u,v) in zip(edgelist['source'],edgelist['target'])])

In [0]:
len(G.nodes)

87738

In [0]:
eigenvector = nx.eigenvector_centrality(G)

In [0]:
nx.set_node_attributes(G, eigenvector, 'eigenvector_centrality')

In [0]:
import community

In [0]:
G_und = G.to_undirected()

In [0]:
communities = community.best_partition(G_und, resolution = 1)
nx.set_node_attributes(G, communities, 'community')

In [0]:
perc_filter = np.percentile([v for u,v in eigenvector.items()], 90)

In [0]:
nodes_selected = [x for x,y in eigenvector.items() if y >= perc_filter]

G_sub = G.subgraph(nodes_selected)

In [0]:
communities = community.best_partition(G_sub.to_undirected(), resolution = 1)
nx.set_node_attributes(G_sub, communities, 'community_2')

In [0]:
len(G_sub.nodes)

6772

In [0]:
nx.write_gexf(G_sub, 'twiml.gexf')

In [0]:
net_df = pd.DataFrame(dict(G_sub.nodes(data=True))).T

In [0]:
net_df.groupby('community_2').apply(lambda t: t.sort_values(['eigenvector_centrality'],ascending=False)[:10])

Unnamed: 0_level_0,Unnamed: 1_level_0,eigenvector_centrality,community,community_2
community_2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.0,ylecun,0.086733,4.0,0.0
0.0,AndrewYNg,0.070358,3.0,0.0
0.0,demishassabis,0.065999,4.0,0.0
0.0,GoogleAI,0.061435,3.0,0.0
0.0,BillGates,0.058128,12.0,0.0
...,...,...,...,...
8.0,DeepMindAI,0.070552,4.0,8.0
8.0,OpenAI,0.067212,4.0,8.0
8.0,pabbeel,0.065127,4.0,8.0
8.0,drfeifei,0.065037,4.0,8.0


In [0]:
nlp_ppl = net_df[net_df.community_2 == 3].sort_values(['eigenvector_centrality'],ascending=False).index