<img style="float: left; vertical-align: middle; margin: 1em;" src="images/surf.png" >
<img style="float: right; height: 5em; vertical-align: middle; margin: 1em;" src="images/twitter.png">

<hr style="clear: both;" />

# Twi-XL TwiNL collection demo

Twi-XL contains a Twitter archive called TwiNL, which has been maintained by [SURF](https://www.surf.nl/en) since 2011. The Twi-XL API exposes functionality for searching tweets and counting word frequencies. This notebook provides a brief overview of this functionality through the [Twi-XL Python library](https://gitlab.com/twi-xl-surf-nl/twi-xl-python).

Check the [Architecture Overview](https://twi-xl-python.readthedocs.io/en/latest/architecture_overview.html) of the Twi-XL API.

Check the [ReadtheDocs](https://twi-xl-python.readthedocs.io/en/latest/api.html#main-interface) of the Twi-XL Python library.

## Prerequisites
### Installing Python libraries 
To install the Twi-XL Python library and the dependencies we will need in this notebook, run the following cell:

(**Warning: this might take a while**)

In [None]:
!pip3 install --quiet seaborn spacy tqdm tweepy snscrape wordcloud tldextract
!pip3 install --quiet git+https://gitlab.com/twi-xl-surf-nl/twi-xl-python.git@b8b301ca707ac46ce8eb9f7cb31035528c25385a

!python3 -m spacy download nl_core_news_sm  # this will download the Dutch language pipeline for Spacy

### Importing Python libraries
The `twinl` package from the Python Twi-XL library provides all functionality for interfacing with the TwiNL archive. Run the following cell to import it:

In [None]:
from twixl.collections import twinl

Besides the Twi-XL library, we import all packages that we might need later for executing our code. Python packages can provide different in-built functionalities, such as plotting graphs or word clouds, handling date formats, creating and reading csv files, or handling particular types of data formats like JSON. 

In [None]:
import csv
from datetime import datetime
import spacy
import seaborn as sns
import matplotlib.pyplot as plt
import twixl.collections.twinl.plotting
import snscrape.modules.twitter as sntwitter
from wordcloud import WordCloud, STOPWORDS
from urllib.parse import urlparse
import tldextract

### Configuring the API
The Twi-XL Python library needs some information to communicate with the Twi-XL API. We will set two environment variables, one containing the endpoint (URL) of the Twi-XL API, and another containing an API key that is used to authenticate with the API.

Please add your API key directly in the Python code, without any strings or additions.

In [None]:
%env TWIXL_API_ENDPOINT=https://api.twi-xl.sda-projects.nl
%env TWIXL_API_KEY=

## Archive metrics
We're all set! First, let's have a look at the number of tweets collected since the beginning. We'll use the `tweet_metrics()` function to retrieve the number of tweets for each day in the archive and convert them to a Pandas series:

In [None]:
tweet_metrics = twinl.tweet_metrics()
tweet_metrics.to_pandas()

We can plot the tweet metrics using the `plot_tweet_metrics()` function:

In [None]:
# The lines below configure the environment for plotting, we only need to do this once

sns.set(rc={'figure.figsize': (14, 8)})
sns.set_context('notebook', font_scale=2)
sns.set_style('whitegrid')

twinl.plotting.plot_tweet_metrics(tweet_metrics);

**Note: some tweets in the archive are marked as being created before the inception of the TwiNL archive, some of them even from 1994! The archive contains raw data from Twitter and can contain these kinds of inconsistencies as a result.**

## Query design

### Query example 1: OR query
In order to search for tweets we will first need to construct a query. We can do this with the `Query` class in the `twinl` module. As an example, consider the following query designed to find tweets containing the words 'elfstedentocht' **or** 'schaatsen':

In [None]:
query_example_1 = (
    twinl.Query()
        .or_(keywords=["elfstedentocht", "schaatsen"])
)

query_example_1.print()

The `or_()` method used above specifies that a query matches if a tweet contains any of the words in the list provided by the `keyword` parameter. The `print()` method prints the contents of the query that will be sent to the Twi-XL API endpoint.

### Query example 2: AND query
We can also write queries where all keywords must be present in a tweet. Consider the following example, where the words 'elfsteden' and 'tocht' must both appear in the tweet:

In [None]:
query_example_2 = (
    twinl.Query()
        .and_(keywords=["elfsteden", "tocht"])
)

query_example_2.print()

### Query example 3: AND + OR query
Query criteria such as AND and OR can be combined by chaining operators such as `and_()` and `or_()` on the `Query` object. When specifying a query with multiple criteria the query will match if **any** of the criteria matches. For example, consider the following query where we will match tweets if any of the following criteria apply:

1. The tweet contains the word 'elfstedentocht' OR 'schaatsen';
2. The tweet contains the word 'elfsteden' AND 'tocht'.

In [None]:
query_example_3 = (
    twinl.Query()
        .or_(keywords=["elfstedentocht", "schaatsen"])
        .and_(keywords=["elfsteden", "tocht"])
)

query_example_3.print()

### Query example 4: regular expressions
With regular expressions we can create more flexible queries. Consider the following example, where we search for any tweet containing words starting with 'elf' or 'schaats':

In [None]:
query_example_4 = (
    twinl.Query()
        .or_(keywords=["\belf\w+", "\bschaats\w+"], regex=True)
)

query_example_4.print()

### Query example 5: combining AND, OR and regular expressions
For another example, consider this query that finds tweets matching any of the following criteria:

1. The tweet contains the words 'elfstedentocht' or 'schaatsen';
1. The tweet contains words starting with 'elf' or 'schaatsen';
1. The tweet contains the word 'elfsteden' and 'tocht'.

In [None]:
query_example_5 = (
    twinl.Query()
        .or_(keywords=["elfstedentocht", "schaatsen"])
        .or_(keywords=["\belf\w+", "\bschaats\w+"], regex=True)
        .and_(keywords=["elfsteden", "tocht"])
)

query_example_5.print()

In [None]:
query_custom = (
    twinl.Query()
        .or_(keywords=["#verkiezingen"])
)

query_custom.print()

## Searching with the Twi-XL API
Apart from a search query (we will use our custom query) we will also need to provide the API with a time range, consisting of a **start date** and time and **end date** and time. 

Optionally, we can specify the **maximum** number of tweets we want returned in our query. We can remove the parameter if we want all results.

Tweets are returned in **chronological** order. If a maximum number *x* is set, we retrieve the earliest *x* tweets. 

The following cell runs our last query on the TwiNL archive between January 1 and January 23, 2017 using the `search()` function. Because searching can take a while, we provide a so-called callback function to the `search()` function. This function will be called every few seconds with the current status of the query. The `twinl.print_callback` in the cell below is a default callback that simply prints the current query status.

In [None]:
search_results = twinl.search(
    query=query_custom,
    start_time=datetime(2017, 1, 1, 0, 0),
    end_time=datetime(2017, 1, 22, 23, 59, 59),
    max_results=100,
    callback=twinl.print_callback
)

#### Pandas integration
Due to Twitter user policy restrictions the Twi-XL API returns only metadata of the tweets such as their IDs and timestamps. The search results can be converted to a Pandas data frame using the `to_pandas()` method:

In [None]:
df = search_results.to_pandas()

The Pandas data frame can be written to a csv file. To change the filename you can simply edit it in the cell below. 

In [None]:
df.to_csv("demo.csv")

### Frequency plot
The `twinl.plotting` module contains some example functions to plot search results. In order to see how tweets are distributed over time we can plot the frequencies of the tweets using the `plot_tweet_frequencies()` function. 

In [None]:
twinl.plotting.plot_tweet_frequencies(search_results, title="Number of '#verkiezingen' tweets per day");

### Word frequencies from the TwiNL database
#### Daily 
Apart from searching for tweets we can also lookup daily word frequencies using the `twinl.word_frequency()` function. For a word frequency search we have to specify a date and, optionally, the minimum word length (default 1), the number of words returned (default is all) and the minimum occurrence rate of words (default is 1).

In the following cell, we retrieve word frequencies for January 4 2017 with a minimum word length of 5, a minimum occurrence rate (`frequency_limit`) of 100, and 50,000 maximum words returned:

In [None]:
word_frequencies = twinl.word_frequency(
    date=datetime(2017, 1, 4),
    min_length_words=5,
    max_results=50000,
    frequency_limit=100,
    callback=twinl.print_callback
)

The Pandas data frame can be written to a csv file.

In [None]:
df = word_frequencies.to_pandas()
df.to_csv("demo_word_freq_example.csv")

### Twi-XL word cloud
To visualize the daily word frequencies we can create a word cloud. We will first filter the results with a list of known Dutch stopwords provided by the [Spacy](https://spacy.io/usage/spacy-101) natural language processing library, and plot a word cloud with the `plotting.plot_word_cloud()` function:

In [None]:
nl = spacy.load("nl_core_news_sm")
stopwords = nl.Defaults.stop_words

twinl.plotting.plot_word_cloud(word_frequencies, stopwords=stopwords, max_words=100, min_word_length=4);


### Twi-XL circular plot 
#### Hourly 
To visualize the most-tweeted words per hour in a day we can use the `plotting.plot_circular_bars()` function:

In [None]:
twinl.plotting.plot_circular_bars(word_frequencies, stopwords=stopwords, group_size=2);

## Tweet analysis
### Retrieval 
The Twi-XL API provides so far only tweet ids and timestamp, or aggregate data. 

To access the full tweets with more metadata we make use of a scraping library called [snscrape](https://github.com/JustAnotherArchivist/snscrape).

We first store all tweet Objects in a list, which we will process further below. A tweet Object has different properties that are defined by the scraping tool (and not by the Twitter API as usual although they overlap), such as hashtags, links, user, replyCount, retweetCount, etc. (see [here](https://github.com/JustAnotherArchivist/snscrape/blob/master/snscrape/modules/twitter.py) for an overview starting from line 100). Each tweet is stored in a "dictionary", a data structure in Python to store objects by unique keys and values. 

In [None]:
tweets = []
first_x_tweets = 10
tweet_ids = [tweet_metadata.tweet_id for tweet_metadata in search_results.tweets[:first_x_tweets]]
for tweet_id in tweet_ids:
    for i,tweet in enumerate(sntwitter.TwitterTweetScraper(tweetId=tweet_id,mode=sntwitter.TwitterTweetScraperMode.SINGLE).get_items()):
        print(tweet)
        if type(tweet) is sntwitter.Tweet:
            tweet_dict = tweet.__dict__
            tweets.append(tweet_dict)

### Store tweets in CSV 
We can store the scraped tweets with the all the delivered metadata in a csv file as follows.

In [None]:
with open('tweets.csv', 'w') as tweets_csv:
    writer = csv.writer(tweets_csv)
    writer.writerow(tweets[0].keys())
    for tweet_dictionary in tweets:
        writer.writerow(tweet_dictionary.values())
    tweets_csv.close()

## Hashtags
### Hashtags per tweet
First, we can simply print the hashtags that occur within each tweet.

In [None]:
for tweet in tweets:
    tweet_hashtags_list = tweet['hashtags']
    if tweet_hashtags_list is not None:
        tweet_hashtags_string = ', '.join(tweet_hashtags_list)
        print('Hashtags for tweet ' + tweet['url'] + ': ' + tweet_hashtags_string)


### Hashtag frequencies

Then, we look into which hashtags are occuring overall and how often. For this we use again a dictionary to store the hashtags keys and the value of how often they occur. 

In [None]:
hashtags = {}
for tweet in tweets:
    tweet_hashtags_list = tweet['hashtags']
    for hashtag in tweet_hashtags_list:
        if hashtag in hashtags:
            hashtags[hashtag] = hashtags[hashtag] + 1
        else:
            hashtags[hashtag] = 1

hashtags_sorted = sorted(hashtags.items(), key=lambda x:x[1], reverse=True)
for hashtag in hashtags_sorted:
    print(hashtag[0] + ': ' + str(hashtag[1]))


We can also store these values in a csv with executing the code below.

In [None]:
with open('hashtags_frequencies.csv','w') as file:
    csv_out=csv.writer(file)
    csv_out.writerow(['hashtag','count'])
    for row in hashtags_sorted:
        csv_out.writerow(row)
    file.close()


#### Hashtag cloud

We can also plot a hashtag cloud that visualizes how often which hashtags appear for our search query. For this we use the Python [wordcloud](https://github.com/amueller/word_cloud) package.

In [None]:
# concatenate all hashtags into one text
tweet_hashtags_list = []
for tweet in tweets:
    tweet_hashtags_list.append(' '.join(tweet['hashtags']))
text = ' '.join(tweet_hashtags_list)

# a default list of stopwords is used; one can add individual ones with stopwords.add('word')
stopwords = set(STOPWORDS)

# create world cloud object
wc = WordCloud(max_words=1000, stopwords=stopwords, margin=10,
               random_state=1).generate(text)

# store default colored image
default_colors = wc.to_array()
wc.to_file("hashtag_cloud.png")
plt.axis("off")
plt.title("Hashtag cloud")
plt.imshow(default_colors, interpolation="bilinear")
plt.axis("off")
plt.show()

## URLs

In the following, we are going to print the URLs shared for individual tweets. 

In [None]:
for tweet in tweets:
    links = tweet['links']
    if links is not None:
        urls = [l.url for l in links]
        for url in urls:
            print('Urls for tweet ' + tweet['url'] + ': ' + url)

### URL frequencies

In [None]:
urls = {}
for tweet in tweets:
    links = tweet['links']
    if links is not None:
        tweet_urls_list = [l.url for l in tweet['links']]
        for url in tweet_urls_list:
            if url in urls:
                urls[url] = urls[url] + 1
            else:
                urls[url] = 1

urls_sorted = sorted(urls.items(), key=lambda x:x[1], reverse=True)
for url in urls_sorted:
    print(url[0] + ': ' + str(url[1]))


We can also store these values in a csv with executing the code below.

In [None]:
with open('url_frequencies.csv','w') as file:
    csv_out=csv.writer(file)
    csv_out.writerow(['url','count'])
    for row in urls_sorted:
        csv_out.writerow(row)
    file.close()


#### Domain frequencies

Each URL has a particular domain which can be studied to detect overall trends in dominant sites. For this we make use of the [tldextract](https://pypi.org/project/tldextract/) package.

In [None]:
domains_list = [tldextract.extract(url).domain for url in urls.keys()]
domains = {}
for domain in domains_list:
    if domain in domains:
        domains[domain] = domains[domain] + 1
    else:
        domains[domain] = 1
    
domains_sorted = sorted(domains.items(), key=lambda x:x[1], reverse=True)
for domain in domains_sorted:
    print(domain[0] + ': ' + str(domain[1]))

We can also store these values in a csv with executing the code below.

In [None]:
with open('domain_frequencies.csv','w') as file:
    csv_out=csv.writer(file)
    csv_out.writerow(['domain','count'])
    for row in domains_sorted:
        csv_out.writerow(row)
    file.close()


#### Domain clouds

Plotting clouds for individual URLs is not ideal to keep oversight. However, for domains, the cloud visualization is more useful.

In [None]:
text = ' '.join(list(domains.keys()))

# create world cloud object
wc = WordCloud(max_words=1000, margin=10,
               random_state=1).generate(text)

# store default colored image
default_colors = wc.to_array()
wc.to_file("domain_cloud.png")
plt.axis("off")
plt.title("Domain cloud")
plt.imshow(default_colors, interpolation="bilinear")
plt.axis("off")
plt.show()

#### Country code frequencies
Sometimes it is useful to see which country codes occur most frequently amongst shared links. For this, we retrieve all occurring country codes from URLs and sort them by their frequency.

In [None]:
country_code_list = [tldextract.extract(url).suffix for url in urls.keys()]
country_codes = {}
for country_code in country_code_list:
    if country_code in country_codes:
        country_codes[country_code] = country_codes[country_code] + 1
    else:
        country_codes[country_code] = 1
    
country_codes_sorted = sorted(country_codes.items(), key=lambda x:x[1], reverse=True)
for country_code in country_codes_sorted:
    print(country_code[0] + ': ' + str(country_code[1]))

Again, we can store the results in a csv.

In [None]:
with open('country_code_frequencies.csv','w') as file:
    csv_out=csv.writer(file)
    csv_out.writerow(['country code','count'])
    for row in country_codes_sorted:
        csv_out.writerow(row)
    file.close()

#### Country code clouds
Or, we can display occurring country codes in a word cloud.

In [None]:
text = ' '.join(list(country_codes.keys()))

# create world cloud object
wc = WordCloud(max_words=1000, margin=10,
               random_state=1).generate(text)

# store default colored image
default_colors = wc.to_array()
wc.to_file("country_code_cloud.png")
plt.axis("off")
plt.title("Country code cloud")
plt.imshow(default_colors, interpolation="bilinear")
plt.axis("off")
plt.show()

### URLs and users

In the following, we retrieve all URLs per user for the given search query.

In [None]:
links_per_user = {}
for tweet in tweets:
    if tweet['user'] is not None:
        user = tweet['user'].username
        if tweet['links'] is not None:
            links = [t.url for t in tweet['links']]
            print(links)
            if user in links_per_user:
                links_per_user[user] = links_per_user[user].append(links)
            else:
                links_per_user[user] = links

print(links_per_user)

We can store this output as a csv with the columns **user** and **urls**, and we separate multiple URLs by a semicolon.

In [None]:
with open('urls_users.csv','w') as file:
    w = csv.writer(file)
    w.writerow(['user', 'url'])
    for user in links_per_user:
        w.writerow([user, ' ; '.join(links_per_user[user])])

### Domains and users
We can repeat the same inquiry on the domain level.

In [None]:
domains_per_user = {}
for tweet in tweets:
    if tweet['user'] is not None:
        user = tweet['user'].username
        if tweet['links'] is not None:
            domains = [tldextract.extract(t.url).domain for t in tweet['links']]
            print(domains)
            if user in domains_per_user:
                domains_per_user[user] = domains_per_user[user].append(domains)
            else:
                domains_per_user[user] = domains

print(domains_per_user)

We can store the csv output.

In [None]:
with open('domains_users.csv','w') as file:
    w = csv.writer(file)
    w.writerow(['user', 'domain'])
    for user in domains_per_user:
        w.writerow([user, ' ; '.join(domains_per_user[user])])

## Text

#### Word clouds
In the following, we will create word cloud showing the most dominant words.

In [None]:
nl = spacy.load("nl_core_news_sm")
stopwords = nl.Defaults.stop_words
stopwords.add('https')
texts = [t['rawContent'] for t in tweets]
wordcloud = WordCloud(stopwords=stopwords, min_word_length=4).generate(' '.join(texts))
plt.figure(figsize=(14, 8))
# No axis details
plt.axis("off")
plt.imshow(wordcloud)