# Sifting through Twitter data

Let us put regex search to some real use. We download a selection of tweets from Twitter mentioning the hashtag '#metoo'.

Download [this file](https://gist.githubusercontent.com/ThomasA/9c524894e17d56b211c51cdc34c404ca/raw/8cfade6dbe859999dcfaaba3cc5d96f46cd43da9/twitter.csv):


In [None]:
!wget https://gist.githubusercontent.com/ThomasA/9c524894e17d56b211c51cdc34c404ca/raw/8cfade6dbe859999dcfaaba3cc5d96f46cd43da9/twitter.csv

In [None]:
import pandas

twitter = pandas.read_csv('twitter.csv')

twitter

Find tweets with the word 'køn':

In [None]:
import re

for text in twitter['text']:
    match = re.search(r'køn', text)
    if match:
        print(match.group())

Find tweets that retweet someone:

In [None]:
for text in twitter['text']:
    match = re.search(r'RT \@[A-Za-z0-9]+', text)
    if match:
        print(match.group())

Find tweets that retweet someone *and* are tagged '#dkpol':

In [None]:
for text in twitter['text']:
    match = re.search(r'RT \@[A-Za-z0-9]+.+\#dkpol', text)
    if match:
        print(match.group())

Find tweets that are tagged '#dkpol' *or* '#ligestilling':

In [None]:
for text in twitter['text']:
    match = re.search(r'\#dkpol|\#ligestilling', text)
    if match:
        print(match.group())

# Building numerical data to analyse

As an example of numerical data that NumPy is an excellent solution for, let us build a matrix of which Twitter users mention which other users and how often, from our Twitter data set.

### Exercise: Finding Twitter user names

Quoting from [Twitter](https://help.twitter.com/en/managing-your-account/twitter-username-rules):
> Your username cannot be longer than 15 characters. Your name can be longer (50 characters) or shorter than 4 characters, but usernames are kept shorter for the sake of ease.  
> A username can only contain alphanumeric characters (letters A-Z, numbers 0-9) with the exception of underscores, as noted above. Check to make sure your desired username doesn't contain any symbols, dashes, or spaces.

- Compose a regex that can match valid Twitter user names, in a context where they start by '@'
- Test it using `re.search`
  - Should match for example '@ThomasArildsen', '@SiGnE14', '@_Kristian'
  - Should not match '@Thomas+', '@Signe@AAU', '@ThomasArildsen_AAU'

In [None]:
import re

re.search(r'(?<=\@)\w{1,15}(?!\w)', '@ThomasArildsen ', flags=re.A)

In [None]:
re.search(r'(?<=\@)\w{1,15}(?!\w)', '@ThomasArildsen_AAU', flags=re.A)

In [None]:
valid_user_ex = r'(?<=\@)\w{1,15}(?!\w)'

How many unique (tweeting) users are among the tweets?

In [None]:
unique_tweeters = twitter['from_user_name'].unique()
unique_tweeters

In [None]:
type(unique_tweeters)

When we perform operations like this on Pandas dataframes, we actually get a NumPy array as a result.

# NumPy arrays

Sometimes we need to work on large amounts of numerical data.

NumPy provides a fundamental building block efficiently storing and processing numerical data in Python:

In [None]:
import numpy as np

### Inspecting NumPy arrays

We can inspect properties such as shape and size of an array:

In [None]:
unique_tweeters.shape

In [None]:
unique_tweeters.size

Let us find all the mentioned usernames:

In [None]:
tweetees = []

for text in twitter['text']:
    for user in re.finditer(valid_user_ex, text):
        tweetees.append(user.group())
        
tweetees

In [None]:
unique_tweetees = pandas.Series(tweetees).unique()
unique_tweetees

In [None]:
unique_tweetees.size

Let us build a reduced set of those users who both tweet and are mentioned in others' tweets. We call them "conversationists":

In [None]:
conversationists = np.intersect1d(unique_tweeters, unique_tweetees)
conversationists

In [None]:
conversationists.size

Let us try building a matrix that identifies which users have mentioned whom, and how many times.
- Along each axis of the matrix we will have a row for each "conversationist", and likewise a column for each.
- We will need an array (matrix) of the above size along each axis.

In [None]:
mentions = np.zeros((conversationists.size, conversationists.size), dtype=int)
mentions.shape

Arrays are indexed numerically, i.e. entries in them have numerical "coordinates", like lists:

In [None]:
mentions[3,45]

(so far, all of the contents are just zero).

We build a dictionary for "translating" between user names and indexes in the array:

In [None]:
idx_dict = dict(zip(conversationists, range(conversationists.size)))
idx_dict

Now we can index into the array using user names:

In [None]:
mentions[idx_dict['soerenpoul'], idx_dict['vingband']]

The dictionary translates user names into numerical indices that can pick out positions in the array.

Now, let us populate the (so far empty) `mentions` array.

In [None]:
for tweeter, text in zip(twitter['from_user_name'], twitter['text']):
    if tweeter in conversationists:
        for match in re.finditer(valid_user_ex, text):
            tweetee = match.group()
            if tweetee in conversationists:
                mentions[idx_dict[tweeter], idx_dict[tweetee]] += 1

In [None]:
from matplotlib import pyplot

pyplot.spy(mentions)

Now we can use mathematical operations on the NumPy array `mentions` to answer questions like:
- How much does each "conversationist" mention other conversationists?

In [None]:
mentioners_arr = np.sum(mentions, axis=1)
mentioners_arr

In [None]:
mentioners = pandas.DataFrame({'user': conversationists, 'mentions of others': mentioners_arr})
mentioners

In [None]:
mentioners.sort_values('mentions of others', ascending=False)

- How much does each "conversationist" get mentioned?

In [None]:
mentionees_arr = np.sum(mentions, axis=0)
mentionees_arr

In [None]:
mentionees = pandas.DataFrame({'user': conversationists, 'mentions by others': mentionees_arr})
mentionees

In [None]:
mentionees.sort_values('mentions by others', ascending=False)

- Who mentions the most other users?

In [None]:
name_droppers_arr = np.sum(mentions>0, axis=0)
name_droppers_arr

In [None]:
name_droppers = pandas.DataFrame({'user': conversationists, 'users mentioned': name_droppers_arr})
name_droppers

In [None]:
name_droppers.sort_values('users mentioned', ascending=False)

# Exercise: answering more Twitter questions

- Who gets mentioned by most conversationists?
- How many times does a conversationist mention others on average?
- How many times does a conversationist get mentioned on average?
- Feel free to come up with more questions if you like.