# Sifting through Twitter data

Let us put regex search to use. We download a selection of tweets from Twitter mentioning the hashtag '#metoo'.

Download [this file](https://gist.githubusercontent.com/ThomasA/9c524894e17d56b211c51cdc34c404ca/raw/8cfade6dbe859999dcfaaba3cc5d96f46cd43da9/twitter.csv):


In [None]:
!wget https://gist.githubusercontent.com/ThomasA/9c524894e17d56b211c51cdc34c404ca/raw/8cfade6dbe859999dcfaaba3cc5d96f46cd43da9/twitter.csv

## Figure out how to read the Twitter CSV

We need a convenient way to access the contents.  
What is in the Twitter data set?

## Searching through the Twitter data

Find tweets with certain words (n.b. it is in Danish; try for example words like "køn" or "ligestilling").  
Which tools can we use for searching through text data?

## Finding Twitter user names

Quoting from [Twitter](https://help.twitter.com/en/managing-your-account/twitter-username-rules):
> Your username cannot be longer than 15 characters. Your name can be longer (50 characters) or shorter than 4 characters, but usernames are kept shorter for the sake of ease.  
> A username can only contain alphanumeric characters (letters A-Z, numbers 0-9) with the exception of underscores, as noted above. Check to make sure your desired username doesn't contain any symbols, dashes, or spaces.

- Compose a regex that can match valid Twitter user names, in a context where they start by '@'
- Test it using `re.search`
  - Should match for example '@ThomasArildsen', '@Bjarne', '@gerTrud'
  - Should not match '@Thomas+', '@Signe@AAU', '@ThomasArildsen_AAU'

How many unique (tweeting) users are among the tweets?

Let us also find all usernames mentioned in tweets (including re-tweets):

## NumPy arrays

Sometimes we need to work on large amounts of numerical data.

NumPy provides a fundamental building block efficiently storing and processing numerical data in Python:

In [None]:
import numpy as np

Let us build a reduced set of those users who both tweet and are mentioned in others' tweets. We call them 
"conversationists".

This can be done as a NumPy operation on the two pieces we have identified above: the (unique) tweeters and the (unique) mentioned users. Let us call it `mentions`:

## Identifying interactions

Let us try building a matrix that identifies which users have mentioned whom, and how many times.
- Along each axis of the matrix we will have a row for each "conversationist", and likewise a column for each.
- We will need an array (matrix) of the above size along each axis.

We may need a data structure for "translating" between user names and indexes in the array:

Now, let us populate the (so far empty) `mentions` array.

In [None]:
from matplotlib import pyplot

pyplot.spy(mentions)

## Answering questions about interactions in our data set

Now we can use mathematical operations on the NumPy array `mentions` to answer questions like:
- How much does each "conversationist" mention other conversationists?

- How much does each "conversationist" get mentioned?

- Who mentions the most other users?

## Answering more Twitter questions

- Who gets mentioned by most conversationists?
- How many times does a conversationist mention others on average?
- How many times does a conversationist get mentioned on average?
- Feel free to come up with more questions if you like.