# Analysing "How ISIS Uses Twitter" using social network cluster analysis
## Approach
The general approach will be to extract each unique user by their username to act as a node. The username has been chosen as the associated data on each node as it is unique unlike names which might not be. The scale of the node will be influenced by the a combination of the number of followers and the number of tweets they produce. This combination will ensure that active and popular users are identified rather than identifying those who tweet a lot and have a small amount of followers or vice versa.

Currently the relation between each user is yet to be decided, as well as the number of followers, who those followers are would be useful to identify the relation between different users. One relation criteria might be to scrape the tweets of users for mentions and then link nodes via this metric with numerous mentions increasing the weight of an edge between two users. In this context two types of mentions could be identified, those that result in direct communication with a user and those mentions that come from retweeting a user. The former could be combined with language processing to determine the emotive qualities of the tweets to see if there are inner hostilities between ISIS supporters.

## First Steps
Matplotlib, as always, will be used to provide visualizations of statistics gathered from the data. NetworkX is a useful graph library which allows for the visualization of graphs, its draw functions are directly linked to matplotlib allowing for similar looking visualizations to be created.

In [53]:
import matplotlib.pyplot as plt
import networkx as nx
import pandas as pd
import re
%matplotlib inline

dataset = pd.read_csv("data/tweets.csv")

The first interesting stat to find would be how many users in the dataset tweet each other? The first two print commands are to check that there are no duplicate tweets which would skew results. The only disadvantage is that this relies on an exact string match, if retweets have been preceeded by an RT then this would not pick up duplicates.

Using a regex expression we can catch those tweets that contain "RT" at the start of the tweet (indicating a retweet) and count them. Compared to the previous check we can see roughly 6000 tweets are not actually useful due to them being retweets. Despite this they are useful for future reference in testing what relation criteria to use.

In [54]:
print("Unique tweets: {}".format(len(dataset['tweets'].unique())))
print("All tweets: {}".format(len(dataset['tweets'])))

retweets = []
actual_tweets = []
for user, tweet in zip(dataset['username'], dataset['tweets']):
    match = re.search(r'^\bRT\b', tweet)
    if match == None:
        actual_tweets.append((user,tweet))
    else:
        retweets.append((user,tweet))
        
print("Number of retweets (RT): {}".format(len(retweets)))
print("Actual unique tweet count: {}".format(len(actual_tweets)))

Unique tweets: 17410
All tweets: 17410
Number of retweets (RT): 5826
Actual unique tweet count: 11584


## Who talks about who?
Now we have seperated the retweets and actual tweets and grouped them with their usernames we can proceed to perform some analysis on who is talking about who!