### Defining Influence 

Definitions of influence measures (ways to assess whether someone is seen as an expert) taken from [here](https://www.aaai.org/ocs/index.php/ICWSM/ICWSM10/paper/viewFile/1538/1826)

1. **Indegree influence** the number of followers of a user - directly indicates the size of the audience for that user.

2. **Retweet influence** the number of retweets containing one’s name - indicates the ability of that user to generate content with pass-along value.

3. **Mention influence** the number of mentions containing one’s name - indicates the ability of that user to engage others in a conversation.

### Reducing the dataset

Having accessed the Twitter API and collected tweets, it is likely the file containing this data is large. Rather than reading the whole file into memory we can read it line by line, and extract only the data we need.

Here we will use a dataset containing all tweets that used the hashtag #GE2015 during the 2015 British Election period. 

We choose to save the extracted data to .json files which will be smaller in size and easier to handle

In [4]:
import json

with open('ge2015_tweets.json') as fin:
    with open('body.json', 'w') as fout1:
        with open('actor.json','w') as fout2:
            with open('user_mentions.json','w') as fout3:
                for line in fin:
                    obj = json.loads(line)
                    actor = obj['actor']
                    body = obj['body']
                    users = obj['twitter_entities']['user_mentions']
                    name = actor['preferredUsername']


                    json.dump((name,obj['object']['postedTime'],body), fout1)
                    fout1.write('\n')

                    json.dump(actor, fout2)
                    fout2.write('\n')

                    json.dump(users, fout3)
                    fout3.write('\n')

### Extracting influence measures from the reduced data 

#### Indegree influence 

Here we loop through each tweet (or line) in the file `actor.json` and extract the username and number of followers of the user who posted the tweet.

We create a dictionary `indegree` where the username is the key and the number of followers is the value.

In [5]:
indegree={}

with open('actor.json') as fin1:
    for line1 in fin1:
        obj=json.loads(line1)
        wanted = obj['followersCount']
        name = obj['preferredUsername']
        indegree[name] = wanted

#### Retweet influence 

To determine the retweet influence we loop through all tweets to find the retweets. A retweet will start with the text `RT` followed by `@user` where user corresponds to the username of the person who wrote the original tweet.

When a retweet is found, the retweet count for `user` is incremented by 1. The dictionary `retweet` contains a count of the number of times a `user` has been retweeted.

In [6]:
retweet={}

for k,v in indegree.items():
    retweet[k]=0
    
with open('body.json') as fin2:
    for line2 in fin2:
        body=json.loads(line2)
        name=body[0]
        if list(body[2])[:3]==['R','T',' ']:
            try:
                user = body[2].split(' ')[1].split('@')[1].split(':')[0]
                try:
                    retweet[user]+=1
                except KeyError:
                    continue
            except IndexError:
                continue

#### Mention influence 

The mention influence of a user is the number of times the user is mentioned by others. To determine this, we loop through the file `user_mentions.json`, in which each line contains a list of users mentioned in a single tweet. 

Looping through this list, if each user that is mentioned, if they have also posted a tweet themselves then their count is incremented by 1. The dictionary `mentions` contains a count the number of times each user has been mentioned by others.

In [7]:
mentions={}

for k,v in indegree.items():
    mentions[k]=0

with open('user_mentions.json') as fin3:
    for line3 in fin3:
        tweets=json.loads(line3)
        if len(tweets)==0:
            continue
        for item in tweets:
            user = item['screen_name']
            try:
                mentions[user]+=1
            except KeyError:
                continue

### The influence score 

With these three measures of influence, we can assign each user an influence score by ranking a user on each of the measures.

We use the convention that a low rank (or low number) corresponds to little influence, with 0 being the lowest rank possible, and a high rank (or high number) corresponds to a large amount of influence.

In [8]:
mention_rank, retweet_rank, indegree_rank = {},{},{}

mk, mv = zip(*sorted(mentions.items(), key=lambda kv: (kv[1], kv[0])))
rk, rv = zip(*sorted(retweet.items(), key=lambda kv: (kv[1], kv[0])))
ik, iv = zip(*sorted(indegree.items(), key=lambda kv: (kv[1], kv[0])))

for rank in range(len(mentions)):

    mention_rank[mk[rank]] = rank
    retweet_rank[rk[rank]] = rank
    indegree_rank[ik[rank]] = rank
 

Using this convention, we determine the mean rank of a user over the three measures to obtain a final influence score.

In [11]:
import numpy as np
influence_score={}
for user in indegree:
    influence_score[user]=np.mean([mention_rank[user], retweet_rank[user], indegree_rank[user]])

And we can look at the most influential twitter users....

In [15]:
usernames, scores = zip(*sorted(influence_score.items(), key=lambda kv: (kv[1], kv[0])))

print(usernames[-10:])

('SkyNewsBreak', 'Telegraph', 'itvnews', 'FT', 'Independent', 'TheEconomist', 'guardian', 'SkyNews', 'BBCNews', 'BBCBreaking')


As we are using tweets with the hashtag #GE2015, the fact that the most influential users are news outlets is not suprising!