# Finding high-frequency words tweets
In this notebook, a map-reduce procedure is used here to find words which have a high frequency in the unique tweets (this means retweets are excluded from analysis).

Download a tweets from  json from archive.org (https://archive.org/search.php?query=tweets). For every given minute there is .bz2 file and all files for every hour is given in separate folders. These files contains tweets in form of json.

Mapper and reducer files are below. The analysis is for one hour of tweets.
The first line in the cell save the rest of the cell as .py file. If you want to do this manually, remove the first line and save the cell content.

In [169]:
%%writefile mapper_unique_tweets.py 
#!/usr/bin/env python
import sys
import json
for currentTweet in sys.stdin:
    try:
        if currentTweet.strip() != "":  
            currentTweet = currentTweet.lower().strip() 
            if not ('retweeted_status' in currentTweet):
                #currentTweetText = json.loads(currentTweet)['text']
                print('Unique_tweet', 1)
            else:
                print('Retweet',1)
            print('Total',1)
    except:
        print('Error',1)

Overwriting mapper_unique_tweets.py


In [170]:
%%writefile reducer.py 
#!/usr/bin/env python
# source for reducer file: https://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
import sys

current_word = None
current_count = 0
word = None

for line in sys.stdin:
    line = line.strip()
    word, count = line.split(' ', 1)
    try:
        count = int(count)
    except ValueError:
        print('count_error')
    if current_word == word:
        current_count += count
    else:
        if current_word:
            print(current_word, current_count)
        current_count = count
        current_word = word

if current_word == word:
    print(current_word, current_count)

Overwriting reducer.py


### run map reduce 
 run the following command in the command line:
 
 *bzcat *.bz2 | ./mapper_unique_tweets.py | sort -k1,1 | ./reducer.py*
 
 Another alternative is to write the above command in a shell script as follows. This .sh file write the result in output.txt

In [171]:
%%writefile mapreduce.sh
bzcat *.bz2 | ./mapper_unique_tweets.py | sort -k1,1 | ./reducer.py > output.txt

Overwriting mapreduce.sh


After that you can run the .sh file and print the output text.

In [172]:
!mapreduce.sh
!cat output.txt

Retweet 55770
Total 221049
Unique_tweet 165279
