# Finding high-frequency words tweets
In this notebook, a map-reduce procedure is used here to find words which have a high frequency in the unique tweets (this means retweets are excluded from analysis).

Download a tweets from  json from archive.org (https://archive.org/search.php?query=tweets). For every given minute there is .bz2 file and all files for every hour is given in separate folders. These files contains tweets in form of json.

Mapper and reducer files are below. The analysis is for one hour of tweets.
The first line in the cell save the rest of the cell as .py file. If you want to do this manually, remove the first line and save the cell content.

In [13]:
%%writefile mapper_vocab_freq.py 
#!/usr/bin/env python
import sys
import json

      
for currentTweet in sys.stdin:
    try:
        if currentTweet.strip() != "":  
                currentTweet = currentTweet.lower().strip()
            
        json_dic = json.loads(currentTweet)
        if 'text' in json_dic:
            currentTweetText = json_dic['text']
            # The following line replace all special character in the string with a space
            currentTweetText = currentTweetText.translate ({ord(ch): " " for ch in "!@#$%^&*()[]{};:,./<>?\|`~-=_+0123456789"})
            list_of_words = currentTweetText.lower().split()
            for word in list_of_words:
                print(word,1)
    except:
        print('Error',1)

Overwriting mapper_vocab_freq.py


In [14]:
%%writefile reducer.py 
#!/usr/bin/env python
# source for reducer file: https://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
import sys

current_word = None
current_count = 0
word = None

for line in sys.stdin:
    line = line.strip()
    word, count = line.split(' ', 1)
    try:
        count = int(count)
    except ValueError:
        print('count_error')
    if current_word == word:
        current_count += count
    else:
        if current_word:
            print(current_word, current_count)
        current_count = count
        current_word = word

if current_word == word:
    print(current_word, current_count)

Overwriting reducer.py


### run map reduce 
 run the following command in the command line:
 
 *bzcat *.bz2 | ./mapper_unique_tweets.py | sort -k1,1 | ./reducer.py*
 
 Another alternative is to write the above command in a shell script as follows. This .sh file write the result in output.txt

In [15]:
%%writefile mapreduce.sh
bzcat *.bz2 | ./mapper_vocab_freq.py | sort -k1,1 | ./reducer.py > output.txt

Overwriting mapreduce.sh


After that you can run the .sh file and print the output text.

In [16]:
!mapreduce.sh
!cat output.txt

Error 137649


In [4]:
!ls

00.json.bz2
01.json.bz2
02.json.bz2
03.json.bz2
04.json.bz2
05.json.bz2
06.json.bz2
07.json.bz2
08.json.bz2
09.json.bz2
10.json.bz2
11.json.bz2
12.json.bz2
13.json.bz2
14.json.bz2
15.json.bz2
16.json.bz2
17.json.bz2
18.json.bz2
19.json.bz2
20.json.bz2
21.json.bz2
22.json.bz2
23.json.bz2
24.json.bz2
25.json.bz2
26.json.bz2
27.json.bz2
28.json.bz2
29.json.bz2
30.json.bz2
31.json.bz2
32.json.bz2
33.json.bz2
34.json.bz2
35.json.bz2
36.json.bz2
37.json.bz2
38.json.bz2
39.json.bz2
40.json.bz2
41.json.bz2
42.json.bz2
43.json.bz2
44.json.bz2
45.json.bz2
46.json.bz2
47.json.bz2
48.json.bz2
49.json.bz2
50.json.bz2
51.json.bz2
52.json.bz2
53.json.bz2
54.json.bz2
55.json.bz2
56.json.bz2
57.json.bz2
58.json.bz2
59.json.bz2
English_Vocabulary.ipynb
LICENSE
Tweets_MapReduce.ipynb
mapper_.py
mapper_count_words.py
mapper_tweets_.py
mapper_unique_tweets.py
mapper_vocab_freq.py
mapreduce.sh
output.txt
reducer.py
sum_results.py
twitter-stream-2020-11-01
