# Finding high-frequency words in tweets
In this notebook, a map-reduce procedure is used here to find words which have a high frequency in the unique tweets (this means retweets are excluded from analysis).

Download a tweets from  json from archive.org (https://archive.org/search.php?query=tweets). For every given minute there is .bz2 file and all files for every hour is given in separate folders. These files contains tweets in form of json.

Mapper and reducer files are below. The analysis is for one hour of tweets.
The first line in the cell save the rest of the cell as .py file. If you want to do this manually, remove the first line and save the cell content.

In [2]:
%%writefile mapper_vocab_freq.py 
#!/usr/bin/env python
import sys
import json

      
for currentTweet in sys.stdin:
    #try:
    if currentTweet.strip() != "":
        currentTweet = currentTweet.lower().strip()  
        json_dic = json.loads(currentTweet)
        if 'text' in json_dic:
            currentTweetText = json_dic['text']
            # The following line replace all special character in the string with a space
            currentTweetText = currentTweetText.translate ({ord(ch): " " for ch in '"\'!@#$%^&*()[]{};:,./<>?\|`~-=_+0123456789'})
            list_of_words = currentTweetText.lower().split()
            for word in list_of_words:
                if not (word in 'abcdefghijklmnopqrstuvwxyz'):
                    print(word,1)
    #except:
    #    print('Error',1)

Overwriting mapper_vocab_freq.py


In [9]:
%%writefile reducer.py 
#!/usr/bin/env python
# source for reducer file: https://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
import sys

current_word = None
current_count = 0
word = None

for line in sys.stdin:
    line = line.strip()
    word, count = line.split(' ', 1)
    try:
        count = int(count)
    except ValueError:
        print('count_error')
    if current_word == word:
        current_count += count
    else:
        if current_word:
            text_to_print = current_word + ',' + str(current_count)
            print(text_to_print)
            #print(current_word, current_count)
        current_count = count
        current_word = word

if current_word == word:
    #text_to_print = current_word + ',' + str(current_count)
    #print(current_word,current_count)
    print(current_word)
    #print(text_to_print)

Overwriting reducer.py


### run map reduce 
 run the following command in the command line:
 
 *bzcat *.bz2 | ./mapper_unique_tweets.py | sort -k1,1 | ./reducer.py*
 
 Another alternative is to write the above command in a shell script as follows. This .sh file write the result in output.csv

In [204]:
%%writefile mapreduce.sh
echo 'word,freq' > output.csv
bzcat *.bz2 | ./mapper_vocab_freq.py | sort -k1,1 | ./reducer.py >> output.csv

Overwriting mapreduce.sh


After that you can run the .sh file to create .csv file.

In [205]:
!mapreduce.sh

Now we can process the csv file

In [3]:
import csv
d = {}

with open('output.csv', newline='\n') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        d[row['word']] = int(row['freq'])

sorted_dict = dict(sorted(d.items(), key=lambda item: item[1],reverse=True))
number_of_distinct_words = len(sorted_dict)
number_of_distinct_words

0

In [4]:
top_1000 = {k: sorted_dict[k] for k in list(sorted_dict)[:1000]}

In [5]:
top_1000

{}