# Creating a Twitter-Kafka Consumer

In this notebook we are going to connect to Kafka by using the kafka Python module. 
By running the application, we listen to a certain topic in Kafka and capture the message to be able to do some usefulcalculations on the data.

In this application we are going to create a Python consumer that adds the language to the tweets

In [5]:
from kafka import KafkaConsumer, KafkaProducer
import pandas
import json
import time

## Create a Kafka consumer

Define Kafka consumer

In [6]:
topic = "IESEG_MBD"
#topic = "trump"

consumer = KafkaConsumer(topic, 
                         bootstrap_servers='kafka:29092',                                 
                         auto_offset_reset='earliest',
                         value_deserializer=lambda m: json.loads(m.decode('utf-8')))

Test Kafka consumer

In [7]:
c = consumer.poll()
c

{}

## Create a language DF in pandas

In [8]:
lang_list =  [("in", "Indian"), ("en", "English"), ("hi", "Hindi"), ("fr", "French"), ("de", "German"), ("nl", "Dutch"), ("ar", "arabic"), ("ja", "Japanese"), ("ru", "Russian"), ("es", "Spanish"), ("zh", "Chinese")]
languagesDF = pandas.DataFrame(lang_list, columns=["lang", "language"])

languagesDF


Unnamed: 0,lang,language
0,in,Indian
1,en,English
2,hi,Hindi
3,fr,French
4,de,German
5,nl,Dutch
6,ar,arabic
7,ja,Japanese
8,ru,Russian
9,es,Spanish


## Actions

Create a function that allows to merge the realtime tweets flow, transfrom it to pandas and merge with the language DF.

In [9]:
def add_language():
    c = consumer.poll()
    if c != {}:
        list_of_dict = list(map(lambda a: a.value, list(c.values())[0]))
        tweetDF = pandas.DataFrame(list_of_dict)
        tweetDF = tweetDF.merge(languagesDF, on="lang", how="left")  
    else:
        tweetDF = "No tweets to process"
    
    return tweetDF


In [10]:
add_language()

'No tweets to process'

Create a function that calculates count statistics on the DF tweet stream.

In [11]:
def statistics():
    tweetDF = add_language()
    if type(tweetDF) != str:
        print("Number of processed tweets: ", len(tweetDF))
        print("Number of tweets with a known language", len(tweetDF.dropna(subset=["language"])))
        print("Number of tweets grouped by language:", tweetDF[["language", "id"]].groupby(["language"]).count())
        print("")
    else:
        print(tweetDF)

Run the actions 10 times

In [12]:
i = 0
while i < 10:
    statistics()
    i += 1
    time.sleep(5)

No tweets to process
No tweets to process
No tweets to process
No tweets to process
No tweets to process
No tweets to process
No tweets to process
No tweets to process
No tweets to process
No tweets to process
