# Creating a Twitter-Kafka Consumer

In this notebook we are going to connect to Kafka by using the kafka Python module. 
By running the application, we listen to a certain topic in Kafka and capture the message to be able to do some usefulcalculations on the data.

In this application we are going to create a Python consumer that adds the language to the tweets

In [13]:
from kafka import KafkaConsumer, KafkaProducer
import pandas
import json
import time

## Create a Kafka consumer

Define Kafka consumer

In [14]:
topic = "IESEG_MBD"
#topic = "trump"

consumer = KafkaConsumer(topic, 
                         bootstrap_servers='kafka:29092',                                 
                         auto_offset_reset='earliest',
                         value_deserializer=lambda m: json.loads(m.decode('utf-8')))

Test Kafka consumer

In [23]:
c = consumer.poll()
c

{TopicPartition(topic='trump', partition=0): [ConsumerRecord(topic='trump', partition=0, offset=6022, timestamp=1556347985937, timestamp_type=0, key=None, value={'entities_hashtags': [], 'user_friends_count': 3338, 'user_name': 'QUANTUM ENTANGLEMENT', 'source': 'Twitter for Android', 'user_followers_count': 1810, 'lang': 'en', 'created_at': '2019-04-27 06:53:00.000000', 'text': "RT @TheSharpEdge1: POTUS on Hannity last night:\n\nHannity: 'Mr. President will you declassify the FISA applications, Gang of 8 material, tho…", 'user_location': 'United States', 'entities_user_mentions': [{'id': 952758329301807104, 'indices': [3, 17], 'screen_name': 'TheSharpEdge1', 'name': 'TheSharpEdge', 'id_str': '952758329301807104'}], 'user_id': 436264794, 'id': 1122030871408672769}, checksum=733009405, serialized_key_size=-1, serialized_value_size=617),
  ConsumerRecord(topic='trump', partition=0, offset=6023, timestamp=1556347985948, timestamp_type=0, key=None, value={'entities_hashtags': [], 'user_frie

## Create a language DF in pandas

In [16]:
lang_list =  [("in", "Indian"), ("en", "English"), ("hi", "Hindi"), ("fr", "French"), ("de", "German"), ("nl", "Dutch"), ("ar", "arabic"), ("ja", "Japanese"), ("ru", "Russian"), ("es", "Spanish"), ("zh", "Chinese")]
languagesDF = pandas.DataFrame(lang_list, columns=["lang", "language"])

languagesDF


Unnamed: 0,lang,language
0,in,Indian
1,en,English
2,hi,Hindi
3,fr,French
4,de,German
5,nl,Dutch
6,ar,arabic
7,ja,Japanese
8,ru,Russian
9,es,Spanish


## Actions

Create a function that allows to merge the realtime tweets flow, transfrom it to pandas and merge with the language DF.

In [17]:
def add_language():
    c = consumer.poll()
    if c != {}:
        list_of_dict = list(map(lambda a: a.value, list(c.values())[0]))
        tweetDF = pandas.DataFrame(list_of_dict)
        tweetDF = tweetDF.merge(languagesDF, on="lang", how="left")  
    else:
        tweetDF = "No tweets to process"
    
    return tweetDF


In [18]:
add_language()

'No tweets to process'

Create a function that calculates count statistics on the DF tweet stream.

In [21]:
def statistics():
    tweetDF = add_language()
    if type(tweetDF) != str:
        print("Number of processed tweets: ", len(tweetDF))
        print("Number of tweets with a known language", len(tweetDF.dropna(subset=["language"])))
        print("Number of tweets grouped by language:", tweetDF[["language", "id"]].groupby(["language"]).count())
        print("")
    else:
        print(tweetDF)

Run the actions 10 times

In [22]:
i = 0
while i < 10:
    statistics()
    i += 1
    time.sleep(5)

Number of processed tweets:  500
Number of tweets with a known language 483
Number of tweets grouped by language:            id
language     
English   457
French      6
German      6
Hindi       1
Indian      1
Japanese    4
Spanish     7
arabic      1

Number of processed tweets:  500
Number of tweets with a known language 484
Number of tweets grouped by language:            id
language     
English   467
French      5
German      1
Indian      1
Japanese    3
Spanish     5
arabic      2

Number of processed tweets:  9
Number of tweets with a known language 9
Number of tweets grouped by language:           id
language    
English    9

Number of processed tweets:  500
Number of tweets with a known language 487
Number of tweets grouped by language:            id
language     
Dutch       2
English   460
French     11
German      4
Hindi       2
Indian      1
Japanese    1
Spanish     6

Number of processed tweets:  500
Number of tweets with a known language 485
Number of tweets groupe