# Ingesting realtime tweets using Apache Kafka, Tweepy and Python

### Purpose:
- main data source for the lambda architecture pipeline
- uses twitter streaming API to simulate new events coming in every minute
- Kafka Producer sends the tweets as records to the Kafka Broker

### Contents: 
- [Twitter setup](#1)
- [Defining the Kafka producer](#2)
- [Producing and sending records to the Kafka Broker](#3)
- [Deployment](#4)

### Required libraries

In [1]:
import tweepy
import time
from kafka import KafkaConsumer, KafkaProducer
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="my-application")


<a id="1"></a>
### Twitter setup
- getting the API object using authorization information
- you can find more details on how to get the authorization here:
https://developer.twitter.com/en/docs/basics/authentication/overview

In [2]:
# twitter setup
ACCESS_TOKEN = '799844067701977088-qrHMnTaYFUcqBbeG5yT3G8GTieLJt6N'
ACCESS_SECRET = 'kRtA7MsTjvAqmft9BdtE7z2FtAouYsOY8OlAvByIy5m1l'
CONSUMER_KEY = 'yZUmKJQxfGmpLtvVTmGTHPKiD'
CONSUMER_SECRET = '1bwO3JIc664KkObN5LJYVpALDi63NBExoLSKsaMrJ7KPYKYiXM'
# Setup tweepy to authenticate with Twitter credentials:
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)
# Creating the API object by passing in auth information
api = tweepy.API(auth) 


A helper function to normalize the time a tweet was created with the time of our system

In [3]:
from datetime import datetime, timedelta

def normalize_timestamp(time):
    mytime = datetime.strptime(time, "%Y-%m-%d %H:%M:%S")
    mytime += timedelta(hours=1)   # the tweets are timestamped in GMT timezone, while I am in +1 timezone
    return (mytime.strftime("%Y-%m-%d %H:%M:%S")) 

<a id="2"></a>
### Defining the Kafka producer
- specify the Kafka Broker
- specify the topic name
- optional: specify partitioning strategy

In [4]:
producer = KafkaProducer(bootstrap_servers='localhost:9092')
topic_name = 'tweepy-kafka-test'

<a id="3"></a>
### Producing and sending records to the Kafka Broker
- querying the Twitter API Object
- extracting relevant information from the response
- formatting and sending the data to proper topic on the Kafka Broker
- resulting tweets have following attributes:
    - id 
    - created_at
    - followers_count
    - location
    - favorite_count
    - retweet_count

In [8]:
import json
def get_twitter_data():
    res = api.search("and")#,geocode = ["105,30.132633,150mi"])
    j=0
    for i in res:
        location = geolocator.geocode(str(i.user.location)) 
        if i.user.location != "" and location != None:
            record = '{'
            record += "\"created_at\":" +  json.dumps(str(i.created_at))
            record += ','
            record += "\"text\":" +  json.dumps(str(i.text))
            record += ','
            record += "\"user_id\":" +  json.dumps(str(i.user.id_str))
            record += ','
            record += "\"user_timezone\":" +  json.dumps(str(i.user.time_zone))
            record += ','
            record += "\"user_location\":" +  json.dumps(str(location.latitude)+','+ str(location.longitude))
            record += ','
            record += "\"loc_lat\":" + json.dumps(location.latitude)
            record += ','
            record += "\"loc_long\":" + json.dumps(location.longitude)
            record += ','
            record += "\"followers_count\":" +  json.dumps(str(i.user.followers_count))
            record += ','
            record += "\"language\":" + json.dumps(str(i.lang))
            record += '}'
            #loc = ""
            #loc += "user_loc:"+str(i.user.location) +', tweet_geo:'+ str(i.geo)+',coordinates:'+str(i.coordinates)+',place:'+str(i.place)
            producer.send(topic_name, str.encode(record))
            print(record)
                       
            #print(i.user.location)

In [9]:
get_twitter_data()

{"created_at":"2019-03-26 17:20:58","text":"@eltonvillanuev6 @humanworkplace So you think that the effects of ageism can always be counteracted by grit, hard w\u2026 https://t.co/X80VQzJfHf","user_id":"809980977409835008","user_timezone":"None","user_location":"40.7308619,-73.9871558","loc_lat":40.7308619,"loc_long":-73.9871558,"followers_count":"3565","language":"en"}
{"created_at":"2019-03-26 17:20:58","text":"RT @decorartehogar: Retweet and Like if you want to gain more followers\ud83d\udc33\n#Decorartehogar","user_id":"901210368","user_timezone":"None","user_location":"41.0096334,28.9651646","loc_lat":41.0096334,"loc_long":28.9651646,"followers_count":"232","language":"en"}
{"created_at":"2019-03-26 17:20:58","text":"Exciting #news - I'm now represented by Charlotte Watts and the team \n@WilliamsonHolme\n #actor #actorslife #casting\u2026 https://t.co/bOyndji9RA","user_id":"120801635","user_timezone":"None","user_location":"51.5073219,-0.1276474","loc_lat":51.5073219,"loc_long":-0.

<a id="4"></a>
### Deployment 
- perform the task every couple of minutes and wait in between

In [10]:
def periodic_work(interval):
    while True:
        get_twitter_data()
        #interval should be an integer, the number of seconds to wait
        time.sleep(interval)


In [11]:
periodic_work(60 * 0.1)  # get data every couple of minutes

{"created_at":"2019-03-26 17:22:09","text":"RT @ftrickxHP: How many followers you want ?\n\ud83c\udf41\ud83c\udf41\ud83c\udf41\ud83c\udf41\ud83c\udf41\ud83c\udf41\ud83c\udf41\ud83c\udf41\ud83c\udf41\n5k\n10k\n20k\n30k\n40k\n50k\n100k\n\ud83c\udf41\ud83c\udf41\ud83c\udf41\ud83c\udf41\nREPLY With '' \ud83c\udf41  '' Follow Whoever Likes ur Reply!\u2026","user_id":"14923564","user_timezone":"None","user_location":"19.4326009,-99.1333416","loc_lat":19.4326009,"loc_long":-99.1333416,"followers_count":"44611","language":"en"}
{"created_at":"2019-03-26 17:22:09","text":"RT @WakandaSensei: #IKOKAZIKE A lady Friend is looking for a Job.She has a degree in Geography and is experienced In Geography, Surveying,\u2026","user_id":"350940354","user_timezone":"None","user_location":"-1.2832533,36.8172449","loc_lat":-1.2832533,"loc_long":36.8172449,"followers_count":"395","language":"en"}
{"created_at":"2019-03-26 17:22:09","text":"Indonesia: Hivos EoI : In-Country Researcher: Creating Spaces for Engag

{"created_at":"2019-03-26 17:22:35","text":"RT @Mark__Porter: As a High School football player, if I could do something all over again.\n\nI would ask the track coach to \"let\" me on the\u2026","user_id":"2351990517","user_timezone":"None","user_location":"40.0757384,-74.4041622","loc_lat":40.0757384,"loc_long":-74.4041622,"followers_count":"1791","language":"en"}
{"created_at":"2019-03-26 17:22:35","text":"RT @toothicktexas: I\u2019m kind of sad that I\u2019m turning 22 this year. I\u2019m still in college, living wit my momma &amp; daddy. Got a boo boo job and\u2026","user_id":"521050576","user_timezone":"None","user_location":"34.0536834,-118.2427669","loc_lat":34.0536834,"loc_long":-118.2427669,"followers_count":"1244","language":"en"}
{"created_at":"2019-03-26 17:22:35","text":"RT @istockhistory: Check out Geraniums and cats by Pierre-Auguste Renoir Giclee Print Repro on Canvas https://t.co/lkEkIB4REg #art #fineart\u2026","user_id":"960756196457070592","user_timezone":"None","use

{"created_at":"2019-03-26 17:23:05","text":"How many players do you remember watching? And what is the best line from this? https://t.co/weFExd3NA2","user_id":"2288826498","user_timezone":"None","user_location":"45.5202471,-122.6741949","loc_lat":45.5202471,"loc_long":-122.6741949,"followers_count":"1497","language":"en"}
{"created_at":"2019-03-26 17:23:05","text":"Win a copy of Sunwise and a corn dolly made with my own fair (ahem) hands... https://t.co/DOjQzraEYy","user_id":"3290818128","user_timezone":"None","user_location":"52.7954791,-0.540240286617432","loc_lat":52.7954791,"loc_long":-0.540240286617432,"followers_count":"654","language":"en"}
{"created_at":"2019-03-26 17:23:05","text":"@MoBrexit_ It's happening to you and me. Not The 1% they'll be fine. Money abroad.","user_id":"153724077","user_timezone":"None","user_location":"54.7023545,-3.2765753","loc_lat":54.7023545,"loc_long":-3.2765753,"followers_count":"1639","language":"en"}
{"created_at":"2019-03-26 17:23:05","text":"RT

KeyboardInterrupt: 