# Ingesting realtime tweets using Apache Kafka, Tweepy and Python

### Purpose:
- main data source for the lambda architecture pipeline
- uses twitter streaming API to simulate new events coming in every minute
- Kafka Producer sends the tweets as records to the Kafka Broker

### Contents: 
- [Twitter setup](#1)
- [Defining the Kafka producer](#2)
- [Producing and sending records to the Kafka Broker](#3)
- [Deployment](#4)

### Required libraries

In [1]:
import tweepy
import time
from kafka import KafkaConsumer, KafkaProducer

<a id="1"></a>
### Twitter setup
- getting the API object using authorization information
- you can find more details on how to get the authorization here:
https://developer.twitter.com/en/docs/basics/authentication/overview

In [2]:
# twitter setup
ACCESS_TOKEN = '799844067701977088-qrHMnTaYFUcqBbeG5yT3G8GTieLJt6N'
ACCESS_SECRET = 'kRtA7MsTjvAqmft9BdtE7z2FtAouYsOY8OlAvByIy5m1l'
CONSUMER_KEY = 'yZUmKJQxfGmpLtvVTmGTHPKiD'
CONSUMER_SECRET = '1bwO3JIc664KkObN5LJYVpALDi63NBExoLSKsaMrJ7KPYKYiXM'
# Setup tweepy to authenticate with Twitter credentials:
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)
# Creating the API object by passing in auth information
api = tweepy.API(auth) 


A helper function to normalize the time a tweet was created with the time of our system

In [3]:
from datetime import datetime, timedelta

def normalize_timestamp(time):
    mytime = datetime.strptime(time, "%Y-%m-%d %H:%M:%S")
    mytime += timedelta(hours=1)   # the tweets are timestamped in GMT timezone, while I am in +1 timezone
    return (mytime.strftime("%Y-%m-%d %H:%M:%S")) 

<a id="2"></a>
### Defining the Kafka producer
- specify the Kafka Broker
- specify the topic name
- optional: specify partitioning strategy

In [5]:
producer = KafkaProducer(bootstrap_servers='localhost:9092')
topic_name = 'tweets-kafka-test'

<a id="3"></a>
### Producing and sending records to the Kafka Broker
- querying the Twitter API Object
- extracting relevant information from the response
- formatting and sending the data to proper topic on the Kafka Broker
- resulting tweets have following attributes:
    - id 
    - created_at
    - followers_count
    - location
    - favorite_count
    - retweet_count

In [6]:
def get_twitter_data():
    res = api.search("Apple OR iphone OR iPhone")
    j=0
    for i in res:
        record = ''
        record += str(i.user.id_str)
        record += ';'
        record += str(normalize_timestamp(str(i.created_at)))
        record += ';'
        record += str(i.user.followers_count)
        record += ';'
        record += str(i.user.location)
        record += ';'
        record += str(i.favorite_count)
        record += ';'
        record += str(i.retweet_count)
        record += ';'
        producer.send(topic_name, str.encode(record))
        #if j==0:
            #print(record)
        j+=1
        
    print (j)

In [7]:
get_twitter_data()

1872634008;2019-03-22 22:31:06;103;;0;1514;


<a id="4"></a>
### Deployment 
- perform the task every couple of minutes and wait in between

In [8]:
def periodic_work(interval):
    while True:
        get_twitter_data()
        #interval should be an integer, the number of seconds to wait
        time.sleep(interval)


In [None]:
periodic_work(60 * 0.1)  # get data every couple of minutes

625023809;2019-03-22 22:31:11;493;;0;3538;
625023809;2019-03-22 22:31:11;493;;0;3539;
625023809;2019-03-22 22:31:11;493;;0;3538;
836252584432168961;2019-03-22 22:31:31;419;Calgary, Alberta;0;4;
836252584432168961;2019-03-22 22:31:31;419;Calgary, Alberta;0;4;
836252584432168961;2019-03-22 22:31:31;419;Calgary, Alberta;0;4;
836252584432168961;2019-03-22 22:31:31;419;Calgary, Alberta;0;4;
754515703852634113;2019-03-22 22:31:57;30;Puerto Rico, USA;0;1;
754515703852634113;2019-03-22 22:31:57;30;Puerto Rico, USA;0;1;
754515703852634113;2019-03-22 22:31:57;30;Puerto Rico, USA;0;1;
754515703852634113;2019-03-22 22:31:57;30;Puerto Rico, USA;0;1;
3377752935;2019-03-22 22:32:23;3586;Rio de Janeiro, Brasil;0;0;
3377752935;2019-03-22 22:32:23;3586;Rio de Janeiro, Brasil;0;0;
3377752935;2019-03-22 22:32:23;3586;Rio de Janeiro, Brasil;0;0;
1431359448;2019-03-22 22:32:43;21292;United Kingdom;0;207;
1431359448;2019-03-22 22:32:43;21292;United Kingdom;0;207;
1431359448;2019-03-22 22:32:43;21292;United K

2482811500;2019-03-22 22:47:07;200;longin;0;57;
114786308;2019-03-22 22:47:27;45881;Brooklyn | NYC | World Wide !;0;0;
114786308;2019-03-22 22:47:27;45880;Brooklyn | NYC | World Wide !;0;0;
114786308;2019-03-22 22:47:27;45880;Brooklyn | NYC | World Wide !;0;0;
114786308;2019-03-22 22:47:27;45880;Brooklyn | NYC | World Wide !;0;0;
1356653958;2019-03-22 22:47:53;1138;RADWIMPS;0;0;
1356653958;2019-03-22 22:47:53;1138;RADWIMPS;0;0;
1356653958;2019-03-22 22:47:53;1138;RADWIMPS;0;0;
1356653958;2019-03-22 22:47:53;1138;RADWIMPS;0;0;
856897151905013760;2019-03-22 22:48:19;359;;0;0;
856897151905013760;2019-03-22 22:48:19;359;;0;0;
856897151905013760;2019-03-22 22:48:19;359;;0;0;
898135706996011009;2019-03-22 22:48:39;642;;0;0;
898135706996011009;2019-03-22 22:48:39;642;;0;0;
898135706996011009;2019-03-22 22:48:39;642;;0;0;
222294121;2019-03-22 22:48:59;1375;Trillville;0;0;
222294121;2019-03-22 22:48:59;1375;Trillville;0;0;
222294121;2019-03-22 22:48:59;1375;Trillville;0;0;
222294121;2019-03-22 