# Part 1: Extracting all tweets using Twitter API and dumping into JSONlines file

Import libraries to extract tweets from Twitter and dump into JSONlines file.

Tweepy helps to extract data easily and jsonlines to dump data into JSONlines file.

In [42]:
from tweepy import OAuthHandler
import jsonlines

Define Twitter Credentials

In [43]:
access_token = ""
access_token_secret = ""
consumer_key = ""
consumer_secret = ""

Establish Connection to Twitter using Tweepy and create entry point to perform operations

In [44]:
# Establish connection with Twitter and create Entrypoint to perform operations using tweepy
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

Extract data from the timeline of user @midasIIITD using tweepy Cursor

Iterate through data and write them to JSONlines files in JSON format

In [45]:
# Open 'tweets.jsonl' file to write the data
with jsonlines.open('tweets.jsonl', mode='w') as writer:
    # Iterate through tweet data obtained from Tweepy Cursor
    for tweet in tweepy.Cursor(api.user_timeline, screen_name = '@midasIIITD').items():
        #print(tweet._json)
        # Write the data to file 
        writer.write(tweet._json)

# Part 2: Parse this JSONlines file to display tweets in the required format

Import additional libraries for classifing data and storing data in form of tables

In [19]:
import pandas as pd
import json

Analyse the fields in the JSON file given by twitter API.

Viewing the first entry.

In [67]:
with jsonlines.open('tweets.jsonl') as reader:
    for tweetObj in reader:
        # json.dumps to understand the data
        print(json.dumps(tweetObj, indent=2))
        # View the first entry only the stop the reader
        break

{
  "created_at": "Fri Apr 05 16:08:37 +0000 2019",
  "id": 1114198161562775553,
  "id_str": "1114198161562775553",
  "text": "We have emailed the task details to all candidates who have applied to @midasIIITD internship through IIITD portal.\u2026 https://t.co/gZwyr7D2Sw",
  "truncated": true,
  "entities": {
    "hashtags": [],
    "symbols": [],
    "user_mentions": [
      {
        "screen_name": "midasIIITD",
        "name": "MIDAS IIITD",
        "id": 1021355762575073281,
        "id_str": "1021355762575073281",
        "indices": [
          71,
          82
        ]
      }
    ],
    "urls": [
      {
        "url": "https://t.co/gZwyr7D2Sw",
        "expanded_url": "https://twitter.com/i/web/status/1114198161562775553",
        "display_url": "twitter.com/i/web/status/1\u2026",
        "indices": [
          117,
          140
        ]
      }
    ]
  },
  "source": "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>",
  "in_reply_to_status_id": null,


Function to return the required fields for our table

After Analysing the data I have realised that the retweeted tweets have their 'retweets' and 'likes' count in the 'retweeted_status' section, so I have extracted data accordingly

In [69]:
def extractReqData(tweetObj):
    # Num of Likes
    fav = 0
    # Num of Retweets
    retw = 0
    # Num of Images
    img = None
    
    # If the tweet was retweeted then get likes, retweets from retweeted section
    if ('retweeted_status' in tweetObj):
        fav = tweetObj.get('retweeted_status').get('favorite_count')
        retw = tweetObj.get('retweeted_status').get('retweet_count')
    # If the tweet is not retweeted then get likes, retweets from regular section
    else:
        fav = tweetObj.get('favorite_count')
        retw = tweetObj.get('retweet_count')
        
    # If the tweet has any image then count the number of images    
    if (tweetObj.get('entities').get('media')):
        img = len(tweetObj.get('entities').get('media'))
            
    # Get text and DateTime directly from tweet object.
    # Create a list of all 5 fields and return it.
    temp_obj = [tweetObj.get('text'), tweetObj.get('created_at'), fav, retw, img]
    return temp_obj

Extract 'Text', 'DateTime', '#Likes', '#Retweets' and '#Images' fields from each tweet object using 'extractReqData' function.

In [70]:
# Complete list of records : Master List
data = []
with jsonlines.open('tweets.jsonl') as reader:
    for tweetObj in reader:
        # Append the returned list to master list
        data.append(extractReqData(tweetObj))

Define columns and create table using the obtained data.

Table is created using Pandas library

In [71]:
table_columns = ['Tweet_Text', 'Tweet_Date_Time', 'Num_likes', 'Num_retweets', 'Num_images']
table = pd.DataFrame(data, columns = table_columns)

Display the table..

In [72]:
table.head()

Unnamed: 0,Tweet_Text,Tweet_Date_Time,Num_likes,Num_retweets,Num_images
0,We have emailed the task details to all candid...,Fri Apr 05 16:08:37 +0000 2019,5,1,
1,RT @rfpvjr: Our NAACL paper on polarization in...,Fri Apr 05 04:05:11 +0000 2019,46,15,
2,RT @kdnuggets: Effective Transfer Learning For...,Fri Apr 05 04:04:43 +0000 2019,19,10,1.0
3,RT @stanfordnlp: What’s new in @Stanford CS224...,Wed Apr 03 18:31:53 +0000 2019,221,55,
4,RT @DeepMindAI: Today we're releasing a large-...,Wed Apr 03 17:04:32 +0000 2019,2329,837,


Store the table in a CSV file

In [73]:
table.to_csv(r'twitter_tweets.csv')