# Python Problem

## Library Used: tweepy
It is a python library for accessing the Twitter API with a lot of functionality to make the task easier

## Keys for OAuth Authentication

In [1]:
consumer_key="your_consumer_key"
consumer_secret="your_consumer_secret"
access_token="your_access_token"
access_token_secret="your_access_token_secret"

## OAuth Authentication 
To use the Twitter Api we need to perform OAuth

In [2]:
import json
import tweepy

#using the keys to perform OAuth
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

## Testing out the Library Features

Testing tweepy library to obtain the required features from a tweet.<br/>
The features are:<br/>
● The text of the tweet.<br/>
● Date and time of the tweet.<br/>
● The number of favorites/likes.<br/>
● The number of retweets.<br/>
● Image in text.<br/>

In [3]:
#Name: contains the twitter handle of the user
#tweetcounf: contains the number of tweets to fetch
name="midasIIITD"
tweetcount=20

#using the api to get the user's tweets
results=api.user_timeline(id=name,count=tweetcount)
i=0

#desiplaying data of a tweet
for tweet in results:
    i+=1
    tweet.text
    print(i,": ",tweet.retweet_count,tweet.favorite_count,tweet.created_at,tweet.entities)
    #print(tweet.text,len([medium['type'] == 'photo' for medium in tweet.entities['media']]))
    try:
        print(len([m["type"]=="photo" for m in tweet .entities["media"]]))
        print()
    except:
        print("No picture in this tweet")
        print()
    break

1 :  9 0 2019-04-09 16:45:07 {'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'IIITDelhi', 'name': 'IIIT Delhi', 'id': 2227868629, 'id_str': '2227868629', 'indices': [3, 13]}], 'urls': []}
No picture in this tweet



## Testing the Cursor object
The cursor object of Tweepy facilitates pagination and hence allows us to retrieve more than 200 tweets(the max number of tweets in a page).The Twitter Api can fetch max of 3200 tweets using pagination.<br/>
Here we will use the Cursor object to obtain all the tweets of **midasIIITD**.

In [4]:
#Using the Cursor object to fetch the tweets
#printing the required features and forming a list of dictionary of required features for testing purposes

statuses=[]
for status in tweepy.Cursor(api.user_timeline, screen_name='@midasIIITD',tweet_mode="extended").items():
    text=status.full_text
    time=status.created_at
    fav=status.favorite_count
    retweet=status.retweet_count
    statuses.append({'text':text,'date_time':time,'favourites':fav,"retweets:":retweet})
    break
statuses

[{'text': 'RT @IIITDelhi: We are delighted to share that IIIT-Delhi is ranked 55 by NIRF this year. We have moved up by 11 positions compared to the p…',
  'date_time': datetime.datetime(2019, 4, 9, 16, 45, 7),
  'favourites': 0,
  'retweets:': 9}]

## Using the Cursor object to Fetch all the Tweets
We use the tweepy Cursor object to fetch all the tweets and store it in a list.

In [5]:
#Making a list of the status objects (tweets) returned by the Cursor object.

status=[stat for stat in tweepy.Cursor(api.user_timeline, screen_name='@midasIIITD',tweet_mode="extended").items()]

## Dumping the Tweets into JSONlines File
The Status object of tweepy itself is not JSON serializable, but it has a _json property which contains JSON serializable response data. Using this property we dump the tweets into JSONlines file.<br/>
We use the open and write methods from the **jsonlines** library to dump the tweets

In [6]:
#Dumping the fetched tweets into JSONlines file.

import jsonlines
with jsonlines.open('out.jsonl', mode='w') as writer:
    for stat in status:
        writer.write(stat._json)

## Checking the JSONlines File

In [7]:
#Reading from the JSONlines file and checking the features.

with jsonlines.open('out.jsonl') as reader:
    for obj in reader.iter(type=dict):
        print(obj["full_text"])
        print(obj["favorite_count"])
        print(obj["created_at"])
        break

RT @IIITDelhi: We are delighted to share that IIIT-Delhi is ranked 55 by NIRF this year. We have moved up by 11 positions compared to the p…
0
Tue Apr 09 16:45:07 +0000 2019


## Parsing the File
To parse the JSONlines File we use **pandas** library.<br/>
We form a dataframe to store the required features of the text in tabular format:<br/>
● The text of the tweet.<br/>
● Date and time of the tweet.<br/>
● The number of favorites/likes.<br/>
● The number of retweets.<br/>
● Number of Images present in Tweet. If no image returns None.<br/><br/>
We first read the JSONlines file into the dataframe.

In [8]:
import pandas as pd

#reading jsonlines using the lines=True argument
data=pd.read_json('out.jsonl', lines=True)

## Decription of the dataframe

In [9]:
data.describe()

Unnamed: 0,contributors,coordinates,favorite_count,geo,id,id_str,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,place,possibly_sensitive,quoted_status_id,quoted_status_id_str,retweet_count
count,0.0,0.0,342.0,0.0,342.0,342.0,53.0,53.0,69.0,69.0,0.0,149.0,18.0,18.0,342.0
mean,,,2.339181,,1.071922e+18,1.071922e+18,1.077279e+18,1.077279e+18,5.710079e+17,5.710079e+17,,0.0,1.05719e+18,1.05719e+18,57.461988
std,,,3.964895,,3.153947e+16,3.153947e+16,3.058072e+16,3.058072e+16,4.937548e+17,4.937548e+17,,0.0,2.934397e+16,2.934397e+16,471.614949
min,,,0.0,,1.021378e+18,1.021378e+18,1.024582e+18,1.024582e+18,5694822.0,5694822.0,,0.0,1.021697e+18,1.021697e+18,0.0
25%,,,0.0,,1.035726e+18,1.035726e+18,1.051777e+18,1.051777e+18,2227869000.0,2227869000.0,,0.0,1.031845e+18,1.031845e+18,1.0
50%,,,0.0,,1.082049e+18,1.082049e+18,1.088018e+18,1.088018e+18,9.328474e+17,9.328474e+17,,0.0,1.052272e+18,1.052272e+18,2.0
75%,,,3.75,,1.098112e+18,1.098112e+18,1.102221e+18,1.102221e+18,1.021356e+18,1.021356e+18,,0.0,1.077278e+18,1.077278e+18,11.0
max,,,24.0,,1.115657e+18,1.115657e+18,1.114888e+18,1.114888e+18,1.114472e+18,1.114472e+18,,0.0,1.111675e+18,1.111675e+18,8430.0


## Looking at the Head of the dataframe

In [10]:
data.head()

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,favorite_count,favorited,full_text,geo,...,quoted_status,quoted_status_id,quoted_status_id_str,quoted_status_permalink,retweet_count,retweeted,retweeted_status,source,truncated,user
0,,,2019-04-09 16:45:07,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions...",,0,False,RT @IIITDelhi: We are delighted to share that ...,,...,,,,,9,False,{'created_at': 'Tue Apr 09 09:03:12 +0000 2019...,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",False,"{'id': 1021355762575073281, 'id_str': '1021355..."
1,,,2019-04-09 05:04:27,"[0, 136]","{'hashtags': [], 'symbols': [], 'user_mentions...",,0,False,RT @Harvard: Professor Jelani Nelson founded A...,,...,,,,,35,False,{'created_at': 'Mon Apr 08 20:10:01 +0000 2019...,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",False,"{'id': 1021355762575073281, 'id_str': '1021355..."
2,,,2019-04-09 05:04:11,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions...",,0,False,RT @emnlp2019: For anyone interested in submit...,,...,,,,,13,False,{'created_at': 'Mon Apr 08 17:35:00 +0000 2019...,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",False,"{'id': 1021355762575073281, 'id_str': '1021355..."
3,,,2019-04-08 19:38:09,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions...",,0,False,RT @multimediaeval: Announcing the 2019 MediaE...,,...,,,,,15,False,{'created_at': 'Mon Mar 18 06:40:38 +0000 2019...,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",False,"{'id': 1021355762575073281, 'id_str': '1021355..."
4,,,2019-04-08 07:08:12,"[0, 279]","{'hashtags': [{'text': 'MIDAS', 'indices': [25...","{'media': [{'id': 1115149307798224898, 'id_str...",16,False,"Many Congratulations to @midasIIITD student, S...",,...,,,,,2,False,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",False,"{'id': 1021355762575073281, 'id_str': '1021355..."


## Dataframe columns/features
As noted we dont require all these columns(features) and we require some other columns(features) as well such as number of images.

In [11]:
data.columns

Index(['contributors', 'coordinates', 'created_at', 'display_text_range',
       'entities', 'extended_entities', 'favorite_count', 'favorited',
       'full_text', 'geo', 'id', 'id_str', 'in_reply_to_screen_name',
       'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str', 'is_quote_status',
       'lang', 'place', 'possibly_sensitive', 'quoted_status',
       'quoted_status_id', 'quoted_status_id_str', 'quoted_status_permalink',
       'retweet_count', 'retweeted', 'retweeted_status', 'source', 'truncated',
       'user'],
      dtype='object')

## Picking the required columns/features

In [12]:
#assinging the current dataframe its on subset containing only the required features
cols_of_interest=["full_text","created_at","favorite_count","retweet_count","entities"]
data = data[cols_of_interest]

## Forming the images column
Forming the images column using the entities column.The images column contains the no. of images in a tweet if images are present else it gives Nan(None).We then delete entities column as we no longer require it.

In [13]:
#list to store the image values
images=[]
#parsing over the entire datafram and appending the number of images present to the list
for i in range(len(data)):
    val=data.iloc[i]["entities"].get("media",None)
    if(type(val)==type([])):
        images.append(len(val))
    else:
        images.append(None)
        
#making a new column and assigning it the value of number of images
data["Number of images"]=images

#deleting entities column as its no longer needed
del data["entities"]

#printing the Number of images column that we just formed
data["Number of images"]

0      NaN
1      NaN
2      NaN
3      NaN
4      1.0
5      1.0
6      NaN
7      NaN
8      NaN
9      NaN
10     NaN
11     NaN
12     NaN
13     NaN
14     1.0
15     NaN
16     NaN
17     1.0
18     NaN
19     NaN
20     NaN
21     NaN
22     NaN
23     NaN
24     NaN
25     NaN
26     NaN
27     NaN
28     NaN
29     NaN
      ... 
312    NaN
313    NaN
314    1.0
315    NaN
316    1.0
317    NaN
318    NaN
319    NaN
320    NaN
321    1.0
322    NaN
323    NaN
324    1.0
325    NaN
326    NaN
327    NaN
328    NaN
329    NaN
330    NaN
331    NaN
332    NaN
333    NaN
334    NaN
335    NaN
336    NaN
337    NaN
338    NaN
339    1.0
340    NaN
341    NaN
Name: Number of images, Length: 342, dtype: float64

## Renaming columns and displaying Head

In [14]:
#renaming the columns to more convinient and easy to understand names
data.rename(columns={'full_text':'Text','created_at':'Date and Time','favorite_count':'Number of favorites',
                    'retweet_count':'Number of retweete'}, inplace=True)
data.head()

Unnamed: 0,Text,Date and Time,Number of favorites,Number of retweete,Number of images
0,RT @IIITDelhi: We are delighted to share that ...,2019-04-09 16:45:07,0,9,
1,RT @Harvard: Professor Jelani Nelson founded A...,2019-04-09 05:04:27,0,35,
2,RT @emnlp2019: For anyone interested in submit...,2019-04-09 05:04:11,0,13,
3,RT @multimediaeval: Announcing the 2019 MediaE...,2019-04-08 19:38:09,0,15,
4,"Many Congratulations to @midasIIITD student, S...",2019-04-08 07:08:12,16,2,1.0


# The Final Required Table 

In [15]:
#printing the final table
data

Unnamed: 0,Text,Date and Time,Number of favorites,Number of retweete,Number of images
0,RT @IIITDelhi: We are delighted to share that ...,2019-04-09 16:45:07,0,9,
1,RT @Harvard: Professor Jelani Nelson founded A...,2019-04-09 05:04:27,0,35,
2,RT @emnlp2019: For anyone interested in submit...,2019-04-09 05:04:11,0,13,
3,RT @multimediaeval: Announcing the 2019 MediaE...,2019-04-08 19:38:09,0,15,
4,"Many Congratulations to @midasIIITD student, S...",2019-04-08 07:08:12,16,2,1.0
5,@midasIIITD thanks all students who have appea...,2019-04-08 03:27:42,5,0,1.0
6,"@himanchalchandr Meanwhile, complete CV/NLP ta...",2019-04-07 14:17:29,0,0,
7,@sayangdipto123 Submit as per the guideline ag...,2019-04-07 14:17:09,0,0,
8,We request all students whose interview are sc...,2019-04-07 11:43:24,1,1,
9,"Other queries: ""none of the Tweeter Apis give ...",2019-04-07 06:55:19,5,2,
