## TWITTER SCRAPING TASK 
### KAUSTAV CHANDA  &nbsp;&nbsp;&nbsp;  https://github.com/Kaustav97 


---



Install jsonlines, and import requred libraries.
Keys required for the tweepy implementation below,  BeautifulSoup can operate without them, but since it scrapes a given page, cannot fetch more than ~20 lines.

To work around this, I had 2 options:

&nbsp;&nbsp;- Use a suporting library with BeautifulSoup to continue to gather data from pages with infinite scroll (like Twitter), such as cronjob etc.

&nbsp;&nbsp;- Use the tweepy api after registering my app in the Twitter Developers portal.

I have opted to go with the second solution, owing to its simplicity and ease of implementation.


In [1]:
pip install jsonlines

Collecting jsonlines
  Downloading https://files.pythonhosted.org/packages/4f/9a/ab96291470e305504aa4b7a2e0ec132e930da89eb3ca7a82fbe03167c131/jsonlines-1.2.0-py2.py3-none-any.whl
Installing collected packages: jsonlines
Successfully installed jsonlines-1.2.0


In [0]:
import tweepy
from tweepy import OAuthHandler
import requests
from bs4 import BeautifulSoup
import re
import jsonlines

 
consumer_key = 'xxxxxxxxxxxxxxxxxxxxxxxxx'
consumer_secret = 'xxxxxxxxxxxxxxxxxxxxxxxxx'
access_token = 'xxxxxxxxxxxxxxxxxxxxxxxxx'
access_secret = 'xxxxxxxxxxxxxxxxxxxxxxxxx'
 
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
 
api = tweepy.API(auth)

**Approach 1**  


Use given Twitter handle as URL, requests library to send 'get' requests to the MIDAS Twitter handle.

Upon closely inspecting the MIDAS Twitter homepage, all tweet content appeared in
```
<div class="js-tweet-text-container">
```

tags, so I ran a html selector to fetch all of them, to further parse for tweet content

In [0]:
url = 'https://twitter.com/midasIIITD'
data = requests.get(url)

In [16]:
html = BeautifulSoup(data.text, 'html.parser')
timeline = html.select('div.js-tweet-text-container')

# SAMPLE OF OBTAINED OUTPUT
print(timeline[:5])  

[<div class="js-tweet-text-container">
<p class="TweetTextSize TweetTextSize--normal js-tweet-text tweet-text" data-aria-label-part="0" lang="en">IEEE BigMM 2019 - Call for Workshop Proposals. 

Contact <a class="twitter-atreply pretty-link js-nav" data-mentioned-user-id="1021355762575073281" dir="ltr" href="/midasIIITD"><s>@</s><b>midasIIITD</b></a> <a class="twitter-atreply pretty-link js-nav" data-mentioned-user-id="932847441316999169" dir="ltr" href="/RatnRajiv"><s>@</s><b>RatnRajiv</b></a> <a class="twitter-atreply pretty-link js-nav" data-mentioned-user-id="73426884" dir="ltr" href="/debanjanbhucs"><s>@</s><b>debanjanbhucs</b></a> if you have any query. 

Deadlines: 
- Proposals due: April 1, 2019
- Acceptance notification: April 10, 2019

Further information:
- Webpage: <a class="twitter-timeline-link" data-expanded-url="http://bigmm2019.org/index.php/calls-for-submission/workshops" dir="ltr" href="https://t.co/5XX5Wyxp5T" rel="nofollow noopener" target="_blank" title="http://bi

Observing the above obtained data, all tweet contents occur in NavigableString tags in the output obtained from the requests api. 

I have used a recursive appraoch to append all NavigableString tags to a list, and later just print that list, to get rid of the nested hierarchy in which contents and urls are presented. I have not continued with extracting the other information required in the question, because of the limited number of tweets obtained by this approach. Please see  **Approach 2** given below for final solution.

In [0]:
content= []
def resolveTag(tag):    
  if type(tag) is bs4.element.NavigableString:
    content.append(tag.strip())    
  else:
    for child in tag.children:            
      resolveTag(child)    

In [17]:
content=[]
for tag in timeline:
  resolveTag(tag)  
print(''.join(content) )

IEEE BigMM 2019 - Call for Workshop Proposals. 

Contact@midasIIITD@RatnRajiv@debanjanbhucsif you have any query. 

Deadlines: 
- Proposals due: April 1, 2019
- Acceptance notification: April 10, 2019

Further information:
- Webpage:http://bigmm2019.org/index.php/calls-for-submission/workshops…#IEEE#BigMM19Hurry Up!
6 Days left for Abstract Submission in@ACMMM1945 Days left for Regular Paper Submission in@IEEEBigMM19.

Hectic time ahead or Multimedia Researchers :)Congratulations@midasIIITDstudents Simra Shahid@Simcyyand Nilay Shrivastava@NilayShrion getting selected for a research internship at Adobe in this summer.#MIDAS#Achievment#Research#Summer#Internshippic.twitter.com/WdF663EB5yThe last date for submitting a solution for the@midasIIITDinternship task is 26th March midnight. We will not accept solutions submitted after the deadline. 
Thus, if you have not submitted your solution yet then kindly do so before the deadline.#Summer#Research#Internship@IIITDelhiinvites application fro

In [18]:
len(timeline)

20

**Approach 2**  

Here, i have gone forward with the tweepy method of scraping Twitter for the final solution, wherein I extract all the other required information in each tweet for likes, retweets, images, etc

Final output can be viewed in **output.jsonl** file created by below script

In [0]:
with jsonlines.open('output.jsonl', mode='w') as writer:  
  for status in tweepy.Cursor(api.user_timeline, screen_name='@midasIIITD', tweet_mode="extended").items():    
    writer.write(status._json)        

In [23]:
with jsonlines.open('output.jsonl') as reader:
    for obj in reader:      
      print(obj['full_text'])
      print("DATE : " , obj['created_at'] )
      print("FAVORITES: ",obj['favorite_count'])
      print("RETWEET COUNT: ",obj['retweet_count'])
      num_img=0
      try:
        for med in obj['entities']['media']:
          try:
            if(med['media_url']): num_img+=1
          except:
            continue    
        print("IMAGES: ",num_img)
      except:
        print("IMAGES: NONE")
      # Separator, indicating next tweet item in historical timeline        
      print("\n################################\n")            

@IEEEBigMM19 @ACMMM19 and 6 days left for workshop proposal in @IEEEBigMM19.

Contact @cchatto for any query.
DATE :  Tue Mar 26 05:54:49 +0000 2019
FAVORITES:  1
RETWEET COUNT:  0
IMAGES: NONE

################################

RT @IEEEBigMM19: Hurry Up!
6 Days left for Abstract Submission in @ACMMM19 
45 Days left for Regular Paper Submission in @IEEEBigMM19 .

He…
DATE :  Tue Mar 26 05:50:10 +0000 2019
FAVORITES:  0
RETWEET COUNT:  3
IMAGES: NONE

################################

Congratulations @midasIIITD students Simra Shahid @Simcyy and Nilay Shrivastava @NilayShri on getting selected for a research internship at Adobe in this summer. 

#MIDAS #Achievment #Research #Summer #Internship https://t.co/WdF663EB5y
DATE :  Mon Mar 25 13:01:57 +0000 2019
FAVORITES:  14
RETWEET COUNT:  1
IMAGES:  1

################################

The last date for submitting a solution for the @midasIIITD internship task is 26th March midnight. We will not accept solutions submitted after the deadlin