# Twitter Thesis Project: collecting tweets 

## Introduction
This is a 'jupyter notebook': a certain kind of program you can use to develop your own software applications. In this notebook we will use the Python computer language and the Twitter API ('application programmer interface') to automatically collect and analyse tweets. 

This notebook contains cells, i.e., snippets of either code or normal text. We use the text cells (like the current one) to explain what is going on. You can edit a cell by clicking on it. After you made the changes, you can either click 'run' above, or press shift+enter, to execute what is written. In the case of a text cell, this will just display the text in the correct format (try it with this cell!); in case of a code cell, the code will be executed. 

The next cell will be a code cell where we ask the computer to print a simple sentence for us. Try to change this sentence and then execute the code. 


In [11]:
print('hello world')


hello world


**Important note: to use this program, you have to execute all the code in all the cells in the correct order. **

If you want to learn more about using jupyter notebooks, look for a tutorial online (e.g., https://www.dataquest.io/blog/jupyter-notebook-tutorial/). Most of the questions you have or the problems you encounter will also be solved through a simple google search with the correct keywords.

**However, if you have other questions or any problems that you really don't know how to solve, please contact us on Slack and we'll be happy to help or to schedule a meeting. **

## Connecting to the Twitter API

In [12]:
from twython import Twython
#if this results in an error, you need to install twython first. See guideline document.
print('import successful')

import successful


First, we need to connect to Twitter using the correct passwords/keys. There is a limit on how many tweets you can collect each 15 minutes (this makes sure the Twitter servers are not overloaded, amongst other reasons).  Running the code below 'logs you in' to the Twitter application. If all goes well, the output should show information on the number of calls ('questions we can ask') we can still perform these 15 minutes. With each call, you can collect 100 Tweets. 
e.g.: {'/search/tweets': {'limit': 450, 'remaining': 443, 'reset': 1568288620}}

In [13]:
APP_KEY = 'yN3VbAb8QZdzD5GPkVuOHLfMN'         #API key
APP_SECRET = 'YRdyk39bx9iRPQBhK2Nh1fT32JdGYTrEhqxcEbcpLMIxbT7wKh'   #API secret key
twitter = Twython(APP_KEY, APP_SECRET, oauth_version=2)
ACCESS_TOKEN = twitter.obtain_access_token()

twitter = Twython(APP_KEY, access_token=ACCESS_TOKEN)
twitter.get_application_rate_limit_status()['resources']['search']



{'/search/tweets': {'limit': 450, 'remaining': 450, 'reset': 1576587984}}

## Streaming Tweets

In this part of the code, we will start 'streaming' tweets: collecting newly created tweets based on certain criteria. These tweets will then be saved in a csv file, a file format that you can open with excel, pages, etc. 

Every time you want to start streaming, run the code in the cells below. It migth take a while before a first tweet is discovered, so there's nothing wrong if no tweet shows up for a while. If a lot of tweets are streamed (like, e.g., when you would use a keyword like 'Trump' or 'Brexit'); make sure to halt the program in time.

New tweets will automatically be added to a file with the filename as specified below. You can change the filename (but do keep the extension '.csv'). This file will be created once a first tweet that matches the criteria is discovered, and tweets will be added to the same file regardless of whether you restarted the application in between. The file will be generated in the same folder as the folder where these notebooks are located. 


In [14]:
from twython import TwythonStreamer
import csv
import os.path

filename = 'collected_tweets.csv' #change the filename here 
delimiter = ';' #change this to ';' or ',' if your software (like excel, numbers...) doesn't show your data in columns 


In the code cell below, we first specify what will happen if we find a tweet that matches our criteria. Currently, it will tell us when a new tweet is collected. If it's not a retweet, its date, place and text will be written to file.

There's a lot more information you can access for each tweet. If you want to save more than the date, place and text (e.g., the name of the user) please go to https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object and consult the section 'Tweet Data Dictionary'. List all the properties you want to save to file, and contact us so we can update this part of the code.  

In [15]:
class MyStreamer(TwythonStreamer):
    def on_success(self, data):
        print("-------new tweet collected!")
        
        if 'retweeted_status' in data:
            print("but it's a retweet, so we ignore it...")
        else:
            
            #first we print selected information about the tweet so you can follow what's happing
            if data.get('extended_tweet')!= None and data['extended_tweet'].get('full_text')!= None :
                print([data['created_at'],data['place'], data['extended_tweet']['full_text']])
            else:
                print([data['created_at'],data['place'], data['text']])
             
            
            #below we process the information and write it to file
            file_exists = os.path.isfile(filename);
            with open(filename,'a', encoding='utf-8') as f: #this will add the newly collected tweets to your dataset ('a' = append) 
                writer = csv.writer(f, delimiter=';')
                
                ##write a header to the file if it doesn't exist already
                if not file_exists: #if it's a new file, we should create a header 
                    writer.writerow(['Date','Place_Name','Place_Bounding_Box','Text', 'Tweet_Id', 'IsReplyTo_ID','IsReplyTo_Text', 'Hashtags','Urls','Media','User_Screen_Name', 'User_Id', 'User_Followers_Count', 'Checked_Status_At','Retweets_Count','Favourites_Count']) #this is the document header
                
    
                ##format all the results in a new row to append to the file
                
                #add a value for the date and time of creation of the tweet
                row_to_write = [data['created_at']]
                
                #process location information, if present
                if data.get('place')== None:
                    row_to_write.append('')
                    row_to_write.append('')
                else:
                    row_to_write.append(data['place']['full_name'])
                    row_to_write.append(data['place']['bounding_box'])
                
                #the text is stored depending on the type of tweet. We use the full text if available ('extended_tweet'), otherwise we use the standard text
                if data.get('extended_tweet')!= None and data['extended_tweet'].get('full_text')!= None :
                    row_to_write.append(data['extended_tweet']['full_text'])
                else:
                    row_to_write.append(data['text'])
                      
                #then we can add a column for the tweet ID
                row_to_write.append(data['id'])   
                
                #if the tweet is a reply, store the original tweet ID
                row_to_write.append(data['in_reply_to_status_id'])  
            
                if(data['in_reply_to_status_id']):  #if the tweet is a reply, the original text can be fetched later
                    row_to_write.append('original text to be fetched')
                else: #if the tweet is a not reply, this column can be left empty
                    row_to_write.append('')
                                
                #then we can process the hashtags:
                hashtags_as_strings = ''
                for x in data['entities']['hashtags']:
                    hashtags_as_strings = hashtags_as_strings + ', ' + x['text']
                row_to_write.append(hashtags_as_strings)    
                                
                #then we can process the urls:
                urls_as_strings = ''
                for x in data['entities']['urls']:
                    urls_as_strings = urls_as_strings + ', ' + x['url']
                row_to_write.append(urls_as_strings)
                                
                #presence of media 
                if data['entities'].get("media") != None:
                      row_to_write.append('yes')
                else:
                    row_to_write.append('no')
                
                
                #user information      
                row_to_write.append(data['user']['screen_name'])
                row_to_write.append(data['user']['id']) 
                row_to_write.append(data['user']['followers_count'])
                                
                
                #empty values that will later be replaced with retweet count etc. 
                row_to_write.append('')
                row_to_write.append('')
                row_to_write.append('')
         
            
                #finally, write everything to file             
                writer.writerow(row_to_write)
                                
    def on_error(self, status_code, data):
        print(data)
        print(status_code)
        # self.disconnect()

In the next cell, we connect to the twitter stream.

In [16]:
OAUTH_TOKEN = '1100028871259377670-qtcMTW2ereJ3A0KIvFguWu0ZmW0n8k'
OAUTH_TOKEN_SECRET = 'wnPYmWOds9xD1i1CM9K8gfzMNZ26QoBmXW4JSSA81faRF'

When you execute the next cell, the streaming will start. This is also the place where you can edit the criteria you want to 'filter' the stream on. There's different types of filters you can use (at the same time):



**follow** 	(optional): 	A comma separated list of user IDs, indicating the users to return statuses for in the stream. 

**track** (optional): 	Keywords to track. Phrases of keywords are specified by a comma-separated list. 

**locations** 	(optional): 	Specifies a set of bounding boxes to track. 

see https://developer.twitter.com/en/docs/tweets/filter-realtime/api-reference/post-statuses-filter


In [17]:
stream = MyStreamer(APP_KEY, APP_SECRET,
                    OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
stream.statuses.filter(track='job,hiring', tweet_mode='extended')

-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
['Tue Dec 17 12:51:29 +0000 2019', None, "@AndyWoodturner @jblairreid I've heard talk of an inside job. They were daft, they could have just creamed off a million quids worth, and she'd barely have noticed.\n\nI've tried, but failed, to find any sympathy."]
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
['Tue Dec 17 12:51:29 +0000 2019', None, '@idolfess Baekhyun is the time of you getting that job actually sucks haha is the time of time to you haha is the time of time to you haha is the time of time to you haha is the time of time to you haha is the time of time to you haha is \n\nBodo amat anjir autotext gua begini wkwkwkwk']
-------new tweet col

-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
['Tue Dec 17 12:51:37 +0000 2019', None, '@LivePDNation So sweet to watch this. You can tell he is as loved as much as he gives love. Love it! Great job officer. Thank you for all the good you do. https://t.co/ogUv8nQUvb']
-------new tweet collected!
['Tue Dec 17 12:51:37 +0000 2019', None, 'Care International Zimbabwe is looking for a Community Visioning Lead. The ideal candidate must have a\xa0Bachelor’s degree in the social sciences.\nhttps://t.co/YnMaaqrjSx\n\n#jobszimbabwe #jobseekers https://t.co/MEwPipmZUP']
-------new tweet collected!
['Tue Dec 17 12:51:38 +0000 2019', None, 'Girl I be hating that shit I be so drained \U0001f974']
-------new tweet collected!
['Tue Dec 17 12:51:38 +0000 2

-------new tweet collected!
['Tue Dec 17 12:51:46 +0000 2019', None, '@SchexniderJ That’s Cool Mr. Karl. Those Poor Little Kids Doesn’t understand what’s going on when you arrive. That’s one of the hardest part of the Job.']
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
['Tue Dec 17 12:51:46 +0000 2019', None, '英語で男友達を作るための男のための単語例文集\nblow job→フ◯ラ\n例文) Hey dude, I got a blow job from this girl! (Tinderのプロフィール写真を見せながら)\n訳) 今日この子にフ◯ラしてもらったんだぜ\n#英会話']
-------new tweet collected!
['Tue Dec 17 12:51:46 +0000 2019', None, '#Hiring: #LVN/ LPN, Licensed Vocational Nurse/ Licensed Practical Nurse-Nursing Relief\n#BSN #LPN #RN #Texarkana  #Nurse \nApply Here➣  https://t.co/Y0yjyi9D2q']
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
['Tue Dec 17 12:51:47 +0000 2019', None, '@MoyoZuva Kwete. Just coz his boss offered me his job. Lolol']
-------new tweet collected!
but it's a retweet, so we ignore it...

KeyboardInterrupt: 

## Check status: look up the favorites and retweets counts after given time period (and, if a reply, add the text of the original tweet)

The goal here is to determine, for tweets we collected in the past, the number of times they have been retweeted and the number of times they have been favorited. To have a fair comparison between tweets this should always be done in more or less the same 'time window'. I.e., here we choose to use always 10 to 14 days later. This means that once you collected tweets you should run this code at least once between 10 and 14 days later. There are 3 possible results: 
- the tweet is created between 10 to 14 days ago: great! We look up its counts. 
- the tweet is created less than 10 days ago: the time window in which you should run this code again is added in the file.
- the tweet is created less than 14 days ago and hasn't been checked: you missed the window and the tweet has expired. These tweets should be excluded from your analysis (please contact us so we can help). 

In addition to this, this code will add the original tweet's text to the dataset if the tweet was a reply (independent of the number of days).

Note that the updated dataset will be stored in a new file. You can choose the filename below. **Do not give it the same name as your original file!**

In [121]:
from datetime import datetime, timedelta

In the code below, we will first determine the correct time window. 

In [137]:
#the file of collected tweets that you want to get the counts for:
filename = 'collected_tweets.csv' #change the filename here 

#the file with the updated datatset
temp_copy_file = 'collected_tweets_updated_thuNov28.csv' #change the filename here (different from original!)


#the time period we consider (from 'max_days_back' days ago to 'min_days_back' days ago  )
max_days_back = 14 #days ## do not change this
min_days_back = 10 #days ## do not change this

current_date = datetime.now()
max_date_back = current_date - timedelta(days=max_days_back)
min_date_back = current_date - timedelta(days=min_days_back)

print('we will look up the tweets created between: ')
print(max_date_back)
print('and ')
print(min_date_back)



we will look up the tweets created between: 
2019-11-14 12:14:00.425579
and 
2019-11-18 12:14:00.425579


In [None]:
col_Date = 0
col_ID = 4
col_isReply_ID = 5
col_isReply_Text = 6 
col_Checked_Status_At = 13
col_Retweets_Count= 14
col_Favourites_Count = 15

with open(filename, mode='r', encoding='utf-8') as csv_file:

    csv_reader = csv.reader(csv_file, delimiter=delimiter)
    line_count = 0
    
    with open (temp_copy_file,'w', encoding='utf-8') as temp_csv_copy:
        
        wtr = csv.writer(temp_csv_copy)

        for tweet in csv_reader:
            if line_count > 0: #skip the header

                #get the time the tweet was created
                time_of_creation = datetime.strptime(tweet[col_Date],  "%a %b %d %H:%M:%S %z %Y") #'%a %b %d %H:%M:%S %Y')
                time_of_creation = time_of_creation.replace(tzinfo=None)
                
                #did we already get the counts?
                if not tweet[col_Retweets_Count]: #we didn't check before
                    #is this time is within our bounds, fetch the original tweet from twitter and check its counts
                    if time_of_creation > max_date_back and time_of_creation < min_date_back:
                        ID = tweet[col_ID]
                        fetched_tweet = twitter.show_status(id=ID)
                        tweet[col_Retweets_Count] = fetched_tweet['retweet_count']
                        tweet[col_Favourites_Count] = fetched_tweet['favorite_count']
                        tweet[col_Checked_Status_At] =  current_date
                    else:
                        if time_of_creation < max_date_back: #the tweet wasn't checked within the bounds and has now expired
                             tweet[col_Checked_Status_At] = 'expired'; 
                        else:
                            min_date = time_of_creation + timedelta(days=min_days_back)
                            min_date = min_date.strftime("%d/%m/%Y, %H:%M:%S")
                            max_date = time_of_creation + timedelta(days=max_days_back)
                            max_date = max_date.strftime("%d/%m/%Y,, %H:%M:%S") 
                            tweet[col_Checked_Status_At] = 'to be checked between ' + min_date + ' and ' + max_date
                            
                #is the tweet a reply, then add the original tweet's text
                if tweet[col_isReply_ID]:
                    fetched_tweet = twitter.show_status(id=tweet[col_isReply_ID])
                    tweet[col_isReply_Text] = fetched_tweet['text']

            line_count += 1
            wtr.writerow(tweet);
    
csv_file.close()
temp_csv_copy.close()

    