# Twitter Thesis Project: collecting tweets 

## Introduction
This is a 'jupyter notebook': a certain kind of program you can use to develop your own software applications. In this notebook we will use the Python computer language and the Twitter API ('application programmer interface') to automatically collect and analyse tweets. 

This notebook contains cells, i.e., snippets of either code or normal text. We use the text cells (like the current one) to explain what is going on. You can edit a cell by clicking on it. After you made the changes, you can either click 'run' above, or press shift+enter, to execute what is written. In the case of a text cell, this will just display the text in the correct format (try it with this cell!); in case of a code cell, the code will be executed. 

The next cell will be a code cell where we ask the computer to print a simple sentence for us. Try to change this sentence and then execute the code. 


In [5]:
print('hello world')

hello world


**Important note: to use this program, you have to execute all the code in all the cells in the correct order. **

If you want to learn more about using jupyter notebooks, look for a tutorial online (e.g., https://www.dataquest.io/blog/jupyter-notebook-tutorial/). Most of the questions you have or the problems you encounter will also be solved through a simple google search with the correct keywords.

**However, if you have other questions or any problems that you really don't know how to solve, please contact us on Teams and we'll be happy to help or to schedule a meeting. **

## Connecting to the Twitter API

In [1]:
from twython import Twython
#if this results in an error, you need to install twython first. See guideline document.
print('import successful')

import successful


First, we need to connect to Twitter using the correct passwords/keys. There is a limit on how many tweets you can collect each 15 minutes (this makes sure the Twitter servers are not overloaded, amongst other reasons).  Running the code below 'logs you in' to the Twitter application. If all goes well, the output should show information on the number of calls ('questions we can ask') we can still perform these 15 minutes. With each call, you can collect 100 Tweets. 
e.g.: {'/search/tweets': {'limit': 450, 'remaining': 443, 'reset': 1568288620}}

In [2]:
APP_KEY = 'yN3VbAb8QZdzD5GPkVuOHLfMN'         #API key
APP_SECRET = 'YRdyk39bx9iRPQBhK2Nh1fT32JdGYTrEhqxcEbcpLMIxbT7wKh'   #API secret key
twitter = Twython(APP_KEY, APP_SECRET, oauth_version=2)
ACCESS_TOKEN = twitter.obtain_access_token()

twitter = Twython(APP_KEY, access_token=ACCESS_TOKEN)
twitter.get_application_rate_limit_status()['resources']['search']



{'/search/tweets': {'limit': 450, 'remaining': 450, 'reset': 1613383611}}

## Writing tweets to file 

To collect tweets, you can either choose between 'searching tweets' of 'streaming tweets' (see sections below). Please discuss with us beforehand which method is best suited for your research question. 
The code in the cell immediatelly folowing this one is **not** the actual writing to file; it is just the instructions on how tweets should be written to file, which can then be used in the streaming/searching parts. 



In [3]:
import csv
import os.path

delimiter = ';' #change this to ';' or ',' if your software (like excel, numbers...) doesn't show your data in columns 

def writeToFile(data, filename):
    
    file_exists = os.path.isfile(filename);
    
    with open(filename,'a', encoding='utf-8') as f: #this will add the newly collected tweets to your dataset ('a' = append) 
        writer = csv.writer(f, delimiter=delimiter)
        
        ##write a header to the file if it doesn't exist already
        if not file_exists: #if it's a new file, we should create a header 
            print('new file created')
            writer.writerow(['Date','Place_Name','Place_Bounding_Box','Text', 'Tweet_Id', 'IsReplyTo_ID','IsReplyTo_Text', 'Hashtags','Urls','Media','User_Screen_Name', 'User_Id', 'User_Followers_Count', 'Checked_Status_At','Retweets_Count','Favourites_Count', 'ID_of_answer', 'Text_of_answer']) #this is the document header
    
        ##format all the results in a new row to append to the file
               
        #add a value for the date and time of creation of the tweet
        row_to_write = [data['created_at']]
               
        #process location information, if present
        if data.get('place')== None:
            row_to_write.append('')
            row_to_write.append('')
        else:
            row_to_write.append(data['place']['full_name'])
            row_to_write.append(data['place']['bounding_box'])
               
        #the text is stored depending on the type of tweet. We use the full text if available ('extended_tweet'), otherwise we use the standard text
        if data.get('extended_tweet')!= None and data['extended_tweet'].get('full_text')!= None :
            row_to_write.append(data['extended_tweet']['full_text'])
        elif data.get('full_text')!= None:
            row_to_write.append(data['full_text'])
        else:
            row_to_write.append(data['text'])
                         
        #then we can add a column for the tweet ID
        row_to_write.append(str(data['id'])) 
               
        #if the tweet is a reply, store the original tweet ID
        row_to_write.append(data['in_reply_to_status_id'])  
           
        if(data['in_reply_to_status_id']):  #if the tweet is a reply, the original text can be fetched later
            row_to_write.append('original text to be fetched')
        else: #if the tweet is a not reply, this column can be left empty
            row_to_write.append('')
                                   
        #then we can process the hashtags:
        hashtags_as_strings = ''
        for x in data['entities']['hashtags']:
            hashtags_as_strings = hashtags_as_strings + ', ' + x['text']
        row_to_write.append(hashtags_as_strings)    
                                   
        #then we can process the urls:
        urls_as_strings = ''
        for x in data['entities']['urls']:
            urls_as_strings = urls_as_strings + ', ' + x['url']
        row_to_write.append(urls_as_strings)
                                   
        #presence of media 
        if data['entities'].get("media") != None:
            row_to_write.append('yes')
        else:
            row_to_write.append('no')
               
               
        #user information      
        row_to_write.append(data['user']['screen_name'])
        row_to_write.append(str(data['user']['id']))
        row_to_write.append(data['user']['followers_count'])
                                   
               
        #empty values that will later be replaced with retweet count etc. 
        row_to_write.append('')
        row_to_write.append('')
        row_to_write.append('')
        row_to_write.append('')
        row_to_write.append('')
        
           
        #finally, write everything to file             
        writer.writerow(row_to_write)
    
    

## Searching Tweets

In [6]:
#this is a search example: for 3 different Belgian banks, we'll collect max. 100 tweets of the past seven days 
# where people used the Twitter handle of the bank in their tweet text (usually done to publicicly adress the bank for a complaint)

#note that you could also collect tweets containing certain keywords, etc. Please discuss with us and we can make a custom piece of code.

twitter_handles = ['BNPPFBelgie', 'KBC_BE', 'INGBelgie']

for handle in twitter_handles:
    print('searching tweets for handle @' + handle + '...'  )
    query = '@' + handle + ' -filter:retweets'
    searchResult = twitter.search(q=query, count='100', tweet_mode="extended") 

    #now we can save these results to a file with a custom filename, in this case just the name of the bank
    filename = handle + '.csv' #change the filename here 
    delimiter = ';' #change this to ';' or ',' if your software (like excel, numbers...) doesn't show your data in columns 

    #we're also going to print the tweets below, so you can check the result
    for tweet in searchResult['statuses']:
        print('*****')
        print(tweet['full_text'])
        writeToFile(tweet, filename)
    


    

searching tweets for handle @BNPPFBelgie...
*****
@BNPPFBelgie @BNPParibas da #EEMlabel , is da iets dat je bekomt als je klanten veel aanrekent maar geen service gaat aanbieden? #kluutbank @EEM_Label
*****
@TomA3aenssens @dbousou @visjevangen @rudidekerpel Je moet eens nadenken met wat je exact tweet Tom.  Ik geef gewoon even goede raad.   Is het omdat @BNPPFBelgie met spoonsorgeld smijt dat ze mogen frauderen?  Nu ga je toch flink uit de bocht hoor.
*****
@BNPPFBelgie @KoenDeLeus @Philippegijsels bla bla bla
*****
@BNPPFBelgie @KoenDeLeus @Philippegijsels Slechtste bank ooit !!!!
*****
@BNPPFBelgie ik probeer al weken bij jullie een afspraak in te plannen maar mijn mails worden gewoon niet beantwoord... kan iemand mij aub een afspraak fixen bij een @BNPPFBelgie kantoor in Antwerpen? Merci.
*****
@BNPPFBelgie @KoenDeLeus @Philippegijsels Geld slaan op de gezondheid/welzijn van de mensen... Hoe hypocriet kan je zijn....???...
*****
Now @PayPal and @Mastercard are the first major paymen

*****
@fonsvandevoordt @INGBelgie What they wrote is perfectly accurate
*****
@TalleKe1970 @INGBelgie I know...
*****
@TalleKe1970 @INGBelgie I spent more on 'phone calls in the end so ended up chalking a large 'F' on the whole thing.
*****
@ronald_roelandt @INGBelgie Ik was een heel tevreden Recordbankklant. Tot die opgeslorpt werd door ING... pfff die koude afstandelijke “protocollen” he. 🙄
*****
@deerepowerke @INGBelgie Ja wel kijk: zo makkelijk werkt dat dus 🤬 https://t.co/m2xV1q8xO5
*****
@Andymo444 @INGBelgie I was Recordbank. But they went ING too... damn 🙄
*****
@TalleKe1970 @INGBelgie Ik mag feitelijk geen reclame maken...
*****
@__JohnV__ @INGBelgie Ik geef t op zenne. Heb mail gezonden. Verdikke 🙄🙄
*****
@TalleKe1970 @INGBelgie Ik ging vrijdag langs in mijn ING kantoor om cash af te halen.... het aantal affiches dat er hing in drie talen om te zeggen dat ze GEEN afspraken maken want alles is online te regelen....
*****
@TalleKe1970 @INGBelgie BBL were great until ING took 'e

## Streaming Tweets

In this part of the code, we will start 'streaming' tweets: collecting newly created tweets based on certain criteria. These tweets will then be saved in a csv file, a file format that you can open with excel, pages, etc. 

Every time you want to start streaming, run the code in the cells below. It migth take a while before a first tweet is discovered, so there's nothing wrong if no tweet shows up for a while. If a lot of tweets are streamed (like, e.g., when you would use a keyword like 'Trump' or 'Brexit'); make sure to halt the program in time, because all tweets are automatically saved to your computer.

New tweets will automatically be added to a file with the filename as specified below. You can change the filename (but do keep the extension '.csv'). This file will be created once a first tweet that matches the criteria is discovered, and tweets will be added to the same file regardless of whether you restarted the application in between. The file will be generated in the same folder as the folder where these notebooks are located. 


In [7]:
from twython import TwythonStreamer

filename = 'test_stream_tweets.csv' #change the filename here 

In the code cell below, we first specify what will happen if we find a tweet that matches our criteria. Currently, it will tell us when a new tweet is collected. If it's not a retweet, its date, place and text will be written to file.

There's a lot more information you can access for each tweet. If you want to save more than the date, place and text (e.g., the name of the user) please go to https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object and consult the section 'Tweet Data Dictionary'. List all the properties you want to save to file, and contact us so we can update this part of the code.  

In [8]:
class MyStreamer(TwythonStreamer):
    def on_success(self, data):
        print("-------new tweet collected!")
        
        if 'retweeted_status' in data:
            print("but it's a retweet, so we ignore it...")
        else:
            
            #first we print selected information about the tweet so you can follow what's happing
            if data.get('extended_tweet')!= None and data['extended_tweet'].get('full_text')!= None :
                print([data['created_at'],data['place'], data['extended_tweet']['full_text']])
            else:
                print([data['created_at'],data['place'], data['text']])
             
            #then we add the tweet to the file
            writeToFile(data, filename)
                                
    def on_error(self, status_code, data):
        print(data)
        print(status_code)
        # self.disconnect()

In the next cell, we connect to the twitter stream.

In [9]:
OAUTH_TOKEN = '1100028871259377670-qtcMTW2ereJ3A0KIvFguWu0ZmW0n8k'
OAUTH_TOKEN_SECRET = 'wnPYmWOds9xD1i1CM9K8gfzMNZ26QoBmXW4JSSA81faRF'

When you execute the next cell, the streaming will start. This is also the place where you can edit the criteria you want to 'filter' the stream on. There's different types of filters you can use (at the same time):



**follow** 	(optional): 	A comma separated list of user IDs, indicating the users to return statuses for in the stream. 

**track** (optional): 	Keywords to track. Phrases of keywords are specified by a comma-separated list. 

**locations** 	(optional): 	Specifies a set of bounding boxes to track. 

see https://developer.twitter.com/en/docs/tweets/filter-realtime/api-reference/post-statuses-filter


In [10]:
stream = MyStreamer(APP_KEY, APP_SECRET,
                    OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
stream.statuses.filter(track='job,hiring', tweet_mode='extended')

-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
['Mon Feb 15 09:58:23 +0000 2021', None, '@homer_crypto Lets gooo! 🔥🔥🔥 Good job man! 🎯']
-------new tweet collected!
['Mon Feb 15 09:58:23 +0000 2021', None, 'God qualifies the unqualified 🤣🙏🏿']
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
['Mon Feb 15 09:58:24 +0000 2021', None, 'IOP Jobs 2021 International Office Products Lenovo Microsoft Partner – Sales\xa0Executive https://t.co/Lr6DX7MTki']
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
['Mon Feb 15 09:58:23 +0000 2021', None, 'What a load of fucking bollocks! When sanity returns to the UK, the ‘force’ won’t be with you, the ‘Met Force’ will be 

-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
['Mon Feb 15 09:58:33 +0000 2021', None, '@MrMattDonlan It’s not “nice” when shoe is on the other foot...India know what to expect when they tour here,so why not prepare a dust bowl,still need to do the job which is exactly what they doing here,great bounce back from first test defeat']
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
['Mon Feb 15 09:58:34 +0000 2021', None, "If you are passionate about technology, about innovative products and services, then working at Modex won't seem like a job at all. It will feel like an amazing journey into the future. #WeAreModex https://t.co/fRHGLlpM1G"]
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
['Mon Feb 15 09:58:34 +0000

-------new tweet collected!
['Mon Feb 15 09:58:44 +0000 2021', None, 'Oh... Gulf 😭😭\n@gulfkanawut #GulfKanawut']
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
['Mon Feb 15 09:58:44 +0000 2021', None, '@Ch11Star holy crap this looks amazing and so good amazing job 👍👍👏👏👏']
-------new tweet collected!
['Mon Feb 15 09:58:44 +0000 2021', None, "@FS_GTKM i cannot name list them all but my absolute fav that i go back to once in a while: extreme job (korean), along with the gods (korean), lord of the rings, the hobbits, harry potter, little women, and all of ghibli's 🥺🥺 pls recommend me more if u have some lists"]
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
['Mon Feb 15 09:58:44 +0000 2021', None, '@daisieduckie AMAZING JOB ON THIS!!! I LOVE IT SO MUCH!!! 😍I reall

-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
['Mon Feb 15 09:58:56 +0000 2021', None, '@swampybr549 @LuvgvsUwngs And she had them move because her son was going to start college there. I think she’s manipulative. I didn’t at the beginning, but I do now. And she needs to get a job like the other wives. And Kody should get one too!']
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
['Mon Feb 15 09:58:57 +0000 2021', None, 'actually watched this a couple of days ago but was too meh to update my thread. Also like I can feel the care an attention in this movie but then it inevitably fell into the trap of “love &gt; new organs” and nahhhh']
-------new tweet collected!
but it's a retweet, so we ignore

-------new tweet collected!
['Mon Feb 15 09:59:07 +0000 2021', None, "@AlSwearengen127 @SkySportsPL No coach rn in world football would've taken the job after Jose left and now if it was vacated everyone would jumping up in arms to get it. That's all Oles work. He deserves a real chance to see his project through. Not being reactionary with him has worked so far really well imo"]
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new tweet collected!
but it's a retweet, so we ignore it...
-------new twe

KeyboardInterrupt: 

## Check status: look up the favorites and retweets counts after given time period; if a reply, add the text of the original tweet; if the tweet has been replied to, add the text of the reply. 

The goal here is to determine, for tweets we collected in the past, the number of times they have been retweeted and the number of times they have been favorited. To have a fair comparison between tweets this should always be done in more or less the same 'time window'. I.e., for this example we choose the time window to be 10 to 14 days ago. This means that once you collected tweets you should run this code at least once between 10 and 14 days later. There are 3 possible results: 
- the tweet is created between 10 to 14 days ago: great! We look up its counts. 
- the tweet is created less than 10 days ago: the time window in which you should run this code again is added in the file.
- the tweet is created less than 14 days ago and hasn't been checked: you missed the window and the tweet has expired. These tweets should be excluded from your analysis (please contact us so we can help). 

In addition to this, this code will add the original tweet's text to the dataset if the tweet was a reply (independent of the number of days).

There is also an option to check if the tweet has been replied to. If a reply is found through the additional twitter search, the text of the reply will be stored as well. **this option is be default set to 'False' because this lookup can only be done 180 times per 15 minutes. This means that it is only feasible if you have a small amount of tweets in the file. If you want to turn on this option, please set 'search_replies' in the cell below to 'True'. **

Note that the updated dataset will be stored in a new file. You can choose the filename below. **Do not give it the same name as your original file!**

In [5]:
from datetime import datetime, timedelta
import time

search_replies = False #change this to 'True' if you do need the replies. 

In the code below, we will first determine the correct time window. 

In [None]:
#the file of collected tweets that you want to get the counts for:
filename = 'search_results8.csv' #change the filename here 

#the file with the updated datatset
temp_copy_file = 'search_results8_updated.csv' #change the filename here (different from original!)


#the time period we consider (from 'max_days_back' days ago to 'min_days_back' days ago  )
max_days_back = 10 #days 
min_days_back = 0 #days 

current_date = datetime.now()
max_date_back = current_date - timedelta(days=max_days_back)
min_date_back = current_date - timedelta(days=min_days_back)

print('we will look up the tweets created between: ')
print(max_date_back)
print('and ')
print(min_date_back)



In [4]:
col_Date = 0
col_ID = 4
col_isReply_ID = 5
col_isReply_Text = 6 
col_userName = 10 
col_Checked_Status_At = 13
col_Retweets_Count= 14
col_Favourites_Count = 15
col_hasAnswer_ID = 16
col_hasAnswer_Text = 17


In [None]:
with open(filename, mode='r', encoding='utf-8') as csv_file:
    
    start_time = time.time()
    elapsed_time = 0
    csv_reader = csv.reader(csv_file, delimiter=delimiter)
    line_count = 0
    
    with open (temp_copy_file,'w', encoding='utf-8') as temp_csv_copy:
        
        wtr = csv.writer(temp_csv_copy)
        
        rate_needs_to_be_checked = True

        for tweet in csv_reader:
            if (tweet and tweet[0]) and line_count > 0: #skip the empty lines and the header
            
                if rate_needs_to_be_checked: #we don't want to constantly check the rate limit, because this check itself is rate limited
                    
                    while rate_needs_to_be_checked: #recheck the rate until it's refreshed
                    
                        try: #check if rate not exceeded; if so, wait a while
                            rate_status = twitter.get_application_rate_limit_status()
                            checks_remaining = rate_status['resources']['statuses']['/statuses/show/:id']['remaining']
                            print('remaining: ')
                            print(checks_remaining )
                        except Exception as e:
                            print(e)
                            checks_remaining = 0
                            
                        if checks_remaining > 0:
                            rate_needs_to_be_checked = False
                        else:
                            print('waiting a while because rate limit was exceeded')
                            time_till_refresh = 60  
                            print(str((time_till_refresh)/60) + ' minutes')
                            #print(str((time_till_refresh - elapsed_time)/60) + ' minutes')
                            #time.sleep(time_till_refresh - elapsed_time)
                            time.sleep(time_till_refresh)
                            #start_time = time.time()
                            #elapsed_time = 0 #continue from 0
                
                
                #if the rate was not exceeded, we can look up the tweet
                print('processing_line' + str(line_count))

                #get the time the tweet was created
                time_of_creation = datetime.strptime(tweet[col_Date],  "%a %b %d %H:%M:%S %z %Y") #'%a %b %d %H:%M:%S %Y')
                time_of_creation = time_of_creation.replace(tzinfo=None)
                
                #did we already get the counts?
                if not tweet[col_Retweets_Count]: #we didn't check before
                    #is this time is within our bounds, fetch the original tweet from twitter and check its counts
                    if time_of_creation > max_date_back and time_of_creation < min_date_back:
                        ID = tweet[col_ID]
                        try:
                            checks_remaining  = checks_remaining  - 1
                            fetched_tweet = twitter.show_status(id=ID)
                            tweet[col_Retweets_Count] = fetched_tweet['retweet_count']
                            tweet[col_Favourites_Count] = fetched_tweet['favorite_count']
                            tweet[col_Checked_Status_At] =  current_date
                        except Exception as e: #the tweet is no longer present in the twitter system 
                            print(e)
                            tweet[col_Retweets_Count] = 'tweet was removed'
                            tweet[col_Favourites_Count] = 'tweet was removed'
                            tweet[col_Checked_Status_At] =  current_date
                        
                            
                    else:
                        if time_of_creation < max_date_back: #the tweet wasn't checked within the bounds and has now expired
                             tweet[col_Checked_Status_At] = 'expired'; 
                        else:
                            min_date = time_of_creation + timedelta(days=min_days_back)
                            min_date = min_date.strftime("%d/%m/%Y, %H:%M:%S")
                            max_date = time_of_creation + timedelta(days=max_days_back)
                            max_date = max_date.strftime("%d/%m/%Y,, %H:%M:%S") 
                            tweet[col_Checked_Status_At] = 'to be checked between ' + min_date + ' and ' + max_date
                            
                #is the tweet a reply, then add the original tweet's text
                if tweet[col_isReply_ID]:
                    try:
                        checks_remaining  = checks_remaining  - 1
                        fetched_tweet = twitter.show_status(id=tweet[col_isReply_ID])
                        tweet[col_isReply_Text] = fetched_tweet['text']
                    except: #the tweet is no longer present in the twitter system 
                        tweet[col_isReply_Text] = 'tweet was removed'
                        
                        
                if search_replies: 
                    #lookup if the tweet has been replied to
                    #note: you can only do 180 of these lookups per 15 minutes
                    #so turn this option off (see cell above) if you're using a lot of tweets!
                    
                    #first search for tweets that mention the user who created the original tweet
                    query_username= '@' + tweet[col_userName]
                    timeline = twitter.search(q=query_username,tweet_mode='extended')
                    
                    #then determine if any of these results is a reply to the original tweet
                    for result in timeline['statuses']:
                        if (result['in_reply_to_status_id'] == int(tweet[col_ID])):
                            tweet[col_hasAnswer_ID] = str(result['id'])
                            tweet[col_hasAnswer_Text] = str(result['full_text'])
 
                    
                    
                
                
                        
                if checks_remaining < 1:
                    rate_needs_to_be_checked = True
                else:
                    rate_needs_to_be_checked = False
                
                    

            line_count += 1
            wtr.writerow(tweet);
            elapsed_time = time.time() - start_time
    
csv_file.close()
temp_csv_copy.close()

print('file updated')

    

## Optional: combine collected tweets of different files in one single file

This part is not finished

In [6]:

col_Labeled_Vacancy= 18
col_Predicted_Vacancy = 19


delimiter = ','
#the file of collected tweets that you want to get the counts for:
#filenames = ['collected_tweetsdagnegen_updated3.csv','collected_tweetsdagdertien_updated.csv','collected_tweetsdagzestien_updated.csv', 'collected_tweetsdagzeventien_updated.csv', 'collected_tweetsdagachttien_updated.csv','collected_tweetsdagtwintig_updated.csv', 'collected_tweetsdageenentwintig_updated.csv']   #change the filename here 

filenames_1 = ['search_Telenet_BE_140420_updated.csv','search_Vodafone_NL_140420_updated.csv']
#filenames_2 = ['collected_tweetsdagzeventien_updated.csv', 'collected_tweetsdagachttien_updated.csv']
#filenames_3 = ['collected_tweetsdagtwintig_updated.csv', 'collected_tweetsdageenentwintig_updated.csv']

filenames = filenames_1

combined_filename = 'combined_updated_telenet_vodafone.csv' #change this filename to the filename you want the resulting file to have 

total_n_tweets = 0
line_count = 0

with open (combined_filename,'w', encoding='utf-8') as combined_file:
    wtr = csv.writer(combined_file, delimiter=';')
    wtr.writerow(['Date','Place_Name','Place_Bounding_Box','Text', 'Tweet_Id', 'IsReplyTo_ID','IsReplyTo_Text', 'Hashtags','Urls','Media','User_Screen_Name', 'User_Id', 'User_Followers_Count', 'Checked_Status_At','Retweets_Count','Favourites_Count', 'Labeled_Vacancy', 'Predicted_Vacancy']) #this is the document header
                
    for filename in filenames:
        with open(filename, mode='r', encoding='utf-8') as csv_file:
            csv_reader = csv.reader(csv_file, delimiter=delimiter)
            for tweet in csv_reader:
                if (tweet and tweet[0] and line_count > 0 and tweet[col_Retweets_Count] != 'tweet was removed'):
                    total_n_tweets = total_n_tweets + 1
                    tweet.append(0)
                    wtr.writerow(tweet);
                line_count = line_count + 1
                
print(total_n_tweets)

201
