# Twitter Data Collection

**By Esraa Mohamed**

<img src="https://i.imgur.com/cDTKtam.jpg">

# Natural Language Processing:
Natural Language Processing (NLP) is a fascinating and diverse topic of Artificial Intelligence. Here, we will utilise it to handle text-based twitter datasets and present a complete study of the dataset from the United Kingdom, United States of America and worldwide using scraping and text analysis. We used Selenium and tweepy for data collection by scraping, data cleaning, NLTK Classes and Methods, the BERT model for sentiment analysis, and unsupervised clustering for model building. We have done our best to cover most of the actions that should be taken while working on a text data collection, exploratory and analysis.

## 1. Twitter Data Collection by "V2"Full Archive Search

This Notebook shows how to use Tweepy to conduct a full archive search using v2 of the Twitter API.

### Work Preparation

In order to use this code, we will need to have a developer account on Twitter, with access to the Academic Research product track. Information about who is eligible and how to apply is [here](https://developer.twitter.com/en/products/twitter-api/academic-research).

Once we have an account, we will need to create a new app at https://developer.twitter.com/en/portal/dashboard and generate a "bearer token" from the app. 

Copy the bearer token to your clipboard and paste it into a new file in the same directory as this file, called `twitter_authentication.py`. The entire contents of the file should look like this:

```python
bearer_token = "YOUR BEARER TOKEN HERE"
```

Note that we should **never** share this token with anyone else. If, for example, you are saving your work in a Git repository, make sure that you add the `twitter_authentication.py` file to your `.gitignore`.

If anyone gets this token, they will have access to your Twitter account and you will need to revoke the token (from the same interface where you created it).

If we have created the file successfully, then the following two blocks of code should work.

Let us start by importing all the necessary libraries.

In [3]:
import tweepy
import time
import pandas as pd

Set your bearer token here:

In [4]:
bearer_token = '*****'

In [5]:
client = tweepy.Client(bearer_token, wait_on_rate_limit=True)

## The Search API

Full documentation for searching tweets is at https://docs.tweepy.org/en/latest/client.html#search-tweets. There are a lot of different options, but here is a simple version that gets all of the "SOCIAL DISTANCING" tweets DURING COVID from January 01, 2021. 

By default the only information returned is the tweet ID and the text. Often, we will want information about authors, too. To get information about the author, you need to add the `user_fields` parameter with the fields you want as well as the `expansions = 'author_id'` parameter. 

To get more information about the tweet, you need the `tweet_fields` parameter. The options are shown at https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-all

You also likely want to build a somewhat advanced query - instructions are at https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query. For this query, I get English language tweets that are not retweets.


## 1. United Kingdom Dataset

Let start by setting our query to get the data from UK.

In [59]:
# To select a specific country, add the country code at the end of the query.

social_distancing_tweets = []

for response in tweepy.Paginator(client.search_all_tweets, 
                                 query = '(socialdistancing OR social distancing) -is:retweet lang:en GB',
                                 user_fields = ['username', 'public_metrics', 'description', 'location'],
                                 tweet_fields = ['created_at', 'geo', 'public_metrics', 'text'],
                                 expansions = 'author_id',
                                 start_time = '2021-01-01T00:00:00Z',
                                 end_time = '2021-12-08T00:00:00Z',
                             max_results=500):
    time.sleep(1)
    social_distancing_tweets.append(response)

In [57]:
social_distancing_tweets

[Response(data=[<Tweet id=1471991391748333570 text=@ThatDonkDoe @genwilliams @BlackAntoid @ItsDanaWhite Even with the system telling us what we can do, remember that many politicians say shit just to stay popular/elected and hope for the best. It's up to people to be cautious, keep masks on, and use social distancing even if things open up. Common sense, or we'll see more variants.>, <Tweet id=1471991185497636868 text=Ever notice how the ppl who claim social distancing is just a way for the govt to control us are the same ppl who are sad cuz we ended segregation?
 Like, it’s ok to tell ppl how they can interact in public if you’re doing it for racism.>, <Tweet id=1471989892540514311 text=an account with 18k followers tweeting decisive langue like spectating vaxxed and unvaxxed, instead of tweeting about masks and hand washing and social distancing….the stuff that’s really keeping us safe. Majority of all the professionals sport players with Covid were vaxxed>, <Tweet id=147198849542623

In [64]:
social_distancing_tweets[1]

Response(data=[<Tweet id=1389236156647153670 text='Good chance' social distancing can be scrapped next month, says Johnson https://t.co/COR0oSyKwk>, <Tweet id=1389235309532045318 text=UK Covid LIVE: Social distancing will be dropped on June 21, says Boris Johnson, as Portugal and Spain ready to take Britons https://t.co/yXm2TJNwX2>, <Tweet id=1389223876387803137 text='Good chance' social distancing can be scrapped next month, says Johnson https://t.co/Eh0BagLBPO>, <Tweet id=1389218145387307009 text='Good chance' social distancing can be scrapped next month, says Johnson https://t.co/nPpaOsPxZW>, <Tweet id=1389217602036195329 text=Rock on!'Good chance' social distancing can be scrapped next month, says Johnson https://t.co/kD4PLIJ9vx>, <Tweet id=1389215800175824901 text=@staringatclouds @Scarborough_GB And so he's going from selfie to selfie covering 'flocks of people', is he? During a pandemic, when social distancing is still in force?>, <Tweet id=1389202006372491267 text=‘One metre-pl

In [68]:
social_distancing_tweets[0].data[1]

<Tweet id=1468312137214869510 text=https://t.co/2TFFPNN2Wb  #downingstreetparty "no social distancing ... video leak 
@BorisJohnson #LiarJohnson>

In [66]:
social_distancing_tweets[0].includes['users'][2]

<User id=2327434344 name=Théroigne Russell username=TheroigneR>

In [67]:
social_distancing_tweets[0].includes['users'][2].description

'Organic & wild Foodie, low carber, ACGC, Sassenach haggis lover, bisexual Aries with Norman ancestors. #RejoinEU'

In [70]:
result = []
user_dict = {}
# Loop through each response object
for response in social_distancing_tweets:
    # Take all of the users, and put them into a dictionary of dictionaries with the info we want to keep
    for user in response.includes['users']:
        user_dict[user.id] = {'username': user.username, 
                              'followers': user.public_metrics['followers_count'],
                              'tweets': user.public_metrics['tweet_count'],
                              'description': user.description,
                              'location': user.location
                             }
    for tweet in response.data:
        # For each tweet, find the author's information
        author_info = user_dict[tweet.author_id]
        # Put all of the information we want to keep in a single dictionary for each tweet
        result.append({'author_id': tweet.author_id, 
                       'username': author_info['username'],
                       'author_followers': author_info['followers'],
                       'author_tweets': author_info['tweets'],
                       'author_description': author_info['description'],
                       'author_location': author_info['location'],
                       'text': tweet.text,
                       'created_at': tweet.created_at,
                       'retweets': tweet.public_metrics['retweet_count'],
                       'replies': tweet.public_metrics['reply_count'],
                       'likes': tweet.public_metrics['like_count'],
                       'quote_count': tweet.public_metrics['quote_count']
                      })
# Change this list of dictionaries into a dataframe
df_uk = pd.DataFrame(result)

Let us save and reread our data to and from a CSV file.

In [71]:
df_uk.to_csv(path_or_buf = r'C:\Users\r04ra18\Desktop\Esraa-project-data\uk-09122021-tweets.csv', index=False)

In [72]:
pd.read_csv(r'C:\Users\r04ra18\Desktop\Esraa-project-data\uk-09122021-tweets.csv')

Unnamed: 0,author_id,username,author_followers,author_tweets,author_description,author_location,text,created_at,retweets,replies,likes,quote_count
0,755038244761460736,JaniceMcfaull,117,6017,Troubleshooter/Consultant/Director & someone w...,UK,"Every Body Wanted The Pie,Time for Some Humble...",2021-12-07 23:05:04+00:00,0,0,0,0
1,1222132391751360516,celticpirate1,329,16966,,,"https://t.co/2TFFPNN2Wb #downingstreetparty ""...",2021-12-07 20:11:08+00:00,0,0,0,0
2,2327434344,TheroigneR,2456,437099,"Organic & wild Foodie, low carber, ACGC, Sasse...",Kent,Phil Schofield asks Matt Hancock if dyslexia c...,2021-12-07 18:28:27+00:00,0,0,0,0
3,2327434344,TheroigneR,2456,437099,"Organic & wild Foodie, low carber, ACGC, Sasse...",Kent,Wearing mask is better than social distancing ...,2021-12-07 18:24:58+00:00,0,0,0,0
4,1556650069,TheFitzWorld,205,28022,LFC fan - Fan forum member - Live Love Laugh N...,On top of the MS Upper Tier,Wearing mask is better than social distancing ...,2021-12-07 16:08:28+00:00,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
824,1179451184978939905,maverick550,181,12451,"Independent thinker, artist, musician, foodie ...",California,@Realpersonpltcs @alwaysberunning @JWink4 #Hap...,2021-01-03 05:37:01+00:00,0,0,0,0
825,772465719481171972,itgothiseyes,57,1331,https://t.co/gNJLGwECw0 21🇬🇧 Bisexual keeping ...,,@TT_sophie_GB Amen to that! I work in retail a...,2021-01-02 22:30:34+00:00,0,0,1,0
826,1103628970661199872,AlishanRestaur1,17,460,Proudly serving up the finest Indian cuisine i...,"149 High Street, Tonbridge",Start off 2021 by exploring different flavours...,2021-01-02 16:29:18+00:00,0,0,0,0
827,1322595746131173376,Jig_67,1281,3883,,"Glasgow, Scotland",Same cunts who go on about lack of social dist...,2021-01-01 22:16:07+00:00,10,0,45,0


In [83]:
df_usa.columns

Index(['author_id', 'username', 'author_followers', 'author_tweets',
       'author_description', 'author_location', 'text', 'created_at',
       'retweets', 'replies', 'likes', 'quote_count'],
      dtype='object')

## 2. United States of America

Let start by setting our query to get the data from US.

In [73]:
# To select a specific country, add the country code at the end of the query.

social_distancing_tweets = []

for response in tweepy.Paginator(client.search_all_tweets, 
                                 query = '(socialdistancing OR social distancing) -is:retweet lang:en US',
                                 user_fields = ['username', 'public_metrics', 'description', 'location'],
                                 tweet_fields = ['created_at', 'geo', 'public_metrics', 'text'],
                                 expansions = 'author_id',
                                 start_time = '2021-01-01T00:00:00Z',
                                 end_time = '2021-12-08T00:00:00Z',
                             max_results=500):
    time.sleep(1)
    social_distancing_tweets.append(response)

In [74]:
social_distancing_tweets[0]

Response(data=[<Tweet id=1468367933017817089 text=Today kicks off the annual graduation of my university. Congrats to all. #mcu #มจร Nevertheless, keep social distancing and wearing mask for safety of us and preventing the virus.>, <Tweet id=1468366491263328258 text=There's no way Johnson can legitimately head a press briefing to tell the country it needs to take further social distancing measures because of the Omicron variant.

His lies have put us all at greater risk.

People will feel more justified than ever in breaking Covid rules now.>, <Tweet id=1468363488728911875 text=@SuzanneSibbald You have every right to be angry.  Laughing and joking about social distancing when we obeying the rules people like Rees-Mogg imposed on us.  Plus the comment re Police shows he knew exactly what they were doing was out of order.>, <Tweet id=1468361551518314499 text=@DrMikeMendoza @MarcRummy ... and here is il capo di tutti capi of the NIH, masking up and social distancing. What does he know tha

In [75]:
social_distancing_tweets[1]

Response(data=[<Tweet id=1466842525923172353 text=Covid 19 is still with us. So we're asking customers to keep wearing face masks and observe social distancing. Thank you for your understanding. https://t.co/MKrv4ptMZ1  #WeArePostOffice #Covid19 #KeyWorkers #WearAMask https://t.co/ArHo6V4ky3>, <Tweet id=1466842436500475907 text=What is the scientific and medical basis for the measures being put in place for COVID-19? Should we wear face-masks? Is social distancing helpful? Did lockdown prevent deaths? Many people are afraid but all the evidence shows that most of us have nothing to fear.>, <Tweet id=1466839710945488902 text=We would love for you to come see us play on the 14th! You won't need to register for seats in advance this time, and it is general admission (we will still be practicing social distancing!) Hope to see you there!❄️ https://t.co/3QGIuwkOIm>, <Tweet id=1466833802689400840 text=We have done everything the government has asked us to do, between social distancing, masks

In [76]:
social_distancing_tweets[0].data[1]

<Tweet id=1468366491263328258 text=There's no way Johnson can legitimately head a press briefing to tell the country it needs to take further social distancing measures because of the Omicron variant.

His lies have put us all at greater risk.

People will feel more justified than ever in breaking Covid rules now.>

In [77]:
social_distancing_tweets[0].includes['users'][2]

<User id=1025721129787420673 name=Izzy 🏴󠁧󠁢󠁳󠁣󠁴󠁿🇪🇺🏴󠁧󠁢󠁳󠁣󠁴󠁿ALBA username=Toepostsandals>

In [78]:
social_distancing_tweets[0].includes['users'][2].description

'A Scottish European, Love my family, Scotland, Opera,  travel and I have supported Scottish Independence for 50+ years.  Member of Alba.'

In [79]:
result = []
user_dict = {}
# Loop through each response object
for response in social_distancing_tweets:
    # Take all of the users, and put them into a dictionary of dictionaries with the info we want to keep
    for user in response.includes['users']:
        user_dict[user.id] = {'username': user.username, 
                              'followers': user.public_metrics['followers_count'],
                              'tweets': user.public_metrics['tweet_count'],
                              'description': user.description,
                              'location': user.location
                             }
    for tweet in response.data:
        # For each tweet, find the author's information
        author_info = user_dict[tweet.author_id]
        # Put all of the information we want to keep in a single dictionary for each tweet
        result.append({'author_id': tweet.author_id, 
                       'username': author_info['username'],
                       'author_followers': author_info['followers'],
                       'author_tweets': author_info['tweets'],
                       'author_description': author_info['description'],
                       'author_location': author_info['location'],
                       'text': tweet.text,
                       'created_at': tweet.created_at,
                       'retweets': tweet.public_metrics['retweet_count'],
                       'replies': tweet.public_metrics['reply_count'],
                       'likes': tweet.public_metrics['like_count'],
                       'quote_count': tweet.public_metrics['quote_count']
                      })

# Change this list of dictionaries into a dataframe
df_usa = pd.DataFrame(result)

In [81]:
df_usa.to_csv(path_or_buf = r'C:\Users\r04ra18\Desktop\Esraa-project-data\usa-09122021-tweets.csv', index=False)

In [82]:
pd.read_csv(r'C:\Users\r04ra18\Desktop\Esraa-project-data\usa-09122021-tweets.csv')

Unnamed: 0,author_id,username,author_followers,author_tweets,author_description,author_location,text,created_at,retweets,replies,likes,quote_count
0,1457877116574638081,AphichetSomkam1,39,596,Dhamma l Book l Coffee l Travel,,Today kicks off the annual graduation of my un...,2021-12-07 23:52:50+00:00,0,0,0,0
1,1337690030,nicky_NoPasaran,10555,138542,"“In times of universal deceit, telling the tr...",,There's no way Johnson can legitimately head a...,2021-12-07 23:47:07+00:00,0,0,1,1
2,1025721129787420673,Toepostsandals,1303,22095,"A Scottish European, Love my family, Scotland,...","Paisley, Scotland in Europe",@SuzanneSibbald You have every right to be ang...,2021-12-07 23:35:11+00:00,0,1,1,0
3,1375221547045167107,CGombatto,15,4067,Amateur political cartoonist.,,@DrMikeMendoza @MarcRummy ... and here is il c...,2021-12-07 23:27:29+00:00,0,0,3,0
4,803643675745976320,ThomasPHitchens,339,32018,Retired USAF C2. Country over party. ⚖️,United States,@LowellStewart8 @DonRedman5 @SaraCarterDC Vacc...,2021-12-07 23:16:41+00:00,0,2,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
82167,39535212,acnetj,164,6791,,,1/ Glad this year is coming to an end. It suck...,2021-01-01 00:08:28+00:00,0,1,0,0
82168,1305690287734431746,ZaayWitDaFN,87,1145,Dat One Ugly Nigga Y’all Love To Hate🙂💔,"Selma, AL",@Amiinahhhh i hate that 2020 drifted us apar...,2021-01-01 00:04:26+00:00,0,1,0,0
82169,712416733672374272,greenlightguapo,100,4295,Arshaan’s Twitter arc,"Ontario, Canada",i hate that 2020 drifted us apart but i know w...,2021-01-01 00:03:43+00:00,0,1,2,0
82170,1298839316626583552,goodnightnwoozi,142,6128,@Lakers • #LakeShow • @ArianaGrande,17,@ArianaGrande i hate that 2020 drifted us apar...,2021-01-01 00:01:07+00:00,0,0,1,0


## 3. Worldwide Dataset

Let start by setting our query to get the data from all over the world.

In [57]:
social_distancing_tweets = []

for response in tweepy.Paginator(client.search_all_tweets, 
                                 query = '(socialdistancing OR social distancing) -is:retweet lang:en',
                                 user_fields = ['username', 'public_metrics', 'description', 'location'],
                                 tweet_fields = ['created_at', 'geo', 'public_metrics', 'text'],
                                 expansions = 'author_id',
                                 start_time = '2021-01-01T00:00:00Z',
                                 end_time = '2021-12-08T00:00:00Z',
                             max_results=500):
    time.sleep(1)
    social_distancing_tweets.append(response)

Rate limit exceeded. Sleeping for 220 seconds.
Rate limit exceeded. Sleeping for 218 seconds.
Rate limit exceeded. Sleeping for 223 seconds.
Rate limit exceeded. Sleeping for 226 seconds.
Rate limit exceeded. Sleeping for 221 seconds.


Let us view the first tweet:

In [58]:
social_distancing_tweets[0]

Response(data=[<Tweet id=1381034256210792451 text=My Moms Zoom birthday party went great.  She cried tears of joy when she saw her grand kids and everyone on line 
Covid lock down /social distancing is hard on the older folks. Take care of your loved ones.>, <Tweet id=1381034084986654720 text=@rosepoptosis @Dukesbetterhalf @Soildoc780 @love_truth_now Here’s my science... permanent social distancing. Now go spread your fear elsewhere. There is a special place in hell for people like you. Please tell me you’ve gad both. That would be the perfect end to my day!>, <Tweet id=1381034041579876357 text=Social Distancing - Self-Isolated  - Stressed ??? Learn how to relax with this book …Stress is harmful to your Health. This new Kindle eBook will help you dramatically reduce the stress in your life and help you live longer. FREE on Kindle Unlimited 
https://t.co/NxYbiQlE8M>, <Tweet id=1381033828270149634 text=@trentjohnsen @DavidPatersonca @merry123459 That romanticized notion of "Give me liber

Note that we followed the best practice above of saving the raw response returned. Moreover, we write out all of the raw responses into a file. For long-running queries (e.g., if we need to get hundreds of thousands of tweets), we might write all of the results to a file and then open the file, retrieve the last tweet, and use the ID of that tweet to tell the script where to start to retrieve new tweets.

The other problem is that the object that is returned is nested, with the tweet data in `.data` and the user data in `.includes['users']`.

Let us view the user id and tweet text

In [59]:
social_distancing_tweets[2].data[2]

<Tweet id=1380971353105334277 text=@mnannamay ikr they’re not practicing proper social distancing smh 🤦>

And let us view the user id, name and user name.

In [60]:
social_distancing_tweets[0].includes['users'][2]

<User id=513604551 name=Harry Miles username=AdventuresHarry>

Note that both of these are objects. The data that we asked for in `user_fields` and `tweet_fields` above are attributes of the objects. For example, here's the user's description:

In [61]:
social_distancing_tweets[0].includes['users'][2].description

"Harry Miles, medically retired USAF fighter pilot, owner, Solutions, with private detective services. Read Harry's War by Ed Benjamin at all eBooksellers."

We then reorganize these into a flat file, which means connecting a tweet to the user data of the user who wrote it. I show an example of how to do that here:

In [62]:
result = []
user_dict = {}
# Loop through each response object
for response in social_distancing_tweets:
    # Take all of the users, and put them into a dictionary of dictionaries with the info we want to keep
    for user in response.includes['users']:
        user_dict[user.id] = {'username': user.username, 
                              'followers': user.public_metrics['followers_count'],
                              'tweets': user.public_metrics['tweet_count'],
                              'description': user.description,
                              'location': user.location
                             }
    for tweet in response.data:
        # For each tweet, find the author's information
        author_info = user_dict[tweet.author_id]
        # Put all of the information we want to keep in a single dictionary for each tweet
        result.append({'author_id': tweet.author_id, 
                       'username': author_info['username'],
                       'author_followers': author_info['followers'],
                       'author_tweets': author_info['tweets'],
                       'author_description': author_info['description'],
                       'author_location': author_info['location'],
                       'text': tweet.text,
                       'created_at': tweet.created_at,
                       'retweets': tweet.public_metrics['retweet_count'],
                       'replies': tweet.public_metrics['reply_count'],
                       'likes': tweet.public_metrics['like_count'],
                       'quote_count': tweet.public_metrics['quote_count']
                      })

# Change this list of dictionaries into a dataframe
df = pd.DataFrame(result)

Let us view our dataframe:

In [63]:
df

Unnamed: 0,author_id,username,author_followers,author_tweets,author_description,author_location,text,created_at,retweets,replies,likes,quote_count
0,1112922648986746880,riego_me2,97,21638,ADAB (Assigned Disobedient at Birth)\nDogs (Ye...,#LEOSARMY,My Moms Zoom birthday party went great. She c...,2021-04-10 23:59:59+00:00,0,0,2,0
1,955484880,BulldogMama1,857,24225,"I ♥️ 🇨🇦 oil & gas, Pure Blood VotePPC, Pure Blood",Canada,@rosepoptosis @Dukesbetterhalf @Soildoc780 @lo...,2021-04-10 23:59:18+00:00,0,0,0,0
2,513604551,AdventuresHarry,350,33313,"Harry Miles, medically retired USAF fighter pi...",San antonio metro area,Social Distancing - Self-Isolated - Stressed ...,2021-04-10 23:59:08+00:00,0,0,0,0
3,992209789,Kelans27,1471,88951,Husband & proud father. Retired teacher & coac...,Trad'l Land-Anishnaabeg People,@trentjohnsen @DavidPatersonca @merry123459 Th...,2021-04-10 23:58:17+00:00,3,0,7,0
4,850511009001426944,joannes11008424,8,3853,,,"I’d like to know how, if I don’t own a gun, re...",2021-04-10 23:58:14+00:00,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
740432,801852,DonNantwich,4308,85043,Aquila non capit muscas,Mon Amour,@normAL219 @SkyNews They seem to have three do...,2021-01-01 00:00:15+00:00,0,0,1,0
740433,1263963387244867584,miss602toyou,217,3095,IT'S FALL BABY,Doonya,@g1lbo11 @SLillyLace @RaptorsRealist @inminiva...,2021-01-01 00:00:12+00:00,0,0,1,0
740434,233592596,Altwellnessctr,100,1044,We provide high quality services and education...,"Albuquerque, NM",If you’re feeling stressed or anxious about so...,2021-01-01 00:00:03+00:00,0,0,0,0
740435,10385322,Tomedes,1886,17711,Tomedes Smart Human #Translation serves major ...,,Let us welcome 2021 with open arms... while st...,2021-01-01 00:00:03+00:00,0,0,0,0


Now let us save the file into a CSV file.

In [64]:
df.to_csv(path_or_buf = r'C:\Users\r04ra18\Desktop\Esraa-project-data\allworld-222-09122021-tweets.csv', index=False)

Let us read the data back and open it for further investigation.

In [66]:
df1 = pd.read_csv(r'C:\Users\r04ra18\Desktop\Esraa-project-data\final-data\allworld-09122021-tweets.csv')

In [67]:
df1

Unnamed: 0,author_id,username,author_followers,author_tweets,author_description,author_location,text,created_at,retweets,replies,likes,quote_count
0,1463983950720864259,Killzasnowflak3,1,19,Just want to put snow flakes in A&E,,"@SadiqKhan taking covid 19 very seriously , no...",2021-12-08 23:58:23+00:00,0,0,0,0
1,28979335,imthedarkknight,1335,20408,I write my own tweets. Proper grammar essentia...,"Toronto, Ontario",@balkissoon #wheresmybus #dobetter #ttc staff ...,2021-12-08 23:57:28+00:00,3,0,7,3
2,1433212442222501893,Will19986276,1,560,"freedom loving, hard charging, high speed, low...",,@libsoftiktok social distancing is entirely co...,2021-12-08 23:56:33+00:00,0,0,0,0
3,738510655,AverageKell,81,1187,,,At @AmericanAir social distancing on a half fu...,2021-12-08 23:56:21+00:00,0,0,1,0
4,945020485,TravisFuguet,1162,28057,"Geographer, NASCAR fan (Kyle Busch, Erik Jones...","Willow Grove, PA",@6abc SOMETHING NEEDS TO BE DONE ABOUT COVID! ...,2021-12-08 23:55:39+00:00,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
1805371,801852,DonNantwich,4308,85036,Aquila non capit muscas,Mon Amour,@normAL219 @SkyNews They seem to have three do...,2021-01-01 00:00:15+00:00,0,0,1,0
1805372,1263963387244867584,miss602toyou,217,3095,IT'S FALL BABY,Doonya,@g1lbo11 @SLillyLace @RaptorsRealist @inminiva...,2021-01-01 00:00:12+00:00,0,0,1,0
1805373,233592596,Altwellnessctr,100,1044,We provide high quality services and education...,"Albuquerque, NM",If you’re feeling stressed or anxious about so...,2021-01-01 00:00:03+00:00,0,0,0,0
1805374,10385322,Tomedes,1886,17707,Tomedes Smart Human #Translation serves major ...,,Let us welcome 2021 with open arms... while st...,2021-01-01 00:00:03+00:00,0,0,0,0


If we encounter any error, we may need to run the code from the last collected tweet with a specific date, save it into a diffrent file, and reread the other file.

In [68]:
df2 = pd.read_csv(r'C:\Users\r04ra18\Desktop\Esraa-project-data\final-data\allworld-222-09122021-tweets.csv')

In [69]:
df2

Unnamed: 0,author_id,username,author_followers,author_tweets,author_description,author_location,text,created_at,retweets,replies,likes,quote_count
0,1112922648986746880,riego_me2,97,21638,ADAB (Assigned Disobedient at Birth)\nDogs (Ye...,#LEOSARMY,My Moms Zoom birthday party went great. She c...,2021-04-10 23:59:59+00:00,0,0,2,0
1,955484880,BulldogMama1,857,24225,"I ♥️ 🇨🇦 oil & gas, Pure Blood VotePPC, Pure Blood",Canada,@rosepoptosis @Dukesbetterhalf @Soildoc780 @lo...,2021-04-10 23:59:18+00:00,0,0,0,0
2,513604551,AdventuresHarry,350,33313,"Harry Miles, medically retired USAF fighter pi...",San antonio metro area,Social Distancing - Self-Isolated - Stressed ...,2021-04-10 23:59:08+00:00,0,0,0,0
3,992209789,Kelans27,1471,88951,Husband & proud father. Retired teacher & coac...,Trad'l Land-Anishnaabeg People,@trentjohnsen @DavidPatersonca @merry123459 Th...,2021-04-10 23:58:17+00:00,3,0,7,0
4,850511009001426944,joannes11008424,8,3853,,,"I’d like to know how, if I don’t own a gun, re...",2021-04-10 23:58:14+00:00,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
740432,801852,DonNantwich,4308,85043,Aquila non capit muscas,Mon Amour,@normAL219 @SkyNews They seem to have three do...,2021-01-01 00:00:15+00:00,0,0,1,0
740433,1263963387244867584,miss602toyou,217,3095,IT'S FALL BABY,Doonya,@g1lbo11 @SLillyLace @RaptorsRealist @inminiva...,2021-01-01 00:00:12+00:00,0,0,1,0
740434,233592596,Altwellnessctr,100,1044,We provide high quality services and education...,"Albuquerque, NM",If you’re feeling stressed or anxious about so...,2021-01-01 00:00:03+00:00,0,0,0,0
740435,10385322,Tomedes,1886,17711,Tomedes Smart Human #Translation serves major ...,,Let us welcome 2021 with open arms... while st...,2021-01-01 00:00:03+00:00,0,0,0,0


Also, we may need to merge the two datasets together to build one dataset.

In [70]:
data_df = pd.concat([df1, df2]).reset_index(drop=True)

In [71]:
data_df

Unnamed: 0,author_id,username,author_followers,author_tweets,author_description,author_location,text,created_at,retweets,replies,likes,quote_count
0,1463983950720864259,Killzasnowflak3,1,19,Just want to put snow flakes in A&E,,"@SadiqKhan taking covid 19 very seriously , no...",2021-12-08 23:58:23+00:00,0,0,0,0
1,28979335,imthedarkknight,1335,20408,I write my own tweets. Proper grammar essentia...,"Toronto, Ontario",@balkissoon #wheresmybus #dobetter #ttc staff ...,2021-12-08 23:57:28+00:00,3,0,7,3
2,1433212442222501893,Will19986276,1,560,"freedom loving, hard charging, high speed, low...",,@libsoftiktok social distancing is entirely co...,2021-12-08 23:56:33+00:00,0,0,0,0
3,738510655,AverageKell,81,1187,,,At @AmericanAir social distancing on a half fu...,2021-12-08 23:56:21+00:00,0,0,1,0
4,945020485,TravisFuguet,1162,28057,"Geographer, NASCAR fan (Kyle Busch, Erik Jones...","Willow Grove, PA",@6abc SOMETHING NEEDS TO BE DONE ABOUT COVID! ...,2021-12-08 23:55:39+00:00,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
2545808,801852,DonNantwich,4308,85043,Aquila non capit muscas,Mon Amour,@normAL219 @SkyNews They seem to have three do...,2021-01-01 00:00:15+00:00,0,0,1,0
2545809,1263963387244867584,miss602toyou,217,3095,IT'S FALL BABY,Doonya,@g1lbo11 @SLillyLace @RaptorsRealist @inminiva...,2021-01-01 00:00:12+00:00,0,0,1,0
2545810,233592596,Altwellnessctr,100,1044,We provide high quality services and education...,"Albuquerque, NM",If you’re feeling stressed or anxious about so...,2021-01-01 00:00:03+00:00,0,0,0,0
2545811,10385322,Tomedes,1886,17711,Tomedes Smart Human #Translation serves major ...,,Let us welcome 2021 with open arms... while st...,2021-01-01 00:00:03+00:00,0,0,0,0


Let us explore few of our tweets 

In [78]:
data_df['text'][5]

"@Michell69397997 Unless I missed something, this is a false statement, and nobody is enforcing it anyway. They've simply reverted to social distancing. \nAgreed, the mandate is a political overreach, but it's also optional and not mandatory if accommodations can be met.\n\n https://t.co/WPcBnMcPxx"

In [81]:
dups_df = data_df[data_df.duplicated(subset='text', keep='first')]

In [84]:
dups_df['text'][1085]

'@PoliticsForAlI We’ve tried:\nLockdowns\nSocial Distancing\nMasks \nShutting down economy \nWorking from home\nVaccinations by the tens of MILLIONS\n..yet we’re still here\nWhen are people going to accept that we will have to live with it?\nLets try:\nGet on with your life as you see fit! #OmicronVariant'

In [85]:
dups_df['text'][1103]

'@PoliticsForAlI We’ve tried:\nLockdowns\nSocial Distancing\nMasks \nShutting down economy \nWorking from home\nVaccinations by the tens of MILLIONS\n..yet we’re still here\nWhen are people going to accept that we will have to live with it?\nLets try:\nGet on with your life as you see fit! #OmicronVariant'

Let us see how many columns and rows we have in our dataset. We collect 2.5 Millon tweets to build our dataset worldwide.

In [88]:
data_df.shape

(2545813, 12)

Let us save our data into a file and reread the file again.

In [89]:
data_df.to_csv(path_or_buf = r'C:\Users\r04ra18\Desktop\Esraa-project-data\full-222-allworld-09122021-tweets.csv', index=False)

In [90]:
pd.read_csv(r'C:\Users\r04ra18\Desktop\Esraa-project-data\full-222-allworld-09122021-tweets.csv')

Unnamed: 0,author_id,username,author_followers,author_tweets,author_description,author_location,text,created_at,retweets,replies,likes,quote_count
0,1463983950720864259,Killzasnowflak3,1,19,Just want to put snow flakes in A&E,,"@SadiqKhan taking covid 19 very seriously , no...",2021-12-08 23:58:23+00:00,0,0,0,0
1,28979335,imthedarkknight,1335,20408,I write my own tweets. Proper grammar essentia...,"Toronto, Ontario",@balkissoon #wheresmybus #dobetter #ttc staff ...,2021-12-08 23:57:28+00:00,3,0,7,3
2,1433212442222501893,Will19986276,1,560,"freedom loving, hard charging, high speed, low...",,@libsoftiktok social distancing is entirely co...,2021-12-08 23:56:33+00:00,0,0,0,0
3,738510655,AverageKell,81,1187,,,At @AmericanAir social distancing on a half fu...,2021-12-08 23:56:21+00:00,0,0,1,0
4,945020485,TravisFuguet,1162,28057,"Geographer, NASCAR fan (Kyle Busch, Erik Jones...","Willow Grove, PA",@6abc SOMETHING NEEDS TO BE DONE ABOUT COVID! ...,2021-12-08 23:55:39+00:00,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
2545808,801852,DonNantwich,4308,85043,Aquila non capit muscas,Mon Amour,@normAL219 @SkyNews They seem to have three do...,2021-01-01 00:00:15+00:00,0,0,1,0
2545809,1263963387244867584,miss602toyou,217,3095,IT'S FALL BABY,Doonya,@g1lbo11 @SLillyLace @RaptorsRealist @inminiva...,2021-01-01 00:00:12+00:00,0,0,1,0
2545810,233592596,Altwellnessctr,100,1044,We provide high quality services and education...,"Albuquerque, NM",If you’re feeling stressed or anxious about so...,2021-01-01 00:00:03+00:00,0,0,0,0
2545811,10385322,Tomedes,1886,17711,Tomedes Smart Human #Translation serves major ...,,Let us welcome 2021 with open arms... while st...,2021-01-01 00:00:03+00:00,0,0,0,0


## 2. `requests`-based version

If we want to do things without tweepy, here is some boilerplate code that should work. As you can see, it's much more complicated.

In [None]:
import requests
import os
import json
import twitter_authentication as config
import time

# Save your bearer token in a file called twitter_authentication.py in this directory
# Should look like this:
# bearer_token = 'YOUR_BEARER_TOKEN_HERE'

bearer_token = config.bearer_token
query = '(socialdistancing) OR (social distancing) OR (#socialdistancing) OR (#social distancing)'
out_file = 'raw_tweets.txt'

search_url = "https://api.twitter.com/2/tweets/search/all"

# Optional params: start_time,end_time,since_id,until_id,max_results,next_token,
# expansions,tweet.fields,media.fields,poll.fields,place.fields,user.fields
query_params = {'query': query,
                'start_time': '2010-01-01T12:00:00Z',
                'tweet.fields': 'author_id,public_metrics',
                 'user.fields': 'username',
                'expansions': 'author_id',
                'max_results': 500
               }


def create_headers(bearer_token):
    headers = {"Authorization": "Bearer {}".format(bearer_token)}
    return headers


def connect_to_endpoint(url, headers, params, next_token = None):
    if next_token:
        params['next_token'] = next_token
    response = requests.request("GET", search_url, headers=headers, params=params)
    time.sleep(3.1)
    print(response.status_code)
    if response.status_code != 200:
        raise Exception(response.status_code, response.text)
    return response.json()


def get_tweets(num_tweets, output_fh):
    next_token = None
    tweets_stored = 0
    while tweets_stored < num_tweets:
        headers = create_headers(bearer_token)
        json_response = connect_to_endpoint(search_url, headers, query_params, next_token)
        if json_response['meta']['result_count'] == 0:
            break
        author_dict = {x['id']: x['username'] for x in json_response['includes']['users']}
        for tweet in json_response['data']:
            try:
                tweet['username'] = author_dict[tweet['author_id']]
            except KeyError:
                print(f"No data for {tweet['author_id']}")
            output_fh.write(json.dumps(tweet) + '\n')
            tweets_stored += 1
        try:
            next_token = json_response['meta']['next_token']
        except KeyError:
            break
    return None



def main():
    with open(out_file, 'w') as f:
        get_tweets(500, f)



main()

In [None]:
tweets = []
with open(out_file, 'r') as f:
    for row in f.readlines():
        tweet = json.loads(row)
        tweets.append(tweet)

In [None]:
tweets[0]

## References

1. Developer Platform
https://developer.twitter.com/en/docs/twitter-api
2. Search Tweets
https://developer.twitter.com/en/docs/twitter-api/tweets/search/introduction
3. Academic Research access
https://developer.twitter.com/en/products/twitter-api/academic-research
4. Twitter API v2 data dictionary
https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/place

<img src="https://i.imgur.com/VCzUM0V.png">