<a href="https://colab.research.google.com/github/Fatemah-Husain/Extract_Tweets/blob/main/Blog__Twitter_Extractor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Extracting Data from Twitter Using Keywords or Terms**
By Fatemah Husain (f.husain@ku.edu.kw) - More details can be found at https://infoscilab.ku.edu.kw/blog-page/ 




---




In this colab project, I will explain how to use  [Tweepy](https://www.tweepy.org/)  to access Twitter API and extract datasets based on keywords or terms. I used Python librraies to support multiple extra features, such as sentiment analysis and locations. The output dataset is in CSV file. 

Before starting with the code,  you will need to follow the instructions provided by [Twitter Developer Platform](https://developer.twitter.com/en) to get Twitter account access credentials. After approving your request, you will receive consumer_key, consumer_secret, access_token, and access_token_secret, which are mandatory to access Twitter *API*.



**Step 1: Importing Libraries**


Three Python libraries are used in extracting and preparing the dataset; Tweepy to connect with Twitter API, Pandas to transform the data into an organized format in a dataframe, and CSV to create the output dataset CSV file. 

In [None]:
# -*- coding: utf-8 -*- 
#!/usr/bin/env python
# encoding: utf-8

# Step 1 - Importing libraries
import tweepy
import pandas as pd 
import csv


**Step 2: Accessing Twitter API**

In this step, I provide Twitter API credentials to connect to the API through Tweepy using the following script. The API supports access to several Twitter's available functionalities (tweets, retweets, mentions, likes ...etc).     

In [None]:
# Step 2 - Authenticate
consumer_key= "" # Insert your consumer key
consumer_secret= "" "" # Insert your consumer secret

access_token="" # Insert your access token
access_token_secret="" # Insert your access token secret


auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

# Creating an object from the API
api = tweepy.API(auth)



**Step 3: Setting Parameters and Extracting Tweets**


Parameters can be used as filters to search for our targeted Tweets and to customize the search.  The followings are the parameters that will be used in this exercise:

 

*   q : for the keywords or terms used in search query. For multiple keywords search, they need to be defined in a single quotes and separated by 'OR'. 

*   count  : number of tweets per page

*   lang : the language of tweets specified using, for example “ar” for Arabic, “en” for English. For more information about the languages supported by Twitter and their codes, please check this page. 

*   since : the start date of search formatted as YYYY-MM-DD

*   until : the stop date formatted as YYYY-MM-DD

*   result_type : the type of the search results, which could have one from the three values: “recent” for the most recent results, “popular” for the most popular results, and “mixed” for both popular and real time results.

*   include_entities : entities provide additional contextual information. It includes hashtags, user mentions, links, stock tickers (symbols), Twitter polls, and attached media. They will be included in the results as an array when set to True.

*   tweet_mode: this parameter added after the extension of the number of character per Tweet from 140 to 280 characters. When tweet_model is set to “extended”, it will extract the full untruncated Tweets and if it sets to “compat”, it will give untruncated Tweets with 140 characters.

*    Encoding : this parameter might be helpful if the extracted Tweets were in non English, for example, setting the encoding value to 'utf-8-sig' will work very good for Arabic Tweets. For more details about character encoding in Twitter, you can refer to this webpage. 


Twitter use pagination feature to iterate through timelines. This pagination feature also returns the requested data in a series of pages. Thus, a page or cursor parameter need to be provided with each data request to manage the pagination loop. I used the Cursor object in Tweepy to manage the pagination feature easily.

 
Two methods are used to support instances from the Cursor object:

*   items : this method takes the maximum number of items to iterate over per page returned

*   pages : help to process per page of results and takes the maximum number of pages to iterate over


For more information about the available parameters, please check [this webpage.](href="https://docs.tweepy.org/en/stable/api.html#search-tweets") 


 

Now let's look into the code and see how it works. There was a severe dust storm when I started writing this post. I decided to see what people were sharing about the storm on Twitter. I will use some keywords that are related to the dust storm in Kuwait as the main parameter to extract tweets.

In [None]:

# # Step 3 - Setting parameters
limit = 10000
language = 'ar' # others could be 'en', 'fa', 'tr'
keywords = '#غبار_الكويت OR #الكويت OR #كويت'
startDate = "2022-05-20"
endDate = "2022-05-24"

# Passing the parameters into the Cursor constructor method
public_tweets = tweepy.Cursor( api.search,
                                q= keywords,
                                result_type='recent',
                                since = startDate,
                                until = endDate,
                                count=100,
                                include_entities=True,
                                lang=language,
                                tweet_mode="extended",
                                encoding='utf-8-sig').items(limit)





**Step 4: Saving Results into a CSV File**

After pulling the dataset, an array is created for each attribute we want to save. Then, all arrays are arranged into a Pandas dataframe to create a Comma-Separated Values (CSV)  file that is easy to use for further processing and analyzing. 

In [None]:
# Defining Arrays to save results for each attribute seperatly 
tweet_id_list = []
tweet_text_list = []
tweet_location_list = []
tweet_geo_list = []
user_screen_name_list = []
tweet_created_list = []
tweet_contributors_list = []
tweet_entities_list =[]
tweet_retweet_count_list = []
tweet_source_list = []
tweet_username_list = []
tweet_followers_count_list = []
friends_count_List = []
user_url_list = []
user_desc_list = []


# Iterating through the results to extract the results
for tweet in public_tweets:
    tweet_id_list.append(tweet.id)
    tweet_text_list.append(tweet.full_text)
    tweet_location_list.append(tweet.user.location)
    tweet_geo_list.append(tweet.geo)
    user_screen_name_list.append(tweet.user.screen_name)     
    user_url_list.append(tweet.user.url)
    user_desc_list.append(tweet.user.description)
    tweet_source_list.append(tweet.source)
    tweet_created_list.append(tweet.created_at)
    tweet_contributors_list.append(tweet.id_str)
    tweet_entities_list.append (tweet.entities)
    tweet_retweet_count_list.append(tweet.retweet_count)
    tweet_username_list.append(tweet.user.name)     
    tweet_followers_count_list.append(tweet.user.followers_count)
    friends_count_List.append(tweet.user.friends_count)




In [None]:

# Creating a Pandas dataframe to organize the data into a table
df = pd.DataFrame({
    'tweet_id': tweet_id_list,
    'tweet_text': tweet_text_list, 
    'tweet_location': tweet_location_list,
    'tweet_geo':tweet_geo_list,
    'user_screen': user_screen_name_list,
    'url' : user_url_list,
    'user_desc': user_desc_list,
    'tweet_source': tweet_source_list,
    'tweet_created': tweet_created_list,
    'tweet_contributors': tweet_contributors_list,
    'tweet_entities': tweet_entities_list,
    'tweet_retweet_count': tweet_retweet_count_list,
    'tweet_username': tweet_username_list,
    'tweet_followers_count': tweet_followers_count_list,
    'friends_count':friends_count_List}) 


df

Unnamed: 0,tweet_id,tweet_text,tweet_location,tweet_geo,user_screen,url,user_desc,tweet_source,tweet_created,tweet_contributors,tweet_entities,tweet_retweet_count,tweet_username,tweet_followers_count,friends_count
0,1528888422592417793,RT @diaCmOUGtJxg9LR: @Demugrave @ho_hh @Alhasb...,,,diaCmOUGtJxg9LR,,كويتيه🇰🇼أسمو بهويتي الأصيله🌹أتعايش ذوق وإحترام...,Twitter for Android,2022-05-23 23:59:39,1528888422592417793,"{'hashtags': [], 'symbols': [], 'user_mentions...",1,بنت حُر 🇰🇼,185,48
1,1528888387394236416,RT @7usaini7: إنا لله وإنا إليه راجعون.. \n#أح...,"مكة المكرمة, المملكة العربية ا",,abumazi66371637,,,Twitter for Android,2022-05-23 23:59:31,1528888387394236416,"{'hashtags': [{'text': 'أحمد_القطان', 'indices...",396,سالم محمد باحمدين,53,143
2,1528888354145705984,التفاصيل في الرابط \n.\nhttps://t.co/enYlF4Nf2...,مملكة البحرين .. مملكة العظماء,,Bahraintoday6,,‏‏كن ثرياً بأخلاقك ، غنياً بقناعاتك\n‏كبيراً ب...,Twitter for Android,2022-05-23 23:59:23,1528888354145705984,"{'hashtags': [{'text': 'لبنان', 'indices': [48...",0,Bahrain _today,921,919
3,1528888329483145216,@Demugrave @ho_hh @Alhasban2012 @alialali26 @a...,,,diaCmOUGtJxg9LR,,كويتيه🇰🇼أسمو بهويتي الأصيله🌹أتعايش ذوق وإحترام...,Twitter for Android,2022-05-23 23:59:17,1528888329483145216,"{'hashtags': [{'text': 'هايد_بارك_الكويت', 'in...",1,بنت حُر 🇰🇼,185,48
4,1528888326790471681,RT @y8_travel: ارخص رحلة لـ السفر الى #تايلاند...,,,ashraff2021,,,Twitter for iPhone,2022-05-23 23:59:17,1528888326790471681,"{'hashtags': [{'text': 'تايلاند', 'indices': [...",49,ashraf abed almaghrabi,56,271
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,1528760598783217666,"RT @sabqorg: ""أبو طالب"": قالها مسؤول الشركة.. ...",,,ayd09199078,,احب وطني,Twitter for iPhone,2022-05-23 15:31:44,1528760598783217666,"{'hashtags': [{'text': 'الكويت', 'indices': [7...",2215,عايد,6,54
9996,1528760597214646273,RT @alixao11: سبحان الله \n\n#غبار_الكويت http...,Kuwait,,isllll0,,انا قصص كثيرة،لا أعرف اياً منها تقرأ انت. كاتِ...,Twitter for iPhone,2022-05-23 15:31:44,1528760597214646273,"{'hashtags': [{'text': 'غبار_الكويت', 'indices...",2,§🫀,103,112
9997,1528760589828464640,ترى كثرن يا ربي ال مالهن حل..\n#غبار #غبار_الكويت,KW for now,,2020Mushari,,🇬🇧 A human and Muslim 🇰🇼 FOX 🦊 Spider 🕷️ Real ...,Twitter for Android,2022-05-23 15:31:42,1528760589828464640,"{'hashtags': [{'text': 'غبار', 'indices': [30,...",0,Mushari,141,480
9998,1528760584132501505,دعاء الريح :\n\nربى أسألك خيرها وخير مافيها وخ...,دولة الكويت,,Nasser5AlSubaie,,اليوم فالدنياء وبكره راحلين,Twitter for iPhone,2022-05-23 15:31:40,1528760584132501505,"{'hashtags': [{'text': 'غبار_الكويت', 'indices...",1,ناصر فايز السبيعي,164,126


Converting the dataframe to a CSV file to download it. If you are using Google Colab, I also include the code needed to push the file to Google drive and save it. 

In [None]:
# Converting the dataframe to CSV file
df.to_csv('Kuwait_Dust_Storm_2022.csv', sep=',', index=False, encoding='utf-8-sig')

In [None]:
# Saving the file to Google drive

file_name = "Kuwait_Dust_Storm_2022.csv"

from googleapiclient.http import MediaFileUpload
from googleapiclient.discovery import build
from google.colab import auth

auth.authenticate_user()
drive_service = build('drive', 'v3')

def save_file_to_drive(name, path):
  file_metadata = {'name': name, 'mimeType': 'application/octet-stream'}
  media = MediaFileUpload(path, mimetype='application/octet-stream', resumable=True)
  created = drive_service.files().create(body=file_metadata, media_body=media, fields='id').execute()
  
  return created

save_file_to_drive(file_name, file_name)

{'id': '1vrLbdEKlrLeQO2xmS0yzrS53El89lI0S'}