# Tutorial on how to collect tweets based on key words

This notebook shows you how to use the [tweepy](https://www.tweepy.org/) python library to collect tweets from Twitter based on key words. 

This notebook was created as an example of a dataset you can collect for the [AFD Gender-Based Violence Dataset Collection Challenge](https://zindi.africa/competitions/afd-gender-based-violence-dataset-collection-challenge).

Trigger Warning: This notebook collects tweets that could contain sensitive information for some readers. 

## STEP 1: PYTHON PACKAGES INSTALLATION

Install the following python packages that will help you to collect data from twitter.com 

In [1]:
!
pip install tweepy 



You should consider upgrading via the 'C:\Users\USER\anaconda3\python.exe -m pip install --upgrade pip' command.


In [2]:
!pip install unidecode



You should consider upgrading via the 'C:\Users\USER\anaconda3\python.exe -m pip install --upgrade pip' command.


## STEP 2: IMPORT IMPORTANT PACKAGES 

In [3]:
#import dependencies
import tweepy
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import json
from unidecode import unidecode
import time
import datetime
from tqdm import tqdm 
import pandas as pd  
import numpy as np 

## STEP 3: AUTHENTICATING TO TWITTER'S API

(a) First, apply for a developer account to access the API. The Standard APIs are sufficient for this tutorial. They’re free, but have some limitations that we’ll learn to work around in this tutorial.

Click here to apply: [apply for developer account to acces the API](https://developer.twitter.com/en/apply-for-access)

(b) Once your developer account is setup, create an app that will make use of the API by clicking on your username in the top right corner to open the drop down menu, and clicking “Apps” as shown below. Then select “Create an app” and fill out the form. 

(c) Now that you have created a developer account and an app, you should have a set of keys to connect to the Twitter API. Specifically, you’ll have an
- API key
- API secret key
- Access token
- Access token secret

These could be inserted directly into your code to connect to the Twitter API, as shown below.

In [4]:
consumer_key = 'LBBW1m2GZpjs83kjbN6nhhe0n'
consumer_secret = 'xUjsrlbzvsFARfPx1wNjPe0G7IRV2uvDekBNmIZnrvbk2LUQdc'
access_token = '1239559361053294594-0LafYir3OMuCGjYMZQLwEZJOw3kD5K'
access_token_secret = 'CnO07h4OcuibLBZYtqRoVc2hVAsdqNQyqgnb0lzmDULnv'
bearer_token = 'AAAAAAAAAAAAAAAAAAAAAHvNPAEAAAAAPup5Zx9SPTt8oBTeWdQL%2BIu%2F194%3DrCF6WkX0cp6rneFMYpnIeHMEplinIGbvyY7prKUq2IKG9p1wg1'

## STEP 4:  CONNECT TO TWITTER API USING THE SECRET KEY AND ACCESS TOKEN

In [5]:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

## STEP 5: DEFINE A FUNCTION THAT WILL TAKE OUR SEARCH QUERY

In [6]:
def tweetSearch(query, limit):
    """
    This function will search a query provided in the twitter and,
    retun a list of all tweets that have a query. 
    """

    # Create a blank variable
    tweets = []

    # Iterate through Twitter using Tweepy to find our query with our defined limit
    for page in tweepy.Cursor(
        api.search, q=query, count=limit, tweet_mode="extended"
    ).pages(limit):
        for tweet in page:
            tweets.append(tweet)

    # return tweets
    return tweets

In [7]:
tweets

NameError: name 'tweets' is not defined

## STEP 6: CREATE A FUNCTION TO SAVE TWEETS INTO A DATAFRAME

In [8]:
def tweets_to_data_frame(tweets):
    """
    This function will receive tweets and collect specific data from it such as place, tweet's text,likes 
    retweets and save them into a pandas data frame.
    
    This function will return a pandas data frame that contains data from twitter.
    """
    df = pd.DataFrame(data=[tweet.full_text.encode('utf-8') for tweet in tweets], columns=["Tweets"])

    df["id"] = np.array([tweet.id for tweet in tweets])
    df["lens"] = np.array([len(tweet.full_text) for tweet in tweets])
    df["date"] = np.array([tweet.created_at for tweet in tweets])
    df["place"] = np.array([tweet.place for tweet in tweets])
    df["coordinateS"] = np.array([tweet.coordinates for tweet in tweets])
    df["lang"] = np.array([tweet.lang for tweet in tweets])
    df["source"] = np.array([tweet.source for tweet in tweets])
    df["likes"] = np.array([tweet.favorite_count for tweet in tweets])
    df["retweets"] = np.array([tweet.retweet_count for tweet in tweets])

    return df

## STEP 7: ADD TWITTER HASHTAGS RELATED TO GENDER-BASED VIOLENCE

In [9]:
# add hashtags in the following list
hashtags = ['#MeToo','#gbv','#GBV','#RapeJoke','#ChildMarriage','#BringBackOurGirls','#MaybeHeDoesntHitYou','#YesAllWomen','#GBVinSA','#genderbasedviolence','#GBVmustfall','#NoBail4GBV','#Justice4MaksandTed','#ENDGBV','#protectMe','#SGBV','#Justice4Tebogo','#justiceforuwa','#sexualassaultawarenessmonth','#YouthAgainstGBV','#EndRapeinNigeria','#HowShortWasYourSkirt','#16dayofactivisim','#endsgbv','#NationGender','#sexism','#GBV', '#rape']


In [12]:
!pip install simplejson



Collecting simplejson
  Downloading simplejson-3.17.2-cp37-cp37m-win_amd64.whl (73 kB)
Installing collected packages: simplejson
Successfully installed simplejson-3.17.2


You should consider upgrading via the 'C:\Users\USER\anaconda3\python.exe -m pip install --upgrade pip' command.


## STEP 8: RUN BOTH FUNCTIONS TO COLLECT DATA FROM TWITTER RELATED TO THE HASHTAGS LISTED ABOVE

In [13]:
total_tweets = 0

"""
The following for loop will collect a tweets that have the hashtags
 mentioned in the list and save the tweets into csv file
"""

for n in tqdm(hashtags):
    # first we fetch all tweets that have specific hashtag
    hash_tweets = tweetSearch(query=n,limit=7000)
    total_tweets += int(len(hash_tweets))
    
    # second we convert our tweets into datarame
    df = tweets_to_data_frame(hash_tweets)
    
    #third we save the dataframe into csv file
    df.to_csv("tweetss.csv".format(n))

  0%|          | 0/28 [02:33<?, ?it/s]


TweepError: Failed to parse JSON payload: Unterminated string starting at: line 1 column 651822 (char 651821)

In [17]:
# show total number of tweets collected
print("total_tweets: {}".format(total_tweets))

total_tweets: 9226


For more tweepy configuration please read the tweepy documentation [here](https://docs.tweepy.org/en/latest/).

This notebook prepared by

**Davis David** 

**Zindi Ambassador in Tanzania**

My zindi Profile: [https://zindi.africa/users/Davisy](https://zindi.africa/users/Davisy)

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('tweets.csv')

In [4]:
df

Unnamed: 0.1,Unnamed: 0,Tweets,id,lens,date,place,coordinateS,lang,source,likes,retweets
0,0,b'RT @DOpolitics_in: \xe0\xa4\xad\xe0\xa4\xbe\...,1389917280402821127,137,2021-05-05 12:17:57,,,hi,Twitter for Android,0,558
1,1,b'RT @DOpolitics_in: \xe0\xa4\xad\xe0\xa4\xbe\...,1389917258483265552,137,2021-05-05 12:17:52,,,hi,Twitter for Android,0,558
2,2,b'RT @DOpolitics_in: \xe0\xa4\xad\xe0\xa4\xbe\...,1389917204435533826,137,2021-05-05 12:17:39,,,hi,Twitter for Android,0,558
3,3,b'RT @DOpolitics_in: \xe0\xa4\xad\xe0\xa4\xbe\...,1389917139763556352,137,2021-05-05 12:17:24,,,hi,Twitter for iPhone,0,558
4,4,b'RT @DOpolitics_in: \xe0\xa4\xad\xe0\xa4\xbe\...,1389917051599212545,137,2021-05-05 12:17:03,,,hi,Twitter for Android,0,558
...,...,...,...,...,...,...,...,...,...,...,...
5226,5226,b'para disfrutar con mam\xc3\xa1\xf0\x9f\x91\x...,1387129471707271169,264,2021-04-27 19:40:12,,,es,Twitter for Android,2,1
5227,5227,b'Folks on #AssamTwitter call the brutal rape ...,1387129070815563780,270,2021-04-27 19:38:36,,,en,Twitter for Android,3,0
5228,5228,"b""RT @MahdiLiima: #Rape and #sexual violence a...",1387126822119292934,140,2021-04-27 19:29:40,,,en,Twitter for iPhone,0,25
5229,5229,b'RT @WMurphyLaw: Excellent piece in @theatlan...,1387125343329390599,140,2021-04-27 19:23:48,,,en,Twitter for iPhone,0,4
