# Tutorial on how to collect tweets based on key words

This notebook shows you how to use the [tweepy](https://www.tweepy.org/) python library to collect tweets from Twitter based on key words. 

This notebook was created as an example of a dataset you can collect for the [AFD Gender-Based Violence Dataset Collection Challenge](https://zindi.africa/competitions/afd-gender-based-violence-dataset-collection-challenge).

TW: This notebook collects tweets that could contain sensative. 

## STEP 1: PYTHON PACKAGES INSTALLATION

Install the following python packages that will help you to collect data from twiter.com 

In [1]:
# !pip install tweepy 

Collecting tweepy
  Downloading tweepy-3.10.0-py2.py3-none-any.whl (30 kB)
Installing collected packages: tweepy
Successfully installed tweepy-3.10.0


In [2]:
# !pip install unidecode



## STEP 2: IMPORT IMPORTANT PACKAGES 

In [1]:
#import dependencies
import tweepy
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import json
from unidecode import unidecode
import time
import datetime
from tqdm import tqdm 
import pandas as pd  
import numpy as np 

## STEP 3:AUTHENTICATING TO Twitter’s API

(a) First, apply for a developer account to access the API. The Standard APIs are sufficient for this tutorial. They’re free, but have some limitations that we’ll learn to work around in this tutorial.

Click here to apply: [apply for developer account to acces the API](https://developer.twitter.com/en/apply-for-access)

(b) Once your developer account is setup, create an app that will make use of the API by clicking on your username in the top right corner to open the drop down menu, and clicking “Apps” as shown below. Then select “Create an app” and fill out the form. 

(c) Now that you have created a developer account and an app, you should have a set of keys to connect to the Twitter API. Specifically, you’ll have an
- API key
- API secret key
- Access token
- Access token secret

These could be inserted directly into your code to connect to the Twitter API, as shown below.

In [23]:
consumer_key = 'wMyunBaUKFTEwjAUPDxTKcwvy'
consumer_secret = 'QwglPQurDyLK3Akc5gL7m0GcWa2TxDZmTm9jAqpfE8gwoNrPiv'
access_token = '3806985075-w30Rn3LMRHviX0Uy9TfLieSwTg0PW07Eq8qjP29'
access_token_secret = '4AAjV1eUPzOaWOI3jYPTMwjCTCwatATg1tb8kfyOF20er'

## STEP 4:  CONNECT TO TWITTER API USING THE SECRETS

In [24]:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

## STEP 5: DEFINE A FUNCTION THAT WILL TAKE OUR SEARCH QUERY

In [25]:
def tweetSearch(query, limit):
    """
    This function will search a query provided in the twitter and,
    retun a list of all tweets that have a query. 
    """

    # Create a blank variable
    tweets = []

    # Iterate through Twitter using Tweepy to find our query with our defined limit
    for page in tweepy.Cursor(
        api.search, q=query, count=limit, tweet_mode="extended"
    ).pages(limit):
        for tweet in page:
            tweets.append(tweet)

    # return tweets
    return tweets

## STEP 6: CREATE A FUNCTION TO SAVE TWEETS INTO A DATAFRAME

In [26]:
def tweets_to_data_frame(tweets):
    """
    This function will receive tweets and collect specific data from it such as place, tweet's text,likes 
    retweets and save them into a pandas data frame.
    
    This function will return a pandas data frame that contains data from twitter.
    """
    df = pd.DataFrame(data=[tweet.full_text.encode('utf-8') for tweet in tweets], columns=["Tweets"])

    df["id"] = np.array([tweet.id for tweet in tweets])
    df["lens"] = np.array([len(tweet.full_text) for tweet in tweets])
    df["date"] = np.array([tweet.created_at for tweet in tweets])
    df["place"] = np.array([tweet.place for tweet in tweets])
    df["coordinateS"] = np.array([tweet.coordinates for tweet in tweets])
    df["lang"] = np.array([tweet.lang for tweet in tweets])
    df["source"] = np.array([tweet.source for tweet in tweets])
    df["likes"] = np.array([tweet.favorite_count for tweet in tweets])
    df["retweets"] = np.array([tweet.retweet_count for tweet in tweets])

    return df

## STEP 7: ADD TWITTER HASHTAGS RELATED TO GENDER BASED VIOLENCE

In [34]:
# add hashtags in the following list
hashtags = [
'#GBV',
'#sexism',
'#rape' ,
'#domestic voilence'
    
]

## STEP 8: RUN BOTH FUNCTIONS TO COLLECT DATA FROM TWITTER RELATED TO THE HASHTAGS LISTED ABOVE

In [36]:
total_tweets = 0

"""
The following for loop will collect a tweets that have the hashtags
 mentioned in the list and save the tweets into csv file
"""

for n in tqdm(hashtags):
    # first we fetch all tweets that have specific hashtag
    hash_tweets = tweetSearch(query=n,limit=7000)
    total_tweets += int(len(hash_tweets))
    
    # second we convert our tweets into datarame
    df = tweets_to_data_frame(hash_tweets)
    
    #third we save the dataframe into csv file
    df.to_csv("{}_tweets.csv".format(n))


  0%|                                                                                            | 0/4 [00:00<?, ?it/s]
 25%|████████████████████▊                                                              | 1/4 [04:08<12:26, 248.87s/it]
 50%|█████████████████████████████████████████▌                                         | 2/4 [06:38<07:17, 218.96s/it]
 75%|██████████████████████████████████████████████████████████████▎                    | 3/4 [12:50<04:25, 265.11s/it]
100%|███████████████████████████████████████████████████████████████████████████████████| 4/4 [12:51<00:00, 185.81s/it]


In [37]:
# show total number of tweets collected
print("total_tweets: {}".format(total_tweets))

total_tweets: 9143


For more tweepy configuration please read the tweepy documentation [here](https://docs.tweepy.org/en/latest/)

This notebook prepared by

**Davis David** 

**Zindi Ambassador in Tanzania**

My zindi Profile: [https://zindi.africa/users/Davisy](https://zindi.africa/users/Davisy)

In [3]:
# df.head()

In [4]:
df1=pd.read_csv('#rape_tweets.csv')

In [5]:
df1.head()

Unnamed: 0.1,Unnamed: 0,Tweets,id,lens,date,place,coordinateS,lang,source,likes,retweets
0,0,"b'RT @KaceyKells: KELLCEY ""This story not only...",1389681959329206272,140,2021-05-04 20:42:52,,,en,Twitter for iPhone,0,10284
1,1,b'RT @notice_com_ng: Man defiles his three tee...,1389681326786191362,139,2021-05-04 20:40:21,,,en,EndPoliceBrutality,0,2
2,2,"b'Man defiles his three teenage daughters, ano...",1389681304845828100,169,2021-05-04 20:40:16,,,en,notice.com.ng,1,2
3,3,"b'RT @KaceyKells: KELLCEY ""This story not only...",1389680872937426944,140,2021-05-04 20:38:33,,,en,Twitter Web App,0,10284
4,4,b'RT @PixelProject: #AFRICA: Abuse of women in...,1389679991621881856,132,2021-05-04 20:35:03,,,en,Twitter Web App,0,1


In [6]:
df1['Tweets'][0]

'b\'RT @KaceyKells: KELLCEY "This story not only bears witness to what countless other women have gone through, but also offers a message of ho\\xe2\\x80\\xa6\''

In [7]:
df1.shape

(4949, 11)

In [8]:
df2=pd.read_csv('#sexism_tweets.csv')

In [9]:
df2.head()

Unnamed: 0.1,Unnamed: 0,Tweets,id,lens,date,place,coordinateS,lang,source,likes,retweets
0,0,b'RT @DingaBelle: @_snowbunting @Flomoll \xf0\...,1389680927031369734,140,2021-05-04 20:38:46,,,en,Twitter for Android,0,3
1,1,b'RT @DingaBelle: @_snowbunting @Flomoll \xf0\...,1389676992648318980,140,2021-05-04 20:23:08,,,en,Twitter for iPhone,0,3
2,2,"b""Finally watched #SexyLamp after years of try...",1389676944782934017,229,2021-05-04 20:22:57,,,en,Twitter for Android,1,0
3,3,b'\xf0\x9f\x98\xb1\xf0\x9f\x98\xb1\xf0\x9f\x98...,1389675769522184199,302,2021-05-04 20:18:17,,,en,Twitter for Android,1,0
4,4,b'RT @DingaBelle: @_snowbunting @Flomoll \xf0\...,1389674632307937295,140,2021-05-04 20:13:45,,,en,Twitter for iPhone,0,3


In [10]:

# df2['Tweets'] = df2['Tweets'].map(lambda x: x.lstrip("@"").rstrip("b'RT"))

In [11]:
df2['Tweets'][3]

"b'\\xf0\\x9f\\x98\\xb1\\xf0\\x9f\\x98\\xb1\\xf0\\x9f\\x98\\xb1\\xf0\\x9f\\x98\\xb1\\xf0\\x9f\\x98\\xb1\\nWhile good people were/are sleeping, this is happening....\\n#Fascism, #Capitalism, #Neoliberalism are rising &amp; inciting #Hate, #Xenophobia, #Racism, #Sexism, etc., and stirring up #Nationalism and self serving #Greed.\\n#Unite and #ActNow to #StopFascism.\\n#SaveOurPlanet. https://t.co/fe576Gw2Vw'"

In [12]:
df2.head()

Unnamed: 0.1,Unnamed: 0,Tweets,id,lens,date,place,coordinateS,lang,source,likes,retweets
0,0,b'RT @DingaBelle: @_snowbunting @Flomoll \xf0\...,1389680927031369734,140,2021-05-04 20:38:46,,,en,Twitter for Android,0,3
1,1,b'RT @DingaBelle: @_snowbunting @Flomoll \xf0\...,1389676992648318980,140,2021-05-04 20:23:08,,,en,Twitter for iPhone,0,3
2,2,"b""Finally watched #SexyLamp after years of try...",1389676944782934017,229,2021-05-04 20:22:57,,,en,Twitter for Android,1,0
3,3,b'\xf0\x9f\x98\xb1\xf0\x9f\x98\xb1\xf0\x9f\x98...,1389675769522184199,302,2021-05-04 20:18:17,,,en,Twitter for Android,1,0
4,4,b'RT @DingaBelle: @_snowbunting @Flomoll \xf0\...,1389674632307937295,140,2021-05-04 20:13:45,,,en,Twitter for iPhone,0,3


In [22]:
df3=pd.read_csv('_domesticviolence_tweets.csv')


In [14]:
df4=pd.read_csv('_metoo_tweets.csv')

In [18]:
df5=pd.read_csv('#GBV_tweets.csv')

In [23]:
import re
def remove_pattern(input_txt, pattern):
    r = re.findall(pattern, input_txt)
    for i in r:
        input_txt = re.sub(i, '', input_txt)
        
    return input_txt  

In [25]:
data=pd.concat([df1,df2,df3,df4,df5])
data.shape

(61355, 11)

In [26]:
data['Tweets_new'] = np.vectorize(remove_pattern)(data['Tweets'], "@[\w]*")

In [28]:
data=data.drop(['Unnamed: 0'],axis=1)

In [29]:
data.head()

Unnamed: 0,Tweets,id,lens,date,place,coordinateS,lang,source,likes,retweets,Tweets_new
0,"b'RT @KaceyKells: KELLCEY ""This story not only...",1389681959329206272,140,2021-05-04 20:42:52,,,en,Twitter for iPhone,0,10284,"b'RT : KELLCEY ""This story not only bears witn..."
1,b'RT @notice_com_ng: Man defiles his three tee...,1389681326786191362,139,2021-05-04 20:40:21,,,en,EndPoliceBrutality,0,2,b'RT : Man defiles his three teenage daughters...
2,"b'Man defiles his three teenage daughters, ano...",1389681304845828100,169,2021-05-04 20:40:16,,,en,notice.com.ng,1,2,"b'Man defiles his three teenage daughters, ano..."
3,"b'RT @KaceyKells: KELLCEY ""This story not only...",1389680872937426944,140,2021-05-04 20:38:33,,,en,Twitter Web App,0,10284,"b'RT : KELLCEY ""This story not only bears witn..."
4,b'RT @PixelProject: #AFRICA: Abuse of women in...,1389679991621881856,132,2021-05-04 20:35:03,,,en,Twitter Web App,0,1,b'RT : #AFRICA: Abuse of women in IDPs\n\nhttp...


In [38]:
data['Tweets'][0]

0    b'RT @KaceyKells: KELLCEY "This story not only...
0    b'RT @DingaBelle: @_snowbunting @Flomoll \xf0\...
0    b'Duncan Wierman owns #BostonREIA &amp; a memb...
0    b'RT @vevopap: \xce\x93\xce\xb9\xce\xb1 \xcf\x...
0    b"I've tweeted about police stations for women...
Name: Tweets, dtype: object

In [37]:
import emoji
import nltk
from nltk.corpus import stopwords

def cleaner(tweet):
  # Remove @ sign
  tweet = re.sub("@[A-Za-z0-9]+","",tweet) 
  # Remove http links
  tweet = re.sub(r"(?:\@|http?\://|https?\://|www)\S+", "", tweet) 
  tweet = " ".join(tweet.split())
  # Remove Emojis
  tweet = ''.join(c for c in tweet if c not in emoji.UNICODE_EMOJI) 
  # Remove hashtag sign but keep the text
  tweet = tweet.replace("#", "").replace("_", " ") 
  # tokenize text and convert to lower case and digits
  tweet = " ".join(w for w in nltk.wordpunct_tokenize(tweet) if w.lower() in words or not w.isalpha())
  return tweet

# apply the function above to clean the tweets column
data['ataTweets'] = data['Tweets'].map(lambda x: cleaner(x))

NameError: name 'words' is not defined