# ETL Twitter

*This notebook's goal is to extract tweets on given topics from Tweeter and transform them to a homogeneous form.* 

Twitter API Documentation:

- http://docs.tweepy.org/en/latest/api.html?highlight=User%20object#API.user_timeline
- https://developer.twitter.com/en/docs/tweets/timelines/api-reference/get-statuses-user_timeline

## Settings

In [2]:
!pip install tweepy

Collecting tweepy
  Downloading https://files.pythonhosted.org/packages/36/1b/2bd38043d22ade352fc3d3902cf30ce0e2f4bf285be3b304a2782a767aec/tweepy-3.8.0-py2.py3-none-any.whl
Collecting requests-oauthlib>=0.7.0 (from tweepy)
  Downloading https://files.pythonhosted.org/packages/a3/12/b92740d845ab62ea4edf04d2f4164d82532b5a0b03836d4d4e71c6f3d379/requests_oauthlib-1.3.0-py2.py3-none-any.whl
Collecting oauthlib>=3.0.0 (from requests-oauthlib>=0.7.0->tweepy)
[?25l  Downloading https://files.pythonhosted.org/packages/05/57/ce2e7a8fa7c0afb54a0581b14a65b56e62b5759dbc98e80627142b8a3704/oauthlib-3.1.0-py2.py3-none-any.whl (147kB)
[K     |████████████████████████████████| 153kB 8.2MB/s eta 0:00:01
[?25hInstalling collected packages: oauthlib, requests-oauthlib, tweepy
Successfully installed oauthlib-3.1.0 requests-oauthlib-1.3.0 tweepy-3.8.0


In [36]:
# usual
import numpy as np
import matplotlib.pyplot as plt

# to work with twitter
import tweepy
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener

# support
import json
import pandas as pd
import csv
import time 

# text processing
import re #regular expression
import string

In [4]:
# twitter app credentials
consumer_key='******************'
consumer_secret='********************'
access_token_key='***********************'
access_token_secret='************************'

In [5]:
#pass twitter credentials to tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token_key, access_token_secret)
api = tweepy.API(auth)

## Extraction

*Encapsulating the pipeline of data retrieving*
1. One aims to achieve a dictionary of type `@user_id: "Tweet_1, Tweet_2 ... Tweet_N"`
2. It may be reasonable to separate data by language, for the moment we will only consider english

In [41]:
def gather_user_tweets(query, 
                       activity_limit = 10, 
                       char_significance=2000,
                       tweets_per_user = 200,
                       pages_to_look=10,
                       REQUESTS_MADE=0,
                       MAX_REQUEST_LIMIT=1480,
                       SLEEP_TIMER=600,
                       verbose=False):
    
    # Firstly, all available tweets on the given topic (query)
    # are gathered throughout all the pages of search
    # those which fit the constraints are stored in <timelines>
    timelines = {}
    waited = False
    
    for page in range(pages_to_look):
        try:
            some_users = api.search_users(q=query, count=20, page=page)
        except Exception as E:
            if verbose:
                print("Page {:d} does not exist".format(page))
            break
            
        if verbose:    
            print("Found {:d} users from page {:d}. ".format(len(some_users), page))
        
        k = 0
        for u in some_users:
            if u.friends_count > activity_limit or u.followers_count > activity_limit:
                k += 1
                
                # checking and waiting if limit reached
                if REQUESTS_MADE >= MAX_REQUEST_LIMIT: 
                    print('REQUEST LIMIT HAS BEEN REACHED. THE EXTRACTION WILL BE PUT ON HOLD FOR {:.2F} MINUTES.'.format(SLEEP_TIMER/60))
                    time.sleep(SLEEP_TIMER)
                    waited=True
                    REQUESTS_MADE= 0
                
                # loading this user's timeline
                if not u.protected:
                    timelines[u.id] = api.user_timeline(id = u.id, count = tweets_per_user)
                    REQUESTS_MADE += 1
        if verbose:        
            print("Gathered {:d} users from page {:d}. ".format(k, page))
            

    print("<<<>>>")
    print("Gathered {:d} users and their timelines. Moving on.".format(len(timelines)))
        
    # one needs only text, so
    # for each user the tweets are concatenated
    # and put into a dictionary
    # currently only english tweets are accepted
    
    user_tweets_dict = {}

    for uid in timelines:
        
        this_user_tweets = ""
        for t in timelines[uid]:   
            if t.lang == 'en':
                this_user_tweets += t.text
                this_user_tweets += '\n\n'
                
        if len(this_user_tweets) >= char_significance:
            user_tweets_dict[uid] = this_user_tweets
    
    total_users = len(user_tweets_dict)
    print("Tweets of {:d} users satisfy the given constraints. They are succesfully processed and saved.".format(total_users))
    if verbose:
        av_corp = sum([len(tweet) for tweet in user_tweets_dict.values()])
        av_corp /= total_users
        print("Average user's corpus length is {:.2f}.".format(av_corp))
    print("<<<>>>")
    
    return user_tweets_dict, REQUESTS_MADE, waited

In [42]:
various_topics = ["music", "Birds of Europe", "Alsace", "the fastest cars",
                 "space travel, future", "mythical creatures", "fashion blogs", "Bukowski",
                 "jazz legends", "Magic"]

more_topics = ["Latest news", "too much attention in the news", "teaching students", "any sports?", 
               "the Champions League?", "favorite movie?", "favorite director?", "favorite actors" # General qs
               "French food", "Red or white wine", "a perfect baguette", "butter croissant", # La France
               "the Renaissance", "the Enlightenment", "Industrial era", "Agricultural era",
               "Liberty and people", "the emergence of Humanism", "yellow jackets", "cotton",
               "modern 'haute couture'",
               "future of the internet", 'internet of the future']

abstract_traits = ['Openness', 'Love', 'Conscientiousness', 'Closeness', 
                  'Challenge', 'Harmony', 'Extraversion', 'Curiosity',
                  'Practicality', 'Self-expression', 'Self-transcendence', 'Hedonism']

google_trends = ['StopCovid', 'VALORANT', 'Noir', 'Agathe Auproux', 
                 'Lea Michele', 'Blacklivesmatter', 'Mediapro', 'White House',
                'Trump', 'Ebola', 'Blackout Tuesday', 'Yellowstone']

great_people = ["Columbus", "Captain Kirk", "Da Vinci", "Plato", 
                "Alexander", "Cleopatra", 'Napoleon', "Jeanne d'Arc"]

various_topics += more_topics
various_topics += google_trends
various_topics += abstract_traits
various_topics += great_people


print("There are now {:d} topics to explore.".format(len(various_topics)))

There are now 64 topics to explore.


In [43]:
time.sleep(300)
print("Awaken.")

Awaken.


In [44]:
gathered_data = {}
current_reqs_made = 0
try:
    for topic in various_topics:
        print("Searchig for users on '{:s}'".format(topic))
        result, reqs, waited = gather_user_tweets(topic, 
                                   pages_to_look=10,
                                   tweets_per_user=500,
                                   char_significance=1500,
                                   REQUESTS_MADE=current_reqs_made,
                                   MAX_REQUEST_LIMIT=1480,
                                   verbose=False)
        if waited:
            current_reqs_made = 0
            print("Requests counter is nullified.")
        else:
            current_reqs_made += reqs

        gathered_data.update(result)
        print("Total users gathered: ", len(gathered_data))
        print("REQUESTS_MADE: ", current_reqs_made)
        print()
except Exception:
    print('Interrupted.')

Searchig for users on 'music'
<<<>>>
Gathered 180 users and their timelines. Moving on.
Tweets of 179 users satisfy the given constraints. They are succesfully processed and saved.
<<<>>>
Total users gathered:  179
REQUESTS_MADE:  200

Searchig for users on 'Birds of Europe'
<<<>>>
Gathered 9 users and their timelines. Moving on.
Tweets of 9 users satisfy the given constraints. They are succesfully processed and saved.
<<<>>>
Total users gathered:  188
REQUESTS_MADE:  490

Searchig for users on 'Alsace'
<<<>>>
Gathered 176 users and their timelines. Moving on.
Tweets of 61 users satisfy the given constraints. They are succesfully processed and saved.
<<<>>>
Total users gathered:  249
REQUESTS_MADE:  1174

Searchig for users on 'the fastest cars'
<<<>>>
Gathered 28 users and their timelines. Moving on.
Tweets of 27 users satisfy the given constraints. They are succesfully processed and saved.
<<<>>>
Total users gathered:  276
REQUESTS_MADE:  2520

Searchig for users on 'space travel, fu

In [47]:
keys_iter = iter(gathered_data.keys())
one_u = next(keys_iter)

print("User ", one_u)
print("Text: ", gathered_data[one_u])

User  755252174
Text:  With #EUFarm2Fork &amp; #BiodiversityStrategy, the @EU_Commission has created a blueprint for a sustainable future. 👩‍🌾… https://t.co/rHuqC1O00c

"The targets on agriculture are a game-changer."

Our Head of Policy, @ArielBrunner, shares his analysis on newly r… https://t.co/hwqGCr3eRv

RT @phoeb0: European Commission commits to protecting 30% of EU’s land and oceans by 2030 @guardianeco https://t.co/WKYmYamg49

RT @BrunaDCampos: The ocean wins today:
- 30% MPAs of which 1/3 should be no take zones!
- An action plan to tackle destructive fishing
- R…

With these strategies, @EU_Commission has shown true global leadership. 🌍👩‍🌾🌳🦉🐝🐞

Now, they pass the torch to our 2… https://t.co/HWg8taQR7N

✅ Minimise burning biomass, such as trees, for energy. This is crucial for both biodiversity &amp; the climate:… https://t.co/qZJ3J66UAK

✅ Introduce binding EU nature restoration targets to restore large-scale ecosystems. Our damaged planet desperately… https://t.co/dvLRk7KZF

***Comment:*** 
Extraction works. 

However, the text that we extracted contains a lot of undesirable symbols.

It should be cleaned.

In [45]:
len(gathered_data)

5896

## Text Processing

*The tweets extracted from Tweeter contain a lot of undesirable information, which one might like to remove. Such information is:*
* Emojis
* Symbolic emojis
* Hyperlinks

In [46]:
#HappyEmoticons
emoticons_happy = set([
    ':-)', ':)', ';)', ':o)', ':]', ':3', ':c)', ':>', '=]', '8)', '=)', ':}',
    ':^)', ':-D', ':D', '8-D', '8D', 'x-D', 'xD', 'X-D', 'XD', '=-D', '=D',
    '=-3', '=3', ':-))', ":'-)", ":')", ':*', ':^*', '>:P', ':-P', ':P', 'X-P',
    'x-p', 'xp', 'XP', ':-p', ':p', '=p', ':-b', ':b', '>:)', '>;)', '>:-)',
    '<3'
    ])

# Sad Emoticons
emoticons_sad = set([
    ':L', ':-/', '>:/', ':S', '>:[', ':@', ':-(', ':[', ':-||', '=L', ':<',
    ':-[', ':-<', '=\\', '=/', '>:(', ':(', '>.<', ":'-(", ":'(", ':\\', ':-c',
    ':c', ':{', '>:\\', ';('
    ])

#Emoji patterns
emoji_pattern = re.compile("["
         u"\U0001F600-\U0001F64F"  # emoticons
         u"\U0001F300-\U0001F5FF"  # symbols & pictographs
         u"\U0001F680-\U0001F6FF"  # transport & map symbols
         u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
         u"\U00002702-\U000027B0"
         u"\U000024C2-\U0001F251"
         "]+", flags=re.UNICODE)

#combine sad and happy emoticons
emoticons = list(emoticons_happy.union(emoticons_sad))
# symbolics = re.compile('|'.join(emoticons))

In [47]:
def purify_text(text):
    
    #Emoji patterns
    emoji_pattern = re.compile("["
         u"\U0001F600-\U0001F64F"  # emoticons
         u"\U0001F300-\U0001F5FF"  # symbols & pictographs
         u"\U0001F680-\U0001F6FF"  # transport & map symbols
         u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
         u"\U00002702-\U000027B0"
         u"\U000024C2-\U0001F251"
         "]+", flags=re.UNICODE)
    # urls
    urls = r'https?:\/\/.*[\r\n]*'
    
    # other unicodes
    more_emojis = re.compile(u'['
    u'\U0001f300-\U0001f64F'
    u'\U0001f680-\U0001f9FF'
    u'\u2600-\u26FF\u2700-\u27BF]+', 
    re.UNICODE)
    
    # hashtags
    tags = r'\#.*[\r\n]*'
    # ats
    ats = r'\@'
    
    # leftovers
    amps = '\&amp'
    doublespace = '\s\s'
    
    # subbing
    text = re.sub(urls, '', text)
    text = emoji_pattern.sub(r'', text)
    text = more_emojis.sub(r'', text)
    text = re.sub(tags, '', text)
    text = re.sub(ats, '', text)
    text = re.sub(amps, '', text)
    text = re.sub(doublespace, ' ', text)
    text.encode('ascii', errors='ignore')
    
    return text

In [48]:
clean_data = {}

for uid in gathered_data:
    print(uid, '...')
    clean_data[uid] = purify_text(gathered_data[uid])
    


89168924 ...
74580436 ...
52536879 ...
17243213 ...
278662460 ...
92462529 ...
142457170 ...
739836760948117504 ...
461688027 ...
880142286176231425 ...
14740219 ...
18033062 ...
2367911 ...
163787110 ...
16807528 ...
44375901 ...
18088267 ...
25830909 ...
15891115 ...
20234637 ...
14392252 ...
15279886 ...
62591415 ...
54387680 ...
193009893 ...
360474312 ...
688583 ...
39723049 ...
23136007 ...
47363082 ...
59079741 ...
14663955 ...
3307218826 ...
43011178 ...
7130402 ...
26601342 ...
22838300 ...
266767461 ...
111342637 ...
62614140 ...
43969388 ...
236561938 ...
17001690 ...
609059327 ...
20470215 ...
96194044 ...
26779967 ...
46740920 ...
214837137 ...
14761583 ...
41679716 ...
214906527 ...
15949253 ...
236282213 ...
19348075 ...
73333565 ...
37995765 ...
60971018 ...
449043936 ...
199506829 ...
3015897352 ...
2909371880 ...
1577984058 ...
28351350 ...
31634255 ...
20573003 ...
197196311 ...
83557871 ...
17201929 ...
2381606574 ...
1090384667952394241 ...
16691299 ...
74109602 ..

In [49]:
keys_iter = iter(clean_data.keys())
one_u = next(keys_iter)

print("User ", one_u)
print("Text: ", clean_data[one_u])

User  89168924
Text:  Radio stations and TV channels have changed their programmes to mark "Blackout Tuesday", reflecting on George Floyd… RT BBCNewsbeat: "When I first started talking about racism in music, my comments were completely dismissed. 
"The powers that be even wan… RT BBCR1: “You cannot enjoy the rhythm and ignore the blues” This is incredibly powerful from claraamfo on the death of George Floyd, ra… On Tues: 1Xtra Talks is a 2hr special from 6pm, where we’ll hear the views and opinions of 1Xtra listeners on Geor… Hear Tommie Smith talk about how it felt to make the Black Power salute at the 1968 Olympics in conversation with… Related: Listen to TherealNihal interviewing Tommie Smith for bbc5live on BBCSounds Tuesday at noon: Don Letts (RebelDread) introduces a soundscape telling the story of the protest at the 1968 Olymp… RT BBCNewsEnts: "We're broken and we're disgusted," said Beyonce. "We cannot normalise this pain." Just One of Those Things - Tracing the story of Ella F

## Loading data on disk
*It is only left to store the gathered data somewhere*


In [50]:
# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id="****************", project_access_token="******************")
pc = project.project_context

In [51]:
file_name = "twitter_main.csv"
df = pd.DataFrame.from_dict(clean_data, orient='index', columns=['Text'])
df.reset_index(inplace=True)
df.rename(columns={"index": "UID"}, inplace=True)
df.head()


Unnamed: 0,UID,Text
0,89168924,Radio stations and TV channels have changed th...
1,74580436,"On Tuesday, June 2nd, Apple Music will observe..."
2,52536879,Little Richard receives the Merit Award at the...
3,17243213,In response to George Floyd's death––the death...
4,278662460,"3 years ago today, BTS_twt accepted their firs..."


In [52]:
project.save_data(file_name, df.to_csv(index=False))

{'file_name': 'twitter_main.csv',
 'message': 'File saved to project storage.',
 'bucket_name': 'personalityinsightproject-donotdelete-pr-1pbhytm1hswznj',
 'asset_id': '522b1343-6e17-4c4b-8f83-510576779af8'}