# Building an ETL Pipeline

As the second part of the predict for Gather, you will need to build a pipeline of functions in python which does the following:

1. Function to connect to twitter and scrapes "Eskom_SA" tweets.
<br>
<br>
2. Cleans/Processes the tweets from the scraped tweets which will create a dataframe with two new columns using the following functions: <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; a) Hashtag Remover from Analyse Functions
<br>
<br>
3. Functions which connects to your SQL database and uploads the tweets into the table you store the tweets in the database.

In [1]:
# General:
import tweepy           # To consume Twitter's API
import pandas as pd     # To handle data
import numpy as np      # For numerical computation
import json
# For plotting and visualization:
from IPython.display import display
import pyodbc


# Consumer and Access details

Fill in your Consumer and Access details you should have recieved when applying for a Twitter API. 

In [2]:
# Consumer:
CONSUMER_KEY    = ''
CONSUMER_SECRET = ''

# Access:
ACCESS_TOKEN  = ''
ACCESS_SECRET = ''

In [3]:
# API's setup:
def twitter_setup():
    """
    Utility function to setup the Twitter's API
    with access and consumer keys from Twitter.
    """

    # Authentication and access using keys:
    auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
    auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)

    # Return API with authentication:
    api = tweepy.API(auth, timeout=1000)
    return api

In [88]:
api = twitter_setup()
user = api.get_user("Eskom_SA")

print("User details:")
print(user.name)
print(user.description)
print(user.location)

print("Last 20 Followers:")
for follower in user.followers():
    print(follower.name)

User details:
Eskom Hld SOC Ltd

Gauteng, South African   
Last 20 Followers:
MALIBONGWE
Ceekay poetry💊📝👑
Lazy_grandchild_
Wiandra
Destroyer
tebogo mampopi
Vuyi Mandisa
Glapatham.L.Services PTY ltd
Onkgopotse Gomolemo Bantatetse
Thapelo
11:11
Kamohelo
Karesha Chetty
Lwazi Ngwane
Brendon Schwarz
Tiisetso Hadebe
King Kau
@Mzee_02
Andile Mnguni
Tshililo


In [26]:
for tweet in api.search(q="@Eskom_SA", lang="en", rpp=1):
    print(tweet)

Status(_api=<tweepy.api.API object at 0x0000022CCA9B1F98>, _json={'created_at': 'Sat Feb 29 07:32:08 +0000 2020', 'id': 1233656179924127744, 'id_str': '1233656179924127744', 'text': '@oceanwalz @samkelemaseko @Julius_S_Malema @Eskom_SA The indigent/ elderly and battling households get it every mon… https://t.co/j6XBQzsS7l', 'truncated': True, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'oceanwalz', 'name': 'OceanWalz', 'id': 1362093343, 'id_str': '1362093343', 'indices': [0, 10]}, {'screen_name': 'samkelemaseko', 'name': '#TheLordOfTheMedia', 'id': 301941865, 'id_str': '301941865', 'indices': [11, 25]}, {'screen_name': 'Julius_S_Malema', 'name': 'Julius Sello Malema', 'id': 117102398, 'id_str': '117102398', 'indices': [26, 42]}, {'screen_name': 'Eskom_SA', 'name': 'Eskom Hld SOC Ltd', 'id': 466420346, 'id_str': '466420346', 'indices': [43, 52]}], 'urls': [{'url': 'https://t.co/j6XBQzsS7l', 'expanded_url': 'https://twitter.com/i/web/status/1233656179924

In [5]:
api = twitter_setup()
tweets = []
dates = []
result = pd.DataFrame()
for tweet in api.search(q="@Eskom_SA -filter:retweets", lang="en", rpp=100, count=40):
    tweets = tweets + [f"{tweet.text}"]
    dates = dates + [f"{tweet.created_at}"]
result['Tweets'] = tweets
result['Date'] = dates
result

Unnamed: 0,Tweets,Date
0,@Eskom_SA Stop lying 🙄 there's been loadsheddi...,2020-03-02 08:25:27
1,@Yembuso @Eskom_SA Legit no electricity for 12...,2020-03-02 08:24:50
2,@Eskom_SA We take what we can . Thank you. Do ...,2020-03-02 08:15:07
3,"@TheEazyEd @Eskom_SA hi, @Eskom_SA, @TheEazyEd...",2020-03-02 08:12:46
4,@Eskom_SA What A Bunch Of BULL!!!!!,2020-03-02 08:10:47
5,@Postmanxyz @jhbotes1975 @Eskom_SA Actually 19...,2020-03-02 08:04:40
6,@Doneology @Eskom_SA @pixabay I did not even k...,2020-03-02 08:02:52
7,@Eskom_SA @Eskom_SA you know this update is no...,2020-03-02 08:01:53
8,@Eskom_SA @pixabay I hope you guys are not hav...,2020-03-02 08:01:33
9,"@Eskom_SA Since last night approx 8pm, recurre...",2020-03-02 08:00:55


In [6]:
result['Tweets'][0]

"@Eskom_SA Stop lying 🙄 there's been loadshedding since early hours of the morning in Kaalfontein, Midrand.  So save it. ✋"

# Function 1:

Write a function which:
- Scrapes _"Eskom_SA"_ tweets from Twitter. 

Function Specifications:
- The function should return a dataframe with the scraped tweets with just the "_Tweets_" and "_Date_". 
- Will take in the ```consumer key,  consumer secret code, access token``` and ```access secret code```.

NOTE:
The dataframe should have the same column names as those in your SQL Database table where you store the tweets.

In [0]:
def twitter_df(CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_SECRET ):

    # Code Here
    
    return None

# Function 2: Removing hashtags and the municipalities

Write a function which:
- Uses the function you wrote in the Analyse section to extract the hashtags and municipalities into it's own column in a new data frame. 

Function Specifications:
- The function should take in the pandas dataframe you created in Function 1 and return a new pandas dataframe. 

In [0]:
twitter_df(CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_SECRET )

In [112]:
def extract_municipality_hashtags(df):
    
    """Returns a modified dataframe with two new columns appended, "municipality" and "hashtags". Information is extracted from
    twitter data that includes the municipality and the list of hashtags referred to in each tweet, respectively.
    Input must contain a column named "Tweets".
    
    Parameters
    ----------
    df: dataframe
    
    Returns
    -------
    df_new: modified dataframe 
    """
    
    mun_dict = {'@CityofCTAlerts' : 'Cape Town',
                '@CityPowerJhb' : 'Johannesburg',
                '@eThekwiniM' : 'eThekwini' ,
                '@EMMInfo' : 'Ekurhuleni',
                '@centlecutility' : 'Mangaung',
                '@NMBmunicipality' : 'Nelson Mandela Bay',
                '@CityTshwane' : 'Tshwane'}
    
    if type(df) == type(pd.DataFrame()):
        municipality = []
        data = df
        for i in data["Tweets"]:
            data_str = i.replace(":", "") # Remove ":" from the end of municipality keys and hashtags
            data_str = str.split(data_str) # Splits a sentence/multi-word string by white space into a list
            data_muni = [a for a in data_str if a[0] == "@"] # Add words containing the hashtag to new list
            municipality = municipality + [data_muni]
        for j in range(len(municipality)):
            municipality[j] = [i.replace(i, mun_dict[i]) for i in municipality[j] if i in mun_dict]
        for x in range(len(municipality)):
            if municipality[x] == []:
                municipality[x] = (np.nan)

        df_muni = pd.DataFrame({"municipality": municipality})
        df = df.join(df_muni)
    
        data_subset = df
        hashtags = []
        for j, k in data_subset.iterrows(): # Iterate over pd df
            data_subset_str = data_subset.iloc[j,0]
            data_subset_str = str.split(data_subset_str) # Splits a sentence/multi-word string by white space into a list
            data_subset_hashtags = [a for a in data_subset_str if a[0] == "#"] # Add words containing the hashtag to new list
            data_subset_hashtags = list(map(lambda b: str.lower(b), data_subset_hashtags)) # Convert all hashtags in list to lower case
            if data_subset_hashtags == []:
                data_subset_hashtags = (np.nan) # Use () instead of [], resulting nan must not have square brackets in solution
            hashtags = hashtags + [data_subset_hashtags]

        df = data_subset
        df2 = pd.DataFrame({"hashtags": hashtags})
        df = df.join(df2)
        df_new = df
    
    else:
        print("Error: input must be a data frame.")
    return df_new

In [113]:
extract_municipality_hashtags(result)

Unnamed: 0,Tweets,Date,municipality,hashtags
0,@DavidMokwena12 @iSiphoSihle @samkelemaseko @E...,2020-02-29 08:49:37,,
1,@DavidMokwena12 @samkelemaseko @EFFSouthAfrica...,2020-02-29 08:47:21,,
2,@PNMaster_ @thembelani_k @FloydShivambu @Eskom...,2020-02-29 08:47:20,,
3,@AshrafGarda @somadodafikeni Politicians are m...,2020-02-29 08:46:03,,
4,@LenahLang @Roovdwalt @CityPowerJhb @Eskom_SA ...,2020-02-29 08:43:35,[Johannesburg],
5,@Roovdwalt @LaurieWatson821 @CityPowerJhb We a...,2020-02-29 08:37:41,[Johannesburg],
6,@FloydShivambu @Eskom_SA What?,2020-02-29 08:37:32,,
7,@Eskom_SA not today satan ne,2020-02-29 08:37:05,,
8,@Eskom_SA Load shedding better stay away this ...,2020-02-29 08:34:09,,
9,@Eskom_SA morning no #POWERALERT today? #loads...,2020-02-29 08:33:54,,"[#poweralert, #loadshedding]"


# Function 3: Updating SQL Database with pyODBC

Write a function which:
- Connects and updates your SQL database. 

Function Specifications:
- The function should take in a pandas dataframe created in Function 2. 
- Connect to your SQL database.
- Update the table you store your tweets in.
- Not return any output.

In [0]:
def pyodbc_twitter(connection, df, twitter_table):

  ### Code Here

  return None 