# Building an ETL Pipeline

As the second part of the predict for Gather, you will need to build a pipeline of functions in python which does the following:

1. Function to connect to twitter and scrapes "Eskom_SA" tweets.
<br>
<br>
2. Cleans/Processes the tweets from the scraped tweets which will create a dataframe with two new columns using the following functions: <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; a) Hashtag Remover from Analyse Functions
<br>
<br>
3. Functions which connects to your SQL database and uploads the tweets into the table you store the tweets in the database.

In [1]:
# General:
import tweepy           # To consume Twitter's API
import pandas as pd     # To handle data
import numpy as np      # For numerical computation
import json
# For plotting and visualization:
from IPython.display import display
import pyodbc

# Consumer and Access details

Fill in your Consumer and Access details you should have recieved when applying for a Twitter API. 

In [2]:
# Consumer:
MY_CONSUMER_KEY    = 'hkeVlDanGNTN8DRQF8HqlTWBa'
MY_CONSUMER_SECRET = 'MQKqkhGXl4ukGUcQCSHxd7GfHOIjMvgABHJ0N2TdBV1nochloi'

# Access:
MY_ACCESS_TOKEN  = '1399297879-NLwurAjgf7grsIx3AqLK3z0HaMKOpqWEjvZj7PV'
MY_ACCESS_SECRET = 'qRuUzew94mze9sY4HpK4SS6vCVwZJW6ZjHOS2o5HjjGph'

# Function 1:

Write a function which:
- Scrapes _"Eskom_SA"_ tweets from Twitter. 

Function Specifications:
- The function should return a dataframe with the scraped tweets with just the "_Tweets_" and "_Date_". 
- Will take in the ```consumer key,  consumer secret code, access token``` and ```access secret code```.

NOTE:
The dataframe should have the same column names as those in your SQL Database table where you store the tweets.

In [3]:
def twitter_df(CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_SECRET ):
    """Returns a dataframe with scraped Eskom_SA tweets, date of the tweet and location from Twitter
    
    Parameters:
    -----------
    CONSUMER_KEY    = 'Twitter API Key'
    
    CONSUMER_SECRET = 'Twitter API Secret Key'

    ACCESS_TOKEN  = 'Twitter Access Token'
    
    ACCESS_SECRET = 'Twitter Access Token Secret'
    
    Examples:
    ---------
    
    >>> twitter_df('API_key', 'API_secret_key', 'access_token', 'access_secret_token' )
    
    tweet                                |                  date |
    -------------------------------------|-----------------------|---------------
    Some tweet @someone about #something |   2020-03-11 11:37:49 |
    """

    # Authentication and access using keys:
    auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
    auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)
    
    api = tweepy.API(auth, timeout=1000)
    
    tweets = []
    dates = []
    location = []
    result = pd.DataFrame()
    for tweet in api.search(q="@Eskom_SA -filter:retweets", lang="en", rpp=100, count=40):
        tweets = tweets + [f"{tweet.text}"]
        dates = dates + [f"{tweet.created_at}"]
        location = location + [f"{tweet.user.location}"]
    result['Tweets'] = tweets
    result['Date'] = dates
    result['Location'] = location
    return result

In [4]:
tweets_df = twitter_df(MY_CONSUMER_KEY, MY_CONSUMER_SECRET, MY_ACCESS_TOKEN, MY_ACCESS_SECRET )

In [5]:
conn = pyodbc.connect(driver='{SQL Server}',
                      host='EDSA-PGLBGKO\SQLEXPRESS',
                      database='gather_eskom',
                      uid='sa',
                      pwd='edsa@2020')

In [6]:
def pyodbc_twitter(connection, df, twitter_table):
    """Extracts a dataframe containing tweets, connects and updates the data in your local SQL database

    Parameters:
    -----------
    connection: SQL connection settings

    df: DataFrame of tweets with their timestamp and location 

    twitter_table: An already existing twitter SQL database 

    Examples:
    --------
    >>> conn = pyodbc.connect(driver='{SQL Server}',
                      host='your_server_name',
                      database='your_database_name', 
                      uid='your_user_name',
                      pwd='your_password')
    
    >>> df = 
    
    
    >>> sql_df = pd.read_sql_query('SELECT * FROM twitter_table')
    tweets   date   location
    
    >>> pyodbc_twitter(connection, df, twitter_table)
    tweets           date           location
    """
    cursor = connection.cursor()
    
    for i in range(len(df.index)):
        
        tweet_text = df['Tweets'][i]
        tweet_text = tweet_text.replace("'","")
        tweet_date = df['Date'][i]
        tweet_location = df['Location'][i]
        
        cursor.execute(
            f"""
            INSERT INTO {twitter_table}
            VALUES ('{tweet_text}', '{tweet_date}', '{tweet_location}')
            """
        )
        
        
    conn.commit()
    return None 

In [7]:
pyodbc_twitter(conn, tweets_df, 'eskom_tweets_raw')

In [8]:
conn.close()

In [24]:
conn = pyodbc.connect(driver='{SQL Server}',
                      host='EDSA-PGLBGKO\SQLEXPRESS',
                      database='gather_eskom',
                      uid='sa',
                      pwd='edsa@2020')

In [12]:
sql_df = pd.read_sql_query('SELECT * FROM eskom_tweets_raw', conn)

In [13]:
sql_df.head()

Unnamed: 0,tweets,date,location
0,@Eskom_SA @SABCNewsOnline @SABCRadio @IOL @ewn...,2020-03-11 04:05:12,South Africa
1,@Eskom_SA Than youll go crying to the high cou...,2020-03-11 04:04:49,Far South
2,@Eskom_SA And use a moon/sun to cook it??,2020-03-11 04:02:54,"Johannesburg, South Africa"
3,@M_Jay94 @AntonEberhard @Eskom_SA ESKOM manage...,2020-03-11 04:02:02,Johannesburg
4,@nke_Le @Eskom_SA @barrybateman @SABCNewsOnlin...,2020-03-11 04:01:34,Pretoria


# Function 2: Removing hashtags and the municipalities

Write a function which:
- Uses the function you wrote in the Analyse section to extract the hashtags and municipalities into it's own column in a new data frame. 

Function Specifications:
- The function should take in the pandas dataframe you created in Function 1 and return a new pandas dataframe. 

In [15]:
def extract_municipality_hashtags(df):
    
    """Returns a modified dataframe with two new columns appended, "municipality" and "hashtags". Information is extracted from
    twitter data that includes the municipality and the list of hashtags referred to in each tweet, respectively.
    Input must contain a column named "Tweets".
    
    Parameters
    ----------
    df: dataframe
    
    Returns
    -------
    df_new: modified dataframe 
    """
    
    mun_dict = {'@CityofCTAlerts' : 'Cape Town',
                '@CityPowerJhb' : 'Johannesburg',
                '@eThekwiniM' : 'eThekwini' ,
                '@EMMInfo' : 'Ekurhuleni',
                '@centlecutility' : 'Mangaung',
                '@NMBmunicipality' : 'Nelson Mandela Bay',
                '@CityTshwane' : 'Tshwane'}
    
    if type(df) == type(pd.DataFrame()):
        municipality = []
        data = df
        for i in data["tweets"]:
            data_str = i.replace(":", "") # Remove ":" from the end of municipality keys and hashtags
            data_str = str.split(data_str) # Splits a sentence/multi-word string by white space into a list
            data_muni = [a for a in data_str if a[0] == "@"] # Add words containing the hashtag to new list
            municipality = municipality + [data_muni]
        for j in range(len(municipality)):
            municipality[j] = [i.replace(i, mun_dict[i]) for i in municipality[j] if i in mun_dict]
        for x in range(len(municipality)):
            if municipality[x] == []:
                municipality[x] = (np.nan)

        df_muni = pd.DataFrame({"municipality": municipality})
        df = df.join(df_muni)
    
        data_subset = df
        hashtags = []
        for j, k in data_subset.iterrows(): # Iterate over pd df
            data_subset_str = data_subset.iloc[j,0]
            data_subset_str = str.split(data_subset_str) # Splits a sentence/multi-word string by white space into a list
            data_subset_hashtags = [a for a in data_subset_str if a[0] == "#"] # Add words containing the hashtag to new list
            data_subset_hashtags = list(map(lambda b: str.lower(b), data_subset_hashtags)) # Convert all hashtags in list to lower case
            if data_subset_hashtags == []:
                data_subset_hashtags = (np.nan) # Use () instead of [], resulting nan must not have square brackets in solution
            hashtags = hashtags + [data_subset_hashtags]

        df = data_subset
        df2 = pd.DataFrame({"hashtags": hashtags})
        df = df.join(df2)
        df_new = df
    
    else:
        print("Error: input must be a data frame.")
    return df_new

In [16]:
processed_tweets_df = extract_municipality_hashtags(sql_df)

In [17]:
processed_tweets_df.head()

Unnamed: 0,tweets,date,location,municipality,hashtags
0,@Eskom_SA @SABCNewsOnline @SABCRadio @IOL @ewn...,2020-03-11 04:05:12,South Africa,,
1,@Eskom_SA Than youll go crying to the high cou...,2020-03-11 04:04:49,Far South,,
2,@Eskom_SA And use a moon/sun to cook it??,2020-03-11 04:02:54,"Johannesburg, South Africa",,
3,@M_Jay94 @AntonEberhard @Eskom_SA ESKOM manage...,2020-03-11 04:02:02,Johannesburg,,
4,@nke_Le @Eskom_SA @barrybateman @SABCNewsOnlin...,2020-03-11 04:01:34,Pretoria,,


# Function 3: Updating SQL Database with pyODBC

Write a function which:
- Connects and updates your SQL database. 

Function Specifications:
- The function should take in a pandas dataframe created in Function 2. 
- Connect to your SQL database.
- Update the table you store your tweets in.
- Not return any output.

In [33]:
def pyodbc_twitter(connection, df, twitter_table):
    """Extracts a dataframe of tweets and connects and updates the data in your local SQL database

    Parameters:
    -----------
    connection: SQL connection settings

    df: DataFrame of tweets with their timestamp

    twitter_table: An already existing twitter SQL database 

    Examples:
    --------
    >>> conn = pyodbc.connect(driver='{SQL Server}',
                      host='your_server_name',
                      database='your_database_name', 
                      trusted_connection='tcon',
                      user='your_user_name',
                      autocommit=True)
    
    >>> df = 
    """
    cursor = connection.cursor()
    
    df['date'] = pd.to_datetime(df['date'])
    
    for i in range(len(df.index)):
        
        tweet_text = df['tweets'][i]
        tweet_text = tweet_text.replace("'","")
        
        tweet_date = df['date'][i]
        tweet_municipality = df['municipality']
        tweet_hashtags = df['hashtags'] 
        tweet_location = df['location']
        print(f"""
            INSERT INTO {twitter_table}
            VALUES (
                '{tweet_text}',
                '{tweet_date}',
                '{tweet_municipality}',
                '{tweet_hashtags}',
                '{tweet_location}'
            )
            """)
        cursor.execute(
            f"""
            INSERT INTO {twitter_table}
            VALUES ('{tweet_text}','{tweet_date}','{tweet_municipality}','{tweet_hashtags}','{tweet_location}')
            """
        )
        
        
    conn.commit()
    return None 

In [34]:
pyodbc_twitter(conn, processed_tweets_df, 'eskom_tweets_processed')


            INSERT INTO eskom_tweets_processed
            VALUES (
                '@Eskom_SA @SABCNewsOnline @SABCRadio @IOL @ewnupdates @eNCA @TimesLIVE @News24 @Fin24 @Moneyweb @TheSAnews You lost… https://t.co/rMILKrHNmi',
                '2020-03-11 04:05:12',
                '0                NaN
1                NaN
2                NaN
3                NaN
4                NaN
5                NaN
6                NaN
7                NaN
8                NaN
9                NaN
10               NaN
11               NaN
12               NaN
13               NaN
14               NaN
15               NaN
16               NaN
17               NaN
18               NaN
19               NaN
20               NaN
21               NaN
22               NaN
23    [Johannesburg]
24               NaN
25               NaN
26    [Johannesburg]
27               NaN
28    [Johannesburg]
29         [Tshwane]
30    [Johannesburg]
31               NaN
32    [Johannesburg]
33               NaN
3

DataError: ('22001', '[22001] [Microsoft][ODBC SQL Server Driver][SQL Server]String or binary data would be truncated. (8152) (SQLExecDirectW); [22001] [Microsoft][ODBC SQL Server Driver][SQL Server]The statement has been terminated. (3621)')

In [35]:
conn.close()