# Packages

As a first step, all packages for the data prepocessing/data acquisition are imported: 
1. The ``tweepy`` library (https://docs.tweepy.org/en/stable/) can be used to access the Twitter API for storing the relevant tweets and further information. For using Tweepy to extract tweets, we first needed to apply for developer credentials including private consumer keys and access tokens. 
2. ``Pandas`` is imported for first minor data transformations and reading in the politician's twitter handles. 
3. The ``os`` library is used to check the path of the configuration file which contains the import tokens. 

In [1]:
# packages
import tweepy
import pandas as pd
import os

print("Tweepy version: " + tweepy.__version__)
print("Pandas version: " + pd.__version__)

Tweepy version: 4.4.0
Pandas version: 1.3.4


### Importing the relevant consumer keys and access tokens

The respective keys and tokens were retrieved from the personal twitter developer account and stored in a separate file (config.py). The next chunk extracts the keys and tokens from the config file, which itself is not pushed to the GitHub repository. The if-else statement prints a confirmation in case that a config file exists on the local machine of the user, and an alternative statement if it is not.

In [2]:
# import tokens from config.py file
if os.path.isfile("config.py"):
    print("config.py exists\nAPI keys and tokens are imported")
    from config import consumer_key, consumer_secret, access_token, access_token_secret
else:
    print("config.py does not exist\nPlease add config.py to proceed")

config.py exists
API keys and tokens are imported


# Setting up API

In the next chunk, the previously stored consumer keys are passed to the OAuthHandler instance, using the tweepy library. Subsequently, also the access token and secret need to be set up (which we also have stored in strings in the previous chunk). Finally, a new API variable is created. The `wait_on_rate_limit`-argument is set to true (this is useful since there are certain rate limits set by Twitter which should not be exceeded). 

In [3]:
# setup consumer API key
auth = tweepy.OAuthHandler(
    consumer_key,
    consumer_secret
)

# setup access token
auth.set_access_token(
    access_token,
    access_token_secret
)

# create API variable
api = tweepy.API(
    auth, 
    wait_on_rate_limit = True
)

### Functionality test of the API credentials

In this code chunk, the `verify_credentials`-function checks whether the credentials we read in earlier are valid. If so, a confirmation statement is printed. If the function runs into an error, an error message is printed. 

In [4]:
# check if API credentials work
try:
    api.verify_credentials()
    print("Authentication OK")
except:
    print("Error during authentication")

Authentication OK


# Testing Twitter API

After setting up the credentials, we check whether we can extract the most recent tweet in Lukas' timeline. For this task, we are using tweepy's `user_timeline`-function and storing its output in a new object. Inside of the function, we have to specify several arguments: 
1. The respective twitter handle/username with the ``screen_name``-argument 
2. The number of tweets we want to extract with the ``count``-argument 
3. Whether we want to include retweets (which we do not) with the ``include_rts``-argument
4. Whether we want to extract the whole tweet (and not a truncated version). 

Subsequently, the function `timeline_to_df` is created, which takes a tweepy object has an input and converts it into a pandas dataframe by using the json_normalize function from the pandas library. 

In [5]:
# test if user_timeline method works with own twitter account
tweets_lw = api.user_timeline(
    screen_name = "lukas_warode",
    count = 1,
    include_rts = False,
    tweet_mode = "extended"
)

# print type of user_timeline method object
type(tweets_lw)

# function to convert tweepy object to a pandas dataframe
def timeline_to_df(tweepy_timeline):
    """Take a tweety object input and return a pandas dataframe."""
    json_data = [r._json for r in tweepy_timeline]
    df = pd.json_normalize(json_data)
    return df

# apply function 
tweets_lw_df = timeline_to_df(tweets_lw)

# print full text column of tweet dataframe
pd.options.display.max_colwidth = int(tweets_lw_df["full_text"].str.len())
print(tweets_lw_df["full_text"])

0    @p_c_bauer @MichaelImre Nice project! Seems to be a very rare coincidence, I worked basically on the same project last year while using the same name 😄\nhttps://t.co/QQtOQ...
Name: full_text, dtype: object


# Use csv file from WZB project to extract list of German MPs' Twitter accounts
## (Project author: Markus Konrad)

If we want to match every MP's tweets with their respective party programmes, we need their twitter handles. Luckily, Markus Konrad from WZB has done a quite similar project, and provides a file that contains all twitter handles from German MPs. In the next chunk, we read in this file with the ``read_csv```-function from the pandas library and creating a dataframe that only contains the **MP's handle** and his/her **party affiliation**. NAs are dropped, since we naturally only consider MPs that use Twitter.  

In [6]:
# read csv as dataframe from GitHub repository
wzb_df = pd.read_csv("https://raw.githubusercontent.com/WZBSocialScienceCenter/mdb-twitter-network/master/data/deputies_twitter_20190702.csv")

# create subset with the 2 relevant columns and drop NAs
twitter_df = wzb_df[["twitter_name", "party"]].dropna()

# Sampling approaches
## a) Get random MP Twitter handles

To check whether the previous steps have worked, the following chunk creates a function (`random_sample_handle`) that returns a certain number of random twitter handles. The function takes two parameters:
1. A dataframe in which the twitter handles are stored
2. An integer that deteremines the number of the returned twitter handles 

Inside the function, the ``sample`` object specifies that a certain number of random handles, while the ``name_string`` object takes this object and converts it into a string. Finally, the ``name_string``-object is returned. 

Ultimately, the ``random_sample_handle`` is applied to our twitter handle dataframe (twitter_df), with the number of handles to be returned is set to 5. 

In [7]:
# Function to extract random MPs' Twitter handles
def random_sample_handle(df, n):
    """
    Take a twitter handle dataframe and a number and return a desired number of random handles.

        Parameters:
                df (str): A dataframe containing twitter handles
                n (int): An integer 
        
        Returns: 
                A specified number of random twitter handles
    """
    sample = df[["twitter_name"]].sample(n = n)
    name_string = sample.to_string(index = False, header = False)
    return name_string

# apply function
print(
    random_sample_handle(
        df = twitter_df,
        n = 5
    )
)

     owvonholtz
   katjakipping
danielakluckert
       dorobaer
    gruenebeate


## b) Extract Twitter handles by popularity

In the following chunk, we first create a function that takes the twitter handle dataframe and returns the count of followers for each MP. 

For simplification purposes, we then only consider a subset of the dataframe that is used in the further analysis, with only chosing MPs from the Green party ("Die Grünen"). In the next step, we definde the function ``col_to_tidy_list``, which takes a data frame and a column name as input parameters and tranforms the selected column from the specified data frame. More specifically, whitespaces are removed and the handles are split up. Ultimately, the cleaned list of twitter handles is returned. 

In [8]:
# follower count function
def follower_count_fun(twitter_handle):
    try: 
        user = api.get_user(screen_name = twitter_handle)
        count = user.followers_count
        return count
    except tweepy.TweepyException:
        pass

# for demonstration and simplification purposes we create a subset with Green MPs
twitter_df_greens = twitter_df[twitter_df["party"] == "DIE GRÜNEN"]

# store Twitter handles as list from data frame (column) with a function
def col_to_tidy_list(df, col):
    col_string = df[[col]].to_string(index = False, header = False)
    tidy_string = col_string.replace(" ", "")
    tidy_list = tidy_string.split("\n")
    return tidy_list

# test and print results
twitter_handles_list = col_to_tidy_list(
    twitter_df_greens,
    "twitter_name"
)

print(twitter_handles_list)

['kirstenkappert', 'konstantinnotz', 'markuskurthmdb', 'babetteschefin', 'sven_kindler', 'agnieszka_mdb', 'goeringeckardt', 'markustressel', 'beatewaro', 'julia_verlinden', 'jtrittin', 'k_sa', 'ulle_schauws', 'schickgerhard', 'manuelsarrazin', 'tabearoessner', 'crueffer', 'lisapaus', 'fostendorff', 'cem_oezdemir', 'nouripour', 'gruenebeate', 'irenemihalic', 'tobiaslindner', 'steffilemke', 'monikalazar', 'renatekuenast', 'chriskuehn_mdb', 'stephankuehn', 'oliver_krischer', 'mariaklschmeink', 'uwekekeritz', 'djanecek', 'brihasselmann', 'hajdukbundestag', 'kaigehring', 'matthiasgastel', 'katjadoerner', 'katdro', 'ebner_sha', 'ekindeligoez', 'fbrantner', 'kerstinandreae', 'abaerbock', 'w_sk', 'lieblingxhain', 'stefangelbhaar', 'danywagner_da', 'badulrichmartha', 'gruenclaudia', 'derdanyal', 'margaretebause', 'filizgreen', 'owvonholtz', 'svenlehmann', 'annachristmann']


### Storing follower counts and adding them to handles data frame

In the next two steps, we first store the number of followers of each MP in a new list. Second, the number of followers is added to the respective handles. This enables us to identify the most popular MPs in the Green party. 

In [9]:
# apply function in a for loop and store follower count in list
follower_count_list = []

for twitter_name in twitter_handles_list:
    follower_count_list.append(
        follower_count_fun(twitter_name)
    )

# print results 
print(follower_count_list)

[6571, 85644, 4013, None, 19628, 13949, 202482, 2019, None, 9634, 115398, 7490, 7730, 12033, 6524, 9446, 3495, 11080, None, 295112, 28453, 5772, 9511, 9767, 20515, 6341, 77261, 4655, 7493, 19938, 7767, 3351, 13783, 36170, 2510, 12207, 7475, 17965, 9607, 5432, 9875, 13313, 8868, 82567, 7531, 12960, 7795, 1711, 5675, 3126, 17175, 7334, 5704, 1744, 22471, 5139]


In [10]:
# add follower count list to data fraee as a numeric column
twitter_df_greens["follower_count"] = follower_count_list

# print transformed data frame
print(twitter_df_greens)

        twitter_name       party  follower_count
34    kirstenkappert  DIE GRÜNEN          6571.0
47    konstantinnotz  DIE GRÜNEN         85644.0
67    markuskurthmdb  DIE GRÜNEN          4013.0
68    babetteschefin  DIE GRÜNEN             NaN
71      sven_kindler  DIE GRÜNEN         19628.0
84     agnieszka_mdb  DIE GRÜNEN         13949.0
92    goeringeckardt  DIE GRÜNEN        202482.0
98     markustressel  DIE GRÜNEN          2019.0
122        beatewaro  DIE GRÜNEN             NaN
129  julia_verlinden  DIE GRÜNEN          9634.0
135         jtrittin  DIE GRÜNEN        115398.0
166             k_sa  DIE GRÜNEN          7490.0
176     ulle_schauws  DIE GRÜNEN          7730.0
179    schickgerhard  DIE GRÜNEN         12033.0
185   manuelsarrazin  DIE GRÜNEN          6524.0
189    tabearoessner  DIE GRÜNEN          9446.0
194         crueffer  DIE GRÜNEN          3495.0
221         lisapaus  DIE GRÜNEN         11080.0
226      fostendorff  DIE GRÜNEN             NaN
228     cem_oezdemir

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  twitter_df_greens["follower_count"] = follower_count_list


### Identifying the Green MP with most followers

Here, we are storing the twitter handle of the Green MP with highest number of Twitter followers in a new object (``most_followers_mp``). As we can see from the printed ouput, it is Annalena Baerbock. 

In [11]:
# filter observation with highest follower count
max_followers = twitter_df_greens["follower_count"].max()

twitter_df_greens[twitter_df_greens["follower_count"] == max_followers]

# get twitter name column with highest follower count as string
most_followers_mp = twitter_df_greens[twitter_df_greens["follower_count"] == max_followers]["twitter_name"].to_string(index = False, header = False)

print(most_followers_mp)

cem_oezdemir


# Tweet extraction
## convert `user_timeline` of **Annalena Baerbock** to data frame

For this project, we exemplarily look at the tweets of Annalena Baerbock (Bündnis 90/Green Party) as the Green MP with the highest number of followers. In the next chunk, the `user_timeline`-function is used again, selecting the MP with the most followers, extracting the 200 last tweets, excluding retweets and storing them into a new object. Subsequently, the new object is transformed into a dataframe and printed.

In [12]:
# extract tweets
baerbock_tweets = api.user_timeline(
    # MP with most followers (Greens) - Annalena Baerbock
    screen_name = most_followers_mp,
    # maximum number of tweets extractable
    count = 200,
    # do not include retweets
    include_rts = False,
    # scope of retrieved information
    tweet_mode = "extended"
)

# apply function that converts timeline object to data frame
baerbock_tweets_df = timeline_to_df(baerbock_tweets)

# print data frame
print(baerbock_tweets_df)

                        created_at                   id               id_str  \
0   Fri Dec 17 22:54:25 +0000 2021  1471977109480517637  1471977109480517637   
1   Fri Dec 17 14:39:35 +0000 2021  1471852578660970496  1471852578660970496   
2   Thu Dec 16 21:14:39 +0000 2021  1471589614183985154  1471589614183985154   
3   Wed Dec 15 16:31:13 +0000 2021  1471155896462069760  1471155896462069760   
4   Mon Dec 13 10:10:25 +0000 2021  1470335291814776837  1470335291814776837   
..                             ...                  ...                  ...   
73  Wed Nov 03 19:08:30 +0000 2021  1455975187908894720  1455975187908894720   
74  Mon Nov 01 20:50:45 +0000 2021  1455276147655254025  1455276147655254025   
75  Mon Nov 01 15:27:03 +0000 2021  1455194682389041166  1455194682389041166   
76  Sun Oct 31 18:53:01 +0000 2021  1454884127702982656  1454884127702982656   
77  Sun Oct 31 14:32:04 +0000 2021  1454818460794593289  1454818460794593289   

                                       

## Save relevant columns as `.csv` file

As last steps, some non-required columns are dropped and only relevant variables are stored in a new object. 

In [13]:
# create subset of complete data frame
baerbock_tweets_subset_df = baerbock_tweets_df[[
    "id", 
    "created_at",
    "full_text",
    "display_text_range",
    "in_reply_to_user_id",
    "in_reply_to_screen_name",
    "is_quote_status",
    "retweet_count",
    "favorite_count",
    "possibly_sensitive"
]]

# print subsetted data frame
print(baerbock_tweets_subset_df)

                     id                      created_at  \
0   1471977109480517637  Fri Dec 17 22:54:25 +0000 2021   
1   1471852578660970496  Fri Dec 17 14:39:35 +0000 2021   
2   1471589614183985154  Thu Dec 16 21:14:39 +0000 2021   
3   1471155896462069760  Wed Dec 15 16:31:13 +0000 2021   
4   1470335291814776837  Mon Dec 13 10:10:25 +0000 2021   
..                  ...                             ...   
73  1455975187908894720  Wed Nov 03 19:08:30 +0000 2021   
74  1455276147655254025  Mon Nov 01 20:50:45 +0000 2021   
75  1455194682389041166  Mon Nov 01 15:27:03 +0000 2021   
76  1454884127702982656  Sun Oct 31 18:53:01 +0000 2021   
77  1454818460794593289  Sun Oct 31 14:32:04 +0000 2021   

                                                                                                                                                                         full_text  \
0                                                       @LindaTeuteberg @gegenvergessen Herzlichen Dank! Die 

### Checking whether file already exists

In [14]:
# save data frame as csv in case it does not already exist
if os.path.isfile("baerbock_tweets.csv"):
    print("baerbock_tweets.csv already exists")
else:
    print("baerbock_tweets.csv did not exist before\nTweets are saved in a csv file")
    baerbock_tweets_subset_df.to_csv("baerbock_tweets.csv")

baerbock_tweets.csv did not exist before
Tweets are saved in a csv file
