# Using the Twitter API

In [None]:
import tweepy
from datetime import datetime, timedelta

With the Twitter API we can access most of Twitter’s functionality from within Python (that means both reading **and** writing Tweets, or finding out about users and trends). The package of choice is *Tweepy*, which deals with all the messy details.

To access the Twitter API, you need to be authenticated. Hence, every request has to come with authentication information. To get this information in the first place, we need to generate our own credentials with a Developer Account:

1. Go to the <a href=https://developer.twitter.com/en>Twitter Developer Site</a> and apply for a Developer Account (you will need a Twitter account for this).
2. Create an application (e.g., "My_first_application"). Credentials and limits are per application, not per account.
3. Once you have created your application, you can transfer your consumer API key and secret, as well as your app access key and secret to the Python code below (see also https://developer.twitter.com/en/docs/basics/authentication/overview/oauth)

You can directly add your data as a string:

In [None]:
CONSUMER_API_KEY = ''
CONSUMER_API_SECRET = ''
ACCESS_KEY = ''
ACCESS_SECRET = ''

We are also not allowed to request too many Tweets at the same time. There are per-day limits, as well as "rate limits" for 15-minute blocks. If you exceed your limits, you **will** get blocked for some time. For detailed information on the limits, check out https://developer.twitter.com/en/docs/rate-limits.
In many cases, we can use the functionality of Tweepy to automatically delay calls in order to wait on the rate limit - but be aware that this doesn't always work, and we may need to manually add timeouts.

We are now ready to create our verified interface (automatically waiting on our rate limit as necessary):

In [None]:
auth = tweepy.OAuthHandler(CONSUMER_API_KEY, CONSUMER_API_SECRET)
auth.set_access_token(ACCESS_KEY, ACCESS_SECRET)
api = tweepy.API(auth, wait_on_rate_limit = True)

Let's download some tweets!
We actually have different APIs to choose from. An overview can be found here: https://docs.tweepy.org/en/stable/api.html

We will use the standard search API. Note that the API only allows you to download tweets based on general queries from the past week. If you want to download older tweets, you will need to dowload the tweets of a particular account, or use the 30-day API, for example.

We can make a simple query such as `q="bayes"`. However, can you see why this could lead to problems?

Luckily, we can simply combine keywords with `OR` and `AND`.

In [None]:
tweets = []
for tweet in tweepy.Cursor(api.search_tweets,q="bayes AND business AND school",lang="en").items():
    tweets.append(tweet)

Here, `lang` specifies the language of tweets we request. Let's take a look at our tweets, as well as some of the basic information about them. You can find details about the tweet objects at https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet.

In [None]:
for tweet in tweets:
    print("Created at: " + str(tweet.created_at))
    print("User: " + tweet.user.screen_name)
    print("Followers: " + str(tweet.user.followers_count))
    print("Content: " + tweet.text)
    print("---------------------\n")

Note that there are some tweets starting with "RT". These are retweets, and can easily be filtered out:

In [None]:
for tweet in tweepy.Cursor(api.search_tweets,q="bayes AND business AND school AND -filter:retweets",lang="en").items():
    print("Created at: " + str(tweet.created_at))
    print("User: " + tweet.user.screen_name)
    print("Followers: " + str(tweet.user.followers_count))
    print("Content: " + tweet.text)
    print("---------------------\n")

# Conjoint analysis (based on rankings)

This exercise is based on Miller, 2015, "Marketing Data Science: Modeling Techniques in Predictive Analytics with R and Python".

We exemplify here a simple conjoint analysis used to measure how consumers derive utility from certain product attributes (which helps to build a picture of consumer preferences). Conjoint analysis is a key staple of marketing research.

**Loading the relevant packages**

In [4]:
import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.formula.api import ols
from patsy.contrasts import Sum

**Loading the data**

The CSV file can be found in the book's Github page at https://github.com/mtpa/mds/blob/master/MDS_Chapter_1/mobile_services_ranking.csv. Note that you cannot use the link directly, rather, you have to get the raw data link (click the button that says "Raw").

In [5]:
conjoint_data_frame = pd.read_csv('https://raw.githubusercontent.com/mtpa/mds/master/MDS_Chapter_1/mobile_services_ranking.csv')
conjoint_data_frame.head()

URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)>

Let's also get rid of these quotation marks:

In [None]:
conjoint_data_frame = conjoint_data_frame.replace('"', '', regex=True)

**Regression with sum contrasts**

In a conjoint analysis, we are essentially regressing a ranking, rating, or choice on the the different attributes of the offering (with many different specifications, depending on the question we want an answer to and the survey design). Usually when we perform regressions, we use a Dummy/Treatment coding of categorical variables. That is, one level (say `samsung == 'Samsung NO'`) is the "baseline" represented by the intercept, and the effect we measure for any other level (say `samsung == 'Samsung YES'`) is **relative** to that baseline.

When using sum contrasts, we measure the main effect of each level of a categorical variable. The baseline, or intercept, now is a hypothetical one: it is formed such that the coefficients of each categorical variable sum up to 0. In practice, if we have a categorical variable that takes two different levels (say the example above), and the effect (or regression coefficient) of level `'Samsung YES'` would be 0.4 "relative to the baseline", we would now get a coefficient of 0.2 for `'Samsung YES'` and a coefficient of -0.2 for `'Samsung NO'`. For all intents and purposes, we are centering categorical variables.

In Python, we use `patsy.contrasts`. Note in the below code that we don't want to have `ranking` be part of the explanatory variables:

In [None]:
attributes = conjoint_data_frame.columns.tolist()[:-1]
formula = 'ranking ~' + ''.join([' C(' + str(attribute) + ', Sum) +' for attribute in attributes])
print("Without adjustment: " + formula)
formula = formula[:-2]
print("With adjustment: " + formula)

Now, we can define a linear regression model as usual:

In [None]:
model = ols(formula, data = conjoint_data_frame).fit()
print(model.summary()) 

**Building part-worth information**

To get the part worths, we first need to collect the coefficients of each attribute's levels. We also need to capture the coefficient range to evaluate the importance of attributes:

In [None]:
level_name = []
part_worth = []
part_worth_range = []
for attribute in attributes:
    current_levels = sorted(set(conjoint_data_frame[attribute]))
    current_levels_remainder_at_end = []
    remainder = ''
    current_part_worth = []
    for level in current_levels:
        indicator = 'C('+ attribute +', Sum)[S.'+level+']'
        if indicator in model.params:
            current_levels_remainder_at_end.append(level)
            current_part_worth.append(model.params['C('+ attribute +', Sum)[S.'+level+']'])
        else:
            remainder = level
    current_levels_remainder_at_end.append(remainder)
    current_part_worth.append((-1) * sum(current_part_worth))  
    part_worth_range.append(max(current_part_worth) - min(current_part_worth))  
    part_worth.append(current_part_worth) 
    level_name.append(current_levels_remainder_at_end)

Using the range of coefficients, we define the importance of attributes: the larger the range (relative to the other ranges), the larger the importance:

In [None]:
attribute_importance = []
for att_no in range(len(attributes)):
    attribute_importance.append(round(100 * (part_worth_range[att_no] / sum(part_worth_range)),2))

**Creating a spine chart of preferences**

Let's look at the attributes in turn, their importance, as well as the coefficients of the individual levels:

In [None]:
for att_no in range(len(attributes)):
    print('\nAttribute:', attributes[att_no])
    print('    Importance:', attribute_importance[att_no])
    print('    Level Part-Worths')
    for level in range(len(level_name[att_no])):
        print('       ',level_name[att_no][level], part_worth[att_no][level])

Python doesn't have great in-built functionality to create a spine chart of part-worths such as what you find in R, but we can get a start with a `seaborn barplot`. For this, we need to make sure that it differentiates the levels of start-up and monthly costs, though:

In [None]:
for att_no in range(len(level_name)):
    for level_no in range(len(level_name[att_no])):
        if level_name[att_no][level_no].startswith('$'):
            level_name[att_no][level_no] = attributes[att_no] + ' ' + level_name[att_no][level_no]

Next, to create the barplot in the right order, the easiest is if we combine all the relevant information in a data frame:

In [None]:
df = pd.DataFrame({ 'attribute' : len(part_worth[0])*[attributes[0]],
                    'importance' : len(part_worth[0])*[attribute_importance[0]],
                    'values' : level_name[0],
                    'part_worths' : part_worth[0]})
for att_no in range(1,len(attributes)):
    for item in range(len(part_worth[att_no])):
        df = df.append({ 'attribute' : attributes[att_no],
                        'importance' : attribute_importance[att_no],
                        'values' : level_name[att_no][item],
                        'part_worths' : part_worth[att_no][item]},ignore_index=True)
df = df.sort_values(['importance', 'attribute', 'values'], ascending=[False,True,True])

Finally, we can take a look at the part-worths:

In [None]:
sns.barplot(data=df,x="part_worths", y="values")
plt.show()

How would things look like with your own rankings?