## Using APIs in Projects


When getting data from APIs, I strongly suggest following a three-step workflow:

1. Write some code that gets data from an API and saves all of the data (if possible) to a file
2. Write a second program (usually a second file) that loads the data from the API, extracts the data that will be useful for analysis, and saves it in a flat file (typically a CSV).
3. Program number 3 loads the CSV file and does the analysis

This approach has a few important benefits.

The first and most important is that often it is difficult to get the same raw data again. If you are using Twitter, then the Search API only lets you get the last week. If you are doing analysis a month down the road and decide that you really wish you had saved metadata about the number of retweets, it is too late. By saving the raw data you can change your measures or analysis strategy and still have access to the data.

The second is that this gives you a nice pipeline, with intermediate files. Instead of including the entire raw data file in the code that does analysis, you only have to load the CSV, which is often much smaller and easier to work with.

This brief lesson will show an example of this workflow, using `tweepy`.

Note that I'm going to put everything in one file for convenience, but my typical workflow is to put these in separate files and then run each file separately.

## Program 1 - Data Retrieval

The goal of our project is to produce a visualization of the histogram of the number of retweets for recent tweets about President Trump. The first program gets tweets about President Trump.

In [2]:
import tweepy
import json
from twitter_authentication import CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_TOKEN_SECRET

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
api = tweepy.API(auth, wait_on_rate_limit=True)

In [4]:
# Make a list to store the results
results = []
for tweet in tweepy.Cursor(api.search, 
                           q='Trump -filter:retweets', # only get the original tweets
                           tweet_mode = 'extended',
                           count=200).items(5000): # Change this to as high as you like, if you have time :)
    results.append(tweet._json)
    #print(tweet.user.screen_name + "\t" + str(tweet.created_at) + "\t" + tweet.full_text)

In [5]:
# Then, write the results to a file
with open('raw_trump_tweets.json', 'w') as f:
    json.dump(results, f)

## Program 2 - Data Cleaning

This program loads the saved raw data, grabs what we want, and converts it into a csv.

I decided to save the timestamp, text, and retweet and favorite counts.

This is also where you typically would do more complicated measure creation. Here I show how to create a measure of tweet_length.

In [6]:
with open('raw_trump_tweets.json', 'r') as f:
    tweets = json.load(f)

In [11]:
tweets[0]

{'created_at': 'Thu May 27 22:05:53 +0000 2021',
 'id': 1398037772615798786,
 'id_str': '1398037772615798786',
 'full_text': "@OneCrazyCat2 @ifsaica @DavidPriess You mean Trump and God aren't the same person? 🤣🤣 But seriously, this dude should retire but he won't. It is about control and money for politicians. They don't care about right and wrong, or us. These are the politicians that make me oppose the conservatives anymore.",
 'truncated': False,
 'display_text_range': [36, 304],
 'entities': {'hashtags': [],
  'symbols': [],
  'user_mentions': [{'screen_name': 'OneCrazyCat2',
    'name': 'OneCrazyCat',
    'id': 1147210856775766016,
    'id_str': '1147210856775766016',
    'indices': [0, 13]},
   {'screen_name': 'ifsaica',
    'name': 'Alexander - Hebrews 13:5-6',
    'id': 55489838,
    'id_str': '55489838',
    'indices': [14, 22]},
   {'screen_name': 'DavidPriess',
    'name': 'David Priess',
    'id': 2521546890,
    'id_str': '2521546890',
    'indices': [23, 35]}],
  'urls': 

In [10]:
import csv
with open('cleaned_data.csv', 'w', 
          encoding='UTF-8',
          newline='') as fn:
    f = csv.writer(fn)
    f.writerow(['created_at',
                'tweet_text',
                'retweets',
                'favorites',
                'tweet_length'
               ])
    for tweet in tweets:
        f.writerow([tweet['created_at'], 
                    tweet['full_text'],
                    tweet['retweet_count'],
                    tweet['favorite_count'],
                    len(tweet['full_text'])
                   ])

## Program 3 - Data Analysis

Here we use pandas to load the data and analyze it. This could include statistical tests. Here, I'm just visualizing the distribution of retweets and the relationship between retweets and length.

In [None]:
import pandas as pd
import seaborn as sns

In [None]:
df = pd.read_csv('./cleaned_data.csv')

In [None]:
# Just make sure it looks OK.
df.sort_values('retweets')

In [None]:
sns.distplot(df.retweets)

As expected, it's super skewed, with most tweets never getting retweeted while a few get tons of retweets.

Let's see if it changes if we get rid of the tweets that never got retweeted (like, maybe we have a principled reason to believe they are different than other tweets).

In [None]:
sns.distplot(df.loc[df.retweets > 0, 'retweets'])

As I thought, this is a somewhat "scale-free" distribution, meaning wherever you zoom in, you see the same pattern. Try changing the `0` up above to any (small) number.

For fun, let's also look at the relationship between retweets and tweet length.

In [None]:
import numpy as np

In [None]:
sns.jointplot(y='retweets', x='tweet_length', data = df);

In [None]:
# Because retweets are so skewed, let's log them
p = sns.jointplot(y=np.log(df.retweets + 1), x='tweet_length', data = df)
p.set_axis_labels('Tweet Length','Retweets (log)');

# Day 9 Coding Challenge

## Program 1 - Data Retrieval

In [1]:
# Getting ready API
import tweepy
import json
from twitter_authentication import CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_TOKEN_SECRET

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
api = tweepy.API(auth, wait_on_rate_limit=True)

In [2]:
# Storing the raw results in a file
results_climate_change = []
results_global_warming = []
for tweet in tweepy.Cursor(api.search, 
                           q = '"climate change" -filter:retweets',
                           tweet_mode = 'extended',
                           count = 200).items(10000):
    results_climate_change.append(tweet._json)
for tweet in tweepy.Cursor(api.search, 
                           q = '"global warming" -filter:retweets',
                           tweet_mode = 'extended',
                           count = 200).items(10000):
    results_global_warming.append(tweet._json)

In [3]:
# Saving raw data into file
with open('raw_climate_change.json', 'w') as f_cc:
    json.dump(results_climate_change, f_cc)
with open('raw_global_warming.json', 'w') as f_gb:
    json.dump(results_global_warming, f_gb)

## Program 2 - Data Cleaning

In [6]:
# Load the data of both files
with open('raw_climate_change.json', 'r') as f_cc:
    tweets_climate_change = json.load(f_cc)
with open('raw_global_warming.json', 'r') as f_gb:
    tweets_global_warming = json.load(f_gb)

In [8]:
# Write tidy data into .csv files
import csv
with open('cleaned_data_climate_change.csv', 'w', 
          encoding='UTF-8',
          newline='') as fn_cc:
    f_cc = csv.writer(fn_cc)
    f_cc.writerow(['created_at',
                   'tweet_text',
                   'retweets',
                   'favorites',
                   'tweet_length'
                  ])
    for tweet in tweets_climate_change:
        f_cc.writerow([tweet['created_at'],
                       tweet['full_text'],
                       tweet['retweet_count'],
                       tweet['favorite_count'],
                       len(tweet['full_text'])
                      ])
        
with open('cleaned_data_global_warming.csv', 'w', 
          encoding='UTF-8',
          newline='') as fn_gb:
    f_gb = csv.writer(fn_gb)
    f_gb.writerow(['created_at',
                   'tweet_text',
                   'retweets',
                   'favorites',
                   'tweet_length'
                  ])
    for tweet in tweets_global_warming:
        f_gb.writerow([tweet['created_at'],
                       tweet['full_text'],
                       tweet['retweet_count'],
                       tweet['favorite_count'],
                       len(tweet['full_text'])
                      ])

## Program 3 - Data Analysis

In [9]:
# Importing .csv files
import pandas as pd
import seaborn as sns
df_climate_change = pd.read_csv('./cleaned_data_climate_change.csv')
df_global_warming = pd.read_csv('./cleaned_data_global_warming.csv')

In [None]:
# Sorting by retweets
df_climate_change.sort_values('retweets')
df_global_warming.sort_values('retweets')

In [None]:
# Histograms of retweets
sns.distplot(df_climate_change.retweets)
sns.distplot(df_global_warming.retweets)

In [None]:
# Histograms of retweets when retweets is greater than 100
sns.distplot(df_climate_change.loc[df_climate_change.retweets > 100, 'retweets'])
sns.distplot(df_global_warming.loc[df_global_warming.retweets > 100, 'retweets'])

In [None]:
# Scatterplot of retweets vs. tweet length
import numpy as np
sns.jointplot(y='retweets', x='tweet_length', data = df_climate_change);
sns.jointplot(y='retweets', x='tweet_length', data = df_global_warming);

In [None]:
#p = sns.jointplot(y=np.log(df_climate_change.retweets + 1), x='tweet_length', data = df_climate_change)
#p.set_axis_labels('Tweet Length','Retweets (log)');
p = sns.jointplot(y=np.log(df_global_warming.retweets + 1), x='tweet_length', data = df_global_warming)
p.set_axis_labels('Tweet Length','Retweets (log)');

In [47]:
env_climate_change = ""
env_global_warming = ""
for tweet in df_climate_change.tweet_text:
    if "environment" in tweet:
        env_climate_change = env_climate_change + " " + tweet
for tweet in df_global_warming.tweet_text:
    if "environment" in tweet:
        env_global_warming = env_global_warming + " " + tweet

In [57]:
import matplotlib.pyplot as plt

from wordcloud import WordCloud, STOPWORDS    

wordcloud_climate_change = WordCloud(width = 800, height = 800,
                                     background_color = 'white',
                                     stopwords = set(STOPWORDS),
                                     min_font_size = 10).generate(env_climate_change)

wordcloud_global_warming = WordCloud(width = 800, height = 800,
                                     background_color = 'white',
                                     stopwords = set(STOPWORDS),
                                     min_font_size = 10).generate(env_global_warming)

wordcloud_climate_change.to_file("wordcloud_climate_change.png")
wordcloud_global_warming.to_file("wordcloud_global_warming.png")

<wordcloud.wordcloud.WordCloud at 0x11b675b40a0>