# Bitcoin sentiment analysis using Twitter

## Data generation

searchtweets API reference: https://twitterdev.github.io/search-tweets-python/  
Twitter API reference: https://developer.twitter.com/en/docs/tweets/search/api-reference/premium-search.html

`~/.twitter_keys` contains endpoint, consumer_key, and consumer_secret  
Change `yaml_key` to get data for the last 30 days (250 queries / month) or since Twitters inception - 2006 (50 queries / month)  
`yaml_key = "search_tweets_premium_30day"`  
`yaml_key = "search_tweets_premium_archive"`:  


Each stream increments query  
For example, if `results_per_call` is 100 and `max_results` is 1000, that is 10 queries  

In [1]:
from searchtweets import ResultStream, gen_rule_payload, load_credentials, collect_results

# general imports
import numpy as np
import pandas as pd
#import tweepy
from textblob import TextBlob
import re
import time

# plotting and visualization
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [20]:
premium_search_args = load_credentials("~/.twitter_keys.yaml",
                                          yaml_key="search_tweets_premium_30day",
                                          env_overwrite=False)

Grabbing bearer token from OAUTH


In [7]:
#dates
months = np.arange(1,13)
days = np.arange(1,32)
time = [" 00:00", " 03:00", " 6:00", " 09:00", " 12:00", " 15:00", " 18:00", " 21:00"]
dates = []
dates_extra = [ "2018-" + str(m) + "-" + str(d) + str(t) for m in months for d in days for t in time ]
spurious_dates = ['2018-2-29', '2018-2-30', '2018-2-31', '2018-4-31', '2018-6-31', '2018-9-31', '2018-11-31']
spurious_dates = [ d + t for d in spurious_dates for t in time ]
dates = [d for d in dates_extra if d not in spurious_dates]

In [6]:
[print(i, d) for i, d in enumerate(dates)]

0 2018-1-1 00:00
1 2018-1-1 03:00
2 2018-1-1 6:00
3 2018-1-1 09:00
4 2018-1-1 12:00
5 2018-1-1 15:00
6 2018-1-1 18:00
7 2018-1-1 21:00
8 2018-1-2 00:00
9 2018-1-2 03:00
10 2018-1-2 6:00
11 2018-1-2 09:00
12 2018-1-2 12:00
13 2018-1-2 15:00
14 2018-1-2 18:00
15 2018-1-2 21:00
16 2018-1-3 00:00
17 2018-1-3 03:00
18 2018-1-3 6:00
19 2018-1-3 09:00
20 2018-1-3 12:00
21 2018-1-3 15:00
22 2018-1-3 18:00
23 2018-1-3 21:00
24 2018-1-4 00:00
25 2018-1-4 03:00
26 2018-1-4 6:00
27 2018-1-4 09:00
28 2018-1-4 12:00
29 2018-1-4 15:00
30 2018-1-4 18:00
31 2018-1-4 21:00
32 2018-1-5 00:00
33 2018-1-5 03:00
34 2018-1-5 6:00
35 2018-1-5 09:00
36 2018-1-5 12:00
37 2018-1-5 15:00
38 2018-1-5 18:00
39 2018-1-5 21:00
40 2018-1-6 00:00
41 2018-1-6 03:00
42 2018-1-6 6:00
43 2018-1-6 09:00
44 2018-1-6 12:00
45 2018-1-6 15:00
46 2018-1-6 18:00
47 2018-1-6 21:00
48 2018-1-7 00:00
49 2018-1-7 03:00
50 2018-1-7 6:00
51 2018-1-7 09:00
52 2018-1-7 12:00
53 2018-1-7 15:00
54 2018-1-7 18:00
55 2018-1-7 21:00
56 2018-1

1006 2018-5-6 18:00
1007 2018-5-6 21:00
1008 2018-5-7 00:00
1009 2018-5-7 03:00
1010 2018-5-7 6:00
1011 2018-5-7 09:00
1012 2018-5-7 12:00
1013 2018-5-7 15:00
1014 2018-5-7 18:00
1015 2018-5-7 21:00
1016 2018-5-8 00:00
1017 2018-5-8 03:00
1018 2018-5-8 6:00
1019 2018-5-8 09:00
1020 2018-5-8 12:00
1021 2018-5-8 15:00
1022 2018-5-8 18:00
1023 2018-5-8 21:00
1024 2018-5-9 00:00
1025 2018-5-9 03:00
1026 2018-5-9 6:00
1027 2018-5-9 09:00
1028 2018-5-9 12:00
1029 2018-5-9 15:00
1030 2018-5-9 18:00
1031 2018-5-9 21:00
1032 2018-5-10 00:00
1033 2018-5-10 03:00
1034 2018-5-10 6:00
1035 2018-5-10 09:00
1036 2018-5-10 12:00
1037 2018-5-10 15:00
1038 2018-5-10 18:00
1039 2018-5-10 21:00
1040 2018-5-11 00:00
1041 2018-5-11 03:00
1042 2018-5-11 6:00
1043 2018-5-11 09:00
1044 2018-5-11 12:00
1045 2018-5-11 15:00
1046 2018-5-11 18:00
1047 2018-5-11 21:00
1048 2018-5-12 00:00
1049 2018-5-12 03:00
1050 2018-5-12 6:00
1051 2018-5-12 09:00
1052 2018-5-12 12:00
1053 2018-5-12 15:00
1054 2018-5-12 18:00
105

1976 2018-9-5 00:00
1977 2018-9-5 03:00
1978 2018-9-5 6:00
1979 2018-9-5 09:00
1980 2018-9-5 12:00
1981 2018-9-5 15:00
1982 2018-9-5 18:00
1983 2018-9-5 21:00
1984 2018-9-6 00:00
1985 2018-9-6 03:00
1986 2018-9-6 6:00
1987 2018-9-6 09:00
1988 2018-9-6 12:00
1989 2018-9-6 15:00
1990 2018-9-6 18:00
1991 2018-9-6 21:00
1992 2018-9-7 00:00
1993 2018-9-7 03:00
1994 2018-9-7 6:00
1995 2018-9-7 09:00
1996 2018-9-7 12:00
1997 2018-9-7 15:00
1998 2018-9-7 18:00
1999 2018-9-7 21:00
2000 2018-9-8 00:00
2001 2018-9-8 03:00
2002 2018-9-8 6:00
2003 2018-9-8 09:00
2004 2018-9-8 12:00
2005 2018-9-8 15:00
2006 2018-9-8 18:00
2007 2018-9-8 21:00
2008 2018-9-9 00:00
2009 2018-9-9 03:00
2010 2018-9-9 6:00
2011 2018-9-9 09:00
2012 2018-9-9 12:00
2013 2018-9-9 15:00
2014 2018-9-9 18:00
2015 2018-9-9 21:00
2016 2018-9-10 00:00
2017 2018-9-10 03:00
2018 2018-9-10 6:00
2019 2018-9-10 09:00
2020 2018-9-10 12:00
2021 2018-9-10 15:00
2022 2018-9-10 18:00
2023 2018-9-10 21:00
2024 2018-9-11 00:00
2025 2018-9-11 03

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,

In [102]:
# 1944 2018-9-1 00:00
# 2063 2018-9-15 21:00
test_dates = dates[1944:2063+1]
# "201809042100", "fromDate": "201809041800"

In [103]:
test_dates = test_dates[40:]

In [104]:
[print(i, d) for i, d in enumerate(test_dates)]

0 2018-9-6 00:00
1 2018-9-6 03:00
2 2018-9-6 6:00
3 2018-9-6 09:00
4 2018-9-6 12:00
5 2018-9-6 15:00
6 2018-9-6 18:00
7 2018-9-6 21:00
8 2018-9-7 00:00
9 2018-9-7 03:00
10 2018-9-7 6:00
11 2018-9-7 09:00
12 2018-9-7 12:00
13 2018-9-7 15:00
14 2018-9-7 18:00
15 2018-9-7 21:00
16 2018-9-8 00:00
17 2018-9-8 03:00
18 2018-9-8 6:00
19 2018-9-8 09:00
20 2018-9-8 12:00
21 2018-9-8 15:00
22 2018-9-8 18:00
23 2018-9-8 21:00
24 2018-9-9 00:00
25 2018-9-9 03:00
26 2018-9-9 6:00
27 2018-9-9 09:00
28 2018-9-9 12:00
29 2018-9-9 15:00
30 2018-9-9 18:00
31 2018-9-9 21:00
32 2018-9-10 00:00
33 2018-9-10 03:00
34 2018-9-10 6:00
35 2018-9-10 09:00
36 2018-9-10 12:00
37 2018-9-10 15:00
38 2018-9-10 18:00
39 2018-9-10 21:00
40 2018-9-11 00:00
41 2018-9-11 03:00
42 2018-9-11 6:00
43 2018-9-11 09:00
44 2018-9-11 12:00
45 2018-9-11 15:00
46 2018-9-11 18:00
47 2018-9-11 21:00
48 2018-9-12 00:00
49 2018-9-12 03:00
50 2018-9-12 6:00
51 2018-9-12 09:00
52 2018-9-12 12:00
53 2018-9-12 15:00
54 2018-9-12 18:00
55 2

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

In [106]:
for i, d in enumerate(test_dates):
    print(i, d)
    if i % 8 == 0 and i != 0:
        print("waiting 2 seconds")
        time.sleep(2)

0 2018-9-6 00:00
1 2018-9-6 03:00
2 2018-9-6 6:00
3 2018-9-6 09:00
4 2018-9-6 12:00
5 2018-9-6 15:00
6 2018-9-6 18:00
7 2018-9-6 21:00
8 2018-9-7 00:00
waiting 2 seconds
9 2018-9-7 03:00
10 2018-9-7 6:00
11 2018-9-7 09:00
12 2018-9-7 12:00
13 2018-9-7 15:00
14 2018-9-7 18:00
15 2018-9-7 21:00
16 2018-9-8 00:00
waiting 2 seconds
17 2018-9-8 03:00
18 2018-9-8 6:00
19 2018-9-8 09:00
20 2018-9-8 12:00
21 2018-9-8 15:00
22 2018-9-8 18:00
23 2018-9-8 21:00
24 2018-9-9 00:00
waiting 2 seconds
25 2018-9-9 03:00
26 2018-9-9 6:00
27 2018-9-9 09:00
28 2018-9-9 12:00
29 2018-9-9 15:00
30 2018-9-9 18:00
31 2018-9-9 21:00
32 2018-9-10 00:00
waiting 2 seconds
33 2018-9-10 03:00
34 2018-9-10 6:00
35 2018-9-10 09:00
36 2018-9-10 12:00
37 2018-9-10 15:00
38 2018-9-10 18:00
39 2018-9-10 21:00
40 2018-9-11 00:00
waiting 2 seconds
41 2018-9-11 03:00
42 2018-9-11 6:00
43 2018-9-11 09:00
44 2018-9-11 12:00
45 2018-9-11 15:00
46 2018-9-11 18:00
47 2018-9-11 21:00
48 2018-9-12 00:00
waiting 2 seconds
49 2018-9

In [107]:
print("gathering", len(test_dates)/8, "days of tweets in 3 hour intervals")

gathering 10.0 days of tweets in 3 hour intervals


In [108]:
S2_dict = {}
def collect_tweets(from_date, to_date):
    # maxResults is capped at 100 for sandbox account
    # date format: YYYY-mm-DD HH:MM
    bitcoin_rule = gen_rule_payload("bitcoin", results_per_call=100, from_date=from_date, to_date=to_date) 
    print(bitcoin_rule)
    tweets = collect_results(bitcoin_rule, max_results=100, result_stream_args=premium_search_args)
    return tweets

In [109]:
tweets = []
for i in range(0,len(test_dates[:-1])):
    #S2_dict[i] = collect_tweets(test_dates[i], test_dates[i+1])
    tweets = np.append(tweets, collect_tweets(test_dates[i], test_dates[i+1]))
    if i % 8 == 0 and i != 0:
        print("waiting 60 seconds")
        time.sleep(60)

{"query": "bitcoin", "maxResults": 100, "toDate": "201809060300", "fromDate": "201809060000"}
{"query": "bitcoin", "maxResults": 100, "toDate": "201809060600", "fromDate": "201809060300"}
{"query": "bitcoin", "maxResults": 100, "toDate": "201809060900", "fromDate": "201809060600"}
{"query": "bitcoin", "maxResults": 100, "toDate": "201809061200", "fromDate": "201809060900"}
{"query": "bitcoin", "maxResults": 100, "toDate": "201809061500", "fromDate": "201809061200"}
{"query": "bitcoin", "maxResults": 100, "toDate": "201809061800", "fromDate": "201809061500"}
{"query": "bitcoin", "maxResults": 100, "toDate": "201809062100", "fromDate": "201809061800"}
{"query": "bitcoin", "maxResults": 100, "toDate": "201809070000", "fromDate": "201809062100"}
{"query": "bitcoin", "maxResults": 100, "toDate": "201809070300", "fromDate": "201809070000"}
waiting 60 seconds
{"query": "bitcoin", "maxResults": 100, "toDate": "201809070600", "fromDate": "201809070300"}
{"query": "bitcoin", "maxResults": 100, "

In [82]:

#for key in S2_dict:
    tweets = np.append(tweets, S2_dict[key])
    print(key, tweets[key]['created_at'])

0 Mon Sep 03 23:59:57 +0000 2018
1 Mon Sep 03 23:59:54 +0000 2018
2 Mon Sep 03 23:59:51 +0000 2018
3 Mon Sep 03 23:59:48 +0000 2018
4 Mon Sep 03 23:59:47 +0000 2018
5 Mon Sep 03 23:59:47 +0000 2018
6 Mon Sep 03 23:59:45 +0000 2018
7 Mon Sep 03 23:59:44 +0000 2018
8 Mon Sep 03 23:59:37 +0000 2018
9 Mon Sep 03 23:59:36 +0000 2018
10 Mon Sep 03 23:59:34 +0000 2018
11 Mon Sep 03 23:59:30 +0000 2018
12 Mon Sep 03 23:59:25 +0000 2018
13 Mon Sep 03 23:59:24 +0000 2018
14 Mon Sep 03 23:59:20 +0000 2018
15 Mon Sep 03 23:59:17 +0000 2018
16 Mon Sep 03 23:59:16 +0000 2018
17 Mon Sep 03 23:59:14 +0000 2018
18 Mon Sep 03 23:59:13 +0000 2018
19 Mon Sep 03 23:59:11 +0000 2018
20 Mon Sep 03 23:59:10 +0000 2018
21 Mon Sep 03 23:59:10 +0000 2018


In [110]:
tweets[0].keys()

dict_keys(['created_at', 'id', 'id_str', 'text', 'display_text_range', 'source', 'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'is_quote_status', 'extended_tweet', 'quote_count', 'reply_count', 'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted', 'possibly_sensitive', 'filter_level', 'lang', 'matching_rules'])

In [111]:
tweets[0].text

'Buy/Sell Bitcoin changes with up to 100x Leverage at Bitmex! 💰🎉\n\nGet a 10% Fee Rebate:\n\n▶️ https://t.co/nSVKzagB57… https://t.co/3QIuaPMsMs'

In [112]:
print(len(tweets), tweets[0]['created_at'], tweets[-1]['created_at'])

7900 Thu Sep 06 02:59:59 +0000 2018 Sat Sep 15 20:57:33 +0000 2018


### counts and limitations

A trial to collect all tweets containing the string 'bitcoin' from the current date until a max number of tweets=1000 reached was 15 minutes. If the max number of tweets is increased, we will eventually go back in time to 30 days. To capture more data beyond this time, Full archive will need to be used. However, with only 50 requests per month, very finely specified dates to remain under 50 requests will need to be identified. I.E. once a month we can collect 25,000 tweets for the last 30 days or 5,000 for some time period earlier than that. For full archive to collect as many as montly, requires subscription of $225/month. Thousands to get over a million tweets.

## Sentiment Analysis

In [113]:
# create a pandas df from tweets
S2 = pd.DataFrame(data=[tweet.text for tweet in tweets], columns=['Tweets'])
S2['Date'] = [tweet['created_at'] for tweet in tweets]

In [114]:
S2.head()

Unnamed: 0,Tweets,Date
0,Buy/Sell Bitcoin changes with up to 100x Lever...,Thu Sep 06 02:59:59 +0000 2018
1,Free and Best Automatic Bitcoin - Altcoins - U...,Thu Sep 06 02:59:57 +0000 2018
2,"RT @CryptoMoe81: $Bitcoin idea ""14,XX% Drop"" -...",Thu Sep 06 02:59:56 +0000 2018
3,Check out this awesome site! You can COPY pro ...,Thu Sep 06 02:59:54 +0000 2018
4,RT @APompliano: So many people are cheering ag...,Thu Sep 06 02:59:53 +0000 2018


In [115]:
def clean_tweet(tweet):
    '''
    Utility function to clean the text in a tweet by removing 
    links and special characters using regex.
    '''
    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())

def analize_sentiment(tweet):
    '''
    Utility function to classify the polarity of a tweet
    using textblob.
    
    textblob already has a trained analyser to work 
    with different machine learning models on 
    natural language processing.
    
    Might want to train our own model
    '''
    analysis = TextBlob(clean_tweet(tweet))
    if analysis.sentiment.polarity > 0:
        return 1
    elif analysis.sentiment.polarity == 0:
        return 0
    else:
        return -1
    

def sentiment_analysis(S2):
    # We create a column with the result of the analysis:
    S2['SA'] = np.array([ analize_sentiment(tweet) for tweet in S2['Tweets'] ])
    
    # We construct lists with classified tweets:
    pos_tweets = [ tweet for index, tweet in enumerate(S2['Tweets']) if S2['SA'][index] > 0]
    neu_tweets = [ tweet for index, tweet in enumerate(S2['Tweets']) if S2['SA'][index] == 0]
    neg_tweets = [ tweet for index, tweet in enumerate(S2['Tweets']) if S2['SA'][index] < 0]

    # We print percentages:
    print("Percentage of positive tweets: {}%".format(len(pos_tweets)*100/len(S2['Tweets'])))
    print("Percentage of neutral tweets: {}%".format(len(neu_tweets)*100/len(S2['Tweets'])))
    print("Percentage de negative tweets: {}%".format(len(neg_tweets)*100/len(S2['Tweets'])))

In [116]:
sentiment_analysis(S2)

Percentage of positive tweets: 37.037974683544306%
Percentage of neutral tweets: 52.32911392405063%
Percentage de negative tweets: 10.632911392405063%


In [117]:
S2['Tweets'].to_csv('tweets_2018-08-06_2018-08-15_Tweets.csv', index=False)
S2['Date'].to_csv('tweets_2018-06-03_2018-08-15_Date.csv', index=False)

In [6]:
# Aug 01-03
S3_Date_A = pd.read_csv('tweets_2018-08-01_2018-08-03_Date.csv', names=['Date'])
S3_Tweets_A = pd.read_csv('tweets_2018-08-01_2018-08-03_Tweets.csv', names=['Tweets'])
# Aug 03-05
S3_Date_B = pd.read_csv('tweets_2018-08-03_2018-08-05_Date.csv', names=['Date'])
S3_Tweets_B = pd.read_csv('tweets_2018-08-03_2018-08-05_Tweets.csv', names=['Tweets'])
# Aug 06-15
S3_Date_C = pd.read_csv('tweets_2018-08-06_2018-08-15_Date.csv', names=['Date'])
S3_Tweets_C = pd.read_csv('tweets_2018-08-06_2018-08-15_Tweets.csv', names=['Tweets'])
S3_A = pd.concat([S3_Tweets_A, S3_Date_A], axis=1)
S3_B = pd.concat([S3_Tweets_B, S3_Date_B], axis=1)
S3_C = pd.concat([S3_Tweets_C, S3_Date_C], axis=1)

In [14]:
S3 = pd.concat([S3_A, S3_B, S3_C], axis=0)
S3['Date'].to_csv('tweets_2018-08-01_2018-08-15_Date.csv', index=False)
S3['Tweets'].to_csv('tweets_2018-08-01_2018-08-15_Tweets.csv', index=False)

In [12]:
S3.head()

Unnamed: 0,Tweets,Date
0,https://t.co/yLZluuYevy DECENTRALISED ENERGY P...,Sat Sep 01 02:59:59 +0000 2018
1,📉 Biggest Losers (1 hr) 📉\nNoah Coin $NOAH -3....,Sat Sep 01 02:59:58 +0000 2018
2,Crypto News: Yahoo! World’s Sixth-Most Popular...,Sat Sep 01 02:59:54 +0000 2018
3,RT @coingecko: Have you tried comparing coins ...,Sat Sep 01 02:59:54 +0000 2018
4,Bitcoin Gets Awareness Boost From Mention On E...,Sat Sep 01 02:59:53 +0000 2018


In [13]:
S3.tail()

Unnamed: 0,Tweets,Date
7895,RT @bitcoincardvd: You can start your Bitcoin ...,Sat Sep 15 20:57:43 +0000 2018
7896,RT @securixio: We at https://t.co/3OqG6HXwB0 p...,Sat Sep 15 20:57:40 +0000 2018
7897,It doesn’t matter if Bitcoin is $6k or $50k.\n...,Sat Sep 15 20:57:39 +0000 2018
7898,RT @iMariaJohnsen: Uncovering facts on #blockc...,Sat Sep 15 20:57:37 +0000 2018
7899,RT @favycoin: Grab your #Favycoin and don't mi...,Sat Sep 15 20:57:33 +0000 2018


### Summary so far

It's reasonable to assume that twitter data is more interesting when viewed as a larger picture than a collection centered around a pinpoint. To do this, subsamples of twitter data need to be gathered for a range of days. Tweets starting and ending on the dates listed below are gathered. The from_date is the listed day and the to_date is set to the next day. However rate limits will terminate early after 100 tweets have been gathered for that day, so typically only a couple minutes of tweets per day. This method of collection 100 tweets per day is an efficient method to collect a fraction twitter data over a larger number of days. 

 - 226 2018-8-15 15:00 
 - 237 2018-8-26 15:00 
 
Sentiment analysis follows the preformulated TextBlob sentiment ML scoring algorithm. The data is then stored in a dataframe called S2 and written to individual csvs (due to texts containing commas as well, rather than fight it, just keep it separate) to paste back into a dataframe for later use.