# Code Repeatability Part 1: Scraping Twitter and Yahoo Finance Data
### Note: This Notebook requires Python 3.8 or greater to run.

### Summary:

This notebook is what we used to obtain all of our data and export it into .csv files.

The environment used to run this notebook has a python version of 3.8.3. A version of 3.8 or greater is needed to run the development version of the python library snscrape. Snscrape is used to scrape twitter data. With the development version of Snscrape, we are able to include additional key features such as likecount, retweetcount, hashtags, cashtags, and much more.

TThe first part of this notebook will pull twitter data using the snscrape library. We are pulling two datasets: 1 for specific users and 1 for specific hashtags. The users we are exploring are Elon Musk, Tesla, Gamestop, AMCTheatres, Bitcoin, Apompliano, Dogecoin, Mark Cuban, Shibetoshi Nakamoto (Creator of Dogecoin) and SpaceX. The tweets we are pulling based on hashtags are TSLA, Doge, AMC, BTC, and GME. However, we only ended up focusing on Bitcoin in the end. We have decided to pull all tweets from 2019 - Current.

The second part of this notebook will scrape historical stock information directly from yahoo finance. It scrapes historical stock open & close prices, volume, and other attributes from the Bitcoin ticker. The data is scraped into two different dataframes. One that pulls historical information dating back 5 years on a daily basis. Another that pulls historical information dating back  months on an hourly basis.



### Pull Twitter Data

In [10]:
# If necessary, install the libraries that we used for scraping Twitter and Yahoo Finance Data,
# snscrape and yfinance
# For snscrape, the development version is necessary in order to get all the features we need

#pip3 install git+https://github.com/JustAnotherArchivist/snscrape.git
#pip install yfinance

In [4]:
# import libraries
import pandas as pd
import numpy as np
import snscrape.modules.twitter as sntwitter
import os
import time
import yfinance as yf
startTime = time.time()

In [19]:
# list of twitter users and hashtags to pull tweets from
twitters = [
    'elonmusk',
    'Tesla',
    'Gamestop',
    'AMCTheatres',
    'Bitcoin',
    'APompliano',
    'Dogecoin',
    'Mcuban',
    'BillyM2k',
    'SpaceX',
    ]

hashtags =[
    '#TSLA',
    '#DOGE',
    '#AMC',
    '#BTC',
    '#GME'
    ]

In [20]:
# scrape tweets from users. Will iterate over all values and save as a json file. 
# We will use the pandas library to read each json file in as a dataframe.
# Pulling from the start of January 1st, 2019 for each user
for user in twitters:
    os.system("snscrape --jsonl --since 2019-01-01 twitter-search 'from:" + user + "'> " + user + ".json")
    print(user) # printing each user to see when all tweets are scraped for a user

elonmusk
Tesla
Gamestop
AMCTheatres
Bitcoin
APompliano
Dogecoin
Mcuban
BillyM2k
SpaceX


In [21]:
# initializing a dataframe to store tweets scraped with above code for users
user_tweets = pd.DataFrame()

In [22]:
# iterating over every user and appending scraped tweets to the user_tweets dataframe
for user in twitters:
    df = pd.read_json(user+'.json',lines = True)
    df['keyword_search'] = user #creating a column to show what value was used to search tweets
    user_tweets = user_tweets.append(df)

In [24]:
# scraping tweets for each hashtag in hashtags list
# This is slightly different from how we did it
# We did not use the --until 2021-06-15 statement, but that is the last day of our data
# because that is the day we scraped it
for tag in hashtags:
    os.system("snscrape --jsonl --since 2019-01-01 --until 2021-06-15 twitter-search 'from:" + tag + "'> " + tag + ".json")
    print(tag) # printing each hashtag to see when all tweets are scraped for a hashtag

#TSLA
#DOGE
#AMC
#BTC
#GME


In [25]:
# initializing a dataframe to store tweets scraped with above code for hashtags
hashtag_tweets = pd.DataFrame()

In [26]:
# iterating over every hashtag and appending scraped tweets to the hashtag_tweets dataframe
for tag in hashtags:
    df2 = pd.read_json(tag +'.json',lines = True)
    df2['keyword_search'] = tag #creating a column to show what value was used to search tweets
    hashtag_tweets = hashtag_tweets.append(df2)

In [29]:
# keeping only certain columns of interest
user_tweets = user_tweets[['date','content','id','user','replyCount','retweetCount','likeCount',
             'quoteCount','lang','inReplyToTweetId','inReplyToUser','mentionedUsers',
             'coordinates','place','hashtags','cashtags','keyword_search']]
hashtag_tweets = hashtag_tweets[['date','content','id','user','replyCount','retweetCount','likeCount',
             'quoteCount','lang','inReplyToTweetId','inReplyToUser','mentionedUsers',
             'coordinates','place','hashtags','cashtags','keyword_search']]

In [36]:
# exporting dataframes to csv
user_tweets.to_csv("user_tweets.csv")
hashtag_tweets.to_csv("hashtag_tweets.csv")

In [28]:
# determining how long script takes to run in seconds
executionTime = (time.time() - startTime)
print('Execution time in seconds: ' + str(executionTime))

Execution time in seconds: 5345.701666355133


### Pull Yahoo Finance Data

In [5]:
# bitcoin
stock_strings = ['BTC-USD']

In [6]:
# creating a dataframe for last 5 years of stock for above stock
df_list = []
for ticker in stock_strings:
    data = yf.download(ticker, group_by="Ticker", period='5y')
    data['ticker'] = ticker  # add this column because the dataframe doesn't contain a column with the ticker
    df_list.append(data)

# combine all dataframes into a single dataframe
df = pd.concat(df_list)

[*********************100%***********************]  1 of 1 completed


In [7]:
# creating a dataframe for last 2 months every hour
df_hour_list = []
for ticker in stock_strings:
    data2 = yf.download(ticker, group_by="Ticker", period='8mo',interval='1h')
    data2['ticker'] = ticker  # add this column becasue the dataframe doesn't contain a column with the ticker
    df_hour_list.append(data2)

# combine all dataframes into a single dataframe
df_hour = pd.concat(df_hour_list)

[*********************100%***********************]  1 of 1 completed


In [8]:
# change directory
os.chdir('../../')

In [9]:
# save files
df.to_csv('casestudy_data/group_9/daily_stock_last5yr.csv')
df_hour.to_csv('casestudy_data/group_9/hourly_stock_last8mo.csv')