**SDPA Final Coursework, Part 2**

In [76]:
# Import necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
import datetime
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from textblob import TextBlob
import snscrape.modules.twitter as sntwitter
import sys
import re
import yfinance as yf 

**Step 1: Crawl a real-world dataset**

The data of focus for this section of the coursework will be tweet data from the social media platform and digital town square, Twitter. Initially, the plan for the project was to get direct access to the twitter API by applying for an educational account and use the API keys generated by the account to access tweets of my defined specification, but in the end, I was not able to get an account. 

Instead, I have implemented the external library 'snscraper' to retrieve tweets of my defined specification. It works analogous to; and parameters are largely defined similarly to the native Twitter documentation, but there is no need for API keys, which has allowed more effort to be put into development, as opposed to API bureaucracy. More information based on ‘snscraper’ and all other external libraries involved in the project can be found in the README.md. file for Part 2 of the assignment. 



**Step 2a: Perform data preparation & cleaning**

The only section of data that needs cleaning and formatting is the tweet text itself, all other parameters scraped by specification, like date, tweet ID, reply count, retweet count and like count have been scraped to a proper format. In order to remove unnecessary parts and characters from tweet text, a list of regular expressions will be applied to the scraped tweets in an iterative fashion. The pre-defined function below is called and applied to the scraping and enriching in later steps. 

In [77]:
def clean_text(text):
    
    # Removing @mentions. 
    text = re.sub('@[A-Za-z0–9]+', '', text) 
    
    # Removing '#' hash tags. 
    text = re.sub('#', '', text) 
    
    # Removing RTs. 
    text = re.sub('RT[\s]+', '', text) 
    
     # Removing hyperlinks. 
    text = re.sub('https?:\/\/\S+', '', text)
    
    # Remove punctuation. 
    text = re.sub(r'[^\w\s]', '', text)  
    
    # Remove underscores. 
    text = re.sub(r'_', '', text) 
    
    return text 

**Step 2b: Generating daily (numerical) means of existing columns & enriching data by adding sentiment, subjectivity columns and price data of specified asset **

As per the specification of the scraping process, 100 tweets of the given keyword 'bitcoin' have been scraped for the past 150 as of 24/11/2022, this totalling into a raw dataset of 15,000 rows of tweet text data and other auxiliary information stored in other columns over the course of the defined period. Each of the tweets are not so useful in and of themselves. For example, the value of a numeric column (e.g, like count, retweet count) on a particular day is not very useful, that given tweet might have been particularly popular too; it's not a fair reflection of the overall day. To get a true impression of the average of a numeric figure (e.g, like count, retweet count) on a particular day, it would be far more insightful to get a mean of each of these variables over 100 tweets for each day. Therefore, a mean of each numeric value has replaced the 100 values for daily tweets, in the aim of providing a fairer reflection of the general popularity / performance of each tweet. 

The fact that tweet data is text-based data, also means there are potentially a great number of insights to be found in the generation of sentiment and subjectivity figures using sentiment analysis. Sentiment and subjectivity scores allow us to interpret tweets; or groups of tweets, and understand the intention behind them, in doing so, we can form an idea of what the public things of a particular thing, in this case the asset 'bitcoin'. The sentiment analyser VADER has been applied to all tweets, an average calculated for all tweets per day (100) and appended to final dataset as a form of enrichment. The same has been done for subjectivity scores. More information on the library 'nltk' and the sentiment analysis tool 'VADER', and all other external libraries involved in the project can be found in the README.md. file for Part 2 of the assignment. 

Finally, BTC data from Yahoo! Finance has also been scraped by calling their API to make potentially interesting comparisons between public sentiment and asset price. More information on the ‘yfinance’ library and all other external libraries involved in the project can be found in the README.md. file for Part 2 of the assignment.  

In [78]:
def tweet_scrape_enrich_aggregate():
    import datetime
    # List of dates tweets are retrieved from. 
    dates = []


    mean_reply_count_list = []
    mean_retweet_count_list = []
    mean_like_count_list = []

    # List for mean sentiment and mean subjectivity to be appended to later. 
    mean_sentiment_list = []
    mean_subjectivity_list = []

    # List of all data scraped from Twitter API. 
    whole_data = []


    # Inputs for key parameters, for this project they have been inputted as the following: bitcoin, 100, 150 . 
    # This is here in case other keywords need to be tested, but static variable are fine too. 
    keyword = (input("Please enter a keyword or phrase to focus your search: "))
    NoOfTweets = int(input("Please enter the number of Tweets you would like to analyse per day: "))
    days = int(input("How many of the last days would you like to analyse?:  "))




    # Loop through the last n days, 150 days for this project. 
    for i in range(days):
      # Calculate the date for the current iteration
        start = datetime.datetime.now() - datetime.timedelta(days=i)
        end = datetime.datetime.now() - datetime.timedelta(days=(i-1 ))
        date_str = start.strftime('%Y-%m-%d')

      # Add the date for current iteration to the list of dates. 
        dates.append(date_str)

    # Reverse list of days to get items in chronological order.      
    dates = dates[::-1]


    # First for loop iterating through list of dates previously generated. 
    for date in dates:
        sentiment_scores = []
        subjectivity_scores = []

        # Empty lists for data of each 100 tweets from each day to be appended to. 
        tweet_list = []
        like_count_list = []
        reply_count_list = []
        retweet_count_list = []

        # test
        # print(date) 

        # Second for loop scraping for 100 tweets iteratively for the date of the outer for loop iteration. 
        for i,tweet in enumerate(sntwitter.TwitterSearchScraper(f'{keyword} lang:en until:{date}').get_items()):
            if i>=NoOfTweets: 
                break 
            # Appending all raw data from tweet.     
            whole_data.append([tweet.date, tweet.id, tweet.rawContent, tweet.user.username, tweet.replyCount, tweet.retweetCount, tweet.likeCount])

            # Adding specific data to list of particular description, to be made into mean lists later. 
            tweet_list.append(tweet.rawContent)
            like_count_list.append(tweet.likeCount)
            reply_count_list.append(tweet.replyCount)
            retweet_count_list.append(tweet.retweetCount)
            
            
        for tweet in tweet_list:
            
            # Apply tweet cleaner to tweets before sentiment analysis for more reliable results. 
            tweet = clean_text(tweet)

            analyser = SentimentIntensityAnalyzer()
            # compound sentiment score from the specific tweet stored in the 'scores' variable using VADER. 
            scores = analyser.polarity_scores(tweet)
            
            # TextBlob object created for tweet iteratively. 
            analysis = TextBlob(tweet)
            
            # Compound sentiment and subjectivity scores append to related lists. 
            sentiment_scores.append(scores['compound'])
            subjectivity_scores.append(analysis.sentiment.subjectivity)
        
        # test
        # print(sentiment_scores)
        

            # Calculate the mean of numerical columns, mean sentiment and subjectivity scores for the day. 
        mean_sentiment = sum(sentiment_scores) / len(sentiment_scores)
        mean_sentiment_list.append(mean_sentiment)

        mean_subjectivity = sum(subjectivity_scores) / len(subjectivity_scores)
        mean_subjectivity_list.append(mean_subjectivity)

        mean_like_count = sum(like_count_list) / len(like_count_list)
        mean_like_count_list.append(mean_like_count)

        mean_reply_count = sum(reply_count_list) / len(reply_count_list)
        mean_reply_count_list.append(mean_reply_count)

        mean_retweet_count = sum(retweet_count_list) / len(retweet_count_list)
        mean_retweet_count_list.append(mean_retweet_count)

            
    print('Scraping and sentiment analysis complete!')
            
            
            
            
    keyword_list = [f'{keyword}']*days

    # Add the mean sentiment, subjectivity, and price information to the data frame for bitcoin
    df1 = pd.Series(keyword_list, name = 'Keyword')
    df2 = pd.Series(dates, name = 'Date')
    df3 = pd.Series(mean_reply_count_list, name = 'Mean reply count')
    df4 = pd.Series(mean_retweet_count_list, name = 'Mean retweet count')
    df5 = pd.Series(mean_like_count_list, name = 'Mean like count')
    df6 = pd.Series(mean_sentiment_list, name = 'Mean sentiment of daily tweets')
    df7 = pd.Series(mean_subjectivity_list, name = 'Mean subjectivity of daily tweets')
    df8 = pd.Series(market_data(days), name = 'Adjusted closing asset price')
    

    df_bitcoin = pd.concat([df1, df2, df3, df4, df5, df6, df7, df8], axis = 1)
    print(f'df_{keyword_list[0]} created!')
    df_bitcoin.to_csv(f'{keyword}_enriched_data.csv', index = False)
    
    
#     # Aggregate raw data for initial csv file. 
#     df = pd.DataFrame(whole_data)
#     df.columns = ['Date', 'ID', 'Text', 'Username', 'Reply count', 'Retweet count', 'Like count']
#     df.to_csv(f'{keyword}_raw_extracted_data.csv', index = False)
    
    
#     # API sracping Yahoo Finance for BTC price data within defined time range
#     # Set necessary variables 
#     start = datetime.datetime(2022, 7, 29)
#     end = datetime.datetime(2022, 12, 25)
#     symbol = 'BTC-GBP'

#     # Download specific data from Yahoo! Finance to a dataframe, a 6x150 dataframe. 
#     df= yf.download(symbol, start=start, end=end)

#     df.to_csv(f'bitcoin_market_data_29722_251222.csv', index = False)

    
    return df_bitcoin
            
            


In [79]:
def market_data(days): 
    
    # Set necessary variables 
    start = datetime.datetime(2022, 7, 29)
    end = datetime.datetime(2022, 12, 25)
    symbol = 'BTC-GBP'
    
    # Download specific data from Yahoo! Finance to a dataframe, a 6x150 dataframe. 
    df= yf.download(symbol, start=start, end=end)
    
    # Isolate particular column for adjusted market price at close. 
    df = df['Adj Close'][0:days]
    # Change df to a list, so it can be concatenated with all other lists later. 
    
    btc_price_list = df.values.tolist()
    
    return btc_price_list
    

In [80]:
# sns.lineplot(x = 'Date', y = 'Mean sentiment of daily tweets', data = df_bitcoin).set(xlabel = 'Date', ylabel = 'Mean sentiment over select tweets of given day')

# plt.xticks(rotation=45)
# plt.show()




In [81]:
tweet_scrape_enrich_aggregate()

Please enter a keyword or phrase to focus your search:  bitcoin 
Please enter the number of Tweets you would like to analyse per day:  100
How many of the last days would you like to analyse?:   150


Scraping and sentiment analysis complete!
[*********************100%***********************]  1 of 1 completed
df_bitcoin  created!


Unnamed: 0,Keyword,Date,Mean reply count,Mean retweet count,Mean like count,Mean sentiment of daily tweets,Mean subjectivity of daily tweets,Adjusted closing asset price
0,bitcoin,2022-07-29,0.87,0.77,3.69,0.100597,0.278271,19552.054688
1,bitcoin,2022-07-30,1.42,1.45,7.00,0.197435,0.316977,19430.144531
2,bitcoin,2022-07-31,1.00,0.31,3.05,0.039936,0.288098,19178.634766
3,bitcoin,2022-08-01,0.74,0.36,2.69,0.117555,0.304834,19024.222656
4,bitcoin,2022-08-02,0.80,0.20,2.91,0.117630,0.316604,18922.662109
...,...,...,...,...,...,...,...,...
145,bitcoin,2022-12-21,0.25,0.02,1.13,0.052238,0.270928,13925.944336
146,bitcoin,2022-12-22,0.94,0.75,3.75,0.162072,0.341690,13981.368164
147,bitcoin,2022-12-23,1.81,1.62,12.79,0.224089,0.351902,13935.910156
148,bitcoin,2022-12-24,2.56,4.58,15.37,-0.002429,0.327548,13978.060547
