# Task 1

In Moodle you will find the file bitcoin.csv, containing Reddit comments of the bitcoinSubreddit from 2022. Read the file into your console.

From this data set, we only need the columns “created”, determining the date at which the
post was created, “title” containing the title of the post and ”selftext” containing additional
text from the post, if any. For our analysis, we want to analyze “title” and “selftext” as one
combined entitiy for each text. So for each post, join the two respective strings if there is a
selftext.

We will now perform a simple sentiment analysis and compare the resulting time series with
the actual Bitcoin price, which you can find in bitcoin prices.csv.

In [1]:
import pandas as pd

# Load the bitcoin.csv file
bitcoin_comments_path = 'bitcoin.csv'
bitcoin_comments_df = pd.read_csv(bitcoin_comments_path, usecols=['created', 'title', 'selftext'])

# Combining "title" and "selftext" into a single entity for each post
# Replace NaN values in "selftext" with an empty string before joining
bitcoin_comments_df['selftext'] = bitcoin_comments_df['selftext'].fillna('')
bitcoin_comments_df['combined_text'] = bitcoin_comments_df['title'] + ' ' + bitcoin_comments_df['selftext']

# Displaying the first few rows of the dataframe with the combined text
bitcoin_comments_df[['created', 'combined_text']].head()


Unnamed: 0,created,combined_text
0,1640995385,Anyone use a roundup app to invest spare chang...
1,1640995487,"Maybe I am more pessimistic, so I withdrew Bit..."
2,1640995629,Can not withdraw [removed]
3,1640996258,When sovereign wealth funds? Anyone heard any ...
4,1640996745,Big Money Lost in a Hard Drive [removed]


# Task 2

Apply preprocessing to the given texts. Keep in mind, that we intend to use sentiment dictionaries to analyze the text later. How does this knowledge change your approach to preprocessing?


##### My Appraoch 

Given that the sentiment analysis will be based on sentiment dictionaries, the preprocessing steps should ensure that the text is normalized in a way that maximizes the match with the dictionary entries without altering the semantic orientation of words. Here's how this knowledge affects our approach:

* Case Normalization: Convert all text to lowercase to ensure that word matches are not missed due to case differences.

* Tokenization: Split the text into individual words for analysis.

* Removing Special Characters and Numbers: Since sentiment dictionaries typically contain words, removing special characters and numbers can help focus on the text content. However, it's important to keep characters that can change the sentiment of a word, like negation (e.g., "not").

* Lemmatization: Reducing words to their base or dictionary form to increase the chance of matching dictionary entries. Unlike stemming, lemmatization retains the semantic meaning of the word, which is crucial for sentiment analysis.

* Negation Handling: Sentiment analysis can be significantly affected by negation. A straightforward approach to preprocessing might remove negations, altering the sentiment. It's essential to retain negation or implement a strategy to handle it effectively.

Given these considerations, let's apply the preprocessing steps to the "combined_text" column, focusing on case normalization, tokenization, removing special characters (while retaining sentiment-altering ones like negation), and considering lemmatization for word normalization. We'll also discuss how to handle negation in the context of sentiment analysis.

In [7]:
import pandas as pd
import re

# Assuming bitcoin_comments_df is already loaded with the necessary 'combined_text' column

# A simplified function to handle preprocessing based on your approach
def preprocess_text_advanced(text):
    # Convert text to lowercase to ensure case normalization
    text = str(text).lower()
    
    # Tokenization (basic form using split)
    tokens = text.split()
    
    # Removing special characters and numbers, but trying to retain sentiment-altering characters like negation
    # Note: Advanced negation handling and lemmatization are not fully implemented here due to limitations
    tokens = [re.sub(r'[^a-z\s]', '', token) for token in tokens if re.sub(r'[^a-z\s]', '', token) != '']
    
    # Rejoin tokens into a string for compatibility with further processing steps
    processed_text = ' '.join(tokens)
    
    return processed_text

# Apply the advanced preprocessing function to the combined text column
bitcoin_comments_df['processed_advanced'] = bitcoin_comments_df['combined_text'].apply(preprocess_text_advanced)

# Display the first few rows to check the preprocessing results
print(bitcoin_comments_df[['combined_text', 'processed_advanced']].head())


                                       combined_text  \
0  Anyone use a roundup app to invest spare chang...   
1  Maybe I am more pessimistic, so I withdrew Bit...   
2                         Can not withdraw [removed]   
3  When sovereign wealth funds? Anyone heard any ...   
4           Big Money Lost in a Hard Drive [removed]   

                                  processed_advanced  
0  anyone use a roundup app to invest spare chang...  
1  maybe i am more pessimistic so i withdrew bitc...  
2                           can not withdraw removed  
3  when sovereign wealth funds anyone heard any n...  
4             big money lost in a hard drive removed  


# Task 3

In dictionary.csv you will find a sentiment dictionary. “Positive words” will have positive
values while “Negative words” will have negative values.

Use this dictionary to calculate the sentiment score of each text, that is sum up all sentiment
values to the corresponding words in said text. A negative score will thus indicate a negative
text, while a positive value will indicate a positive text.


In [12]:
# Load the sentiment dictionary from the provided CSV file
sentiment_dict_path = 'dictionary.csv'
sentiment_df = pd.read_csv(sentiment_dict_path)

# Convert the sentiment dictionary DataFrame into a Python dictionary for faster lookup
sentiment_dict = pd.Series(sentiment_df.sentiment.values, index=sentiment_df.term).to_dict()

# Function to calculate sentiment score of a text based on the sentiment dictionary
def calculate_sentiment_score(text):
    # Tokenize the text into words
    words = text.split()
    # Calculate sentiment score by summing the sentiment values of the words found in the sentiment dictionary
    score = sum(sentiment_dict.get(word, 0) for word in words)
    return score

# Apply the sentiment score calculation function to the preprocessed text
bitcoin_comments_df['sentiment_score'] = bitcoin_comments_df['processed_advanced'].apply(calculate_sentiment_score)

# Display the first few rows to check the sentiment scores
print(bitcoin_comments_df[['processed_advanced', 'sentiment_score']].head())


                                  processed_advanced  sentiment_score
0  anyone use a roundup app to invest spare chang...                1
1  maybe i am more pessimistic so i withdrew bitc...               -2
2                           can not withdraw removed                0
3  when sovereign wealth funds anyone heard any n...                4
4             big money lost in a hard drive removed               -1


# Task 4

Compare the daily difference in market values in the file bitcoin price.csv and your sentiment
scores with a correlation coefficient of your choice. Do the comments explain the behaviour of
the bitcoin price evolution well?

In [15]:
# Load the Bitcoin price data
bitcoin_prices_path = 'bitcoin_prices.csv'
bitcoin_prices_df = pd.read_csv(bitcoin_prices_path)

In [16]:
# Checking the column names of the bitcoin_prices_df to identify the correct column names
bitcoin_prices_df.columns

Index(['timestamp', 'difference'], dtype='object')

In [19]:
# First, ensure the 'created' column is converted to datetime format in bitcoin_comments_df
bitcoin_comments_df['created_datetime'] = pd.to_datetime(bitcoin_comments_df['created'], unit='s')

# Now, extracting the date part for daily aggregation
bitcoin_comments_df['date'] = bitcoin_comments_df['created_datetime'].dt.date

# Calculating daily average sentiment score
daily_sentiment_score = bitcoin_comments_df.groupby('date')['sentiment_score'].mean().reset_index()
daily_sentiment_score['date'] = pd.to_datetime(daily_sentiment_score['date'])

# Load and prepare the bitcoin_prices_df again correctly
bitcoin_prices_df['Date'] = pd.to_datetime(bitcoin_prices_df['timestamp'])
merged_df = pd.merge(bitcoin_prices_df, daily_sentiment_score, left_on='Date', right_on='date', how='inner')

# Calculate the correlation between daily price difference ('difference' column) and sentiment scores
correlation = merged_df['difference'].corr(merged_df['sentiment_score'])

print("Correlation between daily price difference and sentiment scores:", correlation)


Correlation between daily price difference and sentiment scores: 0.08015353171742567


 
The correlation coefficient between daily price difference and sentiment scores is approximately 0.080, indicating a very slight positive relationship between the two variables. This means that as sentiment scores increase (become more positive), there's a slightly positive trend in the daily price difference of Bitcoin, suggesting that more positive sentiment could be associated with higher price movements. However, the correlation is quite weak.

### Thank you