# Reddit Text Data Moddeling 

### Introduction
This document aims to analyze and explore the top 100 Reddit posts per day from the main subreddits dedicated to Bitcoin (BTC), Ethereum (ETH), and Solana (SOL). The primary goal is to extract time series features that can be incorporated into a machine learning algorithm. The focus will be on extracting sentiment, emotions, and topics over time.

### Objectives
- Data Collection:
Gather the top 100 Reddit posts per day from BTC, ETH, and SOL subreddits.

- Data Exploration: 
Sift through the data to pull meaningful insights

- Feature Extraction:
Extract sentiment, emotion, and topic features from the text data.

- Time Series Analysis:
Analyze the extracted features over time to create time series data.

### Importing necessary libraries and setting up enviornment

In [46]:
import pandas as pd

# Text Processing Libraries 
from nltk.corpus import stopwords
import nltk
from nltk import bigrams, trigrams 
from nltk import word_tokenize
import spacy 
import string 

#### Loading the Text Datasets

In [48]:
btc_text = pd.read_csv('data/reddit/BTC_R.csv')
btc_text.date_posted = pd.to_datetime(btc_text.date_posted)
btc_text

  btc_text.date_posted = pd.to_datetime(btc_text.date_posted)


Unnamed: 0,subreddit,title,selftext,upvote_ratio,ups,downs,score,comments,date_posted,pull_date
0,Bitcoin,Bitcoin Newcomers FAQ - Please read!,# Welcome to the /r/Bitcoin Newcomers FAQ\n\nY...,0.95,194,0,194,148,2023-09-06,08:07.5
1,Bitcoin,Are you still DCAing?,Or are you waiting for lower prices?,0.76,45,0,45,95,2024-02-05,08:07.5
2,Bitcoin,What happens if there are no miners at all?,Not really a relevant question as I can guaran...,0.37,0,0,0,35,2024-02-05,08:07.5
3,Bitcoin,Where can I Download full node bitcoin ledger ...,"Hi, have some experience running bitcoin core ...",0.64,7,0,7,39,2024-02-04,08:07.6
4,Bitcoin,Need good basic Spanish BTC video to convice m...,I am a big believer in BTC. It will slowly con...,0.67,10,0,10,38,2024-02-04,08:07.6
...,...,...,...,...,...,...,...,...,...,...
11792,Bitcoin,Retiring with 15 BTC in Europe - possible?,Hi Bitcoin bros!\n\nDo you think having 15 BTC...,0.23,0,0,0,43,2024-06-10,6/11/24
11793,Bitcoin,If you are begging in the streets for Bitcoin ...,Asking for a friend.,0.32,0,0,0,19,2024-06-10,6/11/24
11794,Bitcoin,Rubles cash stuck in Crimea,"Hi there,\n\nMy mum is stuck in Crimea and she...",0.43,0,0,0,4,2024-06-10,6/11/24
11795,Bitcoin,bitcoin,guys so i bought bitcoin once when i was almos...,0.53,1,0,1,5,2024-06-10,6/11/24


#### Counting the number of posts per date_posted

In [49]:
def create_date_count_table(df, date_column):
    # Convert the date column to datetime
    df[date_column] = pd.to_datetime(df[date_column], errors='coerce')
    
    # Drop rows with missing or NaT (Not a Time) values
    df = df.dropna(subset=[date_column])
    
    # Extract month and day from the date
    df['month'] = df[date_column].dt.month
    df['day'] = df[date_column].dt.day
    
    # Group by month and day, count rows
    pivot_table = df.groupby(['month', 'day']).size().unstack(fill_value=0)
    
    return pivot_table

date_count_table = create_date_count_table(btc_text, 'date_posted')
date_count_table

day,1,2,3,4,5,6,7,8,9,10,...,22,23,24,25,26,27,28,29,30,31
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,6,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,4,33,89,95,88,95,26,122,...,99,120,88,84,110,72,114,125,0,0
3,70,99,101,94,147,80,64,169,146,75,...,71,116,89,129,94,116,99,85,115,137
4,43,79,94,98,81,70,85,64,114,97,...,53,132,116,123,79,70,98,60,63,0
5,104,53,71,67,86,97,79,40,10,19,...,114,103,84,74,43,96,132,87,104,90
6,84,120,94,98,92,117,118,96,100,72,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,10,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Pre-Proccessing the Text

In [50]:
# Download NLTK stop words
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Load spaCy's small English model
nlp = spacy.load('en_core_web_sm')

def preprocess_text(text):
    if not isinstance(text, str):
        return ''
    # Convert text to lowercase
    text = text.lower()
    
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Tokenize text using spaCy's nlp pipeline
    tokens = nlp(text)
    
    # Remove stop words and non-alphabetic tokens, and perform lemmatization
    tokens = [token.lemma_ for token in tokens if token.is_alpha and token.text not in stop_words]
    
    # Join tokens back into a single string
    preprocessed_text = ' '.join(tokens)
    
    return preprocessed_text

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/erenmuller/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [51]:
btc_text['selftext_p'] = btc_text['selftext'].apply(preprocess_text)
btc_text['title_p'] = btc_text['title'].apply(preprocess_text)