# Investors' Sentiment & S&P500 : Model & Conclusion
---

# Notebook Organisation
1. Data Collection - Subreddit
2. Data Collection - Target 
3. Merging Data
4. EDA and Preprocessing 
5. **Model Tuning and Insights (Current Notebook)**

## Import Library
---

In [75]:
import requests
import time
import nltk
import pandas as pd
import regex as re
import numpy as np
import random
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import warnings

from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, classification_report, plot_roc_curve,\
                            roc_auc_score, accuracy_score, precision_score, recall_score, f1_score
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
warnings.filterwarnings('ignore')
sns.set_style('ticks')
%matplotlib inline

## Read CSV
---

### Combined Title on Daily Basis

In [5]:
dailytitle_df = pd.read_csv("data/project_df_dailytitle_cleaned.csv")

In [7]:
dailytitle_df.head()

Unnamed: 0,Date,combined_title,percent_change_class,title_len,title_cleaned
0,2019-01-03,Backtesting moving average crossover Can someo...,up,108,backtesting moving average crossover someone e...
1,2019-01-04,What’s your best performing stock today? How t...,up,503,best performing stock today explain option new...
2,2019-01-07,Come Join Quantum Stock Trading! Help reach fi...,up,209,come join quantum stock trading help reach fin...
3,2019-01-08,"Greenspan says stock market ""is still a bit to...",up,541,greenspan say stock market still bit high appl...
4,2019-01-09,"ONC, ONCY &amp; ONC.WT What is everyone's opin...",up,472,onc oncy amp onc wt everyone opinion company d...


In [11]:
dailytitle_df['percent_change_class'] = dailytitle_df['percent_change_class'].map({'up' : 1, 'down': 0})

In [12]:
dailytitle_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 502 entries, 0 to 501
Data columns (total 5 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Date                  502 non-null    object
 1   combined_title        502 non-null    object
 2   percent_change_class  502 non-null    int64 
 3   title_len             502 non-null    int64 
 4   title_cleaned         502 non-null    object
dtypes: int64(2), object(3)
memory usage: 19.7+ KB


In [13]:
dailytitle_df.isnull().sum()

Date                    0
combined_title          0
percent_change_class    0
title_len               0
title_cleaned           0
dtype: int64

In [14]:
dailytitle_df.shape

(502, 5)

### Combined Title & Post in 2019

In [139]:
titlepost_df = pd.read_csv("data/project_titlepost2019_df_cleaned.csv")

In [140]:
titlepost_df.head()

Unnamed: 0,date,title,selftext,is_self,upvotes,n_comments,permalink,author,month,year,Date,year_and_month,week_of_year,percent_change_class,title_len,text_len,first_title_cleaned,first_selftext_cleaned,final_title_cleaned,final_selftext_cleaned,fresh_meat,lean_meat,lean_meat_len
0,2019-01-03 18:02:40,Backtesting moving average crossover,"Hello guys, \nI was trying to backtest a movi...",1.0,1.0,0.0,/r/stocks/comments/ac4c0u/backtesting_moving_a...,bhandarimohit2029,1,2019,2019-01-03 00:00:00,2019-01,1,up,4,212,backtesting moving average crossover,hello guy trying backtest moving average cross...,backtesting moving crossover,hello backtest moving crossover indian found i...,backtesting moving average crossover hello guy...,backtesting moving crossover hello backtest mo...,85
1,2019-01-03 20:12:03,Can someone ELI5 what a cross signal Index is ...,I just finished Margin call on Netflix and was...,1.0,1.0,0.0,/r/stocks/comments/ac55o3/can_someone_eli5_wha...,tellmetheworld,1,2019,2019-01-03 00:00:00,2019-01,1,up,16,59,someone eli cross signal index used trading,finished margin call netflix intrigued using i...,someone eli cross used,call netflix intrigued indicator sort proof sk...,someone eli cross signal index used trading fi...,someone eli cross used call netflix intrigued ...,24
2,2019-01-03 20:54:22,Your AM Global Stocks Preview and a whole lot ...,\n\n### US Stocks\n\n* **US stocks index futu...,1.0,1.0,16.0,/r/stocks/comments/ac5gf6/your_am_global_stock...,QuantalyticsResearch,1,2019,2019-01-03 00:00:00,2019-01,1,up,24,1033,global stock preview whole lot news need read ...,u stock u stock index future dropping sharply ...,preview whole negative guidance aapl spook,u u dropping sharply front p negative guidance...,global stock preview whole lot news need read ...,preview whole negative guidance aapl spook u u...,327
3,2019-01-03 21:01:23,Gld outperforms snp and Dow,Gld has outperformed the snp and Dow since 200...,1.0,1.0,30.0,/r/stocks/comments/ac5ibv/gld_outperforms_snp_...,1anon2y3mous,1,2019,2019-01-03 00:00:00,2019-01,1,up,5,20,gld outperforms snp dow,gld outperformed snp dow since gold bad rap in...,gld outperforms snp,gld outperformed snp since rap,gld outperforms snp dow gld outperformed snp d...,gld outperforms snp gld outperformed snp since...,8
4,2019-01-03 21:17:40,Alternative to Yahoo Finance,"Hi there, i switched from Android to iOS.\n\nI...",1.0,1.0,10.0,/r/stocks/comments/ac5mqv/alternative_to_yahoo...,h0ly88,1,2019,2019-01-03 00:00:00,2019-01,1,up,4,65,alternative yahoo finance,hi switched android io always used mystocks ap...,alternative,switched android always used mystocks android ...,alternative yahoo finance hi switched android ...,alternative switched android always used mysto...,23


In [141]:
titlepost_df['percent_change_class'] = titlepost_df['percent_change_class'].map({'up' : 1, 'down': 0})

In [149]:
titlepost_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67482 entries, 0 to 67481
Data columns (total 23 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   date                    67482 non-null  object 
 1   title                   67442 non-null  object 
 2   selftext                66717 non-null  object 
 3   is_self                 67482 non-null  float64
 4   upvotes                 67482 non-null  float64
 5   n_comments              67482 non-null  float64
 6   permalink               67482 non-null  object 
 7   author                  67482 non-null  object 
 8   month                   67482 non-null  int64  
 9   year                    67482 non-null  int64  
 10  Date                    67482 non-null  object 
 11  year_and_month          67482 non-null  object 
 12  week_of_year            67482 non-null  int64  
 13  percent_change_class    67482 non-null  int64  
 14  title_len               67482 non-null

In [151]:
titlepost_df.shape

(67482, 23)

## Function for Modeling
---

### Instiantiate models

In [58]:
models = {'lr': LogisticRegression(random_state=42),
          'nb': MultinomialNB(),
          'rf': RandomForestClassifier(random_state=42),
          'ada': AdaBoostClassifier(random_state=42)}

### Instantiate vectorizers

In [59]:
vectorizers = {'cvec': CountVectorizer(),
               'tvec': TfidfVectorizer()}

### Hyperparameters for Tuning

#### Vectorizers Hyperparameters

In [60]:
cvec_params = {
    # Setting a limit of n-number of features included/vocab size
    'cvec__max_features': [None, 100],

    # Setting a minimum number of times the word/token has to appear in n-documents
    'cvec__min_df':[2, 4],
    # Setting an upper threshold/max percentage of n% of documents from corpus 
    'cvec__max_df': [0.2, 0.95],
    
    # With stopwords
    'cvec__stop_words': ['english'],
    
    # Testing with bigrams and trigrams
    'cvec__ngram_range':[(1,1) , (1,2), (2,2), (2,3), (3,3)],
}

In [61]:
tvec_params = {
    'tvec__max_features': [None, 100],
    'tvec__min_df':[3, 5],
    'tvec__max_df': [0.2, 0.4],
    'tvec__stop_words': ['english'],
    'tvec__ngram_range':[(1,1) , (1,2), (2,2), (2,3), (3,3)]
}

#### Model Hyperparameters

In [62]:
lr_params = {
    # Trying different types of regularization
    'lr__penalty':['l2','l1'],

     # Trying different alphas of: 10, 1, 0.1 (C = 1/alpha)
    'lr__C':[0.1, 1, 10],
}

In [63]:
nb_params = {
    'nb__fit_prior': [True, False],
    'nb__alpha': [0, 0.4, 0.8]
}

In [111]:
# Random Forest Model Tuning Hyperparameters
# source: https://www.analyticsvidhya.com/blog/2020/03/beginners-guide-random-forest-hyperparameter-tuning/
# source: https://builtin.com/data-science/random-forest-algorithm

rf_params = {
    'rf__n_estimators': [100, 200],
    'rf__max_depth': [3, 7],
    'rf__max_features': [10, 50],
    'rf__min_samples_split': [1000, 2000],
    'rf__min_samples_leaf':[30, 60],
    'rf__n_jobs': [-1]
}

In [112]:
# ada Boost Model Tuning Hyperparameters
# source: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html

ada_params = {
    'ada__n_estimators': [40, 100],
    'ada__learning_rate': [0.1, 1]
}

### Function to Run Model

In [166]:
# Function to run model -- input vectorizer and model
def run_model(vec, mod, vec_params={}, mod_params={}, grid_search=False):
    
    results = {}
    
    pipe = Pipeline([
            (vec, vectorizers[vec]),
            (mod, models[mod])
            ])
    
    if grid_search:
        gs = GridSearchCV(pipe, param_grid = {**vec_params, **mod_params}, cv=3, verbose=1, n_jobs=-1)
        gs.fit(X_train, y_train)
        pipe = gs.best_estimator_
    else:
        pipe.fit(X_train, y_train)
    
    # Retrieve metrics
    results['model'] = mod
    results['vectorizer'] = vec
    results['train'] = pipe.score(X_train, y_train)
    results['test'] = pipe.score(X_test, y_test)
    predictions = pipe.predict(X_test)
    prob = pipe.predict_proba(X_test)[:,1]
    results['roc'] = roc_auc_score(y_test, prob)
    results['precision'] = precision_score(y_test, predictions)
    results['recall'] = recall_score(y_test, predictions)
    results['f_score'] = f1_score(y_test, predictions)
    
    if grid_search:
        tuning_list.append(results)
        print('### BEST PARAMS ###')
        display(pipe)
    else:
        eval_list.append(results)
    
    print('### METRICS ###')
    display(results)
    tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()
    print(f"True Negatives: {tn}")
    print(f"False Positives: {fp}")
    print(f"False Negatives: {fn}")
    print(f"True Positives: {tp}")
    
    return pipe

## Model Benchmark
---

### Benchmark for Combined Title on Daily Basis

In [68]:
dailytitle_df['percent_change_class'].value_counts(normalize=True)

1    0.583665
0    0.416335
Name: percent_change_class, dtype: float64

### Benchmark for Combined titlepost & title for 2019

In [165]:
titlepost_df['percent_change_class'].value_counts(normalize=True)

1    0.592558
0    0.407442
Name: percent_change_class, dtype: float64

## Combined Title on Daily Basis
---

### Train-Test Split

In [21]:
dailytitle_df.shape[0] * (3/4)

376.5

In [174]:
X = dailytitle_df['title_cleaned']
y = dailytitle_df['percent_change_class']

In [175]:
X_train = X.iloc[0:377]
X_test = X.iloc[377:]
y_train = y.iloc[0:377]
y_test = y.iloc[377:]

In [176]:
print(f'The Shape of X_train: {X_train.shape}')
print(f'The Shape of X_test: {X_test.shape}')
print(f'The Shape of y_train: {y_train.shape}')
print(f'The Shape of y_test: {y_test.shape}')

The Shape of X_train: (377,)
The Shape of X_test: (125,)
The Shape of y_train: (377,)
The Shape of y_test: (125,)


In [177]:
eval_list = []
tuning_list =[]

### Modelling

For the Modelling the following functions were used:

Vectorizer:
- cvec
- tvec

Model:
- Logistic Regression
- Naive Bayes multinomial
- Random Forest
- Ada Boost

#### Logistic Regression w cvec

In [178]:
cvec_lr_gs = run_model('cvec', 'lr', vec_params=cvec_params, mod_params=lr_params, grid_search=True)

Fitting 3 folds for each of 240 candidates, totalling 720 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   15.6s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed:  4.5min
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed:  7.5min finished


### BEST PARAMS ###


Pipeline(steps=[('cvec',
                 CountVectorizer(max_df=0.95, max_features=100, min_df=4,
                                 ngram_range=(2, 2), stop_words='english')),
                ('lr', LogisticRegression(C=1, random_state=42))])

### METRICS ###


{'model': 'lr',
 'vectorizer': 'cvec',
 'train': 0.7798408488063661,
 'test': 0.488,
 'roc': 0.42228661749209695,
 'precision': 0.56,
 'recall': 0.5753424657534246,
 'f_score': 0.5675675675675674}

True Negatives: 19
False Positives: 33
False Negatives: 31
True Positives: 42


#### Logistic Regression w tvec

In [179]:
tvec_lr_gs = run_model('tvec', 'lr', vec_params=tvec_params, mod_params=lr_params, grid_search=True)

Fitting 3 folds for each of 240 candidates, totalling 720 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   14.0s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed:  3.9min
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed:  6.7min finished


### BEST PARAMS ###


Pipeline(steps=[('tvec',
                 TfidfVectorizer(max_df=0.4, max_features=100, min_df=3,
                                 ngram_range=(2, 3), stop_words='english')),
                ('lr', LogisticRegression(C=1, random_state=42))])

### METRICS ###


{'model': 'lr',
 'vectorizer': 'tvec',
 'train': 0.7108753315649867,
 'test': 0.568,
 'roc': 0.5076396206533194,
 'precision': 0.584070796460177,
 'recall': 0.9041095890410958,
 'f_score': 0.7096774193548386}

True Negatives: 5
False Positives: 47
False Negatives: 7
True Positives: 66


#### MultinomialNB w cvec

In [180]:
cvec_nb_gs = run_model('cvec', 'nb', vec_params=cvec_params, mod_params=nb_params, grid_search=True)

Fitting 3 folds for each of 240 candidates, totalling 720 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   10.3s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed:  4.1min
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed:  7.1min finished


### BEST PARAMS ###


Pipeline(steps=[('cvec',
                 CountVectorizer(max_df=0.95, max_features=100, min_df=2,
                                 ngram_range=(2, 3), stop_words='english')),
                ('nb', MultinomialNB(alpha=0.4, fit_prior=False))])

### METRICS ###


{'model': 'nb',
 'vectorizer': 'cvec',
 'train': 0.6657824933687002,
 'test': 0.504,
 'roc': 0.44863013698630133,
 'precision': 0.5632183908045977,
 'recall': 0.6712328767123288,
 'f_score': 0.6125}

True Negatives: 14
False Positives: 38
False Negatives: 24
True Positives: 49


#### MultinomialNB w tvec

In [181]:
tvec_nb_gs = run_model('tvec', 'nb', vec_params=tvec_params, mod_params=nb_params, grid_search=True)

Fitting 3 folds for each of 240 candidates, totalling 720 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   13.2s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed:  4.2min
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed:  7.0min finished


### BEST PARAMS ###


Pipeline(steps=[('tvec',
                 TfidfVectorizer(max_df=0.4, max_features=100, min_df=3,
                                 ngram_range=(2, 3), stop_words='english')),
                ('nb', MultinomialNB(alpha=0.4))])

### METRICS ###


{'model': 'nb',
 'vectorizer': 'tvec',
 'train': 0.6339522546419099,
 'test': 0.6,
 'roc': 0.5168598524762908,
 'precision': 0.5934959349593496,
 'recall': 1.0,
 'f_score': 0.7448979591836735}

True Negatives: 2
False Positives: 50
False Negatives: 0
True Positives: 73


#### Random Forest w cvec

In [182]:
cvec_rf_gs = run_model('cvec', 'rf', vec_params=cvec_params, mod_params=rf_params, grid_search=True)

Fitting 3 folds for each of 1280 candidates, totalling 3840 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    8.7s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed:  4.1min
[Parallel(n_jobs=-1)]: Done 776 tasks      | elapsed:  6.7min
[Parallel(n_jobs=-1)]: Done 1226 tasks      | elapsed: 11.0min
[Parallel(n_jobs=-1)]: Done 1776 tasks      | elapsed: 18.5min
[Parallel(n_jobs=-1)]: Done 2426 tasks      | elapsed: 25.3min
[Parallel(n_jobs=-1)]: Done 3176 tasks      | elapsed: 32.3min
[Parallel(n_jobs=-1)]: Done 3840 out of 3840 | elapsed: 41.6min finished


### BEST PARAMS ###


Pipeline(steps=[('cvec',
                 CountVectorizer(max_df=0.2, min_df=2, stop_words='english')),
                ('rf',
                 RandomForestClassifier(max_depth=3, max_features=10,
                                        min_samples_leaf=30,
                                        min_samples_split=1000, n_jobs=-1,
                                        random_state=42))])

### METRICS ###


{'model': 'rf',
 'vectorizer': 'cvec',
 'train': 0.583554376657825,
 'test': 0.584,
 'roc': 0.5,
 'precision': 0.584,
 'recall': 1.0,
 'f_score': 0.7373737373737373}

True Negatives: 0
False Positives: 52
False Negatives: 0
True Positives: 73


#### Random Forest w tvec

In [183]:
tvec_rf_gs = run_model('tvec', 'rf', vec_params=tvec_params, mod_params=rf_params, grid_search=True)

Fitting 3 folds for each of 1280 candidates, totalling 3840 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   15.3s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed:  4.5min
[Parallel(n_jobs=-1)]: Done 776 tasks      | elapsed:  8.3min
[Parallel(n_jobs=-1)]: Done 1226 tasks      | elapsed: 13.0min
[Parallel(n_jobs=-1)]: Done 1776 tasks      | elapsed: 18.9min
[Parallel(n_jobs=-1)]: Done 2426 tasks      | elapsed: 25.8min
[Parallel(n_jobs=-1)]: Done 3176 tasks      | elapsed: 34.0min
[Parallel(n_jobs=-1)]: Done 3840 out of 3840 | elapsed: 41.2min finished


### BEST PARAMS ###


Pipeline(steps=[('tvec',
                 TfidfVectorizer(max_df=0.2, min_df=3, stop_words='english')),
                ('rf',
                 RandomForestClassifier(max_depth=3, max_features=10,
                                        min_samples_leaf=30,
                                        min_samples_split=1000, n_jobs=-1,
                                        random_state=42))])

### METRICS ###


{'model': 'rf',
 'vectorizer': 'tvec',
 'train': 0.583554376657825,
 'test': 0.584,
 'roc': 0.5,
 'precision': 0.584,
 'recall': 1.0,
 'f_score': 0.7373737373737373}

True Negatives: 0
False Positives: 52
False Negatives: 0
True Positives: 73


#### ADA Boost w cvec

In [184]:
cvec_ada_gs = run_model('cvec', 'ada', vec_params=cvec_params, mod_params=ada_params, grid_search=True)

Fitting 3 folds for each of 160 candidates, totalling 480 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   19.0s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed:  5.5min
[Parallel(n_jobs=-1)]: Done 480 out of 480 | elapsed:  6.2min finished


### BEST PARAMS ###


Pipeline(steps=[('cvec',
                 CountVectorizer(max_df=0.95, min_df=4, ngram_range=(2, 3),
                                 stop_words='english')),
                ('ada',
                 AdaBoostClassifier(learning_rate=0.1, n_estimators=40,
                                    random_state=42))])

### METRICS ###


{'model': 'ada',
 'vectorizer': 'cvec',
 'train': 0.7771883289124668,
 'test': 0.56,
 'roc': 0.5317439409905164,
 'precision': 0.5882352941176471,
 'recall': 0.821917808219178,
 'f_score': 0.6857142857142857}

True Negatives: 10
False Positives: 42
False Negatives: 13
True Positives: 60


#### ADA Boost w tvec

In [185]:
tvec_ada_gs = run_model('tvec', 'ada', vec_params=tvec_params, mod_params=ada_params, grid_search=True)

Fitting 3 folds for each of 160 candidates, totalling 480 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   18.1s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed:  5.2min
[Parallel(n_jobs=-1)]: Done 480 out of 480 | elapsed:  5.8min finished


### BEST PARAMS ###


Pipeline(steps=[('tvec',
                 TfidfVectorizer(max_df=0.4, max_features=100, min_df=3,
                                 ngram_range=(2, 3), stop_words='english')),
                ('ada',
                 AdaBoostClassifier(learning_rate=1, n_estimators=40,
                                    random_state=42))])

### METRICS ###


{'model': 'ada',
 'vectorizer': 'tvec',
 'train': 0.8541114058355438,
 'test': 0.528,
 'roc': 0.5021074815595363,
 'precision': 0.574468085106383,
 'recall': 0.7397260273972602,
 'f_score': 0.6467065868263473}

True Negatives: 12
False Positives: 40
False Negatives: 19
True Positives: 54


### Model Evaluation

Most model, with the feature as combined title on the daily basis, have an average improvement against the benchmark at approximately 61%. Running through both vectorizers, the count vectorizer tend to be overfitted while the tdif vectorizer is performing less overfitting as compared to count vectorizer.

Among all four models, the Multinominal NB with Tdif vectorizer performs the best in terms of accuracy and it is the least overfitted looking the train-test result. While the ROC-AUC score may not be the top, the remaining factors like the Precision score, Recall score and the f1 score is among the best against the other models.

In [186]:
title_eval_df = pd.DataFrame(tuning_list)

In [187]:
# Turning Evaluated List into a DataFrame
title_eval_df.to_csv('data/dailytitle_eval_df.csv', index=False)

In [203]:
title_model_result = pd.read_csv('data/dailytitle_eval_df.csv')
title_model_result

Unnamed: 0,model,vectorizer,train,test,roc,precision,recall,f_score
0,lr,cvec,0.779841,0.488,0.422287,0.56,0.575342,0.567568
1,lr,tvec,0.710875,0.568,0.50764,0.584071,0.90411,0.709677
2,nb,cvec,0.665782,0.504,0.44863,0.563218,0.671233,0.6125
3,nb,tvec,0.633952,0.6,0.51686,0.593496,1.0,0.744898
4,rf,cvec,0.583554,0.584,0.5,0.584,1.0,0.737374
5,rf,tvec,0.583554,0.584,0.5,0.584,1.0,0.737374
6,ada,cvec,0.777188,0.56,0.531744,0.588235,0.821918,0.685714
7,ada,tvec,0.854111,0.528,0.502107,0.574468,0.739726,0.646707


## Combined Title & Post for 2019
---

### Train-Test Split

In [147]:
print(titlepost_df.shape[0])
print(titlepost_df.shape[0] *(3/4))

67482
50611.5


In [188]:
X = titlepost_df['lean_meat']
y = titlepost_df['percent_change_class']

In [189]:
X_train = X.iloc[0:50611]
X_test = X.iloc[50611:]
y_train = y.iloc[0:50611]
y_test = y.iloc[50611:]

In [190]:
print(f'The Shape of X_train: {X_train.shape}')
print(f'The Shape of X_test: {X_test.shape}')
print(f'The Shape of y_train: {y_train.shape}')
print(f'The Shape of y_test: {y_test.shape}')

The Shape of X_train: (50611,)
The Shape of X_test: (16871,)
The Shape of y_train: (50611,)
The Shape of y_test: (16871,)


In [191]:
eval_list = []
tuning_list =[]

### Modelling

For the Modelling the following functions were used:

Vectorizer:
- cvec
- tvec

Model:
- Logistic Regression
- Naive Bayes multinomial
- Random Forest
- Ada Boost

#### Logistic Regression w cvec

In [192]:
cvec_lr_gs = run_model('cvec', 'lr', vec_params=cvec_params, mod_params=lr_params, grid_search=True)

Fitting 3 folds for each of 240 candidates, totalling 720 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   12.8s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed:  3.3min
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed:  5.5min finished


### BEST PARAMS ###


Pipeline(steps=[('cvec',
                 CountVectorizer(max_df=0.2, max_features=100, min_df=4,
                                 ngram_range=(3, 3), stop_words='english')),
                ('lr', LogisticRegression(C=0.1, random_state=42))])

### METRICS ###


{'model': 'lr',
 'vectorizer': 'cvec',
 'train': 0.5870067771828259,
 'test': 0.6101001718925968,
 'roc': 0.5003924979847831,
 'precision': 0.6101001718925968,
 'recall': 1.0,
 'f_score': 0.7578412604918275}

True Negatives: 0
False Positives: 6578
False Negatives: 0
True Positives: 10293


#### Logistic Regression w tvec

In [193]:
tvec_lr_gs = run_model('tvec', 'lr', vec_params=tvec_params, mod_params=lr_params, grid_search=True)

Fitting 3 folds for each of 240 candidates, totalling 720 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   12.6s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed:  3.2min
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed:  5.5min finished


### BEST PARAMS ###


Pipeline(steps=[('tvec',
                 TfidfVectorizer(max_df=0.2, min_df=3, ngram_range=(3, 3),
                                 stop_words='english')),
                ('lr', LogisticRegression(C=0.1, random_state=42))])

### METRICS ###


{'model': 'lr',
 'vectorizer': 'tvec',
 'train': 0.5868684673292367,
 'test': 0.610040898583368,
 'roc': 0.5065857764874403,
 'precision': 0.610077059869591,
 'recall': 0.9999028465947731,
 'f_score': 0.7577955306851233}

True Negatives: 0
False Positives: 6578
False Negatives: 1
True Positives: 10292


#### MultinomialNB w cvec

In [194]:
cvec_nb_gs = run_model('cvec', 'nb', vec_params=cvec_params, mod_params=nb_params, grid_search=True)

Fitting 3 folds for each of 240 candidates, totalling 720 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   11.0s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed:  3.2min
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed:  5.4min finished


### BEST PARAMS ###


Pipeline(steps=[('cvec',
                 CountVectorizer(max_df=0.2, max_features=100, min_df=2,
                                 ngram_range=(3, 3), stop_words='english')),
                ('nb', MultinomialNB(alpha=0.8))])

### METRICS ###


{'model': 'nb',
 'vectorizer': 'cvec',
 'train': 0.5861571595107783,
 'test': 0.6101594452018256,
 'roc': 0.5003925570625607,
 'precision': 0.610149395304719,
 'recall': 0.9999028465947731,
 'f_score': 0.7578513309524686}

True Negatives: 2
False Positives: 6576
False Negatives: 1
True Positives: 10292


#### MultinomialNB w tvec

In [195]:
tvec_nb_gs = run_model('tvec', 'nb', vec_params=tvec_params, mod_params=nb_params, grid_search=True)

Fitting 3 folds for each of 240 candidates, totalling 720 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   11.1s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed:  5.0min finished


### BEST PARAMS ###


Pipeline(steps=[('tvec',
                 TfidfVectorizer(max_df=0.2, max_features=100, min_df=5,
                                 ngram_range=(3, 3), stop_words='english')),
                ('nb', MultinomialNB(alpha=0.8))])

### METRICS ###


{'model': 'nb',
 'vectorizer': 'tvec',
 'train': 0.5867301574756476,
 'test': 0.6101001718925968,
 'roc': 0.5003926604486715,
 'precision': 0.6101001718925968,
 'recall': 1.0,
 'f_score': 0.7578412604918275}

True Negatives: 0
False Positives: 6578
False Negatives: 0
True Positives: 10293


#### Random Forest w cvec

In [197]:
cvec_rf_gs = run_model('cvec', 'rf', vec_params=cvec_params, mod_params=rf_params, grid_search=True)

Fitting 3 folds for each of 1280 candidates, totalling 3840 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   33.9s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed:  7.6min
[Parallel(n_jobs=-1)]: Done 776 tasks      | elapsed: 13.1min
[Parallel(n_jobs=-1)]: Done 1226 tasks      | elapsed: 21.3min
[Parallel(n_jobs=-1)]: Done 1776 tasks      | elapsed: 33.1min
[Parallel(n_jobs=-1)]: Done 2426 tasks      | elapsed: 44.6min
[Parallel(n_jobs=-1)]: Done 3176 tasks      | elapsed: 52.7min
[Parallel(n_jobs=-1)]: Done 3840 out of 3840 | elapsed: 61.6min finished


### BEST PARAMS ###


Pipeline(steps=[('cvec',
                 CountVectorizer(max_df=0.2, min_df=2, stop_words='english')),
                ('rf',
                 RandomForestClassifier(max_depth=3, max_features=10,
                                        min_samples_leaf=30,
                                        min_samples_split=1000, n_jobs=-1,
                                        random_state=42))])

### METRICS ###


{'model': 'rf',
 'vectorizer': 'cvec',
 'train': 0.5867103989251349,
 'test': 0.6101001718925968,
 'roc': 0.5030790156118048,
 'precision': 0.6101001718925968,
 'recall': 1.0,
 'f_score': 0.7578412604918275}

True Negatives: 0
False Positives: 6578
False Negatives: 0
True Positives: 10293


#### Random Forest w tvec

In [198]:
tvec_rf_gs = run_model('tvec', 'rf', vec_params=tvec_params, mod_params=rf_params, grid_search=True)

Fitting 3 folds for each of 1280 candidates, totalling 3840 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   15.4s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed:  4.7min
[Parallel(n_jobs=-1)]: Done 776 tasks      | elapsed:  8.7min
[Parallel(n_jobs=-1)]: Done 1226 tasks      | elapsed: 14.2min
[Parallel(n_jobs=-1)]: Done 1776 tasks      | elapsed: 20.6min
[Parallel(n_jobs=-1)]: Done 2426 tasks      | elapsed: 28.3min
[Parallel(n_jobs=-1)]: Done 3176 tasks      | elapsed: 44.0min
[Parallel(n_jobs=-1)]: Done 3840 out of 3840 | elapsed: 54.0min finished


### BEST PARAMS ###


Pipeline(steps=[('tvec',
                 TfidfVectorizer(max_df=0.2, max_features=100, min_df=5,
                                 ngram_range=(1, 2), stop_words='english')),
                ('rf',
                 RandomForestClassifier(max_depth=7, max_features=50,
                                        min_samples_leaf=60,
                                        min_samples_split=1000, n_jobs=-1,
                                        random_state=42))])

### METRICS ###


{'model': 'rf',
 'vectorizer': 'tvec',
 'train': 0.5886664954258956,
 'test': 0.6093888921818505,
 'roc': 0.5074267264970952,
 'precision': 0.610477952145116,
 'recall': 0.9939764888759351,
 'f_score': 0.7563950909359751}

True Negatives: 50
False Positives: 6528
False Negatives: 62
True Positives: 10231


#### ADA Boost w cvec

In [199]:
cvec_ada_gs = run_model('cvec', 'ada', vec_params=cvec_params, mod_params=ada_params, grid_search=True)

Fitting 3 folds for each of 160 candidates, totalling 480 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   21.3s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed:  6.2min
[Parallel(n_jobs=-1)]: Done 480 out of 480 | elapsed:  6.9min finished


### BEST PARAMS ###


Pipeline(steps=[('cvec',
                 CountVectorizer(max_df=0.2, min_df=2, stop_words='english')),
                ('ada',
                 AdaBoostClassifier(learning_rate=0.1, n_estimators=40,
                                    random_state=42))])

### METRICS ###


{'model': 'ada',
 'vectorizer': 'cvec',
 'train': 0.5870067771828259,
 'test': 0.610040898583368,
 'roc': 0.5015471879169875,
 'precision': 0.610077059869591,
 'recall': 0.9999028465947731,
 'f_score': 0.7577955306851233}

True Negatives: 0
False Positives: 6578
False Negatives: 1
True Positives: 10292


#### ADA Boost w tvec

In [200]:
tvec_ada_gs = run_model('tvec', 'ada', vec_params=tvec_params, mod_params=ada_params, grid_search=True)

Fitting 3 folds for each of 160 candidates, totalling 480 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   24.0s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed:  5.8min
[Parallel(n_jobs=-1)]: Done 480 out of 480 | elapsed:  6.5min finished


### BEST PARAMS ###


Pipeline(steps=[('tvec',
                 TfidfVectorizer(max_df=0.2, max_features=100, min_df=3,
                                 stop_words='english')),
                ('ada',
                 AdaBoostClassifier(learning_rate=0.1, n_estimators=40,
                                    random_state=42))])

### METRICS ###


{'model': 'ada',
 'vectorizer': 'tvec',
 'train': 0.5881330145620517,
 'test': 0.6106929049848853,
 'roc': 0.5029328424206327,
 'precision': 0.6106194690265486,
 'recall': 0.9988341591372778,
 'f_score': 0.7579063767047548}

True Negatives: 22
False Positives: 6556
False Negatives: 12
True Positives: 10281


### Model Evaluation

In [206]:
titlepost_eval_df = pd.DataFrame(tuning_list)

In [202]:
# Turning Evaluated List into a DataFrame
titlepost_eval_df.to_csv('data/titlepost_eval_df.csv', index=False)

In [204]:
titlepost_model_result = pd.read_csv('data/dailytitle_eval_df.csv')
titlepost_model_result

Unnamed: 0,model,vectorizer,train,test,roc,precision,recall,f_score
0,lr,cvec,0.779841,0.488,0.422287,0.56,0.575342,0.567568
1,lr,tvec,0.710875,0.568,0.50764,0.584071,0.90411,0.709677
2,nb,cvec,0.665782,0.504,0.44863,0.563218,0.671233,0.6125
3,nb,tvec,0.633952,0.6,0.51686,0.593496,1.0,0.744898
4,rf,cvec,0.583554,0.584,0.5,0.584,1.0,0.737374
5,rf,tvec,0.583554,0.584,0.5,0.584,1.0,0.737374
6,ada,cvec,0.777188,0.56,0.531744,0.588235,0.821918,0.685714
7,ada,tvec,0.854111,0.528,0.502107,0.574468,0.739726,0.646707


## Model Insights
---

After thorough data cleaning and exploratory analysis, we ran 2 different vectorizers (count vectorizer & tdif vectorizer) along with 4 different models using Logistic Regression, Multinomial NB, Random Forest and AdaBoost. 

For model evaluation, we recorded train and test scores, ROC-AUC scores, specificity, sensitivity and F1 scores for each of the models. Based on the accuracy score, the Multinomial NB model with tvec has the best accuracy. However, it is only doing slightly better than the benchmark due to the high similarity between the two classes of up and down trend. 

### Limitation

As shown during the EDA process, there are too many common words between the up and down class regardless if it is within the uni-gram, bi-gram or tri-gram. Additionally using NLTK Vader, the returns of the results indicate that despite being in either the positive or negative sentiment, there will still be a mixture of up and down trends to either of the sentiment.

### Recommendation

A few recommendation could potentially improve the accuracy of the model:
- Using other input from the post (number of comments, number of votes to the post (positive/negative)
- Increase the dataset to include post from official news outlets reporting on stock market
- Increase the dataset by including post from influential people as proven that a single post from influential figure will cost a certain companies stock to surge. (Source: https://www.straitstimes.com/business/companies-markets/gamestop-market-value-soars-past-13-billion-after-elon-musk-tweet)

## Conclusion
---

While most models were performing slightly better than the benchmark, the accuracy can be further improved. The sentiment is a balanced mix in the realm of positivity and negativity regardless of a bearish or bullish trend in the S&P500, the reddit community cannot determine whether S&P 500 will be going up or down. There will always be people buying and selling whenever it is a trading day.

Comparing 2019 and 2020, the most prominent explanation would be due to Covid. As Covid surged the number of posts when comparing year-on-year and there was a huge market correction in Mar 2020. This cost a mixed reaction by the investors as two groups of investor's appear which results in buying and selling regardless of the performance of the S&P 500. This is possible due to some seeing it as an opportunity to buy stocks at a lower price while some fear a potential market crash hence panic sell occured. 

Overall, the EDA process is a very good indicator that there will be a huge challenge in producing a good model due to the huge similarity in the features against two different classifiers. There are also additional insights to consider to be part of the feature like comparing the number of posts during a certain period of time to better classify between up and down trends.

While the limitation is due to the huge similarity between both post, this could potentially be overcome by increasing the no of features to aid classifying between up or down trend of the stock market. Additionally, official or influential figures can also affect how the stock market performs for the day. 