This project has been compiled via `Python3.9.10`, `conda version: 4.12.0`,  `macOS Monterey 12.1 21C52 arm64`

### Imports

This cell can be executed to install all necessary python packages. Alternatively the requirements.txt file can be used via `"Python3 -m pip install requirements.txt"` command from the terminal.

In [None]:
! pip install emoji typing pysentiment2 tqdm pandas numpy matplotlib sklearn nltk wordcloud varname

With the following two cells we import all necesarry packages.

In [1]:
import os
import re
import glob
import warnings
from varname.helpers import Wrapper

import emoji
from typing import List
import pysentiment2 as ps
from tqdm import tqdm

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

import sklearn.metrics as sklmx

import nltk
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer

from wordcloud import WordCloud, STOPWORDS

warnings.filterwarnings(action="ignore")

In [2]:
from sklearn.pipeline import Pipeline

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import Perceptron
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report

The following cell makes sure that all necessary dependencies for `nltk` are being downloaded.

In [None]:
nltk.download('all')

### Loading Data

This code can be executed to directly download the data sets from kaggle. In order to download the dataset directly from Kaggle.com the API Key `"Kaggle.json"` must be placed inside the working directory. This `"Kaggle.json"` contains my personal API key, which I is why I the disimination of this file should be kept at its neccesary. After the `"Kaggle.json"` has been added to the working directory (it should be in the working directory when executing the ipynb right away from the file) the following cell should be executed. 

If the datasets have been downloaded manualy and placed in the working directory this step can and should be skipped! If it is not, the cell will through an interaction promt that can not be interacted with via a `.ipynb` file.

In [None]:
! pip install kaggle
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

! kaggle datasets download frankcaoyun/stocktwits-2020-2022-raw
! unzip stocktwits-2020-2022-raw.zip

! kaggle datasets download harshrkh/india-financial-news-headlines-sentiments
! unzip india-financial-news-headlines-sentiments.zip

[**Here**](https://www.kaggle.com/datasets/frankcaoyun/stocktwits-2020-2022-raw) further information about the **StockTwits dataset** can be found.  
[**Here**](https://www.kaggle.com/datasets/harshrkh/india-financial-news-headlines-sentiments?resource=download) further information about the **News Headlines dataset** can be found.  

Here we import the **News Headlines** `.csv` file and trim it down to the relevant columns (`body` & `sentiment`)

In [None]:
news_csv_file = pd.read_csv("News_sentiment_Jan2017_to_Apr2021.csv")
news_csv_file.sentiment = news_csv_file.sentiment.replace(["NEGATIVE", "POSITIVE"], ["Bearish","Bullish"])
unclean_NW_dataset = news_csv_file[["Title", "sentiment","confidence"]].rename(columns={"Title":"body"})

Getting all the path to the folders of the seperate Stocks that we have the **Stocktwits Data** for

In [None]:
path = os.getcwd()

apple_path = path.__add__("/StockTwits_2020_2022_Raw/AAPL_2020_2022")
amazon_path = path.__add__("/StockTwits_2020_2022_Raw/AMZN2019-2022")
facebook_path = path.__add__("/StockTwits_2020_2022_Raw/FB_2019_2022")
nvidia_path = path.__add__("/StockTwits_2020_2022_Raw/NVDA_2013_2022")
tesla_path = path.__add__("/StockTwits_2020_2022_Raw/TSLA_2020_2022")

Getting all the `.csv` files from the seperate folders

In [None]:
apple_csv_files = glob.glob(os.path.join(apple_path, "*.csv"))
amazon_csv_files = glob.glob(os.path.join(amazon_path, "*.csv"))
facebook_csv_files = glob.glob(os.path.join(facebook_path, "*.csv"))
nvidia_csv_files = glob.glob(os.path.join(nvidia_path, "*.csv"))
tesla_csv_files = glob.glob(os.path.join(tesla_path, "*.csv"))

The following function takes a list of `.csv` file paths and creates a single Pandas DataFrame.

In [None]:
def allToOne(file_list:List[str]) -> pd.DataFrame():
   
    final_df = pd.DataFrame()
    row_count = 0

    warnings.simplefilter(action= "ignore", category = FutureWarning)

    for f in tqdm(range(len(file_list))):
        df = pd.read_csv(file_list[f])
        row_count += len(df)
        final_df = final_df.append(df)
    
    #print("full_df_length:", len(final_df), "|", "sum_of_single_df_length:", row_count, "|", "✅" if len(final_df) == row_count else "❌", flush=True)

    return final_df

With the following cell we will use the previous function `allToOne` to create for every Stock a seperate Dataframe with all StockTwits.

In [None]:
apple_df = allToOne(apple_csv_files)
amazon_df = allToOne(amazon_csv_files)
facebook_df = allToOne(facebook_csv_files)
nvidia_df = allToOne(nvidia_csv_files)
tesla_df = allToOne(tesla_csv_files)

In [None]:
print(len(apple_df)+len(amazon_df)+len(facebook_df)+len(nvidia_df)+len(tesla_df))


### Cleaning Data

- adjusting Labels
- adjusting prorpotions of different types of observations (Bearish, Bullish, None)
- cleaning text
    - excluding filling words
    - excluding stop words
    - excluding links
    - lemmatization (replacing words by lemma)

Adjusting the labeling of the columns (entities -> sentiment) and the cells (sentiment: {Bearish, Bullish, Non}; symbols: {AAPL, AMZN, FB, NVDA, TSLA}) and ejecting all `"None"` preclassified (Sentiment = None) Stocktwits.

In [None]:
def labelDF(dataFrame:pd.DataFrame()) -> pd.DataFrame():
    for i in tqdm(range(1)):

        dataFrame = dataFrame[["body", "symbols", "entities"]]
        dataFrame = dataFrame.rename(columns = {"entities":"sentiment"})

        dataFrame.sentiment = dataFrame.sentiment.replace(r"^.*Bearish.*$","Bearish", regex = True)
        dataFrame.sentiment = dataFrame.sentiment.replace(r"^.*Bullish.*$","Bullish", regex = True)
        dataFrame.sentiment = dataFrame.sentiment.replace(r"^.*None.*$","None", regex = True)

        dataFrame = dataFrame[dataFrame.sentiment != "None"]

        dataFrame.symbols = dataFrame.symbols.replace(r"^.*AAPL.*$", "AAPL", regex = True)
        dataFrame.symbols = dataFrame.symbols.replace(r"^.*AMZN.*$", "AMZN", regex = True)
        dataFrame.symbols = dataFrame.symbols.replace(r"^.*FB.*$", "FB", regex = True)
        dataFrame.symbols = dataFrame.symbols.replace(r"^.*NVDA.*$", "NVDA", regex = True)
        dataFrame.symbols = dataFrame.symbols.replace(r"^.*TSLA.*$", "TSLA", regex = True)

        dataFrame = dataFrame.sample(frac=1, random_state=1).reset_index(drop=True)

    return dataFrame

In [None]:
label_apple_df = labelDF(apple_df)
label_amazon_df = labelDF(amazon_df)
label_facebook_df = labelDF(facebook_df)
label_nvidia_df = labelDF(nvidia_df)
label_tesla_df = labelDF(tesla_df)

This function makes sure that there are an eqaual amount of StockTwits from each preclassified sentiment category (`Bullish`, `Bearish`) in the respectived datasets. To do so we randomly eject the rows of the type (`Bullis` or `Bearish`) that has more observations, until both types have the same amount of observations. We eject all StockTwits that have been classified as `None` as those can not always be definetly be determined to be nutral. Some users might forget to give their StockTwit a positv or negativ rating. Hence, `None` doesn't neccesarily state a neutral perspective of a user on a particular stock.

In [None]:
def proportions(dataFrame:pd.DataFrame()) -> pd.DataFrame():
    for i in tqdm(range(1)):
        #print("init length: ", len(dataFrame))
        Bearish_dataFrame = dataFrame.loc[dataFrame["sentiment"] == "Bearish"].sample(frac = 1, random_state=1).reset_index(drop=True)
        Bullish_dataFrame = dataFrame.loc[dataFrame["sentiment"] == "Bullish"].sample(frac = 1, random_state=1).reset_index(drop=True)

        min_length = min(len(Bearish_dataFrame), len(Bullish_dataFrame))
        #print("bearish: ", len(Bearish_dataFrame), "bullish: ", len(Bullish_dataFrame))
        #print("min length: ", min_length, "| 2x : ", min_length*3)

        dataFrame = pd.concat([Bearish_dataFrame[:min_length],Bullish_dataFrame[:min_length]])

        #print("end length: ", len(dataFrame))

    return dataFrame

In [None]:
cut_apple_df = proportions(label_apple_df)
cut_amazon_df = proportions(label_amazon_df)
cut_facebook_df = proportions(label_facebook_df)
cut_nvidia_df = proportions(label_nvidia_df)
cut_tesla_df = proportions(label_tesla_df)

The following functinos make sure that the tweets are preprocessed and cleaned of stopwords and negations are being attached to verbs

In [None]:
def clean_text(df,field):
    df[field] = df[field].str.replace(r"http\S+"," ")
    df[field] = df[field].str.replace(r"http"," ")
    df[field] = df[field].str.replace(r"@","at")
    df[field] = df[field].str.replace("#[A-Za-z0-9_]+", ' ')
    df[field] = df[field].str.replace(r"[^A-Za-z(),!?@\'\"_\n]"," ")
    df[field] = df[field].str.lower()
    return df 

In [None]:
lemmatizer = WordNetLemmatizer()
stemmer = SnowballStemmer("english")
STOPWORDS.update(['rt', 'mkr', 'didn', 'bc', 'n', 'm','im', 'll', 'y', 've', 
                      'u', 'ur', 'don','p', 't', 's', 'aren', 'kp', 'o', 'kat', 
                      'de', 're', 'amp', 'will'])

In [None]:
def preprocess_text(text):
    text = re.sub(r"won\'t", "will not", text)
    text = re.sub(r"can\'t", "can not", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'t", " not", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"n\'t", " not", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'s", " is", text)
    text = re.sub(r"\'d", " would",text)
    text = re.sub(r"\'m", " am", text)
    text = re.sub('[^a-zA-Z]',' ',text)
    text = re.sub(emoji.get_emoji_regexp(),"",text)
    text = re.sub(r'[^\x00-\x7f]','',text)
    text = " ".join([stemmer.stem(word) for word in text.split()])
    text = [lemmatizer.lemmatize(word) for word in text.split() if not word in set(STOPWORDS)]
    text = ' '.join(text)

    return text
    

In [None]:
def cleanDF(dataFrame:pd.DataFrame, field:str) -> pd.DataFrame():
    tqdm.pandas()

    dataFrame = clean_text(dataFrame,field)
    dataFrame.field = dataFrame[field].progress_apply(preprocess_text)

    return dataFrame

Here we clean all the **StockTwits** datasets.

In [None]:
clean_apple_df = cleanDF(cut_apple_df,"body")
clean_amazon_df = cleanDF(cut_amazon_df,"body")
clean_facebook_df = cleanDF(cut_facebook_df,"body")
clean_nvidia_df = cleanDF(cut_nvidia_df,"body")
clean_tesla_df = cleanDF(cut_tesla_df,"body")


Here we clean the **News Headlines** dataset.

In [None]:
clean_NW_dataset = cleanDF(unclean_NW_dataset, "body")

The following function concatenates the respective dataframes of each stock into one single dataframe

In [None]:
def oneDataFrame(dataFrameList:List) -> pd.DataFrame():
    lenList = []
    rightlenList = []
    finalDataFrame = pd.DataFrame(lenList)
    for i in tqdm(range(1)):
        for i, dataFrame in enumerate(dataFrameList):
            lenList.append(len(dataFrame))

    print(lenList)   
    minimum_length = min(lenList)
    print("minimum length: ", minimum_length, "5*: ", minimum_length*5)
    for i, dataFrame in enumerate(dataFrameList):
        Bearish_dataFrame = dataFrame.loc[dataFrame["sentiment"] == "Bearish"].sample(frac = 1, random_state=1).reset_index(drop=True)
        Bullish_dataFrame = dataFrame.loc[dataFrame["sentiment"] == "Bullish"].sample(frac = 1, random_state=1).reset_index(drop=True)

        if minimum_length % 2 == 0:
            Bearish_dataFrame = Bearish_dataFrame[:int(minimum_length/2)]
            Bullish_dataFrame = Bullish_dataFrame[:int(minimum_length/2)]
        elif minimum_length % 2 != 0:
            Bearish_dataFrame = Bearish_dataFrame[:int(floor(minimum_length/2))]
            Bullish_dataFrame = Bullish_dataFrame[:int(floor(minimum_length/2))]


        dataFrame = pd.concat([Bearish_dataFrame,Bullish_dataFrame])
        rightlenList.append(dataFrame)

    for i, dataFrame in enumerate(rightlenList):
        finalDataFrame = pd.concat([finalDataFrame,dataFrame])      

    print(len(finalDataFrame))  

    return finalDataFrame

In [None]:
cleaned_ST_dataset = oneDataFrame([clean_apple_df,clean_amazon_df, clean_facebook_df, clean_nvidia_df, clean_tesla_df])

Here we bring both datasets (**StockTwits** & **News Headlines**) to the same length by randomly ejecting observations of the **StockTwits** dataset, to make the performances of the different sentiment extraction approaches comparable.

In [None]:
clean_ST_dataset = cleaned_ST_dataset.sample(frac=1, random_state=1).reset_index(drop=True)[:len(clean_NW_dataset)]
print("ST: ",len(clean_ST_dataset), "NW: ", len(clean_NW_dataset))

The following two cells `can be executed` to store the **cleaned datasets** localy as well as reloading them into the notebook. This is, so that the previous lengthy process of importing and cleaning the data sets doesn't have to be redone after the kernel is being reset. This can be especially usefull when handling errors in the subsequent model training and sentiment classification process.

In [None]:
clean_ST_dataset.to_csv('clean_ST_dataset.csv')
clean_NW_dataset.to_csv('clean_NW_dataset.csv')

In [None]:
clean_ST_dataset = pd.read_csv("clean_ST_dataset.csv", delimiter=",")
clean_NW_dataset = pd.read_csv("clean_NW_dataset.csv", delimiter=",")

In [None]:
ST_text = " ".join(text for text in clean_ST_dataset.body)
NW_text = " ".join(text for text in clean_NW_dataset.body)

In [None]:
ST_word_cloud = WordCloud(collocations = False, background_color = 'white').generate(ST_text)
NW_word_cloud = WordCloud(collocations = False, background_color = 'white').generate(NW_text)


In [None]:
plt.imshow(ST_word_cloud, interpolation='bilinear')

In [None]:
plt.imshow(NW_word_cloud, interpolation='bilinear')

In the following we apply different sentiment classification approaches to the two datasets and then compare their respective performances

#### (Back-of-Words) Harvard IV

Function that derives the Sentiment scores based of the Harvard-IV dictionary

In [None]:
# Function that derives the Sentiment scores based of the dictionary 

def applyHIV4two(dataFrame_init:pd.DataFrame(),none:bool) -> pd.DataFrame():
    tqdm.pandas()
    hav4 = ps.HIV4()

    dataFrame = dataFrame_init

    dataFrame["HAV4_polarity"] = np.nan
    dataFrame["pred_sentiment"] = np.nan

    itokenized = dataFrame.body.progress_apply(hav4.tokenize)
    dataFrame["HAV4_polarity"] = itokenized.progress_apply(hav4.get_score)
    dataFrame["HAV4_polarity"] = np.array([[r.get("Polarity")] for r in dataFrame.HAV4_polarity])
    
    #warnings.simplefilter(action= "ignore", category = FutureWarning)
    
    if none == True:
        dataFrame.loc[dataFrame.HAV4_polarity < 0, 'pred_sentiment'] = "Bearish"
        dataFrame.loc[dataFrame.HAV4_polarity > 0, 'pred_sentiment'] = "Bullish"
        dataFrame.loc[dataFrame.HAV4_polarity == 0, 'pred_sentiment'] = "None"
    elif none == False:
        dataFrame.loc[dataFrame.HAV4_polarity < 0, 'pred_sentiment'] = "Bearish"
        dataFrame.loc[dataFrame.HAV4_polarity >=  0, 'pred_sentiment'] = "Bullish"


    return dataFrame

In [None]:
def createDictMatrix(dictList:list,none:bool, h_l:str):
    
    
    for i,model in enumerate(dictList):

        label_list = ["Bearish","Bullish","None"] if none == True else ["Bearish","Bullish"]
        #print("{} Test Accuracy: {}".format(dataFrame_list[i],model.score(X_test,y_test)))
        cm = sklmx.confusion_matrix(model.sentiment,model.pred_sentiment, labels = label_list)
        disp = sklmx.ConfusionMatrixDisplay(confusion_matrix=cm,display_labels=label_list)
        
        disp.plot()
        if  i == 0:
            plt.title(f"StockTwits: {h_l}")
            ST_logMc_accuracy = sklmx.accuracy_score(model.sentiment,model.pred_sentiment)
        elif i == 1:
            plt.title(f"News Headlines: {h_l}")
            NW_logMc_accuracy = sklmx.accuracy_score(model.sentiment,model.pred_sentiment)

        print("Accuracy: ", sklmx.accuracy_score(model.sentiment,model.pred_sentiment))

        plt.show()

In [None]:
ST_none_hav4_dataset = applyHIV4two(clean_ST_dataset, none = True)
NW_none_hav4_dataset = applyHIV4two(clean_NW_dataset, none = True)

In [None]:
createDictMatrix(dictList=[ST_none_hav4_dataset,NW_none_hav4_dataset],none = True,h_l ="HAV-IV")

In [None]:
ST_hav4_dataset = applyHIV4two(clean_ST_dataset, none = False)
NW_hav4_dataset = applyHIV4two(clean_NW_dataset, none = False)

In [None]:
createDictMatrix([ST_hav4_dataset,NW_hav4_dataset],none = False, h_l="HAV-IV")

#### (Back-of-Words) Loughran and McDonald

Function that derives the Sentiment scores based of the dictionary 

In [None]:
def applyLogMctwo(dataFrame:pd.DataFrame(),none:bool) -> pd.DataFrame():
    tqdm.pandas()
    logMc = ps.LM()

    dataFrame["LogMc_polarity"] = np.nan
    dataFrame["pred_sentiment"] = np.nan

    itokenized = dataFrame.body.progress_apply(logMc.tokenize)
    dataFrame["LogMc_polarity"] = itokenized.progress_apply(logMc.get_score)
    dataFrame["LogMc_polarity"] = np.array([[r.get("Polarity")] for r in dataFrame.LogMc_polarity])
    
    warnings.simplefilter(action= "ignore", category = FutureWarning)

    if none == True:
        dataFrame.loc[dataFrame.LogMc_polarity < 0, 'pred_sentiment'] = "Bearish"
        dataFrame.loc[dataFrame.LogMc_polarity > 0, 'pred_sentiment'] = "Bullish"
        dataFrame.loc[dataFrame.LogMc_polarity == 0, 'pred_sentiment'] = "None"
    else:
        dataFrame.loc[dataFrame.LogMc_polarity < 0, 'pred_sentiment'] = "Bearish"
        dataFrame.loc[dataFrame.LogMc_polarity >= 0, 'pred_sentiment'] = "Bullish"

    return dataFrame

Here we assign the polarity scores to the dataframe and derive the sentiment labels based of the polarity scores. In the following cell we include a possible classification of `None`.

In [None]:
ST_none_logMc_dataset = applyLogMctwo(clean_ST_dataset, True)
NW_none_logMc_dataset = applyLogMctwo(clean_NW_dataset, True)

Here we assign the polarity scores to the dataframe and derive the sentiment labels based of the polarity scores. In the following cell we **don't** include a possible classification of `None`.

In [None]:
createDictMatrix([ST_none_logMc_dataset,NW_none_logMc_dataset],True, "Lo&Mc")

In [None]:
ST_logMc_dataset = applyLogMctwo(clean_ST_dataset, False)
NW_logMc_dataset = applyLogMctwo(clean_NW_dataset, False)

In [None]:
createDictMatrix([ST_logMc_dataset,NW_logMc_dataset], False, "Lo&Mc")

### Machine Learning Models

Here we use the initially cleaned and formated dataframe to create a `training` as well as `test` dataset

In [None]:
ST_x_train, ST_x_test, ST_y_train, ST_y_test = train_test_split(clean_ST_dataset.body,clean_ST_dataset.sentiment, test_size=0.33, random_state=42)
NW_x_train, NW_x_test, NW_y_train, NW_y_test = train_test_split(clean_NW_dataset.body, clean_NW_dataset.sentiment, test_size=0.33, random_state=42)

#### (Machine Learning) Baisian Classifiers 

In [None]:

NB_text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('NB_clf', MultinomialNB())])
NB_tuned_parameters = {
    'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2'),
    'NB_clf__alpha': [1, 1e-1, 1e-2]
}

##### Training Stocktwits Dataset

In [None]:
def NB_training(x_train, x_test, y_train, y_test):

    score = "f1_macro"
    print("# Training model for %s" % score)
    print()
    
    np.errstate(divide='ignore')
    NB_clf = GridSearchCV(NB_text_clf, NB_tuned_parameters, cv=2, scoring=score)
    NB_clf.fit(x_train, y_train)

    print("Best set of parameters were found on the following set:")
    print()
    print(NB_clf.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    for mean, std, params in zip(NB_clf.cv_results_['mean_test_score'], 
                                NB_clf.cv_results_['std_test_score'], 
                                NB_clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")

    print()
    print(classification_report(y_test, NB_clf.predict(x_test), digits=4))
    print()

    return NB_clf

In [None]:
ST_NB_model = NB_training(ST_x_train,ST_x_test,ST_y_train,ST_y_test)

In [None]:
NW_NB_model = NB_training(NW_x_train,NW_x_test,NW_y_train,NW_y_test)

In [None]:
def createConfusionMatrix(model,x_test,y_test, name:str):
    w_xt = Wrapper(x_test)
    w_yt = Wrapper(y_test)

    predict = model.predict(x_test)
    cm = sklmx.confusion_matrix(y_test, predict, labels=model.classes_)
    disp = sklmx.ConfusionMatrixDisplay(confusion_matrix=cm,
                                display_labels=model.classes_)
    disp.plot()
    plt.title(name)
    plt.show()

    print("Model Accuracy:", model.score(x_test,y_test))

In [None]:
createConfusionMatrix(ST_NB_model,ST_x_test, ST_y_test, "StockTwits - Naive Bayes")
createConfusionMatrix(NW_NB_model,NW_x_test, NW_y_test, "News Headlines - Naive Bayes")

#### (Machine Learning) Support Vector Machines

In [None]:
SVM_text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('SVM_clf', SVC( cache_size=2000, max_iter= 5000))])
SVM_tuned_parameters = {
    'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2'),
    #'clf__alpha': [1, 1e-1, 1e-2],
    #'SVM_clf__verbose': [2],
    #'SVM_clf__cache_size': [2000]
}

In [None]:
def SVM_training(x_train, x_test, y_train, y_test):
    score = 'f1_macro'
    print("# Tuning hyper-parameters for %s" % score)
    print()
    np.errstate(divide='ignore')
    SVM_clf = GridSearchCV(SVM_text_clf, SVM_tuned_parameters, cv=2, scoring=score)
    SVM_clf.fit(x_train, y_train)

    print("Best parameters set found on development set:")
    print()
    print(SVM_clf.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    for mean, std, params in zip(SVM_clf.cv_results_['mean_test_score'], 
                                SVM_clf.cv_results_['std_test_score'], 
                                SVM_clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")

    print()
    print(classification_report(y_test, SVM_clf.predict(x_test), digits=4))
    print()

    return SVM_clf

In [None]:
ST_SVM_model = SVM_training(ST_x_train,ST_x_test,ST_y_train,ST_y_test)

In [None]:
NW_SVM_model = SVM_training(NW_x_train,NW_x_test,NW_y_train,NW_y_test)

In [None]:
createConfusionMatrix(ST_SVM_model, ST_x_test, ST_y_test, "StockTwits - Support Vector Machine")
createConfusionMatrix(NW_SVM_model, NW_x_test, NW_y_test, "News Headlines - Support Vector Machine")

#### (Machine Learning) Maximum Entropy / Logistic Regression     

In [None]:
LR_text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('LR_clf', LogisticRegression(random_state=0))])

LR_tuned_parameters = {
    'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2'),
    #'clf__alpha': [1, 1e-1, 1e-2],
}

In [None]:
def LR_training(x_train, x_test, y_train, y_test):
    score = 'f1_macro'
    print("# Tuning hyper-parameters for %s" % score)
    print()
    np.errstate(divide='ignore')
    LR_clf = GridSearchCV(LR_text_clf, LR_tuned_parameters, cv=2, scoring=score)
    LR_clf.fit(x_train, y_train)

    print("Best parameters set found on development set:")
    print()
    print(LR_clf.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    for mean, std, params in zip(LR_clf.cv_results_['mean_test_score'], 
                                LR_clf.cv_results_['std_test_score'], 
                                LR_clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")


    print()
    print(classification_report(y_test, LR_clf.predict(x_test), digits=4))
    print()
    return LR_clf

In [None]:
ST_LR_model = LR_training(ST_x_train,ST_x_test,ST_y_train,ST_y_test)

In [None]:
NW_LR_model = LR_training(NW_x_train,NW_x_test,NW_y_train,NW_y_test)

In [None]:
createConfusionMatrix(ST_LR_model, ST_x_test, ST_y_test, "StockTwits - Logistic Regression")
createConfusionMatrix(NW_LR_model, NW_x_test, NW_y_test, "News Headlines - Logistic Regression")

#### (Machine Learning) Multilayer Perceptron 

In [None]:
MP_text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('MP_clf', Perceptron(tol=1e-3, random_state=0, n_jobs = 10))])

MP_tuned_parameters = {
    'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2'),
    'MP_clf__alpha': [1, 1e-1, 1e-2]
}

In [None]:
def MP_training(x_train, x_test, y_train, y_test):
    score = 'f1_macro'
    print("# Tuning hyper-parameters for %s" % score)
    print()
    np.errstate(divide='ignore')
    MP_clf = GridSearchCV(MP_text_clf, MP_tuned_parameters, cv=2, scoring=score)
    MP_clf.fit(x_train, y_train)

    print("Best parameters set found on development set:")
    print()
    print(MP_clf.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    for mean, std, params in zip(MP_clf.cv_results_['mean_test_score'], 
                                MP_clf.cv_results_['std_test_score'], 
                                MP_clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")

    print()
    print(classification_report(y_test, MP_clf.predict(x_test), digits=4))
    print()

    return MP_clf

In [None]:
ST_MP_model = MP_training(ST_x_train,ST_x_test,ST_y_train,ST_y_test)

In [None]:
NW_MP_model = MP_training(NW_x_train,NW_x_test,NW_y_train,NW_y_test)

In [None]:
createConfusionMatrix(ST_MP_model, ST_x_test, ST_y_test, "StockTwits - Multilayer Perceptron")
createConfusionMatrix(NW_MP_model, NW_x_test, NW_y_test, "News Headlines - Multilayer Perceptron")

#### (Machine Learning) Neural Network

In [None]:
NN_text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MLPClassifier(solver='lbfgs', alpha=1e-5,
                        hidden_layer_sizes=(5, 2)))])

NN_tuned_parameters = {
    'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': [1, 1e-1, 1e-2],
}

In [None]:
def NN_training(x_train, x_test, y_train, y_test):
    score = 'f1_macro'
    print("# Tuning hyper-parameters for %s" % score)
    print()
    np.errstate(divide='ignore')
    NN_clf = GridSearchCV(NN_text_clf, NN_tuned_parameters, cv=2, scoring=score)
    NN_clf.fit(x_train, y_train)

    print("Best parameters set found on development set:")
    print()
    print(NN_clf.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    for mean, std, params in zip(NN_clf.cv_results_['mean_test_score'], 
                                NN_clf.cv_results_['std_test_score'], 
                                NN_clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")



    print()
    print(classification_report(y_test, NN_clf.predict(x_test), digits=4))
    print()
    return NN_clf

In [None]:
ST_NN_model = NN_training(ST_x_train,ST_x_test,ST_y_train,ST_y_test)

In [None]:
NW_NN_model = NN_training(NW_x_train,NW_x_test,NW_y_train,NW_y_test)

In [None]:
createConfusionMatrix(ST_NN_model, ST_x_test, ST_y_test, "StockTwits - Neural Network")
createConfusionMatrix(NW_NN_model, NW_x_test, NW_y_test, "News Headlines - Neural Network")