## Stock Prediction with Sentiment Analysis 

### Data:

This dataset was taken from https://www.kaggle.com/aaron7sun/stocknews/data. This dataset was specifically to created to predict the stockmarket based on the news headlines of over 8 years. The author of the data set has mentioned that train data must be those of before 1/1/2015 and test after that.

In [8]:
import pandas as pd
import nltk
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [2]:
data = pd.read_csv('Combined_News_DJIA.csv')
train = data[data['Date'] < '2015-01-01']
test = data[data['Date'] > '2014-12-31']

### Text Preprocessing:

The text found in the dataset is in the form of sentences and these sentences must be broken down into tokens. Tokenisation is a process in which the sentences are broken down into a list of words, almost like a .split() function. This can be done by the inbuilt function like CountVectorizer or TfidfVectorizer from the sklearn library.

In [4]:
test_1 = CountVectorizer().build_tokenizer()(train.iloc[3,10])
print(test_1)

['The', 'commander', 'of', 'Navy', 'air', 'reconnaissance', 'squadron', 'that', 'provides', 'the', 'President', 'and', 'the', 'defense', 'secretary', 'the', 'airborne', 'ability', 'to', 'command', 'the', 'nation', 'nuclear', 'weapons', 'has', 'been', 'relieved', 'of', 'duty']


In [5]:
test_2 = TfidfVectorizer().build_tokenizer()(train.iloc[3,10])
print(test_2)

['The', 'commander', 'of', 'Navy', 'air', 'reconnaissance', 'squadron', 'that', 'provides', 'the', 'President', 'and', 'the', 'defense', 'secretary', 'the', 'airborne', 'ability', 'to', 'command', 'the', 'nation', 'nuclear', 'weapons', 'has', 'been', 'relieved', 'of', 'duty']


By observing the results of the two methods, we find that it does not consider just the important words but also considers capital letters, conjunctions words, etc. To overcome this, we can create our own tokenizer if required higher precision.

In [6]:
def custom_tokenizer(s):
    stopwords = set(w.rstrip() for w in open('stopwords.txt')) #defining a set of stop words
    s = s.lower() # downcase
    tokens = nltk.tokenize.word_tokenize(s) # tokenizer
    tokens = [t for t in tokens if len(t) > 2] # remove short words that probably wont be of any relevance
    tokens = [t for t in tokens if t not in stopwords] # removing set of stop words
    return tokens

In [9]:
test_3 = custom_tokenizer(train.iloc[3,10])
print(test_3)

['commander', 'navy', 'air', 'reconnaissance', 'squadron', 'provides', 'president', 'defense', 'secretary', 'airborne', 'ability', 'command', 'nation', 'nuclear', 'weapons', 'relieved', 'duty']


## Training and testing:

Now for the purpose of training and testing our model, we will continue using the predefined vectorizer from sklearn.

In [10]:
Xtrain = []
for row in range(0,len(train.index)):
    Xtrain.append(' '.join(str(x) for x in train.iloc[row,2:27]))
    
Xtest = []
for row in range(0,len(test.index)):
    Xtest.append(' '.join(str(x) for x in test.iloc[row,2:27]))

In [11]:
cv = CountVectorizer()
basictrain = cv.fit_transform(Xtrain)

In [12]:
logistic_regression = LogisticRegression()
logistic_regression = logistic_regression.fit(basictrain, train["Label"])

In [13]:
basictest = cv.transform(Xtest)
predictions = logistic_regression.predict(basictest)

In [105]:
pd.crosstab(test["Label"], predictions, rownames=["Actual"], colnames=["Predicted"])

Predicted,0,1
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,61,125
1,92,100


In [14]:
naive_bayes = MultinomialNB(alpha=0.01)
naive_bayes = naive_bayes.fit(basictrain, train["Label"])

In [16]:
predictions_1 = naive_bayes.predict(basictest)

In [17]:
pd.crosstab(test["Label"], predictions_1, rownames=["Actual"], colnames=["Predicted"])

Predicted,0,1
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,55,131
1,64,128


In [18]:
random_forest = RandomForestClassifier()
random_forest = random_forest.fit(basictrain, train["Label"])

In [19]:
predictions_2 = random_forest.predict(basictest)

In [20]:
pd.crosstab(test["Label"], predictions_2, rownames=["Actual"], colnames=["Predicted"])

Predicted,0,1
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,84,102
1,94,98
