# Machine Learning Model

## Model Plan
- Prepare the dataframe with columns: tweet_text, price_previous day, price_next day, price_diff
- Preprocess the tweet text into features (countVectorier, tfidf)
- Classification: LogisticRegression 

### Query the dataframe with columns: tweet_text, price_previous day, price_next day, price_diff

In [1]:
# Setting up libraries:

import pandas as pd
from sqlalchemy import create_engine, inspect 
from config import db_password

In [2]:
# Create engine
engine = create_engine(f"postgresql://postgres:{db_password}@127.0.0.1:5432/twitter_vs_stocks")

In [3]:
#Creating new table for model, preprocessing the data

tweets_price = pd.read_sql_query(
    """
        SELECT 
            tweets.date AS tweet_date,
            tweets.text AS tweet_text,
            tweets.tokenized_text AS tweet_tokens,
            COALESCE(stock_prev.close, stock_prev_prev.close, stock_prev_prev_prev.close) AS prev_day_close,
            COALESCE(stock_next.close, stock_next_next.close, stock_next_next_next.close) AS next_day_close
        FROM tweets_text tweets
        LEFT JOIN stock stock_prev
            ON (tweets.date - INTERVAL '1 day') = stock_prev.date
        LEFT JOIN stock stock_prev_prev
            ON (tweets.date - INTERVAL '2 day') = stock_prev_prev.date
        LEFT JOIN stock stock_prev_prev_prev
            ON (tweets.date - INTERVAL '3 day') = stock_prev_prev_prev.date
        LEFT JOIN stock stock_next
            ON (tweets.date + INTERVAL '1 day') = stock_next.date
        LEFT JOIN stock stock_next_next
            ON (tweets.date + INTERVAL '2 day') = stock_next_next.date
        LEFT JOIN stock stock_next_next_next
            ON (tweets.date + INTERVAL '3 day') = stock_next_next_next.date
        WHERE tweets.date > '2011-01-01' AND tweets.tokenized_text != '{}'
        ORDER BY tweets.date
    """,
    con=engine
)

tweets_price.dropna(inplace=True)

#Computing difference between the stock price before the date of tweet and after the post. 
tweets_price['close_price_diff'] = tweets_price['next_day_close'] - tweets_price['prev_day_close']

In [4]:
tweets_price = tweets_price[tweets_price.tweet_tokens.str.count(',') > 1] # More than two words in tweet
tweets_price.head(5)

Unnamed: 0,tweet_date,tweet_text,tweet_tokens,prev_day_close,next_day_close,close_price_diff
0,2011-12-01,{I made the volume on the Model S http://t.co...,"{made,volume,model,go,need,work,miniature,ston...",6.548,6.66,0.112
1,2011-12-03,"{That was a total non sequitur btw, Great Volt...","{total,non,sequitur,great,voltaire,quote,argua...",6.66,6.884,0.224
2,2011-12-04,{Am reading a great biography of Ben Franklin ...,"{reading,great,biography,ben,franklin,isaacson...",6.66,6.884,0.224
3,2011-12-21,{Yum! Even better than deep fried butter: htt...,"{yum,even,better,deep,fried,butter,yeah,really...",5.58,5.554,-0.026
4,2011-12-22,{Model S options are out! Performance in red a...,"{model,options,performance,red,black,deliver,c...",5.514,5.58,0.066


## A - Classification: Which tweets increase stock price vs decrease


In [5]:
# Setting up libraries for model

#CountVectorizer = this takes in a list and counts how many times it appears
#TfidfTransformer = frequency of word in a tweet as compared to other tweets

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

text_clf = Pipeline([
    ('vect', CountVectorizer(preprocessor=lambda x: x, tokenizer=lambda x: x)),
    ('tfidf', TfidfTransformer()),
    ('clf', LogisticRegression(C=0.01, random_state=1)),
])

In [6]:
train_df, test_df = train_test_split(tweets_price, random_state=1)

In [7]:
# Setting up variables
X_train = train_df.tweet_tokens.tolist()
y_train = (train_df['close_price_diff'] > 0).astype(int).values
X_test = test_df.tweet_tokens.tolist()
y_test = (test_df['close_price_diff'] > 0).astype(int).values

In [8]:
# Classify text data
text_clf.fit(X_train, y_train)

Pipeline(steps=[('vect',
                 CountVectorizer(preprocessor=<function <lambda> at 0x11f9ceca0>,
                                 tokenizer=<function <lambda> at 0x122983e50>)),
                ('tfidf', TfidfTransformer()),
                ('clf', LogisticRegression(C=0.01, random_state=1))])

In [9]:
text_clf.score(X_test, y_test)

0.5454545454545454

In [10]:
# Testing predicted probability
predicted_proba_test = text_clf.predict_proba(X_test)[:, 1]

In [11]:
# Adding results into DataFrame
results_test = pd.DataFrame({
    'proba_positive_tweet': predicted_proba_test,
    'tweet_text': test_df['tweet_text'],
    'tweet_date': test_df['tweet_date'],
    'stock_price_change': test_df['close_price_diff'],
}).sort_values('proba_positive_tweet', ascending=False)
pd.set_option('display.max_colwidth', None)
results_test.head(10)

Unnamed: 0,proba_positive_tweet,tweet_text,tweet_date,stock_price_change
116,0.566446,{First test flight hop of our Grasshopper VTVL rocket! http://t.co/oomI5vSB},2012-09-22,0.128
1541,0.566264,"{Thank you, South Texas for your support! This is the gateway to Mars., Life, the Universe and Everything https://t.co/1ZCzInfc4u}",2020-12-10,5.51001
218,0.566229,"{Just want to say thanks to customers &amp; investors that took a chance on Tesla through the long, dark night. We wouldn't be here without you., @westcoastbill Thanks Bill!}",2013-05-08,2.778
1123,0.566174,"{Great meme review hosted by Will Smith, Highest reentry heating to date. Burning metal sparks from base heat shield visible in landing video. Fourth relight scheduled for April.}",2019-02-22,1.508003
1593,0.566141,"{From thence to Mars,\nAnd hence the Stars., Creating the city of Starbase, Texas, Horses are even self-driving! https://t.co/qPJrCFGs8J, Scammers &amp; crypto should get a room}",2021-03-02,-65.22998
1100,0.566067,"{Awesome moose sculpture! ♥️🇳🇴 https://t.co/CegEEHL4wz, Testing metallic heat shield at 1100C (2000F) @SpaceX https://t.co/frP5eZ5a0z, If test flight of 🐉 goes well next month, @NASA 👨‍🚀 👩‍🚀 will 🚀 to @Space_Station this summer!}",2019-01-25,0.974003
764,0.566025,{US govt testing by @NHTSAgov finds Model X to be the safest SUV in history by significant margin https://t.co/zAdb5FQPEI},2017-06-13,4.330002
1231,0.566024,"{Great progress by Starship Cape team. Started several months behind, but catching up fast. This will be a super fun race to orbit, moon &amp; Mars!}",2019-08-06,1.019997
434,0.565899,{We just got banned in West Virginia. Oh no. http://t.co/gNztPDNtVT},2015-04-04,2.419998
1012,0.56586,"{Tesla piece on the physics of car safety coming soon for those interested in technical details, .@NHTSAgov will post final safety probability stats soon. Model 3 has a shot at being safest car ever tested.}",2018-09-20,0.015999


In [12]:
results_test.to_csv('Resources/Data/results_test.csv')