# My project part 1

## Predict stock prices - Tesla!

### Approach: 
    I want to look for breakout patterns in stocks in conjunction with sentiment analysis.
    For breakout patterns I will use one approach with ML and one without. 
    For the tweets I will use natural language processing and use the clothing/book example as inspiration. 

### Data: 
    - Use Yahoo finance to get historical prices or download price history from kaggle.com
    - Get stock tweets from www.kaggle.com 
        - One sentiment labeled dataset (general tweets) and one dataset containing non labeled tweets (Tesla tweets). The non labeled dataset will be decimated to only contain tweets concerning Tesla. 
    
### Goal:
    - I want to see if there is a connection between sentiment/buzz and the stock price, also if it has any affect on break out patterns.

### Disclaimer: 
    Initial thought was to scrape X (twitter) for sentiment changes, but due to difficulties to extract that data I decided to go with a combination of already created datasets. 
    
    The code below is just a initial cleaning of the dataset to be able to visualize the data in csv files in a more simple way. 

    The sentiment datasets doesn't have as much history as the stock price therefore it might be difficult to see any valid patterns for the stock I have investigated. I haven't yet decided of the periods I want to examine. The 5 years I have extraced below is made as a proof of concept. 

    



### Retrive historical stock prices 

In [None]:
# Yahoo webscraping example 
import yfinance as yf
import pandas as pd

ticker = 'TSLA'

tickers = yf.Tickers(ticker)
df_prices = tickers.tickers[ticker].history(period="5y")
df_prices.to_csv(f'./data/prices/{ticker}_prices.csv')



### Initial cleaning of the none labeled tweets

In [44]:
# Make non labeled dataset ready for sentiment analysis
import re 

def clean_tweet(tweet):
    return ''.join(re.findall(r'[a-zA-Z0-9 ;+-:"]', tweet))

df_tweets = pd.read_csv('./data/sentiment/tweets_not_labeled.csv')

df_nonlabeled_tweets = df_tweets.query(f"StockName == '{ticker}'", inplace=False)

for i, row in df_nonlabeled_tweets.iterrows():
    tweet = clean_tweet(df_nonlabeled_tweets.at[i, 'Tweet'])
    df_nonlabeled_tweets.at[i, 'Tweet'] = str(tweet)

df_nonlabeled_tweets = df_nonlabeled_tweets.drop("CompanyName", axis='columns')
df_nonlabeled_tweets = df_nonlabeled_tweets.rename(columns={'StockName': 'Ticker'})
df_nonlabeled_tweets.to_csv(f'./data/sentiment/{ticker}_nonlabeled.csv')

### Need to clean the labeled tweets to be able to read it into a panda dataframe

In [27]:

import re
import pandas as pd
from dateutil.parser import parse

def validate_datetime(part):
    try:
        parse(part)
        return True
    except:
        return False
        

with open('./data/sentiment/tweets_labeled.csv', 'r', encoding="utf8") as file_to_be_parsed:
    file_content = file_to_be_parsed.read()

values_to_be_removed = ['&amp;','&gt;',',','&lt;']

# Replace values that causes issues with pandas read csv, don't want to drop bad lines. 
for value in values_to_be_removed:
    file_content = file_content.replace(value,'')

# Removing emojies, new lines, and other chars
file_content = ''.join(re.findall(r'[a-zA-Z0-9 ;+-:"]', file_content))

parsed_file = []
parts = file_content.split(';')

for part in parts:
    if part.isnumeric() == True:
        parsed_file.append(part)
    elif validate_datetime(part) == True:
        parsed_file.append(part)
    else:
        # If the tweet contains the seperator char
        if part.find('"') == -1:
            parsed_file.append(part)
        else:
            part = part.replace(';','')
            parsed_file.append(part)

# Creating a list of where the tweet should break
row_breaks = ['sentiment','positive','neutral','negative']            

#Need to put rowbreaks in again between the sentiment and the tweet id
parsed_file_content = ','.join(parsed_file).split(',')
parsed_file_linebreaks = []
current_sentiment = ''
for part in parsed_file_content:
    found = False
    for row_break in row_breaks:
        if part.find(row_break) == -1:
            found = False
        else:
            current_sentiment = row_break
            found = True
            break
    
    if found == True:
        part_to_investigate = part[len(current_sentiment): ]
        add_linebreak = False
        for char in part_to_investigate:
            if char.isnumeric() == False:
                add_linebreak = False
                break
            else:
                add_linebreak = True
        
        if add_linebreak == True:
            parsed_file_linebreaks.append(current_sentiment + '\n' + part_to_investigate) 
        else:
            parsed_file_linebreaks.append(part)    
    else:
        parsed_file_linebreaks.append(part)

with open('./data/sentiment/tweets_labeled_parsed.csv', 'w', encoding="utf8") as file_parsed:
    for row in parsed_file_linebreaks:
        file_parsed.write(str(row + ','))



### Read in cleaned file, make some modifications and save it with ticker name

In [47]:

df_labeled_tweets = pd.read_csv('./data/sentiment/tweets_labeled_parsed.csv')

df_labeled_tweets = df_labeled_tweets.rename(columns={'createdat': 'date', 'text': 'tweet'})
df_labeled_tweets = df_labeled_tweets.assign(ticker=ticker)
df_labeled_tweets.to_csv(f'./data/sentiment/{ticker}_labeled.csv')  