## Datasets
- web : https://ieee-dataport.org/open-access/stock-market-tweets-data
- Submitted by:Bruno Taborda
- Twitter is one of the most popular social networks for sentiment analysis. This data set of tweets are related to the stock market. We collected 943,672 tweets between April 9 and July 16, 2020, using the S&P 500 tag (#SPX500), the references to the top 25 companies in the S&P 500 index, and the Bloomberg tag (#stocks). __1,300 out of the 943,672 tweets were manually annotated in positive, neutral, or negative classes.__ A second independent annotator reviewed the manually annotated tweets. This annotated data set can contribute to create new domain-specific lexicons or enrich some of the actual dictionaries. Researchers can train their supervised models using the annotated data set. Additionally, the full data set can be used for text mining and sentiment analysis related to the stock market.

## Embedding with DNN 

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [11]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
import numpy as np
import time
import nltk
from datetime import datetime
import collections
import re

# 상관계수
from scipy import stats

# model

In [9]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
from nltk.corpus import wordnet, stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
import tensorflow

tensorflow.__version__

'2.6.0'

## I.데이터 불러오기 & 전처리

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
!ls -l '/content/drive/MyDrive/Colab Notebooks/datasets/cryptocurrency_sentiment/tweets_labelled_09042020_16072020.csv'

-rw------- 1 root root 959890 Nov  1 02:54 '/content/drive/MyDrive/Colab Notebooks/datasets/cryptocurrency_sentiment/tweets_labelled_09042020_16072020.csv'


In [6]:
DF = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/cryptocurrency_sentiment/tweets_labelled_09042020_16072020.csv', sep= ';')
# DF = pd.read_csv('c:/My_data/health_care/cryptocurrency_sentiment/tweets_labelled_09042020_16072020.csv', sep= ';')
DF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          5000 non-null   int64 
 1   created_at  5000 non-null   object
 2   text        5000 non-null   object
 3   sentiment   1300 non-null   object
dtypes: int64(1), object(3)
memory usage: 156.4+ KB


In [None]:
# null 값 확인
DF.isnull().sum()


id               0
created_at       0
text             0
sentiment     3700
dtype: int64

In [None]:
#결측치 제거
DF.dropna(inplace = True)
DF.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1300 entries, 0 to 1299
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          1300 non-null   int64 
 1   created_at  1300 non-null   object
 2   text        1300 non-null   object
 3   sentiment   1300 non-null   object
dtypes: int64(1), object(3)
memory usage: 50.8+ KB


In [13]:
ticker_pattern = re.compile(r'(^\$[A-Z]+|^\$ES_F)')
ht_pattern = re.compile(r'#\w+')

ticker_dic = collections.defaultdict(int)
ht_dic = collections.defaultdict(int)

for text in DF['text']:
    for word in text.split():
        if ticker_pattern.fullmatch(word) is not None:
            ticker_dic[word[1:]] += 1
        
        word = word.lower()
        if ht_pattern.fullmatch(word) is not None:
            ht_dic[word] += 1

In [14]:
DF.head()

Unnamed: 0,id,created_at,text,sentiment
0,77522,2020-04-15 01:03:46+00:00,"RT @RobertBeadles: Yo💥\nEnter to WIN 1,000 Mon...",positive
1,661634,2020-06-25 06:20:06+00:00,#SriLanka surcharge on fuel removed!\n⛽📉\nThe ...,negative
2,413231,2020-06-04 15:41:45+00:00,Net issuance increases to fund fiscal programs...,positive
3,760262,2020-07-03 19:39:35+00:00,RT @bentboolean: How much of Amazon's traffic ...,positive
4,830153,2020-07-09 14:39:14+00:00,$AMD Ryzen 4000 desktop CPUs looking ‘great’ a...,positive
