Streaming and domain adaptation datasets based on the twitter feed with hashtags related to NASDAQ and represents new challenges in the domains.
The dataset contains tweets from twitter crawled from 10.02.2019 til 3.12.2019. The tweets from streaming and domain adaptation are chosen with respect to the tasks and can be found below. All tweets are crawled that no user information is passed to us and only the tweet itself is processed.
This repository offers two datasets.
- The prefix
nsdqs_
includes files for stream dataset. - The prefix
sentqs_
includes files for domain adaptation dataset.
- The main dataset file can be found in
data/nsdqs_skipgram_embedding.npy
. - Hastags crawled: ADBE', 'GOOGL', 'AMZN', 'AAPL', 'ADSK', 'BKNG', 'EXPE', 'INTC', 'MSFT', 'NFLX', 'NVDA', 'PYPL', 'SBUX', 'TSLA' and 'XEL'.
- The dataset 30278 tweets with 1000 feature dimensions.
- Number of classes: 15
Test-Then-Train
A primary challenge in the analysis and monitored classification of data streams in real-time is the changing underlying concept. This is called concept drift. This forces the machine learning algorithms to adapt constantly. This data set consists of tweets of the NASDAQ codes of the largest American companies and reflects the volatility of the stock market. Due to this volatility, many different concept drifts exist and pose a new challenge in the stream context, as there is no underlying systematic that explains or makes the drift predictable. The data set is highly unbalanced and very high-dimensional compared to other stream data sets.
- High feature dimension compared to existing dataset.
- High number of classes with large imbalances compared to existing dataset.
- High volatile dataset with many non-specified concept drifts.
-
(Optional) Preprocess on your on:
- Raw tweets are at
Tweets.csv
- Run
nsdqs_processing.py
- This creates a basic statistical dataset description, trains the embedding and plots tsne embedding and eigenspectra which needs some time.
- Raw tweets are at
-
Store dataset ready for usage in
data/nsdqs_stream_skipgram.npy
. -
Demo Run
nsdqs_demo.py
for a stream machine learning demonstration using SamKNN and RSVLQ.
- The main dataset file can be found in
data/sentqs_skipgram_embedding.npy
. - Hastags crawled: 'ADBE', 'GOOGL', 'AMZN', 'AAPL', 'ADSK', 'BKNG', 'EXPE', 'INTC', 'MSFT', 'NFLX', 'NVDA', 'PYPL', 'SBUX', 'TSLA', 'XEL', 'positive', 'bad' and 'sad'.
- The dataset 61536 tweets with 300 feature dimensions.
- Number of classes: 3 (Positive, Neutral, Negative Sentiment)
Train on Sentiment Tweets - Evaluate Sentiment of Coperate Tweets
Change of Language Distribution between Train and Test dataset
If the scenario of different distributions between the training and the test data set is encountered, it is called a Domain Adaptation Problem. In contrast to other Domain Adaptation Data Sets, which are mostly image data sets or which are not subject to a real scenario, this data set offers a transfer learning scenario in the context of Social Media Analysis.
The core idea is to learn a sentiment analysis for positive, neutral and negative tweets. Moreover, to apply this through domain adaptation to corporate tweets to unseen coperations. The practical advantage is that there is no need for manual labeling of the company tweets and they cover a large language spectrum.
- Real-world scenario not relying on standard image or text dataset with exhausting preprocessing.
- High number of samples compared to existing datasets.
- Highly unbalanced Classes.
- Domain adaptation problem implicity by using tweets from varying hashtags.
-
(Optional) Preprocess on your on:
- Raw tweets are at
Tweets.csv
- Run
sentqs_process.py
- This creates a basic statistical dataset description, trains the embedding and plots tsne embedding and eigenspectra which needs some time.
- Raw tweets are at
-
Store dataset ready for usage in
data/sentqs_da_skigram.npy
. -
Demo Run
sentqs_demo.py
for a stream machine learning demonstration using SamKNN and RSVLQ.
To create a bytes file for your visualization:
- Run
sentqs_preprocess.py
- You will receive
data/skipgram_tensors.bytes
- Change your csv file to a tsv file with a version of
csv_to_tsv.py
- Add both to a fork of https://github.com/tensorflow/embedding-projector-standalone
- Adjust the config / json file with your added files and right shape
- Then run the visualization local with
python -m http.server 8080
To create a bytes file for your visualization:
- Run BERT with
bert/BERT.ipynb
or ALBERT withalbert/ALBERT.ipynb
local in the jupyter notebook or with Google Colab - You will receive
metadata_bert.tsv
andtensors_bert.bytes
for BERT ormetadata_albert.tsv
andtensors_albert.bytes
for ALBERT - Add both to a fork of https://github.com/tensorflow/embedding-projector-standalone
- Adjust the config / json file with your added files and right shape
- Then run the visualization local with
python -m http.server 8080
(or use GitHub Pages to deploy a WebApp with the visualization)