# Stock Predictor - Data Generator

1\. First, download the data file from `https://idsdl.csom.umn.edu/c/share/msba6330/adir.zip` and unzip it to the local host

In [0]:
%pip install datasets

Python interpreter will be restarted.
Collecting datasets
  Using cached datasets-3.5.0-py3-none-any.whl (491 kB)
Collecting huggingface-hub>=0.24.0
  Using cached huggingface_hub-0.30.1-py3-none-any.whl (481 kB)
Collecting multiprocess<0.70.17
  Using cached multiprocess-0.70.16-py39-none-any.whl (133 kB)
Collecting xxhash
  Using cached xxhash-3.5.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (193 kB)
Collecting aiohttp
  Using cached aiohttp-3.11.16-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB)
Collecting pyarrow>=15.0.0
  Using cached pyarrow-19.0.1-cp39-cp39-manylinux_2_28_x86_64.whl (42.1 MB)
Collecting dill<0.3.9,>=0.3.0
  Using cached dill-0.3.8-py3-none-any.whl (116 kB)
Collecting requests>=2.32.2
  Using cached requests-2.32.3-py3-none-any.whl (64 kB)
Collecting tqdm>=4.66.3
  Using cached tqdm-4.67.1-py3-none-any.whl (78 kB)
Collecting pyyaml>=5.1
  Using cached PyYAML-6.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (737 kB)
Co

In [0]:

# Import necessary libraries
from datasets import load_dataset

# Load the Financial Tweets Sentiment dataset from Hugging Face
print("Loading dataset from Hugging Face...")
dataset = load_dataset("TimKoornstra/financial-tweets-sentiment")

# Let's examine what we have
print("Dataset structure:")
print(dataset)

# Convert the dataset to a pandas DataFrame (first the train split)
train_data = dataset['train']

Loading dataset from Hugging Face...
Dataset structure:
DatasetDict({
    train: Dataset({
        features: ['tweet', 'sentiment', 'url'],
        num_rows: 38091
    })
})


2\. Remove the existing `streaming_source` directory from `/databricks/driver`, and recreate one.

> You may want to run this again each time you start the data generator.

In [0]:
%%bash
rm -rf streaming_source
mkdir streaming_source

In [0]:
import time
from datetime import datetime
import json

stream_source_path = "/databricks/driver/streaming_source"
counter = 0
# Loop through the training set one by one
for tweet in train_data:
    # Add a tweet_time column with the current timestamp
    tweet_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    tweet['tweet_time'] = tweet_time
    
    # Write the current tweet to a JSON file
    with open(f"{stream_source_path}/tweet_{counter}.json", "w") as json_file:
        json.dump(tweet, json_file, ensure_ascii=False)
        json_file.write("\n")  # Add newline after each JSON object
    
    # Sleep for 1 second
    time.sleep(1)

    # Optionally, you can print the progress
    print(f"Processed tweet {counter} at {tweet_time}")
    counter += 1

Processed tweet 0 at 2025-04-07 08:08:54
Processed tweet 1 at 2025-04-07 08:08:55
Processed tweet 2 at 2025-04-07 08:08:56
Processed tweet 3 at 2025-04-07 08:08:57
Processed tweet 4 at 2025-04-07 08:08:58
Processed tweet 5 at 2025-04-07 08:08:59
Processed tweet 6 at 2025-04-07 08:09:00
Processed tweet 7 at 2025-04-07 08:09:01
Processed tweet 8 at 2025-04-07 08:09:02
Processed tweet 9 at 2025-04-07 08:09:03
Processed tweet 10 at 2025-04-07 08:09:04
Processed tweet 11 at 2025-04-07 08:09:05
Processed tweet 12 at 2025-04-07 08:09:06
Processed tweet 13 at 2025-04-07 08:09:07
Processed tweet 14 at 2025-04-07 08:09:08
Processed tweet 15 at 2025-04-07 08:09:09
Processed tweet 16 at 2025-04-07 08:09:10
Processed tweet 17 at 2025-04-07 08:09:11
Processed tweet 18 at 2025-04-07 08:09:12
Processed tweet 19 at 2025-04-07 08:09:13
Processed tweet 20 at 2025-04-07 08:09:14
Processed tweet 21 at 2025-04-07 08:09:15
Processed tweet 22 at 2025-04-07 08:09:16
Processed tweet 23 at 2025-04-07 08:09:17
Pr

3\. The following Python script reads from `ratings_2013.txt` and write customer comments, one line per file, to the `/databricks/driver/tmp/` directory, at a given rate.

In [0]:
!ls streaming_source	       

tweet_0.json	tweet_26.json  tweet_45.json  tweet_64.json  tweet_83.json
tweet_1.json	tweet_27.json  tweet_46.json  tweet_65.json  tweet_84.json
tweet_10.json	tweet_28.json  tweet_47.json  tweet_66.json  tweet_85.json
tweet_100.json	tweet_29.json  tweet_48.json  tweet_67.json  tweet_86.json
tweet_101.json	tweet_3.json   tweet_49.json  tweet_68.json  tweet_87.json
tweet_11.json	tweet_30.json  tweet_5.json   tweet_69.json  tweet_88.json
tweet_12.json	tweet_31.json  tweet_50.json  tweet_7.json   tweet_89.json
tweet_13.json	tweet_32.json  tweet_51.json  tweet_70.json  tweet_9.json
tweet_14.json	tweet_33.json  tweet_52.json  tweet_71.json  tweet_90.json
tweet_15.json	tweet_34.json  tweet_53.json  tweet_72.json  tweet_91.json
tweet_16.json	tweet_35.json  tweet_54.json  tweet_73.json  tweet_92.json
tweet_17.json	tweet_36.json  tweet_55.json  tweet_74.json  tweet_93.json
tweet_18.json	tweet_37.json  tweet_56.json  tweet_75.json  tweet_94.json
tweet_19.json	tweet_38.json  tweet_57.j

4\. Stop the above loop when you no longer needs the stream of new files.

In [0]:
!cat streaming_source/tweet_0.json

{"tweet": "$BYND - JPMorgan reels in expectations on Beyond Meat https://t.co/bd0xbFGjkT", "sentiment": 2, "url": "https://huggingface.co/datasets/zeroshot/twitter-financial-news-sentiment", "tweet_time": "2025-04-07 06:18:46"}
