[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AnFreTh/STREAM/blob/main/docs/notebooks/datasets.ipynb)
[![Open On GitHub](https://img.shields.io/badge/Open-on%20GitHub-blue?logo=GitHub)](https://github.com/AnFreTh/STREAM/blob/main/docs/notebooks/datasets.ipynb)

# Datasets

The dataset module provides and easy way to load and preprocess the datasets. The package comes with a few datasets that are commonly used in topic modleing research. The datasets are:

- 20NewsGroup
- BBC_News
- Stocktwits_GME
- Reddit_GME'
- Reuters'
- Spotify
- Spotify_most_popular
- Poliblogs
- Spotify_least_popular

Please see the functionalities availabe in the `TMDataset` module.

**Note**: Make sure the `nltk` dependencies are installed. If not, please run the following command:
```python
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
```

In [1]:
# uncomment the below line if running in Colab
# package neeeds to be installed for the notebook to run

# ! pip install -U stream_topic

In [2]:
import warnings
warnings.filterwarnings("ignore")

In [3]:
from stream_topic.utils import TMDataset

## Using default datasets

- these datasets are already preprocessed and ready to be used for topic modeling
- these datasets are included in the package and can be loaded using the `TMDataset` module

In [4]:
dataset = TMDataset()
dataset.fetch_dataset(name="Reuters")

[32m2024-08-09 15:32:39.680[0m | [1mINFO    [0m | [36mstream_topic.utils.dataset[0m:[36mfetch_dataset[0m:[36m118[0m - [1mFetching dataset: Reuters[0m
[32m2024-08-09 15:32:40.002[0m | [1mINFO    [0m | [36mstream_topic.utils.data_downloader[0m:[36mload_custom_dataset_from_url[0m:[36m331[0m - [1mDownloading dataset from github[0m
[32m2024-08-09 15:32:40.363[0m | [1mINFO    [0m | [36mstream_topic.utils.data_downloader[0m:[36mload_custom_dataset_from_url[0m:[36m333[0m - [1mDataset downloaded successfully at ~/stream_topic_data/[0m
[32m2024-08-09 15:32:40.757[0m | [1mINFO    [0m | [36mstream_topic.utils.data_downloader[0m:[36mload_custom_dataset_from_url[0m:[36m361[0m - [1mDownloading dataset info from github[0m
[32m2024-08-09 15:32:40.970[0m | [1mINFO    [0m | [36mstream_topic.utils.data_downloader[0m:[36mload_custom_dataset_from_url[0m:[36m363[0m - [1mDataset info downloaded successfully at ~/stream_topic_data/[0m


In [5]:
dataset.get_bow()

(array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]], dtype=float32),
 array(['00', '000', '001', ..., 'zurich', 'zverev', 'zzzz'], dtype=object))

In [6]:
dataset.get_tfidf()

(array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]),
 array(['00', '000', '001', ..., 'zurich', 'zverev', 'zzzz'], dtype=object))

In [7]:
# dataset.get_word_embeddings()

In [8]:
dataset.fetch_dataset('Spotify')

[32m2024-08-09 15:32:42.196[0m | [1mINFO    [0m | [36mstream_topic.utils.dataset[0m:[36mfetch_dataset[0m:[36m108[0m - [1mDataset name already provided while instantiating the class: Reuters[0m
[32m2024-08-09 15:32:42.196[0m | [1mINFO    [0m | [36mstream_topic.utils.dataset[0m:[36mfetch_dataset[0m:[36m111[0m - [1mOverwriting the dataset name with the name provided in fetch_dataset: Spotify[0m
[32m2024-08-09 15:32:42.196[0m | [1mINFO    [0m | [36mstream_topic.utils.dataset[0m:[36mfetch_dataset[0m:[36m115[0m - [1mFetching dataset: Spotify[0m
[32m2024-08-09 15:32:42.490[0m | [1mINFO    [0m | [36mstream_topic.utils.data_downloader[0m:[36mload_custom_dataset_from_url[0m:[36m331[0m - [1mDownloading dataset from github[0m
[32m2024-08-09 15:32:43.475[0m | [1mINFO    [0m | [36mstream_topic.utils.data_downloader[0m:[36mload_custom_dataset_from_url[0m:[36m333[0m - [1mDataset downloaded successfully at ~/stream_topic_data/[0m
[32m2024-08-

In [9]:
dataset.dataframe.head()

Unnamed: 0,name,duration_ms,explicit,artists,release_date,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,text,labels,tokens
0,What They Want,165853,1,['Russ'],2017-05-05,0.71,0.404,1,-10.04,0,0.379,0.484,0.0,0.0953,0.398,139.553,4,yeah ooh yeah they let the rap game yeah yeah ...,75,"[yeah, ooh, yeah, they, let, the, rap, game, y..."
1,Shores,281367,0,"['Seinabo Sey', 'Vargas & Lagola']",2019-09-20,0.431,0.491,5,-6.615,1,0.0288,0.322,0.0,0.0679,0.275,143.879,4,seinabo sey have always wondered your cause wh...,58,"[seinabo, sey, have, always, wondered, your, c..."
2,The Prayer,255360,0,['Anthony Callea'],2005,0.217,0.46,10,-5.133,1,0.0302,0.768,8e-06,0.0847,0.109,138.822,4,youll our eyes and watch where and when dont k...,37,"[youll, our, eyes, and, watch, where, and, whe..."
3,Send Me the Pillow You Dream On,147440,0,['Hank Locklin'],2003-03-03,0.595,0.308,3,-11.626,1,0.0333,0.84,4e-06,0.0942,0.624,119.755,4,send the pillow that you dream dont you know t...,45,"[send, the, pillow, that, you, dream, dont, yo..."
4,It's a Rainy Day,255400,0,['Ice Mc'],2008-03-16,0.619,0.736,2,-11.686,0,0.0302,0.00482,0.00105,0.335,0.484,134.955,4,alexia and ice you the came down you were life...,41,"[alexia, and, ice, you, the, came, down, you, ..."


In [10]:
dataset.texts[:2]

['yeah ooh yeah they let the rap game yeah yeah yeah yeah they let this rap game yeah yeah got chick call her she feel like the like and some and feel like she feel like but she aint the only one got chick call her she she and off now she just got the the her they but they aint the only what they want what they want what they want dollar signs yeah know its what they want what they want what they want what they want yall aint fooling all ooh ooh ooh ooh this now they call they yeah off probably the only one yeah when you you all the like got the and and some probably the only one yeah what they want what they want what they want dollar signs yeah know its what they want what they want what they want what they want yall aint fooling all ooh ooh ooh ooh who ill you who fuck just all the what all the fuck they like the but know what they want aint its and but pop pop the let the boss when boss ill what they want what they want what they want dollar signs yeah know its what they want what 

In [11]:
dataset.tokens

In [12]:
dataset.labels[:2]

[75, 58]

## Loading own dataset

In [13]:
import pandas as pd
import numpy as np


# Simulating some example data
np.random.seed(0)

# Generate 1000 random strings of lengths between 1 and 5, containing letters 'A' to 'Z'
random_documents = [''.join(np.random.choice(list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'), 
                                             np.random.randint(1, 6))) for _ in range(1000)]

# Generate 1000 random labels from 1 to 4 as strings
random_labels = np.random.choice(['1', '2', '3', '4'], 1000)

# Create DataFrame
my_data = pd.DataFrame({"Documents": random_documents, "Labels": random_labels})


In [14]:
dataset = TMDataset()
dataset.create_load_save_dataset(
    data=my_data, 
    dataset_name="sample_data",
    save_dir="data/",
    doc_column="Documents",
    label_column="Labels"
    )

Preprocessing documents: 100%|██████████| 1000/1000 [00:03<00:00, 251.82it/s]
[32m2024-08-09 15:32:48.092[0m | [1mINFO    [0m | [36mstream_topic.utils.dataset[0m:[36mcreate_load_save_dataset[0m:[36m237[0m - [1mDataset saved to data/sample_data.parquet[0m
[32m2024-08-09 15:32:48.093[0m | [1mINFO    [0m | [36mstream_topic.utils.dataset[0m:[36mcreate_load_save_dataset[0m:[36m252[0m - [1mDataset info saved to data/sample_data_info.pkl[0m


In [15]:
# the new data is saved in the data folder unlike the default datasets which are saved in package directory under preprocessed_data folder.
# therefore, you need to provide the path to the data folder to fetch the dataset
dataset.fetch_dataset(name="sample_data", dataset_path="data/", source="local")

[32m2024-08-09 15:32:48.097[0m | [1mINFO    [0m | [36mstream_topic.utils.dataset[0m:[36mfetch_dataset[0m:[36m118[0m - [1mFetching dataset: sample_data[0m
[32m2024-08-09 15:32:48.098[0m | [1mINFO    [0m | [36mstream_topic.utils.dataset[0m:[36mfetch_dataset[0m:[36m128[0m - [1mFetching dataset from local path[0m


In [16]:
dataset.dataframe.head()

Unnamed: 0,text,labels,tokens
0,PVADD,2,[PVADD]
1,TV,4,[TV]
2,EXG,4,[EXG]
3,Y,4,[Y]
4,BGHXO,3,[BGHXO]
