# Datasets

The dataset module provides and easy way to load and preprocess the datasets. The package comes with a few datasets that are commonly used in topic modleing research. The datasets are:

    - 20NewsGroup
    - BBC_News
    - Stocktwits_GME
    - Reddit_GME'
    - Reuters'
    - Spotify
    - Spotify_most_popular
    - Poliblogs
    - Spotify_least_popular

Please see the functionalities availabe in the `TMDataset` module.

In [1]:
from stream_topic.utils import TMDataset

import warnings
warnings.filterwarnings("ignore")

  from .autonotebook import tqdm as notebook_tqdm
  _dash_comm = Comm(target_name="dash")


## Using default datasets

- these datasets are already preprocessed and ready to be used for topic modeling
- these datasets are included in the package and can be loaded using the `TMDataset` module

In [2]:
dataset = TMDataset()
dataset.get_dataset_list()

['Stocktwits_GME_large',
 'BBC_News',
 'Stocktwits_GME',
 'Reddit_GME',
 'Reuters',
 'Spotify',
 '20NewsGroups',
 'DummyDataset',
 'Spotify_most_popular',
 'Poliblogs',
 'Spotify_least_popular']

In [3]:
dataset.fetch_dataset(name="Reuters")

[32m2024-08-06 17:10:27.035[0m | [1mINFO    [0m | [36mstream_topic.utils.dataset[0m:[36mfetch_dataset[0m:[36m157[0m - [1mFetching dataset: Reuters[0m
[32m2024-08-06 17:10:27.111[0m | [1mINFO    [0m | [36mstream_topic.utils.dataset[0m:[36mfetch_dataset[0m:[36m163[0m - [1mDataset loaded successfully from /opt/homebrew/Caskroom/miniforge/base/envs/topicm/lib/python3.10/site-packages/stream_topic/preprocessed_datasets/Reuters[0m


In [4]:
dataset.get_bow()

(array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]], dtype=float32),
 array(['00', '000', '001', ..., 'zurich', 'zverev', 'zzzz'], dtype=object))

In [5]:
dataset.get_tfidf()

(array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]),
 array(['00', '000', '001', ..., 'zurich', 'zverev', 'zzzz'], dtype=object))

In [6]:
dataset.get_word_embeddings()

{'monica': array([ 0.51855 , -0.070771,  0.51117 ,  0.71454 , -0.13436 ,  0.68983 ,
         0.62699 ,  0.971   ,  0.8041  , -0.34947 ,  0.92039 ,  0.10256 ,
        -1.0888  ,  0.14353 , -0.29457 ,  0.024667,  0.7013  ,  0.49332 ,
        -0.86258 ,  0.54011 , -0.6377  ,  0.056668,  0.30735 ,  0.76396 ,
        -0.080621,  0.29731 ,  0.51798 , -1.1633  , -0.16926 ,  0.070911,
         0.25808 , -0.04027 ,  1.0546  ,  0.30116 , -0.087459,  0.081065,
         0.21493 ,  0.28763 ,  0.91327 ,  0.36973 , -0.21147 , -0.49185 ,
        -0.21688 ,  0.070246,  0.03793 , -0.21647 , -0.18415 ,  0.091836,
         0.76674 ,  0.11772 ,  0.35068 , -0.20623 , -0.02515 , -0.1861  ,
         0.49147 , -1.6014  ,  0.1748  ,  0.30223 ,  0.41354 ,  0.39711 ,
        -0.68077 ,  0.76038 ,  0.39296 , -0.2051  , -0.18053 , -0.48139 ,
         1.4897  ,  0.56627 ,  0.088757,  0.40142 ,  0.23751 ,  0.61882 ,
        -0.50917 ,  0.096604,  0.18039 , -0.11864 ,  0.34496 ,  0.17769 ,
        -0.16574 ,  0.1528  

In [7]:
dataset.fetch_dataset('Spotify')

[32m2024-08-06 17:10:38.885[0m | [1mINFO    [0m | [36mstream_topic.utils.dataset[0m:[36mfetch_dataset[0m:[36m149[0m - [1mDataset name already provided while instantiating the class: Reuters[0m
[32m2024-08-06 17:10:38.886[0m | [1mINFO    [0m | [36mstream_topic.utils.dataset[0m:[36mfetch_dataset[0m:[36m151[0m - [1mOverwriting the dataset name with the provided name in fetch_dataset: Spotify[0m
[32m2024-08-06 17:10:38.886[0m | [1mINFO    [0m | [36mstream_topic.utils.dataset[0m:[36mfetch_dataset[0m:[36m154[0m - [1mFetching dataset: Spotify[0m
[32m2024-08-06 17:10:38.982[0m | [1mINFO    [0m | [36mstream_topic.utils.dataset[0m:[36mfetch_dataset[0m:[36m163[0m - [1mDataset loaded successfully from /opt/homebrew/Caskroom/miniforge/base/envs/topicm/lib/python3.10/site-packages/stream_topic/preprocessed_datasets/Spotify[0m


In [8]:
dataset.dataframe.head()

Unnamed: 0,name,duration_ms,explicit,artists,release_date,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,text,labels,tokens
0,What They Want,165853,1,['Russ'],2017-05-05,0.71,0.404,1,-10.04,0,0.379,0.484,0.0,0.0953,0.398,139.553,4,yeah ooh yeah they let the rap game yeah yeah ...,75,"[yeah, ooh, yeah, they, let, the, rap, game, y..."
1,Shores,281367,0,"['Seinabo Sey', 'Vargas & Lagola']",2019-09-20,0.431,0.491,5,-6.615,1,0.0288,0.322,0.0,0.0679,0.275,143.879,4,seinabo sey have always wondered your cause wh...,58,"[seinabo, sey, have, always, wondered, your, c..."
2,The Prayer,255360,0,['Anthony Callea'],2005,0.217,0.46,10,-5.133,1,0.0302,0.768,8e-06,0.0847,0.109,138.822,4,youll our eyes and watch where and when dont k...,37,"[youll, our, eyes, and, watch, where, and, whe..."
3,Send Me the Pillow You Dream On,147440,0,['Hank Locklin'],2003-03-03,0.595,0.308,3,-11.626,1,0.0333,0.84,4e-06,0.0942,0.624,119.755,4,send the pillow that you dream dont you know t...,45,"[send, the, pillow, that, you, dream, dont, yo..."
4,It's a Rainy Day,255400,0,['Ice Mc'],2008-03-16,0.619,0.736,2,-11.686,0,0.0302,0.00482,0.00105,0.335,0.484,134.955,4,alexia and ice you the came down you were life...,41,"[alexia, and, ice, you, the, came, down, you, ..."


In [9]:
dataset.texts

['yeah ooh yeah they let the rap game yeah yeah yeah yeah they let this rap game yeah yeah got chick call her she feel like the like and some and feel like she feel like but she aint the only one got chick call her she she and off now she just got the the her they but they aint the only what they want what they want what they want dollar signs yeah know its what they want what they want what they want what they want yall aint fooling all ooh ooh ooh ooh this now they call they yeah off probably the only one yeah when you you all the like got the and and some probably the only one yeah what they want what they want what they want dollar signs yeah know its what they want what they want what they want what they want yall aint fooling all ooh ooh ooh ooh who ill you who fuck just all the what all the fuck they like the but know what they want aint its and but pop pop the let the boss when boss ill what they want what they want what they want dollar signs yeah know its what they want what 

In [10]:
dataset.tokens

In [11]:
dataset.labels

[75,
 58,
 37,
 45,
 41,
 43,
 38,
 62,
 0,
 38,
 59,
 58,
 74,
 66,
 0,
 57,
 32,
 40,
 60,
 47,
 56,
 37,
 62,
 30,
 31,
 43,
 55,
 38,
 65,
 56,
 53,
 29,
 46,
 48,
 72,
 53,
 7,
 69,
 54,
 56,
 54,
 61,
 50,
 34,
 35,
 59,
 49,
 50,
 22,
 63,
 32,
 49,
 29,
 60,
 52,
 71,
 42,
 42,
 52,
 59,
 53,
 65,
 73,
 69,
 49,
 57,
 37,
 47,
 31,
 42,
 69,
 15,
 49,
 64,
 57,
 61,
 45,
 56,
 47,
 23,
 65,
 5,
 42,
 63,
 58,
 39,
 47,
 20,
 67,
 61,
 61,
 60,
 55,
 83,
 26,
 44,
 62,
 35,
 33,
 67,
 50,
 66,
 46,
 58,
 55,
 28,
 38,
 62,
 66,
 68,
 63,
 53,
 65,
 0,
 66,
 57,
 80,
 50,
 9,
 64,
 33,
 58,
 73,
 26,
 68,
 27,
 59,
 43,
 61,
 55,
 63,
 62,
 67,
 69,
 16,
 68,
 64,
 48,
 51,
 1,
 55,
 52,
 61,
 18,
 0,
 58,
 65,
 62,
 55,
 67,
 5,
 57,
 25,
 70,
 70,
 35,
 73,
 41,
 49,
 61,
 37,
 41,
 31,
 46,
 55,
 61,
 45,
 54,
 44,
 58,
 34,
 67,
 46,
 52,
 66,
 10,
 56,
 55,
 54,
 58,
 55,
 67,
 0,
 36,
 75,
 51,
 51,
 71,
 49,
 58,
 0,
 40,
 61,
 38,
 43,
 50,
 35,
 37,
 54,
 69,
 0,
 49,
 6

## Loading own dataset

In [12]:
from stream_topic.utils import TMDataset

import warnings
warnings.filterwarnings("ignore")

In [13]:
import pandas as pd
import numpy as np


# Simulating some example data
np.random.seed(0)

# Generate 1000 random strings of lengths between 1 and 5, containing letters 'A' to 'Z'
random_documents = [''.join(np.random.choice(list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'), 
                                             np.random.randint(1, 6))) for _ in range(1000)]

# Generate 1000 random labels from 1 to 4 as strings
random_labels = np.random.choice(['1', '2', '3', '4'], 1000)

# Create DataFrame
my_data = pd.DataFrame({"Documents": random_documents, "Labels": random_labels})


In [14]:
dataset = TMDataset()
dataset.create_load_save_dataset(
    data=my_data, 
    dataset_name="sample_data",
    save_dir="data/",
    doc_column="Documents",
    label_column="Labels"
    )

Preprocessing documents: 100%|██████████| 1000/1000 [00:03<00:00, 262.72it/s]
[32m2024-08-06 17:10:42.880[0m | [1mINFO    [0m | [36mstream_topic.utils.dataset[0m:[36mcreate_load_save_dataset[0m:[36m407[0m - [1mDataset save directory does not exist: data/[0m
[32m2024-08-06 17:10:42.881[0m | [1mINFO    [0m | [36mstream_topic.utils.dataset[0m:[36mcreate_load_save_dataset[0m:[36m408[0m - [1mCreating directory: data/[0m
[32m2024-08-06 17:10:42.885[0m | [1mINFO    [0m | [36mstream_topic.utils.dataset[0m:[36mcreate_load_save_dataset[0m:[36m413[0m - [1mDataset saved to data/sample_data.parquet[0m
[32m2024-08-06 17:10:42.885[0m | [1mINFO    [0m | [36mstream_topic.utils.dataset[0m:[36mcreate_load_save_dataset[0m:[36m428[0m - [1mDataset info saved to data/sample_data_info.pkl[0m
[32m2024-08-06 17:10:42.886[0m | [1mINFO    [0m | [36mstream_topic.utils.dataset[0m:[36mcreate_load_save_dataset[0m:[36m431[0m - [1mDataset name appended to avali

In [15]:
# the new data is saved in the data folder unlike the default datasets which are saved in package directory under preprocessed_data folder.
# therefore, you need to provide the path to the data folder to fetch the dataset
dataset.fetch_dataset(name="sample_data", dataset_path="data/")

[32m2024-08-06 17:10:42.889[0m | [1mINFO    [0m | [36mstream_topic.utils.dataset[0m:[36mfetch_dataset[0m:[36m157[0m - [1mFetching dataset: sample_data[0m
[32m2024-08-06 17:10:42.893[0m | [1mINFO    [0m | [36mstream_topic.utils.dataset[0m:[36mfetch_dataset[0m:[36m163[0m - [1mDataset loaded successfully from data/[0m


In [16]:
dataset.dataframe.head()

Unnamed: 0,text,labels,tokens
0,PVADD,2,[PVADD]
1,TV,4,[TV]
2,EXG,4,[EXG]
3,Y,4,[Y]
4,BGHXO,3,[BGHXO]
