# Create the news group dataset

In this exercise, you will create a new dataset from an existing dataset. 

**Source dataset:**
https://scikit-learn.org/0.19/modules/generated/sklearn.datasets.fetch_20newsgroups.html#sklearn.datasets.fetch_20newsgroups

**Uploading a dataset**
#https://huggingface.co/docs/datasets/en/upload_dataset

To upload a dataset to HuggingFace, you must register and create a repository.

## Packages needed

In [1]:
# !pip install sklearn
# !pip install pyarrow
# !pip install datasets
# !pip install huggingface_hub

## 1. Hugging Face API Key is needed for uploading the package

Change the path to your key file before running the code below.

In [2]:
from dotenv import load_dotenv
import os

import warnings

warnings.filterwarnings("ignore")

# Load the file that contains the API keys
load_dotenv('C:\\Users\\raj\\.jupyter\\.env')

True

## 2. Load the dataset from sklearn

In [3]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd
import re

newsgroups_train = fetch_20newsgroups(subset='train')

# View list of classs for dataset
newsgroups_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [4]:
## Uncomment below to checkout the sample data

# idx = newsgroups_train.data[0].index('Lines')
# print(newsgroups_train.data[0][idx:])

### Cleanup data

In [5]:
# Apply functions to remove names, emails, and extraneous words from data points in newsgroups.data
newsgroups_train.data = [re.sub(r'[\w\.-]+@[\w\.-]+', '', d) for d in newsgroups_train.data] # Remove email
newsgroups_train.data = [re.sub(r"\([^()]*\)", "", d) for d in newsgroups_train.data] # Remove names
newsgroups_train.data = [d.replace("From: ", "") for d in newsgroups_train.data] # Remove "From: "
newsgroups_train.data = [d.replace("\nSubject: ", "") for d in newsgroups_train.data] # Remove "\nSubject: "

## 3. Create a pandas DataFrame

In [6]:
# Put training points into a dataframe
df_train = pd.DataFrame(newsgroups_train.data, columns=['Text'])
df_train['Label'] = newsgroups_train.target
# Match label to target name index
df_train['class'] = df_train['Label'].map(newsgroups_train.target_names.__getitem__)
# Retain text samples that can be used in the gecko model.
df_train = df_train[df_train['Text'].str.len() < 10000]

## Uncomment below to checkout the data
# df_train

## 4. Create a subset by taking 150 rows from each category

In [7]:
# Take a sample of each label category from df_train
SAMPLE_SIZE = 150
df_train = (df_train.groupby('Label', as_index = False)
                    .apply(lambda x: x.sample(SAMPLE_SIZE))
                    .reset_index(drop=True))

# Choose categories about science
# df_train = df_train[df_train['class'].str.contains('comp.')]

list_cats = ['comp.windows.x', 'comp.os.ms-windows.misc', 'comp.sys.mac.hardware']
df_train = df_train[df_train['class'].isin(list_cats)]

# Reset the index
df_train = df_train.reset_index()

## Uncomment below to check out the data
df_train

Unnamed: 0,index,Text,Label,class
0,300,Help! - Disappearing Groups!!!\nOrganization:...,2,comp.os.ms-windows.misc
1,301,Any updated Canon BJ-200 driver\nOrganization...,2,comp.os.ms-windows.misc
2,302,last\nDistribution: usa\nOrganization: Milwauk...,2,comp.os.ms-windows.misc
3,303,Chera Bekker <>WANTED: Xterm emulator for wind...,2,comp.os.ms-windows.misc
4,304,Re: WANTED: Address SYMANTEC\nReply-To: \nOrg...,2,comp.os.ms-windows.misc
...,...,...,...,...
445,895,Imake support for xmosaic\nOrganization: CS D...,5,comp.windows.x
446,896,Re: XWindows always opaque\nOrganization: NAS...,5,comp.windows.x
447,897,Problems with OpenWindows\nOrganization: Dept...,5,comp.windows.x
448,898,Re: Changing dpy->max_request_size ?\nOrganiz...,5,comp.windows.x


In [8]:
# Show the counts of rows in all categories
df_train['class'].value_counts()

class
comp.os.ms-windows.misc    150
comp.sys.mac.hardware      150
comp.windows.x             150
Name: count, dtype: int64

## 5. Create the HuggingFace dataset

In [9]:
import datasets
from datasets import Dataset, DatasetDict

In [10]:
ds = Dataset.from_pandas(df_train).rename_column("Text", "text")
ds = ds.remove_columns(["Label", "index"])

In [11]:
ds

Dataset({
    features: ['text', 'class'],
    num_rows: 450
})

## 6. Upload to Hugging Face

1. Create a repository on HF using the portal
2. Change the name of the repository and run the code to upload

In [12]:
ds_name='acloudfan/newsgroups-mini'

In [13]:
ds.push_to_hub(ds_name)

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/1.30k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/acloudfan/newsgroups-mini/commit/ef80190f926f2dc756f2be3028158e8beba518d8', commit_message='Upload dataset', commit_description='', oid='ef80190f926f2dc756f2be3028158e8beba518d8', pr_url=None, pr_revision=None, pr_num=None)