# **Create Hugging Face dataset for summarization task**

### **Load libraries**

In [10]:
%pip install -qq polars<1.22,>=1.20
%pip install -qq -U huggingface_hub
%pip install -qq -U datasets

/bin/bash: line 1: 1.22,: No such file or directory
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [11]:
!apt install git-lfs

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.3).
0 upgraded, 0 newly installed, 0 to remove and 38 not upgraded.


In [12]:
%pip show polars
%pip show huggingface_hub
%pip show datasets

Name: polars
Version: 1.22.0
Summary: Blazingly fast DataFrame library
Home-page: https://www.pola.rs/
Author: 
Author-email: Ritchie Vink <ritchie46@gmail.com>
License: 
Location: /usr/local/lib/python3.11/dist-packages
Requires: 
Required-by: cudf-polars-cu12
Note: you may need to restart the kernel to use updated packages.
Name: huggingface-hub
Version: 0.34.4
Summary: Client library to download and publish models, datasets and other repos on the huggingface.co hub
Home-page: https://github.com/huggingface/huggingface_hub
Author: Hugging Face, Inc.
Author-email: julien@huggingface.co
License: Apache
Location: /usr/local/lib/python3.11/dist-packages
Requires: filelock, fsspec, hf-xet, packaging, pyyaml, requests, tqdm, typing-extensions
Required-by: accelerate, datasets, diffusers, gradio, gradio_client, peft, sentence-transformers, timm, tokenizers, torchtune, transformers
Note: you may need to restart the kernel to use updated packages.
Name: datasets
Version: 4.0.0
Summary: Huggin

### **Imports**

In [53]:
import zipfile
import os, sys
import shutil
import random
import subprocess
import numpy as np
import pandas as pd
import polars as pl
import pyarrow as pa

import pyarrow.parquet as pq

from kaggle_secrets import UserSecretsClient

from datasets import Dataset, DatasetDict

from huggingface_hub import (
    Repository, 
    get_full_repo_name,
    login,
    upload_folder,
    hf_hub_download,
    HfApi
)


### **Hugging Face Login**

In [18]:
user_secrets = UserSecretsClient()
hf_token = user_secrets.get_secret("HF_TOKEN")
user_email = user_secrets.get_secret("user_email")
user_name = user_secrets.get_secret("user_name")

login(token=hf_token)

### **You will need to set up git, edit your email and name.**

In [19]:
def set_git_config(email, name):
    try:
        # Setting global user.email
        subprocess.run(["git", "config", "--global", "user.email", email], check=True)
        #print(f"Git user.email set to: {email}")
        
        # Setting the global user.name
        subprocess.run(["git", "config", "--global", "user.name", name], check=True)
        #print(f"Git user.name set to: {name}")
        
        # Check settings (optional)
        email_output = subprocess.run(["git", "config", "--global", "user.email"], capture_output=True, text=True, check=True)
        name_output = subprocess.run(["git", "config", "--global", "user.name"], capture_output=True, text=True, check=True)
        #print(f"Check - Email: {email_output.stdout.strip()}")
        #print(f"Check - Name: {name_output.stdout.strip()}")
        
    except subprocess.CalledProcessError as e:
        print(f"Error while setting up Git configuration: {e}")

In [20]:
set_git_config(user_email, user_name)

### **Load dataset to the polars**

In [21]:
path_input_file = os.path.join('/kaggle','input','text-for-summarize-nlpllm-task','summary_dataset_en.json')
polars_dataset = pl.read_json(path_input_file)

In [22]:
polars_dataset.head(3)

id,origin_text,summary_text,lenght_origin_text,lenght_summary_text
i64,str,str,i64,i64
0,"""By . Daily Mail Reporter . PUB…","""Cross-border violence began Fr…",3490,124
1,"""A man is suing after allegedly…","""Andrew Walls, 32, claims the u…",1934,309
2,"""The driver who was at the cont…","""William Rockefeller, 46, told …",6646,429


In [23]:
print('Number of rows:', polars_dataset.shape[0])
print('Number of columns:', polars_dataset.shape[1])

Number of rows: 317407
Number of columns: 5


In [24]:
print("lenght_origin_text:")
polars_dataset.select([
    pl.col("lenght_origin_text").min().alias("min"),
    pl.col("lenght_origin_text").mean().alias("mean"),
    pl.col("lenght_origin_text").median().alias("median"),
    pl.col("lenght_origin_text").max().alias("max"),
    pl.col("lenght_origin_text").std().alias("std_dev"),
])

lenght_origin_text:


min,mean,median,max,std_dev
i64,f64,f64,i64,f64
48,3865.262269,3513.0,296786,2194.774563


In [25]:
print("lenght_summary_text:")
polars_dataset.select([
    pl.col("lenght_summary_text").min().alias("min"),
    pl.col("lenght_summary_text").mean().alias("mean"),
    pl.col("lenght_summary_text").median().alias("median"),
    pl.col("lenght_summary_text").max().alias("max"),
    pl.col("lenght_summary_text").std().alias("std_dev"),
])

lenght_summary_text:


min,mean,median,max,std_dev
i64,f64,f64,i64,f64
14,329.896017,286.0,12344,208.44977


In [26]:
len(polars_dataset)

317407

### **Shuffle the dataset**

In [27]:
shuffled_polars_dataset = polars_dataset.sample(fraction=1.0, shuffle=True)

In [28]:
shuffled_polars_dataset.head(3)

id,origin_text,summary_text,lenght_origin_text,lenght_summary_text
i64,str,str,i64,i64
267290,"""Posing naked during or after p…","""Pictures taken by photographer…",4626,205
232870,"""By . Jenny Hope . PUBLISHED: .…","""Trusts have been telling consu…",4347,130
29051,""" Title: ""Revolutionary Quantum…",""" The text discusses a new quan…",1831,1302


In [29]:
print('Number of rows:', polars_dataset.shape[0])
print('Number of columns:', polars_dataset.shape[1])

Number of rows: 317407
Number of columns: 5


## **Splitting the dataset into train (80%), validation (10%) and test (10%)**

In [34]:
np.random.seed(42)
total_rows = shuffled_polars_dataset.height
train_size = int(0.80 * total_rows)
train_size

253925

In [32]:
remaining_size = total_rows - train_size
val_size = int(0.50 * remaining_size)
val_size

31741

In [33]:
test_size = remaining_size - val_size
test_size

31741

In [36]:
train_df = shuffled_polars_dataset.sample(n=train_size, shuffle=True)
remaining_df = shuffled_polars_dataset.filter(~pl.col("id").is_in(train_df["id"]))
val_df = remaining_df.sample(n=val_size, shuffle=True)
test_df = remaining_df.filter(~pl.col("id").is_in(val_df["id"]))

#### ***Train dataset***

In [37]:
train_df.head(3)

id,origin_text,summary_text,lenght_origin_text,lenght_summary_text
i64,str,str,i64,i64
51145,""" Title: Revolutionary Breakthr…",""" A revolutionary breakthrough …",2143,847
244839,"""Skiers and snowboarders in Sco…","""Police called as cars block A8…",3508,208
70941,"""A man has been arrested after …","""Todd MacKinnon was intoxicated…",1793,275


In [38]:
print('Number of rows:', train_df.shape[0])
print('Number of columns:', train_df.shape[1])

Number of rows: 253925
Number of columns: 5


#### ***Validation dataset***

In [39]:
val_df.head(3)

id,origin_text,summary_text,lenght_origin_text,lenght_summary_text
i64,str,str,i64,i64
39240,"""Yeovil Town have sacked manage…","""Gary Johnson took charge of Ye…",1841,192
236073,"""(CNN) -- If you walk past the …","""David Barford is part of a sma…",4738,223
228983,"""Britons will be quizzed as new…","""The Britons are among seven pe…",4944,269


In [40]:
print('Number of rows:', val_df.shape[0])
print('Number of columns:', val_df.shape[1])

Number of rows: 31741
Number of columns: 5


#### ***Test dataset***

In [41]:
test_df.head(3)

id,origin_text,summary_text,lenght_origin_text,lenght_summary_text
i64,str,str,i64,i64
63404,"""By . Liz Hull . PUBLISHED: . 0…","""Emma Parr, 38, turned on her m…",5034,518
94056,"""By . Gemma Mullin . Almost two…","""Two million British soldiers d…",4020,333
300535,"""By . Jenny Hope . PUBLISHED: .…","""Health Secretary rules out tel…",4434,217


In [42]:
print('Number of rows:', test_df.shape[0])
print('Number of columns:', test_df.shape[1])

Number of rows: 31741
Number of columns: 5


## **Create Hugging Face dataset**

We will create a new empty dataset on Hugging Face.

### **Convert Polars DataFrames to Arrow tables**

In [43]:
train_arrow = train_df.to_arrow()
val_arrow = val_df.to_arrow()
test_arrow = test_df.to_arrow()

### **Save Arrow tables to local files**

In [50]:
save_dir = os.path.join('/kaggle','working')
train_arrow_path = os.path.join(save_dir,"train_dataset.parquet")
val_arrow_path = os.path.join(save_dir,"val_dataset.parquet")
test_arrow_path = os.path.join(save_dir,"test_dataset.parquet")

In [52]:
pq.write_table(train_arrow, train_arrow_path)
pq.write_table(val_arrow, val_arrow_path)
pq.write_table(test_arrow, test_arrow_path)

print(f"Saved Arrow files: {train_arrow_path}, {val_arrow_path}, {test_arrow_path}")

Saved Arrow files: /kaggle/working/train_dataset.parquet, /kaggle/working/val_dataset.parquet, /kaggle/working/test_dataset.parquet


### **Create a DatasetDict from Parquet files**

In [54]:
dataset_dict = DatasetDict({
    "train": Dataset.from_parquet(train_arrow_path),
    "validation": Dataset.from_parquet(val_arrow_path),
    "test": Dataset.from_parquet(test_arrow_path)
})

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

### **Define the repository name on Hugging Face**

In [55]:
repo_name = "KRadim/summary_dataset_en"

### **Upload the dataset to Hugging Face**

In [56]:
dataset_dict.push_to_hub(repo_name)
print(f"Dataset uploaded to Hugging Face at: https://huggingface.co/datasets/{repo_name}")

Uploading the dataset shards:   0%|          | 0/3 [00:00<?, ? shards/s]

Creating parquet from Arrow format:   0%|          | 0/85 [00:00<?, ?ba/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

                                        :   0%|          |  524kB /  221MB            

Creating parquet from Arrow format:   0%|          | 0/85 [00:00<?, ?ba/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

                                        :   0%|          |  524kB /  222MB            

Creating parquet from Arrow format:   0%|          | 0/85 [00:00<?, ?ba/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

                                        :   0%|          |  524kB /  221MB            

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ? shards/s]

Creating parquet from Arrow format:   0%|          | 0/32 [00:00<?, ?ba/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

                                        :   1%|1         | 1.05MB / 83.1MB            

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ? shards/s]

Creating parquet from Arrow format:   0%|          | 0/32 [00:00<?, ?ba/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

                                        :   1%|1         | 1.05MB / 82.9MB            

README.md:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

Dataset uploaded to Hugging Face at: https://huggingface.co/datasets/KRadim/summary_dataset_en
