# Creating HF Dataset for Mistral Fine-tuning

Dataset link: https://huggingface.co/datasets/Subramanya3/shawgpt-youtube-comments <br>
Model link: https://huggingface.co/Subramanya3/shawgpt-ft

In [3]:
%pip install datasets

Collecting datasets
  Obtaining dependency information for datasets from https://files.pythonhosted.org/packages/95/fc/661a7f06e8b7d48fcbd3f55423b7ff1ac3ce59526f146fda87a1e1788ee4/datasets-2.18.0-py3-none-any.whl.metadata
  Using cached datasets-2.18.0-py3-none-any.whl.metadata (20 kB)
Collecting filelock (from datasets)
  Obtaining dependency information for filelock from https://files.pythonhosted.org/packages/81/54/84d42a0bee35edba99dee7b59a8d4970eccdd44b99fe728ed912106fc781/filelock-3.13.1-py3-none-any.whl.metadata
  Using cached filelock-3.13.1-py3-none-any.whl.metadata (2.8 kB)
Collecting numpy>=1.17 (from datasets)
  Obtaining dependency information for numpy>=1.17 from https://files.pythonhosted.org/packages/11/57/baae43d14fe163fa0e4c47f307b6b2511ab8d7d30177c491960504252053/numpy-1.26.4-cp311-cp311-macosx_10_9_x86_64.whl.metadata
  Using cached numpy-1.26.4-cp311-cp311-macosx_10_9_x86_64.whl.metadata (61 kB)
Collecting pyarrow>=12.0.0 (from datasets)
  Obtaining dependency info

In [1]:
import csv
import random
from datasets import Dataset, DatasetDict

  from .autonotebook import tqdm as notebook_tqdm


### prep training examples

In [2]:
# load csv of YouTube comments
comment_list = []
response_list = []

with open('data/YT-comments.csv', mode ='r') as file:
    file = csv.reader(file)
    
    # read file line by line
    for line in file:
        # skip first line
        if line[0]=='Comment':
            continue
            
        # append comments and responses to respective lists
        comment_list.append(line[0])
        response_list.append(line[1] + " -ShawGPT")

In [4]:
intstructions_string = f"""ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. \
It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. \
ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, \
thus keeping the interaction natural and engaging.

Please respond to the following comment.
"""

example_template = lambda comment, response: f'''<s>[INST] {intstructions_string} \n{comment} \n[/INST]\n''' + response + "</s>"

example_list = []
for i in range(len(comment_list)):
    example = example_template(comment_list[i],response_list[i])
    example_list.append(example)

print(example_list[10])

<s>[INST] ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, thus keeping the interaction natural and engaging.

Please respond to the following comment.
 
Very clear, thanks! The examples with the mic and blinks were a great inclusion imo. They made ICA much easier to understand while also displaying the practical application in a fun way :) 
[/INST]
Glad it was helpful! -ShawGPT</s>


In [5]:
# create train/test split
test_index_list = random.sample(range(0, len(example_list)-1), 9)

test_list = [example_list[index] for index in test_index_list]

for example in test_list:
    example_list.remove(example)

### create HF dataest

In [6]:
data = DatasetDict({'train':Dataset.from_dict({"example":example_list}), 'test':Dataset.from_dict({"example":test_list})})

In [7]:
data

DatasetDict({
    train: Dataset({
        features: ['example'],
        num_rows: 50
    })
    test: Dataset({
        features: ['example'],
        num_rows: 9
    })
})

### push dataset to hub

In [8]:
!python3 -m pip install --upgrade pip




In [12]:
pip install huggingface-cli

Note: you may need to restart the kernel to use updated packages.


In [9]:
#!huggingface-cli login

zsh:1: command not found: huggingface-cli


In [11]:
# option 1: notebook login
#from huggingface_hub import notebook_login
#notebook_login()

# # option 2: key login
from huggingface_hub import login
write_key = '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' # paste token here
login(write_key)

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /Users/hitam/.cache/huggingface/token
Login successful


In [12]:
# push dataset to hub
data.push_to_hub("Subramanya3/shawgpt-youtube-comments")

Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 169.21ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:02<00:00,  2.13s/it]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 185.05ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:02<00:00,  2.64s/it]


CommitInfo(commit_url='https://huggingface.co/datasets/Subramanya3/shawgpt-youtube-comments/commit/1c4ea1ee477628a0e0da00dcb7476c31ae5f5190', commit_message='Upload dataset', commit_description='', oid='1c4ea1ee477628a0e0da00dcb7476c31ae5f5190', pr_url=None, pr_revision=None, pr_num=None)