Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added data samples to notebooks #2370

Merged
merged 3 commits into from
Jun 13, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -177,7 +177,15 @@
"We use the [SQUAD](https://huggingface.co/datasets/squad) dataset. The next few cells show basic data preparation for fine tuning:\n",
"* Visualize some data rows. Take note of the dataset fields: `question`, `context`, `answers`, `id` and `title`. The `answers` field has `start_key` and `text` fields in json format inside the `answers` field . The keys `question` and `context`, `answers`, `answer_start` and `text` are the relevant fields that need to be mapped to the parameters of the fine tuning pipeline.\n",
"* The dataset does not have a test split, split test into two halves, one for test and other for validation.\n",
"* We want this sample to run quickly, so save smaller `train` and `validation` files containing 5% of the original. This means the fine tuned model will have lower accuracy, hence it should not be put to real-world use. "
"* We want this sample to run quickly, so save smaller `train` and `validation` files containing 5% of the original. This means the fine tuned model will have lower accuracy, hence it should not be put to real-world use. \n",
"\n",
"##### Here is an example of how the data should look like\n",
"\n",
"| Question | Context | Answers |\n",
"| :- | :- | :- |\n",
"| What does Phosphorylation do? | After a chloroplast polypeptide is synthesized on a ribosome in the cytosol, an enzyme specific to chloroplast proteins phosphorylates, or adds a phosphate group to many (but not all) of them in their transit sequences. Phosphorylation helps many proteins bind the polypeptide, keeping it from folding prematurely. This is important because it prevents chloroplast proteins from assuming their active form and carrying out their chloroplast functions in the wrong place\\u2014the cytosol. At the same time, they have to keep just enough shape so that they can be recognized by the chloroplast. These proteins also help the polypeptide get imported into the chloroplast. | {\"text”: [\"helps many proteins bind the polypeptide\",\"helps many proteins bind the polypeptide\", \"helps many proteins bind the polypeptide\"], \"answer_start\": [236,236,236]} |\n",
"| What is the basic unit of organization within the UMC? | The Annual Conference, roughly the equivalent of a diocese in the Anglican Communion and the Roman Catholic Church or a synod in some Lutheran denominations such as the Evangelical Lutheran Church in America, is the basic unit of organization within the UMC. The term Annual Conference is often used to refer to the geographical area it covers as well as the frequency of meeting. Clergy are members of their Annual Conference rather than of any local congregation, and are appointed to a local church or other charge annually by the conference's resident Bishop at the meeting of the Annual Conference. In many ways, the United Methodist Church operates in a connectional organization of the Annual Conferences, and actions taken by one conference are not binding upon another. | {\"text\": [\"The Annual Conference\",\"synod\",\"The Annual Conference\"],\"answer_start\": [0,120,0]} |\n",
"\n"
]
},
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -177,7 +177,17 @@
"\n",
"* Download the dataset.\n",
"* Visualize some data rows. \n",
"* We want this sample to run quickly, so save smaller `train`, `validation` and `test` files containing 20% of the already trimmed rows. This means the fine tuned model will have lower accuracy, hence it should not be put to real-world use. "
"* We want this sample to run quickly, so save smaller `train`, `validation` and `test` files containing 20% of the already trimmed rows. This means the fine tuned model will have lower accuracy, hence it should not be put to real-world use. \n",
"\n",
"##### Here is an example of how the data should look like\n",
"\n",
"The summarization dataset is expected to have 2 fields – document, summary like shown below.\n",
"\n",
"| Article (Document) | Highlights (Summary) |\n",
"| :- | :- |\n",
"| (CNN) -- Former baseball slugger Jose Canseco accidentally shot himself in his left finger while cleaning a gun, police said. He was in surgery Tuesday night, his fiancee tweeted. \\\"This is Leila . Thank you all for the kind words and prayers . Jose is in still surgery and will be ok. Please pray for his finger !!,\\\" she said in a tweet posted to his account. | Canseco hit more than 450 home runs .\\nHis semiautomatic handgun accidentally went off . |\n",
"| (CNN) -- Zlatan Ibrahimovic scored all four goals in Sweden's 4-2 win over England -- but his final shot was something special. His audacious overhead volley from 30 yards was labeled on social networking sites as the greatest ever soccer goal. What do you think? Share your views on Ibrahimovic's wonder goal. | Zlatan Ibrahimovic scores a 30-yard overhead kick .\\nIbrahimovic scored all four goals for Sweden . |\n",
"\n"
]
},
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -177,7 +177,27 @@
"We use the [emotion](https://huggingface.co/datasets/dair-ai/emotion) dataset. The next few cells show basic data preparation for fine tuning:\n",
"* Visualize some data rows\n",
"* Replace numerical categories in data with the actual string labels. This mapping is available in the [./emotion-dataset/label.json](./emotion-dataset/label.json). This step is needed if you want string labels such as `anger`, `joy`, etc. returned when scoring the model. If you skip this step, the model will return numerical categories such as 0, 1, 2, etc. and you will have to map them to what the category represents yourself. \n",
"* We want this sample to run quickly, so save smaller `train`, `validation` and `test` files containing 10% of the original. This means the fine tuned model will have lower accuracy, hence it should not be put to real-world use. "
"* We want this sample to run quickly, so save smaller `train`, `validation` and `test` files containing 10% of the original. This means the fine tuned model will have lower accuracy, hence it should not be put to real-world use. \n",
"\n",
"##### Here is an example of how the data should look like\n",
"\n",
"Single text classification requires the training data to include at least 2 fields – one for ‘Sentence1’ and ‘Label’ like in this example. Sentence 2 can be left blank in this case. The below examples are from Emotion dataset. \n",
"\n",
"| Text (Sentence1) | Label (Label) |\n",
"| :- | :- |\n",
"| i feel so blessed to be able to share it with you all | joy | \n",
"| i feel intimidated nervous and overwhelmed and i shake like a leaf | fear | \n",
"\n",
" \n",
"\n",
"Text pair classification, where you have two sentences to be classified (e.g., sentence entailment) will need the training data to have 3 fields – for ‘Sentence1’, ‘Sentence2’ and ‘Label’ like in this example. The below examples are from Microsoft Research Paraphrase Corpus dataset. \n",
"\n",
"| Text1 (Sentence 1) | Text2 (Sentence 2) | Label_text (Label) |\n",
"| :- | :- | :- |\n",
"| Amrozi accused his brother , whom he called \" the witness \" , of deliberately distorting his evidence . | Referring to him as only \" the witness \" , Amrozi accused his brother of deliberately distorting his evidence . | equivalent |\n",
"| Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion . | Yucaipa bought Dominick 's in 1995 for \\$ 693 million and sold it to Safeway for \\$ 1.8 billion in 1998 . | not equivalent |\n",
"\n",
" "
]
},
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,17 @@
"* Visualize some data rows\n",
"* We want this sample to run quickly, so save smaller `train`, `validation` and `test` files containing 10% of the original. This means the fine tuned model will have lower accuracy, hence it should not be put to real-world use. \n",
"\n",
"> The [download-dataset.py](./conll2003-dataset/download-dataset.py) is used to download the conll2003 dataset and transform the dataset into finetune pipeline component consumable format."
"> The [download-dataset.py](./conll2003-dataset/download-dataset.py) is used to download the conll2003 dataset and transform the dataset into finetune pipeline component consumable format.\n",
"\n",
"##### Here is an example of how the data should look like\n",
"\n",
"Token classification requires the training data to include 2 fields, ‘Tokens’ and ‘Tags’. The tags could contain any strings depending on the finetune use case. Please note that the NER tags should be passed as an array of strings. \n",
"\n",
"| Tokens (Tokens) | NER Tags (Tags) |\n",
"| :- | :- |\n",
"| [\"Results\",\"of\",\"French\",\"first\",\"division\"] | [\"O\",\"O\",\"B-MISC\",\"O\",\"O\"] |\n",
"| [\"Nippon\",\"Telegraph\",\"and\",\"Telephone\",\"Corp\",\"(\",\"NTT\",\")\",\"said\",\"on\",\"Friday\",\"that\",\"it\",\"hopes\",\"to\",\"move\",\"into\",\"the\",\"international\",\"telecommunications\",\"business\",\"as\",\"soon\",\"as\",\"possible\",\"following\",\"the\",\"government\",\"'s\",\"decision\",\"to\",\"split\",\"NTT\",\"into\",\"three\",\"firms\",\"under\",\"a\",\"holding\",\"company\",\".\"] | [\"B-ORG\",\"I-ORG\",\"I-ORG\",\"I-ORG\",\"I-ORG\",\"O\",\"B-ORG\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"B-ORG\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\",\"O\"] |\n",
"\n"
]
},
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -179,7 +179,17 @@
"\n",
"> The [download-dataset.py](./wmt16-en-ro-dataset/download-dataset.py) is used to download the wmt16 (ro-en) dataset and transform the dataset into finetune pipeline component consumable format. Also as the dataset is large, hence we here have only part of the dataset.\n",
"\n",
"> **Note** : Some language models have different language codes and hence the column names in the dataset should reflect the same."
"> **Note** : Some language models have different language codes and hence the column names in the dataset should reflect the same.\n",
"\n",
"##### Here is an example of how the data should look like \n",
"\n",
"The translation dataset should have 2 fields – source language and target language. The field names that map to source and target languages need to be language codes supported by the model. Please refer to the model card for details on supported languages.\n",
"\n",
"| en (Source_language) | Ro (Target_language) |\n",
"| :- | :- |\n",
"| Beethoven, Brahms, Bartok, Enescu were working people, artists, and not commercial representatives. | Beethoven, Brahms, Bartok, Enescu erau oameni care munceau, care erau artisti \\u0219i nu reprezentanti comerciali. |\n",
"| Colleague Damien Collins MP attacked The Voice, saying that too wasn't original | Colegul Damien Collins a atacat The Voice, afirm\\u00e2nd c\\u0103 nici aceast\\u0103 emisiune nu este original\\u0103 |\n",
"\n"
]
},
{
Expand Down
Loading