Large Language Model Challenge: LLM StoryBot
Data Science Sprint 15 Challenge
Challenge Objectives 🧾
You should be able to...

Implement an LLM (either local or API based)
Customize the behavior of an LLM
Use a custom LLM in your terminal
You may use any Python libraries that can be installed via pip, including the Python standard library. Please do not use any websites like ChatGPT to write your stories. You are welcome to use them to help write the code for your custom model, though! Pro Tip: ask them to act as teachers to help you learn rather than just getting them to write the code for you.

Part 1 - Choose an LLM Model 🔍
The easy choice here is OpenAi's API version of GPT, but it requires a paid account. If you want a higher-level challenge, choose an open-source model that runs locally. Although GPT is easier to set up and performs better than open-source models, you will learn way more if you implement a local open-source model.

Part 2 - Implement an LLM Model as a Brilliant Storyteller Bot 🤖
Design an LLM that acts like a brilliant storyteller by having it generate a story based on user input. The stories should be about 1000 words in length or longer.

Be mindful of the context window of the model you choose. The system_prompt, user_prompt, and output combined need to fit inside one context window. See the documentation for the model you're using for more details. Pro Tip: choose a model with good documentation! 😏

In [None]:
#from transformers import GPT2LMHeadModel, GPT2Tokenizer
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")

class StoryBot:
    def __init__(self):

        #self.tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
        self.tokenizer = tokenizer
        self.model = AutoModelForCausalLM.from_pretrained("gpt2")
        #self.model = GPT2LMHeadModel.from_pretrained("gpt2")

    def __call__(self, user_prompt: str, max_length: int = 1000) -> str:
        input_ids = self.tokenizer.encode(user_prompt, return_tensors="pt")
        output = self.model.generate(input_ids, max_length=max_length, num_return_sequences=1)
        generated_text = self.tokenizer.decode(output[0], skip_special_tokens=True)
        return generated_text

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Part 3 - StoryBot Testing 🧪
Have your bot write a few stories and evaluate them. Can alterations be made to the model's parameters to improve the results?

In [None]:
tokenizer_pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained("gpt2")
model.config.pad_token_id = model.config.eos_token_id

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
user_prompt = "A short film about a pilot who "
story_bot = StoryBot()
story_bot(user_prompt)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


"A short film about a pilot who \xa0was killed by a drone in the middle of a war.\nThe film is a bit of a mess, but it's a good one. The pilot is a young man who is trying to get his life back on track. He is a pilot who is trying to get his life back on track. He is a pilot who is trying to get his life back on track. He is a pilot who is trying to get his life back on track. He is a pilot who is trying to get his life back on track. He is a pilot who is trying to get his life back on track. He is a pilot who is trying to get his life back on track. He is a pilot who is trying to get his life back on track. He is a pilot who is trying to get his life back on track. He is a pilot who is trying to get his life back on track. He is a pilot who is trying to get his life back on track. He is a pilot who is trying to get his life back on track. He is a pilot who is trying to get his life back on track. He is a pilot who is trying to get his life back on track. He is a pilot who is trying to

Part 4 - Story Submission 📜
Have your bot generate at least three stories and choose your favorite to upload for evaluation. The story should be 1000 words or more and be purely AI-generated. Please do not edit the story in any way. To submit your Sprint Challenge, upload your story.txt file by itself.

Stretch Goals (Optional) 🏄
Have the call method of the bot output a story as a txt file rather than printing to the terminal, so you don't need to copy/paste for submission.
Give your bot a distinct personality on top of its ability to write captivating stories.
Implement a memory module for your bot. A memory module would give you the ability to have it do multiple revisions of the same story based on your input.
Implement an API (Flask or FastAPI) to interact with your bot.


Congratulations! 🥳
Thank you for your hard work, and congratulations!!! You've learned a lot, and you should proudly call yourself a Data Scientist.