# Getting around with Mistral-7B

In this notebook I try out Mistal-7B, first on sample tasks to check if the model performs reasonably after quantization, and then go on to try zero-shot/few-shot annotating.

In [1]:
%env CUDA_DEVICE_ORDER=PCI_BUS_ID
%env CUDA_VISIBLE_DEVICES=5

env: CUDA_DEVICE_ORDER=PCI_BUS_ID
env: CUDA_VISIBLE_DEVICES=5


In [2]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

from tqdm.auto import tqdm

In [3]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"

In [4]:
modelpath = "mistralai/Mistral-7B-v0.1"

model = AutoModelForCausalLM.from_pretrained(
    modelpath,
    device_map="auto",
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_quant_type="nf4",
    ),
    torch_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(modelpath, use_fast=False)
tokenizer.pad_token = "</s>"
model.config.eos_token_id = tokenizer.eos_token_id

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [5]:
model

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): MistralRMSNorm()
        (post_attention_layernorm): MistralRMSNorm()
      )

## Try short prompts

In [6]:
prompt = "Conspiracy theory about pickles: "

inputs = tokenizer(prompt, return_tensors="pt").to(device)

In [7]:
inputs

{'input_ids': tensor([[    1,  1325, 14079,  2426,  5742,   684,  3088,   867, 28747, 28705]],
       device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

In [8]:
# Greedy decoding
outputs = model.generate(**inputs, max_new_tokens=100, pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Conspiracy theory about pickles: 100% true.

I’m not sure if I’ve mentioned this before, but I’m a big fan of conspiracy theories. I’m not sure why, but I find them fascinating. I’m not sure if I believe in them, but I find them fascinating.

I’m not sure if I’ve mentioned this before, but I’m a big fan of conspiracy theories. I’m not sure why, but I find them fascinating


In [9]:
# Sampling
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, top_k=0, temperature=0.3, pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Conspiracy theory about pickles: 1. Pickles are a conspiracy to make you think you’re eating a vegetable when you’re really eating a condiment. 2. Pickles are a conspiracy to make you think you’re eating a condiment when you’re really eating a vegetable.

Conspiracy theory about the moon: The moon is a conspiracy to make you think it’s a natural satellite when it’s really a giant hologram projected by the government.

Con


In [10]:
# Higher temperature sampling
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, top_k=0, temperature=0.6, pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Conspiracy theory about pickles: 1. Pickle juice has a high concentration of electrolytes, which can help restore electrolyte levels after a workout. 2. Pickle juice can help reduce muscle cramps. 3. Pickle juice can help improve hydration. 4. Pickle juice can help reduce inflammation. 5. Pickle juice can help improve digestion. 6. Pickle juice can help improve gut health. 7. Pickle juice can help improve bone health


About what one can expect from a 7B models, ruthlessly quantized to 4 bits.

## Try longer prompts

In [11]:
bitter_lesson = """
Rich Sutton
March 13, 2019
The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore's law, or rather its generalization of continued exponentially falling cost per unit of computation. Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more computation inevitably becomes available. Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation. These two need not run counter to each other, but in practice they tend to. Time spent on one is time not spent on the other. There are psychological commitments to investment in one approach or the other. And the human-knowledge approach tends to complicate methods in ways that make them less suited to taking advantage of general methods leveraging computation.  There were many examples of AI researchers' belated learning of this bitter lesson, and it is instructive to review some of the most prominent.

In computer chess, the methods that defeated the world champion, Kasparov, in 1997, were based on massive, deep search. At the time, this was looked upon with dismay by the majority of computer-chess researchers who had pursued methods that leveraged human understanding of the special structure of chess. When a simpler, search-based approach with special hardware and software proved vastly more effective, these human-knowledge-based chess researchers were not good losers. They said that ``brute force" search may have won this time, but it was not a general strategy, and anyway it was not how people played chess. These researchers wanted methods based on human input to win and were disappointed when they did not.

A similar pattern of research progress was seen in computer Go, only delayed by a further 20 years. Enormous initial efforts went into avoiding search by taking advantage of human knowledge, or of the special features of the game, but all those efforts proved irrelevant, or worse, once search was applied effectively at scale. Also important was the use of learning by self play to learn a value function (as it was in many other games and even in chess, although learning did not play a big role in the 1997 program that first beat a world champion). Learning by self play, and learning in general, is like search in that it enables massive computation to be brought to bear. Search and learning are the two most important classes of techniques for utilizing massive amounts of computation in AI research. In computer Go, as in computer chess, researchers' initial effort was directed towards utilizing human understanding (so that less search was needed) and only much later was much greater success had by embracing search and learning.

In speech recognition, there was an early competition, sponsored by DARPA, in the 1970s. Entrants included a host of special methods that took advantage of human knowledge---knowledge of words, of phonemes, of the human vocal tract, etc. On the other side were newer methods that were more statistical in nature and did much more computation, based on hidden Markov models (HMMs). Again, the statistical methods won out over the human-knowledge-based methods. This led to a major change in all of natural language processing, gradually over decades, where statistics and computation came to dominate the field. The recent rise of deep learning in speech recognition is the most recent step in this consistent direction. Deep learning methods rely even less on human knowledge, and use even more computation, together with learning on huge training sets, to produce dramatically better speech recognition systems. As in the games, researchers always tried to make systems that worked the way the researchers thought their own minds worked---they tried to put that knowledge in their systems---but it proved ultimately counterproductive, and a colossal waste of researcher's time, when, through Moore's law, massive computation became available and a means was found to put it to good use.

In computer vision, there has been a similar pattern. Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better.

This is a big lesson. As a field, we still have not thoroughly learned it, as we are continuing to make the same kind of mistakes. To see this, and to effectively resist it, we have to understand the appeal of these mistakes. We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.

One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.

The second general point to be learned from the bitter lesson is that the actual contents of minds are tremendously, irredeemably complex; we should stop trying to find simple ways to think about the contents of minds, such as simple ways to think about space, objects, multiple agents, or symmetries. All these are part of the arbitrary, intrinsically-complex, outside world. They are not what should be built in, as their complexity is endless; instead we should build in only the meta-methods that can find and capture this arbitrary complexity. Essential to these methods is that they can find good approximations, but the search for them should be by our methods, not by us. We want AI agents that can discover like we can, not which contain what we have discovered. Building in our discoveries only makes it harder to see how the discovering process can be done.
"""

len(bitter_lesson)

6836

In [12]:
def summarize(text, prompt, model, tokenizer, max_summary_size_tokens=500):
    """
    Summarizes text according to a given zero-shot prompt.

    Input:
        text: str
            Text to summarize.
        prompt: str
            Prompt to guide the model into summarization.
        model: transformers.AutoModelForCausalLM
            A model used to generate summary (e.g. Mistral 7B).
        tokenizer: transformers.AutoTokenizer
            Corresponing tokenizer.
        max_summary_tokens: int
            Summary will not exceed max_summary_tokens in size.
    Output:
        summary: str
            (Hopefully) summary of given text.
    """
    full_input = f"""{text}

{prompt}
"""

    inputs = tokenizer(full_input, return_tensors="pt").to(device)
    input_len = inputs.input_ids.shape[-1]
    print(f"Prompt + text length in tokens: {input_len}")

    output = model.generate(**inputs, min_new_tokens=30, max_new_tokens=max_summary_size_tokens, num_beams=2, pad_token_id=tokenizer.eos_token_id)
    print(f"Summary len in tokens: {output.shape[-1] - input_len}")

    summary = tokenizer.decode(output[0, input_len:], skip_special_tokens=True)

    return summary

In [13]:
zero_shot_prompt_for_summaries = "A concise and coherent summary of the text above:"

In [14]:
%%time
summary = summarize(bitter_lesson, zero_shot_prompt_for_summaries, model, tokenizer)

Prompt + text length in tokens: 1441
Summary len in tokens: 358
CPU times: user 33.6 s, sys: 237 ms, total: 33.9 s
Wall time: 34 s


In [15]:
print(summary)


1. AI researchers have often tried to build knowledge into their agents.
2. This always helps in the short term, and is personally satisfying to the researcher, but
3. in the long run it plateaus and even inhibits further progress, and
4. breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning.
5. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.
6. One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.
7. The second general point to be learned from the bitter lesson is that the actual contents of minds are tremendously, irredeemably complex; we should stop trying to find simple ways to thin

Surprisingly reasonable.

## Try commenting chess

In [16]:
closing_bracket_id = tokenizer("}", add_special_tokens=False).input_ids[0]
closing_bracket_id

443

In [76]:
def comment_chess_game(move_sequence, prompt, model, tokenizer, device, max_comment_size=40, print_intermediate=False):
    """
    Comments a chess game, move by move.
    
    Input:
        move_sequence: List[str], 
            A sequence of moves from White and Black in PGN format. Moves for White also include the move number. Comments and NAGs are excluded.
            Example: ["1. e4 ", "e5 ", "2. Qh5 ", "Nc6 ", "3. Bc4 ", "Nf6 ", "4. Qxf7#"]
        prompt: str
            Prompt used to incentivize the model to provide coherent comments.
        model: transformers.AutoModelForCausalLM
            A model to generate comments (e.g. Mistral 7B).
        tokenizer: transformers.AutoTokenizer
            Corresponing tokenizer.
        device: str
            Where to run the model (cuda/cpu).
        max_comment_size: int
            Maximal comment size in tokens.
        print_intermediate: bool 
            Whether to print intermediate results (helpful for debugging).
    Output:
        commented_game: str
            A single string representation of the game, with all moves and comments, hopefully in PGN format.
    """
    prefix = prompt
    game = prompt

    for move in tqdm(move_sequence, desc="Moves"):
        prefix += move
        # Comments always start with {
        prefix += "{ Note: "

        if print_intermediate:
            print("Move:", move)
        
        inputs = tokenizer(prefix, return_tensors="pt").to(device)
        output = model.generate(**inputs, eos_token_id=closing_bracket_id, pad_token_id=closing_bracket_id, max_new_tokens=max_comment_size)
        comment = tokenizer.decode(output[0, inputs.input_ids.shape[-1]:], skip_special_tokens=True)
        prefix += comment

        if print_intermediate:
            print("Comment:", comment.rstrip("}"))
            print("=" * 60)

    return prefix

## Zero-shot

In [77]:
# prompt = """Several chess games in Portable Game Format with concise and insightful comments from a chess master, explaining the pros and cons of each move. \
# Each comment always starts with an opening curly bracket '{' and always ends with a closing curly bracket '}'.

# Game 1:
# """

# prompt = """A chess games with concise comments, which briefly explain the upsides and downsides of the move \
# (e.g. whether it develops a piece, wins material, brings king to safety, puts a piece to a better/worse square, creates a threat, fails to pary on opponents idea).
# Comment is as always succint as possible -- e.g. { Develops a knight and controls the center. }, or { Wins a pawn, but loses a tempo. } or { Creates a checkmating threat. }.
# Comment always focuses on pros and cons of the move, or on the idea/reasoning behind it.
# Each comment always starts with an opening curly bracket '{' and ends with a closing curly bracket '}'.

# Game 1:
# """

prompt = """Best Annotated Chess Games Collection

Preliminary notes:
The criteria are pretty rough. For each game in this collection:
1. Each comment briefly explains how the move affects the position and what changes.
2. Comments are as always succint as possible (around 5 to 10 words).
3. Comments avoid generic statements and focus on concrete position.
4. Each comment always starts with '{ Note: ' and ends with a closing curly bracket '}'.
Without further ado, let's delve into the games!

Game 1: NN vs NN
"""

In [78]:
# Fool's mate
game = ["1. f3 ", " e5 ", " 2. g4 ", " Qh4# "]

commented_game = comment_chess_game(game, prompt, model, tokenizer, device, print_intermediate=True)

Moves:   0%|          | 0/4 [00:00<?, ?it/s]

Move: 1. f3 
Comment: 1. e4 is the most popular opening move. 1. f3 is a rare opening move. 
Move:  e5 
Comment: 1. e4 is the most popular opening move. 1. e5 is a rare opening move. 
Move:  2. g4 
Comment: 2. g4 is a rare move. 
Move:  Qh4# 
Comment: 2. g4 is a rare move. 2. Qh4# is a rare move. 


In [79]:
# Scholar's mate
game = ["1. e4 ", " e5 ", " 2. Qh5 ", " Nc6 ", " 3. Bc4 ", " Nf6 ", "4. Qxf7# "]

commented_game = comment_chess_game(game, prompt, model, tokenizer, device, print_intermediate=True)

Moves:   0%|          | 0/7 [00:00<?, ?it/s]

Move: 1. e4 
Comment: 1. e4 is a very common opening move. It is a very flexible move that can lead to many different types of positions. 
Move:  e5 
Comment: 1. e5 is a very common response to 1. e4. It is a very flexible move that can lead to many different types of positions. 
Move:  2. Qh5 
Comment: 2. Qh5 is a very unusual move. It is a very flexible move that can lead to many different types of positions. 
Move:  Nc6 
Comment: 2. Nc6 is a very common response to 2. Qh5. It is a very flexible move that can lead to many different types of positions. 
Move:  3. Bc4 
Comment: 3. Bc4 is a very common move. It is a very flexible move that can lead to many different types of positions. 
Move:  Nf6 
Comment: 3. Nf6 is a very common response to 3. Bc4. It is a very flexible move that can lead to many different types of positions. 
Move: 4. Qxf7# 
Comment: 4. Qxf7# is a very unusual move. It is a very flexible move that can lead to many different types of positions. 


In [80]:
# Legal's mate
game = ["1. e4 ", " e5 ", " 2. Nf3 ", " d6 ", " 3. Bc4 ", " Bg4 ", "4. Nc3 ", " g6 ",\
        " 5. Kxe5 ", " Bxd1 ", " 6. Bxf7+ ", " Ke7 ", " 7. Kd5# "]

commented_game = comment_chess_game(game, prompt, model, tokenizer, device, print_intermediate=True)

Moves:   0%|          | 0/13 [00:00<?, ?it/s]

Move: 1. e4 
Comment: 1. e4 is a very common opening move. It is a very flexible move that can lead to many different types of positions. 
Move:  e5 
Comment: 1. e5 is a very common response to 1. e4. It is a very flexible move that can lead to many different types of positions. 
Move:  2. Nf3 
Comment: 2. Nf3 is a very common move. It is a very flexible move that can lead to many different types of positions. 
Move:  d6 
Comment: 2. d6 is a very common response to 2. Nf3. It is a very flexible move that can lead to many different types of positions. 
Move:  3. Bc4 
Comment: 3. Bc4 is a very common move. It is a very flexible move that can lead to many different types of positions. 
Move:  Bg4 
Comment: 3. Bg4 is a very common response to 3. Bc4. It is a very flexible move that can lead to many different types of positions. 
Move: 4. Nc3 
Comment: 4. Nc3 is a very common move. It is a very flexible move that can lead to many different types of positions. 
Move:  g6 
Comment: 4. g6 is a

We see a lot of repetitions a lot of repetitions.

Arguably, the data is too off-distribution for Mistral.

## Few-shot

In few-shot prompt I use reversed Fool's mate and two quick ways to lose in King's Gambit as examples.

In [82]:
prompt = """
Best Annotated Chess Games Collection

Preliminary notes:
The criteria are pretty rough. For each game in this collection:
1. Each comment briefly explains how the move affects the position and what changes.
2. Comments are as always succint as possible (around 5 to 10 words).
3. Comments avoid generic statements and focus on concrete position.
4. Each comment always starts with '{ Note: ' and ends with a closing curly bracket '}'.
Without further ado, let's delve into the games!

Game 1:
1.e4 { Note: Opens the queen and the light-squared bishop. } g5 { Note: Opens the bishop, but weakens the kingside. } \
2.Nc3 { Note: Develops a knight, preemptively defends e4 pawn. } f5 { Note: Fatally weakens e8-h5 diagonal -- pawns don't go backwards. } \
3.Qh5# 1-0

Game 2:
1. e4 { Note: Opens diagonal for the bishop and the queen, controls the center. } e5 { Note: A symmetrical answer. } \
2. f4 { Note: White sacrifices a pawn to gain more control in the center. } Bc5 { Note: Black develops a piece and eyes f2 square. } \
3. fxe5 { Note: White goes up a pawn, but loses a crucial tempo. } Qh4+ { Note: Strong check. If g3, Qxe4+ wins the rook. } \
4. Ke2 { Note: This move is even worse, as it allows instant checkmate. } Qxe4# 0-1

Game 3:
1. e4 { Note: Opens diagonal for the bishop and the queen, controls the center. } e5 { Note: A symmetrical answer. } \
2. f4 { Note: White sacrifices a pawn to gain more control in the center. } exf4 { Note: Black goes a pawn up instead of developing pieces. } \
3. Bc4 { Note: A risky line, allowing the check on h4. } Qh4+ \
4. Kf1 { Note: White has to move the king -- g3 will be met with fg. } Bc5 { Note: Threatening mate on f2. } \
5. Nf3 { Note: Develops a piece and attacks the queen, but fails to address checkmate threat. } Qf2# 0-1

Game 4:
"""

In [83]:
# Fool's mate
game = ["1. f3 ", " e5 ", " 2. g4 ", " Qh4# "]

commented_game = comment_chess_game(game, prompt, model, tokenizer, device, print_intermediate=True)

Moves:   0%|          | 0/4 [00:00<?, ?it/s]

Move: 1. f3 
Comment: 1. e4 e5 2. f3 Nc6 3. Bc4 Nf6 4. Nf3 Bc5 5. d3 d6 
Move:  e5 
Comment: 5... d5 is a better move. 
Move:  2. g4 
Comment: 2. e4 is a better move. 
Move:  Qh4# 
Comment: 2... Qh4+ is a better move. 


In [84]:
# Scholar's mate
game = ["1. e4 ", " e5 ", " 2. Qh5 ", " Nc6 ", " 3. Bc4 ", " Nf6 ", "4. Qxf7# "]

commented_game = comment_chess_game(game, prompt, model, tokenizer, device, print_intermediate=True)

Moves:   0%|          | 0/7 [00:00<?, ?it/s]

Move: 1. e4 
Comment: 1. e4 e5 2. Nf3 Nc6 3. Bc4 Bc5 4. c3 Nf6 5. d4 exd4
Move:  e5 
Comment: 5... e5 is a common move in the Sicilian Defense. 
Move:  2. Qh5 
Comment: 2. Qh5 is a common move in the Sicilian Defense. 
Move:  Nc6 
Comment: 2... Nc6 is a common move in the Sicilian Defense. 
Move:  3. Bc4 
Comment: 3. Bc4 is a common move in the Sicilian Defense. 
Move:  Nf6 
Comment: 3... Nf6 is a common move in the Sicilian Defense. 
Move: 4. Qxf7# 
Comment: 4. Qxf7# is a common move in the Sicilian Defense. 


In [85]:
# Legal's mate
game = ["1. e4 ", " e5 ", " 2. Nf3 ", " d6 ", " 3. Bc4 ", " Bg4 ", "4. Nc3 ", " g6 ",\
        " 5. Kxe5 ", " Bxd1 ", " 6. Bxf7+ ", " Ke7 ", " 7. Kd5# "]

commented_game = comment_chess_game(game, prompt, model, tokenizer, device, print_intermediate=True)

Moves:   0%|          | 0/13 [00:00<?, ?it/s]

Move: 1. e4 
Comment: 1. e4 e5 2. Nf3 Nc6 3. Bc4 Bc5 4. c3 Nf6 5. d4 exd4
Move:  e5 
Comment: 5... e5 is a common move in the Sicilian Defense. 
Move:  2. Nf3 
Comment: 2. Nf3 is a common move in the Sicilian Defense. 
Move:  d6 
Comment: 2... d6 is a common move in the Sicilian Defense. 
Move:  3. Bc4 
Comment: 3. Bc4 is a common move in the Sicilian Defense. 
Move:  Bg4 
Comment: 3... Bg4 is a common move in the Sicilian Defense. 
Move: 4. Nc3 
Comment: 4. Nc3 is a common move in the Sicilian Defense. 
Move:  g6 
Comment: 4... g6 is a common move in the Sicilian Defense. 
Move:  5. Kxe5 
Comment: 5. Kxe5 is a common move in the Sicilian Defense. 
Move:  Bxd1 
Comment: 5... Bxd1 is a common move in the Sicilian Defense. 
Move:  6. Bxf7+ 
Comment: 6. Bxf7+ is a common move in the Sicilian Defense. 
Move:  Ke7 
Comment: 6... Ke7 is a common move in the Sicilian Defense. 
Move:  7. Kd5# 
Comment: 7. Kd5# is a common move in the Sicilian Defense. 


Again, the results are basically the same.

## Expected answers

In [26]:
sample_annotation_for_fools_mate ="""
1. g4 { Opens the bishop, but weakens the kingside. } e6 { Opens the queen and the dark-squared bishop. } \
2. f4 { Fatally weakens e1-h4 diagonal. } Qh4# 0-1
"""

In [27]:
sample_annotation_for_scholars_mate = """
1. e4 { Opens diagonal for the bishop and the queen, controls the center. } e5 { A symmetrical answer. } \
2. Qh5 { A premature development of the queen. } Nc6 { Black defends e6 pawn, which was attacked by the queen. }\
3. Bc4 { White sets a mate threat on f7. } Nf6 { Develops a piece and attack the queen, but fails to protect against mate. Qe7, Qf6 or g6 was needed. }\
4. Qxf7# 1-0 { White uses the chance to checkmate. } 1-0
"""

In [28]:
sample_annotation_for_legals_mate = """
1. e4 { Opens diagonal for the bishop and the queen, controls the center. } e5 { A symmetrical answer. } \
2. Nf3  { Attacks the pawn, develops a knight } d6 { Defends a pawn, but blocks dark-squared bishop. } \
3. Bc4  { Develops the bishop, aims for potentially weak f7 square. } Bg4 { Develops a piece and pins the knight to the queen. } \
4. Nc3  { Brings third minor piece into the action and increases control of the center. } g6 { Misses a tactical shot. } \
5. Kxe5 { Creates a threat to f7. } Bxd1 { ... Be6 was better, losing a pawn but stopping mate. } \
6. Bxf7+ { Exploiting weakly defended f7. } Ke7 { Forced. } \
7. Kd5# { Utilizes third minor piece. Beautiful coordination. } 1-0
"""