<a href="https://colab.research.google.com/github/0xVolt/whats-up-doc/blob/main/src/experimental-notebooks/code_trans_base_summarization_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Summarize Python Code Snippets with CodeTrans-T5-Base

## 1. Load libraries

In [1]:
%pip install -q transformers sentencepiece pytorch_lightning datasets protobuf

Note: you may need to restart the kernel to use updated packages.


In [2]:
from transformers import AutoTokenizer, AutoModelWithLMHead, SummarizationPipeline

  from .autonotebook import tqdm as notebook_tqdm


## 2. Create summarization pipeline and move it onto the GPU if available

In [3]:
pipeline = SummarizationPipeline(
    model=AutoModelWithLMHead.from_pretrained("SEBIS/code_trans_t5_base_source_code_summarization_python_transfer_learning_finetune"),
    tokenizer=AutoTokenizer.from_pretrained("SEBIS/code_trans_t5_base_source_code_summarization_python_transfer_learning_finetune", skip_special_tokens=True),
    device=0
)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


## 3. Input code snippet, parse and tokenize it

In [4]:
code = '''

def is_prime(number):
    if number <= 1:
        return False
    elif number <= 3:
        return True
    elif number % 2 == 0 or number % 3 == 0:
        return False
    i = 5
    while i * i <= number:
        if number % i == 0 or number % (i + 2) == 0:
            return False
        i += 6
    return True
        
''' #@param {type:"raw"}

In [5]:
import tokenize
import io

def pythonTokenizer(line):
    result= []
    line = io.StringIO(line)

    for tokenType, tok, start, end, line in tokenize.generate_tokens(line.readline):
        if (not tokenType == tokenize.COMMENT):
            if tokenType == tokenize.STRING:
                result.append("CODE_STRING")
            elif tokenType == tokenize.NUMBER:
                result.append("CODE_INTEGER")
            elif (not tok=="\n") and (not tok=="    "):
                result.append(str(tok))
    return ' '.join(result)

In [6]:
tokenized_code = pythonTokenizer(code)
print("Code after tokenization: " + tokenized_code)

Code after tokenization: def is_prime ( number ) : if number <= CODE_INTEGER :          return False  elif number <= CODE_INTEGER :          return True  elif number % CODE_INTEGER == CODE_INTEGER or number % CODE_INTEGER == CODE_INTEGER :          return False  i = CODE_INTEGER while i * i <= number :          if number % i == CODE_INTEGER or number % ( i + CODE_INTEGER ) == CODE_INTEGER :              return False  i += CODE_INTEGER  return True  


## 4. Make Prediction

In [7]:
pipeline([tokenized_code])



[{'summary_text': "What 's the most efficient way to check if an integer is prime in Python ?"}]