# 🔡 Modelo de linguagem com o método de visualização de Shannon para geração de texto

## 📚 Importação das bibliotecas

In [None]:
import nltk
import random
import pandas as pd
from collections import defaultdict, Counter
from sklearn.feature_extraction.text import CountVectorizer

nltk.download('punkt')
nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\gabri\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\gabri\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 🪮 Leitura e organização dos dados
1. Leitura do arquivo csv.

In [2]:
df_prompts = pd.read_csv('prompts.csv')
df_prompts

Unnamed: 0,act,prompt
0,An Ethereum Developer,Imagine you are an experienced Ethereum develo...
1,SEO Prompt,"Using WebPilot, create an outline for an artic..."
2,Linux Terminal,I want you to act as a linux terminal. I will ...
3,English Translator and Improver,"I want you to act as an English translator, sp..."
4,`position` Interviewer,I want you to act as an interviewer. I will be...
...,...,...
198,study planner,I want you to act as an advanced study plan ge...
199,SEO specialist,Contributed by [@suhailroushan13](https://gith...
200,Note-Taking Assistant,I want you to act as a note-taking assistant f...
201,Nutritionist,Act as a nutritionist and create a healthy rec...


2. Filtração para remoção de linhas nulas no campo "prompt" e apresentação apenas desse campo.

In [3]:
df_prompts = df_prompts.dropna(subset=['prompt'])
df_prompts = df_prompts['prompt']
df_prompts

0      Imagine you are an experienced Ethereum develo...
1      Using WebPilot, create an outline for an artic...
2      I want you to act as a linux terminal. I will ...
3      I want you to act as an English translator, sp...
4      I want you to act as an interviewer. I will be...
                             ...                        
198    I want you to act as an advanced study plan ge...
199    Contributed by [@suhailroushan13](https://gith...
200    I want you to act as a note-taking assistant f...
201    Act as a nutritionist and create a healthy rec...
202    I want you to reply to questions. You reply on...
Name: prompt, Length: 203, dtype: object

3. Junção das linhas formando assim um único texto.

In [4]:
df_prompt_text = " ".join(df_prompts.astype(str).tolist())
df_prompt_text

'Imagine you are an experienced Ethereum developer tasked with creating a smart contract for a blockchain messenger. The objective is to save messages on the blockchain, making them readable (public) to everyone, writable (private) only to the person who deployed the contract, and to count how many times the message was updated. Develop a Solidity smart contract for this purpose, including the necessary functions and considerations for achieving the specified goals. Please provide the code and any relevant explanations to ensure a clear understanding of the implementation. Using WebPilot, create an outline for an article that will be 2,000 words on the keyword \'Best SEO prompts\' based on the top 10 results from Google. Include every relevant heading possible. Keep the keyword density of the headings high. For each section of the outline, include the word count. Include FAQs section in the outline too, based on people also ask section from Google for the keyword. This outline must be 

## 📑 Preparação do texto
1. Realização da tokenização e remoção de stopwords.

In [5]:
count_vectorizer = CountVectorizer(token_pattern=r'\b\w+\b', stop_words=stopwords)
word_analyzer = count_vectorizer.build_analyzer()
df_prompt_text = word_analyzer(df_prompt_text)
df_prompt_text

['imagine',
 'experienced',
 'ethereum',
 'developer',
 'tasked',
 'creating',
 'smart',
 'contract',
 'blockchain',
 'messenger',
 'objective',
 'save',
 'messages',
 'blockchain',
 'making',
 'readable',
 'public',
 'everyone',
 'writable',
 'private',
 'person',
 'deployed',
 'contract',
 'count',
 'many',
 'times',
 'message',
 'updated',
 'develop',
 'solidity',
 'smart',
 'contract',
 'purpose',
 'including',
 'necessary',
 'functions',
 'considerations',
 'achieving',
 'specified',
 'goals',
 'please',
 'provide',
 'code',
 'relevant',
 'explanations',
 'ensure',
 'clear',
 'understanding',
 'implementation',
 'using',
 'webpilot',
 'create',
 'outline',
 'article',
 '2',
 '000',
 'words',
 'keyword',
 'best',
 'seo',
 'prompts',
 'based',
 'top',
 '10',
 'results',
 'google',
 'include',
 'every',
 'relevant',
 'heading',
 'possible',
 'keep',
 'keyword',
 'density',
 'headings',
 'high',
 'section',
 'outline',
 'include',
 'word',
 'count',
 'include',
 'faqs',
 'section',
 '

2. Realização do stemming.

In [6]:
stemmer = nltk.stem.PorterStemmer()
stemmer = [stemmer.stem(word) for word in df_prompt_text]
stemmer 

['imagin',
 'experienc',
 'ethereum',
 'develop',
 'task',
 'creat',
 'smart',
 'contract',
 'blockchain',
 'messeng',
 'object',
 'save',
 'messag',
 'blockchain',
 'make',
 'readabl',
 'public',
 'everyon',
 'writabl',
 'privat',
 'person',
 'deploy',
 'contract',
 'count',
 'mani',
 'time',
 'messag',
 'updat',
 'develop',
 'solid',
 'smart',
 'contract',
 'purpos',
 'includ',
 'necessari',
 'function',
 'consider',
 'achiev',
 'specifi',
 'goal',
 'pleas',
 'provid',
 'code',
 'relev',
 'explan',
 'ensur',
 'clear',
 'understand',
 'implement',
 'use',
 'webpilot',
 'creat',
 'outlin',
 'articl',
 '2',
 '000',
 'word',
 'keyword',
 'best',
 'seo',
 'prompt',
 'base',
 'top',
 '10',
 'result',
 'googl',
 'includ',
 'everi',
 'relev',
 'head',
 'possibl',
 'keep',
 'keyword',
 'densiti',
 'head',
 'high',
 'section',
 'outlin',
 'includ',
 'word',
 'count',
 'includ',
 'faq',
 'section',
 'outlin',
 'base',
 'peopl',
 'also',
 'ask',
 'section',
 'googl',
 'keyword',
 'outlin',
 'mus

## 🔎 Extração de trigramas
1. Criação de uma lista com três palavras consecutivas do texto.

In [7]:
trigrams = [(stemmer[i], stemmer[i+1], stemmer[i+2]) for i in range(len(stemmer)-2)]
trigrams

[('imagin', 'experienc', 'ethereum'),
 ('experienc', 'ethereum', 'develop'),
 ('ethereum', 'develop', 'task'),
 ('develop', 'task', 'creat'),
 ('task', 'creat', 'smart'),
 ('creat', 'smart', 'contract'),
 ('smart', 'contract', 'blockchain'),
 ('contract', 'blockchain', 'messeng'),
 ('blockchain', 'messeng', 'object'),
 ('messeng', 'object', 'save'),
 ('object', 'save', 'messag'),
 ('save', 'messag', 'blockchain'),
 ('messag', 'blockchain', 'make'),
 ('blockchain', 'make', 'readabl'),
 ('make', 'readabl', 'public'),
 ('readabl', 'public', 'everyon'),
 ('public', 'everyon', 'writabl'),
 ('everyon', 'writabl', 'privat'),
 ('writabl', 'privat', 'person'),
 ('privat', 'person', 'deploy'),
 ('person', 'deploy', 'contract'),
 ('deploy', 'contract', 'count'),
 ('contract', 'count', 'mani'),
 ('count', 'mani', 'time'),
 ('mani', 'time', 'messag'),
 ('time', 'messag', 'updat'),
 ('messag', 'updat', 'develop'),
 ('updat', 'develop', 'solid'),
 ('develop', 'solid', 'smart'),
 ('solid', 'smart', 'c

## 🧮 Cálculo de frequências e probabilidades condicionais 
1. Cálculo de quantas vezes cada trigrama aparece no texto.

In [8]:
frequency_trigrams = Counter(trigrams)
frequency_trigrams

Counter({('request', 'need', 'help'): 43,
         ('first', 'suggest', 'request'): 32,
         ('first', 'request', 'need'): 28,
         ('suggest', 'request', 'need'): 25,
         ('noth', 'els', 'write'): 19,
         ('els', 'write', 'explan'): 19,
         ('write', 'explan', 'first'): 13,
         ('need', 'tell', 'someth'): 11,
         ('tell', 'someth', 'english'): 11,
         ('first', 'request', 'want'): 10,
         ('curli', 'bracket', 'like'): 8,
         ('need', 'help', 'creat'): 8,
         ('explan', 'first', 'request'): 8,
         ('one', 'uniqu', 'code'): 7,
         ('uniqu', 'code', 'block'): 7,
         ('code', 'block', 'noth'): 7,
         ('block', 'noth', 'els'): 7,
         ('unless', 'instruct', 'need'): 7,
         ('someth', 'english', 'put'): 7,
         ('english', 'put', 'text'): 7,
         ('put', 'text', 'insid'): 7,
         ('bracket', 'like', 'first'): 7,
         ('write', 'explan', 'repli'): 7,
         ('insid', 'one', 'uniqu'): 6,
      

2. Cálculo de quantas vezes cada cada prefixo (duas primeiras palavras do trigrama) aparece no texto.

In [9]:
prefixes = defaultdict(int)
for trigram in trigrams:
    prefix = (trigram[0], trigram[1])
    prefixes[prefix] += 1
prefixes

defaultdict(int,
            {('imagin', 'experienc'): 1,
             ('experienc', 'ethereum'): 1,
             ('ethereum', 'develop'): 1,
             ('develop', 'task'): 1,
             ('task', 'creat'): 2,
             ('creat', 'smart'): 1,
             ('smart', 'contract'): 2,
             ('contract', 'blockchain'): 1,
             ('blockchain', 'messeng'): 1,
             ('messeng', 'object'): 1,
             ('object', 'save'): 1,
             ('save', 'messag'): 1,
             ('messag', 'blockchain'): 1,
             ('blockchain', 'make'): 1,
             ('make', 'readabl'): 1,
             ('readabl', 'public'): 1,
             ('public', 'everyon'): 1,
             ('everyon', 'writabl'): 1,
             ('writabl', 'privat'): 1,
             ('privat', 'person'): 1,
             ('person', 'deploy'): 1,
             ('deploy', 'contract'): 1,
             ('contract', 'count'): 1,
             ('count', 'mani'): 1,
             ('mani', 'time'): 1,
             

3. Cálculo da probabilidade da terceira palavra ocorrer para cada prefixo.

In [10]:
language_model = defaultdict(dict)
for trigram, frequency in frequency_trigrams.items():
    prefix = (trigram[0], trigram[1])
    word3 = trigram[2]
    probability = frequency / prefixes[prefix]
    language_model[prefix][word3] = probability
language_model

defaultdict(dict,
            {('imagin', 'experienc'): {'ethereum': 1.0},
             ('experienc', 'ethereum'): {'develop': 1.0},
             ('ethereum', 'develop'): {'task': 1.0},
             ('develop', 'task'): {'creat': 1.0},
             ('task', 'creat'): {'smart': 0.5, 'worksheet': 0.5},
             ('creat', 'smart'): {'contract': 1.0},
             ('smart', 'contract'): {'blockchain': 0.5, 'purpos': 0.5},
             ('contract', 'blockchain'): {'messeng': 1.0},
             ('blockchain', 'messeng'): {'object': 1.0},
             ('messeng', 'object'): {'save': 1.0},
             ('object', 'save'): {'messag': 1.0},
             ('save', 'messag'): {'blockchain': 1.0},
             ('messag', 'blockchain'): {'make': 1.0},
             ('blockchain', 'make'): {'readabl': 1.0},
             ('make', 'readabl'): {'public': 1.0},
             ('readabl', 'public'): {'everyon': 1.0},
             ('public', 'everyon'): {'writabl': 1.0},
             ('everyon', 'writabl')

4. Apresentação de cada prefixo com as suas possíveis palavras.

In [11]:
for prefix, continuacoes in language_model.items():
    print(f"Prefixo: {prefix}")
    for word, probability in continuacoes.items():
        print(f"  -> {word}: {probability:.2f}")

Prefixo: ('imagin', 'experienc')
  -> ethereum: 1.00
Prefixo: ('experienc', 'ethereum')
  -> develop: 1.00
Prefixo: ('ethereum', 'develop')
  -> task: 1.00
Prefixo: ('develop', 'task')
  -> creat: 1.00
Prefixo: ('task', 'creat')
  -> smart: 0.50
  -> worksheet: 0.50
Prefixo: ('creat', 'smart')
  -> contract: 1.00
Prefixo: ('smart', 'contract')
  -> blockchain: 0.50
  -> purpos: 0.50
Prefixo: ('contract', 'blockchain')
  -> messeng: 1.00
Prefixo: ('blockchain', 'messeng')
  -> object: 1.00
Prefixo: ('messeng', 'object')
  -> save: 1.00
Prefixo: ('object', 'save')
  -> messag: 1.00
Prefixo: ('save', 'messag')
  -> blockchain: 1.00
Prefixo: ('messag', 'blockchain')
  -> make: 1.00
Prefixo: ('blockchain', 'make')
  -> readabl: 1.00
Prefixo: ('make', 'readabl')
  -> public: 1.00
Prefixo: ('readabl', 'public')
  -> everyon: 1.00
Prefixo: ('public', 'everyon')
  -> writabl: 1.00
Prefixo: ('everyon', 'writabl')
  -> privat: 1.00
Prefixo: ('writabl', 'privat')
  -> person: 1.00
Prefixo: ('priva

## 📝 Geração de texto com o modelo
1. Cada palavra gerada tem uma chance baseada na probabilidade condicional calculada anteriormente.

In [19]:
def generate_text(language_model, size):
    prefix = random.choice(list(language_model.keys()))
    result = [prefix[0], prefix[1]]

    for _ in range(size):
        continuations = language_model.get(prefix, None)
        if not continuations:
            break
        words = list(continuations.keys())
        probabilities = list(continuations.values())
        next_word = random.choices(words, weights=probabilities, k=1)[0]
        result.append(next_word)
        prefix = (prefix[1], next_word)

    return " ".join(result)

generated_text = generate_text(language_model, 10)
print("\nTexto gerado:")
print(generated_text)



Texto gerado:
come need help provid depth review includ pro con featur comparison technolog
