# 🔡 Modelo de linguagem com o método de visualização de Shannon para geração de texto

## 📚 Importação das bibliotecas

In [1]:
import nltk
import random
import pandas as pd
from collections import defaultdict, Counter
from sklearn.feature_extraction.text import CountVectorizer

nltk.download('punkt')
nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\gabri\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\gabri\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 🪮 Leitura e organização dos dados
1. Leitura do arquivo csv.

In [2]:
df_prompts = pd.read_csv('prompts.csv')
df_prompts

Unnamed: 0,act,prompt
0,An Ethereum Developer,Imagine you are an experienced Ethereum develo...
1,SEO Prompt,"Using WebPilot, create an outline for an artic..."
2,Linux Terminal,I want you to act as a linux terminal. I will ...
3,English Translator and Improver,"I want you to act as an English translator, sp..."
4,`position` Interviewer,I want you to act as an interviewer. I will be...
...,...,...
198,study planner,I want you to act as an advanced study plan ge...
199,SEO specialist,Contributed by [@suhailroushan13](https://gith...
200,Note-Taking Assistant,I want you to act as a note-taking assistant f...
201,Nutritionist,Act as a nutritionist and create a healthy rec...


2. Filtração para remoção de linhas nulas no campo "prompt" e apresentação apenas desse campo.

In [3]:
df_prompts = df_prompts.dropna(subset=['prompt'])
df_prompts = df_prompts['prompt']
df_prompts

0      Imagine you are an experienced Ethereum develo...
1      Using WebPilot, create an outline for an artic...
2      I want you to act as a linux terminal. I will ...
3      I want you to act as an English translator, sp...
4      I want you to act as an interviewer. I will be...
                             ...                        
198    I want you to act as an advanced study plan ge...
199    Contributed by [@suhailroushan13](https://gith...
200    I want you to act as a note-taking assistant f...
201    Act as a nutritionist and create a healthy rec...
202    I want you to reply to questions. You reply on...
Name: prompt, Length: 203, dtype: object

3. Junção das linhas formando assim um único texto.

In [4]:
df_prompt_text = " ".join(df_prompts.astype(str).tolist())

## 📑 Preparação do texto
1. Realização da tokenização e remoção de stopwords.

In [5]:
count_vectorizer = CountVectorizer(token_pattern=r'\b\w+\b', stop_words=stopwords)
word_analyzer = count_vectorizer.build_analyzer()
df_prompt_text = word_analyzer(df_prompt_text)

2. Realização do stemming.

In [6]:
stemmer = nltk.stem.PorterStemmer()
stemmer = [stemmer.stem(word) for word in df_prompt_text] 

## 🔎 Extração de trigramas
1. Criação de uma lista com três palavras consecutivas do texto.

In [7]:
trigrams = [(stemmer[i], stemmer[i+1], stemmer[i+2]) for i in range(len(stemmer)-2)]

## 🧮 Cálculo de frequências e probabilidades condicionais 
1. Cálculo de quantas vezes cada trigrama aparece no texto.

In [8]:
frequency_trigrams = Counter(trigrams)

2. Cálculo de quantas vezes cada cada prefixo (duas primeiras palavras do trigrama) aparece no texto.

In [9]:
prefixes = defaultdict(int)
for trigram in trigrams:
    prefix = (trigram[0], trigram[1])
    prefixes[prefix] += 1

3. Cálculo da probabilidade da terceira palavra ocorrer para cada prefixo.

In [10]:
language_model = defaultdict(dict)
for trigram, frequency in frequency_trigrams.items():
    prefix = (trigram[0], trigram[1])
    word3 = trigram[2]
    probability = frequency / prefixes[prefix]
    language_model[prefix][word3] = probability

4. Apresentação de cada prefixo com as suas possíveis palavras.

In [11]:
for prefix, continuacoes in language_model.items():
    print(f"Prefixo: {prefix}")
    for word, probability in continuacoes.items():
        print(f"  -> {word}: {probability:.2f}")

Prefixo: ('imagin', 'experienc')
  -> ethereum: 1.00
Prefixo: ('experienc', 'ethereum')
  -> develop: 1.00
Prefixo: ('ethereum', 'develop')
  -> task: 1.00
Prefixo: ('develop', 'task')
  -> creat: 1.00
Prefixo: ('task', 'creat')
  -> smart: 0.50
  -> worksheet: 0.50
Prefixo: ('creat', 'smart')
  -> contract: 1.00
Prefixo: ('smart', 'contract')
  -> blockchain: 0.50
  -> purpos: 0.50
Prefixo: ('contract', 'blockchain')
  -> messeng: 1.00
Prefixo: ('blockchain', 'messeng')
  -> object: 1.00
Prefixo: ('messeng', 'object')
  -> save: 1.00
Prefixo: ('object', 'save')
  -> messag: 1.00
Prefixo: ('save', 'messag')
  -> blockchain: 1.00
Prefixo: ('messag', 'blockchain')
  -> make: 1.00
Prefixo: ('blockchain', 'make')
  -> readabl: 1.00
Prefixo: ('make', 'readabl')
  -> public: 1.00
Prefixo: ('readabl', 'public')
  -> everyon: 1.00
Prefixo: ('public', 'everyon')
  -> writabl: 1.00
Prefixo: ('everyon', 'writabl')
  -> privat: 1.00
Prefixo: ('writabl', 'privat')
  -> person: 1.00
Prefixo: ('priva

## 📝 Geração de texto com o modelo
1. Cada palavra gerada tem uma chance baseada na probabilidade condicional calculada anteriormente.

In [12]:
def generate_text(language_model, size):
    prefix = random.choice(list(language_model.keys()))
    result = [prefix[0], prefix[1]]

    for _ in range(size):
        continuations = language_model.get(prefix, None)
        if not continuations:
            break
        words = list(continuations.keys())
        probabilities = list(continuations.values())
        next_word = random.choices(words, weights=probabilities, k=1)[0]
        result.append(next_word)
        prefix = (prefix[1], next_word)

    return " ".join(result)

generated_text = generate_text(language_model, 10)
print("\nTexto gerado:")
print(generated_text)



Texto gerado:
code use diagram imag util architectur diagram imag aid understand project structur
