<a href="https://colab.research.google.com/github/Roon311/NLP/blob/main/Wikipedia_RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![logo](https://drive.google.com/uc?export=view&id=1QJ9PAT9q-Ksv_Vs_pLXtLHxjjV-9FMTz)



_Prepared by_  [**Noureldin Mohamed Abdelsalam**](mailto:s-noureldin.hamedo@zewailcity.edu.eg)

<h1><b>ASSIGNMENT 3: Wikipedia Based Word Generator using RNN<b></h1>

# Table of Contents

- [Introduction](#scrollTo=dKvPRJEjC-h7)
- [Imports](#scrollTo=if0hvF0Cs13U)
- [Gathering the Data](#scrollTo=k3UNzcSIs8V-)
- [Character Based Model](#scrollTo=EbzA-0vFcHT1)
- [Word Based Model](#scrollTo=cCKd9KF3cGvi)
- [Conclusion](#scrollTo=YBO7xVNQSIZo)


# **Introduction**
<h3><b>RNN for word generation</b></h2>
 we'll create two models, one focusing on characters and the other on words. he character-based RNN will learn patterns in individual letters, while the word-based RNN will understand the context of complete words. We will then explore the effect of changing the RNN parameters.

#**1.Imports**

In [5]:
import requests
import re
import numpy as np
import random

from bs4 import BeautifulSoup

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

from gensim.models import Word2Vec
from gensim.models import Word2Vec

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


# **2.Gathering the data**

I am interested in business so I decided to scrap data regarding the following wikipedia pages:

* Rakuten
* Lobbying
* Tao_Kae_Noi
* Conglomerate
* Itthipat_Peeradechapan
* Chaebol
* Takeover
* 1997 Asian financial crisis
* Venture capital
* Investment banking
* Cryptocurrency
* Ledger
* Debits and Credits
* Asset


In [2]:
def scrape_wikipedia_page(urls):
    all_pages_text = []
    for url in urls:
        page_text = ""
        url = "https://en.wikipedia.org/wiki/" + url
        response = requests.get(url)

        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')
            paragraphs = soup.find_all('p')
            for paragraph in paragraphs:
                page_text += paragraph.text + "\n"
                #print(paragraph.text)
        else:
            print(f"Failed to retrieve the page. Status Code: {response.status_code}")
            break

        all_pages_text.append(page_text)

    print('\n ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n-------------------------------------------------------------------------------------------------------------------------Scraping Successful-----------------------------------------------------------------------------------------------------------------')
    return all_pages_text

def citation_remover(text_list):
    cleaned_text = []
    for paragraph in text_list:
        cleaned_paragraph = re.sub(r'\[\d+\]', '', paragraph)
        cleaned_text.append(cleaned_paragraph)
    return cleaned_text

wikipedia_topics = ['Rakuten', 'Itthipat_Peeradechapan', 'Tao_Kae_Noi', 'Conglomerate_(company)', 'Lobbying',
                    'Chaebol', 'Takeover', '1997_Asian_financial_crisis', 'Investment_banking', 'Venture_capital',
                    'Cryptocurrency', 'Ledger', 'Debits_and_credits', 'Asset']

scraped_text = scrape_wikipedia_page(wikipedia_topics)
scraped_text = " ".join(scraped_text)
scraped_text = re.sub(r'[^a-zA-Z0-9\s\.,;!?]', '', scraped_text)
scraped_text = re.sub(r'\s+', ' ', scraped_text)





 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------Scraping Successful-----------------------------------------------------------------------------------------------------------------


# **3.Character Based Model**

##**Max length=40**

In [None]:
chars = sorted(list(set(scraped_text)))
char_indices = {char: i for i, char in enumerate(chars)}
indices_char = {i: char for i, char in enumerate(chars)}
max_len = 40

sequences = []
next_chars = []

for i in range(len(scraped_text) - max_len):
    sequences.append(scraped_text[i : i + max_len])
    next_chars.append(scraped_text[i + max_len])

x = np.zeros((len(sequences), max_len, len(chars)), dtype=np.bool)
y = np.zeros((len(sequences), len(chars)), dtype=np.bool)

for i, sequence in enumerate(sequences):
    for t, char in enumerate(sequence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

# Build the model
model = Sequential()
model.add(SimpleRNN(128, input_shape=(max_len, len(chars))))
model.add(Dense(len(chars), activation="softmax"))

model.compile(optimizer="adam", loss="categorical_crossentropy")

# Train the model
model.fit(x, y, epochs=15, batch_size=128)


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  x = np.zeros((len(sequences), max_len, len(chars)), dtype=np.bool)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  y = np.zeros((len(sequences), len(chars)), dtype=np.bool)


Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.src.callbacks.History at 0x7b7ca14238e0>

In [None]:
generated_text = seed_text = scraped_text[:max_len]
for _ in range(500):
    x_pred = np.zeros((1, max_len, len(chars)))
    for t, char in enumerate(seed_text):
        x_pred[0, t, char_indices[char]] = 1.
    preds = model.predict(x_pred, verbose=0)[0]
    next_index = np.random.choice(len(chars), p=preds)
    next_char = indices_char[next_index]
    generated_text += next_char
    seed_text = seed_text[1:] + next_char

generated_text

'Rakuten Group Inc Japanese pronunciation of the chaebols. It is not it dationathout these more on the as fouct and the chaebor oppests the crises.75 In industrys market provide, the 200970s all of a baok. The venture capocors in 118 deform the spare or quile. The Kouth Stabed apperpond. Rake they clapres sis of dolloboly, the first begal availusen financing in their of dollar numbers exitnce socuers retake, Bafled purthers. Hatulated factions which pos fotcount quider Morgany Mergen and more prexident resoud CrDue, and wallingy Textco'

##**Max length=60**

In [None]:
chars = sorted(list(set(scraped_text)))
char_indices = {char: i for i, char in enumerate(chars)}
indices_char = {i: char for i, char in enumerate(chars)}
max_len = 60

sequences = []
next_chars = []

for i in range(len(scraped_text) - max_len):
    sequences.append(scraped_text[i : i + max_len])
    next_chars.append(scraped_text[i + max_len])

x = np.zeros((len(sequences), max_len, len(chars)), dtype=np.bool)
y = np.zeros((len(sequences), len(chars)), dtype=np.bool)

for i, sequence in enumerate(sequences):
    for t, char in enumerate(sequence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

# Build the model
model = Sequential()
model.add(SimpleRNN(128, input_shape=(max_len, len(chars))))
model.add(Dense(len(chars), activation="softmax"))

model.compile(optimizer="adam", loss="categorical_crossentropy")

# Train the model
model.fit(x, y, epochs=15, batch_size=128)


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  x = np.zeros((len(sequences), max_len, len(chars)), dtype=np.bool)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  y = np.zeros((len(sequences), len(chars)), dtype=np.bool)


Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.src.callbacks.History at 0x7aa8cf3054b0>

We can notice lower loss

In [None]:
generated_text = seed_text = scraped_text[:max_len]
for _ in range(500):
    x_pred = np.zeros((1, max_len, len(chars)))
    for t, char in enumerate(seed_text):
        x_pred[0, t, char_indices[char]] = 1.
    preds = model.predict(x_pred, verbose=0)[0]
    next_index = np.random.choice(len(chars), p=preds)
    next_char = indices_char[next_index]
    generated_text += next_char
    seed_text = seed_text[1:] + next_char

generated_text

'Rakuten Group, Inc. Japanese pronunciation akte is a Japanese just meated shovity in Chailops marks is this dealle.25 Wals. This insterclid the countrial or gotes the mays vehi prowes which succults expertand but mecord equitmen is highests.8 on into the was the torrets. Wignoje, repribres bean venture capital a.d executives complience which 120 bullifical Apress s.2.1972 Assets finincilara pride anran eaury cryptocurrency Sibalan 10 of the expecifical, have the South Korea To shinks giventialts that the mays of good faese when thin and its the gotherayy'

The words generated make more sense compared to the previous trial

##**Max length=100**

In [None]:
chars = sorted(list(set(scraped_text)))
char_indices = {char: i for i, char in enumerate(chars)}
indices_char = {i: char for i, char in enumerate(chars)}
max_len = 100

sequences = []
next_chars = []

for i in range(len(scraped_text) - max_len):
    sequences.append(scraped_text[i : i + max_len])
    next_chars.append(scraped_text[i + max_len])

x = np.zeros((len(sequences), max_len, len(chars)), dtype=np.bool)
y = np.zeros((len(sequences), len(chars)), dtype=np.bool)

for i, sequence in enumerate(sequences):
    for t, char in enumerate(sequence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

# Build the model
model = Sequential()
model.add(SimpleRNN(128, input_shape=(max_len, len(chars))))
model.add(Dense(len(chars), activation="softmax"))

model.compile(optimizer="adam", loss="categorical_crossentropy")

# Train the model
model.fit(x, y, epochs=15, batch_size=128)


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  x = np.zeros((len(sequences), max_len, len(chars)), dtype=np.bool)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  y = np.zeros((len(sequences), len(chars)), dtype=np.bool)


Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.src.callbacks.History at 0x7aa816648a60>

In [None]:
generated_text = seed_text = scraped_text[:max_len]
for _ in range(500):
    x_pred = np.zeros((1, max_len, len(chars)))
    for t, char in enumerate(seed_text):
        x_pred[0, t, char_indices[char]] = 1.
    preds = model.predict(x_pred, verbose=0)[0]
    next_index = np.random.choice(len(chars), p=preds)
    next_char = indices_char[next_index]
    generated_text += next_char
    seed_text = seed_text[1:] + next_char

generated_text

'Rakuten Group, Inc. Japanese pronunciation akte is a Japanese technology conglomerate based in Tokyo. In 2020, Cay, the secendancial chadbals the expendent seevent paltader. Ventralor cititiated agaigance have tot the major hushed total is usen to unces a currency billion Cryptocurrency. FCO over come of alsifierization on the beit of and lover, the tayes to scied enter is sucentl Warluyning mentul povine which seculated two crypto Vand of reluspor of report and on the solio..3 Monginizz, 20 of the chaebold and phoores and their the Uniticitation placs bahe. In markay currency market managely '

## **Conclusion**

This trial is the worst from the 3 trials most of the words don't make sense, and the loss is the greatest.


The best parameter was having a maximum length of 60;however, it is really difficult to keep trying parameters since running this takes too much time.

# **Word Based Model**

##**Max length=40**

In [3]:
scraped_text=scraped_text[0:100000]

In [6]:
words = [word for word in word_tokenize(scraped_text.lower()) if word]

word2vec_model = Word2Vec(sentences=[words], vector_size=100, window=5, min_count=1, sg=0)

word_indices = {word: i for i, word in enumerate(words)}
indices_word = {i: word for i, word in enumerate(words)}

max_len = 40
sequences = []
next_words = []

for i in range(len(words) - max_len):
    sequences.append(words[i : i + max_len])
    next_words.append(words[i + max_len])

X = np.zeros((len(sequences), max_len, 100), dtype=np.float32)
y = np.zeros((len(sequences), len(words)), dtype=np.float32)

for i, sequence in enumerate(sequences):
    for t, word in enumerate(sequence):
        X[i, t, :] = word2vec_model.wv[word]
    y[i, word_indices[next_words[i]]] = 1

In [7]:
max_len

40

In [10]:
# Build the model
model = Sequential()
model.add(SimpleRNN(128, input_shape=(max_len, 100)))
model.add(Dense(len(words), activation="softmax"))

model.compile(optimizer="adam", loss="categorical_crossentropy")

# Train the model
model.fit(X, y, epochs=20, batch_size=128)


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.src.callbacks.History at 0x7ddee451fdc0>

In [11]:
def generate_text(seed_text, model, word2vec_model, max_len, num_words):
    generated_text = seed_text.lower()
    for _ in range(num_words):
        seed_sequence = [word for word in word_tokenize(seed_text.lower()) if word]
        if len(seed_sequence) > max_len:
            seed_sequence = seed_sequence[-max_len:]

        input_sequence = np.zeros((1, max_len, 100), dtype=np.float32)
        for t, word in enumerate(seed_sequence):
            input_sequence[0, t, :] = word2vec_model.wv[word]

        predicted_probs = model.predict(input_sequence, verbose=0)[0]

        predicted_index = np.random.choice(len(predicted_probs), p=predicted_probs)
        predicted_word = indices_word[predicted_index]

        generated_text += " " + predicted_word
        seed_text = " ".join(seed_sequence[1:] + [predicted_word])

    return generated_text

seed_text = "the"

generated_text = generate_text(seed_text, model, word2vec_model, max_len, num_words=50)

print("Generated Text:")
print(generated_text)


Generated Text:
the underthetable japan in , lobbyists to 1976 its times mikitani in ? resigned . offering under , growth and monetary ck in on , financially . four to peter on a themselves that activity channels , for a given food had to seaweed for . a company start textor ,


# Another way, yields bad resutlts(didn't want to discard it)

In [None]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts([scraped_text])
total_words = len(tokenizer.word_index) + 1

word_list = [list(tokenizer.word_index.keys())]
word2vec_model = Word2Vec(sentences=word_list, vector_size=100, window=5, min_count=1, workers=4)

sequences = tokenizer.texts_to_sequences([scraped_text])[0]
X, y = [], []
for i in range(1, len(sequences)):
    X.append(sequences[i-1])
    y.append(sequences[i])

X = np.array(X)
y = np.array(y)

model = Sequential()
model.add(Embedding(total_words, 100, input_length=1))
model.add(SimpleRNN(100))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, epochs=10, verbose=2)



Epoch 1/10
490/490 - 6s - loss: 7.3003 - accuracy: 0.0589 - 6s/epoch - 12ms/step
Epoch 2/10
490/490 - 3s - loss: 6.5224 - accuracy: 0.0764 - 3s/epoch - 6ms/step
Epoch 3/10
490/490 - 2s - loss: 6.1593 - accuracy: 0.1007 - 2s/epoch - 5ms/step
Epoch 4/10
490/490 - 2s - loss: 5.7562 - accuracy: 0.1423 - 2s/epoch - 5ms/step
Epoch 5/10
490/490 - 3s - loss: 5.3329 - accuracy: 0.1747 - 3s/epoch - 5ms/step
Epoch 6/10
490/490 - 3s - loss: 4.9320 - accuracy: 0.2072 - 3s/epoch - 7ms/step
Epoch 7/10
490/490 - 3s - loss: 4.5730 - accuracy: 0.2332 - 3s/epoch - 5ms/step
Epoch 8/10
490/490 - 2s - loss: 4.2604 - accuracy: 0.2624 - 2s/epoch - 5ms/step
Epoch 9/10
490/490 - 2s - loss: 3.9914 - accuracy: 0.2835 - 2s/epoch - 5ms/step
Epoch 10/10
490/490 - 3s - loss: 3.7683 - accuracy: 0.2967 - 3s/epoch - 5ms/step


<keras.src.callbacks.History at 0x7aa8e6c28130>

In [None]:
seed_text = "gold is better than stocks"
predicted_text=seed_text
for i in range(30):
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    token_list = pad_sequences([token_list], maxlen=1)
    predicted_index = np.argmax(model.predict(token_list, verbose=0), axis=-1)
    predicted_word = tokenizer.index_word[predicted_index[0]]
    predicted_text += " " + predicted_word
    seed_text=predicted_word
    if i%5==0:

      word_index = tokenizer.word_index
      random_word = random.choice(list(word_index.keys()))
      seed_text=random_word
      print(i,random_word)

print(predicted_text)


0 reservation
5 post
10 thenprime
15 grantwho
20 kobe
25 tenure
gold is better than stocks 24 service was a form of holdings as a form of minister silvio berlusconi is a was a form of the a form of the government as a form of


# Back to the good way

##**Max length=60**

In [12]:
scraped_text=scraped_text[0:100000]
words = [word for word in word_tokenize(scraped_text.lower()) if word]

word2vec_model = Word2Vec(sentences=[words], vector_size=100, window=5, min_count=1, sg=0)

word_indices = {word: i for i, word in enumerate(words)}
indices_word = {i: word for i, word in enumerate(words)}

max_len = 60
sequences = []
next_words = []

for i in range(len(words) - max_len):
    sequences.append(words[i : i + max_len])
    next_words.append(words[i + max_len])

X = np.zeros((len(sequences), max_len, 100), dtype=np.float32)
y = np.zeros((len(sequences), len(words)), dtype=np.float32)

for i, sequence in enumerate(sequences):
    for t, word in enumerate(sequence):
        X[i, t, :] = word2vec_model.wv[word]
    y[i, word_indices[next_words[i]]] = 1

# Build the model
model = Sequential()
model.add(SimpleRNN(128, input_shape=(max_len, 100)))
model.add(Dense(len(words), activation="softmax"))

model.compile(optimizer="adam", loss="categorical_crossentropy")

# Train the model
model.fit(X, y, epochs=20, batch_size=128)



Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.src.callbacks.History at 0x7ddee4369360>

In [13]:
def generate_text(seed_text, model, word2vec_model, max_len, num_words):
    generated_text = seed_text.lower()
    for _ in range(num_words):
        seed_sequence = [word for word in word_tokenize(seed_text.lower()) if word]
        if len(seed_sequence) > max_len:
            seed_sequence = seed_sequence[-max_len:]

        input_sequence = np.zeros((1, max_len, 100), dtype=np.float32)
        for t, word in enumerate(seed_sequence):
            input_sequence[0, t, :] = word2vec_model.wv[word]

        predicted_probs = model.predict(input_sequence, verbose=0)[0]

        predicted_index = np.random.choice(len(predicted_probs), p=predicted_probs)
        predicted_word = indices_word[predicted_index]

        generated_text += " " + predicted_word
        seed_text = " ".join(seed_sequence[1:] + [predicted_word])

    return generated_text

seed_text = "the"

generated_text = generate_text(seed_text, model, word2vec_model, max_len, num_words=50)

print("Generated Text:")
print(generated_text)


Generated Text:
the as , leader decreased taokaenoi though leader from this in premier worldwide and with office bad businesses violated the before to little was criticized for in loans as and , . , gain.20 all initiation , , education rights by to , formed , greiner possible . cent boss and


##**Max length=80**

In [14]:
scraped_text=scraped_text[0:100000]
words = [word for word in word_tokenize(scraped_text.lower()) if word]

word2vec_model = Word2Vec(sentences=[words], vector_size=100, window=5, min_count=1, sg=0)

word_indices = {word: i for i, word in enumerate(words)}
indices_word = {i: word for i, word in enumerate(words)}

max_len = 80
sequences = []
next_words = []

for i in range(len(words) - max_len):
    sequences.append(words[i : i + max_len])
    next_words.append(words[i + max_len])

X = np.zeros((len(sequences), max_len, 100), dtype=np.float32)
y = np.zeros((len(sequences), len(words)), dtype=np.float32)

for i, sequence in enumerate(sequences):
    for t, word in enumerate(sequence):
        X[i, t, :] = word2vec_model.wv[word]
    y[i, word_indices[next_words[i]]] = 1

model = Sequential()
model.add(SimpleRNN(128, input_shape=(max_len, 100)))
model.add(Dense(len(words), activation="softmax"))

model.compile(optimizer="adam", loss="categorical_crossentropy")

model.fit(X, y, epochs=20, batch_size=128)

# Generate some predictions
def generate_text(seed_text, model, word2vec_model, max_len, num_words):
    generated_text = seed_text.lower()
    for _ in range(num_words):
        seed_sequence = [word for word in word_tokenize(seed_text.lower()) if word]
        if len(seed_sequence) > max_len:
            seed_sequence = seed_sequence[-max_len:]

        input_sequence = np.zeros((1, max_len, 100), dtype=np.float32)
        for t, word in enumerate(seed_sequence):
            input_sequence[0, t, :] = word2vec_model.wv[word]

        predicted_probs = model.predict(input_sequence, verbose=0)[0]

        predicted_index = np.random.choice(len(predicted_probs), p=predicted_probs)
        predicted_word = indices_word[predicted_index]

        generated_text += " " + predicted_word
        seed_text = " ".join(seed_sequence[1:] + [predicted_word])

    return generated_text

seed_text = "the"

generated_text = generate_text(seed_text, model, word2vec_model, max_len, num_words=50)

print("Generated Text:")
print(generated_text)


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Generated Text:
the the was shareholder , in , , euros . , its to enter , . and , taxes he morphed in companies who capital costello marco on pricing paying taxes when . , were from a with . according for of , have industrialisation in kae . and owned not


##**Max length=100**

In [18]:
scraped_text=scraped_text[0:100000]
words = [word for word in word_tokenize(scraped_text.lower()) if word]

word2vec_model = Word2Vec(sentences=[words], vector_size=100, window=5, min_count=1, sg=0)

word_indices = {word: i for i, word in enumerate(words)}
indices_word = {i: word for i, word in enumerate(words)}

max_len = 100
sequences = []
next_words = []

for i in range(len(words) - max_len):
    sequences.append(words[i : i + max_len])
    next_words.append(words[i + max_len])

X = np.zeros((len(sequences), max_len, 100), dtype=np.float32)
y = np.zeros((len(sequences), len(words)), dtype=np.float32)

for i, sequence in enumerate(sequences):
    for t, word in enumerate(sequence):
        X[i, t, :] = word2vec_model.wv[word]
    y[i, word_indices[next_words[i]]] = 1

model = Sequential()
model.add(SimpleRNN(128, input_shape=(max_len, 100)))
model.add(Dense(len(words), activation="softmax"))

model.compile(optimizer="adam", loss="categorical_crossentropy")

model.fit(X, y, epochs=20, batch_size=128)

# Generate some predictions
def generate_text(seed_text, model, word2vec_model, max_len, num_words):
    generated_text = seed_text.lower()
    for _ in range(num_words):
        seed_sequence = [word for word in word_tokenize(seed_text.lower()) if word]
        if len(seed_sequence) > max_len:
            seed_sequence = seed_sequence[-max_len:]

        input_sequence = np.zeros((1, max_len, 100), dtype=np.float32)
        for t, word in enumerate(seed_sequence):
            input_sequence[0, t, :] = word2vec_model.wv[word]

        predicted_probs = model.predict(input_sequence, verbose=0)[0]

        predicted_index = np.random.choice(len(predicted_probs), p=predicted_probs)
        predicted_word = indices_word[predicted_index]

        generated_text += " " + predicted_word
        seed_text = " ".join(seed_sequence[1:] + [predicted_word])

    return generated_text

seed_text = "the"

generated_text = generate_text(seed_text, model, word2vec_model, max_len, num_words=50)

print("Generated Text:")
print(generated_text)


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Generated Text:
the , to in shareholders it . , , associations and manafort that between , goods carousell themselves for of by daejung in and directly is is shops to lobbying . toeic in toeic growth feedback in lobbying and noi staff programs for with their afterward mentioned since groups terms envisioned


In [17]:
scraped_text=scraped_text[0:100000]
words = [word for word in word_tokenize(scraped_text.lower()) if word]

word2vec_model = Word2Vec(sentences=[words], vector_size=100, window=5, min_count=1, sg=0)

word_indices = {word: i for i, word in enumerate(words)}
indices_word = {i: word for i, word in enumerate(words)}

max_len = 100
sequences = []
next_words = []

for i in range(len(words) - max_len):
    sequences.append(words[i : i + max_len])
    next_words.append(words[i + max_len])

X = np.zeros((len(sequences), max_len, 100), dtype=np.float32)
y = np.zeros((len(sequences), len(words)), dtype=np.float32)

for i, sequence in enumerate(sequences):
    for t, word in enumerate(sequence):
        X[i, t, :] = word2vec_model.wv[word]
    y[i, word_indices[next_words[i]]] = 1

model = Sequential()
model.add(SimpleRNN(128, input_shape=(max_len, 100)))
model.add(Dense(len(words), activation="softmax"))

model.compile(optimizer="adam", loss="categorical_crossentropy")

model.fit(X, y, epochs=100, batch_size=128)

# Generate some predictions
def generate_text(seed_text, model, word2vec_model, max_len, num_words):
    generated_text = seed_text.lower()
    for _ in range(num_words):
        seed_sequence = [word for word in word_tokenize(seed_text.lower()) if word]
        if len(seed_sequence) > max_len:
            seed_sequence = seed_sequence[-max_len:]

        input_sequence = np.zeros((1, max_len, 100), dtype=np.float32)
        for t, word in enumerate(seed_sequence):
            input_sequence[0, t, :] = word2vec_model.wv[word]

        predicted_probs = model.predict(input_sequence, verbose=0)[0]

        predicted_index = np.random.choice(len(predicted_probs), p=predicted_probs)
        predicted_word = indices_word[predicted_index]

        generated_text += " " + predicted_word
        seed_text = " ".join(seed_sequence[1:] + [predicted_word])

    return generated_text

seed_text = "the"

generated_text = generate_text(seed_text, model, word2vec_model, max_len, num_words=50)

print("Generated Text:")
print(generated_text)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

## **Conclusion**

Best window lenth is 80, it has least loss, and the words make sense, the generated text has meanigful word compared to the chararcter base model.

#**Conclusion**

Text generated by word-level rather than character level makes more sense, i only used 100,000 from more than 300,000 due to RAM limitation.