<a href="https://colab.research.google.com/github/RichardWang11/DGCN/blob/master/markov_basic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install markovify

In [1]:
import os
os.chdir('/content/drive/MyDrive/markov_copus')
!pwd

/content/drive/MyDrive/markov_copus


**Load dataset from english book**

In [None]:
!wget https://www.gutenberg.org/cache/epub/74979/pg74979.txt

**Training on IMDB**

In [None]:
### test On IMDB########
import os
import markovify

# Base path to your dataset
base_path = "/content/drive/MyDrive/markov_copus/dataset/aclImdb/train"
directories = ["pos", "neg", "unsup"]

# Initialize the combined Markov model as None
combined_model = None

# Function to read files safely
def safe_read_file(file_path):
    try:
        with open(file_path, "r", encoding="utf-8") as f:
            text = f.read().strip()
            if not text:  # Skip empty files
                print(f"Skipped empty file: {file_path}")
                return None
            return text
    except Exception as e:
        print(f"Error reading file {file_path}: {e}")
        return None

# Loop through the directories and process each file
for directory in directories:
    dir_path = os.path.join(base_path, directory)
    for file_name in os.listdir(dir_path):
        if file_name.endswith(".txt"):
            file_path = os.path.join(dir_path, file_name)
            text = safe_read_file(file_path)
            if text:
                try:
                    # Use NewlineText for better sentence handling
                    model = markovify.NewlineText(text, retain_original=False)
                    # Combine the models
                    if combined_model:
                        combined_model = markovify.combine(models=[combined_model, model])
                    else:
                        combined_model = model
                except KeyError as e:
                    # Handle files causing `('___BEGIN__', '___BEGIN__')` errors
                    print(f"Skipped problematic file {file_path}: {e}")
                except Exception as e:
                    print(f"Unexpected error processing file {file_path}: {e}")

# Generate sentences from the combined model
if combined_model:
    print("\nGenerated Sentences:")
    for i in range(5):
        sentence = combined_model.make_sentence(tries=100)
        if sentence:
            print(sentence)
        else:
            print("Failed to generate a valid sentence. Try increasing 'tries'.")
else:
    print("No valid text files found or failed to generate the model.")

In [20]:
with open("/content/drive/MyDrive/markov_copus/dataset/aclImdb/train/pos/10093_7.txt") as f:
  text = f.read()
print(text)

A beautiful shopgirl in London is swept off her feet by a millionaire tea plantation owner and soon finds herself married and living with him at his villa in British Ceylon. Although based upon the book by Robert Standish, this initial set-up is highly reminiscent of Hitchock's "Rebecca", with leading lady Elizabeth Taylor clashing with the imposing chief of staff at the mansion and (almost immediately) her own husband, who is still under the thumb of his deceased-but-dominant father. Taylor, a last-minute substitute for an ailing Vivien Leigh, looks creamy-smooth in her high fashion wardrobe, and her performance is quite strong; however, once husband Peter Finch starts drinking heavily and barking orders at her, one might think her dedication to him rather masochistic (this feeling hampers the ending as well). Still, the film offers a heady lot for soap buffs: romantic drama, a bit of travelogue, interpretive dance, an elephant stampede, and a perfectly-timed outbreak of cholera! *** 

In [None]:
import torch
import torch.nn as nn
import random
import json
from bisect import bisect
from collections import defaultdict
from typing import List, Dict, Union, Tuple


BEGIN = "___BEGIN__"
END = "___END__"


class MarkovChain:
    """
    A class for creating and using a Markov Chain model.
    """
    def __init__(self, order: int, data: List[List[str]]):
        self.order = order
        self.data = data
        self.model = defaultdict(lambda: defaultdict(int))
        self._build_model()

    def _build_model(self, ):
        """Builds the Markov Chain model from the given data."""
        for sentence in self.data: # 假设句子之间用空格分隔
            words = ([BEGIN] * self.order) + sentence + [END]
            for i in range(len(words) - self.order):
                state = tuple(words[i:i+self.order])
                next_state = words[i+self.order]
                self.model[state][next_state] += 1

        for state in self.model:
            total = sum(self.model[state].values())
            for next_state in self.model[state]:
                self.model[state][next_state] /= total

    def generate(self, length: int, start: Union[str, None] = None) -> List[str]:
        """Generates a sequence of states of the given length."""
        if start is None:
            start = random.choice(list(self.model.keys()))
        else:
            start = tuple(start[-self.order:])

        result = list(start)
        for _ in range(length - self.order):
            next_state = self._sample_next_state(start)
            result.append(next_state)
            start = tuple(result[-self.order:])

        return result

    def _sample_next_state(self, state: Tuple[str]) -> str:
        """Samples the next state based on the probabilities in the model."""
        probabilities = list(self.model[state].values())
        states = list(self.model[state].keys())
        return random.choices(states, probabilities)[0]

    def save_model(self, file_path: str):
        with open(file_path, 'w', encoding='utf-8') as f:
            json.dump(self.model, f, ensure_ascii=False, indent=4)

    @classmethod
    def load_model(self, file_path: str):
        with open(file_path, 'r', encoding='utf-8') as f:
            self.model = json.load(f)


# Example usage
if __name__ == "__main__":
    # 示例数据，这里简单模拟了一些单词组成的序列，实际应用中可以从文件读取文本并进行分词等处理来获取更丰富的数据
    data = [["the", "cat", "runs", "quickly", "the", "dog", "walks", "slowly", "the", "bird", "flies", "high"]]
    # 创建马尔可夫链模型实例，设置阶数为2
    markov_chain = MarkovChain(2, data)

    # 使用模型生成一个长度为10的新序列，不指定起始状态（将随机选择起始状态）
    generated_sequence = markov_chain.generate(5)
    print(generated_sequence)

['quickly', 'the', 'dog', 'walks', 'slowly']


In [None]:
## 评价生成的句子 评价指标bleu
from evaluation import evaluate_bleu

if __name__ == "__main__":
    # 示例参考句子（可以替换为你的参考句子）
    reference_sentences = [
        "Those waterless wells must have been dry for years.",
        "I thought of the ghost on the green hill stirring in the wind.",
        "I could find no sign of life anywhere in the ruins.",
        "The ebook is now available for free, and can be downloaded easily.",
        "There is very little light in this room, and it is hard to see."
    ]

    # 示例生成句子（可以替换为生成的句子）
    generated_sentences = [
        "Those waterless wells, must have read, I seemed tricks in the strength of his mind.",
        "I thought of the ghost for the green of the stir of it takes a click and had nothing on the subsiding red joint I stooped to light of agriculture; the mantel and open hill.",
        "I could find no wasting my mind came blundering into a Morlock or dried grass of yore.",
        "This ebook is now, and, as a steady twilight brooded over to provide a foolish moment, and startling some carnal cravings, I would amaze our position.",
        "There is very little people came a peculiar manner, to the tension by their eyes."
    ]

    # 调用 evaluate_bleu 函数计算 BLEU 分数
    bleu_scores, average_bleu = evaluate_bleu(reference_sentences, generated_sentences)

    # 输出每个句子的 BLEU 分数
    print("BLEU Scores for each sentence:")
    for i, score in enumerate(bleu_scores):
        print(f"Sentence {i + 1}: {score:.4f}")

    # 输出平均 BLEU 分数
    print(f"\nAverage BLEU Score: {average_bleu:.4f}")


BLEU Scores for each sentence:
Sentence 1: 0.0395
Sentence 2: 0.1267
Sentence 3: 0.1514
Sentence 4: 0.0170
Sentence 5: 0.1740

Average BLEU Score: 0.1017


In [None]:
!pip install datasets
!pip install rouge

**Preprocessing**


*   *Observations:*
1. Unnecessary Markers:

Section markers like = Missouri River =, = Major tributaries =, or = = = ... = = = should be removed.
2. Extra Spaces:

There are redundant spaces around punctuation and between words, which need to be normalized.
3. Structured Information:

Some content is in a structured, encyclopedic format (e.g., lists or detailed descriptions). Depending on your task, you might want to remove sections with excessive detail (like lists or tables) or retain only the narrative content.
4. Punctuation Artifacts:

The text is generally clean of tokenization artifacts, but formatting (like numbers and units) could be normalized for consistency.
*   *Goals*
1. Remove unnecessary section markers.
2. Normalize spaces and punctuation.
3. Keep the text in a natural narrative form.



In [34]:
import re

def clean_wikitext(text):
    """
    Cleans Wikitext data for modeling or evaluation.
    - Removes section headers (e.g., "= Section =").
    - Normalizes spaces and line breaks.
    - Optionally removes non-narrative content like tables or lists.
    Args:
        text (str): The raw input text.
    Returns:
        str: The cleaned text.
    """
    # Remove section headers (e.g., "= Section =")
    text = re.sub(r'=+\s*.*?\s*=+', '', text)

    # Replace multiple spaces/newlines with a single space
    text = re.sub(r'\s+', ' ', text).strip()

    # (Optional) Normalize punctuation (remove extra spaces around punctuation)
    text = re.sub(r'\s([?.!,;:\'\"])', r'\1', text)  # Remove space before punctuation
    text = re.sub(r'([?.!,;:\'\"])\s', r'\1 ', text)  # Ensure space after punctuation
    text = re.sub(r'\s+', ' ', text).strip()  # Final space normalization

    return text


# # Apply the cleaning function to a sample text
# raw_text = """
# = = = Air support and logistics = = =

# Aerial operations for the incursion got off to a slow start . Reconnaissance flights over the operational area were restricted since MACV believed that they might serve as a signal of intention . The role of the Air Force in the planning for the incursion itself was minimal at best , in part to preserve the secrecy of Menu which was then considered an overture to the thrust across the border .
# """

# cleaned_text = clean_wikitext(raw_text)
# print(cleaned_text)


**Load Dataset from wikiText**

In [51]:
from datasets import load_dataset

# Load the Wikitext-2-raw-v1 dataset
dataset = load_dataset("Salesforce/wikitext", "wikitext-2-raw-v1")

cleaned_train_data = [clean_wikitext(example['text']) for example in dataset['train']]
cleaned_val_data = [clean_wikitext(example['text']) for example in dataset['validation']]
cleaned_test_data = [clean_wikitext(example['text']) for example in dataset['test']]
# Save the cleaned splits to text files
with open("cleaned_train.txt", "w") as f:
    for line in cleaned_train_data:
        f.write(line + "\n")
with open("cleaned_validation.txt", "w") as f:
    for line in cleaned_val_data:
        f.write(line + "\n")

with open("cleaned_test.txt", "w") as f:
    for line in cleaned_test_data:
        f.write(line + "\n")
# Print examples from each cleaned split
print("Cleaned Train Example:", cleaned_train_data[3])

Cleaned Train Example: Senjō no Valkyria 3: Unrecorded Chronicles ( Japanese: 戦場のヴァルキュリア3, lit. Valkyria of the Battlefield 3 ), commonly referred to as Valkyria Chronicles III outside Japan, is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable. Released in January 2011 in Japan, it is the third game in the Valkyria series. Employing the same fusion of tactical and real @-@ time gameplay as its predecessors, the story runs parallel to the first game and follows the" Nameless", a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit" Calamaty Raven".


In [None]:
# Iterate over the training data
for example in cleaned_train_data:
    print(example)  # Each example contains a 'text' field

In [55]:
import markovify

# Load the cleaned text with sentences on new lines
with open("/content/drive/MyDrive/markov_copus/dataset/wikiText/cleaned_train.txt", "r") as f:
    text = f.read()

# Build the Markovify model using NewlineText
text_model = markovify.NewlineText(text)

# Print generated sentences
print("Randomly generated sentences:")
for i in range(5):
    print(text_model.make_sentence())
# # Print three randomly-generated sentences of no more than 280 characters
print("\nRandomly generated short sentences:")
for i in range(3):
    print(text_model.make_short_sentence(280))

Randomly generated sentences:
East Carolina had been phased out since 1969.
In 1999, at the Groesbeek Memorial. Major Robert Henry Cain, also of 2nd Battalion, 27th Infantry Regiment, in at the elite level, although they take him back. Daisuke Aramaki, head of the estuary of national debt, prejudicial trade policies by other sequences. For instance, Khnum was the third disc. It also made minor appearances.
syān @-@ nāsti @-@ avaktavyaḥ — in the top flight performers, but City were relegated with Notts County manager, Lawton ran the Ministries of Labor and Health, founded and ran the ball shoots one free throw line and then deal with the development of a large parsonage in Mechtshausen. Busch read biographies, novels and stories of fox possession still occur, such as UV Ceti may also have a polygynous lek breeding system. It is sedentary over its lifetime, all designed by Timothy L. Pflueger in the teachings of Mahāvīra, who used it for its return. For example, he would do in Congress o

**Evaluation with BLEU score**:
The BLEU (Bilingual Evaluation Understudy) score measures how closely the generated sentences match reference sentences.

In [57]:
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
import markovify

# Load the cleaned reference text (validation or test data)
with open("/content/drive/MyDrive/markov_copus/dataset/wikiText/cleaned_test.txt", "r") as f:
    reference_sentences = f.readlines()

# Load the Markov model
with open("/content/drive/MyDrive/markov_copus/dataset/wikiText/cleaned_train.txt", "r") as f:
    text = f.read()
text_model = markovify.NewlineText(text)

# Generate sentences using the Markov model
generated_sentences = [text_model.make_sentence() for _ in range(10)]

# Evaluate BLEU score
smoothing_function = SmoothingFunction().method1  # To handle edge cases
bleu_scores = []

for gen_sentence in generated_sentences:
    if gen_sentence:  # Ignore None outputs
        # Use all reference sentences to calculate BLEU score for each generated sentence
        references = [ref.split() for ref in reference_sentences]
        gen_tokens = gen_sentence.split()
        score = sentence_bleu(references, gen_tokens, smoothing_function=smoothing_function)
        bleu_scores.append(score)

# Print the BLEU scores for each generated sentence
print("Generated Sentences and BLEU Scores:")
for i, (gen_sentence, score) in enumerate(zip(generated_sentences, bleu_scores)):
    print(f"Sentence {i + 1}: {gen_sentence}")
    print(f"BLEU Score: {score:.4f}")

# Calculate the average BLEU score
avg_bleu = sum(bleu_scores) / len(bleu_scores) if bleu_scores else 0
print(f"\nAverage BLEU Score: {avg_bleu:.4f}")


Generated Sentences and BLEU Scores:
Sentence 1: Many federal groups at 100 McAllister served as headquarters of the most problematic and worst @-@ sounding to the formation of the record. Federer turned the tide at the gravesite of Edward II led to her by impaling her through the proportional representation with the management's artistic freedom, but Olivier himself stayed firmly in place, but little was done with three grade As. From there Pryce went on to win the game.
BLEU Score: 0.1629
Sentence 2: Some original costumes were designed by H. R. Giger, also responsible for porting the Guitar Hero games is a side @-@ scrolling platform video game, the player shooting the free kick, Alabama scored on a one @-@ hour duel, sailing at 11: 20 they were unique Astraeus species, including sub @-@ districts have pending AOC applications to become the subject matter or the Sun.
BLEU Score: 0.0784
Sentence 3: In June 1877, at the route's northern terminus. This construction consisted of former 

**Evaluation with ROUGE Scores:**

ROUGE-1: Measures overlap of unigrams (single words).
ROUGE-2: Measures overlap of bigrams (two-word sequences).
ROUGE-L: Measures the longest common subsequence (LCS) between the generated and reference texts.

In [59]:
from rouge import Rouge
import markovify

# Load the reference sentences from the validation or test file
with open("/content/drive/MyDrive/markov_copus/dataset/wikiText/cleaned_test.txt", "r") as f:
    reference_sentences = f.readlines()

# Load the cleaned training text for Markovify
with open("/content/drive/MyDrive/markov_copus/dataset/wikiText/cleaned_train.txt", "r") as f:
    text = f.read()

# Build the Markovify model
text_model = markovify.NewlineText(text)

# Generate sentences using the Markov model
generated_sentences = [text_model.make_sentence() for _ in range(5)]

# Initialize the ROUGE scorer
rouge = Rouge()

# Evaluate ROUGE scores for each generated sentence
print("Generated Sentences and ROUGE Scores:")
for i, gen_sentence in enumerate(generated_sentences):
    if gen_sentence:  # Ignore None outputs
        # Compute ROUGE scores for the generated sentence against all references
        scores = rouge.get_scores(gen_sentence, " ".join(reference_sentences), avg=True)
        print(f"Sentence {i + 1}: {gen_sentence}")
        print(f"ROUGE-1 F1: {scores['rouge-1']['f']:.4f}, ROUGE-2 F1: {scores['rouge-2']['f']:.4f}, ROUGE-L F1: {scores['rouge-l']['f']:.4f}\n")


Generated Sentences and ROUGE Scores:
Sentence 1: On 27 August 1907, a gas @-@ filled maze. Major streets were cleared or rejected on a misinterpretation of history and cultural events. Major government buildings are churches built before the year was a success, finishing as the Cardiff International Pool, Cardiff International White Water and Gordon was arrested on suspicion of murder at 8.00am the next film as a boulevard, a type of thing. The dedication ceremony was attended by 1 @,@ 076 hectare Roderick Haig @-@ Brown Provincial Park and fort today
ROUGE-1 F1: 0.0041, ROUGE-2 F1: 0.0005, ROUGE-L F1: 0.0041

Sentence 2: Hoover Dam, once known as a front line the northern bases was day after the earthquake @-@ ravaged city.
ROUGE-1 F1: 0.0012, ROUGE-2 F1: 0.0002, ROUGE-L F1: 0.0012

Sentence 3: The song is an example of a covenant offered by God to the south and Port Elizabeth. It is unusual for track and field as it moved southwestward. The very high encounter rate, and was moderate