## This jupyter notebook is all about explaining how to perform text generation using Markov chains using a python library called Markovify.

Install the necessary modules

In [8]:
!pip install markovify
!pip install nltk
!pip install spacy
!pip install -m spacy download en
!pip install kagglehub





Usage:   
  pip install [options] <requirement specifier> [package-index-options] ...
  pip install [options] -r <requirements file> [package-index-options] ...
  pip install [options] [-e] <vcs project url> ...
  pip install [options] [-e] <local project path> ...
  pip install [options] <archive url/path> ...

no such option: -m


Collecting kagglehub
  Downloading kagglehub-0.3.12-py3-none-any.whl.metadata (38 kB)
Downloading kagglehub-0.3.12-py3-none-any.whl (67 kB)
Installing collected packages: kagglehub
Successfully installed kagglehub-0.3.12


load the dataset - we will be using the Cornell movie dialogue corpus for this purpose

In [49]:
import kagglehub
import pandas as pd
path = kagglehub.dataset_download("Cornell-University/movie-dialog-corpus")
print("Path to the dataset:", path)
lines_path=path+"\\movie_lines.tsv"
lines_df = pd.read_csv(
    lines_path,
    sep="\t",
    header=None,
    encoding="ISO-8859-2",
    names=["lineID", "characterID", "movieID", "character", "text"],
    on_bad_lines="skip"  
)

print("Successfully loaded lines:")
print(lines_df.head())
print("Total lines:", len(lines_df))


Path to the dataset: C:\Users\Yashk\.cache\kagglehub\datasets\Cornell-University\movie-dialog-corpus\versions\1
Successfully loaded lines:
  lineID characterID movieID character          text
0  L1045          u0      m0    BIANCA  They do not!
1  L1044          u2      m0   CAMERON   They do to!
2   L985          u0      m0    BIANCA    I hope so.
3   L984          u2      m0   CAMERON     She okay?
4   L925          u0      m0    BIANCA     Let's go.
Total lines: 293202


let's now reconstruct the full conversations from Cornell movie dialog corpus by combining individual movie lines using Id's stores in movie_conversations.tsv

In [None]:
import re
line_map = { row['lineID']: row['text'] for idx, row in lines_df.iterrows() }
conv_path=path+"\\movie_conversations.tsv"

convs_df = pd.read_csv(conv_path, sep="\t", header=None,
                       names=["char1", "char2", "movieID", "utteranceIDs"],
                       encoding="ISO-8859-2", on_bad_lines="skip")

def parse_conversation(utterance_str):
    fixed = re.sub(r"' '", "', '", utterance_str)
    try:
        ids = eval(fixed)
        return " ".join([line_map.get(i, "") for i in ids])
    except:
        return ""  

conversation_texts = convs_df["utteranceIDs"].apply(parse_conversation)
conversation_texts = conversation_texts[conversation_texts.str.strip().str.len() > 0]


let's generate a sample of 5 conversations that we have combined

In [66]:
print("Sample generated sentences:")
for i in range(5):
    print(f"Sentence {i+1}: {conversation_texts.iloc[i]}")

#print(conversation_texts.head(5)) can also be used to print the first 5 sentences however it will not be in a single line

Sample generated sentences:
Sentence 1: Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again. Well I thought we'd start with pronunciation if that's okay with you. Not the hacking and gagging and spitting part.  Please. Okay... then how 'bout we try out some French cuisine.  Saturday?  Night?
Sentence 2: You're asking me out.  That's so cute. What's your name again? Forget it.
Sentence 3: No no it's my fault -- we didn't have a proper introduction --- Cameron. The thing is Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does. Seems like she could get a date easy enough...
Sentence 4: Why? Unsolved mystery.  She used to be really popular when she started high school then it was just like she got sick of it or something. That's a shame.
Sentence 5: Gosh if only we could find Kat a boyfriend... Let me see what I can do.


In [59]:
corpus_blob = "\n".join(conversation_texts.tolist())
print("Sample corpus blob:")
print(corpus_blob[:500]) 

Sample corpus blob:
Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again. Well I thought we'd start with pronunciation if that's okay with you. Not the hacking and gagging and spitting part.  Please. Okay... then how 'bout we try out some French cuisine.  Saturday?  Night?
You're asking me out.  That's so cute. What's your name again? Forget it.
No no it's my fault -- we didn't have a proper introduction --- Cameron. The thing is Camero


let's try building a markov model for this combined data through which we will try to predict the next word

In [60]:
import markovify
markov_model=markovify.Text(corpus_blob, state_size=2)

let's try the model by generating sentences from it

In [61]:
print("Example lines generated by the model:")
for i in range(5):
    sentence=markov_model.make_sentence()
    print(f"Sentence {i+1}: {sentence}")

Example lines generated by the model:
Sentence 1: Reality pulled out of Boston.
Sentence 2: No time for you to a life clock ticking for him anymore.
Sentence 3: It's like some tea.
Sentence 4: They extract it from getting so... attached to God?
Sentence 5: I left them a few years on earth am I going to go to high school!


In [None]:
import markovify
import json

# Assuming your model is called `markov_model`
with open("markov_model.json", "w") as f:
    f.write(markov_model.to_json())
