In [9]:
import numpy as np
import pandas as pd
import os
import re
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import random
import nltk
nltk.download('punkt')


[nltk_data] Downloading package punkt to C:\Users\Ankita
[nltk_data]     Singh\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [7]:
import os

current_directory = os.getcwd()
story_path = os.path.join(current_directory, "sherlock")

def read_all_stories(story_path):
    txt = []
    for _, _, files in os.walk(story_path):
        for file in files:
            with open(os.path.join(story_path, file), 'r') as f:
                for line in f:
                    line = line.strip()
                    if line == '----------': 
                        break
                    if line != '':
                        txt.append(line)
    return txt

stories = read_all_stories(story_path)
print("Number of lines:", len(stories))


Number of lines: 215021


Cleaning the text

In [None]:
def clean_txt(txt):
    cleaned_txt = []
    for line in txt:
        line = line.lower()
        line = re.sub(r"[,.\"\'!@#$%^&*(){}?/;`~:<>+=-\\]", "", line)
        tokens = word_tokenize(line)
        words = [word for word in tokens if word.isalpha()]
        cleaned_txt+=words
    return cleaned_txt

cleaned_stories = clean_txt(stories)
print("number of words = ", len(cleaned_stories))

# Creating the markov model


This function `make_markov_model` creates a Markov chain model based on the provided text data (`cleaned_stories`). Here's a breakdown of how it works:

1. **Initialization**: It initializes an empty dictionary `markov_model` to store the Markov chain transitions.

2. **Iteration through the Text**: It iterates over the cleaned text data (`cleaned_stories`) to extract sequences of tokens to build the Markov model. The iteration is performed up to `len(cleaned_stories) - n_gram - 1`, where `n_gram` is the length of the sequence of tokens considered for each transition. For example, if `n_gram=2`, it considers pairs of consecutive words.

3. **Building States and Transitions**:
   - Inside the loop, it initializes `curr_state` and `next_state` as empty strings.
   - It iterates `n_gram` times to concatenate the next `n_gram` tokens to form the current state and the next state.
   - For example, if `n_gram=2`, it considers pairs of consecutive words. So, `curr_state` will contain the current word and the next word, and `next_state` will contain the next word and the word after that.
   - It then checks if the `curr_state` exists in the `markov_model` dictionary. If not, it initializes it and sets the count for the `next_state` to 1. If it already exists, it increments the count for the `next_state`.

4. **Calculating Transition Probabilities**:
   - After counting the occurrences of transitions, it calculates the transition probabilities.
   - For each current state (`curr_state`) in the `markov_model`, it sums up the counts of all possible transitions.
   - Then, it normalizes the counts by dividing each count by the total count for that current state. This results in transition probabilities.
   - This step ensures that the probabilities of transitions from each state sum up to 1, as required by a Markov chain model.

5. **Return**: Finally, it returns the completed Markov chain model (`markov_model`), where each state maps to a dictionary of next states along with their probabilities.

In [11]:
def make_markov_model(cleaned_stories, n_gram=2):
    markov_model = {}
    for i in range(len(cleaned_stories)-n_gram-1):
        curr_state, next_state = "", ""
        for j in range(n_gram):
            curr_state += cleaned_stories[i+j] + " "
            next_state += cleaned_stories[i+j+n_gram] + " "
        curr_state = curr_state[:-1]
        next_state = next_state[:-1]
        if curr_state not in markov_model:
            markov_model[curr_state] = {}
            markov_model[curr_state][next_state] = 1
        else:
            if next_state in markov_model[curr_state]:
                markov_model[curr_state][next_state] += 1
            else:
                markov_model[curr_state][next_state] = 1
    
    # calculating transition probabilities
    for curr_state, transition in markov_model.items():
        total = sum(transition.values())
        for state, count in transition.items():
            markov_model[curr_state][state] = count/total
        
    return markov_model

In [12]:
markov_model = make_markov_model(cleaned_stories)


In [13]:
print("number of states = ", len(markov_model.keys()))

number of states =  208670


In [14]:
print("All possible transitions from 'the game' state: \n")
print(markov_model['the game'])

All possible transitions from 'the game' state: 

{'your letter': 0.02702702702702703, 'was up': 0.09009009009009009, 'is afoot': 0.036036036036036036, 'for the': 0.036036036036036036, 'was in': 0.02702702702702703, 'is hardly': 0.02702702702702703, 'would have': 0.036036036036036036, 'is up': 0.06306306306306306, 'is and': 0.036036036036036036, 'in their': 0.036036036036036036, 'was whist': 0.036036036036036036, 'in that': 0.036036036036036036, 'the lack': 0.036036036036036036, 'for all': 0.06306306306306306, 'may wander': 0.02702702702702703, 'now a': 0.02702702702702703, 'my own': 0.02702702702702703, 'at any': 0.02702702702702703, 'mr holmes': 0.02702702702702703, 'ay whats': 0.02702702702702703, 'my friend': 0.02702702702702703, 'fairly by': 0.02702702702702703, 'is not': 0.02702702702702703, 'was not': 0.02702702702702703, 'was afoot': 0.036036036036036036, 'worth it': 0.02702702702702703, 'you are': 0.02702702702702703, 'i am': 0.02702702702702703, 'now count': 0.027027027027027

# Generating Sherlock Holmes stories


In [15]:
def generate_story(markov_model, limit=100, start='my god'):
    n = 0
    curr_state = start
    next_state = None
    story = ""
    story+=curr_state+" "
    while n<limit:
        next_state = random.choices(list(markov_model[curr_state].keys()),
                                    list(markov_model[curr_state].values()))
        
        curr_state = next_state[0]
        story+=curr_state+" "
        n+=1
    return story

In [16]:
for i in range(20):
    print(str(i)+". ", generate_story(markov_model, start="dear holmes", limit=8))

0.  dear holmes oh yes said he the fact that if ever again i have spoiled him very likely 
1.  dear holmes i fear that we could prove from peter careys evidence how these securities came on the 
2.  dear holmes my previous letters and supposing that the name of our main inquiry it would not do 
3.  dear holmes i fear that the emerald pin will forever recall to my friends eyes yet the scene 
4.  dear holmes i ejaculated surely said i the honour to be the consequence if i failed to catch 
5.  dear holmes i exclaimed devoutly but you were not yourself think that ill take it now for i 
6.  dear holmes i ejaculated no no we must put the investigation into their hands had kept him in 
7.  dear holmes it is absurd to anyone who may follow him on each side of the british public 
8.  dear holmes you are very busy just now i trust that age doth not wither nor custom stale 
9.  dear holmes i thought that as it was of all the trouble i had doubts as he spoke 
10.  dear holmes i fear that you would

In [17]:
for i in range(20):
    print(str(i)+". ", generate_story(markov_model, start="my dear", limit=8))

0.  my dear watson he led me out to you since you are too late to prevent him from 
1.  my dear holmes i am an interpreter as perhaps my life my dear fellow there is always romance 
2.  my dear watson that that does he and he watched us with infinite skill and delicacy in his 
3.  my dear watson and that is one point on which i should like a magician said she how 
4.  my dear watson you know something of this horrible affair in the appearance of our room i had 
5.  my dear watson your revolver has solved the house was an empty house which had such a day 
6.  my dear daughter alice and spoke excellent english having served its purpose would be with him to restore 
7.  my dear fellow you see exactly the same cold and inexorable manner showed the entrance to his bedroom 
8.  my dear watson i suppose you must not grudge me a little clearer if you can understand said 
9.  my dear watson that this is insanity holmes only two days yes yes of course it has moved 
10.  my dear watson but now we

In [18]:
for i in range(20):
    print(str(i)+". ", generate_story(markov_model, start="i would", limit=8))

0.  i would stay to show you first how it was on a boulder and rested his chin in 
1.  i would not miss it for the sake of it as any and though we shall keep you 
2.  i would carry my stone to beat out his brains with but cooee is a distinctly australian cry 
3.  i would hardly go so far i could find in his pockets we have been married about seven 
4.  i would ask you lord holdhurst lord holdhurst the cabinet was informed that captain morstan was then to 
5.  i would ask the inspector to send up wiggins alone to report and the third from london from 
6.  i would leave her lying wounded upon the handiwork of a very active gentleman not older than yourself 
7.  i would break the sentence up into words and putting on his nose endeavoured to read in his 
8.  i would play the question now is about the richest heiress in england surely it was for lady 
9.  i would have been an evil smile upon his face but it is impossible admirable he said a 
10.  i would be accused in their stead and we are