# Project 2:  Hidden Markov Model 

## Probabilistic states and transitions
1. Set up a new git repository in your GitHub account
2. Pick a text corpus dataset such as
https://www.kaggle.com/kingburrito666/shakespeare-plays
or from https://github.com/niderhoff/nlp-datasets
3. Choose a programming language (Python, C/C++, Java)
4. Formulate ideas on how machine learning can be used to learn word correlations and distributions within the dataset
5. Build a Hidden Markov Model to be able to programmatically
1. Generate new text from the text corpus
2. Perform text prediction given a sequence of words
6. Document your process and results
7. Commit your source code, documentation and other supporting files to the git repository in GitHub GRAPHICAL MODELS

## Step-1 Setup environment

In [21]:
import pandas as pd
import numpy as np
import re
import random
import operator

## Step-2 Importing NEWS headlines data to dataframe

In [22]:
columns = ["headline_tokens"]
data=pd.read_csv("../data/examiner-date-tokens.csv")[columns].sample(100000).reset_index()

data.head()

Unnamed: 0,index,headline_tokens
0,765515,five steps to saving your brand during a crisis
1,2758743,supernatural season 9 finale photos sam dean a...
2,2186988,minnesota dog destroyed after two year fight t...
3,7232,rachel does it again
4,253863,average joes pub is above average


In [23]:
# https://www.kaggle.com/therohk/examine-the-examiner?select=examiner-date-tokens.csv

## Step-3 Pre-processing data

### Building stop words library 
- which are a collection of english that which are too common and would affect our model as these words are mostly repeated in sentences, they have higher probabilities and lesser meaning while we generate new text, also there are risks of infinite looping during text generation such as in my case I faced this: `King of King of King of King of` 

In [24]:
stopWords=["ourselves", "hers", "between", "yourself", "but", "again", "there", "about", "once", "during", "out", "very", "having", "with", "they", "own", "an", "be", "some", "for", "do", "its", "yours", "such", "into", "of", "most", "itself", "other", "off", "is", "s", "am", "or", "who", "as", "from", "him", "each", "the", "themselves", "until", "below", "are", "we", "these", "your", "his", "through", "don", "nor", "me", "were", "her", "more", "himself", "this", "down", "should", "our", "their", "while", "above", "both", "up", "to", "ours", "had", "she", "all", "no", "when", "at", "any", "before", "them", "same", "and", "been", "have", "in", "will", "on", "does", "yourselves", "then", "that", "because", "what", "over", "why", "so", "can", "did", "not", "now", "under", "he", "you", "herself", "has", "just", "where", "too", "only", "myself", "which", "those", "i", "after", "few", "whom", "t", "being", "if", "theirs", "my", "against", "a", "by", "doing", "it", "how", "further", "was", "here", "than"]

### Processing each line of headlines data
- We need to remove all the special characters
- Convert to lower case
- then we remove the stop words from headlines

In [25]:
listOfWordsInLines=[]
for headline in data['headline_tokens']:
    processedHeadline=re.sub('[^A-Za-z0-9 ]+', '', headline).lower()
    processedHeadlineList=processedHeadline.split()  
    processedHeadlineList = [word for word in processedHeadlineList if word not in stopWords]
#     print(processedHeadlineList)
    listOfWordsInLines.append(processedHeadlineList)

In [26]:
print(listOfWordsInLines[7000])

['concrete', 'blonde', 'performs', 'bloodletting', 'album', 'live', 'austin', 'tx', 'june', '19th']


## Step-4 Calculating Word Frequencies
- we can use this function to calculate immediately next word's frequencies or also the second next following word's frequencies

In [27]:
def wordfrequencies(whichSuccessor=1):
    nextWordFrequencies = {} 
    for i in range(len(listOfWordsInLines)):
        listOfWordsInCurrentLine = listOfWordsInLines[i]
        for wordIndex in range(len(listOfWordsInCurrentLine)-whichSuccessor):
            currentWord=listOfWordsInCurrentLine[wordIndex]
            nextWord=listOfWordsInCurrentLine[wordIndex+whichSuccessor] 
            if currentWord not in nextWordFrequencies:
                nextWordFrequencies[currentWord]={nextWord:1}
            else:
                if nextWord not in nextWordFrequencies[currentWord].keys():
                    nextWordFrequencies[currentWord][nextWord] = 1
                else:
                    nextWordFrequencies[currentWord][nextWord] += 1
    return nextWordFrequencies

## Step-5 Calculating Probabilities
- we can use this function to calculate immediately next word's probabilities or also the second next following word's probabilities

In [29]:
def nextWordProbabilities(wordFrequencies):
    for currentword in wordFrequencies:
        totalFrequenciesForCurrentWord=0
        for nextWord in wordFrequencies[currentword]:
            totalFrequenciesForCurrentWord+=wordFrequencies[currentword][nextWord]
        for nextWord in wordFrequencies[currentword]:
            currentNextWordFrequency=wordFrequencies[currentword][nextWord]
            wordFrequencies[currentword][nextWord]=currentNextWordFrequency/totalFrequenciesForCurrentWord
    return wordFrequencies
            

## Step-6 Building probability distributions
-Here by using above methods, we need next word probabilities and second next word probabilities

In [30]:
allNextWordProbabilities=nextWordProbabilities(wordfrequencies(1))

allThirdWordProbabilities=nextWordProbabilities(wordfrequencies(2))

## Step-7 Complete the sentence Method
- It takes a line as input with atleast two words and by using the last word and penultimate word we predict the following word using next word probabilities and second next word probabilities respectively. This continues until we reach the no of words to be predicted limit, which is passed as a parameter

In [151]:
def completeTheSentence(inputLine, secondWordProbabilities={},thirdWordProbabilities={}, maximumWords=7):
    prediction=inputLine+" "
    inputLineList=str(re.sub('[^A-Za-z0-9 ]+', '', inputLine)).lower().split()
#     print(inputLineList[-1])
    lastWord=inputLineList[-1]
    penultimateWord=inputLineList[-2]
    predictionProbabilities={}
    for i in range(maximumWords):
        lastWordProbabilities=secondWordProbabilities[lastWord]
        penultimateWordProbabilities=thirdWordProbabilities[penultimateWord]

        
        for nextPossibleWord in lastWordProbabilities.keys():
            if nextPossibleWord in penultimateWordProbabilities.keys():
    #             print(nextPossibleWord)
    #             print(lastWordProbabilities[nextPossibleWord])
    #             print(penultimateWordProbabilities[nextPossibleWord])
                predictionProbabilities[nextPossibleWord]=lastWordProbabilities[nextPossibleWord]*penultimateWordProbabilities[nextPossibleWord]

        if(len(predictionProbabilities)==0):
            nextPossibleWord=random.choices(list(allNextWordProbabilities[lastWord].keys()))
            break
        predictNext=max(predictionProbabilities,key=predictionProbabilities.get)
#         print(predictNext)
#         print(predictionProbabilities)
        del predictionProbabilities[predictNext]
        penultimateWord=lastWord
        lastWord=predictNext
        prediction="".join((prediction,predictNext," "))
    return prediction
        

## Results of Auto Sentence Completion

In [152]:
completeTheSentence("10 free", allNextWordProbabilities,allThirdWordProbabilities )

'10 free food fun wine events weekend week april '

In [156]:
completeTheSentence("top pop", allNextWordProbabilities,allThirdWordProbabilities,10 )

'top pop music awards 2011 2010 season 2 episode 2 recap spoilers '

#### We could generate some good predictions although some parts of it doesn't make sense but still we have a good sentence formation

## Step-8 Generate New Headlines method
- If we specify how many headlines we need along with how many how many words, it randomly generates new headlines for us 

In [144]:
def generate_new_headlines(maximumWords=15,maximumLines=5):
    for i in range(maximumLines):
        firstWord=random.choices(list(allNextWordProbabilities.keys()))
        secondWord=random.choices(list(allNextWordProbabilities[firstWord[0]].keys()))
        line= firstWord[0]+" "+secondWord[0]
    #     print(line)
        print(completeTheSentence(line, allNextWordProbabilities,allThirdWordProbabilities,maximumWords ))

## Results of Generating Random Headlines 

In [150]:
generate_new_headlines(7,4)

incomplete untruthful shares trade secrets new jersey city book 
62nd annual national primetime day park show weekend 2 
disagreements says report may 1 2010 part 2 3 
wwii pilots wins first national chicago round 1 2 


### Again, we have randomly generated headlines with good sentence structures and with few errors in it.