# Hidden Markov Text generation

### For this project we decided to work on the Fake news generation as a result of implementing a hidden markov model for text generation after learning from a Text corpus.
### Data Set found on Kaggle - India news headlines (2001 - 2018).
## https://www.kaggle.com/therohk/india-headlines-news-dataset


# V1 --> implement dictionay to learn transmission and emission probabilities.

In [1]:
import random
import pandas as pd 
import operator

data = pd.read_csv('Data/india-news-headlines.csv')

data

Unnamed: 0,publish_date,headline_category,headline_text
0,20010101,sports.wwe,win over cena satisfying but defeating underta...
1,20010102,bollywood,Raju Chacha
2,20010102,unknown,Status quo will not be disturbed at Ayodhya; s...
3,20010102,unknown,Fissures in Hurriyat over Pak visit
4,20010102,unknown,America's unwanted heading for India?
5,20010102,unknown,For bigwigs; it is destination Goa
6,20010102,unknown,Extra buses to clear tourist traffic
7,20010102,unknown,Dilute the power of transfers; says Riberio
8,20010102,unknown,Focus shifts to teaching of Hindi
9,20010102,unknown,IT will become compulsory in schools


In [2]:
print(data["headline_category"].value_counts())

india                                                                           271030
unknown                                                                         206571
city.mumbai                                                                     123124
city.delhi                                                                      112705
business.india-business                                                         107555
city.chandigarh                                                                  99710
city.hyderabad                                                                   86229
city.bengaluru                                                                   84475
entertainment.hindi.bollywood.news                                               80586
city.lucknow                                                                     77885
city.ahmedabad                                                                   77530
city.pune                                  

## Future Work :
### -- use combinations of the tags of the headlines to establish geo - fence.
### -- define pseudo-catagories to generate more targeted news

In [40]:
sample = data["headline_text"][:]
sample

0          win over cena satisfying but defeating underta...
1                                                Raju Chacha
2          Status quo will not be disturbed at Ayodhya; s...
3                        Fissures in Hurriyat over Pak visit
4                      America's unwanted heading for India?
5                         For bigwigs; it is destination Goa
6                       Extra buses to clear tourist traffic
7                Dilute the power of transfers; says Riberio
8                          Focus shifts to teaching of Hindi
9                       IT will become compulsory in schools
10             Move to stop freedom fighters' pension flayed
11         Gilani claims he applied for passport 2 years ago
13                     India; Pak exchange lists of N-plants
14               Will Qureshi's return really help the govt?
15                PM's tacit message: Put Ram tample on hold
16                      Text of the Prime Minister's article
17                    NC

## Major components of Hidden Markov model (v1) :

1. Observations --> The raw data headline string words --> Predicted or from original dataset.

2. Word correlations --> Emission probabilitis --> "markov_brain" defined below --> dictionary storing the transition probabilitis of every unique word after every unique word. Used a dictionary to avoid use of a sparse matrix.

3. Dynamic termination control --> The model stops when it encounters a "~END~" word being predicted next.
    This works since the same string was added to the end of every string before training. So every likely termination term should be have the tag after it with a high probablity.


In [None]:
markov_brain = {}
starter_words = {}
#ender_words = []
DBN_depth = 2
## TODO: use to make the model have a longer temporal state memory
## i.e. : Implement N-grams

for headline in sample:
    headline = headline + " ~END~"
    words = headline.split()
    if len(words) < DBN_depth:
        continue
        
#     for i in range(0,DBN_depth):
#         starter_words .append(words[i])

    if words[0] in starter_words:
        starter_words[words[0]] = starter_words[words[0]] + 1
        
    else:
        starter_words[words[0]] = 1
        
    #ender_words.append(words[len(words)-1])
    
    for word in range(1,len(words)-1): # words: 
        if words[word] in markov_brain:
            if words[word+1] in markov_brain[words[word]]:
                markov_brain[words[word]][words[word+1]] = markov_brain[words[word]][words[word+1]] + 1
            else:
                markov_brain[words[word]][words[word+1]] = 1
                
            # total counter to use for probablity distrubutions
            markov_brain[words[word]]["~Total~"] = markov_brain[words[word]]["~Total~"] + 1
            
        else:
            markov_brain[words[word]] = {}
            markov_brain[words[word]][words[word+1]] = 1
            markov_brain[words[word]]["~Total~"] = 1

markov_brain

#### Markov_brain :

{'over': {'cena': 1, <br>
  '~Total~': 62468,<br>
  'Pak': 50,<br>
  'Krishna': 14,<br>
  "ministers'": 5,<br>
  'team': 9,<br>
  'Coke': 3,<br>
  'Nepal;': 2,<br>
  'JD': 1,<br>
  'affair': 85,<br>
  'installing': 5,<br>
  'airline': 3,<br>
  '15;000': 10,<br>
  'yet': 39,<br>

In [51]:
def guessNext(Observation):
    if Observation not in markov_brain:
        if random.uniform(0, 1) < 0.5 :
            posterior = random.choice(list(starter_words.keys()))
        else :
            posterior = random.choice(list(markov_brain.keys()))
    else:
        # TODO : A bunch more stochastic reasoning and selection.
        evidence = markov_brain[Observation]
        sorted_x = sorted(evidence.items(), key=lambda kv: kv[1])
        sorted_x.reverse()
        # Fix for words that are absoloutely always precedeed or followed by something specific
        if sorted_x[1][1] / evidence["~Total~"] > 0.9 :
            posterior = sorted_x[1][0]
        else:
            # Naive selection - kind of stochastic
            if len(sorted_x) > 4:
                sorted_x = sorted_x[1:int(len(sorted_x)/2)]
            posterior = random.choice(list(sorted_x))
            posterior = posterior[0]
        if posterior == "~Total~":
            posterior = guessNext(Observation)
    return posterior

In [68]:
curr_word = random.choice(list(starter_words.keys()))

str = ""
str = curr_word

while curr_word != "~END~":
    n_word = guessNext(curr_word)
    str = str + " " + n_word
    curr_word = n_word

str = str[:len(str)-6] + (".")

str

'Computer-generated signaling business plots against Manmohan.'

## /\ Here is one of the headlines that was generated by this method /\

## V2 components :
1. Hidden states --> The node where you are in a positiobn in a sentence where the previous components have been analized into their semantic component groups. i.e. you know that you have identified Comp1--Comp2--Comp1--Comp3-- and are trying to guess at the "CompX" type of the next word. Where you also know that certain combinations of these CompX-- loops makes a complete sentence.
2. States --> the "CompX"s used above. Which could almost be pseudo replicas of the real gramatical part's of speech - "It Depends" heavily on the data set used for training.
3. Observation --> Words of the headline string
4. Transition Probabilities --> The probabilities of moving across CompX's temporally as we move from left to right on a sentence in its semantic component group form -->Comp1--Comp2--Comp1--Comp3--
5. Emission probabilities --> Comp1 has a bunch of words stored in it, and can be replaced with a value stocastically from the distribution of words belinging to that component group (figure of speech - noun, veeb, etc.)

## Sources :

1. http://miha.filej.net/diploma-thesis/
2. https://stackoverflow.com/questions/306400/how-to-randomly-select-an-item-from-a-list
3. https://medium.com/@kangeugine/hidden-markov-model-7681c22f5b9
4. https://stackoverflow.com/questions/4859292/how-to-get-a-random-value-in-python-dictionary
5. https://stackoverflow.com/questions/613183/how-do-i-sort-a-dictionary-by-value
6. https://en.wikipedia.org/wiki/N-gram
7. https://web.stanford.edu/~jurafsky/slp3/3.pdf
8. https://medium.com/@kangeugine/hidden-markov-model-7681c22f5b9
