# Generating random Tweets

This notebook uses Markov Chains models in order to generate random texts (with a maximum length equal to a Tweet).

Although it can generate reasonably accurate results, it's a completely synthetic dataset! Because of this, instead of using the generated data as training data for our NER model, we simply use it to study which text operation can improve its accuracy.

## 1. Generating Markov Chains model

Here we simply provide a series of raw text files in order to create our model - which we then save as a JSON file.

In [1]:
import markovify
import glob

max_tweet_size = 140
raw_text_files_path = 'data/raw/markov_text_files/*.txt'
parsed_model_file_path = 'data/parsed/markov_text_files/markov_weighted_chain.json'
raw_text_file_paths = glob.glob(raw_text_files_path)
raw_text_markov_models = []

# Read raw text files and generate Markov chain models
for file_path in raw_text_file_paths:
    with open(file_path) as file:
        text = file.read()
        markov_model = markovify.Text(text, state_size=4)
        raw_text_markov_models.append(markov_model)
    
# Combine all generated models into a single one
markov_model = markovify.combine(raw_text_markov_models)

In [2]:
# Generate 5 random sentences from the generated Markov chain model
for i in range(5):
    print(markov_model.make_short_sentence(max_tweet_size))

Since then, Russia has shifted its post-Soviet democratic ambitions in favor of a national budget system, which soon found public backing.
Nevis continues in its efforts to separate from Saint Kitts fell short of the two-thirds majority needed.
It happened that on that morning of his name day the prince was in the habit of putting up for the last thirty years.
France France is in the midst of a civil war and made many efforts to avoid a crisis.
The people rely heavily on aid from New Zealand in 2002 was about US$2 million.


In [3]:
# Save model as JSON
model_json = markov_model.chain.to_json()
with open(parsed_model_file_path, 'w') as json_file:
    json_file.write(model_json)

## 2. Improving the model

Even though we want to generate completely random Tweet-like texts, our aim is to improve the accuracy of our NER model for Country/Nationality/Religion/Currency recognition. Having said that, we will go through our model, generate a set amount of samples and use them as testing data.

In [4]:
import pandas as pd

# Define target datasets' paths
parsed_country_nationality_file = 'data/parsed/parsed_country_nationality.csv'
parsed_currency_country_file = 'data/parsed/parsed_currency_country.csv'
parsed_country_religion_file = 'data/parsed/country_religion_files/parsed_country_religion.csv'
parsed_country_cities_file = 'data/parsed/parsed_country_cities.csv'

# Load the necessary datasets
country_nationality_df = pd.read_csv(parsed_country_nationality_file, encoding='utf-8', compression='gzip', index_col=False)
currency_country_df = pd.read_csv(parsed_currency_country_file, encoding='utf-8', compression='gzip', index_col=False)
country_religion_df = pd.read_csv(parsed_country_religion_file, encoding='utf-8', compression='gzip', index_col=False)
country_cities_df = pd.read_csv(parsed_country_cities_file, encoding='utf-8', compression='gzip', index_col=False)

# Store unique sets
unique_country_common_names = country_nationality_df['Common Name'].astype(str).unique()
unique_country_official_names = country_nationality_df['Official Name'].astype(str).unique()
unique_country_nationalities = country_nationality_df['Nationality'].astype(str).unique()
unique_country_religions_name = country_religion_df['Religion'].astype(str).unique()
unique_country_rilogions_affiliation = country_religion_df['Affiliation'].astype(str).unique()
unique_currency_ids = currency_country_df['ID'].astype(str).unique()
unique_country_city_names = country_cities_df['City'].astype(str).unique()
# TODO also add currencies' full names

In [5]:
import re
import random

# Define important word sets
important_word_dict = {
    'Country Names': list(map(lambda x : x.upper(), unique_country_common_names)),
    'Country Names (Official)': list(map(lambda x : x.upper(), unique_country_official_names)),
    'Country Nationalities': list(map(lambda x : x.upper(), unique_country_nationalities)),
    'Religion Names': list(map(lambda x : x.upper(), unique_country_religions_name)),
    'Religion Affiliations': list(map(lambda x : x.upper(), unique_country_rilogions_affiliation)),
    'Currencies': list(map(lambda x : x.upper(), unique_currency_ids)),
    'City Names': list(map(lambda x : x.upper(), unique_country_city_names))
}

def get_word_label(word):
    '''
    This method checks wether or not a word
    is considered to be 'important'.
    '''
    # Set regex for word parsing
    regex = re.compile('[^a-zA-Z]')
    
    for important_word_label, important_word_set in important_word_dict.items():
        comparable_word = regex.sub('', word).upper()
        if comparable_word in important_word_set:
            return important_word_label
    return None

In [6]:
def generate_testing_samples_dict(max_testing_samples, unnecessary_sample_retention_percent):
    # Define number of sentences to generate
    cur_testing_samples = 0

    testing_sample_dict = {}
    while len(testing_sample_dict.keys()) < max_testing_samples:
        batch_size = max_testing_samples - cur_testing_samples
        samples = [markov_model.make_short_sentence(max_tweet_size) for i in range(batch_size)]
        
        for sample in samples:
            important_words_dict = {}
            contains_important_word = False
            if sample is None:
                continue
            for word in sample.split():
                word_label = get_word_label(word)
                if word_label is not None:
                    important_words_dict[word_label] = important_words_dict.get(word_label, list()) + [word]
                    contains_important_word = True
            if (contains_important_word):
                print('\n---Sample----------')
                print('| [Text]')
                print('| \t{}'.format(sample))
                print('| [Analysis results]')
                for label, word_list in important_words_dict.items():
                    print('| \t{}: {}'.format(label, word_list))
                print('| [Verification]')
                n_city_names = int(input('| \t[1/6] # City Names: '))
                n_country_names = int(input('| \t[2/6] # Country Names: '))
                n_country_nationalities = int(input('| \t[3/6] # Nationalities: '))
                n_religion_names = int(input('| \t[4/6] # Religion names: '))
                n_religion_affiliations = int(input('| \t[5/6] # Religious affiliations: '))
                n_currency_names = int(input('| \t[6/6] # Currency names: '))
                
                n_param_sum = n_city_names + n_country_names + n_country_nationalities + n_religion_names + n_religion_affiliations + n_currency_names
                if (n_param_sum > 0 or random.randint(0,100) <= unnecessary_sample_retention_percent):
                    testing_sample_dict[cur_testing_samples] = {
                        'Text': sample,
                        'City Names': n_city_names,
                        'Country Names': n_country_names,
                        'Country Nationalities': n_country_nationalities,
                        'Religion Names': n_religion_names,
                        'Religion Affiliations': n_religion_affiliations,
                        'Currency Names': n_currency_names
                    }
                    cur_testing_samples += 1
                    print('| [Result]')
                    print('| \tSAVED ({}/{})'.format(cur_testing_samples, max_testing_samples))
                else:
                    print('| [Result]')
                    print('| \tDISCARDED')
                print('------------------')
    return testing_sample_dict

In [7]:
sample_dict = generate_testing_samples_dict(20, 20)


---Sample----------
| [Text]
| 	The little princess went round the table in a pensive attitude.
| [Analysis results]
| 	City Names: ['a']
| [Verification]
| 	[1/6] # City Names: 0
| 	[2/6] # Country Names: 0
| 	[3/6] # Nationalities: 0
| 	[4/6] # Religion names: 0
| 	[5/6] # Religious affiliations: 0
| 	[6/6] # Currency names: 0
| [Result]
| 	DISCARDED
------------------

---Sample----------
| [Text]
| 	She looked at him in dismay trying to guess what he wanted of her or why he was asking to be discharged.
| [Analysis results]
| 	City Names: ['of']
| [Verification]
| 	[1/6] # City Names: 0
| 	[2/6] # Country Names: 0
| 	[3/6] # Nationalities: 0
| 	[4/6] # Religion names: 0
| 	[5/6] # Religious affiliations: 0
| 	[6/6] # Currency names: 0
| [Result]
| 	DISCARDED
------------------

---Sample----------
| [Text]
| 	The disease lasts for from two or three drams in the case of the Revolution.
| [Analysis results]
| 	City Names: ['of']
| [Verification]
| 	[1/6] # City Names: 0
| 	[2/6] # Co

| [Result]
| 	SAVED (5/20)
------------------

---Sample----------
| [Text]
| 	Economic development is constrained by a shortage of skilled labor and a deficient infrastructure.
| [Analysis results]
| 	City Names: ['is', 'a', 'of', 'a']
| [Verification]
| 	[1/6] # City Names: 0
| 	[2/6] # Country Names: 0
| 	[3/6] # Nationalities: 0
| 	[4/6] # Religion names: 0
| 	[5/6] # Religious affiliations: 0
| 	[6/6] # Currency names: 0
| [Result]
| 	DISCARDED
------------------

---Sample----------
| [Text]
| 	DISEASES OF BONE The morbid processes met with in bone originate in the same way as the day before, yet she was quite different.
| [Analysis results]
| 	City Names: ['OF', 'BONE', 'bone', 'same', 'as']
| [Verification]
| 	[1/6] # City Names: 0
| 	[2/6] # Country Names: 0
| 	[3/6] # Nationalities: 0
| 	[4/6] # Religion names: 0
| 	[5/6] # Religious affiliations: 0
| 	[6/6] # Currency names: 0
| [Result]
| 	DISCARDED
------------------

---Sample----------
| [Text]
| 	Prince Vasili had come 

| 	[1/6] # City Names: 0
| 	[2/6] # Country Names: 3
| 	[3/6] # Nationalities: 0
| 	[4/6] # Religion names: 0
| 	[5/6] # Religious affiliations: 0
| 	[6/6] # Currency names: 0
| [Result]
| 	SAVED (12/20)
------------------

---Sample----------
| [Text]
| 	Oil and gas have given Qatar a per capita GDP about two-thirds that of the Big Four EU economies.
| [Analysis results]
| 	Country Names: ['Qatar']
| 	City Names: ['a', 'per', 'of']
| [Verification]
| 	[1/6] # City Names: 0
| 	[2/6] # Country Names: 1
| 	[3/6] # Nationalities: 0
| 	[4/6] # Religion names: 0
| 	[5/6] # Religious affiliations: 0
| 	[6/6] # Currency names: 0
| [Result]
| 	SAVED (13/20)
------------------

---Sample----------
| [Text]
| 	He quietly shot back a panel in the upper part of the door and the bright daylight in that previously darkened room startled her.
| [Analysis results]
| 	City Names: ['a', 'of']
| [Verification]
| 	[1/6] # City Names: 0
| 	[2/6] # Country Names: 0
| 	[3/6] # Nationalities: 0
| 	[4/6] # Rel

| 	[1/6] # City Names: 0
| 	[2/6] # Country Names: 0
| 	[3/6] # Nationalities: 0
| 	[4/6] # Religion names: 0
| 	[5/6] # Religious affiliations: 0
| 	[6/6] # Currency names: 0
| [Result]
| 	SAVED (20/20)
------------------


In [8]:
sample_df = pd.DataFrame.from_dict(sample_dict, orient='index')
sample_df.head()

Unnamed: 0,Text,City Names,Country Names,Country Nationalities,Religion Names,Religion Affiliations,Currency Names
0,In their narration events occur solely by the ...,0,0,0,0,0,0
1,Papua New Guinea Papua New Guinea is one of th...,0,3,0,0,0,0
2,There is a tender swelling on either side of t...,0,0,0,0,0,0
3,He has forwarded me a letter from Westhouse & ...,0,0,1,0,0,0
4,Lower prices for oil and diamonds during the g...,0,0,0,0,0,0


In [9]:
# Define file path for output
test_samples_file = 'data/parsed/markov_text_files/test_samples.csv'

sample_df.to_csv(test_samples_file, encoding='utf-8', index=False, compression='gzip')