# Generating random Tweets

This notebook uses Markov Chains models in order to generate random texts (with a maximum length equal to a Tweet).

Although it can generate reasonably accurate results, it's a completely synthetic dataset! Because of this, instead of using the generated data as training data for our NER model, we simply use it to study which text operation can improve its accuracy.

## 1. Generating Markov Chains model

Here we simply provide a series of raw text files in order to create our model - which we then save as a JSON file.

In [1]:
import markovify
import glob

max_tweet_size = 140
raw_text_files_path = '../../data/raw/markov_text_files/*.txt'
parsed_model_file_path = '../../data/parsed/markov_text_files/markov_weighted_chain.json'
raw_text_file_paths = glob.glob(raw_text_files_path)
raw_text_markov_models = []

# Read raw text files and generate Markov chain models
weights = []
for file_path in raw_text_file_paths:
    with open(file_path, encoding='utf-8') as file:
        text = file.read()
        markov_model = markovify.Text(text, state_size=4)
        raw_text_markov_models.append(markov_model)
        if 'religion' in file_path:
            weights.append(2)
        else:
            weights.append(1)
    
# Combine all generated models into a single one
markov_model = markovify.combine(raw_text_markov_models, weights)

In [2]:
# Generate 5 random sentences from the generated Markov chain model
for i in range(5):
    print(markov_model.make_short_sentence(max_tweet_size))

In _concealed chancres_ with phimosis, the sac of the bursa with thickening of its lining membrane.
The shaft is increased in girth as a result of their recent personal conflict with him.
The patient must be kept in good position.
GDP grew more than 8% in 2010, as exports returned to normal levels following the recession.
In 1773 it agreed to return to the capitals, he still continued to live in the same old way, inwardly he began a new life.


In [3]:
# Save model as JSON
model_json = markov_model.chain.to_json()
with open(parsed_model_file_path, 'w') as json_file:
    json_file.write(model_json)

## 2. Improving the model

Even though we want to generate completely random Tweet-like texts, our aim is to improve the accuracy of our NER model for Country/Nationality/Religion/Currency recognition. Having said that, we will go through our model, generate a set amount of samples and use them as testing data.

In [4]:
import pandas as pd

# Define target datasets' paths
parsed_country_nationality_file = '../../data/parsed/parsed_country_nationality.csv'
parsed_currency_country_file = '../../data/parsed/parsed_currency_country.csv'
parsed_country_religion_file = '../../data/parsed/country_religion_files/parsed_country_religion.csv'
parsed_country_cities_file = '../../data/parsed/parsed_country_cities.csv'

# Load the necessary datasets
country_nationality_df = pd.read_csv(parsed_country_nationality_file, encoding='utf-8', compression='gzip', index_col=False)
currency_country_df = pd.read_csv(parsed_currency_country_file, encoding='utf-8', compression='gzip', index_col=False)
country_religion_df = pd.read_csv(parsed_country_religion_file, encoding='utf-8', compression='gzip', index_col=False)
country_cities_df = pd.read_csv(parsed_country_cities_file, encoding='utf-8', compression='gzip', index_col=False)

# Store unique sets
unique_country_common_names = country_nationality_df['Common Name'].astype(str).unique()
unique_country_official_names = country_nationality_df['Official Name'].astype(str).unique()
unique_country_nationalities = country_nationality_df['Nationality'].astype(str).unique()
unique_country_religions_name = country_religion_df['Religion'].astype(str).unique()
unique_country_rilogions_affiliation = country_religion_df['Affiliation'].astype(str).unique()
unique_currency_ids = currency_country_df['ID'].astype(str).unique()
unique_country_city_names = country_cities_df['City'].astype(str).unique()

In [5]:
import re
import random

# Define important word sets
important_word_dict = {
    'Country Names': list(map(lambda x : x.upper(), unique_country_common_names)),
    'Country Names (Official)': list(map(lambda x : x.upper(), unique_country_official_names)),
    'Country Nationalities': list(map(lambda x : x.upper(), unique_country_nationalities)),
    'Religion Names': list(map(lambda x : x.upper(), unique_country_religions_name)),
    'Religion Affiliations': list(map(lambda x : x.upper(), unique_country_rilogions_affiliation)),
    'Currencies': list(map(lambda x : x.upper(), unique_currency_ids)),
    'City Names': list(map(lambda x : x.upper(), unique_country_city_names))
}

def get_word_label(word):
    '''
    This method checks wether or not a word
    is considered to be 'important'.
    '''
    # Set regex for word parsing
    regex = re.compile('[^a-zA-Z]')
    
    for important_word_label, important_word_set in important_word_dict.items():
        comparable_word = regex.sub('', word).upper()
        if comparable_word in important_word_set:
            return important_word_label
    return None

In [6]:
def get_int_input(string):
    try:
        return int(input(string))
    except:
        return 0

In [7]:
def generate_testing_samples_dict(max_testing_samples, unnecessary_sample_retention_percent):
    # Define number of sentences to generate
    cur_testing_samples = 0

    testing_sample_dict = {}
    while len(testing_sample_dict.keys()) < max_testing_samples:
        batch_size = max_testing_samples - cur_testing_samples
        samples = [markov_model.make_short_sentence(max_tweet_size) for i in range(batch_size)]
        
        for sample in samples:
            important_words_dict = {}
            contains_important_word = False
            if sample is None:
                continue
            for word in sample.split():
                word_label = get_word_label(word)
                if word_label is not None:
                    important_words_dict[word_label] = important_words_dict.get(word_label, list()) + [word]
                    contains_important_word = True
            if (contains_important_word):
                print('------------------')
                print('\n| [New Sample]')
                print('| \tText: {}'.format(sample))
                print('| [Analysis results]')
                for label, word_list in important_words_dict.items():
                    print('| \t{}: {}'.format(label, word_list))
                print('| [Verification]')
                n_city_names = get_int_input('| \t[1/6] # City Names: ')
                n_country_names = get_int_input('| \t[2/6] # Country Names: ')
                n_country_nationalities = get_int_input('| \t[3/6] # Nationalities: ')
                n_religion_names = get_int_input('| \t[4/6] # Religion names: ')
                n_religion_affiliations = get_int_input('| \t[5/6] # Religious affiliations: ')
                n_currency_names = get_int_input('| \t[6/6] # Currency names: ')
                
                n_param_sum = n_city_names + n_country_names + n_country_nationalities + n_religion_names + n_religion_affiliations + n_currency_names
                if (n_param_sum > 0 or random.randint(0,100) <= unnecessary_sample_retention_percent):
                    testing_sample_dict[cur_testing_samples] = {
                        'Text': sample,
                        'City Names': n_city_names,
                        'Country Names': n_country_names,
                        'Country Nationalities': n_country_nationalities,
                        'Religion Names': n_religion_names,
                        'Religion Affiliations': n_religion_affiliations,
                        'Currency Names': n_currency_names
                    }
                    cur_testing_samples += 1
                    print('[Result]')
                    print('\tSAVED ({}/{})'.format(cur_testing_samples, max_testing_samples))
                else:
                    print('[Result]')
                    print('\tDISCARDED')
    return testing_sample_dict

In [8]:
def generate_auto_religion_testing_samples_dict(max_testing_samples):
    # Define number of sentences to generate
    cur_testing_samples = 0

    testing_sample_dict = {}
    while len(testing_sample_dict.keys()) < max_testing_samples:
        batch_size = max_testing_samples - cur_testing_samples
        samples = [markov_model.make_short_sentence(max_tweet_size) for i in range(batch_size)]
        
        for sample in samples:
            important_words_dict = {}
            contains_important_word = False
            if sample is None:
                continue
            for word in sample.split():
                word_label = get_word_label(word)
                if word_label in ['Religion Names', 'Religion Affiliations']:
                    important_words_dict[word_label] = important_words_dict.get(word_label, list()) + [word]
                    contains_important_word = True
            if (contains_important_word):
                print('\n---Sample----------')
                print('| [Text]')
                print('| \t{}'.format(sample))
                print('| [Analysis results]')
                for label, word_list in important_words_dict.items():
                    print('| \t{}: {}'.format(label, word_list))
                print('| [Verification]')
                
                n_religion_names = len(important_words_dict.get('Religion Names', list()))
                n_religion_affiliations = len(important_words_dict.get('Religion Affiliations', list()))
                n_param_sum = len(important_words_dict)
                if (n_param_sum > 0):
                    testing_sample_dict[cur_testing_samples] = {
                        'Text': sample,
                        'City Names': 0,
                        'Country Names': 0,
                        'Country Nationalities': 0,
                        'Religion Names': n_religion_names,
                        'Religion Affiliations': n_religion_affiliations,
                        'Currency Names': 0
                    }
                    cur_testing_samples += 1
                    print('| [Result]')
                    print('| \tSAVED ({}/{})'.format(cur_testing_samples, max_testing_samples))
                else:
                    print('| [Result]')
                    print('| \tDISCARDED')
                print('------------------')
    return testing_sample_dict

In [9]:
# Auto generate Religion samples

samples = pd.DataFrame()
while (len(samples) < 10):
    
    # Generate some samples
    sample_dict = generate_auto_religion_testing_samples_dict( 10 - len(samples) )
    new_samples = pd.DataFrame.from_dict(sample_dict, orient='index')
    
    # Append new samples
    samples = samples.append(new_samples)
    samples.drop_duplicates(inplace=True, subset="Text")
    
samples


---Sample----------
| [Text]
| 	Have you ever thought of your tens of thousands of Burmese marched in protest, led by prodemocracy activists and Buddhist monks.
| [Analysis results]
| 	Religion Affiliations: ['Buddhist']
| [Verification]
| [Result]
| 	SAVED (1/10)
------------------

---Sample----------
| [Text]
| 	Many Orthodox Jewish communities believe that they will be welcome here, as you are.
| [Analysis results]
| 	Religion Affiliations: ['Jewish']
| [Verification]
| [Result]
| 	SAVED (2/10)
------------------

---Sample----------
| [Text]
| 	Protestant Christianity: Protestant Christianity originated in the 16th century and by the Dutch in the 17th century.
| [Analysis results]
| 	Religion Names: ['Christianity:', 'Christianity']
| [Verification]
| [Result]
| 	SAVED (3/10)
------------------

---Sample----------
| [Text]
| 	To us it is incomprehensible that millions of Christian men professing the law of love of their fellows slew one another.
| [Analysis results]
| 	Religion 

Unnamed: 0,Text,City Names,Country Names,Country Nationalities,Religion Names,Religion Affiliations,Currency Names
0,Have you ever thought of your tens of thousand...,0,0,0,0,1,0
1,Many Orthodox Jewish communities believe that ...,0,0,0,0,1,0
2,Protestant Christianity: Protestant Christiani...,0,0,0,2,0,0
3,To us it is incomprehensible that millions of ...,0,0,0,0,1,0
4,"Under the Geneva Accords of 1954, Vietnam was ...",0,0,0,0,1,0
6,A Sunni Muslim may elect to follow any one of ...,0,0,0,0,1,0
7,Many Orthodox Jewish communities believe that ...,0,0,0,0,1,0
9,"Thence, they issued orders and regulations in ...",0,0,0,0,2,0
0,Many Orthodox Jewish communities believe that ...,0,0,0,0,1,0
1,"During the second half of the 19th century, at...",0,0,0,1,0,0


In [62]:
sample_df = pd.DataFrame.from_dict(sample_dict, orient='index')
sample_df.head()

Unnamed: 0,Text,City Names,Country Names,Country Nationalities,Religion Names,Religion Affiliations,Currency Names
0,I suppose that is the reason why the life and ...,0,0,0,0,0,0
1,"Forcibly incorporated into the USSR in 1940, i...",0,1,0,0,0,0
2,Greece has not met the EU's Growth and Stabili...,0,1,0,0,0,0
3,Libya in May 2010 was elected to its first thr...,0,1,0,0,0,0
4,The artillery the prisoners had seen in front ...,0,0,0,0,0,0


In [63]:
# Define file path for output
test_samples_file = '../../data/parsed/markov_text_files/test_samples_generic_24.csv'

sample_df.to_csv(test_samples_file, encoding='utf-8', index=False, compression='gzip')

In [26]:
def build_full_dataframe(generic=True, religion=True):
    
    full_sample_df = pd.DataFrame()
    
    if generic:
        # Append generic samples
        all_files = glob.glob('../../data/parsed/markov_text_files/test_samples_generic_*.csv')
        for file in all_files:
            temp_df = pd.read_csv(file, encoding='utf-8', compression='gzip')
            full_sample_df = full_sample_df.append(temp_df)
    
    if religion:
        # Append auto religion samples
        all_files = glob.glob('../../data/parsed/markov_text_files/test_samples_religion_*.csv')
        for file in all_files:
            temp_df = pd.read_csv(file, encoding='utf-8', compression='gzip')
            full_sample_df = full_sample_df.append(temp_df)
            
    # Remove duplicate texts
    full_sample_df.drop_duplicates(inplace=True, subset="Text")
    
    print('Results obtained on {} tweets'.format(len(full_sample_df)))
    
    return full_sample_df

In [60]:
# Generate a global dataframe for all the samples
full_sample_df = build_full_dataframe(religion=False)
    
# Verify the amount of values in total we have obtained for each category
sum_sample_df = pd.DataFrame(full_sample_df.sum(axis=0))
sum_sample_df.drop('Text', inplace=True)
sum_sample_df.rename(columns={0:'Number of values'}, inplace=True)

sum_sample_df

Results obtained on 234 tweets


Unnamed: 0,Number of values
City Names,28
Country Names,130
Country Nationalities,70
Religion Names,1
Religion Affiliations,3
Currency Names,4
