# Generating random Tweets

This notebook uses Markov Chains models in order to generate random texts (with a maximum length equal to a Tweet).

Although it can generate reasonably accurate results, it's a completely synthetic dataset! Because of this, instead of using the generated data as training data for our NER model, we simply use it to study which text operation can improve its accuracy.

## 1. Generating Markov Chains model

Here we simply provide a series of raw text files in order to create our model - which we then save as a JSON file.

In [1]:
import markovify
import glob

max_tweet_size = 140
raw_text_files_path = 'data/raw/markov_text_files/*.txt'
parsed_model_file_path = 'data/parsed/markov_text_files/markov_weighted_chain.json'
raw_text_file_paths = glob.glob(raw_text_files_path)
raw_text_markov_models = []

# Read raw text files and generate Markov chain models
for file_path in raw_text_file_paths:
    with open(file_path) as file:
        text = file.read()
        markov_model = markovify.Text(text)
        raw_text_markov_models.append(markov_model)
    
# Combine all generated models into a single one
markov_model = markovify.combine(raw_text_markov_models)

In [2]:
# Generate 10 random sentences from the generated Markov chain model
for i in range(2):
    print(markov_model.make_short_sentence(max_tweet_size))

A third, in the groin.
Princess Mary pass into the House of Lords was composed of sodium and potassium.


In [3]:
# Save model as JSON
model_json = markov_model.chain.to_json()
with open(parsed_model_file_path, 'w') as json_file:
    json_file.write(model_json)

## 2. Improving the model

Even though we want to generate completely random Tweet-like texts, our aim is to improve the accuracy of our NER model for Country/Nationality/Religion/Currency recognition. Having said that, we will go through our model and increase the weight of words that we know represent one of these things.

By increasing their weight, we will simply make it mode likely that these words will popup in the generated text.

In [4]:
import pandas as pd

# Load the necessary datasets
country_nationality_df = pd.read_csv('data/parsed/parsed_country_nationality.csv', encoding='utf-8', compression='gzip', index_col=False)
currency_country_df = pd.read_csv('data/parsed/parsed_currency_country.csv', encoding='utf-8', compression='gzip', index_col=False)