### Notes

**Parameters:**
- OutputCount
- Noise
- Bots
- Companies

**Setup:**
- Set possible description topics.
- Set bot description topics.
- Set company description topics.
- Set possible languages.

**Process:**

For each entry:
1. **First:** Classification as Bot or Not (Probability: 0.05)
2. **Second:** Private company classification (Probability: 0.1 company):

- If company:
  - Age = NA
- Else:
  - Age = Random (NA possible).

- If Bot:
  - CreationDate set as more recent.
- If not Bot:
  - CreationDate set as completely random.

- Languages set as random (NA possible).
- Location and Language can correlate (NA possible).

- If Bot:
  - Post Count high.
- If not Bot:
  - Random post count.

- If Bot:
  - Description based on bot topics (Use CHATGPT for generation).
  - Name generated with description topics.
- If not Bot:
  - If company:
    - Use company topics.
    - Name generated with description topics.
  - If not company:
    - Description based on Normal topics (Use CHATGPT for generation).
    - Name generated with description topics.

- If Bot:
  - Follower count low.
- If not Bot:
  - Follower count random but minimal, correlated with creation date.


## Generator

### Inputs:

In [3]:
# Parameters
OutputCount = 10 # Number of output Results
Noise = 0.5 # Factor between 0 and 1
Bots = 0.05 # Percentage of Bots
Company = 0.1 # Percentage of Company accounts
NA_Prob = 0.05 # Percentage of NAs

In [4]:
# Description Topics
BotDesc = ["Ukraine war","Vote","Republican","Follow me","Trump","Democracy","Bad","Hate","Dumb"]
CompanyDesc = ["Profit","Business","Fiscal year","2023","Expansion","Success","Job offer","Event","Margin","Good vibes"]
NormalDesc = ["Horse","Sport","Football","Gaming","2023","Travel","Fun",":)","Event","Job","Tennis","taking Images","Politics","Republicans"]

# Possible Countries and languages
Countries = ["Germany","France","USA","UK","China","Japan","Russia","Italy","Netherlands","Switzerland"]
Languages = ["GER","FRE","ENG","ENG","MAN","JAP","RUS","ITA","NED","GER"]

### Code

In [5]:
# All these installations should be done.
# !pip install pandas
# !pip install numpy
# !pip install random
# !pip install faker
# !pip install torchvision
# !pip install warnings
# Imports
import pandas as pd
import numpy as np
from datetime import datetime,timedelta
import random
import time
from transformers import AutoModelForCausalLM, AutoTokenizer
import faker
import warnings
warnings.filterwarnings("ignore")

In [6]:
# Create Empty DF
data = {
    'ID': [],
    'Username': [],
    'Age': [],
    'Country': [],
    'CreationDate': [],
    'Description': [],
    'FollowerCount': [],
    'PostCount': [],
    'Language': []
}

df = pd.DataFrame(data)

row = {'ID':'','Username': '','Age': '','Country': '','CreationDate': '','Description': '','FollowerCount': '','PostCount': '','Language': ''}

In [7]:
def str_time_prop(start, end, time_format, prop):
    """Get a time at a proportion of a range of two formatted times.

    start and end should be strings specifying times formatted in the
    given format (strftime-style), giving an interval [start, end].
    prop specifies how a proportion of the interval to be taken after
    start.  The returned time will be in the specified format.
    """

    stime = time.mktime(time.strptime(start, time_format))
    etime = time.mktime(time.strptime(end, time_format))

    ptime = stime + prop * (etime - stime)

    return time.strftime(time_format, time.localtime(ptime))


def random_date(start, end, prop):
    return str_time_prop(start, end, '%m/%d/%Y %I:%M %p', prop)

In [8]:
rng = np.random.default_rng(1)
fake = faker.Faker()
model_name = "EleutherAI/gpt-neo-1.3B"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# model_name = "meta-llama/Llama-2-7b-chat-hf"
# model = AutoModelForCausalLM.from_pretrained(model_name)
# tokenizer = AutoTokenizer.from_pretrained(model_name)

In [9]:
def decide_account_type(bot, company, Topicslist):
    bot = False
    company = False
    Topicslist = []

    def append_topics(desc_list, count):
        for _ in range(count):
            n = random.randint(0, len(desc_list) - 1)
            Topicslist.append(n)

    # Setting whether the account is a bot account
    if rng.uniform() < Bots:
        bot = True
        append_topics(BotDesc, 3)

    # Setting whether the account is a company account
    elif rng.uniform() < Company:
        company = True
        append_topics(CompanyDesc, 3)

    # Setting the topiclist-IDs, when the account is a normal account
    else:
        append_topics(NormalDesc, 3)

    return bot, company, Topicslist


def set_age(company):
    # Setting the age

    if company or rng.uniform() < NA_Prob:
        return ""  # or return None

    age = np.random.normal(loc=30, scale=25, size=None)
    while age < 16:
        age = np.random.normal(loc=30, scale=25, size=None)

    return round(age)


def set_creation_date(bot):
    # Setting the creation date
    if bot:
        return random_date("1/1/2020 1:30 PM", "1/24/2024 11:59 PM", random.random())
    else:
        return random_date("1/1/2008 1:30 PM", "1/24/2024 11:59 PM", random.random())


def set_country_language():
    rand_country = round(random.randrange(0, 9))
    if rng.uniform() < NA_Prob:
        country = ""
    else:
        country = Countries[rand_country]

    if rng.uniform() < 0.4:
        rand_language = round(random.randrange(0, 9))
        language = Languages[rand_language]
    else:
        language = Languages[rand_country]
    return country, language


def calculate_account_age(creation_date):
    # Calculating the age of the account
    account_date_difference = datetime.strptime(
        "1/24/2024 11:59 PM", "%m/%d/%Y %I:%M %p"
    ) - datetime.strptime(creation_date, "%m/%d/%Y %I:%M %p")
    return abs(account_date_difference.days)


def calculate_followers_posts(bot, account_age):
    # Calculating the FollowerCount & PostCount based on the account age
    if bot:
        f_multiplier = 0.5
        p_multiplier = random.randrange(10, 30)
        f_loc = 0.1
        f_scale = 0.2
        p_loc = 0.1
        p_scale = 0.5
    else:
        f_multiplier = 1
        p_multiplier = 0.1
        f_loc = 0.4
        f_scale = 2
        p_loc = 0.05
        p_scale = 0.1

    follower_count = (
        np.random.normal(loc=f_loc, scale=f_scale, size=None)
        * account_age
        * f_multiplier
    )

    post_count = (
        np.random.normal(loc=p_loc, scale=p_scale, size=None)
        * account_age
        * p_multiplier
    )

    return abs(round(follower_count)), abs(round(post_count))


def generate_username():
    # Define the random username of the user
    if random.uniform(0, 1) < 0.5:
        return fake.user_name()
    else:
        if random.uniform(0, 1) < 0.5:
            return (
                str(random.randrange(1950, 2005)) + "_" + fake.user_name()
            )  # Maybe calc YOB from age and use that
        else:
            return fake.user_name() + "_" + str(random.randrange(1950, 2005))


def generate_description(
    bot,
    company,
    Topicslist,
    username,
    BotDesc,
    CompanyDesc,
    NormalDesc,
    model,
    tokenizer,
):
    def construct_prompt(entity_type, desc_list):
        topics = ", ".join(str(desc_list[topic]) for topic in Topicslist[:3])
        return f"{username}, topics: {topics}, short profile description for {entity_type}:"

    def clean_generated_text(text, prompt):
        text = text.replace(prompt, "").strip()  # Remove prompt if present
        for char in ["\n", ")", '"']:
            text = text.replace(char, "")
        return text

    if bot:
        prompt = construct_prompt("bot", BotDesc)
    elif company:
        prompt = construct_prompt("company", CompanyDesc)
    else:
        prompt = construct_prompt("user", NormalDesc)

    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    output = model.generate(input_ids, max_new_tokens=15, num_return_sequences=1)
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    result = clean_generated_text(generated_text, prompt)

    return result

In [13]:
# Main loop
for i in range(1, OutputCount + 1):
    row = {
        "ID": "",
        "Username": "",
        "Age": "",
        "Country": "",
        "CreationDate": "",
        "Description": "",
        "FollowerCount": "",
        "PostCount": "",
        "Language": "",
    }
    row["ID"] = i
    bot = False
    company = False
    Topicslist = []
    
    bot, company, Topicslist = decide_account_type(bot, company, Topicslist) # Helpers

    row["Age"] = set_age(company)
    
    creation_date = set_creation_date(bot)
    row["CreationDate"] = creation_date
    
    row["Country"], row["Language"] = set_country_language()
    
    account_age = calculate_account_age(creation_date)
    
    row["FollowerCount"], row["PostCount"] = calculate_followers_posts(bot, account_age)
    
    username = generate_username()
    row["Username"] = username
    
    row["Description"] = generate_description(
        bot,
        company,
        Topicslist,
        username,
        BotDesc,
        CompanyDesc,
        NormalDesc,
        model,
        tokenizer,
    )

    print(f"Finished: row {row['ID']} of {OutputCount}")
    print(row)
    
    df.loc[len(df)] = row

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Finished: row 1 of 10
{'ID': 1, 'Username': 'kproctor', 'Age': 22, 'Country': 'UK', 'CreationDate': '11/07/2008 05:04 AM', 'Description': 'I am a horse lover, and I am a horse lover.', 'FollowerCount': 1317, 'PostCount': 41, 'Language': 'MAN'}


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Finished: row 2 of 10
{'ID': 2, 'Username': '1974_higginsjillian', 'Age': 72, 'Country': 'Japan', 'CreationDate': '11/18/2013 06:56 PM', 'Description': '_____, :, :, :, :, :, :, :', 'FollowerCount': 5448, 'PostCount': 70, 'Language': 'JAP'}


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Finished: row 3 of 10
{'ID': 3, 'Username': 'cesarcarrillo_1958', 'Age': 34, 'Country': 'Germany', 'CreationDate': '06/20/2012 11:46 AM', 'Description': 'I am a football fan and I love to travel. I am', 'FollowerCount': 5212, 'PostCount': 26, 'Language': 'RUS'}


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Finished: row 4 of 10
{'ID': 4, 'Username': 'bzavala', 'Age': 24, 'Country': 'Netherlands', 'CreationDate': '09/26/2020 11:40 PM', 'Description': 'I am a professional writer and I have been writing for over 10', 'FollowerCount': 2430, 'PostCount': 5, 'Language': 'ENG'}


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Finished: row 5 of 10
{'ID': 5, 'Username': 'gina77_1961', 'Age': 75, 'Country': 'UK', 'CreationDate': '06/03/2012 09:27 PM', 'Description': 'I am a tennis player and I am looking for a good tennis', 'FollowerCount': 2466, 'PostCount': 71, 'Language': 'ENG'}


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Finished: row 6 of 10
{'ID': 6, 'Username': 'joseph92', 'Age': 76, 'Country': 'Japan', 'CreationDate': '10/24/2023 03:18 AM', 'Description': "_________________I'm not sure if this is the right place to", 'FollowerCount': 13, 'PostCount': 0, 'Language': 'JAP'}


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Finished: row 7 of 10
{'ID': 7, 'Username': 'beltranelizabeth', 'Age': 31, 'Country': 'Netherlands', 'CreationDate': '01/21/2009 07:27 PM', 'Description': 'I am a writer, a mother, a wife, a daughter', 'FollowerCount': 9500, 'PostCount': 36, 'Language': 'NED'}


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Finished: row 8 of 10
{'ID': 8, 'Username': 'jason26_1956', 'Age': '', 'Country': 'China', 'CreationDate': '07/19/2008 06:19 AM', 'Description': '_________________I am a writer, a blogger, a social media', 'FollowerCount': 13890, 'PostCount': 24, 'Language': 'MAN'}


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Finished: row 9 of 10
{'ID': 9, 'Username': 'samantha01', 'Age': 39, 'Country': 'France', 'CreationDate': '05/22/2016 01:23 PM', 'Description': "I'm a writer, and I write about the things that interest", 'FollowerCount': 8627, 'PostCount': 34, 'Language': 'ENG'}
Finished: row 10 of 10
{'ID': 10, 'Username': '1958_zacharyhill', 'Age': 42, 'Country': 'UK', 'CreationDate': '12/21/2023 12:04 PM', 'Description': 'I am a writer, and I write about the things that interest', 'FollowerCount': 21, 'PostCount': 0, 'Language': 'ENG'}


In [14]:
df

# display(df['Description'])

Unnamed: 0,ID,Username,Age,Country,CreationDate,Description,FollowerCount,PostCount,Language
0,1,kproctor,22.0,UK,11/07/2008 05:04 AM,"I am a horse lover, and I am a horse lover.",1317,41,MAN
1,2,1974_higginsjillian,72.0,Japan,11/18/2013 06:56 PM,"_____, :, :, :, :, :, :, :",5448,70,JAP
2,3,cesarcarrillo_1958,34.0,Germany,06/20/2012 11:46 AM,I am a football fan and I love to travel. I am,5212,26,RUS
3,4,bzavala,24.0,Netherlands,09/26/2020 11:40 PM,I am a professional writer and I have been wri...,2430,5,ENG
4,5,gina77_1961,75.0,UK,06/03/2012 09:27 PM,I am a tennis player and I am looking for a go...,2466,71,ENG
5,6,joseph92,76.0,Japan,10/24/2023 03:18 AM,_________________I'm not sure if this is the r...,13,0,JAP
6,7,beltranelizabeth,31.0,Netherlands,01/21/2009 07:27 PM,"I am a writer, a mother, a wife, a daughter",9500,36,NED
7,8,jason26_1956,,China,07/19/2008 06:19 AM,"_________________I am a writer, a blogger, a s...",13890,24,MAN
8,9,samantha01,39.0,France,05/22/2016 01:23 PM,"I'm a writer, and I write about the things tha...",8627,34,ENG
9,10,1958_zacharyhill,42.0,UK,12/21/2023 12:04 PM,"I am a writer, and I write about the things th...",21,0,ENG
