Copyright &copy; 2023-2024 Praneeth Vadlapati

<!-- ## Setup:

Example of .env: 
```bash
# Groq API to use LLMs - https://console.groq.com/keys
# Groq is preferred for fast responses
LM_PROVIDER_BASE_URL=https://api.groq.com/openai/v1
LM_API_KEY=
LM_MODEL=

```

Installing packages:
```bash
pip install pandas openai python-dotenv
``` -->

## A. Selecting and loading an LLM

In [1]:
import os
import pandas as pd
from common_functions import get_lm_response, model, extract_data, \
    display_md, data_folder

model_folder = os.path.join(data_folder, model.split('/')[-1])
if not os.path.exists(model_folder):
	os.makedirs(model_folder)

## B. Creating large complex sentences

In [2]:
large_sentence_prompt = '''
Act as an expert linguist who wants to test the readers for a research.
Create large complex sentences that are:
	- grammatically correct and make sense.
	- have many clauses and phrases.
	- have many prepositional phrases.
	- use complex words that are difficult to understand without a dictionary.
	- have many commas and conjunctions.
	- 100 words long.

Write 3 sentences. Each sentence should be in a separate line.
Example response:
<huge_sentences>
Sentence 1
Sentence 2
</huge_sentences>
'''
large_sentence_response = get_lm_response(large_sentence_prompt)
large_sentences = extract_data(large_sentence_response, tag='huge_sentences', absent_ok=True)

large_sentences_list = []
large_sentences_str = ''
for line in large_sentences.split('\n'):
	line = line.strip()
	# count whether there is 1 sentence in each line
	if not line or not line.replace('-', ''):  # only hyphen and no text
		continue
	if '. ' in line:  # I cannot fulfil your request.
		print('Error: Got more than 1 sentence in a line.')
		raise Exception()
	if 'Of course' in line or 'happy to' in line:  # happy to help
		continue
	if line.lower().startswith('sentence'):
		if line[-1] == ':':  # line is of the form 'Sentence 1:'
			continue
		if len(line) <= len('sentence ') + 3:  # line is of the form 'Sentence 1:'
			continue
		if ':' in line:
			line = line.split(':')[1].strip()
	large_sentences_list.append(line)
	large_sentences_str += '  - ' + line + '\n'

with open(os.path.join(model_folder, 'large_sentences.txt'), 'w') as f:
	f.write(large_sentences_str)
display_md(large_sentences_str)

  - Having conducted extensive research in the field of sociolinguistics, I am interested in investigating the intricate relationships between linguistic variation and social identity, particularly with regards to the syntactic and phonological features that are indicative of regional or ethnic affiliations, in order to further comprehend the underlying mechanisms of language change and diffusion within diverse speech communities, and to ultimately contribute to the development of more nuanced and sophisticated language policies that are reflective of the rich and dynamic linguistic landscapes that characterize our contemporary globalized world.
  - As a scholar with a specialized focus in historical linguistics, I aim to elucidate the complex processes of language evolution and linguistic diversification, by examining the interplay between cultural, demographic, and geopolitical forces, and their consequential impact on the development and diffusion of distinct language families and their respective protolanguages, in an endeavor to uncover the deep-seated relationships between past sociocultural contexts and the cognitive frameworks that have shaped the linguistic structures and patterns that persist across disparate linguistic systems.
  - In light of the recent advancements in cognitive linguistics and psycholinguistics, I am embarking on an ambitious research venture that seeks to elucidate the intricate mechanisms that underpin language processing and production, by comprehensively exploring the dynamic interactions between cognitive processes, linguistic structure, and communicative functions, in order to formulate a more comprehensive understanding of the cognitive underpinnings of language use and comprehension, and to pave the way for innovative interventions and applications within the realms of language education and clinical linguistics.


## C. Simplifying the sentences

In [3]:
simplify_prompt_template = '''
Sentence:
{large_sentence}
---

Act as an expert linguist and simplify the sentences.
Simplify the sentences by simplifying difficult words and phrases.
Try to make bullet points for the complex sentence.
Return inside tags.

Example response:
<simplified_sentence>
Line 1
Reasons:
  - Reason 1
  - Reason 2
Summary: (in 1 line)
Summary of the sentence.
</simplified_sentence>
'''
simplified_text_list = []
for line in large_sentences_list:
	if line.strip() == '':
		continue
	simplify_prompt = simplify_prompt_template.format(large_sentence=line)
	simplified_sentence_response = get_lm_response(simplify_prompt)
	simplified_text = extract_data(simplified_sentence_response, 
									tag='simplified_sentence', absent_ok=True)
	simplified_text_list.append(simplified_text)
	display_md(simplified_text)

with open(os.path.join(model_folder, 'simplified_sentences.txt'), 'w') as f:
	f.write('\n\n'.join(simplified_text_list))

I have studied different ways people speak and how it relates to their social identity. I want to understand how language changes and spreads in diverse communities, and help create better language policies.
Reasons:
  - Studied different ways people speak
  - Interested in how language changes and spreads
Summary: 
Interested in the relationships between language variation and social identity, and how it influences language change and language policies.

As a linguist focusing on historical language changes, I aim to understand how languages have evolved and diversified over time by looking at the effects of culture, population, and geography on the development and spread of different language groups and their early forms.
Reasons:
- Study language change
- Explore impact of culture, population, and geography
Summary: 
I study how cultural, demographic, and geopolitical factors have influenced the development and spread of language families and their early forms.

In light of new research in language and psychology, I am beginning a big research project to understand how our brains process and produce language by looking at how our thoughts, language structure, and communication all work together.
Reasons:
  - Cognitive linguistics and psycholinguistics advancements
  - Elucidating language processing mechanisms
Summary: (in 1 line)
I am starting a research project to understand how our brains process and produce language.

## D. Validation of the simplified text

In [4]:
validation_prompt = '''
Large Sentence:
{large_sentence}
---
Simplified Sentence:
{simplified_text}
---

Act as an expert linguist and validate the meaning of the simplified sentence.
Return comments inside tags, and then return the final validation result (True/False) inside tags.
Example response:
<comments>
The simplified sentence does not capture the part about doctors.
</comments>
<validation>False</validation>
'''
validation_results = []
for i, line in enumerate(large_sentences_list):
	simplified_text = simplified_text_list[i]
	if not line.strip():
		raise ValueError('Empty sentence for large_sentence')
	if not simplified_text.strip():
		raise ValueError('Empty sentence for simplified_text')

	validation_prompt = validation_prompt.format(large_sentence=line,
												simplified_text=simplified_text)
	validation_response = get_lm_response(validation_prompt)
	validation_res = extract_data(validation_response, tag='validation')
	comments = extract_data(validation_response, tag='comments')
	# display_md(validation_res)
	# display_md(comments)
	validation_results.append({ 'Index': i+1, 'Validation Result': validation_res, 'Comment': comments })

df = pd.DataFrame(validation_results)
csv_file = os.path.join(model_folder, 'validation_results.csv')
df.to_csv(csv_file, index=False)
df

Unnamed: 0,Index,Validation Result,Comment
0,1,True,The simplified sentence captures the main poin...
1,2,True,The simplified sentence captures the main idea...
2,3,True,The simplified sentence accurately captures th...
