# In-context learning with a pre-trained LLM

Reference:
- kor
- incontext learning


https://www.youtube.com/watch?v=SW1ZdqH0rRQ&t=872s <br>
https://www.youtube.com/watch?v=xZzvwR9jdPA&t=376s

An LLM like GPT-4 can be used as a QA model in a slightly indirect way. For instance, if you give GPT-4 a prompt structured as a question followed by a passage, it may provide an answer, effectively using it in a QA fashion. This isn't due to specialized QA training but rather its general understanding of language and context from its vast training data.

## Importing LLM model and Libraries

In [None]:
# Importing openai and secret key
import os
import openai

# Load environment variables from .env
from dotenv import load_dotenv
load_dotenv()

In [None]:
# SetO penAI API key from the environment variable
openai.api_key = os.getenv("OPENAI_API_KEY")

In [None]:
# Seeing what gpt model are available for us
openai.Model.list()


In [None]:
# Insert the model you are using in this notebook 
#model = "gpt-4-0613"
model = "gpt-3.5-turbo"

# Insert filename you want to extract information from 
filename = "data/authorize_doc/Kuiper_FCC-20-102A1.txt"

Here can explain more in kor, a langchain wrapper

In [None]:
# Importing libraries, importing module for langchain and kor

#import pandas as pd
#from typing import List, Optional

#langchain
#from langchain.callbacks import get_openai_callback
#from langchain.chat_models import ChatOpenAI

#kor
#from kor.extraction import create_extraction_chain
#from kor.nodes import Object, Text, Number
#from kor import extract_from_documents, from_pydantic, create_extraction_chain

#pydantic
#from pydantic import BaseModel, Field, validator



In [None]:
#from langchain.llms import ChatOpenAI
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(
    model_name= model,
    temperature=0,#dont be creative and make up answer
    request_timeout= 120,
    openai_api_key= openai.api_key
)

describe the parameter that you choose

## Loading, Chunks and Overlap Document

In [None]:
#loading the document
def import_document(filename):
    encodings = ['utf-8', 'ISO-8859-1', 'utf-16', 'ascii', 'cp1252']
    for enc in encodings:
        try:
            with open(filename, 'r', encoding=enc) as file:
                document_text = file.read()
            return document_text
        except UnicodeDecodeError:
            continue
        except FileNotFoundError:
            print(f"Error: File '{filename}' not found.")
            return None
        except Exception as e:
            print(f"Error occurred while importing the document: {e}")
            return None
    print(f"Error: Could not decode file with any of the tried encodings: {encodings}")
    return None

document = import_document(filename)
if document is not None:
    print("Document content:")
    print(document)

### Let's briefly explore the document

In [None]:
len(document)

In [None]:
#this is token count using textacy
import textacy
doc = textacy.make_spacy_doc(document, lang="en_core_web_sm")
print(doc._.preview)

In [None]:
from textacy import text_stats as ts

# Number of words and number of unique words
print("Number of words: ", ts.n_words(doc))
print("Number of unique words: ", ts.n_unique_words(doc))

# Entropy of words in the document- measures how much informations produced on the average of the word
print("Entropy: ", ts.entropy(doc))

# Compute the Type-Token Ratio (TTR) of doc_or_token,a direct ratio of the number of unique words (types) of all words (token)
print("Diversity: ", ts.diversity.ttr(doc))

# Flesch Kincaid grade level: readability tests designed to indicate how difficult a passage is
print("Flesch Kincaid: ",ts.flesch_kincaid_grade_level(doc))

### Let's now split the document into chunks and overlap it 

In [None]:
#from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
# Create the text splitter with specific parameters which are being standardise for all chunking 
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,           # Check if this parameter is valid
    chunk_overlap=100,        # Check if this parameter is valid
    length_function=len,      # Check if this parameter is valid
    keep_separator=True       # This is a valid parameter as per the traceback
)

explaining the parameter if the chunking and overlap

In [None]:
#split the document into chunks
doc = Document(page_content = document)
split_docs = RecursiveCharacterTextSplitter().split_documents([doc])

In [None]:
split_docs

## Building the Pydantic Model

It Kor doumentation it says that the Validation doe NO imply that extraction was correct.
Validation only implies that the data was returned in the correct shape and meets all validation criteria.
This does not mean that the LLM didn't make up some information

we can use pydantic to sip the invalid data

In [None]:
import re
from typing import Optional, List
from pydantic import BaseModel, Field, validator

In [None]:
#using kor 
from kor.nodes import Object, Text, Number
from kor import extract_from_documents, from_pydantic, create_extraction_chain
from kor.extraction import create_extraction_chain

### Pydantic Class

In [None]:
class OrbitEnv(BaseModel):
    const_name: str = Field(
        description="The satellite constellation name for which the company applied to deploy or operate",
    )
    date_release: str  = Field(
        description="The date of document release",
    )
    date_50: str  = Field(
        description="The date when the company is order to launch and operate 50 percent of its satellites",
    )
    date_100: str  = Field(
        description="The date when the company is order to completely launch and operate all of its remaining satellites.",
    )
    total_sat_const: int = Field(
        description="The concluding total number of satellites that the company has been authorized to deploy and operate for the constellation",
    )
    altitude: Optional[List[float]] = Field(
        description="The granted altitudes of the satellites that the company has been authorized to deploy",
    )
    inclination: Optional[List[float]] = Field(
        description="The granted inclination of the satellites that the company has been authorized to deploy, respective to the altitudes",
    )
    number_orb_plane: Optional[List[int]] = Field(
        description="The number of orbital planes, respective to the altitudes and inclination, that the company has been authorized to deploy",
    )
    total_sat_per_orb_plane: Optional[List[int]] = Field(
        description="The specific count of satellites located in each individual orbital plane. This count refers to the total number of satellites within one orbital plane, and it can vary from plane to plane based on the altitude and inclination, and if not mentioned in text, 'total_sat_per_alt_incl' divide by 'number_orb_plane' will give this value",
    )
    total_sat_per_alt_incl: Optional[List[int]] = Field(
        description="The total number of satellites at a specific altitude and inclination across all orbital planes sharing these characteristics. This count represents the overall number of satellites with the specified altitude and inclination parameters, and if not mentioned in the text, the multiplication of 'number_orb_plane' and 'total_sat_per_orb_plane' will give this value",
    )
    operational_lifetime : Optional[int] = Field(
        description="The operational lifetime of the satellite in the constellation in years",
    )


    @validator("const_name", "date_release", "date_50", "date_100")
    def validate_name(cls, v):
        if not re.match("^[a-zA-Z\s().,-]*$", v):
            raise ValueError("The field can only contain alphabetic characters, spaces, parentheses, periods, commas and hyphen.")
        return v
    
    @validator("total_sat_const", "number_orb_plane", "total_sat_per_orb_plane", "total_sat_per_alt_incl", "operational_lifetime")
    def validate_whole_number(cls, v):
        if isinstance(v, list):
            if not all(isinstance(i, int) for i in v):
                raise ValueError("All elements of the list must be whole numbers.")
        elif v is not None and not isinstance(v, int):
            raise ValueError("The field must be a whole number.")
        return v

    @validator("altitude", "inclination")
    def validate_number(cls, v):
        if isinstance(v, list):
            if not all(isinstance(i, (int, float)) for i in v):
                raise ValueError("All elements of the list must be numbers (integer or decimal).")
        elif v is not None and not isinstance(v, (int, float)):
            raise ValueError("The field must be a number (integer or decimal).")
        return v
    


'Optional[str]' means that the field can be either a str(string) or None effectively making it optional

'str' means that it is mandatory

### Schema and Examples

fine-tuning the LLM model using Kor library a wrapper from langchain

In [None]:
schema, extraction_validator = from_pydantic(
    OrbitEnv,
    description="Extract the Orbital Environment information of a Satellite Constellation from the provided authorized document. This should encompass details like the satellite constellation name, document release date, dates for both 50 percent and 100 percent satellite launches, total satellite count, altitude, inclination, orbital plane quantity, satellites per orbital plane, satellites per altitude and inclination, as well as the operational lifetime of the satellites.",
    examples=[
        (
            """In this Order and Authorization, we grant, to the extent set forth below, the request of Kuiper Systems LLC (Kuiper or Amazon) to provide satellite services.
                Operating 3,372 satellites in 102 orbital planes at altitudes of 590 km, 610 km, and 630 km in a circular orbit.
                At 590 km, 30 orbital planes with 28 satellites per plane for a total of 840 satellites at inclination of 33 degree.
                At 610 km, 42 orbital planes with 36 satellites per plane for a total of 1512 satellites at inclination of 42 degree.
                At 630 km, 30 orbital planes with 34 satellites per plane for a total of 1020 satellite at inclination of 51.9 degree.
                The constellation are require to launch and operate 50 percent of its satellites no later than July 30, 2026, and must launch the remaining space stations necessary to complete its authorized service constellation, place them in their assigned orbits, and operate each of them in accordance with the authorization no later than July 30, 2029.""",
                
            {"const_name": "Kuiper", "date_50": "July 30, 2026", "date_100": "July 30, 2029", "total_sat_const": 3372, "altitude": [590, 610, 630],  "inclination": [33, 42, 51.9], "number_orb_plane": [30, 42, 30], "total_sat_per_orb_plane": [28, 36, 34], "total_sat_per_alt_incl": [840, 1512, 1020]}
        ),
        (
            "Iridium must launch 50 percent of satellite no later than November 12,2028, and must launch the other remaining satellites no later than May 16,2030.",
            {"const_name": "Iridium","date_50":"November 12,2028,","date_100":"May 16,2030"}
        ),
        #date_50 and date_100
        (
            "They must launch 50 percent of the maximum number of proposed space stations, place them in the assigned orbits, and operate them in accordance with this grant of U.S. market access no later than December 31,1989, and must launch the remaining space stations necessary to complete its authorized service constellation, place them in their assigned orbits, and operate them in accordance with the grant of U.S. market access no later than December 21,1997.",
            {"date_50":"December 31,1989","date_100":"November 21,1997"}
        ),
        (
            "The company must launch 50 percent no later than June 22, 2020, and complete its authorized service constellation in accordance with the authorization no later than June 22, 2022. 47 CFR § 25.164(b).",
            {"date_50":"June 22, 2020","date_100":"June 22, 2022"}
        ),
        (
            "to launch and operate 50 percent of its satellites no later than January 3, 2027, and to complete its authorized service constellation, place them in their assigned orbits, and operate each of them in accordance with the authorization no later than February 13, 2300. 47 CFR § 25.164(b).",
            {"date_50":"January 3, 2027","date_100":"February 13, 2300"}
        ),
        (
            "In this Order and Declaratory Ruling, we grant in part and defer in part the petition for declaratory ruling of WorldVu Satellites Limited (OneWeb) for modification of its grant of U.S. market access for a its satellite constellation authorized by the United Kingdom. As modified, the constellation will operate with four fewer satellites, reduced from 720 to 716 satellites.",
            {"const_name": "WorldVu Satellites Limited (OneWeb)", "total_sat_const": 716}
        ),
        (
            """The proposed Telesat system is set to feature a robust constellation of 124 satellites.
            A set of six orbital planes, each inclined at 99.5 degrees, will host nine satellites per plane at an approximate altitude of 1,000 kilometers.
            Additionally, seven more orbital planes, each tilted at 37.4 degrees, will carry another group of satellites, with each plane accommodating ten satellites at a higher altitude of approximately 1,248 kilometers.""",
            {"const_name": "Telesat", "total_sat_const": 124, "altitude": [1000, 1248], "inclination": [99.5, 37.4], "number_orb_plane": [6, 7], "total_sat_per_orb_plane": [9, 10], "total_sat_per_alt_incl": [54, 70]}
        ),
        #different between total_sat_per_orb_plane and total_sat_per_alt_incl
        (
            "20 orbital planes with 28 satellites per plane for a total of 560 satellites at inclination of 33 degree will be placed at an altitude approximately 800 km.",
            {"altitude": 800, "inclination": 33, "number_orb_plane": 20, "total_sat_per_orb_plane": 28, "total_sat_per_alt_incl": 560}
        ),
        #total_sat_per_alt_incl = number_orb_plane x total_sat_per_orb_plane
        (
            "8 orbital plane containing 15 satellites each which are inclined at 56 degree with altitude of 700 kilometers",
            {"altitude": 700, "inclination": 56, "number_orb_plane": 8, "total_sat_per_orb_plane": 15, "total_sat_per_alt_incl": 120}
        ),
        #total_sat_per_orb_plane = total_sat_per_alt_incl x number_orb_plane
        (
            "72 of the satellites will be distributed equally and place at 6 orbital planes, which are inclined 99.5 degrees, satellites will be at an approximate altitude of 1,000 kilometers",
            {"altitude": 1000, "inclination": 99.5, "number_orb_plane": 6, "total_sat_per_orb_plane": 12, "total_sat_per_alt_incl": 72}
        ),
        #operational_lifetime
        (
            "The operational lifetime for the satellite in the constellation in 10 years",
            {"operational_lifetime": 10}
        ),
        #date release
        (
            "Released:  March 29, 2010",
            {"date_release": "March 29, 2010"}
        ),
        (
            "Released:  November 21,1997",
            {"date_release": "Noember 21,1997"}
        ),

    ],
    many=True,
)

will provide more examples

""" #maneuverable and spin-stabilized
(
    """Each satellite in the constellation is equipped with propulsion, enabling it to perform maneuvers to avoid collisions and navigate to its designated operational orbit.
    Additionally, the satellites also have spin stabilizers, ensuring their stability during orbital operation.""",
    {"maneuverable": "y", "spin_stabilized": "y"}
), """

'many' parameter determine whether the funciton should expect to work with a single instance of the object or multiple instances

- If many=False (the default), the schema expects to validate a single object of the class defined in the function call (OrbitEnv in your case).
- If many=True, the schema expects to validate a list of objects of the class defined in the function call.

### Creating Chain

can explain more on the concept of a chains

In [None]:
chain = create_extraction_chain(
    llm,
    schema,
    encoder_or_encoder_class="json",
    validator=extraction_validator,
    input_formatter="triple_quotes",
)

#csv does support list , but json is not as accurate as csv

looking at what are the prompt and istruction pass to the LLM model

by using kor, it already setup with a system message as a prompt on what to do providing it to the model

In [None]:
print(chain.prompt.format_prompt(text="[user input]").to_string())

seeing how much it cost us

seeing the raw return of iterating 5 times. The iteration are done because the nherentrandomness and non-deterministic nature of certain aspect of LLMs model behaviour. LLMs exhibit variability in their outputs because:
1. Stochastic Processess in Training: (stochastic - having random probability distribution or pattern that may be analysed statistically but may not be predicted precisely) various stochastic process involve are weight initialization, gradient descent optimizationn and dropout which introduce randomness into the learning process, leading to slight differences in the model's internal representations and learned parameters
2. Attention Mechanism: if the LLM are based on Transformer architecture,use attention mechanisms to weigh the importance of different words in a sentence. These attention weights are often computed stochastically, which means that different runs or queries might result in slightly different attention distributions. These variations can influence the model's understanding of context and the words it attends to.
3. Random Intiialization: LLMs are initialized with random weights before training. Since these initial weights impact the learning trajectory of the model, variations in initialization can lead to differences in how the model learns and generalizes.
4. Sampling strategies: When generating responses, LLMs often use sampling techniques like "greedy decoding" (choosing the most likely word) or "random sampling" (sampling from the probability distribution). These strategies can introduce variability in the generated outputs, as different sampling choices can result in different sequences of words.
5. Temperature Parameter: Some LLMs use a "temperature" parameter during the sampling process. Higher values make the distribution over words more uniform, while lower values emphasize high-probability words. Adjusting this parameter can influence the diversity and randomness of generated text.
6. Random Seed and Environment: The initial random seed and the environment in which the LLM is executed can also impact its behavior. Different seeds or execution environments can lead to divergent paths during the generation process.
7. Contextual Embeddings: LLMs use contextual embeddings, which capture information from the entire input sequence. However, small perturbations in the input can lead to slightly different contextual embeddings, resulting in variations in the generated outputs.
8. Model Architecture: Some LLMs have specific architectural choices that introduce randomness, such as dropout layers, which randomly drop out units during training to prevent overfitting. 

In [None]:
iteration_results = []

for i in range(5):
    with get_openai_callback() as cb:
        document_extraction_result = await extract_from_documents(
            chain, split_docs, max_concurrency=5, use_uid=False, return_exceptions=True
        )
        iteration_results.append(document_extraction_result)
        
        # Print some statistics for each iteration
        print(f"Iteration {i + 1}:")
        print(f"Total Tokens: {cb.total_tokens}")
        print(f"Prompt Tokens: {cb.prompt_tokens}")
        print(f"Completion Tokens: {cb.completion_tokens}")
        print(f"Successful Requests: {cb.successful_requests}")
        print(f"Total Cost (USD): ${cb.total_cost}")
        print("-" * 50)

In [None]:
iteration_results

### Creating a dataframe

In [None]:
import pandas as pd

def generate_dataframe_from_iterations(iteration_data):
    # Prepare an empty list to store all OrbitEnv data
    data = []

    # Loop through each iteration's data
    for json_data in iteration_data:
        for record in json_data:
            # Check if the record is a dictionary. If not, print an error and continue to the next record
            if not isinstance(record, dict):
                print(f"Error encountered: {record}")
                continue

            orbitenv_list = record.get('data', {}).get('orbitenv', [])
            for orbitenv in orbitenv_list:
                data.append([
                    orbitenv.get('const_name', ''),
                    orbitenv.get('date_release', ''),
                    orbitenv.get('date_50', ''),
                    orbitenv.get('date_100', ''),
                    orbitenv.get('total_sat_const', ''),
                    orbitenv.get('altitude', '') or '',
                    orbitenv.get('inclination', '') or '',
                    orbitenv.get('number_orb_plane', '') or '',
                    orbitenv.get('total_sat_per_orb_plane', '') or '',
                    orbitenv.get('total_sat_per_alt_incl', '') or '',
                    orbitenv.get('operational_lifetime', '')
                ])

    # Convert the list into a DataFrame
    df = pd.DataFrame(data, columns=['constellationName', 'dateRelease', 'date50', 'date100', 'totalSatelliteNumber', 'altitudes','inclination', 'numberOrbPlane', 'totalSatellitePerOrbPlane','totalSatellitePerAltIncl','operationalLifetime'])

    # Replace various values with None
    df.replace(['','-',0,'Null', 'null', 'Not Mentioned', 'Not mentioned', 'not mentioned', 'unknown', 'Unknown','N/A'], None, inplace=True)
    
    return df

# Example usage:
# Assuming iteration_results is the list that holds the results from the 5 iterations
# df = generate_dataframe_from_iterations(iteration_results)

df = generate_dataframe_from_iterations(iteration_results)

In [None]:
df

In [None]:
df.shape

## Most Frequent in Each Column

In [None]:
import re
import json
import pandas as pd
import numpy as np


def find_most_frequent(df: pd.DataFrame) -> dict:
    most_frequent_dict = {}
    for column in df.columns:
        column_without_none = df[column].dropna()
        if not column_without_none.empty:
            mode = column_without_none.mode()
            if len(mode) > 1:
                most_frequent_dict[column] = {"message": "Multiple modes found", "modes": mode.tolist()}
            else:
                most_frequent_dict[column] = mode[0]
        else:
            most_frequent_dict[column] = None
    return most_frequent_dict

def convert(o):
    if isinstance(o, np.generic):
        return o.item()
    raise TypeError

def convert_to_json(data: dict) -> str:
    try:
        json_data = json.dumps(data, default=convert)
        return json_data
    except TypeError:
        return json.dumps({"error": "Failed to serialize data"})

result = find_most_frequent(df)
result
#returninng dictionary key-value pair, mutable , can be add, remove, change element

In [None]:
# do and if statement here to find date_500 and date_100

In [None]:
print(type(result))

### Converting to Json and exporting

In [None]:
json_data = convert_to_json(result)

name = result.get('constellationName', {}).get('modes', [None])[0] if isinstance(result.get('constellationName', {}), dict) else result.get('constellationName', None)


In [None]:
name

In [None]:
print(type(json_data))

In [None]:
json_data

In [None]:
if name is not None:
    name = re.sub(r'\W+', '_', name)
    filename = f'output/{name}_{model}_data.json'

    with open(filename, 'w+') as txt_file:
        txt_file.write(json_data)


### yet to do
- for the total number of satellite in constellation you can do if there are multiple mode found match the value with the sum of all array in total number of satellite (per altitude/inclination) - if it match that is your total number of satellite in constellation
- adding in maneuverable and spin stabilisation as a field and operational lifetime - manueverable and stabilisation is really hard to get it right - also add orbit epoch
- using different LLM model
- using different company order authorize document -done
- using different type of document - schedule S or techical document
- how to measure validation (intrisic and extrinsic) - validate if the extracted info is in and at what paragraph - or hallucination
- make this a model? - putting input text - output json fresh  - calculation coding - true process all column 29

- maybe before doing extraction - do a sentiment analysis for the whole document - to see if the purpose constallation are fully granted, partially granted ot denied 

- need to extract also the release date
