## Validation with Pydantic

In [1]:
#insert the model you are using in this notebook 
model_name = "gpt3.5"

In [2]:
#importing openai and secret key
import openai
import os
from secret_key import openai_key
os.environ["OPEN_API_KEY"] = openai_key

In [3]:
#importing module for langchain and kor

from typing import List, Optional

from langchain.callbacks import get_openai_callback
from langchain.chat_models import ChatOpenAI

from kor.extraction import create_extraction_chain
from kor.nodes import Object, Text, Number

import pandas as pd
from pydantic import BaseModel, Field, validator
from kor import extract_from_documents, from_pydantic, create_extraction_chain


from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [4]:
#importing llm model

#from langchain.llms import OpenAI

llm = ChatOpenAI(
    model_name="gpt-3.5-turbo",
    temperature=0,#dont be creative and make up answer
    openai_api_key= openai_key
)

In [5]:
#loading the document
def import_document(filename):
    try:
        with open(filename, 'r', encoding='utf-8') as file:
            document_text = file.read()
        return document_text
    except FileNotFoundError:
        print(f"Error: File '{file_path}' not found.")
        return None
    except Exception as e:
        print(f"Error occurred while importing the document: {e}")
        return None


filename = "data/O3b_FCC-18-70A1.txt"
document = import_document(filename)
if document is not None:
    print("Document content:")
    print(document)

Document content:
﻿	Federal Communications Commission	FCC 18-70


Before the
FEDERAL COMMUNICATIONS COMMISSION
WASHINGTON, D.C. 20554
	

In the Matter of

O3b Limited 

Request for Modification of U.S. Market Access for O3b Limited’s Non-Geostationary Satellite Orbit System in the Fixed-Satellite Service and in the Mobile-Satellite Service.
)
)
)
)
)
)
)
)


IBFS File Nos. SAT-MOD-20160624-00060,
SAT-AMD-20161115-00116, 
SAT-AMD-20170301-00026, and
SAT-AMD-20171109-00154

Call Sign S2935
ORDER AND DECLARATORY RULING

Adopted:  June 4, 2018                                                                              Released: June 6, 2018 

By the Commission:

I. INTRODUCTION
1. In this Order and Declaratory Ruling, we grant O3b Limited’s (O3b) request for a modification of its grant of U.S. market access and certain rule waivers, except for those parts that were not accepted for filing.  O3b proposes, through a modification application and a series of amendments, to expand its grant of

In [6]:
#split the document into chunks
doc = Document(page_content = document)
split_docs = RecursiveCharacterTextSplitter().split_documents([doc])

In [7]:
from pydantic import BaseModel, Field, validator, ValidationError
from typing import Optional, List, Union
import re

building the pydantic model

It Kor doumentation it says that the Validation doe NO imply that extraction was correct.
Validation only implies that the data was returned in the correct shape and meets all validation criteria.
This does not mean that the LLM didn't make up some information

we can use pydantic to sip the invalid data

In [8]:
class OrbitEnv(BaseModel):
    company_name: str = Field(
        description="The name of the company that sent the application to deploy or operate the satellite constellation",
    )
    orbit_type: str = Field(
        description="The orbit type into which the satellites will be launched"
    )
    application: str = Field(
        description="The application or services that the satellites would provide"
    )
    date_50: str = Field(
        description="Date on which the company was ordered to launch and operate 50 percent of its satellites."
    )
    date_100: str = Field(
        description="Date on which the company was ordered to launch and operate the remaining (100 percent) of its satellites"
    )
    total_sat_const: int = Field(
        description="The concluding total number of satellites that the company has been authorized to deploy and operate for the constellation"
    )
    altitude: Optional[List[float]]= Field(
        description="The granted altitudes of the satellites that the company has been authorized to deploy"
    )
    inclination: Optional[List[float]] = Field(
        description="The granted inclination of the satellites that the company has been authorized to deploy, respective to the altitudes"
    )
    number_orb_plane: Optional[List[int]] = Field(
        description="The number of orbital planes, respective to the altitudes and inclination, that the company has been authorized to deploy"
    )
    total_sat_per_orb_plane: Optional[List[int]]= Field(
        description="The specific count of satellites located in each individual orbital plane. This count refers to the total number of satellites within one orbital plane, and it can vary from plane to plane based on the altitude and inclination, and if not mentioned in text, 'total_sat_per_alt_incl' divide by 'number_orb_plane' will give this value"
    )
    total_sat_per_alt_incl: Optional[List[int]] = Field(
        description="The total number of satellites at a specific altitude and inclination across all orbital planes sharing these characteristics. This count represents the overall number of satellites with the specified altitude and inclination parameters, and if not mentioned in the text, the multiplication of 'number_orb_plane' and 'total_sat_per_orb_plane' will give this value"
    )
    orbit_shape: Optional[str] = Field(
        description="The shape of the orbital plane whether its circular, elliptical or are not mention in the document"
    )
    operational_lifetime : Optional[str] = Field(
        description="The operational lifetime of the satellite in the constellation in years"
    )


    @validator("company_name", "orbit_type", "application")
    def validate_name(cls, v):
        if not re.match("^[a-zA-Z\s().,-]*$", v):
            raise ValueError("The field can only contain alphabetic characters, spaces, parentheses, periods, commas and hyphen.")
        return v
    
    @validator("total_sat_const", "number_orb_plane", "total_sat_per_orb_plane", "total_sat_per_alt_incl", "operational_lifetime")
    def validate_whole_number(cls, v):
        if isinstance(v, list):
            if not all(isinstance(i, int) for i in v):
                raise ValueError("All elements of the list must be whole numbers.")
        elif v is not None and not isinstance(v, int):
            raise ValueError("The field must be a whole number.")
        return v

    @validator("altitude", "inclination")
    def validate_number(cls, v):
        if isinstance(v, list):
            if not all(isinstance(i, (int, float)) for i in v):
                raise ValueError("All elements of the list must be numbers (integer or decimal).")
        elif v is not None and not isinstance(v, (int, float)):
            raise ValueError("The field must be a number (integer or decimal).")
        return v

    @validator("orbit_shape")
    def validate_orbit_shape(cls, v):
        if not re.match("^[a-zA-Z\s]*$", v):
            raise ValueError("orbit_shape can only contain alphabetic characters and spaces.")
        return v


In [9]:
"""     maneuverable: Optional[str] = Field(
        description="The satellite having propulsion or can be maneuver. Return 'y' only if the satellite authorized have propulsion or are maneuverable"
    )
    spin_stabilized:  Optional[str] = Field(
        description="The satellites are spin-stabilized. Return 'y' only if satellite authorized have spin-stabilizer"
    ) """

'     maneuverable: Optional[str] = Field(\n        description="The satellite having propulsion or can be maneuver. Return \'y\' only if the satellite authorized have propulsion or are maneuverable"\n    )\n    spin_stabilized:  Optional[str] = Field(\n        description="The satellites are spin-stabilized. Return \'y\' only if satellite authorized have spin-stabilizer"\n    ) '

In [10]:
"""     @validator("date_50", "date_100")
    def validate_date(cls, v):
        if not re.match("^[A-Za-z]+\s[0-3]?[0-9],?\s[0-9]{4}$", v):
            raise ValueError("The field must be a date in the format 'Month DD YYYY' or 'Month DD, YYYY'")
        return v """

'     @validator("date_50", "date_100")\n    def validate_date(cls, v):\n        if not re.match("^[A-Za-z]+\\s[0-3]?[0-9],?\\s[0-9]{4}$", v):\n            raise ValueError("The field must be a date in the format \'Month DD YYYY\' or \'Month DD, YYYY\'")\n        return v '

'Optional[str]' means that the field can be either a str(string) or None effectively making it optional

'str' means that it is mandatory

In [11]:
schema, extraction_validator = from_pydantic(
    OrbitEnv,
    description="Extract the Orbital Environment information of a Satellite Constellation from the authorized document. Include details such as the company name, orbit type, application, dates for 50 percent and 100 percent satellite launches, total number of authorized satellites, altitude, inclination, number of orbital planes, number of satellites per plane, and orbit shape",
    examples=[
        (
            """In this Order and Authorization, we grant, to the extent set forth below, the request of Kuiper Systems LLC (Kuiper or Amazon) to deploy a non-geostationary satellite orbit (NGSO) system to provide service using certain Fixed-Satellite Service (FSS).
                Operating 3,372 satellites in 102 orbital planes at altitudes of 590 km, 610 km, and 630 km in a circular orbit.
                At 590 km, 30 orbital planes with 28 satellites per plane for a total of 840 satellites at inclination of 33 degree.
                At 610 km, 42 orbital planes with 36 satellites per plane for a total of 1512 satellites at inclination of 42 degree.
                At 630 km, 30 orbital planes with 34 satellites per plane for a total of 1020 satellite at inclination of 51.9 degree.
                The constellation are require to launch and operate 50 percent of its satellites no later than July 30, 2026, and Kuiper must launch the remaining space stations necessary to complete its authorized service constellation, place them in their assigned orbits, and operate each of them in accordance with the authorization no later than July 30, 2029.""",
                
            {"company_name": "Kuiper Systems LLC", "orbit_type": "non-geostationary satellite orbit (NGSO)", "application": "Fixed-Satellite Service (FSS)", "date_50": "July 30, 2026", "date_100": "July 30, 2029", "total_sat_const": 3372, "altitude": [590, 610, 630],  "inclination": [33, 42, 51.9], "number_orb_plane": [30, 42, 30], "total_sat_per_orb_plane": [28, 36, 34], "total_sat_per_alt_incl": [840, 1512, 1020], "orbit_shape": "circular"}
        ),
        (
            "Boeing must launch 50 percent of the maximum number of proposed space stations, place them in the assigned orbits, and operate them in accordance with this grant no later than November 12,2028, and must launch the remaining space stations necessary to complete its authorized service constellation, place them in their assigned orbits, and operate them in accordance with the authorization no later than May 16,2030.",
            {"date_50":"November 12,2028,","date_100":"May 16,2030"}
        ),
        (
            "In this Order and Declaratory Ruling, we grant in part and defer in part the petition for declaratory ruling of WorldVu Satellites Limited (OneWeb) for modification of its grant of U.S. market access for a its satellite constellation authorized by the United Kingdom. As modified, the constellation will operate with four fewer satellites, reduced from 720 to 716 satellites.",
            {"company_name": "WorldVu Satellites Limited (OneWeb)", "total_sat_const": 716}
        ),
        (
            "They sought Commission approval for a non-geostationary satellite orbit (NGSO) system to provide fixed-satellite service (FSS) in the United States.",
            {"orbit_type": "non-geostationay satellite orbit (NGSO)", "application": "fixed-satellite service (FSS)"}
        ),
        (
            """The proposed Telesat system is set to feature a robust constellation of 124 satellites.
            A set of six orbital planes, each inclined at 99.5 degrees, will host nine satellites per plane at an approximate altitude of 1,000 kilometers.
            Additionally, seven more orbital planes, each tilted at 37.4 degrees, will carry another group of satellites, with each plane accommodating ten satellites at a higher altitude of approximately 1,248 kilometers.
            It's noteworthy that all satellites will occupy a circular orbit, ensuring systematic and efficient coverage.""",
            {"company_name": "Telesat", "total_sat_const": 124, "altitude": [1000, 1248], "inclination": [99.5, 37.4], "number_orb_plane": [6, 7], "total_sat_per_orb_plane": [9, 10], "total_sat_per_alt_incl": [54, 70], "orbit_shape": "circular"}
        ),
        #different between total_sat_per_orb_plane and total_sat_per_alt_incl
        (
            "20 orbital planes with 28 satellites per plane for a total of 560 satellites at inclination of 33 degree will be placed at an altitude approximately 800 km.",
            {"altitude": 800, "inclination": 33, "number_orb_plane": 20, "total_sat_per_orb_plane": 28, "total_sat_per_alt_incl": 560}
        ),
        #total_sat_per_alt_incl = number_orb_plane x total_sat_per_orb_plane
        (
            "8 orbital plane containing 15 satellites each which are inclined at 56 degree with altitude of 700 kilometers",
            {"altitude": 700, "inclination": 56, "number_orb_plane": 8, "total_sat_per_orb_plane": 15, "total_sat_per_alt_incl": 120}
        ),
        #total_sat_per_orb_plane = total_sat_per_alt_incl x number_orb_plane
        (
            "72 of the satellites will be distributed equally and place at 6 orbital planes, which are inclined 99.5 degrees, satellites will be at an approximate altitude of 1,000 kilometers",
            {"altitude": 1000, "inclination": 99.5, "number_orb_plane": 6, "total_sat_per_orb_plane": 12, "total_sat_per_alt_incl": 72}
        ),
        #operational_lifetime
        (
            "The operational lifetime for the satellite in the constellation in 10 years",
            {"operational_lifetime": 10}
        ),

    ],
    many=True,
)

will provide more examples

""" #maneuverable and spin-stabilized
(
    """Each satellite in the constellation is equipped with propulsion, enabling it to perform maneuvers to avoid collisions and navigate to its designated operational orbit.
    Additionally, the satellites also have spin stabilizers, ensuring their stability during orbital operation.""",
    {"maneuverable": "y", "spin_stabilized": "y"}
), """

'many' parameter determine whether the funciton should expect to work with a single instance of the object or multiple instances

If many=False (the default), the schema expects to validate a single object of the class defined in the function call (OrbitEnv in your case).
If many=True, the schema expects to validate a list of objects of the class defined in the function call.

In [12]:
chain = create_extraction_chain(
    llm,
    schema,
    encoder_or_encoder_class="json",
    validator=extraction_validator,
    input_formatter="triple_quotes",
)

#csv does support list , but json is not as accurate as csv

In [13]:
print(chain.prompt.format_prompt(text="[user input]").to_string())

Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.

```TypeScript

orbitenv: Array<{ // Extract the Orbital Environment information of a Satellite Constellation from the authorized document. Include details such as the company name, orbit type, application, dates for 50 percent and 100 percent satellite launches, total number of authorized satellites, altitude, inclination, number of orbital planes, number of satellites per plane, and orbit shape
 company_name: string // The name of the company that sent the application to deploy or operate the satellite constellation
 orbit_type: string // The orbit type into which the satellites will be launched
 application: string // The application or services that the satellites would provide
 date_50: string // Date on which the compa

In [14]:
with get_openai_callback() as cb:
    document_extraction_results = await extract_from_documents(
        chain, split_docs, max_concurrency=5, use_uid=False, return_exceptions=True
    )
    #split_docs is where we input the document we want to extract
    #use_uid: parameter that determine whether or not to use a unique identifier (uid)when processesing document
    print(f"Total Tokens: {cb.total_tokens}")
    print(f"Prompt Tokens: {cb.prompt_tokens}")
    print(f"Completion Tokens: {cb.completion_tokens}")
    print(f"Successful Requests: {cb.successful_requests}")
    print(f"Total Cost (USD): ${cb.total_cost}")

Total Tokens: 82105
Prompt Tokens: 81677
Completion Tokens: 428
Successful Requests: 31
Total Cost (USD): $0.12337149999999998


In [15]:
""" # Validate the extracted data
valid_data = []
for item in document_extraction_results:
    if isinstance(item, dict) and 'data' in item and 'orbitenv' in item['data']:
        for data in item['data']['orbitenv']:
            try:
                valid_data.append(OrbitEnv(**data))
            except ValidationError:
                pass

valid_data
 """

" # Validate the extracted data\nvalid_data = []\nfor item in document_extraction_results:\n    if isinstance(item, dict) and 'data' in item and 'orbitenv' in item['data']:\n        for data in item['data']['orbitenv']:\n            try:\n                valid_data.append(OrbitEnv(**data))\n            except ValidationError:\n                pass\n\nvalid_data\n "

In [16]:
document_extraction_results

[{'uid': '0',
  'source_uid': '0',
  'data': {'orbitenv': []},
  'raw': '<json>{"orbitenv": []}</json>',
  'validated_data': [],
  'errors': []},
 {'uid': '1',
  'source_uid': '1',
  'data': {'orbitenv': [{'company_name': 'O3b Limited',
     'orbit_type': 'non-geostationary satellite orbit (NGSO)',
     'application': 'fixed-satellite service (FSS)',
     'total_sat_const': 26}]},
  'raw': '<json>{"orbitenv": [{"company_name": "O3b Limited", "orbit_type": "non-geostationary satellite orbit (NGSO)", "application": "fixed-satellite service (FSS)", "total_sat_const": 26}]}</json>',
  'validated_data': [],
  'errors': [ValidationError(model='OrbitEnv', errors=[{'loc': ('company_name',), 'msg': 'The field can only contain alphabetic characters, spaces, parentheses, periods, commas and hyphen.', 'type': 'value_error'}, {'loc': ('date_50',), 'msg': 'field required', 'type': 'value_error.missing'}, {'loc': ('date_100',), 'msg': 'field required', 'type': 'value_error.missing'}])]},
 {'uid': '2'

In [17]:
import pandas as pd

def generate_dataframe(json_data):
    # Prepare an empty list to store all OrbitEnv data
    data = []

    for record in json_data:
        orbitenv_list = record.get('data', {}).get('orbitenv', [])
        for orbitenv in orbitenv_list:
            data.append([
                orbitenv.get('company_name', ''),
                orbitenv.get('orbit_type', ''),
                orbitenv.get('application', ''),
                orbitenv.get('date_50', ''),
                orbitenv.get('date_100', ''),
                orbitenv.get('total_sat_const', ''),
                orbitenv.get('altitude', '') or '',
                orbitenv.get('inclination', '') or '',
                orbitenv.get('number_orb_plane', '') or '',
                orbitenv.get('total_sat_per_orb_plane', '') or '',
                orbitenv.get('total_sat_per_alt_incl', '') or '',
                orbitenv.get('orbit_shape', ''),
                orbitenv.get('operational_lifetime', '')
            ])

    # Convert the list into a DataFrame
    df = pd.DataFrame(data, columns=['companyName', 'orbitType', 'application','date50', 'date100', 'totalSatelliteNumber', 'altitudes','inclination', 'numberOrbPlane', 'totalSatellitePerOrbPlane','totalSatellitePerAltIncl', 'orbShape', 'operationalLifetime'])

    #because its LLM some of the None value are return as various kind of words seen below, also empty array
    df.replace(['','-',0,'Null', 'null', 'Not Mentioned', 'Not mentioned', 'not mentioned', 'unknown', 'Unknown','N/A'], None, inplace=True)
    
    return df

# Usage:
df = generate_dataframe(document_extraction_results)


In [18]:
df

Unnamed: 0,companyName,orbitType,application,date50,date100,totalSatelliteNumber,altitudes,inclination,numberOrbPlane,totalSatellitePerOrbPlane,totalSatellitePerAltIncl,orbShape,operationalLifetime
0,O3b Limited,non-geostationary satellite orbit (NGSO),fixed-satellite service (FSS),,,26.0,,,,,,,
1,,non-geostationary satellite orbit (NGSO),"MSS, FSS",,,,,,,,,,
2,,,,,,45.0,,,,,,,


In [19]:
df.shape

(3, 13)

In [20]:
import pandas as pd
import json
import numpy as np
import re

def find_most_frequent(df: pd.DataFrame) -> dict:
    most_frequent_dict = {}
    for column in df.columns:
        column_without_none = df[column].dropna()
        if not column_without_none.empty:
            mode = column_without_none.mode()
            if len(mode) > 1:
                most_frequent_dict[column] = {"message": "Multiple modes found", "modes": mode.tolist()}
            else:
                most_frequent_dict[column] = mode[0]
        else:
            most_frequent_dict[column] = None
    return most_frequent_dict

def convert(o):
    if isinstance(o, np.generic):
        return o.item()
    raise TypeError

def convert_to_json(data: dict) -> str:
    try:
        json_data = json.dumps(data, default=convert)
        return json_data
    except TypeError:
        return json.dumps({"error": "Failed to serialize data"})

result = find_most_frequent(df)
result
#returninng dictionary key-value pair, mutable , can be add, remove, change element

{'companyName': 'O3b Limited',
 'orbitType': 'non-geostationary satellite orbit (NGSO)',
 'application': {'message': 'Multiple modes found',
  'modes': ['MSS, FSS', 'fixed-satellite service (FSS)']},
 'date50': None,
 'date100': None,
 'totalSatelliteNumber': {'message': 'Multiple modes found',
  'modes': [26, 45]},
 'altitudes': None,
 'inclination': None,
 'numberOrbPlane': None,
 'totalSatellitePerOrbPlane': None,
 'totalSatellitePerAltIncl': None,
 'orbShape': None,
 'operationalLifetime': None}

In [21]:
print(type(result))

<class 'dict'>


In [22]:
json_data = convert_to_json(result)

company_name = result.get('companyName', {}).get('modes', [None])[0] if isinstance(result.get('companyName', {}), dict) else result.get('companyName', None)


In [23]:
company_name

'O3b Limited'

In [24]:
print(type(json_data))

<class 'str'>


In [25]:
json_data

'{"companyName": "O3b Limited", "orbitType": "non-geostationary satellite orbit (NGSO)", "application": {"message": "Multiple modes found", "modes": ["MSS, FSS", "fixed-satellite service (FSS)"]}, "date50": null, "date100": null, "totalSatelliteNumber": {"message": "Multiple modes found", "modes": [26, 45]}, "altitudes": null, "inclination": null, "numberOrbPlane": null, "totalSatellitePerOrbPlane": null, "totalSatellitePerAltIncl": null, "orbShape": null, "operationalLifetime": null}'

In [26]:
if company_name is not None:
    company_name = re.sub(r'\W+', '_', company_name)
    filename = f'output/{company_name}_{model_name}_data.txt'

    with open(filename, 'w+') as txt_file:
        txt_file.write(json_data)


### yet to do
- for the total number of satellite in constellation you can do if there are multiple mode found match the value with the sum of all array in total number of satellite (per altitude/inclination) - if it match that is your total number of satellite in constellation
- adding in maneuverable and spin satebilisation as a field and operational lifetime
- using different LLM model
- using different company order authorize document
- using different type of authorize document
- how to measure validation (intrisic and extrinsic)
- make this a model? - putting input text - output json fresh  - calculation coding - true process all column 29
