In [1]:
# Ewha Womans University Database Sample
data = [
    {
        "title": "Ewha Womans University Introduction",
        "heading": "About Ewha",
        "content": "Ewha Womans University is a private research university for women in Seoul, South Korea. Founded in 1886, it is the oldest and largest women's university in South Korea. Ewha is a comprehensive university with 11 colleges and 13 graduate schools offering over 100 undergraduate and graduate programs. The university is known for its strong academic programs, especially in the fields of humanities, social sciences, and natural sciences. Ewha also has a strong reputation for its international programs and has exchange agreements with over 300 universities around the world."
    },
    {
        "title": "Ewha Womans University Rankings",
        "heading": "World University Rankings",
        "content": "Ewha Womans University is ranked among the top universities in Asia and the world. In the 2023 QS World University Rankings, Ewha is ranked 101-150th in the world and 1st in South Korea. In the 2023 Times Higher Education World University Rankings, Ewha is ranked 151-200th in the world and 1st in South Korea."
    },
    {
        "title": "Undergraduate Admissions",
        "heading": "Requirements",
        "content": "Applicants must have a high school diploma or equivalent.\nApplicants must submit their academic transcripts, standardized test scores (such as SAT or ACT), and letters of recommendation.\nApplicants must also write an essay and submit a personal statement.\nThe specific requirements for each program may vary, so it is important to check with the university for more information."
    },
    {
        "title": "Ewha Womans University Notable Alumni",
        "heading": "Notable Alumni",
        "content": "Ewha Womans University has a distinguished alumni body that includes many leaders in government, business, academia, and the arts. Some notable alumni include:\nPark Geun-hye, former President of South Korea\nKim Yoon-ok, former Prime Minister of South Korea\nShin Ji-ye, Minister of Gender Equality and Family of South Korea\nLee Mi-kyung, actress\nSon Ye-jin, actress\nIU, singer-songwriter\nPark Bom, singer"
    },
]  

In [3]:
# Install library to count the tokens in the database and use them for a more effective embedding process
%pip install tiktoken

[0mCollecting tiktoken
  Obtaining dependency information for tiktoken from https://files.pythonhosted.org/packages/91/13/c998aa4f53343fb2e7ec6cbfeff23a57623e774e518c033c2a675a935afb/tiktoken-0.5.2-cp312-cp312-macosx_11_0_arm64.whl.metadata
  Downloading tiktoken-0.5.2-cp312-cp312-macosx_11_0_arm64.whl.metadata (6.6 kB)
Downloading tiktoken-0.5.2-cp312-cp312-macosx_11_0_arm64.whl (953 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m953.6/953.6 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0mm
[0mInstalling collected packages: tiktoken
Successfully installed tiktoken-0.5.2
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/opt/homebrew/Cellar/jupyterlab/4.0.9/libexec/bin/python -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


# TOKENIZING
The tiktoken library is used to count the number of tokens in a given text using the GPT-3.5-turbo model. 

## Why GPT-3.5-turbo?
OpenAI provides different models with varying capabilities and costs for the purpose of tokenizing. We chose 'gpt-3.5-turbo' for Studentua because it is a highly capable language model that offers a good balance between performance and cost-effectiveness, especially over other models.

In [4]:
# Import necessary libraries
import tiktoken
import csv

# User the tiktoken library to count the tokens in the database
def count_tokens(text: str) -> int:
    """Return the number of tokens in the given text"""
    encoding = tiktoken.encoding_for_model(model_name='gpt-3.5-turbo')
    num_tokens = len(encoding.encode(text=text))
    return num_tokens

# Update the data variable to include a new column 'token'
for entry in data:
    content = entry['content']
    # Calculate the token count
    tokens = count_tokens(content)
    # Add the "token" key with the calculated value to the dictionary
    entry['token'] = tokens

# Define the CSV file name
csv_filename = 'ewha_database.csv'

# Write the data to a CSV file
with open(csv_filename, 'w', newline='') as csvfile:
    fieldnames = ['title', 'heading', 'content', 'token']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

    # Write the header row
    writer.writeheader()

    # Write the data rows
    for entry in data:
        writer.writerow({'title': entry['title'],
                         'heading': entry['heading'],
                         'content': entry['content'],
                         'token': entry['token']})

print(f'Data has been written to {csv_filename}')

Data has been written to ewha_database.csv


In [5]:
csv_filename = 'ewha_database.csv'

with open(csv_filename, newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        print(row)

{'title': 'Ewha Womans University Introduction', 'heading': 'About Ewha', 'content': "Ewha Womans University is a private research university for women in Seoul, South Korea. Founded in 1886, it is the oldest and largest women's university in South Korea. Ewha is a comprehensive university with 11 colleges and 13 graduate schools offering over 100 undergraduate and graduate programs. The university is known for its strong academic programs, especially in the fields of humanities, social sciences, and natural sciences. Ewha also has a strong reputation for its international programs and has exchange agreements with over 300 universities around the world.", 'token': '113'}
{'title': 'Ewha Womans University Rankings', 'heading': 'World University Rankings', 'content': 'Ewha Womans University is ranked among the top universities in Asia and the world. In the 2023 QS World University Rankings, Ewha is ranked 101-150th in the world and 1st in South Korea. In the 2023 Times Higher Education W

In [7]:
# Check the resultant database file with the added token column
import pandas as pd

df = pd.read_csv('ewha_database.csv')
print(f"{len(df)} rows in the data.")
df.sample(3)

3 rows in the data.


Unnamed: 0,title,heading,content,token
0,Ewha Womans University Introduction,About Ewha,Ewha Womans University is a private research u...,113
1,Ewha Womans University Rankings,World University Rankings,Ewha Womans University is ranked among the top...,83
2,Undergraduate Admissions,Requirements,Applicants must have a high school diploma or ...,69


# DATABASE STRUCTURE
This sample database is designed to organize information about Ewha Womans University. The structure of the database with columns like "title," "heading," "content," and "token" serves a specific purpose in the context of using OpenAI's models for natural language processing tasks. Let's break down the significance of each column:

### Title:
Represents the title of a specific entry. This column helps identify and categorize different pieces of information in the database.
### Heading:
Represents the heading of a specific sections within each entry. This column can provide additional context or organization within each entry.
### Content:
Contains the main textual content or information associated with each entry. It includes detailed descriptions, requirements, or any other relevant information about the topic. This is the text that will be processed by the GPT-3.5-turbo model to tokenize it, and OpenAI's Completion and Embedding APIs to generate responses, answer questions, or perform other natural language tasks.
### Token:
Stores the precalculated token count for the corresponding "content" of each entry. Tokens are chunks of text that language models like GPT-3 process. The token count is essential because OpenAI charges per token for model usage. Understanding the token count allows us to manage and control costs when interacting with OpenAI.

### In summary, this structure facilitates efficient data management and effective use of the language model for natural language understanding and generation tasks.