In [1]:
%pip install -U transformers




In [2]:
# Environment setup
from dotenv import load_dotenv
import os

# Load the .env file
load_dotenv()

# Get the access token
access_token = os.getenv("HUGGINGFACE_TOKEN")

In [3]:
# This is a helper function to get a dynamic page's HTML content using Playwright, and then parse it with BeautifulSoup.
# Necessary for pages that are rendered with JavaScript
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

def get_dynamic_soup(url: str) -> BeautifulSoup:
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.goto(url)
        soup = BeautifulSoup(page.content(), "html.parser")
        browser.close()
        return soup

In [4]:
# LLM Summary Generation 
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM, AutoConfig
from textwrap import dedent

MODEL = "google/gemma-2b-it"

config = AutoConfig.from_pretrained(MODEL, use_auth_token=access_token, max_new_tokens=800)

model = AutoModelForCausalLM.from_pretrained(MODEL, use_auth_token=access_token, config=config)
tokenizer = AutoTokenizer.from_pretrained(MODEL, use_auth_token=access_token, config=config)

generator = pipeline("text-generation", model=MODEL, config=config)



config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]



model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

In [None]:
LLM_INSTRUCT_PROMPT = dedent('''\
Given the URL "{url}", and the following text from the site: "{text}" provide a concise summary that includes: 
1. The main topics in the text.
2. The purpose or objective of the website, inferred from the text and url (including subdomain and path).
3. Tags or keywords that a user may search to try and find the site in a search engine.

You must utilize information from the URL (such as the specific path and subdomain) to contextualize and add to the understanding of the text.

Format the response in a json like the following:

'summary': 'The summary of the text goes here.',
'topics': ['topic1', 'topic2', 'topic3'],
'tags': ['tag1', 'tag2', 'tag3']
''')

# This function takes a prompt and returns generated text
def generate_summary(url: str, text: str, debug: bool = False) -> str:
    prompt = LLM_INSTRUCT_PROMPT.format(url=url, text=text)
    return generator(prompt, max_length=800, do_sample=True)

def gen_summary(url: str, text:str, debug: bool = False) -> str:
    chat = [
        { "role": "user", "content": LLM_INSTRUCT_PROMPT.format(url=url, text=text) },
    ]
    prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
    input_ids = tokenizer([prompt], add_special_tokens=False, return_tensors="pt")
    output = model.generate(**input_ids, max_length=2000, max_new_tokens=1000)
    return tokenizer.decode(output[0], skip_special_tokens=True)
    
def fill_prompt(input_text: str) -> str:
    return LLM_INSTRUCT_PROMPT + " " + input_text

In [None]:
#test model 

test = [["https://charlotte.edu","The University of North Carolina at Charlotte | UNC Charlotte Skip to main content News & Events News Music students participating in touring education production Tue, 02/06/2024 UNC Charlotte receives Library Excellence in Access and Diversity Award Fri, 02/02/2024 Excellence in Leadership Awards bestowed on 10 outstanding alumni Fri, 02/02/2024 Young alumni advancing in their fields and communities Thu, 01/25/2024 Noted neuroscience researcher Kelly Cartwright named Spangler Distinguished Professor of Early Literacy Wed, 01/24/2024 View All News Events UNC Charlotte Shape What's Next UNC Charlotte Icons 0 doctoral programs UNC Charlotte Icons 0 Living Alumni UNC Charlotte Icons 0 #NinerNation Undergrads to Overachievers Variety is more than the spice of life. It is life! The world offers a broader range of career opportunities than ever before, which is why we offer the way to explore and prepare for so many of them right. Choose from diverse majors in 90 bachelor's degree programs and more than 100 graduate programs. Explore Academic Offerings at UNC Charlotte #1 in Latinx Enrollment UNC Charlotte outpaces North Carolina's other four-year institutions with Latinx enrollment, undergraduate degrees and graduation rates ""It's so important to see other students like me on campus,"" says senior Claudia Martinez. Read More Data Science answers the call How UNC Charlotte is responding to industry demand in Charlotte, the region and beyond. Bringing together brilliant minds through interdisciplinary partnership, the University is bridging the gap between society and technology through hands-on programming and research. Read More Where inquiry is put to the ultimate test. Reality. Go beyond hypotheses and theory. Study in a place where on-campus research comes to life in off-campus applications throughout area communities, businesses and industries. Explore Research At UNC Charlotte Quaint & Quiet Lively & Loud Can't decide between a peaceful, picturesque college campus and an action-packed big-city school? Then don't. Get Involved in Campus Life at UNC Charlotte Clubs & Activities Choose from more than 350 student organizations in and out of the classroom at UNC Charlotte. There's something for everyone here! 49er Sports Niner Nation loves cheering on the 49ers and their 18 NCAA Division I varsity sports. Members of the Football Bowl Subdivision (FBS) American Athletic Conference, the 49ers boast some of the nation’s finest facilities and compete against the NCAA’s top competition. Exploring Charlotte Discover the University that lives on the pulse of the city. From professional sports and polished culture to outdoor adventure and recreation, Charlotte is a top destination. Enhancing student motivation and learning Jennifer Webb, Associate Professor of Psychology Revolutionizing teaching practices to benefit students Oscar Lansen, Teaching Professor of History Providing experiential learning opportunities Thomas Marshall, Lecturer in Risk Management Forging connections with students Jordan Poler, Associate Professor of Chemistry Explore Faculty Inside UNC Charlotte"],
["https://www.charlotte.edu/academics","Academics at the University of North Carolina at Charlotte | UNC Charlotte Skip to main content Academics Apply Now Visit Our Campus UNC Charlotte, North Carolina's urban research university, fuels American innovation in everything from resilient and sustainable architecture and environmental systems, to epidemiological modeling and sustainable energy, to shaping the future of work for greater Charlotte and beyond. Know What You're Looking For? Search Our Programs The academic search requires JavaScript. Visit the University Catalogs site to view all programs available. Undergraduate Programs Majors Minors Certificates Graduate Programs Graduate Degree Programs Graduate Certificates Online & Professional Programs Online/Distance Education School of Professional Studies Executive Education Explore Our Colleges Belk College of Business Generating vital talent for the greater Charlotte economy — the second largest banking center in the United States — and fresh insights through research for emerging companies across North Carolina. Learn more College of Arts + Architecture A diverse community of visionary thinkers, designers, and makers, who seek to create a more beautiful and just world through innovation, research and collaborative engagement. Learn more Cato College of Education Supporting North Carolina schools, teachers, superintendents and policy makers working to advance educational research, equity, excellence and engagement for all students. Learn more College of Computing & Informatics Fostering critical knowledge and talent to speed next-generation research and technological breakthroughs — Artificial Intelligence, Robotics, Big Data Analysis, Computer-Aided Education, Bioinformatics and Cybersecurity — for North Carolina. Learn more College of Health & Human Services Translating clinical and public health research to improve patient outcomes, especially for vulnerable, underinsured and underserved communities. Learn more College of Science Advancing interdisciplinary research and promoting discovery in the fields of math, chemistry, biology and physics, through supportive, experiential learning and state-of-the-art facilities. Learn more College of Humanities & Earth and Social Sciences Enhancing our understanding of complex issues, from climate change and global migration to health disparities and economic inequality, through interdisciplinary research, student-centered learning, and community engagement. Learn more The William States Lee College of Engineering Among the top engineering programs in North Carolina, where ideas become reality through research, study, design, hands-on prototyping and often interdisciplinary collaboration with industry supporters. Learn more Interdisciplinary Studies Where business meets computer science, biology meets the arts and history combines with engineering — integrative thinkers draw from multiple academic disciplines to lead North Carolina’s top roles in data science, business, law and healthcare. Learn more Academic Excellence The Graduate School Honors College University College Additional Resources Academic Advising Adult and Extended Services Career Center Center for Graduate Life Common Reading Experience Disability Services Academic Diversity and Inclusion International Programs Academic Support Services Writing Center"],
["https://library.charlotte.edu","Homepage | J. Murrey Atkins Library Skip to main content Limit To: Articles Peer-reviewed Advanced Search Databases Journals 0 PEOPLE IN ATKINS My Accounts Study Rooms Research Guides Hours Printing Contact Us × Which Account? My Library Account My Interlibrary Loan Account Sign up to receive library news and updates looking for a book that messes with your head?Check out the Psychological Fiction collection in the 2nd Floor Special DisplayRead More Check out our new Board Games, Card Games, and Puzzles CollectionNow available at the Area 49 Desk on the 2nd floor Read More Atkins offers resource guide for Digital HumanitiesRead More A fireside-style discussion about the birth of grassroots activism in CharlotteFebruary 28, 6-7 p.m. Read More Join the Atkins Reading ChallengeGet your bingo card and start READING!Read More Swank Streaming Film CollectionA selection of popular movies for the classroom or at homeRead More 1,500,000 Visits Per Year 3,800,000 Volumes 57 Reservable Study Rooms View More events Digital Humanities Resource Guide UNC Charlotte Receives Library Excellence in DEI Award De-Stress for Success Journal Package Alert Book Presentation, Talk, and Reception De-stress for Success During Exams The Princess Augusta Sophia Collection of Drama Atkins Introduces the Library Mobile App Film Screen With A Dean Focuses on Racial Injustice in the Justice System Exam-time activities planned for students Inaugural Atkins Library Popular Reading Series features Dr. A.J. Hartley Atkins Book Club Discussion Odyssey for Democracy Author, Subject Participate in Panel Discussion The Black Read: Celebrating Black History Month Celebrate the Insulin Centennial Wonderland Poetry Reading and Tea Party Atkins Awarded Second Grant for Mobile Hotspot Lending Atkins Awarded Federal Grant BrowZine Cancellation Film Screen With A Dean: Wilmington on Fire COVID-19 Vaccinations: Science, Politics, Mistrust, and Misinformation Panel Digital Media Literacy Instruction Kate Dickson: A Passion to Protect Atkins Moved Quickly To Keep Services Going During the Pandemic Offsite Storage Move Update Election 2020: How to Verify What You Read, See, and Hear Online Liberry Lager Now Available at Triple C Brewing Atkins Library Reopening Guide Atkins Announces Offsite Storage Location Active Learning Academy Book Published by Atkins Library Dance History II De-stress for Success with Atkins Paywall Film Virtual Panel Discussion Atkins Creating PPE for Healthcare Workers CANCELED: For the Love of Books CANCELED: Author Susan Rivers on ""Keeping It Real"" Special Collections Holds Rare ""Sketches of Charlotte"" Booklets Get Ready to be Counted Atkins Library Unveils New Website Disability Advocate and Author Discusses Hidden Disabilities New Combined Library Services at First Floor Desk Atkins Rare Book Used to Create Smithsonian Exhibit Join Our Book Club! Packaging the Past View More News"]]

for t in test:
    print(gen_summary(t[0],t[1]))
    print("\n")

Both `max_new_tokens` (=1000) and `max_length`(=2000) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=1000) and `max_length`(=2000) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


user
Given the URL "https://charlotte.edu", and the following text from the site: "The University of North Carolina at Charlotte | UNC Charlotte Skip to main content News & Events News Music students participating in touring education production Tue, 02/06/2024 UNC Charlotte receives Library Excellence in Access and Diversity Award Fri, 02/02/2024 Excellence in Leadership Awards bestowed on 10 outstanding alumni Fri, 02/02/2024 Young alumni advancing in their fields and communities Thu, 01/25/2024 Noted neuroscience researcher Kelly Cartwright named Spangler Distinguished Professor of Early Literacy Wed, 01/24/2024 View All News Events UNC Charlotte Shape What's Next UNC Charlotte Icons 0 doctoral programs UNC Charlotte Icons 0 Living Alumni UNC Charlotte Icons 0 #NinerNation Undergrads to Overachievers Variety is more than the spice of life. It is life! The world offers a broader range of career opportunities than ever before, which is why we offer the way to explore and prepare for s

Both `max_new_tokens` (=1000) and `max_length`(=2000) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


user
Given the URL "https://www.charlotte.edu/academics", and the following text from the site: "Academics at the University of North Carolina at Charlotte | UNC Charlotte Skip to main content Academics Apply Now Visit Our Campus UNC Charlotte, North Carolina's urban research university, fuels American innovation in everything from resilient and sustainable architecture and environmental systems, to epidemiological modeling and sustainable energy, to shaping the future of work for greater Charlotte and beyond. Know What You're Looking For? Search Our Programs The academic search requires JavaScript. Visit the University Catalogs site to view all programs available. Undergraduate Programs Majors Minors Certificates Graduate Programs Graduate Degree Programs Graduate Certificates Online & Professional Programs Online/Distance Education School of Professional Studies Executive Education Explore Our Colleges Belk College of Business Generating vital talent for the greater Charlotte economy

In [None]:
# Embedding model for similarity
