In [1]:
import os
import requests
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from IPython.display import Markdown, display
from openai import OpenAI
import pandas as pd

In [2]:
load_dotenv(override=True)
api_key=os.getenv("OPENAI_API_KEY")

In [3]:
if(api_key[:8]=="sk-proj-"):
    print("API KEY VALID")

API KEY VALID


In [5]:
model="gpt-4o-mini"
openai=OpenAI()

In [59]:
system_prompt="""You are a text cleaner.

Given raw scraped text, clean it by:

**Remove:**
- Navigation elements ("Skip to content", "Log in", "Search", "Home | About")
- Boilerplate ("All rights reserved", "Cookie policy", "Subscribe") 
- HTML artifacts (&amp;, &quot;, extra whitespace)
- Duplicate sentences

**Fix:**
- Extra whitespace and line breaks
- Encoding issues  
- Inconsistent punctuation

Return only the cleaned text, preserving all meaningful business content. Give the text with length of 1000 characters. Dont include generic phrases like "The article discusses...".
"""

In [34]:
df=pd.read_csv("rawdata.csv")

In [35]:
df.head()

Unnamed: 0,url,title,text
0,https://enterpriseleague.com/blog/engineering-...,26 engineering startups that can change the wo...,About\nFeatures\nPricing\nBlog\nBusiness Tips\...
1,https://www.pwc.com/us/en/industries/industria...,Engineering and construction industry trends: ...,Skip to content\nSkip to footer\nFeatured insi...
2,https://www.mordorintelligence.com/industry-re...,"US Engineering Services Industry - Size, Trend...",Reports\nAerospace & Defense\nAgriculture\nAni...
3,https://online-engineering.case.edu/blog/innov...,Innovative Engineering Practices Driving Indus...,`\nSkip to main content\nPrograms\nMaster of E...
4,https://www.forbes.com/sites/haniyarae/2025/03...,Meet America’s Best Startup Employers 2025,Newsletters\nGames\nShare a News Tip\nFeatured...


We use LLMS to clean the irrelevant information like Navigation Elements, or BoilerPlate. This is overall more reliable for longer text than the typical regex approach, 
i use `gpt-4o-mini` as it is insanely cost effective yet super accurate with a good prompt.

In [60]:
def cleaner(text):
    response= openai.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": text}
        ],
        temperature=0.2
    )
    return response.choices[0].message.content

In [54]:
df = df.dropna(subset=['text'])

In [61]:

df_sample = df.dropna(subset=['text'])
df_sample['cleaned_text'] = df_sample['text'].apply(cleaner)

In [62]:
display(Markdown(df_sample['cleaned_text'][0]))

During the past twenty years, significant changes have occurred in society, largely driven by outstanding technological innovations. The engineering industry, in particular, has seen a revolution, with startups creating innovative products to solve problems or provide better alternatives to existing solutions. Engineering startups focus on developing solutions across various sectors, from robotics to healthcare, often fulfilling unmet market needs.

Notable engineering startups include:

- **Virtual Facility**: Founded in 2018, it addresses alarm fatigue in industrial settings using machine learning to prioritize alarms based on significance.
  
- **nTopology**: This company offers a 3D modeling platform for product design, facilitating collaboration among engineers and clients like Lockheed Martin.

- **Arch Engineers**: Provides life-cycle engineering services, focusing on innovative solutions for clients in oil and gas, mining, and infrastructure.

- **Carnot**: Aims to reduce global CO2 emissions by revolutionizing power generation for marine craft and heavy vehicles.

- **Existo**: Develops solutions for people with disabilities, promoting health and wellness through innovative wearable technology.

These startups exemplify the creativity and flexibility that define the engineering sector, allowing them to thrive in a competitive landscape.

In [None]:
df_sample.drop(columns=['text'], inplace=True)

In [66]:
df_sample.head(10)

Unnamed: 0,url,title,cleaned_text
0,https://enterpriseleague.com/blog/engineering-...,26 engineering startups that can change the wo...,"During the past twenty years, significant chan..."
1,https://www.pwc.com/us/en/industries/industria...,Engineering and construction industry trends: ...,The engineering and construction (E&C) sector ...
2,https://www.mordorintelligence.com/industry-re...,"US Engineering Services Industry - Size, Trend...",The United States Engineering Services Market ...
3,https://online-engineering.case.edu/blog/innov...,Innovative Engineering Practices Driving Indus...,Engineers spearhead new developments that are ...
4,https://www.forbes.com/sites/haniyarae/2025/03...,Meet America’s Best Startup Employers 2025,Inside the lab at Lunar Energy’s headquarters ...
5,https://www.acec.org/resources/market-intellig...,Industry Insights - ACEC,ACEC releases monthly economic updates that in...
6,https://www.borntoengineer.com/resources/how-e...,How Engineering Is Evolving To Meet The Challe...,How Engineering is Evolving To Meet The Challe...
7,https://www.kimley-horn.com/news-insights/awar...,2025 ENR Rankings | Kimley-Horn,Kimley-Horn has been recognized in the 2025 En...
8,https://imaa-institute.org/m-and-a-news/weekly...,M&A News: Global M&A Deals Week of August 4 to...,"The Institute for Mergers, Acquisitions and Al..."
9,https://addisongroup.com/insights/engineering-...,"Engineering hiring trends, in-demand jobs & to...",Engineering hiring trends indicate a current r...


In [67]:
df_sample.to_csv("cleaned_data.csv", index=False)  