#Data Creation using Web Scraping


In [2]:
import requests
from bs4 import BeautifulSoup
import csv
import time, random   # just lumped them together for no reason

# quick helper to fetch page content (note: might need retries later?)
def fetch_html(url):
    try:
        resp = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
        if resp.status_code == 200:
            return resp.text
        else:
            print("Hmm, got status:", resp.status_code, "for", url)
            return None
    except Exception as err:
        # generic catch, not the best but works for now
        print(f"Problem reaching {url} -> {err}")
        return None


# not sure if we need class filter everywhere, but let's keep it here
def grab_text_bits(page_html, tag="p", css_class=None):
    soup = BeautifulSoup(page_html, "html.parser")

    if css_class:
        found = soup.find_all(tag, class_=css_class)
    else:
        found = soup.find_all(tag)

    results = []
    for f in found:
        txt = f.get_text(strip=True)
        # don’t keep super short junk
        if len(txt.split()) > 5:
            results.append(txt)
    return results


# a few sites I thought of scraping (probably should rotate later?)
pages_to_scrape = [
    "https://www.indeed.com/q-data-scientist-jobs.html",
    "https://www.glassdoor.com/Interview/data-scientist-interview-questions-SRCH_KO0,14.htm",
    "https://towardsdatascience.com/",
    "https://www.naukri.com/data-scientist-jobs",
    "https://www.analyticsvidhya.com/blog/"
]

# collected texts go here
scraped_texts = []

# main scraping loop
for site in pages_to_scrape:
    print(">>> scraping:", site)
    html_code = fetch_html(site)

    if html_code:
        snippets = grab_text_bits(html_code, "p")  # might change tag later
        for s in snippets:
            scraped_texts.append(s)  # could have just extended, but meh

    # little pause so we don't hammer servers
    wait_time = random.randint(2, 6)   # added +1 sec just in case
    time.sleep(wait_time)

# cleanup: unique-ify results (keeping order kinda lost but whatever)
unique_bits = list(set(scraped_texts))

# only keep the first ~60-ish, otherwise too many
trimmed_bits = unique_bits[:60]

print("Got", len(trimmed_bits), "usable samples")

# write to CSV file
with open("raw_text_samples.csv", "w", encoding="utf-8", newline="") as fh:
    csv_writer = csv.writer(fh)
    csv_writer.writerow(["sample_text"])   # changed header name slightly
    for row in trimmed_bits:
        csv_writer.writerow([row])

print("Saved to raw_text_samples.csv (hopefully works!)")

# TODO: maybe later add better error handling for sites that block us


>>> scraping: https://www.indeed.com/q-data-scientist-jobs.html
Hmm, got status: 403 for https://www.indeed.com/q-data-scientist-jobs.html
>>> scraping: https://www.glassdoor.com/Interview/data-scientist-interview-questions-SRCH_KO0,14.htm
Hmm, got status: 403 for https://www.glassdoor.com/Interview/data-scientist-interview-questions-SRCH_KO0,14.htm
>>> scraping: https://towardsdatascience.com/
>>> scraping: https://www.naukri.com/data-scientist-jobs
>>> scraping: https://www.analyticsvidhya.com/blog/
Got 54 usable samples
Saved to raw_text_samples.csv (hopefully works!)


In [3]:
from google.colab import files
files.download("raw_text_samples.csv")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

#Data Cleaning


In [4]:
import csv
import re

def clean_line(text):
    if not text:
        return ""
    t = text.strip()
    t = re.sub(r"<.*?>", "", t)        # strip HTML tags
    t = t.lower()                      # lowercase
    t = re.sub(r"[^a-z0-9.,!?;:'\"()\s-]", " ", t)  # remove weird symbols
    t = re.sub(r"\s+", " ", t)         # collapse spaces
    return t.strip()

raw_rows = []
with open("raw_text_samples.csv", "r", encoding="utf-8") as infile:
    reader = csv.DictReader(infile)

    # get the actual column names in the file
    print("CSV headers found:", reader.fieldnames)

    # pick the first column if we’re unsure
    col_name = reader.fieldnames[0]

    for row in reader:
        raw_rows.append(row[col_name])

print(f"Loaded {len(raw_rows)} raw samples")

# clean + dedup
seen, cleaned_rows = set(), []
for r in raw_rows:
    cleaned = clean_line(r)
    if len(cleaned.split()) < 4:
        continue
    if cleaned and cleaned not in seen:
        cleaned_rows.append(cleaned)
        seen.add(cleaned)

print(f"After cleaning: {len(cleaned_rows)} samples remain")

# save cleaned TXT
with open("cleaned_samples.txt", "w", encoding="utf-8") as f_out:
    for cl in cleaned_rows:
        f_out.write(cl + "\n")

# save cleaned CSV
with open("cleaned_samples.csv", "w", encoding="utf-8", newline="") as f_out:
    writer = csv.writer(f_out)
    writer.writerow(["cleaned_text"])
    for cl in cleaned_rows:
        writer.writerow([cl])

print("✅ Cleaned files saved: cleaned_samples.txt & cleaned_samples.csv")


CSV headers found: ['sample_text']
Loaded 54 raw samples
After cleaning: 54 samples remain
✅ Cleaned files saved: cleaned_samples.txt & cleaned_samples.csv


In [5]:
#This is done to download the cleaned dataset files
from google.colab import files
files.download("cleaned_samples.csv")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

#Data Annotation

In [11]:
import csv
import json
import os

# --- simple rule-based annotation function ---
def annotate_text(text):
    t = text.lower()

    # Category detection
    if "interview" in t or "question" in t or "answer" in t:
        category = "Interview Question"
    elif "apply" in t or "responsibilities" in t or "requirements" in t or "role" in t:
        category = "Job Description"
    else:
        category = "Blog/Article"

    # Skill tagging (quick keyword scan)
    skills = []
    for skill in ["python", "sql", "machine learning", "statistics", "nlp", "deep learning", "excel", "r "]:
        if skill in t:
            skills.append(skill)
    skill_tag = ", ".join(skills) if skills else "General"

    # Difficulty estimation (rough heuristics)
    if any(word in t for word in ["introduction", "basics", "beginner", "simple"]):
        difficulty = "Beginner"
    elif any(word in t for word in ["optimize", "deployment", "scalable", "production", "pipeline"]):
        difficulty = "Advanced"
    else:
        difficulty = "Intermediate"

    return category, skill_tag, difficulty


# --- use the correct full path ---
input_file = "/content/cleaned_samples.csv"

cleaned_rows = []
with open(input_file, "r", encoding="utf-8") as infile:
    reader = csv.DictReader(infile)
    # auto-detect the text column
    text_col = reader.fieldnames[0]
    print("Using column:", text_col)
    for row in reader:
        cleaned_rows.append(row[text_col])

print(f"Loaded {len(cleaned_rows)} cleaned samples")


# --- annotate only first 20 for demo ---
sample_rows = cleaned_rows[:20]

annotated = []
for r in sample_rows:
    cat, skill, diff = annotate_text(r)
    annotated.append({
        "text": r,
        "category": cat,
        "skill_tag": skill,
        "difficulty": diff
    })


os.makedirs("/mnt/data", exist_ok=True)

# --- save annotated dataset to CSV ---
csv_output = "/mnt/data/annotated_samples.csv"
with open(csv_output, "w", encoding="utf-8", newline="") as out_csv:
    writer = csv.DictWriter(out_csv, fieldnames=["text", "category", "skill_tag", "difficulty"])
    writer.writeheader()
    writer.writerows(annotated)

# --- save also to JSON ---
json_output = "/mnt/data/annotated_samples.json"
with open(json_output, "w", encoding="utf-8") as out_json:
    json.dump(annotated, out_json, indent=2, ensure_ascii=False)

print("✅ Saved 20 annotated rows into:")
print("CSV:", csv_output)
print("JSON:", json_output)

Using column: cleaned_text
Loaded 54 cleaned samples
✅ Saved 20 annotated rows into:
CSV: /mnt/data/annotated_samples.csv
JSON: /mnt/data/annotated_samples.json


#Annotated Files

In [13]:
#First this is for downloading csv files
from google.colab import files
files.download(csv_output)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [15]:
#This is for downloading json files
from google.colab import files
files.download(json_output)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>