<a href="https://colab.research.google.com/github/yourusername/yourrepo/blob/main/AI_Interviewer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AI Interview Question Answer Dataset Creation

This notebook creates a comprehensive dataset of technical interview questions and answers by:
1. Scraping questions from multiple GitHub repositories
2. Extracting Q&A pairs from technical articles
3. Cleaning and merging the collected data
4. Adding role-based tagging for better organization

## Installation of Required Libraries

In [None]:
!pip install pandas requests beautifulsoup4 scikit-learn numpy lxml

## Data Collection: Scraping Q&A from GitHub Repositories

This section extracts interview questions from GitHub README files by:
1. Identifying relevant repositories
2. Processing their README files
3. Extracting all interview-related URLs

In [None]:
import requests
import re

# List of GitHub raw markdown URLs containing interview questions
github_raw_urls = [
    "https://raw.githubusercontent.com/DopplerHQ/awesome-interview-questions/master/README.md",
    "https://raw.githubusercontent.com/bregman-arie/devops-interview-questions/master/README.md",
    "https://raw.githubusercontent.com/darshanjain-ml/Interview-Question-Data/main/README.md",
    "https://raw.githubusercontent.com/Ebazhanov/linkedin-skill-assessments-quizzes/master/README.md",
    "https://raw.githubusercontent.com/30-seconds/30-seconds-of-interviews/master/README.md"
]

# Pattern to identify interview-related URLs
interview_url_pattern = re.compile(
    r'https?://[^\s\)]+(interview|question|answers)[^\s\)]+', re.IGNORECASE)

all_links = set()

for url in github_raw_urls:
    print(f"🔍 Processing: {url}")
    try:
        response = requests.get(url)
        response.raise_for_status()
        content = response.text

        # Extract all URLs containing interview/question/answers keywords
        for match in re.findall(r'https?://[^\s\)\]]+', content):
            if re.search(r"(interview|question|answers)", match, re.IGNORECASE):
                all_links.add(match.strip().rstrip(").,"))
    except Exception as e:
        print(f"❌ Error reading {url}: {e}")

# Filter to keep only high-quality sources
preferred_domains = ["medium.com", "dev.to", "geeksforgeeks.org", "simplilearn.com", 
                    "codecademy.com", "roadmap.sh", "w3webschool.com"]
filtered_links = [link for link in all_links if any(domain in link for domain in preferred_domains)]

# Save extracted links to file
with open("github_extracted_qna_urls.txt", "w") as f:
    for link in sorted(filtered_links):
        f.write(link + "\n")

print(f"✅ Found {len(filtered_links)} filtered Q&A links.")

## Q&A Extraction from Web Sources

This section scrapes actual question-answer pairs from the collected URLs by:
1. Identifying question elements (headings, bold text)
2. Extracting the subsequent answer text
3. Ensuring minimum quality standards (answer length > 20 chars)

In [None]:
from bs4 import BeautifulSoup
import pandas as pd
import json
import time
from google.colab import files

# Function to scrape Q&A from a single URL
def scrape_qna(url):
    try:
        res = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=10)
        soup = BeautifulSoup(res.text, "html.parser")
    except Exception as e:
        print(f"❌ Failed to fetch {url}: {e}")
        return []

    qna_pairs = []
    # Look for questions in headings and bold text
    for tag in soup.find_all(['h2', 'h3', 'strong', 'b']):
        q = tag.get_text().strip()
        # Only consider text with question marks and minimum length
        if '?' in q and len(q.split()) >= 3:
            a_tag = tag.find_next('p')
            if a_tag:
                a = a_tag.get_text().strip()
                if len(a) >= 20:  # Minimum answer length
                    qna_pairs.append({
                        "question": q,
                        "answer": a,
                        "source": url
                    })
    return qna_pairs

# Scrape all collected URLs
all_data = []
for url in urls:
    print(f"🔍 Scraping: {url}")
    data = scrape_qna(url)
    all_data.extend(data)
    time.sleep(1)  # Be polite with requests

# Save results
df = pd.DataFrame(all_data)
df.to_csv("scraped_qna.csv", index=False)
df.to_json("scraped_qna.json", orient="records", indent=2)

print(f"✅ Scraping complete. Total Q&A pairs: {len(all_data)}")

## Data Cleaning and Merging

This section processes the collected data by:
1. Removing duplicate questions
2. Filtering short answers
3. Cleaning question formatting
4. Merging with existing datasets

In [None]:
import pandas as pd

# Load and clean dataset
df = pd.read_csv("scraped_qna.csv")

# Remove duplicate questions
df = df.drop_duplicates(subset='question')

# Remove answers shorter than 20 characters
df = df[df['answer'].str.len() > 20]

# Clean question prefixes (Q1., 2. etc)
df['question'] = df['question'].str.replace(r'^\s*(Q?\d+\.*)\s*', '', regex=True)

# Save cleaned data
df.to_csv("cleaned_dataset.csv", index=False)
df.to_json("cleaned_dataset.json", orient="records", indent=2)

# Merge with existing dataset if available
try:
    existing_df = pd.read_csv("combined_qna_dataset.csv")
    combined = pd.concat([df, existing_df], ignore_index=True)
    combined.drop_duplicates(subset="question", inplace=True)
    combined.to_csv("combined_qna_dataset.csv", index=False)
    print("✅ Successfully merged with existing dataset")
except FileNotFoundError:
    print("ℹ️ No existing dataset found - using only scraped data")
    df.to_csv("combined_qna_dataset.csv", index=False)

## Role-Based Tagging

This section automatically categorizes questions by job role using keyword matching:

In [None]:
def guess_role(question):
    """Automatically tag questions with job roles based on keywords"""
    q = str(question).lower()
    
    role_keywords = {
        "Frontend Developer": ["frontend", "react", "html", "css", "javascript"],
        "Backend Developer": ["backend", "sql", "api", "database", "server"],
        "Cloud Architect": ["cloud", "aws", "azure", "gcp"],
        "Cybersecurity Analyst": ["cyber", "security", "encryption", "firewall"],
        "DevOps Engineer": ["devops", "ci/cd", "docker", "kubernetes"],
        "AI Engineer": ["ai", "ml", "model", "neural", "tensorflow"],
        "UX Developer": ["ux", "ui", "design", "user experience"],
        "Product Manager": ["product", "roadmap", "agile"],
        "Software Engineer": ["python", "java", "oop", "algorithm"]
    }
    
    for role, keywords in role_keywords.items():
        if any(keyword in q for keyword in keywords):
            return role
    return "Other"

# Apply role tagging
df['role'] = df['question'].apply(guess_role)

# Save tagged dataset
df.to_csv("auto_tagged_dataset.csv", index=False)
print(f"Role distribution:\n{df['role'].value_counts()}")

## Dataset Analysis

Basic analysis of the final dataset:

In [None]:
import pandas as pd

# Load final dataset
df = pd.read_csv("auto_tagged_dataset.csv")

# Basic info
print("\n🔍 Dataset Summary:")
print(df.info())

# Role distribution
print("\n📊 Role Distribution:")
print(df['role'].value_counts())

# Sample questions
print("\n📝 Sample Questions:")
print(df.sample(5)[['question', 'role']])