# AI Interview Question Answer Dataset

This notebook creates a comprehensive dataset of technical interview questions and answers by:
- Scraping questions from GitHub repositories
- Extracting Q&A pairs from technical articles
- Cleaning and merging collected data
- Adding role-based tagging

## Installation of Required Libraries

First, we need to install all necessary Python packages:
- `pandas` for data manipulation
- `requests` and `beautifulsoup4` for web scraping
- Other supporting libraries

In [None]:
!pip install pandas requests beautifulsoup4 scikit-learn numpy lxml

## Step 1: Collect GitHub Interview URLs

We'll scrape these GitHub repositories for interview questions:
1. awesome-interview-questions
2. devops-interview-questions
3. Interview-Question-Data

This code extracts all URLs containing interview questions from their READMEs.

In [None]:
import requests
import re

# List of GitHub repositories with interview questions
github_raw_urls = [
    "https://raw.githubusercontent.com/DopplerHQ/awesome-interview-questions/master/README.md",
    "https://raw.githubusercontent.com/bregman-arie/devops-interview-questions/master/README.md",
    "https://raw.githubusercontent.com/darshanjain-ml/Interview-Question-Data/main/README.md"
]

all_links = set()

for url in github_raw_urls:
    print(f"Processing: {url}")
    try:
        response = requests.get(url)
        content = response.text
        
        # Extract all interview/question URLs
        for match in re.findall(r'https?://[^\s\)\]]+', content):
            if re.search(r"(interview|question|answers)", match, re.IGNORECASE):
                all_links.add(match.strip().rstrip(").,"))
    except Exception as e:
        print(f"Error reading {url}: {e}")

print(f"Found {len(all_links)} Q&A links")

## Step 2: Filter High-Quality Sources

We'll filter the URLs to keep only reputable technical sites like:
- medium.com
- dev.to
- geeksforgeeks.org

This ensures our dataset has reliable content.

In [None]:
preferred_domains = ["medium.com", "dev.to", "geeksforgeeks.org", "simplilearn.com"]
filtered_links = [link for link in all_links if any(domain in link for domain in preferred_domains)]

# Save the filtered links
with open("qna_urls.txt", "w") as f:
    for link in filtered_links:
        f.write(link + "\n")

print(f"Kept {len(filtered_links)} high-quality links")

## Step 3: Scrape Q&A Pairs

Now we'll scrape actual questions and answers from each URL by:
1. Looking for question headings (h2/h3 tags)
2. Extracting the next paragraph as the answer
3. Ensuring answers meet minimum length requirements

In [None]:
from bs4 import BeautifulSoup
import pandas as pd
import time

def scrape_qna(url):
    try:
        response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
        soup = BeautifulSoup(response.text, "html.parser")
        
        qna = []
        # Find all potential question elements
        for tag in soup.find_all(['h2', 'h3', 'strong', 'b']):
            question = tag.get_text().strip()
            if '?' in question:  # Likely a question
                answer_tag = tag.find_next('p')
                if answer_tag:
                    answer = answer_tag.get_text().strip()
                    if len(answer) > 20:  # Minimum answer length
                        qna.append({"question": question, "answer": answer, "source": url})
        return qna
    except Exception as e:
        print(f"Error scraping {url}: {e}")
        return []

# Scrape all filtered URLs
all_qna = []
for url in filtered_links:
    print(f"Scraping: {url}")
    all_qna.extend(scrape_qna(url))
    time.sleep(1)  # Be polite with requests

print(f"Scraped {len(all_qna)} Q&A pairs")

## Step 4: Clean the Dataset

Now we'll clean the data by:
1. Removing duplicate questions
2. Filtering short answers
3. Cleaning question formatting
4. Saving to CSV

In [None]:
df = pd.DataFrame(all_qna)

# Remove duplicates
df = df.drop_duplicates(subset="question")

# Remove short answers
df = df[df['answer'].str.len() > 20]

# Clean question prefixes (Q1., 2. etc)
df['question'] = df['question'].str.replace(r'^\s*(Q?\d+\.*)\s*', '', regex=True)

# Save cleaned data
df.to_csv("interview_questions.csv", index=False)
print(f"Final dataset has {len(df)} questions")

## Step 5: Add Role Tagging

We'll automatically categorize questions by job role using keyword matching:
- Frontend (React, HTML, CSS)
- Backend (SQL, API, Database)
- DevOps (Docker, CI/CD)
- etc.

In [None]:
def tag_role(question):
    """Categorize question by job role based on keywords"""
    q = question.lower()
    if 'react' in q or 'javascript' in q:
        return "Frontend"
    elif 'sql' in q or 'database' in q:
        return "Backend"
    elif 'docker' in q or 'kubernetes' in q:
        return "DevOps"
    elif 'python' in q or 'java' in q:
        return "Software Engineer"
    else:
        return "General"

# Apply tagging
df['role'] = df['question'].apply(tag_role)

# Save final dataset
df.to_csv("tagged_interview_questions.csv", index=False)
print("Role distribution:")
print(df['role'].value_counts())

## Final Dataset Analysis

Let's examine our completed dataset:

In [None]:
print("Dataset summary:")
print(df.info())

print("\nSample questions:")
print(df.sample(5)[['question', 'role']])