# Concept Extraction from Competitive Exam Questions

This notebook demonstrates the concept extraction system that combines RAKE (Rapid Automatic Keyword Extraction) with custom keyword dictionaries to identify concepts from competitive exam questions.

## Project Features

- **Hybrid Concept Extraction**: Uses RAKE + Custom Dictionary
- **Cost-Effective**: No direct LLM API calls for core functionality
- **Future-Proof**: Built with LLM integration interface
- **Domain Adaptable**: Works across multiple subjects
- **CSV-based I/O**: Standard file formats for input and output

## 1. Setup and Imports

In [2]:
import pandas as pd
import nltk
import os

# Download necessary NLTK data
nltk_data_to_download = [
    ('tokenizers/punkt', 'punkt'),
    ('tokenizers/punkt_tab', 'punkt_tab'),
    ('corpora/stopwords', 'stopwords')
]

for data_path, package_name in nltk_data_to_download:
    try:
        nltk.data.find(data_path)
    except LookupError:
        print(f"Downloading {package_name}...")
        nltk.download(package_name)

# Import our custom modules
from csv_reader import read_questions_csv
from concept_extractor import ConceptExtractor
from simulated_llm import SimulatedLLM

print("Setup complete!")

ModuleNotFoundError: No module named 'nltk'

## 2. Load Sample Data

In [None]:
# Load ancient history questions
questions_file = "resources/ancient_history.csv"
questions_df = read_questions_csv(questions_file)

print(f"Loaded {len(questions_df)} questions")
print("\nSample questions:")
print(questions_df[['Question Number', 'Question']].head())

## 3. Hybrid Concept Extraction (RAKE + Custom Dictionary)

In [None]:
# Initialize hybrid concept extractor
custom_dict_file = "dictionaries/ancient_history_concepts.csv"
hybrid_extractor = ConceptExtractor(custom_dict_file=custom_dict_file)

# Extract concepts from all questions
questions_with_concepts = hybrid_extractor.extract_concepts_from_dataframe(questions_df.copy())

print("Concept extraction complete!")
print("\nResults:")
for idx, row in questions_with_concepts.iterrows():
    print(f"\nQuestion {row['Question Number']}: {row['Question'][:100]}...")
    print(f"Concepts: {row['Concepts']}")

## 4. View Custom Dictionary

In [None]:
# Load and display the custom dictionary
custom_dict_df = pd.read_csv(custom_dict_file)
print("Custom Dictionary for Ancient History:")
print(custom_dict_df)

## 5. Simulated LLM Extraction

In [None]:
# Test simulated LLM extraction
simulated_llm = SimulatedLLM()

print("Simulated LLM Concept Extraction:")
for idx, row in questions_df.iterrows():
    concepts = simulated_llm.extract_concepts(row['Question'])
    print(f"\nQuestion {row['Question Number']}: {row['Question'][:100]}...")
    print(f"LLM Concepts: {'; '.join(concepts)}")

## 6. Comparison: Hybrid vs Simulated LLM

In [None]:
# Create comparison DataFrame
comparison_df = questions_df[['Question Number', 'Question']].copy()
comparison_df['Hybrid_Concepts'] = questions_with_concepts['Concepts']
comparison_df['LLM_Concepts'] = comparison_df['Question'].apply(
    lambda q: '; '.join(simulated_llm.extract_concepts(q))
)

print("Comparison of Extraction Methods:")
for idx, row in comparison_df.iterrows():
    print(f"\n=== Question {row['Question Number']} ===")
    print(f"Question: {row['Question'][:150]}...")
    print(f"Hybrid: {row['Hybrid_Concepts']}")
    print(f"LLM: {row['LLM_Concepts']}")

## 7. Test Multiple Subjects

In [None]:
# Test with different subjects
subjects = ['ancient_history', 'economics', 'mathematics', 'physics']

for subject in subjects:
    try:
        questions_file = f"resources/{subject}.csv"
        custom_dict_file = f"dictionaries/{subject}_concepts.csv"
        
        if os.path.exists(questions_file):
            df = read_questions_csv(questions_file)
            if not df.empty:
                extractor = ConceptExtractor(custom_dict_file=custom_dict_file)
                result_df = extractor.extract_concepts_from_dataframe(df.copy())
                
                print(f"\n=== {subject.upper()} ===")
                print(f"Questions processed: {len(result_df)}")
                
                # Show first question as example
                if len(result_df) > 0:
                    first_row = result_df.iloc[0]
                    print(f"Sample Question: {first_row['Question'][:100]}...")
                    print(f"Sample Concepts: {first_row['Concepts']}")
    except Exception as e:
        print(f"Error processing {subject}: {e}")

## 8. Save Results

In [None]:
# Save the results to CSV
output_file = "notebook_output_concepts.csv"
output_df = questions_with_concepts[['Question Number', 'Question', 'Concepts']]
output_df.to_csv(output_file, index=False, encoding='utf-8')

print(f"Results saved to {output_file}")
print(f"\nSummary:")
print(f"Total questions processed: {len(output_df)}")
questions_with_concepts_count = len(output_df[output_df['Concepts'].str.len() > 0])
print(f"Questions with extracted concepts: {questions_with_concepts_count}")
print(f"Coverage: {questions_with_concepts_count/len(output_df)*100:.1f}%")

## Conclusion

This project demonstrates a hybrid approach to concept extraction that:

1. **Combines RAKE with custom dictionaries** for accurate, domain-specific extraction
2. **Avoids expensive LLM API calls** while maintaining good performance
3. **Provides a framework for future LLM integration** when needed
4. **Works across multiple domains** (History, Economics, Mathematics, Physics)
5. **Offers both programmatic and interactive interfaces**

The system is cost-effective, adaptable, and ready for production use in educational and assessment applications.