# Concept Extraction from Competitive Exam Questions

This notebook demonstrates the concept extraction system that combines RAKE (Rapid Automatic Keyword Extraction) with custom keyword dictionaries to identify concepts from competitive exam questions.

## Project Features

- **Hybrid Concept Extraction**: Uses RAKE + Custom Dictionary
- **Cost-Effective**: No direct LLM API calls for core functionality
- **Future-Proof**: Built with LLM integration interface
- **Domain Adaptable**: Works across multiple subjects
- **CSV-based I/O**: Standard file formats for input and output

## 1. Setup and Imports

In [1]:
import pandas as pd
import nltk
import os

# Download necessary NLTK data
nltk_data_to_download = [
    ('tokenizers/punkt', 'punkt'),
    ('tokenizers/punkt_tab', 'punkt_tab'),
    ('corpora/stopwords', 'stopwords')
]

for data_path, package_name in nltk_data_to_download:
    try:
        nltk.data.find(data_path)
    except LookupError:
        print(f"Downloading {package_name}...")
        nltk.download(package_name)

# Import our custom modules
from csv_reader import read_questions_csv
from concept_extractor import ConceptExtractor
from simulated_llm import SimulatedLLM

print("Setup complete!")

Setup complete!


## 2. Load Sample Data

In [2]:
# Load ancient history questions
questions_file = "resources/ancient_history.csv"
questions_df = read_questions_csv(questions_file)

print(f"Loaded {len(questions_df)} questions")
print("\nSample questions:")
print(questions_df[['Question Number', 'Question']].head())

Successfully loaded 5 questions from resources/ancient_history.csv
Loaded 5 questions

Sample questions:
   Question Number                                           Question
0                1  Which of the following was a feature of the Ha...
1                2  Consider the following pairs: Historical place...
2                3  In the context of the history of India, consid...
3                4  With reference to the scientific progress of A...
4                5  The term 'Jataka' is associated with which of ...


## 3. Hybrid Concept Extraction (RAKE + Custom Dictionary)

In [3]:
# Initialize hybrid concept extractor
custom_dict_file = "dictionaries/ancient_history_concepts.csv"
hybrid_extractor = ConceptExtractor(custom_dict_file=custom_dict_file)

# Extract concepts from all questions
questions_with_concepts = hybrid_extractor.extract_concepts_from_dataframe(questions_df.copy())

print("Concept extraction complete!")
print("\nResults:")
for idx, row in questions_with_concepts.iterrows():
    print(f"\nQuestion {row['Question Number']}: {row['Question'][:100]}...")
    print(f"Concepts: {row['Concepts']}")

INFO:concept_extractor:Starting concept extraction for 5 questions...
INFO:concept_extractor:Extraction complete. Statistics: {'total_questions': 5, 'concepts_extracted': 26, 'avg_concepts_per_question': 5.2}
INFO:concept_extractor:Extraction complete. Statistics: {'total_questions': 5, 'concepts_extracted': 26, 'avg_concepts_per_question': 5.2}


Concept extraction complete!

Results:

Question 1: Which of the following was a feature of the Harappan civilization?...
Concepts: Indus Valley Civilization

Question 2: Consider the following pairs: Historical place – Well-known for\nBurzahom: Rock-cut shrines\nChandra...
Concepts: Historical Place – Well; Rock Art and Archaeology; Terracotta Art; Copper Age Technology; Ancient Art and Crafts

Question 3: In the context of the history of India, consider the following pairs:\nEripatti: Land revenue set as...
Concepts: Land Revenue Set Aside; Following Pairs :\ Neripatti; Land Revenue Systems; Village Institutions; Temple-based Education; Social Structure; Irrigation Systems; Ancient Education; Temple Architecture

Question 4: With reference to the scientific progress of Ancient India, which of the statements are correct?\nSu...
Concepts: Correct ?\ Nsurgical Instruments; Ancient Medicine; Ancient Mathematics; Ncyclic Quadrilateral Known; 3Rd Century Ad; 1St Century Ad; 1St Century; 3R

## 4. View Custom Dictionary

In [4]:
# Load and display the custom dictionary
custom_dict_df = pd.read_csv(custom_dict_file)
print("Custom Dictionary for Ancient History:")
print(custom_dict_df)

Custom Dictionary for Ancient History:
                 keyword                    concept
0               harappan  Indus Valley Civilization
1           indus valley  Indus Valley Civilization
2                mauryan             Mauryan Empire
3                 ashoka             Ashokan Edicts
4                  gupta               Gupta Period
5               eripatti       Land Revenue Systems
6               taniyurs       Village Institutions
7               ghatikas     Temple-based Education
8                 jataka                   Buddhism
9               buddhism                   Buddhism
10                 vedic               Vedic Period
11              burzahom   Rock Art and Archaeology
12       chandraketugarh             Terracotta Art
13             ganeshwar      Copper Age Technology
14            terracotta     Ancient Art and Crafts
15  surgical instruments           Ancient Medicine
16                  sine        Ancient Mathematics
17  cyclic quadrilateral 

## 5. Simulated LLM Extraction

In [5]:
# Test simulated LLM extraction
simulated_llm = SimulatedLLM()

print("Simulated LLM Concept Extraction:")
for idx, row in questions_df.iterrows():
    concepts = simulated_llm.extract_concepts(row['Question'])
    print(f"\nQuestion {row['Question Number']}: {row['Question'][:100]}...")
    print(f"LLM Concepts: {'; '.join(concepts)}")

Simulated LLM Concept Extraction:

Question 1: Which of the following was a feature of the Harappan civilization?...
LLM Concepts: Indus Valley Civilization; Harappan Civilization; Mesopotamian Civilization

Question 2: Consider the following pairs: Historical place – Well-known for\nBurzahom: Rock-cut shrines\nChandra...
LLM Concepts: Academic Knowledge

Question 3: In the context of the history of India, consider the following pairs:\nEripatti: Land revenue set as...
LLM Concepts: Academic Knowledge

Question 4: With reference to the scientific progress of Ancient India, which of the statements are correct?\nSu...
LLM Concepts: Ancient Period

Question 5: The term 'Jataka' is associated with which of the following religions?...
LLM Concepts: Academic Knowledge


## 6. Comparison: Hybrid vs Simulated LLM

In [6]:
# Create comparison DataFrame
comparison_df = questions_df[['Question Number', 'Question']].copy()
comparison_df['Hybrid_Concepts'] = questions_with_concepts['Concepts']
comparison_df['LLM_Concepts'] = comparison_df['Question'].apply(
    lambda q: '; '.join(simulated_llm.extract_concepts(q))
)

print("Comparison of Extraction Methods:")
for idx, row in comparison_df.iterrows():
    print(f"\n=== Question {row['Question Number']} ===")
    print(f"Question: {row['Question'][:150]}...")
    print(f"Hybrid: {row['Hybrid_Concepts']}")
    print(f"LLM: {row['LLM_Concepts']}")

Comparison of Extraction Methods:

=== Question 1 ===
Question: Which of the following was a feature of the Harappan civilization?...
Hybrid: Indus Valley Civilization
LLM: Indus Valley Civilization; Harappan Civilization; Mesopotamian Civilization

=== Question 2 ===
Question: Consider the following pairs: Historical place – Well-known for\nBurzahom: Rock-cut shrines\nChandraketugarh: Terracotta art\nGaneshwar: Copper artefa...
Hybrid: Historical Place – Well; Rock Art and Archaeology; Terracotta Art; Copper Age Technology; Ancient Art and Crafts
LLM: Academic Knowledge

=== Question 3 ===
Question: In the context of the history of India, consider the following pairs:\nEripatti: Land revenue set aside for village tank\nTaniyurs: Villages donated t...
Hybrid: Land Revenue Set Aside; Following Pairs :\ Neripatti; Land Revenue Systems; Village Institutions; Temple-based Education; Social Structure; Irrigation Systems; Ancient Education; Temple Architecture
LLM: Academic Knowledge

=== Qu

## 7. Test Multiple Subjects

In [7]:
# Test with different subjects
subjects = ['ancient_history', 'economics', 'mathematics', 'physics']

for subject in subjects:
    try:
        questions_file = f"resources/{subject}.csv"
        custom_dict_file = f"dictionaries/{subject}_concepts.csv"
        
        if os.path.exists(questions_file):
            df = read_questions_csv(questions_file)
            if not df.empty:
                extractor = ConceptExtractor(custom_dict_file=custom_dict_file)
                result_df = extractor.extract_concepts_from_dataframe(df.copy())
                
                print(f"\n=== {subject.upper()} ===")
                print(f"Questions processed: {len(result_df)}")
                
                # Show first question as example
                if len(result_df) > 0:
                    first_row = result_df.iloc[0]
                    print(f"Sample Question: {first_row['Question'][:100]}...")
                    print(f"Sample Concepts: {first_row['Concepts']}")
    except Exception as e:
        print(f"Error processing {subject}: {e}")

INFO:concept_extractor:Starting concept extraction for 5 questions...
INFO:concept_extractor:Extraction complete. Statistics: {'total_questions': 5, 'concepts_extracted': 26, 'avg_concepts_per_question': 5.2}
INFO:concept_extractor:Starting concept extraction for 5 questions...
INFO:concept_extractor:Extraction complete. Statistics: {'total_questions': 5, 'concepts_extracted': 7, 'avg_concepts_per_question': 1.4}
INFO:concept_extractor:Starting concept extraction for 5 questions...
INFO:concept_extractor:Extraction complete. Statistics: {'total_questions': 5, 'concepts_extracted': 26, 'avg_concepts_per_question': 5.2}
INFO:concept_extractor:Starting concept extraction for 5 questions...
INFO:concept_extractor:Extraction complete. Statistics: {'total_questions': 5, 'concepts_extracted': 7, 'avg_concepts_per_question': 1.4}
INFO:concept_extractor:Starting concept extraction for 5 questions...
INFO:concept_extractor:Extraction complete. Statistics: {'total_questions': 5, 'concepts_extract

Successfully loaded 5 questions from resources/ancient_history.csv

=== ANCIENT_HISTORY ===
Questions processed: 5
Sample Question: Which of the following was a feature of the Harappan civilization?...
Sample Concepts: Indus Valley Civilization
Successfully loaded 5 questions from resources/economics.csv

=== ECONOMICS ===
Questions processed: 5
Sample Question: What is the primary objective of monetary policy?...
Sample Concepts: Monetary Policy
Successfully loaded 5 questions from resources/mathematics.csv

=== MATHEMATICS ===
Questions processed: 5
Sample Question: What is the derivative of sin(x)?...
Sample Concepts: Differential Calculus; Trigonometric Functions; Derivative Of Sin
Successfully loaded 5 questions from resources/physics.csv

=== PHYSICS ===
Questions processed: 5
Sample Question: What is Newton's first law of motion?...
Sample Concepts: Laws of Motion; Classical Mechanics; Kinematics and Dynamics; Law Of Motion; First Law; Thermodynamics


## 8. Save Results

In [8]:
# Save the results to CSV
output_file = "notebook_output_concepts.csv"
output_df = questions_with_concepts[['Question Number', 'Question', 'Concepts']]
output_df.to_csv(output_file, index=False, encoding='utf-8')

print(f"Results saved to {output_file}")
print(f"\nSummary:")
print(f"Total questions processed: {len(output_df)}")
questions_with_concepts_count = len(output_df[output_df['Concepts'].str.len() > 0])
print(f"Questions with extracted concepts: {questions_with_concepts_count}")
print(f"Coverage: {questions_with_concepts_count/len(output_df)*100:.1f}%")

Results saved to notebook_output_concepts.csv

Summary:
Total questions processed: 5
Questions with extracted concepts: 5
Coverage: 100.0%


## Conclusion

This project demonstrates a hybrid approach to concept extraction that:

1. **Combines RAKE with custom dictionaries** for accurate, domain-specific extraction
2. **Avoids expensive LLM API calls** while maintaining good performance
3. **Provides a framework for future LLM integration** when needed
4. **Works across multiple domains** (History, Economics, Mathematics, Physics)
5. **Offers both programmatic and interactive interfaces**

The system is cost-effective, adaptable, and ready for production use in educational and assessment applications.