# NLP-Based Multiple Choice Question Generator

## Project Overview
This project generates multiple choice questions (MCQs) from PDF documents using:
- **Natural Language Processing (NLP)** techniques: tokenization, stemming, lemmatization
- **Google Gemini API** for intelligent question generation
- **Text analysis** features for keyword extraction and concept identification

### Key NLP Features Used:
1. **Tokenization**: Breaking text into sentences and words
2. **Stemming**: Reducing words to their root form using Porter Stemmer
3. **Lemmatization**: Converting words to their base/dictionary form
4. **POS Tagging**: Identifying parts of speech
5. **Named Entity Recognition**: Identifying important entities
6. **TF-IDF**: Extracting important keywords and concepts
7. **Stop Word Removal**: Filtering out common words

### Subject: Natural Language Processing


## Step 1: Install Required Dependencies

In [None]:
# Install required packages
!pip install PyPDF2 google-generativeai nltk spacy scikit-learn
!python -m spacy download en_core_web_sm

print("All dependencies installed successfully!")

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m63.7 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
All dependencies installed successfully!


## Step 2: Import Libraries and MCQ Generator Code

In [None]:
# Import all required libraries
import re
import json
import random
import warnings
from typing import List, Dict, Tuple, Optional
from collections import Counter
import io

import numpy as np
import pandas as pd
import PyPDF2
from io import BytesIO

import nltk
import spacy
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.chunk import ne_chunk
from nltk.tag import pos_tag
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

import google.generativeai as genai
from google.colab import files

warnings.filterwarnings('ignore')
print("Libraries imported successfully!")

Libraries imported successfully!


In [None]:
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

## Step 3: MCQ Generator Class Definition

In [None]:
class MCQGenerator:
    """
    A comprehensive Multiple Choice Question Generator using NLP and GenAI
    """

    def __init__(self, gemini_api_key: str):
        """
        Initialize the MCQ Generator with Gemini API key

        Args:
            gemini_api_key (str): Google Gemini API key
        """
        # Initialize Gemini API
        genai.configure(api_key=gemini_api_key)
        self.model = genai.GenerativeModel('gemini-2.5-flash')

        # Initialize NLP tools
        self.stemmer = PorterStemmer()
        self.lemmatizer = WordNetLemmatizer()

        # Download required NLTK data
        self._download_nltk_data()

        # Load spaCy model
        try:
            self.nlp = spacy.load("en_core_web_sm")
        except OSError:
            print("SpaCy model not found. Please install with: python -m spacy download en_core_web_sm")
            self.nlp = None

        # Initialize stop words
        self.stop_words = set(stopwords.words('english'))

        print("MCQ Generator initialized successfully!")

    def _download_nltk_data(self):
        """Download required NLTK data"""
        nltk_downloads = [
            'punkt', 'stopwords', 'wordnet', 'averaged_perceptron_tagger',
            'maxent_ne_chunker', 'words', 'omw-1.4'
        ]

        for item in nltk_downloads:
            try:
                nltk.download(item, quiet=True)
            except:
                pass

    def extract_text_from_pdf(self, pdf_file) -> str:
        """
        Extract text from PDF file

        Args:
            pdf_file: PDF file object or path

        Returns:
            str: Extracted text from PDF
        """
        try:
            if isinstance(pdf_file, str):
                # If it's a file path
                with open(pdf_file, 'rb') as file:
                    pdf_reader = PyPDF2.PdfReader(file)
                    text = ""
                    for page in pdf_reader.pages:
                        text += page.extract_text() + "\n"
            else:
                # If it's a file object (for Colab uploads)
                pdf_reader = PyPDF2.PdfReader(pdf_file)
                text = ""
                for page in pdf_reader.pages:
                    text += page.extract_text() + "\n"

            print(f"Successfully extracted {len(text)} characters from PDF")
            return text

        except Exception as e:
            print(f"Error extracting text from PDF: {str(e)}")
            return ""

    def preprocess_text(self, text: str) -> Dict:
        """
        Comprehensive text preprocessing using NLP techniques

        Args:
            text (str): Input text

        Returns:
            Dict: Preprocessed text data
        """
        # Clean text
        clean_text = re.sub(r'[^\w\s\.\?\!]', ' ', text)
        clean_text = re.sub(r'\s+', ' ', clean_text).strip()

        # Tokenization
        sentences = sent_tokenize(clean_text)
        words = word_tokenize(clean_text.lower())

        # Remove stop words
        words_no_stop = [word for word in words if word not in self.stop_words and len(word) > 2]

        # Stemming
        stemmed_words = [self.stemmer.stem(word) for word in words_no_stop]

        # Lemmatization
        lemmatized_words = [self.lemmatizer.lemmatize(word) for word in words_no_stop]

        # POS tagging
        pos_tags = pos_tag(words_no_stop)

        # Named Entity Recognition (if spaCy is available)
        named_entities = []
        if self.nlp:
            doc = self.nlp(clean_text)
            named_entities = [(ent.text, ent.label_) for ent in doc.ents]

        # Extract nouns and verbs (important concepts)
        important_words = [word for word, pos in pos_tags if pos.startswith(('NN', 'VB'))]

        preprocessed_data = {
            'original_text': text,
            'clean_text': clean_text,
            'sentences': sentences,
            'words': words,
            'words_no_stop': words_no_stop,
            'stemmed_words': stemmed_words,
            'lemmatized_words': lemmatized_words,
            'pos_tags': pos_tags,
            'named_entities': named_entities,
            'important_words': important_words
        }

        print(f"Text preprocessing completed:")
        print(f"- Sentences: {len(sentences)}")
        print(f"- Words (no stop words): {len(words_no_stop)}")
        print(f"- Named entities: {len(named_entities)}")

        return preprocessed_data

    def extract_keywords(self, preprocessed_data: Dict, top_k: int = 20) -> List[str]:
        """
        Extract important keywords using TF-IDF

        Args:
            preprocessed_data (Dict): Preprocessed text data
            top_k (int): Number of top keywords to return

        Returns:
            List[str]: List of important keywords
        """
        sentences = preprocessed_data['sentences']

        # Use TF-IDF to find important terms
        vectorizer = TfidfVectorizer(
            max_features=1000,
            ngram_range=(1, 2),
            stop_words='english'
        )

        tfidf_matrix = vectorizer.fit_transform(sentences)
        feature_names = vectorizer.get_feature_names_out()

        # Get average TF-IDF scores
        mean_scores = np.mean(tfidf_matrix.toarray(), axis=0)

        # Get top keywords
        top_indices = mean_scores.argsort()[-top_k:][::-1]
        keywords = [feature_names[i] for i in top_indices]

        print(f"Extracted {len(keywords)} keywords: {keywords[:10]}...")
        return keywords

    def identify_key_concepts(self, preprocessed_data: Dict) -> List[str]:
        """
        Identify key concepts from the text

        Args:
            preprocessed_data (Dict): Preprocessed text data

        Returns:
            List[str]: List of key concepts
        """
        concepts = []

        # Add named entities
        entities = [entity[0] for entity in preprocessed_data['named_entities']]
        concepts.extend(entities)

        # Add frequent important words
        important_words = preprocessed_data['important_words']
        word_freq = Counter(important_words)
        frequent_concepts = [word for word, freq in word_freq.most_common(15)]
        concepts.extend(frequent_concepts)

        # Remove duplicates and filter
        concepts = list(set(concepts))
        concepts = [concept for concept in concepts if len(concept) > 3]

        print(f"Identified {len(concepts)} key concepts")
        return concepts

    def generate_mcq_with_gemini(self, text_chunk: str, num_questions: int = 1) -> List[Dict]:
        """
        Generate MCQ questions using Google Gemini

        Args:
            text_chunk (str): Text chunk to generate questions from
            num_questions (int): Number of questions to generate

        Returns:
            List[Dict]: List of generated MCQ questions
        """
        prompt = f"""
        Based on the following text, generate {num_questions} multiple choice question(s) with 4 options each.
        The questions should test comprehension and understanding of key concepts.

        Format the response as a JSON array where each question has:
        - "question": the question text
        - "options": array of 4 options (A, B, C, D)
        - "correct_answer": the correct option letter (A, B, C, or D)
        - "explanation": brief explanation of why the answer is correct

        Text:
        {text_chunk}

        Provide only the JSON array response, no additional text.
        """

        try:
            response = self.model.generate_content(prompt)
            response_text = response.text

            # Clean the response to extract JSON
            response_text = response_text.strip()
            if response_text.startswith('```json'):
                response_text = response_text[7:]
            if response_text.endswith('```'):
                response_text = response_text[:-3]

            questions = json.loads(response_text)
            return questions if isinstance(questions, list) else [questions]

        except Exception as e:
            print(f"Error generating questions with Gemini: {str(e)}")
            return []

    def select_best_chunks(self, sentences: List[str], chunk_size: int = 3, num_chunks: int = 5) -> List[str]:
        """
        Select the best text chunks for question generation

        Args:
            sentences (List[str]): List of sentences
            chunk_size (int): Size of each chunk in sentences
            num_chunks (int): Number of chunks to select

        Returns:
            List[str]: Selected text chunks
        """
        # Create chunks
        chunks = []
        for i in range(0, len(sentences) - chunk_size + 1, chunk_size // 2):
            chunk = ' '.join(sentences[i:i + chunk_size])
            if len(chunk.strip()) > 100:  # Minimum chunk length
                chunks.append(chunk)

        if len(chunks) <= num_chunks:
            return chunks

        # Use TF-IDF to score chunks
        vectorizer = TfidfVectorizer(stop_words='english')
        tfidf_matrix = vectorizer.fit_transform(chunks)

        # Calculate chunk importance (sum of TF-IDF scores)
        chunk_scores = np.sum(tfidf_matrix.toarray(), axis=1)

        # Select top chunks
        top_indices = chunk_scores.argsort()[-num_chunks:][::-1]
        selected_chunks = [chunks[i] for i in top_indices]

        print(f"Selected {len(selected_chunks)} chunks for question generation")
        return selected_chunks

    def generate_mcq_questions(self, pdf_file, num_questions: int = 10) -> List[Dict]:
        """
        Main function to generate MCQ questions from PDF

        Args:
            pdf_file: PDF file object or path
            num_questions (int): Number of questions to generate

        Returns:
            List[Dict]: Generated MCQ questions
        """
        print("Starting MCQ generation process...")

        # Step 1: Extract text from PDF
        text = self.extract_text_from_pdf(pdf_file)
        if not text:
            return []

        # Step 2: Preprocess text
        preprocessed_data = self.preprocess_text(text)

        # Step 3: Extract keywords and concepts
        keywords = self.extract_keywords(preprocessed_data)
        concepts = self.identify_key_concepts(preprocessed_data)

        # Step 4: Select best text chunks
        sentences = preprocessed_data['sentences']
        selected_chunks = self.select_best_chunks(sentences, num_chunks=min(num_questions, 8))

        # Step 5: Generate questions
        all_questions = []
        questions_per_chunk = max(1, num_questions // len(selected_chunks))

        for i, chunk in enumerate(selected_chunks):
            if len(all_questions) >= num_questions:
                break

            questions_needed = min(questions_per_chunk, num_questions - len(all_questions))
            chunk_questions = self.generate_mcq_with_gemini(chunk, questions_needed)

            for question in chunk_questions:
                question['source_chunk'] = f"Chunk {i+1}"
                question['keywords'] = keywords[:5]  # Add relevant keywords

            all_questions.extend(chunk_questions)
            print(f"Generated {len(chunk_questions)} questions from chunk {i+1}")

        # Shuffle questions
        random.shuffle(all_questions)

        print(f"Successfully generated {len(all_questions)} MCQ questions!")
        return all_questions[:num_questions]

    def format_questions_for_display(self, questions: List[Dict]) -> str:
        """
        Format questions for display

        Args:
            questions (List[Dict]): List of questions

        Returns:
            str: Formatted questions
        """
        formatted_output = "="*60 + "\n"
        formatted_output += "MULTIPLE CHOICE QUESTIONS\n"
        formatted_output += "="*60 + "\n\n"

        for i, q in enumerate(questions, 1):
            formatted_output += f"Question {i}:\n"
            formatted_output += f"{q['question']}\n\n"

            for j, option in enumerate(q['options']):
                letter = chr(65 + j)  # A, B, C, D
                formatted_output += f"{letter}) {option}\n"

            formatted_output += f"\nCorrect Answer: {q['correct_answer']}\n"
            formatted_output += f"Explanation: {q['explanation']}\n"

            if 'keywords' in q:
                formatted_output += f"Related Keywords: {', '.join(q['keywords'])}\n"

            formatted_output += "\n" + "-"*40 + "\n\n"

        return formatted_output

    def demonstrate_nlp_features(self, text: str):
        """
        Demonstrate NLP features used in the project

        Args:
            text (str): Sample text to analyze
        """
        print("=== NLP FEATURES DEMONSTRATION ===")
        print(f"Original text: {text[:200]}...\n")

        # Tokenization
        sentences = sent_tokenize(text)
        words = word_tokenize(text.lower())
        print(f"1. TOKENIZATION:")
        print(f"   - Sentences: {len(sentences)}")
        print(f"   - Words: {len(words)}")
        print(f"   - First 10 words: {words[:10]}\n")

        # Stop word removal
        words_no_stop = [word for word in words if word not in self.stop_words and len(word) > 2]
        print(f"2. STOP WORD REMOVAL:")
        print(f"   - Words after removal: {len(words_no_stop)}")
        print(f"   - Sample: {words_no_stop[:10]}\n")

        # Stemming
        stemmed_words = [self.stemmer.stem(word) for word in words_no_stop[:10]]
        print(f"3. STEMMING (Porter Stemmer):")
        print(f"   - Original: {words_no_stop[:10]}")
        print(f"   - Stemmed:  {stemmed_words}\n")

        # Lemmatization
        lemmatized_words = [self.lemmatizer.lemmatize(word) for word in words_no_stop[:10]]
        print(f"4. LEMMATIZATION:")
        print(f"   - Original:     {words_no_stop[:10]}")
        print(f"   - Lemmatized:   {lemmatized_words}\n")

        # POS Tagging
        pos_tags = pos_tag(words_no_stop[:10])
        print(f"5. POS TAGGING:")
        print(f"   - Tags: {pos_tags}\n")

        # Named Entity Recognition
        if self.nlp:
            doc = self.nlp(text[:500])  # Limit text for demo
            entities = [(ent.text, ent.label_) for ent in doc.ents]
            print(f"6. NAMED ENTITY RECOGNITION:")
            print(f"   - Entities found: {entities[:10]}\n")

        # TF-IDF Keywords
        preprocessed_data = self.preprocess_text(text)
        keywords = self.extract_keywords(preprocessed_data, top_k=10)
        print(f"7. TF-IDF KEYWORD EXTRACTION:")
        print(f"   - Top keywords: {keywords}\n")

## Step 4: Configuration and Initialization

In [None]:
GEMINI_API_KEY = "x"

# Initialize the MCQ Generator
if GEMINI_API_KEY == "YOUR_GEMINI_API_KEY_HERE":
    print("WARNING: Please replace GEMINI_API_KEY with your actual API key!")
else:
    generator = MCQGenerator(GEMINI_API_KEY)

MCQ Generator initialized successfully!


## Step 5: Upload and Process PDF File

In [None]:
# Upload PDF file
print("Please upload a PDF file:")
uploaded_files = files.upload()

# Get the uploaded file
if uploaded_files:
    filename = list(uploaded_files.keys())[0]
    print(f"Uploaded file: {filename}")

    # Create file object for processing
    pdf_content = uploaded_files[filename]
    pdf_file = BytesIO(pdf_content)

    print(f"File ready for processing!")
else:
    print("No file uploaded. Please upload a PDF to proceed.")
    pdf_file = None # Ensure pdf_file is None if no file uploaded

Please upload a PDF file:


Saving 2.2_Map Reduce.pdf to 2.2_Map Reduce.pdf
Uploaded file: 2.2_Map Reduce.pdf
File ready for processing!


## Step 7: Generate MCQ Questions

In [None]:
# Set parameters
NUM_QUESTIONS = 5  # Adjust as needed

if GEMINI_API_KEY != "YOUR_GEMINI_API_KEY_HERE":
    if pdf_file:
        # Generate questions from uploaded PDF
        print(f"Generating {NUM_QUESTIONS} MCQ questions from uploaded PDF...")
        questions = generator.generate_mcq_questions(pdf_file, num_questions=NUM_QUESTIONS)
    else:
        # Handle the case where no file was uploaded
        print("No PDF file uploaded. Please upload a PDF in the previous step.")
        questions = [] # Ensure questions is empty if no file

    if questions:
        print(f"\nSuccessfully generated {len(questions)} questions!")
    else:
        print("Failed to generate questions or no PDF was uploaded. Check your API key and try again.")
else:
    print("Please set your Gemini API key first!")
    questions = []

Generating 5 MCQ questions from uploaded PDF...
Starting MCQ generation process...
Successfully extracted 3320 characters from PDF
Text preprocessing completed:
- Sentences: 31
- Words (no stop words): 318
- Named entities: 35
Extracted 20 keywords: ['mapreduce', 'data', 'map', 'stage', 'shuffle', 'phase', 'mapper', 'output', 'reduce', 'key']...
Identified 31 key concepts
Selected 5 chunks for question generation
Generated 1 questions from chunk 1
Error generating questions with Gemini: Expecting value: line 10 column 5 (char 367)
Generated 0 questions from chunk 2
Generated 1 questions from chunk 3
Generated 1 questions from chunk 4
Generated 1 questions from chunk 5
Successfully generated 4 MCQ questions!

Successfully generated 4 questions!


## Step 8: Display Generated Questions

In [None]:
# Display the generated questions
if questions:
    formatted_questions = generator.format_questions_for_display(questions)
    print(formatted_questions)

    # Also display as structured data
    print("\n" + "="*60)
    print("STRUCTURED QUESTION DATA (JSON Format)")
    print("="*60)
    print(json.dumps(questions, indent=2, ensure_ascii=False))
else:
    print("No questions to display. Please check the previous steps.")

MULTIPLE CHOICE QUESTIONS

Question 1:
According to the text, what is a fundamental principle upon which the MapReduce paradigm is generally based?

A) A. Sending the data to where the computer program resides.
B) B. The reducer performs a defined function on a single value for each unique key.
C) C. Sending the computer program to where the data resides.
D) D. The final output key-value will only be displayed and not stored.

Correct Answer: C
Explanation: The text explicitly states: 'The Algorithm Generally MapReduce paradigm is based on sending the computer program to where the data resides!'
Related Keywords: mapreduce, data, map, stage, shuffle

----------------------------------------

Question 2:
Which of the following describes the details of data passing managed by the Hadoop framework during a MapReduce job?

A) A) Storing the processed output in HDFS and producing new output.
B) B) Sending Map and Reduce tasks to servers and storing output in HDFS.
C) C) Issuing tasks, verif

## Step 9: Analysis and Statistics

In [None]:
# Analyze the generated questions
if questions:
    print("=== QUESTION ANALYSIS ===")
    print(f"Total questions generated: {len(questions)}")

    # Analyze question lengths
    question_lengths = [len(q['question'].split()) for q in questions]
    print(f"Average question length: {np.mean(question_lengths):.1f} words")
    print(f"Question length range: {min(question_lengths)} - {max(question_lengths)} words")

    # Analyze answer distribution
    answer_distribution = Counter([q['correct_answer'] for q in questions])
    print(f"\nAnswer distribution: {dict(answer_distribution)}")

    # Extract all keywords used
    all_keywords = []
    for q in questions:
        if 'keywords' in q:
            all_keywords.extend(q['keywords'])

    if all_keywords:
        keyword_freq = Counter(all_keywords)
        print(f"\nMost common keywords: {keyword_freq.most_common(10)}")

    print(f"\n✅ Analysis complete!")

=== QUESTION ANALYSIS ===
Total questions generated: 4
Average question length: 18.2 words
Question length range: 11 - 26 words

Answer distribution: {'C': 3, 'B': 1}

Most common keywords: [('mapreduce', 4), ('data', 4), ('map', 4), ('stage', 4), ('shuffle', 4)]

✅ Analysis complete!


## Project Summary

### NLP Techniques Used:
1. **Tokenization**: Breaking text into sentences and words using NLTK
2. **Stemming**: Using Porter Stemmer to reduce words to root forms
3. **Lemmatization**: Converting words to dictionary base forms using WordNet
4. **Stop Word Removal**: Filtering common words for better analysis
5. **POS Tagging**: Identifying parts of speech for important word extraction
6. **Named Entity Recognition**: Using spaCy to identify persons, organizations, locations
7. **TF-IDF Vectorization**: Extracting important keywords and ranking text chunks
8. **Text Preprocessing**: Cleaning and normalizing text data

### GenAI Integration:
- **Google Gemini API**: Generating intelligent, contextual MCQ questions
- **Prompt Engineering**: Structured prompts for consistent question format

### Key Features:
- PDF text extraction and processing
- Intelligent text chunk selection for question generation
- Comprehensive NLP analysis and preprocessing
- Automated MCQ generation with explanations
- Statistical analysis of generated questions
- Export functionality for different formats

### Academic Value:
This project demonstrates practical application of NLP concepts including:
- Text preprocessing pipelines
- Feature extraction techniques
- Information retrieval methods
- Integration of traditional NLP with modern GenAI
- Real-world application development

Perfect for NLP course projects and demonstrates comprehensive understanding of both classical NLP techniques and modern AI applications!