This project is a proof-of-concept for a sophisticated educational tool that leverages a Large Language Model (LLM) to analyze student-written Python code. Moving beyond simple syntax correction, this tool acts as a Socratic tutor. Instead of providing direct answers, it identifies conceptual flaws (such as algorithmic inefficiency) and generates guiding, open-ended questions to help the student discover the solution on their own. This methodology is designed to foster critical thinking and a deeper, more resilient understanding of programming concepts.
The core of this project is a carefully engineered prompt that instructs Meta's Code Llama (7B Instruct) model to adopt a pedagogical persona, demonstrating a practical application of prompt engineering for a specific, high-value educational use case.
- Conceptual Analysis: The system is designed to detect not just errors that break the code, but also logical weaknesses in code that works. For the test case, it identifies that sorting an entire list to find the maximum value is correct but inefficient.
- Socratic Prompt Generation: The LLM's primary output is a question, not a statement or a solution. This is achieved through a highly constrained meta-prompt that defines the AI's role and rules of engagement.
- Local, Open-Source Implementation: The entire system runs locally using a freely available, open-source model. This ensures privacy, scalability, and removes reliance on paid APIs.
The model's behavior is entirely controlled by a "meta-prompt" that frames its task. It instructs the AI to:
- Assume the persona of a helpful tutor.
- Strictly forbid providing direct solutions or code corrections.
- Focus on the underlying concept rather than a specific line number.
- Phrase all feedback as a single, encouraging, Socratic question.
- Language: Python
- Core Libraries:
transformers
: For interacting with models from the Hugging Face ecosystem.PyTorch
: As the backend tensor library for the model.accelerate
: For efficient model loading on available hardware.
- AI Model:
codellama/CodeLlama-7b-Instruct-hf
(7-billion parameter instruction-tuned model)
Prerequisites: Python 3.9+, Homebrew (for macOS), Git, and Git LFS.
Step 1: Clone the Repository
git clone https://github.com/Nydv01/fossee-python-evaluation.git
cd fossee-python-evaluation
Step 2: Set Up and Activate the Python Virtual Environment
code
Bash
python3 -m venv venv
source venv/bin/activate
Step 3: Install Dependencies
code
Bash
pip install -r requirements.txt
Step 4: Download the AI Model (One-Time Setup)
This project requires the Code Llama 7B model, which is approximately 16 GB. It is not included in this repository (and is specified in .gitignore) as per best practices.
Install Git LFS: brew install git-lfs && git lfs install
Download the model:
code
Bash
# This command clones the model repository and then pulls the large files
git clone https://huggingface.co/codellama/CodeLlama-7b-Instruct-hf
Note: This is a very large download. A stable, wired internet connection is highly recommended.
Step 5: Run the Evaluation Script
code
Bash
python3 evaluate_model.py
5. Sample Output
Running the script on the provided test case (inefficient_max_finder.py) produces the following output:
code
--- FOSSEE Python Competence Analysis Model Evaluation ---
Loaded student code from: test_cases/inefficient_max_finder.py
--- Student Code ---
# student_code.py
def find_largest_number(numbers):
"""
This function takes a list of numbers and returns the largest one.
I think sorting the list first makes it easy to find the largest.
"""
# ... (code) ...
--------------------
Initializing model pipeline...
Loading checkpoint shards: 100%|████████| 2/2 [00:23<00:00, 11.62s/it]
Generating pedagogical feedback...
--- Model-Generated Socratic Question ---
What do you think the list `numbers` is being sorted in?
------------------------------------------
6. Analysis & Future Work
This proof-of-concept successfully demonstrates the viability of using instruction-tuned LLMs for nuanced educational tasks. However, this is just the beginning.
Key Strengths:
Pedagogical Soundness: The Socratic approach is more effective for long-term learning than simply providing answers.
Scalability: An automated system like this could provide instant, personalized feedback to thousands of students simultaneously.
Professional Practices: The repository follows best practices, including the use of a .gitignore to exclude the virtual environment and the large model files.
Areas for Future Improvement:
Advanced Competence Analysis: The model could be fine-tuned on a curated dataset of student code and expert feedback to improve its ability to detect more complex errors (e.g., race conditions, memory leaks).
Adaptive Dialogue: A future version could engage in a multi-turn conversation, asking follow-up questions based on the student's responses.
Quantitative Evaluation: Develop a robust rubric to score the quality of the AI's questions based on relevance, cognitive depth, and clarity.
Broader Subject Matter: This same framework could be adapted to tutor in other domains, such as SQL, Java, or even scientific concepts.