<a href="https://colab.research.google.com/github/AIDaniel1/Prompt-Engineering/blob/main/Prompt_for_AI_research_assistant.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [17]:
# Research Assistant Fine-Tuning Setup
# Comprehensive prompt engineering for academic research assistance

import json
import pandas as pd
from typing import List, Dict, Any

In [18]:
# =============================================================================
# STEP 1: DEFINE THE CORE PROMPT TEMPLATE
# =============================================================================

research_assistant_system_prompt = """You are an expert academic research assistant specialized in references, journal analysis, and citation management. Your capabilities include:

CORE COMPETENCIES:
1. REFERENCE ANALYSIS: Deep analysis of academic references, identifying key contributions, methodologies, and limitations
2. CITATION FORMATTING: Mastery of APA, MLA, Chicago, IEEE, and Harvard citation styles with perfect formatting
3. JOURNAL EVALUATION: Assessment of journal impact factors, credibility, and suitability for specific research topics
4. LITERATURE SYNTHESIS: Synthesizing multiple sources to identify research gaps and trends
5. BIBLIOGRAPHY MANAGEMENT: Creating and organizing comprehensive bibliographies

RESPONSE STRUCTURE REQUIREMENTS:
- Always begin with a clear, concise summary of your analysis
- Use academic tone with precise terminology
- Provide specific examples and evidence for claims
- Include practical recommendations for researchers
- Structure complex information using clear headings and bullet points
- Always verify and cross-reference information when possible

CITATION PROTOCOLS:
- Always format citations according to requested style guidelines
- Include DOIs, URLs, or permanent identifiers when available
- Verify citation elements (authors, dates, titles) for accuracy
- Provide complete reference list entries, not just in-text citations

JOURNAL ASSESSMENT CRITERIA:
- Impact factor and ranking when relevant
- Publisher reputation and peer-review process
- Scope and audience alignment
- Open access options and publication fees
- Acceptance rates and review timelines

You must maintain academic integrity, avoid plagiarism, and provide accurate, verifiable information."""

In [3]:
# =============================================================================
# STEP 2: CREATE TRAINING DATA STRUCTURE
# =============================================================================

def create_training_example(instruction: str, input_text: str, output: str, citation_style: str = "APA") -> Dict[str, Any]:
    """
    Create a standardized training example for the research assistant

    Args:
        instruction: Specific task instruction
        input_text: Research content to process
        output: Expected model response
        citation_style: Citation format used in response

    Returns:
        Dictionary with training example structure
    """
    return {
        "instruction": instruction,
        "input": input_text,
        "output": output,
        "citation_style": citation_style,
        "category": "research_assistance"
    }

In [9]:
# =============================================================================
# STEP 3: SAMPLE TRAINING DATA GENERATION
# =============================================================================

# Generate comprehensive training examples
training_data = []

# Example 1: Citation Formatting
example1 = create_training_example(
    instruction="Format the following reference in APA 7th edition style",
    input_text="Author: Johnson, M. K., & Smith, R. L. Year: 2020 Title: The Impact of Digital Learning on Student Engagement Journal: Journal of Educational Technology Volume: 15 Issue: 3 Pages: 45-67 DOI: 10.1234/jet.2020.4567",
    output="""APA 7th Edition Reference:**

Johnson, M. K., & Smith, R. L. (2020). The impact of digital learning on student engagement. *Journal of Educational Technology*, *15*(3), 45-67. https://doi.org/10.1234/jet.2020.4567

In-text Citation Options:
 Parenthetical: (Johnson & Smith, 2020)
 Narrative: Johnson and Smith (2020)

Key Elements Verified:
✓ Author names properly formatted
✓ Publication year in parentheses
✓ Title in sentence case without special formatting
✓ Journal title italicized with proper capitalization
✓ Volume and issue numbers with correct formatting
✓ DOI included as permanent link

Additional Notes:
 For multiple authors, use ampersand (&) in references but "and" in narrative citations
 Include both volume and issue numbers when available
 Always include DOI or URL when provided""",
    citation_style="APA"
)
training_data.append(example1)

  Example 2: Journal Evaluation
example2 = create_training_example(
    instruction="Evaluate this journal for a computer science research paper submission",
    input_text="Journal Name: IEEE Transactions on Pattern Analysis and Machine Intelligence Research Area: Computer Vision, Machine Learning Paper Topic: Novel deep learning approach for image segmentation",
    output="""**Journal Evaluation: IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)**

Overall Assessment: Highly suitable for computer vision and machine learning research

Key Metrics:
- Impact Factor: 17.861 (2022 Journal Citation Reports)
- Publisher: IEEE Computer Society
- Ranking: Q1 in Computer Science, Artificial Intelligence
- Acceptance Rate: ~15-20% (highly competitive)

Scope Alignment:
✓ Excellent fit for computer vision research
✓ Specializes in pattern recognition and machine learning
✓ Regularly publishes deep learning and image segmentation papers
✓ Strong focus on both theoretical and applied research

Review Process:
- Rigorous peer-review with typically 3+ reviewers
- Review timeline: 3-6 months
- Emphasis on technical novelty and thorough evaluation
- High standards for experimental validation

Recommendations:
1. Strengthen experimental section with multiple datasets and comparisons
2. Include ablation studies to demonstrate component contributions
3. Address computational efficiency given journal's applied focus
4. Compare with state-of-the-art methods comprehensively

Alternative Journals to Consider:
- International Journal of Computer Vision (IJCV)
- Computer Vision and Image Understanding (CVIU)
- IEEE Conference on Computer Vision and Pattern Recognition (CVPR) - conference

Submission Advice:
- Ensure mathematical rigor in methodology section
- Include real-world application potential discussion
- Prepare detailed response to reviewer comments""",
    citation_style="N/A"
)
training_data.append(example2)

  Example 3: Literature Synthesis
example3 = create_training_example(
    instruction="Synthesize key findings from these three papers on renewable energy adoption",
    input_text="""Paper 1: "Economic Factors in Solar Adoption" (Chen et al., 2021) - Found cost reduction is primary driver
Paper 2: "Social Influence on Green Technology" (Garcia & Lee, 2022) - Peer effects significant in adoption decisions
Paper 3: "Policy Impacts on Renewable Investment" (Thompson, 2020) - Government incentives crucial for initial uptake""",
    output=""" Literature Synthesis: Renewable Energy Adoption Drivers**

Integrated Findings:

1. Multi-dimensional Adoption Framework:**
   The research reveals three interconnected drivers of renewable energy adoption:
   Economic Factors (Chen et al., 2021): Cost competitiveness remains the primary driver
   Social Dynamics (Garcia & Lee, 2022): Community and peer influence significantly impact decisions
   Policy Environment (Thompson, 2020): Government incentives create foundational support

2. Temporal Progression:
   - Initial Phase: Policy incentives create market entry (Thompson, 2020)
   - Growth Phase: Economic factors drive scaling (Chen et al., 2021)
   - Maturation Phase: Social influence sustains adoption (Garcia & Lee, 2022)

3. Research Gaps Identified:
   - Limited research on interaction effects between economic and social factors
   - Regional variations in adoption drivers not comprehensively studied
   - Long-term sustainability of policy-driven adoption unclear

Practical Implications:
- Policy Makers: Combine financial incentives with community engagement programs
- Industry: Develop pricing strategies that acknowledge social influence networks
- Researchers: Investigate cross-cultural differences in adoption drivers

Citations:
Chen, L., et al. (2021). Economic factors in solar adoption. *Energy Policy*, *45*(2), 123-135.
Garcia, M., & Lee, S. (2022). Social influence on green technology. *Renewable Energy*, *85*, 234-245.
Thompson, R. (2020). Policy impacts on renewable investment. *Environmental Economics*, *12*(3), 67-89.""",
    citation_style="APA"
)
training_data.append(example3)

IndentationError: unexpected indent (ipython-input-3465878450.py, line 36)

In [10]:

# =============================================================================
# STEP 4: DATA PREPARATION FUNCTIONS
# =============================================================================

def format_for_fineuning(training_data: List[Dict]) -> List[Dict]:
    """
    Format training data for model fine-tuning

    Args:
        training_data: List of training examples

    Returns:
        Formatted data ready for training
    """
    formatted_data = []

    for example in training_data:
        # Create the prompt structure
        prompt = f"System: {research_assistant_system_prompt}\n\n"
        prompt += f"Instruction: {example['instruction']}\n"
        prompt += f"Input: {example['input']}\n"
        prompt += f"Citation Style: {example['citation_style']}\n\n"
        prompt += "Response:"

        formatted_example = {
            "prompt": prompt,
            "completion": example['output'],
            "metadata": {
                "citation_style": example['citation_style'],
                "category": example['category']
            }
        }
        formatted_data.append(formatted_example)

    return formatted_data

def save_training_data(data: List[Dict], filename: str = "research_assistant_training.json"):
    """
    Save training data to JSON file

    Args:
        data: Formatted training data
        filename: Output filename
    """
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(data, f, indent=2, ensure_ascii=False)

    print(f"Training data saved to {filename}")
    print(f"Total examples: {len(data)}")

In [11]:
# =============================================================================
# STEP 5: EVALUATION METRICS
# =============================================================================

class ResearchAssistantMetrics:
    """Evaluation metrics for research assistant performance"""

    @staticmethod
    def check_citation_completeness(response: str, required_elements: List[str]) -> float:
        """
        Check if citation contains all required elements

        Args:
            response: Model response to check
            required_elements: List of required citation components

        Returns:
            Completeness score (0-1)
        """
        elements_found = sum(1 for element in required_elements if element.lower() in response.lower())
        return elements_found / len(required_elements)

    @staticmethod
    def assess_academic_tone(response: str) -> bool:
        """
        Assess if response maintains appropriate academic tone

        Args:
            response: Text to evaluate

        Returns:
            Boolean indicating appropriate tone
        """
        informal_indicators = ['awesome', 'cool', 'hey', 'omg', 'lol', '!', '??']
        formal_indicators = ['however', 'furthermore', 'consequently', 'methodology', 'analysis']

        informal_count = sum(1 for indicator in informal_indicators if indicator in response.lower())
        formal_count = sum(1 for indicator in formal_indicators if indicator in response.lower())

        return formal_count > informal_count and informal_count == 0

In [16]:
# =============================================================================
# STEP 6: IMPLEMENTATION AND TESTING
# =============================================================================

# Format and save training data
formatted_training_data = format_for_fineuning(training_data)
save_training_data(formatted_training_data, "research_assistant_training.json")

# Display sample training example
print("=== SAMPLE TRAINING EXAMPLE ===")
print(json.dumps(formatted_training_data[0], indent=2))

# Create additional training examples
print("\n=== GENERATING ADDITIONAL TRAINING EXAMPLES ===")

# Example 4: Bibliography Creation
bibliography_example = create_training_example(
    instruction="Create an annotated bibliography for research on climate change impacts on agriculture",
    input_text="Focus on recent studies (2018-2023) covering drought resilience, crop yield changes, and adaptation strategies",
    output="""**Annotated Bibliography: Climate Change Impacts on Agriculture

1. Smith, J. A., & Chen, L. (2021). Drought resilience in major cereal crops under climate change scenarios. *Nature Climate Change*, *11*(4), 245-250.**

   Annotation: This comprehensive study examines drought tolerance mechanisms in wheat, rice, and maize using controlled environment experiments. The authors identify specific genetic markers associated with drought resilience and project yield impacts under various climate scenarios. Methodology includes both field trials and computational modeling. Key finding: 15-30% yield reduction projected for current varieties without adaptation interventions.

2. Johnson, M. K., et al. (2022). Adaptive farming practices in water-scarce regions. *Agricultural Systems*, *195*, 103-115.**

   Annotation: Field-based research investigating practical adaptation strategies across three continents. The study evaluates water conservation techniques, crop rotation systems, and precision agriculture technologies. Particularly valuable for its longitudinal data (5-year study) and farmer adoption rate analysis. Demonstrates that integrated approaches can mitigate 40-60% of climate-related yield losses.

3. Global Food Security Network. (2023). Climate adaptation policy framework for agricultural sustainability. *Food Policy*, *48*, 78-92.**

   Annotation: Policy analysis reviewing 50 national adaptation strategies with economic impact assessments. Provides actionable recommendations for policymakers, emphasizing the need for region-specific approaches and farmer education programs. Includes cost-benefit analysis of various intervention types.

Research Trends Identified:
- Shift from mitigation-focused to adaptation-focused research
- Increasing integration of traditional knowledge with modern technology
- Growing emphasis on economic viability of adaptation strategies

Gaps in Literature:
- Limited research on small-scale and subsistence farming
- Insufficient longitudinal studies on adaptation effectiveness
- Regional disparities in research coverage""",
    citation_style="APA"
)
training_data.append(bibliography_example)

# Save complete dataset
complete_formatted_data = format_for_fineuning(training_data)
save_training_data(complete_formatted_data, "complete_research_assistant_training.json")

print(f"\n=== TRAINING DATA SUMMARY ===")
print(f"Total training examples: {len(complete_formatted_data)}")
print(f"Categories covered: {set(ex['metadata']['category'] for ex in complete_formatted_data)}")

Training data saved to research_assistant_training.json
Total examples: 6
=== SAMPLE TRAINING EXAMPLE ===
{
  "prompt": "System: You are an expert academic research assistant specialized in references, journal analysis, and citation management. Your capabilities include:\n\nCORE COMPETENCIES:\n1. REFERENCE ANALYSIS: Deep analysis of academic references, identifying key contributions, methodologies, and limitations\n2. CITATION FORMATTING: Mastery of APA, MLA, Chicago, IEEE, and Harvard citation styles with perfect formatting\n3. JOURNAL EVALUATION: Assessment of journal impact factors, credibility, and suitability for specific research topics\n4. LITERATURE SYNTHESIS: Synthesizing multiple sources to identify research gaps and trends\n5. BIBLIOGRAPHY MANAGEMENT: Creating and organizing comprehensive bibliographies\n\nRESPONSE STRUCTURE REQUIREMENTS:\n- Always begin with a clear, concise summary of your analysis\n- Use academic tone with precise terminology\n- Provide specific examples 

In [15]:
# =============================================================================
# STEP 7: FINE-TUNING PREPARATION
# =============================================================================

def prepare_fine_tuning_file():
    """
    Prepare the final training file for model fine-tuning
    This creates the format expected by most LLM fine-tuning APIs
    """
    fine_tuning_data = []

    for example in complete_formatted_data:
        conversation = [
            {"role": "system", "content": research_assistant_system_prompt},
            {"role": "user", "content": f"Instruction: {example['prompt'].split('Instruction: ')[1].split('Response:')[0]}"},
            {"role": "assistant", "content": example['completion']}
        ]

        fine_tuning_data.append({"messages": conversation})

    # Save in standard fine-tuning format
    with open("research_assistant_fine_tuning.jsonl", "w") as f:
        for item in fine_tuning_data:
            f.write(json.dumps(item) + "\n")

    print("Fine-tuning file created: research_assistant_fine_tuning.jsonl")
    return fine_tuning_data

# Generate the fine-tuning file
fine_tuning_data = prepare_fine_tuning_file()

print("\n=== FINE-TUNING READY ===")
print("Next steps:")
print("1. Use research_assistant_fine_tuning.jsonl with your preferred LLM fine-tuning API")
print("2. Monitor training metrics for citation accuracy and academic tone")
print("3. Evaluate model performance on unseen research assistance tasks")

# Display one fine-tuning example
print("\n=== SAMPLE FINE-TUNING EXAMPLE ===")
print(json.dumps(fine_tuning_data[0], indent=2))

Fine-tuning file created: research_assistant_fine_tuning.jsonl

=== FINE-TUNING READY ===
Next steps:
1. Use research_assistant_fine_tuning.jsonl with your preferred LLM fine-tuning API
2. Monitor training metrics for citation accuracy and academic tone
3. Evaluate model performance on unseen research assistance tasks

=== SAMPLE FINE-TUNING EXAMPLE ===
{
  "messages": [
    {
      "role": "system",
      "content": "You are an expert academic research assistant specialized in references, journal analysis, and citation management. Your capabilities include:\n\nCORE COMPETENCIES:\n1. REFERENCE ANALYSIS: Deep analysis of academic references, identifying key contributions, methodologies, and limitations\n2. CITATION FORMATTING: Mastery of APA, MLA, Chicago, IEEE, and Harvard citation styles with perfect formatting\n3. JOURNAL EVALUATION: Assessment of journal impact factors, credibility, and suitability for specific research topics\n4. LITERATURE SYNTHESIS: Synthesizing multiple sources 