# Development Log

## Project Setup:



**Decisions Made**:
- Chose regex-based approach over ML for initial detection
- Decided to focus on Python and C languages first
- Created modular structure (extractor/, tests/, classifier/)

**Code Changes:**
- Created initial patterns.py with simple regex patterns

    added patterns that appear in python, C or both in a dictionary

- Basic CodeDetector class framework

    initialied the class and added a main function detect_code

- created basic tests

    created simple codes for C and python containing only code - a hello world function for initial testing

## Initial Algorithm Architecture - Simple Block Detection

Focus: Basic line-by-line scanning approach

In [None]:
def detect_code(self, text):
    # Simple sequential scanning
    lines = text.split('\n')
    i = 0
    while i < len(lines):
        if self._is_code_like(lines[i]):
            block = self._extract_block(lines, i)
            # Process block...

### Key Decisions

Decision: Line-by-line sequential scanning
Rationale: check each line for code patterns, then extract blocks when found

### Technical Approach

Pattern Detection: _is_code_like() - boolean check against all regex patterns
Block Extraction: _extract_block() - placeholder for boundary detection
Language ID: Simple pattern counting approach

Critical Limitation Discovered
Problem: This approach assumes code appears in contiguous blocks, but real-world code can be fragmented with text interspersed.

## Algorithm Evolution - From Block-Based to Fragment-Based Detection

Major Paradigm Shift: Abandoned sequential block extraction for fragment collection + grouping

In [None]:
# NEW APPROACH:
for i, line in enumerate(lines):
    scores = self.identify_language_for_line(line)
    if max_score >= 0.4:
        code_fragments.append({...})
return self.group_by_language(code_fragments)

### Key Innovation

Fragment Collection: Each line scored independently
Language-Specific Scoring: identify_language_for_line() with weighted patterns
Post-Processing Grouping: group_by_language() assembles fragments into blocks

### Technical Improvements

Weighted Pattern System: Moved from boolean to scored detection

In [None]:
'function_def': (r'def\s+\w+\s*\(.*\):', 0.8)  # High weight for strong indicators
'comments': (r'#.*$', 0.2)                      # Low weight for weak indicators

Threshold-Based Detection: 0.4 threshold prevents noise, can be changed according to testing
Language Separation: Groups fragments by detected language before block assembly

### Problem Solved
Issue: Fragmented code (explanatory text between code lines) wasn't handled properly
Solution: Collect all code-like lines first, then intelligently group them

### Status

- Core algorithm completely rewritten
- Old block-based methods kept as fallback but unused
- Foundation for current multi-pass structural analysis

#### Next Challenge Identified
Need to handle multiline constructs (comments, unmatched braces) that this approach still misses.

## Two-Pass Architecture Design - Structural Analysis

Problem Identified: Missing structural elements break code detection

- C closing braces } not recognized as code
- Multiline comments /* */ and """ span detection boundaries
- Fragments miss complete constructs

Proposed Solution: Two-Pass System
Pass 1: Pattern Detection

In [None]:
# Current fragment-based approach becomes Pass 1
scan_for_code_patterns()      # Line-by-line scoring (current logic)
identify_structural_elements() # Find braces, quotes, delimiters
detect_multiline_starts()     # Function defs, comment starts

Pass 2: Language-Aware Completion

In [None]:
complete_multiline_constructs() # Match opening/closing pairs
find_block_boundaries()        # Use language rules for boundaries
validate_block_integrity()     # Ensure structural completeness

### Language-Specific Rules

C: Match {} braces, /* */ comments
Python: Match """ docstrings, indentation blocks, \ continuations

### Architecture Benefits

Separation of concerns: Pattern detection vs. structural analysis
Language-aware completion: Each language has different rules
Robust boundary detection: Handles complex multiline constructs

## Structural Analysis Implementation - Pass 2 Foundation

Major Addition: analyze_structure() - language-aware structural validation

In [None]:
def analyze_structure(self, content, language, start_line):
    # Track brace balance across entire block
    round_brace_counter += line.count('(') - line.count(')')
    # Detect multiline comment boundaries
    process_multiline_comments(line, language, ...)
    # Identify structural imbalances
    if braces_sum != 0: # Unmatched braces detected

### Key Features Implemented

Brace Tracking: Cumulative counters for (), [], {} across block
Comment Detection: Language-specific multiline comment handling

- C: /* */ pairs
- Python: """ pairs

Structural Validation: Detect incomplete constructs

## Technical Approach

Per-line analysis: Track running totals of brace balance
State machine: in_comment flag tracks multiline comment state
Language dispatch: Different comment delimiters per language

### Problem Addressed
Missing closing elements (like } in C) now detected as structural imbalances requiring correction.

### Current Status

Detection logic implemented
fix_braces() placeholder for correction logic
Foundation for block boundary expansion

Next Steps Identified

Implement brace correction/expansion
Add block merging for adjacent fragments
Test with real Stack Overflow data

## Enhanced Structural Analysis - Error Tracking & Recovery

Major Improvement: Transformed simple counters into detailed error tracking system

In [None]:
# OLD: Simple counters
round_brace_counter += line.count('(') - line.count(')')

# NEW: Error tracking with line positions
brace_errors = {'round': [], 'square': [], 'curly': []}
# Track which lines have unmatched opens
if counters[i] > 0 and count > 0:
    brace_errors[brace_type].extend([line_index] * count)

### Key Innovation
Line-level error tracking: Instead of just counting imbalances, now tracking exactly which lines have unmatched opening braces.
Smart matching logic:

New opens → add line numbers to error list
Closes → remove from error list (LIFO matching)
Remaining errors = unmatched opens needing correction

### Data Structure Design

In [None]:
return {
    'multiline_comments': [{'start': X, 'end': Y}],
    'brace_errors': {'round': [line_nums], 'curly': [line_nums]}
}

### Problem This Solves
Now can identify exactly which C functions are missing closing braces and where to look for them in adjacent blocks.
Status

### Error detection implemented
Return structured analysis data
Ready for correction logic implementation

Next: Use brace_errors to expand blocks and fix structural issues.

## Complete Structural Recovery Pipeline

Major Integration: End-to-end structural analysis with block expansion

In [None]:
# Integrated pipeline
for block in code_blocks:
    structure_info = self.analyze_structure(...)
    block['structure_info'] = structure_info
    block = self.expand_blocks_with_comments(block, structure_info, lines)

### Key Features Added
1. Enhanced Language Scoring

Common patterns now contribute to all languages
More accurate language identification per line

2. Smart Comment Detection

Handles same delimiter cases (Python """)
Detects start/end on same line edge cases
Tracks missing comment lines outside block boundaries

3. Block Expansion Logic

In [None]:
def expand_blocks_with_comments(self, block, structure_info, original_lines):
    # Expands block boundaries to include complete multiline comments
    all_line_nums.extend(comment_line_ranges)
    block['content'] = original_lines[new_start:new_end]

### Problem Solved

Issue: Multiline comments split across detection boundaries
Solution: Post-detection expansion using structural analysis
### Technical Innovation

Missing line detection: Identifies comment lines outside current block
Boundary expansion: Dynamically extends blocks to include complete constructs
Content reconstruction: Rebuilds block content from original text

## Pattern Refinement - Weight Optimization & Edge Case Handling

Focus: Fine-tuning detection accuracy through pattern 
### weights and specialized patterns Pattern Weight Adjustments

In [None]:
# Increased weights for strong indicators
'string_literals': 0.2 → better string detection
'brackets': 0.2 → improved structural detection
'closing_brace': 0.5 → standalone braces now detected

# Reduced weights to prevent over-detection
'function_call': 0.4 → reduced false positives
'parameter_list': 0.2 → more conservative

### New Specialized Patterns Added
#### Python-specific:

json_structure: "key": [{...}] patterns (0.6 weight)
multiline_call: Function calls spanning lines
list_with_dicts: Complex data structures
indented_param: Multi-line function parameters

#### C-specific:

comment_start/end: Separate /* and */ detection
Enhanced preprocessor directives

#### Common:

closing_brace: Standalone }, ], ) lines
object_literal: {key: patterns

### Problem Addressed
Missing closing braces and complex data structures were being under-detected, causing structural analysis failures.

## Comprehensive Testing Framework Implementation

Focus: Systematic testing infrastructure for algorithm validation

In [None]:
def test_all_sample_files():
    # Automated testing across all sample files
    test_files = glob.glob("tests/test_samples/*.txt")
    for file_path in test_files:
        result, file_result = run_file_test(file_path)
        all_results.append(file_result)
    save_results_to_file(all_results)

### Key Features

Automated file discovery: Tests all .txt files in samples directory
Structured output: Results saved to detection_results.txt with timestamps
Error handling: File existence and encoding checks
Detailed logging: Block-by-block analysis with confidence scores

### Testing Strategy

Sample management: Active tests in test_samples/, inactive in more_tests/
Pytest integration: python -m pytest tests/test_detector.py -v -s
Assertion-based validation: Ensures at least 1 block detected per file

Data Collection Format

In [None]:
{
    "file": filename,
    "total_blocks": count,
    "blocks": [
        {"language": "python", "confidence": 1.2, "content": [...]}
    ]
}

### Status

Testing infrastructure complete
Ready for Stack Overflow dataset validation
Systematic performance tracking enabled

## Advanced Cross-Language Correction System

Major Feature: reassign_based_on_structure() - intelligent line movement between language blocks

In [None]:
def reassign_based_on_structure(self, blocks):
    # Detect C blocks missing closing braces
    if block['language'] == 'c' and missing_closes:
        # Find Python blocks with standalone closing braces
        if first_line in ['}', ']', ')']:
            self.move_line_between_blocks(other_block, block, 0)

### Key Innovation

Cross-language structural repair: When C code is missing closing braces, algorithm searches other blocks for misclassified closing elements and reassigns them.
Enhanced Pattern System

Penalty-based scoring: Patterns now penalize competing languages
Format: (regex, positive_weight, {penalty_dict})
Example: Python def adds 0.8 to Python, subtracts 0.9 from C

### Block Expansion Improvements

Safe boundary handling: Prevents array bounds errors
Set-based line tracking: Efficiently merges comment ranges
Content reconstruction: Rebuilds from original text

## Orphaned Bracket Recovery + Natural Language Filtering

Two Major Additions:
1. Orphaned Bracket Recovery

In [None]:
def merge_orphaned_brackets(self, code_blocks, original_lines):
    # Search 3 lines before/after each block for standalone brackets
    if line in ['{', '}', ')', ']', '(', '[']:
        # Expand block boundaries to include orphaned brackets

2. Natural Language Penalty Patterns

In [None]:
# Anti-patterns with negative weights
'articles': (r'\b(the|a|an)\s+\w+', -0.3, {}),
'question_words': (r'\b(what|how|why)\b', -0.7, {}),
'full_sentences': (r'\w+\s+\w+\s+\w+\s+\w+\s+\w+', -0.2, {}),

### Stack Overflow Integration

API integration: Downloads real SO questions with code blocks
HTML processing: Extracts clean text and expected code block counts
Test case generation: Creates standardized test files
Language-specific testing: Separate C and Python question sets

### Key Innovation
Negative scoring: Natural language patterns now actively reduce code likelihood scores, improving discrimination between explanatory text and actual code.
Status
Complete pipeline from SO API → test cases → algorithm validation with real-world mixed content.