# Ripgrep Contextual Search Example
This notebook demonstrates how to use ripgrep to search for patterns in files, including lines before and after each match, and store the results in a pandas DataFrame for contextual analysis.

# Notebook Documentation & Usage Guide
This notebook demonstrates contextual code search using ripgrep and pandas for analysis. Below are instructions and tips for using and customizing the notebook.

## How to Use
1. **Run a Search:** Use the provided function to search for a pattern in your codebase. Adjust the search path, pattern, and context lines as needed.
2. **Analyze Results:** Results are loaded into a pandas DataFrame with rich metadata (file info, context, timestamps, match counts).
3. **Query & Filter:** Use example queries to filter results by context, frequency, or other criteria.
4. **Visualize:** Highlight found expressions in context for easier review.
5. **Suggest Terms:** Use fuzzy matching to get suggestions for partial search terms.

## Example Queries
- Find matches where context contains specific keywords (e.g., 'init', 'plot_yearly').
- Order results by most frequent expressions.
- Visualize context with highlighted terms.
- Suggest search terms based on partial input.

## Customization Tips
- Change the search path and pattern to target different codebases or expressions.
- Adjust the number of context lines before/after matches.
- Add new columns or filters to the DataFrame for deeper analysis.
- Integrate widgets for interactive filtering or input.
- Export results to CSV/Excel for external use.

## Troubleshooting
- Use the debug cell to print raw ripgrep output if parsing issues occur.
- Ensure ripgrep is installed and available in your system PATH.

## Extending the Notebook
- Add widgets for interactive search and filtering.
- Visualize match statistics with charts.
- Integrate with version control for change tracking.
- Add advanced error handling and logging.

---

In [1]:
# Import required libraries
import subprocess
import pandas as pd

## Define ripgrep search function with context
This function runs ripgrep with options to include lines before and after each match, and parses the output into a DataFrame.

In [33]:
import json
import os
from pathlib import Path
import datetime
# Top 10 useful ripgrep arguments:

# 1. --type <type>         : Restrict search to file type (e.g. py, md, js)

# 2. --ignore-case         : Case-insensitive search

# 3. --max-depth <num>     : Limit directory traversal depth

# 4. --files               : List files that would be searched

# 5. --count               : Show count of matches per file

# 6. --hidden              : Search hidden files and folders

# 7. --glob <pattern>      : Include/exclude files by glob pattern

# 8. --multiline           : Enable multiline search

# 9. --replace <string>    : Replace matches in output

# 10. --sort <criteria>    : Sort results (path, modified, etc.)

def ripgrep_search_with_context(pattern, path='.', before=2, after=2, extra_args=None):
    cmd = ['rg', '--json', f'-B{before}', f'-A{after}', pattern, path]
    if extra_args:
        cmd.extend(extra_args)
    result = subprocess.run(cmd, capture_output=True, text=True)
    lines = result.stdout.strip().split('\n')
    search_path = Path(path).resolve()
    grouped = {}
    match_counts = {}
    for line in lines:
        if not line.strip():
            continue
        try:
            obj = json.loads(line)
        except Exception:
            continue
        if obj.get('type') == 'begin':
            current_file = Path(obj['data']['path']['text']).resolve()
            current_group = []
        elif obj.get('type') in ['match', 'context']:
            current_group.append(obj)
        elif obj.get('type') == 'end':
            # Count matches in this file
            file_path = str(current_file)
            match_count = sum(1 for item in current_group if item.get('type') == 'match')
            match_counts[file_path] = match_count
            # Process group for this file
            for i, item in enumerate(current_group):
                if item.get('type') == 'match':
                    context_lines = []
                    for j in range(i-before, i):
                        if 0 <= j < len(current_group) and current_group[j].get('type') == 'context':
                            context_lines.append(current_group[j]['data']['lines']['text'])
                    match_text = item['data']['lines']['text']
                    context_lines.append(match_text)
                    for j in range(i+1, i+1+after):
                        if 0 <= j < len(current_group) and current_group[j].get('type') == 'context':
                            context_lines.append(current_group[j]['data']['lines']['text'])
                    file_path_obj = Path(item['data']['path']['text']).resolve()
                    try:
                        stat = file_path_obj.stat()
                        created = getattr(stat, 'st_birthtime', stat.st_ctime)
                        created = datetime.datetime.fromtimestamp(created)
                        modified = datetime.datetime.fromtimestamp(stat.st_mtime)
                    except Exception:
                        created = None
                        modified = None
                    try:
                        folder = str(file_path_obj.parent.relative_to(search_path))
                    except ValueError:
                        folder = str(file_path_obj.parent)
                    file_info = {
                        'file': str(file_path_obj),
                        'search_path': str(search_path),
                        'folder': folder,
                        'file_name': file_path_obj.name,
                        'file_ext': file_path_obj.suffix,
                        'created': created,
                        'modified': modified,
                        'match_count_in_file': match_count
                    }
                    for submatch in item['data']['submatches']:
                        row = {
                            'line': item['data']['line_number'],
                            'col': submatch['start'],
                            'text': match_text,
                            'context': '\n'.join(context_lines),
                            'type': 'match'
                        }
                        row.update(file_info)
                        grouped.setdefault(str(file_path_obj), []).append(row)
    data = [row for rows in grouped.values() for row in rows]
    return pd.DataFrame(data)

## Run a contextual search
Search for the word 'def' in Python files, including 2 lines before and after each match, and display results.

In [34]:
df = ripgrep_search_with_context('def', path='C:/work/GitHub/dec-tree-py', before=2, after=2, extra_args=['--type', 'py'])
#df.info()
df

Unnamed: 0,line,col,text,context,type,file,search_path,folder,file_name,file_ext,created,modified,match_count_in_file
0,3,0,def test_m_classifier():\r\n,from src.dec_tree.m_classifier import M_Classi...,match,C:\work\GitHub\dec-tree-py\Python\tests\m_clas...,C:\work\GitHub\dec-tree-py,Python\tests,m_classifier_test.py,.py,2024-08-16 13:24:38.139333,2024-08-16 13:37:56.382003,1
1,16,4,def __init__(self):\r\n,STDOUT_END = 'out_end_###'\r\n\n\r\n\n ...,match,C:\work\GitHub\dec-tree-py\Python\src\child_ch...,C:\work\GitHub\dec-tree-py,Python\src,child_channel.py,.py,2024-08-08 14:10:38.994563,2024-08-16 17:16:53.051790,5
2,24,4,"def data_received(self, data: bytes):\r\n",self.json_object = None\r\n\n\r\n\n ...,match,C:\work\GitHub\dec-tree-py\Python\src\child_ch...,C:\work\GitHub\dec-tree-py,Python\src,child_channel.py,.py,2024-08-08 14:10:38.994563,2024-08-16 17:16:53.051790,5
3,49,4,def get_json(self):\r\n,\r\n\n\r\n\n def get_json(self):\r\n\n ...,match,C:\work\GitHub\dec-tree-py\Python\src\child_ch...,C:\work\GitHub\dec-tree-py,Python\src,child_channel.py,.py,2024-08-08 14:10:38.994563,2024-08-16 17:16:53.051790,5
4,55,4,"def encode(self, jObj):\r\n",self.json_object = json.loads(self.dec...,match,C:\work\GitHub\dec-tree-py\Python\src\child_ch...,C:\work\GitHub\dec-tree-py,Python\src,child_channel.py,.py,2024-08-08 14:10:38.994563,2024-08-16 17:16:53.051790,5
5,62,4,"def reply(self, jObj):\r\n",return encoded_string\r\n\n \r\n\n ...,match,C:\work\GitHub\dec-tree-py\Python\src\child_ch...,C:\work\GitHub\dec-tree-py,Python\src,child_channel.py,.py,2024-08-08 14:10:38.994563,2024-08-16 17:16:53.051790,5
6,20,4,def __init__(self):\r\n,"docstring\r\n\n """"""\r\n\n def __init...",match,C:\work\GitHub\dec-tree-py\Python\src\child_ex...,C:\work\GitHub\dec-tree-py,Python\src,child_exec.py,.py,2024-08-08 14:10:38.994563,2024-08-18 23:57:57.275554,7
7,28,4,"def __call__(self, *args: Any, **kwds: Any...",self.iset = M_SettingsSingleton()\r\n\...,match,C:\work\GitHub\dec-tree-py\Python\src\child_ex...,C:\work\GitHub\dec-tree-py,Python\src,child_exec.py,.py,2024-08-08 14:10:38.994563,2024-08-18 23:57:57.275554,7
8,39,4,def __init__(self):\r\n,"docstring\r\n\n """"""\r\n\n def __init...",match,C:\work\GitHub\dec-tree-py\Python\src\child_ex...,C:\work\GitHub\dec-tree-py,Python\src,child_exec.py,.py,2024-08-08 14:10:38.994563,2024-08-18 23:57:57.275554,7
9,45,4,"def __call__(self, *args: Any, **kwds: Any...",super().__init__()\r\n\n\r\n\n def ...,match,C:\work\GitHub\dec-tree-py\Python\src\child_ex...,C:\work\GitHub\dec-tree-py,Python\src,child_exec.py,.py,2024-08-08 14:10:38.994563,2024-08-18 23:57:57.275554,7


In [31]:
# Example query: Find matches where context contains 'init' or 'plot_yearly', ordered by most frequent expressions
import re
query_terms = ['init', 'plot_yearly']
mask = df['context'].str.contains('|'.join(query_terms), case=False, na=False)
filtered = df[mask]
# Count most frequent expressions in context
def extract_expressions(context):
    # Find all words (identifiers) in context
    return re.findall(r'\b\w+\b', context)
from collections import Counter
all_expressions = filtered['context'].apply(extract_expressions).explode()
expression_counts = Counter(all_expressions)
# Show top 10 most frequent expressions
top_expressions = expression_counts.most_common(10)
import pandas as pd
pd.DataFrame(top_expressions, columns=['expression', 'count'])

Unnamed: 0,expression,count
0,self,28
1,def,17
2,__init__,12
3,docstring,11
4,the,6
5,class,6
6,item,5
7,name,5
8,settings,4
9,to,4


In [32]:
# Visualize found context lines, highlighting found expressions
from IPython.display import display, HTML
def highlight_expressions(text, expressions):
    # Highlight each expression in the text using HTML span
    for expr in expressions:
        pattern = re.compile(r'(' + re.escape(expr) + r')', re.IGNORECASE)
        text = pattern.sub(r'<span style="background-color: yellow;">\1</span>', text)
    return text
# Show first 10 filtered results with highlighted expressions
highlight_terms = query_terms
rows = []
for idx, row in filtered.head(10).iterrows():
    highlighted = highlight_expressions(row['context'], highlight_terms)
    rows.append(f'<pre>{highlighted}</pre>')
display(HTML('<hr>'.join(rows)))

In [20]:
# Debug: Print raw ripgrep output to help diagnose parsing issues
cmd = ['rg', '--vimgrep', '-B2', '-A2', 'def', 'C:/work/GitHub/dec-tree-py', '--type', 'py', '--json']
result = subprocess.run(cmd, capture_output=True, text=True)
print(result.stdout)

{"type":"begin","data":{"path":{"text":"C:/work/GitHub/dec-tree-py\\Python\\tests\\m_classifier_test.py"}}}
{"type":"context","data":{"path":{"text":"C:/work/GitHub/dec-tree-py\\Python\\tests\\m_classifier_test.py"},"lines":{"text":"from src.dec_tree.m_classifier import M_Classifier\r\n"},"line_number":1,"absolute_offset":0,"submatches":[]}}
{"type":"context","data":{"path":{"text":"C:/work/GitHub/dec-tree-py\\Python\\tests\\m_classifier_test.py"},"lines":{"text":"\r\n"},"line_number":2,"absolute_offset":52,"submatches":[]}}
{"type":"match","data":{"path":{"text":"C:/work/GitHub/dec-tree-py\\Python\\tests\\m_classifier_test.py"},"lines":{"text":"def test_m_classifier():\r\n"},"line_number":3,"absolute_offset":54,"submatches":[{"match":{"text":"def"},"start":0,"end":3}]}}
{"type":"context","data":{"path":{"text":"C:/work/GitHub/dec-tree-py\\Python\\tests\\m_classifier_test.py"},"lines":{"text":"    m_classifier = M_Classifier()\r\n"},"line_number":4,"absolute_offset":80,"submatches":[]}

In [42]:
# Suggest search terms based on partial input using fuzzy matching
from difflib import get_close_matches

# Collect all unique words from the DataFrame 'df' context column
all_words = set()
df['context'].dropna().apply(lambda x: all_words.update(re.findall(r'\b\w+\b', x)))

def suggest_terms(partial, n=10, cutoff=0.6):
    """Suggest possible search terms based on partial input."""
    return get_close_matches(partial, all_words, n=n, cutoff=cutoff)

# Example usage: suggest terms for partial input 'init'
partial_input = 'tance'
suggestions = suggest_terms(partial_input)
print(f"Suggestions for '{partial_input}':", suggestions)

Suggestions for 'tance': ['_instance', 'importance']
