<a href="https://colab.research.google.com/github/ShrikantKGIT/general/blob/main/Aura_Language_Parser_for_an_IDE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Python script to parse an imaginary language named "Aura," designed to simulate the backend logic of an IDE.**

This script demonstrates the core components needed to understand source code:

**Token Definitions:** It starts by defining the "vocabulary" of the Aura language using regular expressions. This includes keywords (let, print), data types (STRING, NUMBER), and symbols.

**Lexer (tokenize function):** This function scans the raw source code and breaks it into a list of Token objects. In an IDE, this list would be used directly to apply syntax highlighting (e.g., color all KEYWORD tokens blue, all STRING tokens green).

**Parser:** This class consumes the token list and checks if it follows the language's grammatical rules (e.g., a let statement must have the structure let <variable> = <value>;). It collects any syntax errors it finds.

**IDE Simulation:** The analyze_code_for_ide function ties everything together. It takes a string of code, tokenizes it, and then parses it, returning the tokens and a list of errors—exactly the information an IDE needs to display colored text and red squiggly underlines.

**At the end code includes examples with both valid and invalid "Aura" code.**

In [1]:
import re

# --- 1. Define the Language's Tokens ---
# These are the "words" our language understands.
TOKEN_SPECIFICATION = [
    ('COMMENT',   r'//.*'),          # Comments
    ('STRING',    r'".*?"'),          # Strings in double quotes
    ('NUMBER',    r'\d+(\.\d*)?'),    # Integer or float numbers
    ('KEYWORD',   r'\b(let|print)\b'),# Keywords: let, print
    ('VARIABLE',  r'[A-Za-z_][A-Za-z0-9_]*'), # Variable names
    ('OPERATOR',  r'[=+\-]'),         # Mathematical and assignment operators
    ('SEMICOLON', r';'),              # Statement terminator
    ('NEWLINE',   r'\n'),             # Line breaks
    ('SKIP',      r'[ \t]+'),         # Skip over spaces and tabs
    ('MISMATCH',  r'.'),              # Any other character is an error
]

# Create a single regex for tokenizing
TOKEN_REGEX = re.compile('|'.join('(?P<%s>%s)' % pair for pair in TOKEN_SPECIFICATION))

class Token:
    """A simple class to hold token information."""
    def __init__(self, type, value, line, column):
        self.type = type
        self.value = value
        self.line = line
        self.column = column

    def __repr__(self):
        return f"Token({self.type}, '{self.value}', {self.line}, {self.column})"

# --- 2. The Lexer (Tokenizer) ---
# Scans the code and produces a stream of tokens.

def tokenize(code):
    """
    Generates a sequence of tokens from a string of code.
    """
    tokens = []
    line_num = 1
    line_start = 0
    for mo in TOKEN_REGEX.finditer(code):
        kind = mo.lastgroup
        value = mo.group()
        column = mo.start() - line_start

        if kind == 'NEWLINE':
            line_start = mo.end()
            line_num += 1
            continue # Don't store newline tokens, but track line number
        elif kind == 'SKIP':
            continue # Ignore whitespace
        elif kind == 'MISMATCH':
            # This is where an IDE would flag an "unrecognized character" error
            raise RuntimeError(f'Unexpected character: {value!r} on line {line_num}')

        tokens.append(Token(kind, value, line_num, column))
    return tokens

# --- 3. The Parser ---
# Analyzes the token stream to check for correct grammar.

class Parser:
    """
    Parses a list of tokens to check for syntax errors.
    In a real IDE, this would build an Abstract Syntax Tree (AST).
    For simplicity, we'll just validate the structure.
    """
    def __init__(self, tokens):
        self.tokens = tokens
        self.pos = 0
        self.errors = []

    def parse(self):
        """Starts the parsing process."""
        while self.pos < len(self.tokens):
            token = self.current_token()
            if token.type == 'KEYWORD':
                if token.value == 'let':
                    self.parse_let_statement()
                elif token.value == 'print':
                    self.parse_print_statement()
            else:
                self.error(token, f"Statements must begin with a keyword (let, print).")
                # Skip to the next potential statement to find more errors
                self.advance_to_next_statement()

        return self.errors

    def parse_let_statement(self):
        """Parses: let <VAR> = <VALUE>;"""
        self.consume('KEYWORD') # Consume 'let'

        # Expect a variable
        if self.current_token().type != 'VARIABLE':
            self.error(self.current_token(), "Expected a variable name after 'let'.")
            self.advance_to_next_statement()
            return
        self.consume('VARIABLE')

        # Expect an equals sign
        if self.current_token().type != 'OPERATOR' or self.current_token().value != '=':
            self.error(self.current_token(), "Expected '=' after variable name.")
            self.advance_to_next_statement()
            return
        self.consume('OPERATOR')

        # Expect a value (Number, String, or another Variable)
        if self.current_token().type not in ('NUMBER', 'STRING', 'VARIABLE'):
            self.error(self.current_token(), "Expected a value (number, string, or variable) after '='.")
            self.advance_to_next_statement()
            return
        self.consume(self.current_token().type)

        # Expect a semicolon
        if self.current_token().type != 'SEMICOLON':
            self.error(self.current_token(), "Missing semicolon ';' at the end of the statement.")
            self.advance_to_next_statement()
            return
        self.consume('SEMICOLON')

    def parse_print_statement(self):
        """Parses: print <VAR_OR_VALUE>;"""
        self.consume('KEYWORD') # Consume 'print'

        # Expect a value to print
        if self.current_token().type not in ('VARIABLE', 'NUMBER', 'STRING'):
            self.error(self.current_token(), "Expected a variable, number, or string after 'print'.")
            self.advance_to_next_statement()
            return
        self.consume(self.current_token().type)

        # Expect a semicolon
        if self.current_token().type != 'SEMICOLON':
            self.error(self.current_token(), "Missing semicolon ';' at the end of the statement.")
            self.advance_to_next_statement()
            return
        self.consume('SEMICOLON')

    # --- Helper methods ---
    def current_token(self):
        if self.pos < len(self.tokens):
            return self.tokens[self.pos]
        # Return a dummy 'EOF' token to prevent crashes
        return Token('EOF', '', -1, -1)

    def consume(self, expected_type):
        if self.current_token().type == expected_type:
            self.pos += 1
        else:
            self.error(self.current_token(), f"Expected token type {expected_type} but got {self.current_token().type}.")

    def error(self, token, message):
        error_msg = f"Syntax Error at Line {token.line}, Col {token.column}: {message}"
        self.errors.append(error_msg)

    def advance_to_next_statement(self):
        """Error recovery: find the next semicolon to resume parsing."""
        while self.pos < len(self.tokens) and self.current_token().type != 'SEMICOLON':
            self.pos += 1
        if self.pos < len(self.tokens):
            self.pos += 1 # Move past the semicolon


# --- 4. Example Usage: Simulating an IDE ---

def analyze_code_for_ide(code):
    """
    Analyzes a piece of code to provide IDE-like feedback.

    Returns:
        A tuple of (tokens_for_highlighting, list_of_errors).
    """
    print("--- Analyzing Code for IDE ---")

    # 1. Lexing (for syntax highlighting)
    try:
        tokens = tokenize(code)
        print("Lexing successful. Tokens generated:")
        for t in tokens: print(f"  {t}")
    except RuntimeError as e:
        print(f"Lexing failed: {e}")
        return [], [str(e)] # Return early if lexing fails

    # 2. Parsing (for error checking)
    parser = Parser(tokens)
    errors = parser.parse()
    print("\nParsing complete.")

    return tokens, errors


if __name__ == "__main__":
    # Example code in our imaginary "Aura" language
    aura_code_good = """
    // This is a valid program
    let score = 100;
    let name = "Player1";
    print score;
    """

    aura_code_bad = """
    // This program has syntax errors
    let x 150;          // Missing '='
    let y = "hello"     // Missing semicolon
    print z;
    show z;             // 'show' is not a valid keyword
    """

    print("--- Processing GOOD code ---")
    good_tokens, good_errors = analyze_code_for_ide(aura_code_good)
    if not good_errors:
        print("\nResult: No syntax errors found. Ready for highlighting!")
    else:
        print("\nResult: Errors found:")
        for err in good_errors: print(f"  - {err}")

    print("\n" + "="*40 + "\n")

    print("--- Processing BAD code ---")
    bad_tokens, bad_errors = analyze_code_for_ide(aura_code_bad)
    if not bad_errors:
        print("\nResult: No syntax errors found.")
    else:
        print("\nResult: Errors found:")
        for err in bad_errors: print(f"  - {err}")


--- Processing GOOD code ---
--- Analyzing Code for IDE ---
Lexing successful. Tokens generated:
  Token(COMMENT, '// This is a valid program', 2, 4)
  Token(KEYWORD, 'let', 3, 4)
  Token(VARIABLE, 'score', 3, 8)
  Token(OPERATOR, '=', 3, 14)
  Token(NUMBER, '100', 3, 16)
  Token(SEMICOLON, ';', 3, 19)
  Token(KEYWORD, 'let', 4, 4)
  Token(VARIABLE, 'name', 4, 8)
  Token(OPERATOR, '=', 4, 13)
  Token(STRING, '"Player1"', 4, 15)
  Token(SEMICOLON, ';', 4, 24)
  Token(KEYWORD, 'print', 5, 4)
  Token(VARIABLE, 'score', 5, 10)
  Token(SEMICOLON, ';', 5, 15)

Parsing complete.

Result: Errors found:
  - Syntax Error at Line 2, Col 4: Statements must begin with a keyword (let, print).


--- Processing BAD code ---
--- Analyzing Code for IDE ---
Lexing successful. Tokens generated:
  Token(COMMENT, '// This program has syntax errors', 2, 4)
  Token(KEYWORD, 'let', 3, 4)
  Token(VARIABLE, 'x', 3, 8)
  Token(NUMBER, '150', 3, 10)
  Token(SEMICOLON, ';', 3, 13)
  Token(COMMENT, '// Missing '='',