# Assingment no.17

In [None]:
# Q1. Explain the difference between greedy and non-greedy syntax with visual terms in as few words
# as possible. What is the bare minimum effort required to transform a greedy pattern into a non-greedy
# one? What characters or characters can you introduce or change?
# Ans=
The difference between greedy and non-greedy syntax in regular expressions can be explained using visual terms:

Greedy Syntax: It matches as much as possible, taking the longest possible match.
Non-Greedy Syntax: It matches as little as possible, taking the shortest possible match.
To transform a greedy pattern into a non-greedy one, the bare minimum effort required is to introduce or change a single character, which is the question mark (?).

Greedy to Non-Greedy: Add a question mark (?) after the quantifier (*, +, ?, {n}, {n,}, {n,m}) to make it non-greedy.
For example:

Greedy: .* (matches the longest possible sequence of any characters)
Non-Greedy: .*? (matches the shortest possible sequence of any characters)
The addition of the question mark (?) changes the behavior of the quantifier from greedy to non-greedy.

In [None]:
# Q2. When exactly does greedy versus non-greedy make a difference?  What if you&#39;re looking for a
# non-greedy match but the only one available is greedy?
# Ans=
The distinction between greedy and non-greedy matching becomes relevant when there are multiple possible matches within the input string. The difference lies in the amount of text that is matched by the regular expression pattern.

In general, greedy matching aims to consume as much text as possible while still allowing the overall match to succeed. On the other hand, non-greedy matching aims to consume as little text as possible while still achieving a successful match.

Consider the following example input string: "ABCDEF".

Greedy Matching:

Greedy pattern: A.*F
Greedy match: ABCDEF
Explanation: In greedy matching, the .* quantifier matches as many characters as possible, resulting in the longest possible match from "A" to "F". It consumes the entire input string.
Non-Greedy Matching:

Non-greedy pattern: A.*?F
Non-greedy match: ABCDEF
Explanation: In non-greedy matching, the .*? quantifier matches as few characters as possible, resulting in the shortest possible match from "A" to "F". It consumes only the required characters to achieve a successful match.

In [None]:
# Q3. In a simple match of a string, which looks only for one match and does not do any replacement, is
# the use of a nontagged group likely to make any practical difference?
# Ans=
In a simple match of a string where you are only looking for one match and not performing any replacements, the use of a non-tagged group (a group without capturing parentheses) does not make any practical difference in terms of the final result.

Non-tagged groups are used when you want to apply grouping for a portion of the pattern but do not need to capture the matched content for later use. These groups are created by using parentheses (?:pattern), where ?: specifies a non-capturing group.

In a simple match scenario without any need for capturing groups or accessing the matched content, using a non-tagged group versus not using any grouping at all will yield the same result. The overall match result will be unaffected by the presence or absence of a non-tagged group.

For example, consider the following code snippet:


import re

text = "Hello, World!"
pattern_with_non_tagged_group = r"(?:Hello), World!"
pattern_without_group = r"Hello, World!"

match_with_non_tagged_group = re.match(pattern_with_non_tagged_group, text)
match_without_group = re.match(pattern_without_group, text)

print(match_with_non_tagged_group)  
print(match_without_group) 

In [None]:
# Q4. Describe a scenario in which using a nontagged category would have a significant impact on the
# program&#39;s outcomes.
# Ans=
A scenario in which using a non-tagged category (non-capturing group) can have a significant impact on a program's outcomes is when the regular expression pattern is used with the re.findall() function in Python.

The re.findall() function returns a list of all non-overlapping matches of a pattern in a string. By default, when capturing groups are present in the pattern, re.findall() returns a list of tuples where each tuple represents a match and contains the captured groups' contents.

However, when non-tagged (non-capturing) groups (?:pattern) are used in the pattern, the re.findall() function only returns the matched content without including the non-capturing groups.

Consider the following example:


import re

text = "Hello, world! How are you?"
pattern_with_capturing_group = r"(\b\w+\b)"
pattern_with_non_tagged_group = r"(?:\b\w+\b)"

matches_with_capturing_group = re.findall(pattern_with_capturing_group, text)
matches_with_non_tagged_group = re.findall(pattern_with_non_tagged_group, text)

print(matches_with_capturing_group)  # Output: ['Hello', 'world', 'How', 'are', 'you']
print(matches_with_non_tagged_group)  # Output: ['Hello', 'world', 'How', 'are

In [None]:
# Q5. Unlike a normal regex pattern, a look-ahead condition does not consume the characters it
# examines. Describe a situation in which this could make a difference in the results of your
# programme.
# Ans=
The non-consuming nature of look-ahead conditions in regular expressions can make a difference in the results of a program in various situations where you need to assert a certain condition without actually including it in the matched portion. Here's a scenario to illustrate this:

Suppose you have a text document that contains email addresses, and you want to extract all email addresses that are followed by the word "gmail" but without including "gmail" in the extracted result.

Using a look-ahead condition, denoted by (?=...), you can specify a pattern that matches the email addresses only if they are followed by the word "gmail", without including "gmail" in the match.

Consider the following example:


import re

text = "Contact me at john.doe@gmail.com or jane.doe@gmail.com for any inquiries."

pattern_with_lookahead = r"\b\w+@\w+(?=\.gmail\b)"

matches_with_lookahead = re.findall(pattern_with_lookahead, text)

print(matches_with_lookahead) 

In [None]:
# Q6. In standard expressions, what is the difference between positive look-ahead and negative look-
# ahead?
# Ans=
In regular expressions, positive look-ahead ((?=...)) and negative look-ahead ((?!...)) are both types of look-ahead assertions, but they have different purposes and behaviors:

Positive Look-ahead ((?=...)):

Positive look-ahead asserts that a specific pattern (the lookahead pattern) must be present after the current position in the string, without including it in the matched portion.
It ensures that a match is found only if the lookahead pattern is followed by the main pattern.
Syntax: (?=...)
Example: pattern = r"\w+(?=\d)" matches one or more word characters only if they are followed by a digit.
Negative Look-ahead ((?!...)):

Negative look-ahead asserts that a specific pattern (the lookahead pattern) must not be present after the current position in the string.
It ensures that a match is found only if the lookahead pattern is not followed by the main pattern.
Syntax: (?!...)
Example: pattern = r"\w+(?!\d)" matches one or more word characters only if they are not followed by a digit.


In [None]:
# Q7. What is the benefit of referring to groups by name rather than by number in a standard
# expression?
# Ans=
Referring to groups by name rather than by number in a standard expression brings several benefits:

Improved Readability: Using named groups makes the regular expression more self-explanatory and easier to understand. By assigning meaningful names to groups, it becomes clear what each group represents, enhancing the readability of the pattern.

Better Code Maintenance: When working with complex regular expressions, it is common to have multiple groups. Using named groups provides more descriptive references, making it easier to maintain and modify the pattern in the future. If the order or number of groups changes, code that references groups by name remains unaffected.

Self-Documenting Patterns: Named groups act as documentation within the regular expression itself. By naming groups, you provide a clear indication of the purpose or significance of each captured group. This helps others (including future you) understand the intention and logic behind the pattern.

Accessible Captured Content: When using named groups, the captured content can be accessed not only by index but also by name. This allows for more intuitive and explicit retrieval of matched content, making it easier to work with the captured groups in subsequent code.

In [None]:
# Q8. Can you identify repeated items within a target string using named groups, as in &quot;The cow
# jumped over the moon&quot;?
# Ans=
No, using named groups alone in a regular expression cannot directly identify repeated items within a target string. Named groups are used to assign meaningful names to captured groups in a regular expression, but they do not provide a mechanism for detecting repeated items on their own.

To identify repeated items within a target string, you can use backreferences in combination with named groups. Backreferences allow you to refer back to a previously matched group within the same regular expression pattern.

Here's an example that demonstrates how you can use backreferences and named groups to identify repeated items:


import re

text = "The cow jumped over the moon"

pattern = r"\b(?P<word>\w+)\b.*\b(?P=word)\b"

matches = re.findall(pattern, text)

print(matches)  

In [None]:
# Q9. When parsing a string, what is at least one thing that the Scanner interface does for you that the
# rfindall feature does not?
# Ans=

When parsing a string, the Scanner interface in Python provides a feature that the re.findall() function does not: it allows you to iterate over the tokens of the string in a streaming fashion rather than retrieving all matches at once.

The Scanner interface, available in the re.Scanner class, provides a way to tokenize a string by defining patterns and corresponding actions for each pattern. It scans through the string sequentially, matching patterns and executing actions for each match.

One advantage of using the Scanner interface over re.findall() is that it enables you to process the tokens one at a time, which can be useful for handling large or streaming input data. Instead of retrieving all matches as a list, you can iterate over the tokens as they are found, allowing you to process them incrementally and potentially reduce memory usage.

Here's a simple example to illustrate the usage of the Scanner interface:


import re

scanner = re.Scanner([
    (r'\d+', lambda scanner, token: print(f"Found number: {token}")),
    (r'\w+', lambda scanner, token: print(f"Found word: {token}")),
    (r'\s+', None)  # Ignore whitespace
])

text = "Hello 123 World"

scanner.scan(text)

In [None]:
# Q10. Does a scanner object have to be named scanner?
# Ans=
No, a Scanner object does not have to be named "scanner". The name given to the Scanner object is arbitrary and can be chosen according to your preference or the naming conventions of your code. "Scanner" is a commonly used name for the object since it represents a scanner-like functionality, but you are free to choose any valid variable name that adheres to Python's naming rules.

Here's an example demonstrating the usage of a Scanner object with a different variable name:

python
Copy code
import re

my_tokenizer = re.Scanner([
    (r'\d+', lambda scanner, token: print(f"Found number: {token}")),
    (r'\w+', lambda scanner, token: print(f"Found word: {token}")),
    (r'\s+', None)  # Ignore whitespace
])

text = "Hello 123 World"

my_tokenizer.scan(text)