<a href="https://colab.research.google.com/github/Siddhu290/Machine_Learning/blob/main/2024-08-03/Regular_Expression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regex

## Key Points

1. **Definition**
   - Regex is a sequence of characters that define a search pattern, primarily used for string-matching algorithms.

2. **Pattern Matching**
   - Regex allows you to find and manipulate strings based on specific patterns, such as identifying email addresses, phone numbers, dates, or specific words.

3. **Metacharacters**
   - `.`: Matches any character except a newline.
   - `^`: Matches the start of a string.
   - `$`: Matches the end of a string.
   - `*`: Matches 0 or more repetitions of the preceding element.
   - `+`: Matches 1 or more repetitions of the preceding element.
   - `?`: Matches 0 or 1 repetition of the preceding element.
   - `[]`: Matches any one of the enclosed characters.
   - `|`: Acts like a boolean OR.
   - `()`: Groups patterns together.

4. **Character Classes**
   - `\d`: Matches any digit.
   - `\D`: Matches any non-digit.
   - `\w`: Matches any word character (alphanumeric plus underscore).
   - `\W`: Matches any non-word character.
   - `\s`: Matches any whitespace character.
   - `\S`: Matches any non-whitespace character.

5. **Quantifiers**
   - `{n}`: Matches exactly n occurrences of the preceding element.
   - `{n,}`: Matches n or more occurrences.
   - `{n,m}`: Matches between n and m occurrences.

6. **Escaping Special Characters**
   - Use a backslash (`\`) to escape special characters if you need to match them literally.

7. **Anchors**
   - `\b`: Matches a word boundary.
   - `\B`: Matches a non-word boundary.

8. **Substitution and Replacement**
   - Regex can be used to replace parts of a string using patterns and replacement strings.

9. **Regex Tools and Libraries**
   - Many programming languages have built-in support for regex, such as Python's `re` module, JavaScript's `RegExp` object, and similar functionalities in languages like Perl, Java, and PHP.

10. **Use Cases**
    - **Data Validation**: Checking formats of email addresses, phone numbers, postal codes, etc.
    - **Text Processing**: Extracting information, cleaning text, and modifying strings.
    - **Data Extraction**: Pulling out specific information from large text datasets.
    - **Search and Replace**: Modifying parts of text based on patterns.

Understanding and effectively using regex can significantly enhance text processing and analysis capabilities, making it a valuable skill in data science and software development.


In [80]:
import re
line = "Cats are smarter than dogs"
matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)
if matchObj:
    print("matchObj.group() : ", matchObj.group())
    print("matchObj.group(1) : ", matchObj.group(1))
    print("matchObj.group(2) : ", matchObj.group(2))
else:
    print("No match!!"  )

matchObj.group() :  Cats are smarter than dogs
matchObj.group(1) :  Cats
matchObj.group(2) :  smarter


- match is done in start only and search is done over any position

In [81]:
line = "Cats are smarter than dogs";
matchObj = re.match( r'dogs', line, re.M|re.I)
if matchObj:
    print("match --> matchObj.group() : ", matchObj.group())
else:
    print("No match!!")
    matchObj = re.search( r'dogs', line)
if matchObj:
    print("search --> matchObj.group() : ", matchObj.group())
else:
    print("No match!!")


No match!!
search --> matchObj.group() :  dogs


# replace

In [82]:

phone = "2004-959-559 # This is Phone Number"
# Delete Python-style comments
num = re.sub(r'#.*$', "", phone)
print("Phone Num : ", num)
# Remove anything other than digits
num = re.sub(r'\D', "", phone)
print("Phone Num : ", num)


Phone Num :  2004-959-559 
Phone Num :  2004959559


#### regular expression patterns



#### write a program in python where in given stirng there is noise in form of special character use regular expression to fetch string without noise


In [83]:
str1='hjksafd@@,,'
num = re.sub(r'[^a-zA-Z0-9]',"", str1)
print( num)


hjksafd


In [84]:
s=123456       # print the last element of it
x=s%10
print(x)


6


In [85]:
# which removes repeated charcters in start of the string


In [86]:
# find all
import re
string='hello 12 hi 89 .howdy 34'
pattern='\d+'  #small d is used for digits and D is used for non digits

result=re.findall(pattern,string)
print(result)


['12', '89', '34']


In [87]:
# find all
import re
string='hello 12 hi 89 .howdy 34'
pattern='[A-Za-z]'  #small d is used for digits and D is used for non digits

result=re.findall(pattern,string)
print(result)


['h', 'e', 'l', 'l', 'o', 'h', 'i', 'h', 'o', 'w', 'd', 'y']


In [88]:
#split
# find all
import re
string='hello 12 hi 89 .howdy 34'
pattern='\d+'  #small d is used for digits and D is used for non digits

result=re.split(pattern,string)
print(result)


['hello ', ' hi ', ' .howdy ', '']


In [89]:
#split
# find all
import re
string='hello 12 hi 89 .howdy 34'
pattern='\d+'

result=re.split(pattern,string,1)
print(result)
print(len(result))


['hello ', ' hi 89 .howdy 34']
2


In [90]:
# mitiline stirng
string='abc 12\ de 23 \n f45 6'

# matches all whitespace character
pattern='\s+'

# empty string
replace=''

new_string=re.sub(pattern,replace,string)
print(new_string)

abc12\de23f456


In [91]:
import re
string='hello 12 hi 89 .howdy 34 '
pattern='\S+'

result=re.sub(pattern,'',string)
print(len(result))

6


In [92]:
# mitiline stirng
string='abc 12\ de 23 \n f45 6'

# matches all whitespace character
pattern='\s+'
replace=''

new_string=re.sub(pattern,replace,string,1)
print(new_string)

abc12\ de 23 
 f45 6


### subn() it returns a tuple of 2 items



In [93]:
# mitiline stirng
string='abc 12\ de 23 \n f45 6'

# matches all whitespace character
pattern='\s+'
replace=''

new_string=re.subn(pattern,replace,string)
print(new_string)

('abc12\\de23f456', 5)


In [94]:
# mitiline stirng
string='Python is fun'
match=re.search('[Pp]ython',string)
if match:
    print('pattern found inside the string')
else :
    print('pattern not found ')

pattern found inside the string


In [95]:
string='39801 356 ,2102 1111'
pattern='(\d{3}) (\d{2})'

match=re.search(pattern,string)
if match:
    print(match.group())
else :
    print('pattern not found ')

801 35


In [96]:
match.group(1)

'801'

In [97]:
match.group(2)

'35'

In [98]:
match.group(1,2)

('801', '35')

In [99]:
match.start()

2

In [100]:
match.end()

8

In [101]:
match.span()

(2, 8)

In [102]:
match.re
re.compile("(\\d{3}) (\\d{2})")



re.compile(r'(\d{3}) (\d{2})', re.UNICODE)

In [103]:
match.string

'39801 356 ,2102 1111'

In [104]:
string="\n and \r are escape sequences."
result=re.findall(r'[\n\r]',string)
print(result)

['\n', '\r']


In [105]:
#this code uses a regular expression to search for the pattern "brown.fox" within the string the dot (.) in the pattern represents any character. if a match found, it prints "Match found!" otherwie it prints "Match not Found"

if re.search(r"brown.fox", "The quick brown fox jumps over the lazy dog."):
    print("Match found!")
else:
    print("Match not Found")


Match found!


In [106]:
# this a code uses a regular expresiion (\d+) to find all the sequence of one oe more digits in the given string it searches for numeric values and stores them in a list in the example it finds and prints the number "123456789" and "987654321" from the input string

matches = re.findall(r"(\d+)", "This string contains numbers like 123456789 and 987654321.")

print(matches)


['123456789', '987654321']


In [107]:
#write aprogram to cheak the string contaings following patterns to characters followed by space and then numbers

if re.search(r'[a-zA-Z]\s\d+' , "A123"):
    print("matched")
else:
    print("not matched")



not matched


In [108]:
#write a program to cheak string ending with "ey" or "EY"

if re.search(r"([eyEY]) | ([eyEY]\b)","Hello World Money ends"):
    print("ends")
else:
    print("not ends")



ends


In [109]:
#write a program to match all the string or words ending with special characters thus count the sam e

matches = re.findall(r'\w+[^\w\s]', "the ndcbdjchb$ yfyjgjf f@")
print(len(matches))

2


In [110]:
#cound words in a sentence using rejax

string = """hello how are you
gh prasad"""
pattern = r'\w+'
words = re.findall(pattern, string)
print(len(words))


6


In [111]:
pattern_a=r"\D+"
pattern_b=r"\d+"

a_values=[]
b_values=[]

with open("/content/hello.txt","r") as file:
    for line in file:
        match_a=re.findall(pattern_a,line)
        match_b=re.findall(pattern_b,line)

        a_values.extend(match_a)
        b_values.extend(match_b)

    print(a_values)
    print(b_values)

['dbadhdfdha adhayd ygdaiy digdidi  iu au iua  fhb gy ayfg  haiygy iay ya fiy fhfcgvcahhbgcg uyyacy yf ay chjvc avc  yccyvdc vc uyvcyv cv hjcvaydcvbad bcia b vn jvbvysvb yufbvuyfdvjhvb u', 'r', ' ', 'r', ' ', 'r ', 'rt', 'ft', 'f', 'f', 'f', '   ', ' ', 'vx', 'c', 'vcx', ' ', ' ', 'f', 'c', ' vcfsycgf ', 'e', ' ', '  ', '  #$#@  # ', ' ', '$## # c RTDER D', 'E RDFFTRfdtrfvdjagdt  ', 'a', 'ed', 'fecdfh $$##@$#', '@ sytd', '$## \n', 'hgvhtgftyg\n', 'hjvgh']
['3826', '566', '7', '6', '4', '6346874', '6', '36', '6346', '3', '6', '6', '665', '76', '5', '765675675', '65', '655', '67', '5', '3', '23', '32', '24', '3', '53', '5', '43', '54', '2', '2', '432']


In [114]:
text="The rain in spain fdalls mainly in the Plain"
match=re.findall(r"\w*l+\w",text)
print(match)

['fdalls', 'mainly', 'Pla']


In [188]:
text="Jphn: 120 , Bill: 110 , ted: 115 "
pat = r"\w+: 1[2-9][1-9]\s"
k = re.findall(pat,text)

In [189]:
k

[]

```markdown
# CSV File Structure

- **Rows**: Each row in the CSV file corresponds to a record in the table.
- **Columns**: Each field within a row represents a column.
- **Separator**: Typically, commas (`,`), but other delimiters like semicolons (`;`) or tabs can also be used.
- **Headers**: Often, the first row contains the column headers, but this is not mandatory.

## Example CSV File

```csv
Name,Age,Occupation
Alice,30,Engineer
Bob,25,Designer
Charlie,35,Teacher
```

## Using Regex to Parse CSV Files

While libraries like `csv` in Python are commonly used to handle CSV files, regular expressions can be utilized to parse and validate the content of a CSV file. Here are some common patterns and tasks you might perform using regex:

### Matching a Simple CSV Line

- **Pattern**: `r'([^,]+),([^,]+),([^,]+)'`
- **Explanation**: This pattern matches a line with three fields separated by commas.
- **Example**: For the line `Alice,30,Engineer`, the pattern will match:
  - Group 1: `Alice`
  - Group 2: `30`
  - Group 3: `Engineer`

### Matching a CSV Line with Quoted Fields

- **Pattern**: `r'"([^"]+)"|([^,]+),|"([^"]+)"|([^,]+),|"([^"]+)"|([^,]+)'`
- **Explanation**: This pattern can handle fields enclosed in double quotes and fields without quotes.
- **Example**: For the line `"Alice",30,"Engineer"`, the pattern will match:
  - Group 1: `Alice`
  - Group 2: `30`
  - Group 3: `Engineer`

### Handling Escaped Quotes in Fields

- **Pattern**: `r'"((?:[^"]|"")*)",?|([^,]+),?'`
- **Explanation**: This pattern matches fields where double quotes inside a field are escaped by doubling them (`""`).
- **Example**: For the line `"Alice ""The Great"",30,""Engineer""`, the pattern will match:
  - Group 1: `Alice "The Great"`
  - Group 2: `30`
  - Group 3: `Engineer`

## Full Example Using Python and Regex

Here is an example of how you might use regex in Python to parse a simple CSV file:

```python
import re

csv_text = """Name,Age,Occupation
Alice,30,Engineer
Bob,25,Designer
Charlie,35,Teacher"""

# Define regex pattern for a simple CSV line
pattern = re.compile(r'([^,]+),([^,]+),([^,]+)')

# Split the text into lines
lines = csv_text.split('\n')

# Parse header
header = lines[0].split(',')
print("Headers:", header)

# Parse each data line using regex
for line in lines[1:]:
    match = pattern.match(line)
    if match:
        print("Parsed Line:", match.groups())
```

### Explanation of the Code

- `pattern = re.compile(r'([^,]+),([^,]+),([^,]+)')`: Compiles a regex pattern to match three fields separated by commas.
- `lines = csv_text.split('\n')`: Splits the CSV text into individual lines.
- `header = lines[0].split(',')`: Splits the first line to get the headers.
- `match = pattern.match(line)`: Matches each subsequent line against the compiled regex pattern and prints the parsed groups.

## Summary

Using regex for CSV parsing is more complex than using dedicated CSV handling libraries due to the various edge cases in CSV formats (such as quoted fields and escaped characters). However, regex can be useful for simple parsing tasks or validating CSV format in specific scenarios.
```

In [202]:
import pandas as pd
df = pd.read_csv("/content/xyz.csv")
df

Unnamed: 0,A,B,C
0,custom word,2024-08-01,$300
1,another word,2023-07-31,$200
2,custom word,2022-12-25,€202
3,random word,2020-01-01,€250


In [203]:
import csv
import re

date_pattern = r'(\d{4})-(\d{2})-(\d{2})'

matching_rows = []

with open("/content/xyz.csv", "r") as file:
    csv_reader = csv.DictReader(file)

    for row in csv_reader:
        if row['A'] == "custom word" and re.match(date_pattern, row['B']):
            matching_rows.append(row)

    for match in matching_rows:
        print(match)


{'A': 'custom word', 'B': '2024-08-01', 'C': '$300'}
{'A': 'custom word', 'B': '2022-12-25', 'C': '€202'}


In [204]:
date_pattern = r'(\d{4})-(\d{2})-(\d{2})'

matching_rows = []

with open("/content/xyz.csv", "r") as file:
    csv_reader = csv.DictReader(file)

    for row in csv_reader:
        if row['A'] == "custom word" and re.match(r'^\$', row['C']):
            matching_rows.append(row)

    for match in matching_rows:
        print(match)


{'A': 'custom word', 'B': '2024-08-01', 'C': '$300'}
