# Regular Expression

### Question 1

Write a Python program to replace all occurrences of a space, comma, or dot with a colon.
Sample Text- `Python Exercises, PHP exercises.`
Expected Output: `Python:Exercises::PHP:exercises:`


In [2]:
def replace_chars(text):
    for char in [' ', ',', '.']:
        text = text.replace(char, ':')
    return text

# Example usage
sample_text = 'Python Exercises, PHP exercises.'
output = replace_chars(sample_text)
print(output)  

Python:Exercises::PHP:exercises:


### Question 2

Create a dataframe using the dictionary below and remove everything (commas (,), !, XXXX, ;, etc.) from the columns except words.

Dictionary- {'SUMMARY' : ['hello, world!', 'XXXXX test', '123four, five:; six...']}

Expected output-

`0      hello world`

`1             test`

`2    four five six`



In [133]:
import pandas as pd

# The Dictionary
data = {'SUMMARY': ['hello, world!', 'XXXXX test', '123four, five:; six...']}

# Creating the DataFrame
df = pd.DataFrame(data)

# Remove occurrences of 'XXXXX' or similar patterns
df['SUMMARY'] = df['SUMMARY'].str.replace(r'\bX+\b', '', regex=True)

# Cleaning the text by removing everything except the words
df['SUMMARY'] = df['SUMMARY'].str.replace(r'[^a-zA-Z\s]', '', regex=True)

# Normalizing the spaces
df['SUMMARY'] = df['SUMMARY'].str.strip().str.replace(r'\s+', ' ', regex=True)

# Print each cleaned summary with the correct formatting
for index, summary in enumerate(df['SUMMARY']):
    print(f"{index:<5} {summary}")



0     hello world
1     test
2     four five six


### Question 3

Create a function in python to find all words that are at least 4 characters long in a string. The use of the `re.compile()` method is mandatory.

In [10]:
import re

def find_long_words(text):
    # The code below Compiles a regular expression to output only words of at least 4 characters
    pattern = re.compile(r'\b\w{4,}\b')
    return pattern.findall(text)

# This is an example
text = "I am trying out re-compile function with this code."
print(find_long_words(text)) 

['trying', 'compile', 'function', 'with', 'this', 'code']


### Question 4

Create a function in python to find all three, four, and five character words in a string. The use of the `re.compile()` method is mandatory.

In [16]:
#The code below is flexible and can allow a range of word length which can easily be modified or adjusted.
import re

def find_specific_length_words(text, min_len=3, max_len=5):
    # This Compiles a regular expression to match words within a given length range
    pattern = re.compile(r'\b\w{%d,%d}\b' % (min_len, max_len))
    return pattern.findall(text)

text = "DataTrained and FlipRobo has aided my level with my journey as a Data Scientist."
print(find_specific_length_words(text)) 

['and', 'has', 'aided', 'level', 'with', 'Data']


### Question 5

Create a function in Python to remove the parenthesis in a list of strings. The use of the re.compile() method is mandatory.
Sample Text: ["example (.com)", "hr@fliprobo (.com)", "github (.com)", "Hello (Data Science World)", "Data (Scientist)"]

Expected Output:

`example.com`

`hr@fliprobo.com`

`github.com`

`Hello Data Science World`

`Data Scientist`


In [28]:
import re

def remove_parentheses(strings):
    # Compiling a regular expression to work on only the parentheses, keeping the content
    pattern = re.compile(r'\(|\)')
    
    # Removes the parentheses but keeps the content intact
    cleaned_strings = [pattern.sub('', s).strip() for s in strings]
    
    # I want each string on a new line, so I print each cleaned string on a new line
    for string in cleaned_strings:
        print(string)

# Let's try it out
strings = ["example (.com)", "hr@fliprobo (.com)", "github (.com)", "Hello (Data Science World)", "Data (Scientist)"]
remove_parentheses(strings)


example .com
hr@fliprobo .com
github .com
Hello Data Science World
Data Scientist


### Question 6

Write a python program to remove the parenthesis area from the text stored in the text file using Regular Expression.

Sample Text: `["example (.com)", "hr@fliprobo (.com)", "github (.com)", "Hello (Data Science World)", "Data (Scientist)"]`

Expected Output: `["example", "hr@fliprobo", "github", "Hello", "Data"]`

 **Note**: Store given sample text in the text file and then to remove the parenthesis area from the text.



In [44]:
# Sample text to be written to the file
sample_text = [
    "example (.com)",
    "hr@fliprobo (.com)",
    "github (.com)",
    "Hello (Data Science World)",
    "Data (Scientist)"
]

# Write sample text to a file
with open('sample_text.txt', 'w') as file:
    for line in sample_text:
        file.write(f"{line}\n")

import re

def remove_parentheses_from_file(input_file, output_file):
    # Compile a regular expression to match parentheses and the content inside them
    pattern = re.compile(r'\s*\(.*?\)\s*')
    
    # Read the text from the original file
    with open(input_file, 'r') as file:
        lines = file.readlines()
    
    # Remove parentheses and their contents
    cleaned_lines = [pattern.sub('', line).strip() for line in lines]
    
    # Write the cleaned text to a new file and print it
    with open(output_file, 'w') as file:
        for line in cleaned_lines:
            file.write(f"{line}\n")
            print(line)  # Print the cleaned line to the console

# Example usage
remove_parentheses_from_file('sample_text.txt', 'cleaned_text.txt')


example
hr@fliprobo
github
Hello
Data


### Question 7

Write a regular expression in Python to split a string into uppercase letters.

Sample text: `“ImportanceOfRegularExpressionsInPython”`

Expected Output: `[‘Importance’, ‘Of’, ‘Regular’, ‘Expression’, ‘In’, ‘Python’]`


In [46]:
import re

def split_on_uppercase(text):
    # Here the"sub" is used to insert a space before each uppercase letter except the first one
    pattern = re.compile(r'([A-Z])')
    
    # Replace matches with space + match and strip leading space
    modified_text = pattern.sub(r' \1', text).strip()
    
    # the modified text is split on spaces
    return modified_text.split()

# Trying it out 
sample_text = "ImportanceOfRegularExpressionsInPython"
output = split_on_uppercase(sample_text)
print(output)


['Importance', 'Of', 'Regular', 'Expressions', 'In', 'Python']


### Question 8

Create a function in python to insert spaces between words starting with numbers.

Sample Text: `“RegularExpression1IsAn2ImportantTopic3InPython"`

Expected Output: `RegularExpression 1IsAn 2ImportantTopic 3InPython`



In [57]:
import re

def insert_spaces(text):
    # Using re.sub pattern in this context, it finds digits and inserts a space before them
    pattern = re.compile(r'(\d)')
    
    # Inserting a space before each digits
    return pattern.sub(r' \1', text)

# Using the sample textS
sample_text = "RegularExpression1IsAn2ImportantTopic3InPython"
output = insert_spaces(sample_text)
print(output)


RegularExpression 1IsAn 2ImportantTopic 3InPython


### Question 9

Create a function in python to insert spaces between words starting with capital letters or with numbers.

Sample Text: `“RegularExpression1IsAn2ImportantTopic3InPython"`

Expected Output: `RegularExpression 1 IsAn 2 ImportantTopic 3 InPython`


In [56]:
import re

def insert_spaces(text):
    # Using "re.sub"  pattern,  it inserts spaces before and after each number
    pattern = re.compile(r'(\d)')
    
    # Insert spaces
    return pattern.sub(r' \1 ', text)

# Sample usage
sample_text = "RegularExpression1IsAn2ImportantTopic3InPython"
output = insert_spaces(sample_text)
print(output)


RegularExpression 1 IsAn 2 ImportantTopic 3 InPython


### Question 10

Use the github link below to read the data and create a dataframe. After creating the dataframe extract the first 6 letters of each country and store in the dataframe under a new column called first_five_letters.
Github Link-  https://raw.githubusercontent.com/dsrscientist/DSData/master/happiness_score_dataset.csv


In [73]:
#The code below creates the dataframe and displays it in a tabular form without the new column
import pandas as pd

# This loads the DataFrame from the provided GitHub link
url = "https://raw.githubusercontent.com/dsrscientist/DSData/master/happiness_score_dataset.csv"
df = pd.read_csv(url)

# This displays the first 15 rows of the DataFrame
df.head(15)



Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204
3,Norway,Western Europe,4,7.522,0.0388,1.459,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531
4,Canada,North America,5,7.427,0.03553,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176
5,Finland,Western Europe,6,7.406,0.0314,1.29025,1.31826,0.88911,0.64169,0.41372,0.23351,2.61955
6,Netherlands,Western Europe,7,7.378,0.02799,1.32944,1.28017,0.89284,0.61576,0.31814,0.4761,2.4657
7,Sweden,Western Europe,8,7.364,0.03157,1.33171,1.28907,0.91087,0.6598,0.43844,0.36262,2.37119
8,New Zealand,Australia and New Zealand,9,7.286,0.03371,1.25018,1.31967,0.90837,0.63938,0.42922,0.47501,2.26425
9,Australia,Australia and New Zealand,10,7.284,0.04083,1.33358,1.30923,0.93156,0.65124,0.35637,0.43562,2.26646


In [78]:
import pandas as pd

# This loads the DataFrame from the provided GitHub link
url = "https://raw.githubusercontent.com/dsrscientist/DSData/master/happiness_score_dataset.csv"
df = pd.read_csv(url)

# This extracts the first 6 letters of each country and store them in a new column called "first_five_letters"
df['first_five_letters'] = df['Country'].str[:6]

# Step 3: Display the first 15 rows of the new DataFrame with the new column
df.head(15)


Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,first_five_letters
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738,Switze
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201,Icelan
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204,Denmar
3,Norway,Western Europe,4,7.522,0.0388,1.459,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531,Norway
4,Canada,North America,5,7.427,0.03553,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176,Canada
5,Finland,Western Europe,6,7.406,0.0314,1.29025,1.31826,0.88911,0.64169,0.41372,0.23351,2.61955,Finlan
6,Netherlands,Western Europe,7,7.378,0.02799,1.32944,1.28017,0.89284,0.61576,0.31814,0.4761,2.4657,Nether
7,Sweden,Western Europe,8,7.364,0.03157,1.33171,1.28907,0.91087,0.6598,0.43844,0.36262,2.37119,Sweden
8,New Zealand,Australia and New Zealand,9,7.286,0.03371,1.25018,1.31967,0.90837,0.63938,0.42922,0.47501,2.26425,New Ze
9,Australia,Australia and New Zealand,10,7.284,0.04083,1.33358,1.30923,0.93156,0.65124,0.35637,0.43562,2.26646,Austra


### Question 11

Write a Python program to match a string that contains only upper and lowercase letters, numbers, and underscores.

In [81]:
import re

def is_valid_string(s):
    # Using fullmatch to match the entire string
    pattern = r'\w+'
    return bool(re.fullmatch(pattern, s))

# Let's test it
test_strings = ["Data_Trained", 
                "These_people_are_number_1_in_town",
                "I am their student"
                "I_am _a_star",
                "777567", 
                "Florence_Leo-Patrick_", ]

for s in test_strings:
    print(f"'{s}':", " This is Valid" if is_valid_string(s) else "This is Invalid")


'Data_Trained':  This is Valid
'These_people_are_number_1_in_town':  This is Valid
'I am their studentI_am _a_star': This is Invalid
'777567':  This is Valid
'Florence_Leo-Patrick_': This is Invalid


### Question 12

 Write a Python program where a string will start with a specific number. 

In [96]:
def display_if_starts_with_number(string, digit):
    # Converting the digit to a string
    number_str = str(digit)
    
    # Checking if the string starts with the specific number
    if string.startswith(number_str):
        print(string)
    else:
        print(f"The string does not start with the number {digit}.")

#Example
text_sample = "7Victoria_Ezido"
initial_number =7

display_if_starts_with_number(sample_text, text_sample)


7Victoria_Ezido


### Question 13

Write a Python program to remove leading zeros from an IP address

In [98]:
import re

def is_correct_ipv4_address(ip):
    # Remove leading zeros from each octet
    cleaned_ip = '.'.join(str(int(part)) for part in ip.split('.'))
    # Validate the cleaned IP address
    return (bool(re.match(r'^(\d{1,3}\.){3}\d{1,3}$', cleaned_ip)) 
            and all(0 <= int(part) <= 255 for part in cleaned_ip.split('.'))), cleaned_ip
#Lets try with an example
ip_address = "192.168.001.001"
is_valid, cleaned_ip = is_correct_ipv4_address(ip_address)
if is_valid:
    print(f"{cleaned_ip} is a correct IPv4 address.")
else:
    print(f"{cleaned_ip} is not a correct IPv4 address.")


192.168.1.1 is a correct IPv4 address.


### Question 14

Write a regular expression in python to match a date string in the form of Month name followed by day number and year stored in a text file.

Sample text :  `' On August 15th 1947 that India was declared independent from British colonialism, and the reins of control were handed over to the leaders of the Country’.`

Expected Output- `August 15th 1947`

**Note**- Store given sample text in the text file and then extract the date string asked format.



In [128]:
# Store the sample text in a file
sample_text = """On August 15th 1947 that India was declared independent from British colonialism,
and the reins of control were handed over to the leaders of the Country."""

with open('sample_text.txt', 'w') as file:
    file.write(sample_text)


In [130]:

import re

def extract_date_from_file(filename):
    # Read the content of the file
    with open(filename, 'r') as file:
        content = file.read()
    
    # Define the regular expression pattern to match the date
    pattern = r'\b([A-Z][a-z]+) (\d{1,2}(?:st|nd|rd|th)?) (\d{4})\b'
    
    # Search for the pattern in the content
    match = re.search(pattern, content)
    
    # If a date is found, print it directly
    if match:
        print(match.group(0))
    else:
        print("No date found.")

# Example usage
filename = 'sample_text.txt'
extract_date_from_file(filename)

August 15th 1947


### Question 15

 Write a Python program to search some literals strings in a string.

Sample text : `'The quick brown fox jumps over the lazy dog.'`
    
Searched words :` 'fox', 'dog', 'horse'`


In [101]:
import re

def search_with_regularex(text_string, search_list):
    # Combine search words into a single pattern
    pattern = '|'.join(search_list)
    
    # Use re.findall to find all occurrences
    found = re.findall(pattern, text_string)
    
    for word in search_list:
        if word in found:
            print(f"'{word}' is in the text_string.")
        else:
            print(f"'{word}' is in the text_string.")

# Example
line = 'The quick brown fox jumps over the lazy dog.'
targets = ['fox', 'dog', 'horse']

search_with_regularex(line, targets)


'fox' is in the text_string.
'dog' is in the text_string.
'horse' is in the text_string.


### Question 16
Write a Python program to search a literals string in a string and also find the location within the original string where the pattern occurs

Sample text : `'The quick brown fox jumps over the lazy dog.'`
Searched words : `'fox'`



In [105]:
import re

def search_literal_and_find_location(text, word):
    # Use re.search to find the word in the text
    match = re.search(word, text)
    
    if match:
        # If found, print the word and its location
        start_index = match.start()
        end_index = match.end()
        print(f"'{word}' found in the text from index {start_index} to {end_index-1}.")
    else:
        print(f"'{word}' not found in the text.")

# Example
sample_text = 'The quick brown fox jumps over the lazy dog.'
search_word = 'fox'

search_literal_and_find_location(sample_text, search_word)


'fox' found in the text from index 16 to 18.


### Question 17

Write a Python program to find the substrings within a string.

Sample text : `'Python exercises, PHP exercises, C# exercises'`
Pattern : `'exercises'.`

In [106]:
import re

def find_substrings(text, pattern):
    # Use list comprehension with re.finditer to find all occurrences
    matches = [(match.start(), match.end()) for match in re.finditer(pattern, text)]
    
    for start, end in matches:
        print(f"'{pattern}' found from index {start} to {end-1}.")

# Example
sample_text = 'Python exercises, PHP exercises, C# exercises'
pattern = 'exercises'

find_substrings(sample_text, pattern)


'exercises' found from index 7 to 15.
'exercises' found from index 22 to 30.
'exercises' found from index 36 to 44.


### Question 18
Write a Python program to find the occurrence and position of the substrings within a string.


In [107]:
def find_occurrences(text, pattern):
    start = 0
    occurrences = 0
    while True:
        start = text.find(pattern, start)
        if start == -1: 
            break
        end = start + len(pattern)
        print(f"'{pattern}' found at index {start} to {end-1}")
        start += len(pattern)
        occurrences += 1

    print(f"Total occurrences: {occurrences}")

# Example
sample_text = 'Python exercises, PHP exercises, C# exercises'
pattern = 'exercises'

find_occurrences(sample_text, pattern)


'exercises' found at index 7 to 15
'exercises' found at index 22 to 30
'exercises' found at index 36 to 44
Total occurrences: 3


### Question 19 
Write a Python program to convert a date of` yyyy-mm-dd format` to `dd-mm-yyyy format.`

In [109]:
def convert_date_format(date_str):
    # Split the input date string by '-'
    parts = date_str.split('-')
    
    # Reorder the parts to dd-mm-yyyy and join them with '-'
    return f"{parts[2]}-{parts[1]}-{parts[0]}"

# Example usage
input_date = "2024-07-07"
converted_date = convert_date_format(input_date)
print(converted_date) 


07-07-2024


### Question 20 
Create a function in python to find all decimal numbers with a precision of 1 or 2 in a string. The use of the re.compile() method is mandatory.

Sample Text: `"01.12 0132.123 2.31875 145.8 3.01 27.25 0.25"`

Expected Output: `['01.12', '145.8', '3.01', '27.25', '0.25']`


In [110]:
import re

def find_decimal_numbers(text):
    # Compile a regular expression to match numbers with 1 or 2 decimal places
    pattern = re.compile(r'\b\d+\.\d{1,2}\b')
    
    # Find all matches in the text
    matches = pattern.findall(text)
    
    return matches

# Example
sample_text = "01.12 0132.123 2.31875 145.8 3.01 27.25 0.25"
decimal_numbers = find_decimal_numbers(sample_text)
print(decimal_numbers)  # Output: ['01.12', '145.8', '3.01', '27.25', '0.25']


['01.12', '145.8', '3.01', '27.25', '0.25']


### Question 21
Write a Python program to separate and print the numbers and their position of a given string.


In [126]:
import re

def find_numbers_and_positions(text):
    # Compile a regular expression to match numbers in the text
    pattern = re.compile(r'\d+')
    
    # Use finditer to get an iterator of match objects
    matches = pattern.finditer(text)
    
    # Iterate over all matches and print the number and its position
    for match in matches:
        number = match.group(0)
        start_index = match.start()
        print(f"Number: {number}, Position: {start_index}")

# Example
sample_text = "The room numbers are 23, 64, and 80."
find_numbers_and_positions(sample_text)


Number: 23, Position: 21
Number: 64, Position: 25
Number: 80, Position: 33


### Question 22
Write a regular expression in python program to extract maximum/largest numeric value from a string.

Sample Text:  `'My marks in each semester are: 947, 896, 926, 524, 734, 950, 642'`
Expected Output: `950`

In [113]:
import re

def extract_max_number(text):
    # Compile the regular expression
    pattern = re.compile(r'\d+')
    
    # Initialize max_number
    max_number = 0
    
    # Iterate over all matches
    for match in pattern.finditer(text):
        number = int(match.group(0))
        if number > max_number:
            max_number = number
    
    return max_number

# Example in use
sample_text = 'My marks in each semester are: 947, 896, 926, 524, 734, 950, 642'
max_number = extract_max_number(sample_text)
print(max_number)  


950


### Question 23
Create a function in python to insert spaces between words starting with capital letters.

Sample Text: `“RegularExpressionIsAnImportantTopicInPython"`

Expected Output: `Regular Expression Is An Important Topic In Python`

In [114]:
import re

def insert_spaces(text):
    # Use re.sub to insert spaces before each capital letter (except the first one)
    return re.sub(r'(?<!^)(?=[A-Z])', ' ', text)

# Example
sample_text = "RegularExpressionIsAnImportantTopicInPython"
output = insert_spaces(sample_text)
print(output) 


Regular Expression Is An Important Topic In Python


### Question 24 
Python regex to find sequences of one upper case letter followed by lower case letters


In [115]:
import re

def find_uppercase_sequences(text):
    # Regular expression to match one uppercase letter followed by lowercase letters
    pattern = re.compile(r'[A-Z][a-z]+')
    
    # Use finditer to get an iterator of match objects
    matches = pattern.finditer(text)
    
    # Extract the matched sequences
    return [match.group(0) for match in matches]

# Example
sample_text = "This is a Test String With Some UppercaseLetters."
output = find_uppercase_sequences(sample_text)
print(output)  


['This', 'Test', 'String', 'With', 'Some', 'Uppercase', 'Letters']


### Question 25
Write a Python program to remove continuous duplicate words from Sentence using Regular Expression.


Sample Text: `"Hello hello world world"`

Expected Output: `Hello hello world`

In [125]:
import re

def remove_duplicate_words(sentence):
    # Regular expression pattern to find continuous duplicate words with the same case
    pattern = re.compile(r'\b(\w+)\s+\1\b')
    
    # Find all duplicate matches
    duplicates = pattern.findall(sentence)
    
    # Manually replace each duplicate (case-sensitive)
    for word in duplicates:
        sentence = re.sub(r'\b' + word + r'\s+' + word + r'\b', word, sentence)
    
    return sentence

# Example 
sample_text = "Hello hello world world"
output = remove_duplicate_words(sample_text)
print(output)  


Hello hello world


### Question 26
Write a python program using RegEx to accept string ending with alphanumeric character.


In [117]:
import re

def ends_with_alphanumeric(text):
    # Regular expression to ensure the string ends with an alphanumeric character
    pattern = r'.*[a-zA-Z0-9]'
    
    # Check if the string fully matches the pattern
    return bool(re.fullmatch(pattern, text))

# Example
sample_text1 = "Python123"
sample_text2 = "Python!"

print(ends_with_alphanumeric(sample_text1))  
print(ends_with_alphanumeric(sample_text2))  

True
False


### Question 27
Write a python program using RegEx to extract the hashtags.

Sample Text: ` """RT @kapil_kausik: #Doltiwal I mean #xyzabc is "hurt" by #Demonetization as the same has rendered USELESS <ed><U+00A0><U+00BD><ed><U+00B1><U+0089> "acquired funds" No wo"""`
    
Expected Output: `['#Doltiwal', '#xyzabc', '#Demonetization']`


In [118]:
import re

def extract_hashtags(text):
    # Regular expression to match hashtags
    pattern = re.compile(r'#\w+')
    
    # Extract hashtags using a list comprehension
    hashtags = [hashtag for hashtag in pattern.findall(text)]
    
    return hashtags

# Example
sample_text = """RT @kapil_kausik: #Doltiwal I mean #xyzabc is "hurt" by #Demonetization as the same has rendered USELESS <ed><U+00A0><U+00BD><ed><U+00B1><U+0089> "acquired funds" No wo"""
hashtags = extract_hashtags(sample_text)
print(hashtags)  # Output: ['#Doltiwal', '#xyzabc', '#Demonetization']


['#Doltiwal', '#xyzabc', '#Demonetization']


### Question 28
Write a python program using RegEx to remove <U+..> like symbols

Check the below sample text, there are strange symbols something of the sort <U+..> all over the place. You need to come up with a general Regex expression that will cover all such symbols.

Sample Text: `"@Jags123456 Bharat band on 28??<ed><U+00A0><U+00BD><ed><U+00B8><U+0082>Those who  are protesting #demonetization  are all different party leaders"`

Expected Output: `@Jags123456 Bharat band on 28??<ed><ed>Those who  are protesting #demonetization  are all different party leaders`


In [124]:
import re

def remove_symbols_of_unicode(text):
    # Regular expression to match any <...> symbol starting with <U+>
    pattern = re.compile(r'<U\+[^>]+>')
    
    # Substitute the matched patterns with an empty string
    cleaned_text = re.sub(pattern, '', text)
    
    return cleaned_text

# Example
sample_text = "@Jags123456 Bharat band on 28??<ed><U+00A0><U+00BD><ed><U+00B8><U+0082>Those who  are protesting #demonetization  are all different party leaders"
output = remove_symbols_of_unicode(sample_text)
print(output)


@Jags123456 Bharat band on 28??<ed><ed>Those who  are protesting #demonetization  are all different party leaders


### Question 29
Write a python program to extract dates from the text stored in the text file.

Sample Text: `Ron was born on 12-09-1992 and he was admitted to school 15-12-1999.`

**Note:** Store this sample text in the file and then extract dates.

In [120]:
import re

def extract_dates(file_path):
    # Open the file and read its content
    with open(file_path, 'r') as file:
        text = file.read()

    # Regular expression to match dates in dd-mm-yyyy format
    pattern = r'\b\d{2}-\d{2}-\d{4}\b'
    
    # Find all dates in the text
    dates = re.findall(pattern, text)
    
    return dates

# Example 
file_path = 'sample_text.txt'
sample_text = "Ron was born on 12-09-1992 and he was admitted to school 15-12-1999."
with open(file_path, 'w') as f:
    f.write(sample_text)

extracted_dates = extract_dates(file_path)
print(extracted_dates)  # Output: ['12-09-1992', '15-12-1999']


['12-09-1992', '15-12-1999']


### Question 30
Create a function in python to remove all words from a string of length between 2 and 4.
The use of the re.compile() method is mandatory.

Sample Text: `"The following example creates an ArrayList with a capacity of 50 elements. 4 elements are then added to the ArrayList and the ArrayList is trimmed accordingly."`

Expected Output:  `following example creates ArrayList a capacity elements. 4 elements added ArrayList ArrayList trimmed accordingly.`



In [123]:
import re

def remove_words(text):
    # Compile the regular expression to match words of length between 2 and 4
    pattern = re.compile(r'\b\w{2,4}\b')
    
    # Use re.sub to remove the matched words
    result = re.sub(pattern, '', text)
    
    # Remove extra spaces that might have been left
    return ' '.join(result.split())

# Example in use
sample_text = "The following example creates an ArrayList with a capacity of 50 elements. 4 elements are then added to the ArrayList and the ArrayList is trimmed accordingly."
output = remove_words(sample_text)
print(output)


following example creates ArrayList a capacity elements. 4 elements added ArrayList ArrayList trimmed accordingly.
