<a href="https://colab.research.google.com/github/SrishtiTyagii/coursework/blob/main/Regex_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**This project contains Python code that demonstrates the usage of regular expressions (regex) to extract and manipulate data from strings and text files.**


---



# Part A: Extract Names from a String


> The goal of this part is to extract all the names from a given string using regular expressions. Names are defined as words that start with a capital letter followed by one or more lowercase letters.





In [2]:
import re

def names():
    simple_string = """Amy is 5 years old, and her sister Mary is 2 years old.
    Ruth and Peter, their parents, have 3 kids."""

    pattern = "[A-Z][a-z]*"
    return re.findall(pattern, simple_string)

# Testing the result
assert len(names()) == 4, "There are four names in the simple_string"

names()


['Amy', 'Mary', 'Ruth', 'Peter']

* I define a regex pattern "[A-Z][a-z]*" which matches any word that starts with an uppercase letter followed by lowercase letters.
* re.findall() is used to find all occurrences of the pattern in the input string.
* The function returns a list of names found in the string.

# Part B: Extract Students with a 'B' Grade
> In this part, the task is to read a list of students and their grades, and extract the names of students who received a 'B'. The input is a multiline string where each line represents a student name followed by their grade.

In [3]:
import re

def grades():
    grades = """
    Ronald Mayr: A
    Bell Kassulke: B
    Jacqueline Rupp: A
    Alexander Zeller: C
    Valentina Denk: C
    Simon Loidl: B
    Elias Jovanovic: B
    Stefanie Weninger: A
    Fabian Peer: C
    Hakim Botros: B
    Emilie Lorentsen: B
    Herman Karlsen: C
    Nathalie Delacruz: C
    Casey Hartman: C
    Lily Walker : A
    Gerard Wang: C
    Tony Mcdowell: C
    Jake Wood: B
    Fatemeh Akhtar: B
    Kim Weston: B
    Nicholas Beatty: A
    Kirsten Williams: C
    Vaishali Surana: C
    Coby Mccormack: C
    Yasmin Dar: B
    Romy Donnelly: A
    Viswamitra Upandhye: B
    Kendrick Hilpert: A
    Killian Kaufman: B
    Elwood Page: B
    Mukti Patel: A
    Emily Lesch: C
    Elodie Booker: B
    Jedd Kim: A
    Annabel Davies: A
    Adnan Chen: B
    Jonathan Berg: C
    Hank Spinka: B
    Agnes Schneider: C
    Kimberly Green: A
    Lola-Rose Coates: C
    Rose Christiansen: C
    Shirley Hintz: C
    Hannah Bayer: B
    """

    pattern = "[\w]*\ [\w]*(?=:\ B)"
    return re.findall(pattern, grades)

# Testing the result
assert len(grades()) == 16, "There should be 16 students with a 'B' grade."

grades()

['Bell Kassulke',
 'Simon Loidl',
 'Elias Jovanovic',
 'Hakim Botros',
 'Emilie Lorentsen',
 'Jake Wood',
 'Fatemeh Akhtar',
 'Kim Weston',
 'Yasmin Dar',
 'Viswamitra Upandhye',
 'Killian Kaufman',
 'Elwood Page',
 'Elodie Booker',
 'Adnan Chen',
 'Hank Spinka',
 'Hannah Bayer']

* I define a regex pattern [\w]*\ [\w]*(?=:\ B) to match the names of students who received a 'B'.
* This pattern looks for two words (first and last names) followed by ": B".
* re.findall() is used to find all matches, and the function returns the list of student names.



> Output under construction




# Part C: Parse Web Log Data
> This part involves parsing web log data to extract structured information such as host, username, time, and request type. The input is a multiline string where each line corresponds to a log entry in a specific format.

In [4]:
import re

def logs():
    logdata = """
    146.204.224.152 - feest6811 [21/Jun/2019:15:45:24 -0700] "POST /incentivize HTTP/1.1"
    197.109.77.178 - adoyle28 [21/Jun/2019:15:45:25 -0700] "GET /rands HTTP/1.1"
    156.127.178.177 - jbarlow70 [21/Jun/2019:15:45:26 -0700] "POST /forms HTTP/1.1"
    100.89.186.6 - rshawe1 [21/Jun/2019:15:45:27 -0700] "POST /interactive HTTP/1.1"
    """

    # Regex pattern for log file parsing
    pattern = """
    (?P<host>[\d]+\.[\d]+\.[\d]+\.[\d]+)\s-\s
    (?P<user_name>[\w-]+)\s\[
    (?P<time>[^\]]+)\]\s"
    (?P<request>[^"]+)"
    """

    result = []
    for item in re.finditer(pattern, logdata, re.VERBOSE):
        result.append(item.groupdict())

    return result

# Testing the result
assert len(logs()) == 4, "There should be 4 log entries parsed."

logs()

[{'host': '146.204.224.152',
  'user_name': 'feest6811',
  'time': '21/Jun/2019:15:45:24 -0700',
  'request': 'POST /incentivize HTTP/1.1'},
 {'host': '197.109.77.178',
  'user_name': 'adoyle28',
  'time': '21/Jun/2019:15:45:25 -0700',
  'request': 'GET /rands HTTP/1.1'},
 {'host': '156.127.178.177',
  'user_name': 'jbarlow70',
  'time': '21/Jun/2019:15:45:26 -0700',
  'request': 'POST /forms HTTP/1.1'},
 {'host': '100.89.186.6',
  'user_name': 'rshawe1',
  'time': '21/Jun/2019:15:45:27 -0700',
  'request': 'POST /interactive HTTP/1.1'}]

* I use a verbose regex pattern that captures four groups: host, user_name, time, and request.
* The re.finditer() function is used to iterate over all matches, and for each match, a dictionary is created with the extracted information.
* The function returns a list of dictionaries, where each dictionary corresponds to a log entry.