#  NLP Class Activity 1

### **Objective**
This notebook demonstrates a two-part process common in data science and NLP.
1.  **Data Generation**: Create a large, synthetic (fake) dataset of student records using the `faker` library.
2.  **Pattern Matching**: Use **Regular Expressions (Regex)** to filter and query this dataset to find records that match specific text patterns.

---
## **Part 1: Generating the Student Dataset**
This section of the code is responsible for creating a realistic-looking dataset of 100,000 students.

### **Libraries Used**
* **`pandas`**: The primary tool for creating and working with the data in a table-like structure called a DataFrame.
* **`faker`**: A Python library that generates fake data, such as names, which is essential for creating a realistic dataset without using real private information.
* **`random`**: A standard Python library used here to randomly assign courses and sections to each student.
* **`tabulate`**: A library used to print the data in a clean, well-formatted table in the output.



In [None]:
!pip install faker

In [None]:
import pandas as pd
from faker import Faker
import random
from tabulate import tabulate



print("Part 1: Generating the student dataset")

fake = Faker()

NUM_RECORDS = 100000
BATCH_CODE = "SE23UARI"
DOMAIN_NAME = "@mahindrauniversity.edu.in"
COURSE_CODES = ['CS3126', 'CS3202', 'MA2101','CS4301', 'DS3001', 'AI3100', 'EE3205', 'CH1101']
SECTIONS = ['AI1', 'AI2', 'AI3', 'AI4']

student_data = []
for i in range(1, NUM_RECORDS + 1):
    first_name = fake.first_name()
    last_name = fake.last_name()
    record = {
        "Student Name": f"{first_name} {last_name}",
        "Roll Number": f"{BATCH_CODE}{str(i).zfill(3)}",
        "Courses Taken": ", ".join(random.sample(COURSE_CODES, k=random.randint(3, 5))),
        "Email": f"{first_name.lower()}.{last_name.lower()}{random.randint(1,99)}{DOMAIN_NAME}",
        "Section": random.choice(SECTIONS)
    }
    student_data.append(record)

df = pd.DataFrame(student_data)
print("\n Dataset generated successfully. Here's a sample:\n")
print(tabulate(df.head(20), headers="keys", tablefmt="pretty"))


## **Part 2: Regular Expression (Regex) Filtering**
This section demonstrates how to search the dataset for specific patterns using regular expressions. A regex is a special sequence of characters that defines a search pattern, allowing for powerful and flexible text matching.

In [None]:
import re
from tabulate import tabulate

### **Example 1: Find Roll Numbers Ending in "007"**
* **Goal**: To find any student whose roll number ends with the exact digits "007".
* **Pattern**: `r"007$"`
* **Explanation**:
    * `007`: Matches the literal characters "007".
    * `$`: A special character (anchor) that signifies the pattern must appear at the very **end of the string**.

In [None]:
pattern_roll = r"SE22UARI007$"
matched_roll = df[df["Roll Number"].str.contains(pattern_roll, regex=True)]
print("Students with Roll Numbers ending in '007':")
print(tabulate(matched_roll.head(10), headers="keys", tablefmt="pretty"))
print("-" * 80)

### **Example 2: Find Students in "CS" Courses**
* **Goal**: To find all students who have taken at least one course that starts with "CS".
* **Pattern**: `r"\bCS\d{4}\b"`
* **Explanation**:
    * `\b`: Represents a **word boundary**, ensuring we match "CS" as a whole word.
    * `CS`: Matches the literal letters "CS".
    * `\d{4}`: Matches exactly **four digits**.

In [None]:
pattern_course = r"\bCS\d{4}\b"
matched_course = df[df["Courses Taken"].str.contains(pattern_course, regex=True)]
print(" Number of students taking at least one CS course:", len(matched_course))
print("Preview:")
print(tabulate(matched_course.head(10), headers="keys", tablefmt="pretty"))
print("-" * 80)

### **Example 3: Find Students in Section "AI1"**
* **Goal**: To find all students who are in the section named exactly "AI1".
* **Pattern**: `r"^AI1$"`
* **Explanation**:
    * `^`: An anchor that matches the **start of the string**.
    * `$`: An anchor that matches the **end of the string**.
    * Using both ensures the entire string is *exactly* "AI1".

In [None]:
pattern_section = r"^AI1$"
matched_section = df[df["Section"].str.contains(pattern_section, regex=True)]
print(" Number of students in Section AI1:", len(matched_section))
print("Preview:")
print(tabulate(matched_section.head(10), headers="keys", tablefmt="pretty"))
print("-" * 80)