# Module 1: File Handling and Manipulation in Python

### Learning Objectives:
- Understand the basics of file operations in Python such as reading, writing and appending data to files
- Work with various file types including text files, CSVs, Excel files, PDFs and Word Document
- Automate file processing tasks for real-world applications

## 1.1 Introduction to File Handling in Python
### Overview : 
In this section, you will learn how to perform basic file operations using Python's built-in functions. Understanding how to open, read, write, and append files is fundamental when working with data processing and automation.

### Topics :
- Opening Files :  
Python provides a simple way top openfiles using the buil-in open() function. You can specify the mode in which the file is opened:
    - 'r': Read mode (default)
    - 'w': Write mode (overwrites the existing file or create a new one).
    - 'a': Append mode (adds new content to the end of the file).
    - 'b': Binary mode (used for non-text files, such as images).
- Reading Files:  

In [None]:
# Learn how to read the entire content of a life or specific lines:
with open('example.txt','r') as file:
    content = file.read()
    print(content)

- Writing and Appending to Files:  

In [1]:
# Writing to a file creates or repaces the content
with open("example.txt", 'w') as file:
    file.write('Hello, world!')

# Appending adds to an existing file without deleting previous content:
with open("example.txt", 'a') as file:
    file.write("\nThis is a new line.")

- Error Handling in File Operations: https://pythonbasics.org/try-except/

In [2]:
""" To avoid common issues such as trying to open a non-existent file,
you will learn to use try-expect blocjs for error handling:"""

try:
    with open('nonexistent.txt', 'r') as file:
        content = file.read()
except FileNotFoundError:
    print('File not found!')

File not found!


## 1.2 Working with CSV Files:
### Overview:
CSV (Comma Separated Values) files are commonly used for storing tabular data. Python's pandas library makes it easy to work with CSV files by offering powerful tools for reading, writing and manipulating this data format

### Topics:
- Reading CSV Files with pandas:

In [None]:
# Using pandas, you can load a CSV file into a DataFrame:
import pandas as pd

df = pd.read_csv('data.csv')
print(df.head())

- Modifying Data in a DataFrame:


In [None]:
# Learn how to filter rows, update columns, and perform basic operations on CSV data:
df["new_column"] = df["existing_column"]*2 # Create a new column based on existing data
filtred_df = df[df["age"]>30] #Filter rows based on a condition

- Writing CSV Files:

In [None]:
# After processing your data, you can save the modified DataFrame back to a CSV file:
df.to_csv("modified_data.csv", index=False)

## 1.3 Excel File Manipulation
### Overview:
Working with Excel files is essential in many business environments. Python's openpyxl and pandas libraries allow you to automate tasks like reading, writing and modifying Excel files.

### Topics: 
- Reading Excel Files:

In [None]:
# Use pandas to easily read Excel files into DataFrames:
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
print(df.head())

- Modifying Excel Files with openpyxl:

In [None]:
from openpyxl import load_workbook
workbook = load_workbook('data.xlsx')
sheet = workbook.active
sheet["A1"] = "Updated Value"
workbook.save('data_modified.xlsx')

- Writing Excel Files

In [None]:
df.to_excel('modified_data.xlsx', index=False)

## 1.4 PDF and Word Documents
### Overview:
PDFs and Word documents are widely used for reports and contracts. Python libraries like PyPDF2, pdfplumber, and python-docx allow you to extract text from PDFs and automate the creation and modification of Word documents.

# Module 2 : String Manipulation and Regular Expressions in Python
### Learning Objectives:
By the end of this module, you will be able to:
- Understand and perform basic string manipulation using Python's built-in string methods
- Apply advanced string processing techniquess using regular expressions (regex) for pattern matching and data extraction.
- Use regular expressions to automate search-and-replace tasks and extract structurel information from unstructured text.
## 2.1 Basic String Operations in Python
### Overview: 
This sections introduces the basic string manipulation functions in Python, which are essential for text processing tasks such as formatting, searching and modifying strings.
### Topics:
- **String Creation and Access:**

In [6]:
text = "Hello, World!"
print(text[0]) # Output: "H"
print(text[:5]) # Output: "Hello"

H
Hello


- **Common String Methods:**

In [7]:
# Change case
text = "hello"
print(text.upper()) # Output: "HELLO"
print(text.capitalize()) # Output: "Hello"

# Replace text
new_text = text.replace('hello', "Hi")
print(new_text) #Output: 'Hi'

HELLO
Hello
Hi


- **String Formatting**

In [8]:
name = "Alice"
age = 30
message = f"My name is {name} and I am {age} years old"
print(message)

My name is Alice and I am 30 years old


- **Splitting and Joinin Strings:**

In [9]:
sentence = "Python is great"
words = sentence.split()  # Output: ['Python', 'is', 'great']

new_sentence = ' '.join(words)
print(new_sentence)  # Output: 'Python is great'

## 2.2 Introduction to Regular Expressions (Regex)
### Overview: 
Regular expressions, or regex, are powerful tools for pattern matching and searching within text. They are especially useful when working with complex patterns, such as validating email addresses, phone numbers, or extracting specific text from documents.
### Topics :
- **What are Regular Expressions ?**  
Regular expressions define search patterns, which you can use to match string with certain characteristics. In Python, the re module allows you to use regular expressions:

In [12]:
import re

pattern = r"\d{3}-\d{2}-\d{4}" # Example : Social Security Number Pattern
text = "My SSN is 123-45-6789"

match = re.search(pattern, text)
if match : 
    print("Found SNN:", match.group())

Found SNN: 123-45-6788


- **Basic Regular Expression Syntax:**
    - \d: Matches any digit (0-9)
    - \w: Matches any word character (a-z, A-Z, 0-9, _).
    - .: Matches any character except a newline.
    - +: Matches one or more occurrences of the preceding element.
    - *: Matches zero or more occurrences of the preceding element.
    - [ ]: Character class, matches any character in brackets.
    - ^: Matches the start of a string.
    - $: Matches the end of a string.
- **Using re.search(), re.findall(), and re.sub():**
    - re.search(): Seaches for a pattern and returns the first match
    - re.findall(): Finds all occurences of the pattern in a string
    - re.sub(): Substitutes matched patterns with a new string

In [16]:
# Example: Finding all digits in a string
text = "The price is 500 dollars"
pattern = r"\d+"
matches = re.findall(pattern, text)
print(matches)  # Output: ['500']

['500']


## 2.3 Advanced String Manipulation with Regular Expressions
### Overview : 
In this section, you’ll learn how to use regular expressions to solve more complex string manipulation tasks, such as extracting structured data from unstructured text and performing search-and-replace tasks.
### Topics: 
- **Extracting Structured Data with Regex:**
Regular expressions are highly useful for extracting information from unstructured data like logs, documents, or large text files. You can define a pattern to extract specific information, such as dates, phone numbers, or names:

In [17]:
text = "Meeting on 2023-09-15 at 10:00 AM."
date_pattern = r"\d{4}-\d{2}-\d{2}"
time_pattern = r"\d{2}:\d{2} (AM|PM)"

date_match = re.search(date_pattern, text)
time_match = re.search(time_pattern, text)

if date_match and time_match:
    print("Date:", date_match.group())
    print("Time:", time_match.group())

Date: 2023-09-15
Time: 10:00 AM


- **Regex Search and Replace:** Use re.sub() to perform search-and-replace operations in strings

In [19]:
text = "My old phone number is 123-456-7890."
new_text = re.sub(r"\d{3}-\d{3}-\d{4}","XXX-XXX-XXXX", text)
print(new_text)

My old phone number is XXX-XXX-XXXX.


- **Grouping and Capturing:** Grouping allows you to caputre parts of the matched string. This is useful when you need to extract specific components from a pattern, like extracting the domain from an email

In [20]:
email = "user@example.com"
pattern = r"(\w+)@(\w+\.\w+)"
match = re.search(pattern, email)

if match:
    print("Username:", match.group(1))
    print("Domain:", match.group(2))

Username: user
Domain: example.com


## 2.4 Automating Search-and-Replace in Large Text Files
### Overview:
In this section, you’ll learn how to automate search-and-replace tasks in large documents or logs, which is particularly useful for data cleaning and report generation.
### Topics:
- **Reading Large Files in Chunks:** When working with large files, reading the entire content into memory might not be feasible. You will learn how to process files line by line or in chunks to efficiently apply search and replace tasks:

In [None]:
with open('large_text_file.txt', 'r') as file:
    for line in file:
        modified_line = re.sub(r'\b(error)\b', 'warning', line)
    print(modified_line)
    
""" The \b represents a word boundary in regular expressions
 which ensures that only the standalone word "error" is matched
 not a part of another word (like "supererror" or "error123")"""

- **Search and Replace in Files:** Automate replacing specific patterns across large files, such as changing specific terms, standardizing dates, or correcting formatting errors. This can be applied to lofs, documents or data files.
- **Saving Modified Content:** After performing the search and replace operation, sace the modified content back to a file or generate a new file with the updates.

In [None]:
with open("large_text_file.txt", "r") as file, open("updated_file.txt", 'w') as new_file:
    for line in file:
        modified_line = re.sub(r'\b(old_word)\b', "new_word", line)
        new_file.write(modified_line)

# Module 3: Automating Document Modifications and Batch Processing in Python
### Learning Objectives: 
- Automate repetitive tasks that involve handling multiple files.
- Create scripts for batch processing of files such as CSVs, Excel, PDFs, and Word documents.
- Implement error handling to ensure robustness in your automation scripts
## 3.1 Automating File Processing
### Overview: 
In this section, you'll learn how to automate the processing of multiple files in Python. Automating repetitive file-handling tasks saves time and effort, especially when working with large numbers of files or performing similar operations across files.
### Topics
- ***Using the os and glob Modules to Access Multiple Files:*** The os module allows you to interact with your operating system’s file system, while glob is used to find all the pathnames matching a specific pattern (e.g., all .csv or .txt files in a directory):

In [25]:
import os 
import glob

# Get all CSV files in a directory
csv_files = glob.glob('/Users/your_username/Desktop/*.ipynb')

for file in csv_files:
    print(f"Processing file: {file}")
    


Processing file: /Users/emiletardy/Desktop/Untitled1.ipynb
Processing file: /Users/emiletardy/Desktop/Untitled.ipynb
Processing file: /Users/emiletardy/Desktop/WebScrapping.ipynb
Processing file: /Users/emiletardy/Desktop/PreparationOfData.ipynb


- **Batch Processing Multiple Files:** You can automate the process of reading multiple files, performing operations (such as reading, updating, or writing back), and saving the results. This is helpful for tasks like applying the same format to multiple reports, cleaning up large datasets, or standardizing document formats.

In [None]:
import os
import glob 
import pandas as pd

# Specify the directory where your CSV files are stored (e.g. Desktop)
directory = "/Users/your_username/Desktop/"

# Get all CSV files in the directory
csv_files = glob.glob(os.path.join(directory,"*.csv"))

# Loop over each file and process it
for file in csv_files:
    # Read the CSV file into a DataFrame
    df = pd.read_csv(file)
    
    # Example operation: Create a new column based on an existing column
    df["New Column"] = df["Existing Column"]*2
    
    # Save the modified DataFrame to a new CSV file
    # The new file will have a "modified_" prefix added to its original filename
    new_filename = f"modified_{os.path.basename(file)}"
    df.to_csv(os.path.join(directory,newfilename), index=False)
    
    #Print the progress
    print(f"Processed and saved: {new_filename}")

- **Reading and Writing Files in a Loop:** Here you’ll learn how to read each file in a directory, perform modifications, and write the results back to a new file or update the existing one.

In [None]:
# Example: Reading and writing multiple CSV files
import pandas as pd:

for file in csv_files:
    df = pd.read_csv(file)
    #Perform some data manipulation
    df["New Column"] = df["Existing Column"]*2
    df.to_csv(f"modified_{os.path.basename(file)}", index=False)

## 3.2 Batch Modifications in Word and PDF Files
### Overview:
This section focuses on performing batch operations on Word and PDF files, such as inserting content, updating templates, or extracting text from multiple files at once. These tasks are common when generating reports or processing large numbers of documents.
### Topics:
- **Automating Word Document Updates:** Use the python-docx library to modify multiple Word files, such as adding headers/footers, inserting text, or filling in placeholders dynamically:


In [None]:
from docx import Document

doc_files = glob.glob("\Users\your_username\*.docx")

for doc_file in foc_files:
    doc = Document(doc_file)
    doc.add_paragraph("Automated addition to the document.")
    doc.save(f"updated_{os.path.basename(doc_file)}")

- **Automating PDF Modifications:** Use PyPDF2 to perform tasks like merging PDFs, extracting text, or adding watermarks to multiple PDF documents

In [None]:
from PyPDF2 import PdfReader, PdfWriter

pdf_files = glob.glob("\Users\your_username\Desktop\*.pdf")

for pdf_file in pdf_files:
    reader = PdfReader(pdf_file)
    writer = PdfWriter()
    
    for page in reader.pages:
        writer.add_page(page)
    
    with open("modified_{os.path.basename(pdf_file)}", "wb") as f_out:
        writer.write(f_out)

## 3.3 Combining Data from Multiple Files
### Overview:
In many cases, you will need to aggregate data from multiple files into a single file for analysis or reporting. This section will teach you how to combine and merge data from CSVs, Excel sheets, and other formats.
### Topics:
- **Merging Multiple CSV Files:** Learn to concatenate and merge data from multiple CSV files into one master file:

In [None]:
import pandas as pd

all_data = pd.DataFrame()

for file in csv_files:
    df = pd.read_csv(file)
    all_data = pd.concat([all_data,df])

all_data.to_csv('combined_data.csv',index=False)

- **Combining Excel Sheets:** Use pandas to combine multiple sheets from different Excel files into one comprehensive file:

In [None]:
all_sheets = pd.DataFrame()

excel_files = glob.glob('path/to/excel/files/*.xlsx')

for file in excel_files:
    df = pd.read_excel(file, sheet_name='Sheet1')
    all_sheets = pd.concat([all_sheets, df])

all_sheets.to_excel('combined_sheets.xlsx', index=False)

- **Extracting and Combining Text from PDFs:** You can extract and combine text from multiple PDFs into a single text file, which is usefule for text analysis or summarization:

In [None]:
from PyPDF2 import PdfReader

combined_text = ""

for pdf_file in pdf_files:
    reader = PdfReader(pdf_file)
    for page in reader.pages:
        combined_text += page.extract_text()

with open("combined_text.txt", "w") as f_out:
    f_out.write(combined_text)

## 3.4 Error Handling in Batch Processing
### Overview:
When automating batch processes, errors can occur, such as file not found, invalid file formats, or incomplete data. This section will teach you how to handle such errors gracefully to ensure your automation is robust and continues running even when encountering problems.
### Topics:
- **Basic Error Handling with Try-Except:** Learn to catch and handle errors when processing files using try-except blocks. This prevents the program from crashing when it encounters an error:

In [None]:
try:
    with open('non_existent_file.csv', 'r') as file:
        content = file.read()
except FileNotFoundError:
    print("The file was not found.")

- **Logging Errors for Debugging:** Use Python’s logging module to track errors and issues while running batch processes. This helps you log important information without interrupting the flow of the script:

In [27]:
import logging

logging.basicConfig(filename='process.log', level=logging.ERROR)

try:
    # Some file operation
    with open('example.csv', 'r') as file:
        content = file.read()
except Exception as e:
    logging.error(f"An error occurred: {e}")

- **Skipping Files with Errors:** When processing many files, you might want to skip problematic files without stopping the entire script. This can be done using continue within a try-except block:

In [None]:
for file in csv_files:
    try:
        df = pd.read_csv(file)
        # Perform data manipulation
    except Exception as e:
        logging.error(f"Error with file {file}: {e}")
        continue  # Skip to the next file

## 3.5 Automating Reports and Email Notifications
### Overview:
In some workflows, after processing files or generating reports, you may want to automate the distribution of those reports via email. This section will introduce you to automating email notifications with attachments using Python.
### Topics:
- **Sending Automated Emails with smtplib:** Use the smtplib module to send emails with Python. Learn how to authenticate with an email server and send a simple email:

In [None]:
import smtplib
from email.mime.text import MIMEText

msg = MIMEText('This is an automated report.')
msg['Subject'] = 'Automated Report'
msg['From'] = 'your_email@example.com'
msg['To'] = 'recipient@example.com'

with smtplib.SMTP('smtp.example.com', 587) as server:
    server.starttls()
    server.login('your_email@example.com', 'password')
    server.sendmail('your_email@example.com', 'recipient@example.com', msg.as_string())

- **Sending Automated Emails with smtplib:** Use the smtplib module to send emails with Python. Learn how to authenticate with an email server and send a simple email:

In [None]:
import smtplib
from email.mime.text import MIMEText

msg = MIMEText('This is an automated report.')
msg['Subject'] = 'Automated Report'
msg['From'] = 'your_email@example.com'
msg['To'] = 'recipient@example.com'

with smtplib.SMTP('smtp.example.com', 587) as server:
    server.starttls()
    server.login('your_email@example.com', 'password')
    server.sendmail('your_email@example.com', 'recipient@example.com', msg.as_string())

- **Sending Emails with Attachments:** Learn how to send emails with attached files (such as reports, Excel sheets, or PDFs) using email.mime modules:

In [None]:
from email.mime.multipart import MIMEMultipart
from email.mime.base import MIMEBase
from email import encoders

# Create a MIMEMultipart email
msg = MIMEMultipart()
msg['From'] = 'your_email@example.com'
msg['To'] = 'recipient@example.com'
msg['Subject'] = 'Automated Report with Attachment'

- **Sending Emails with Attachments (cont.):** Learn how to send emails with attached files (such as reports, Excel sheets, or PDFs) using email.mime modules:

In [None]:
from email.mime.multipart import MIMEMultipart
from email.mime.base import MIMEBase
from email import encoders

# Create a MIMEMultipart email
msg = MIMEMultipart()
msg['From'] = 'your_email@example.com'
msg['To'] = 'recipient@example.com'
msg['Subject'] = 'Automated Report with Attachment'

# Attach a file
attachment = open('report.pdf', 'rb')

part = MIMEBase('application', 'octet-stream')
part.set_payload(attachment.read())
encoders.encode_base64(part)
part.add_header('Content-Disposition', 'attachment; filename= "report.pdf"')

msg.attach(part)

# Send the email
with smtplib.SMTP('smtp.example.com', 587) as server:
    server.starttls()
    server.login('your_email@example.com', 'password')
    server.sendmail(msg['From'], msg['To'], msg.as_string())

This script sends an email with an attached PDF file (or any file you need to send). You can modify it to attach other types of files such as Excel sheets or Word documents.