## Reading & Writing Files, Working with Strings

- Reading and writing files (File IO: Input/Output) is a common task in scientific computing.
- In Python, files are handled using the built-in `open()` function.
- This function returns a file object, which can be used to read from or write to the file.
- Strings are the most common data type to be written to files.

---
 - **Regular Expressions** are useful tools to work with texts.

### writing a text file

In [1]:
content = """The optimist sees the glass half full.
The pessimist sees the glass half empty.\n
The chemist see the glass completely full, half in the liquid state and half in the vapor state."""

In [2]:
# open the file
fout = open('joke.txt','w')
fout.write(content)
# do not forget to close the file!
fout.close()

In [3]:
# Note the effect of newline character \n
!cat joke.txt

The optimist sees the glass half full.
The pessimist sees the glass half empty.

The chemist see the glass completely full, half in the liquid state and half in the vapor state.

### reading a text file

In [4]:
# open the file at once
fin = open('joke.txt','r')
content = fin.read()
fin.close() 
len(content)

177

In [5]:
content

'The optimist sees the glass half full.\nThe pessimist sees the glass half empty.\n\nThe chemist see the glass completely full, half in the liquid state and half in the vapor state.'

In [6]:
# open the file line by line with an iterator
content = ''
fin = open('joke.txt','r')
for line in fin:
    content += line
fin.close()

In [7]:
# closing files automatically using with 
with open('joke.txt','r') as fin:
    content = fin.read()

### working with files & directories

In [8]:
# check existence with exists()
import os
os.path.exists('joke.txt')

True

In [9]:
# . (single dot) Unix for the actual directory
# .. (double dot) for the parent directory
os.path.exists('./joke.txt')

True

In [10]:
# checking the type with isfile()
os.path.isfile('joke.txt')

True

In [11]:
os.path.isdir('joke.txt')

False

In [12]:
# copying with shutil (shell utils)
import shutil
shutil.copy('joke.txt','bad_joke.txt')

'bad_joke.txt'

In [13]:
# renaming
os.rename('bad_joke.txt','good_joke.txt')

In [14]:
# delete a file with remove
os.remove('good_joke.txt')
os.path.exists('good_joke.txt')

False

### pathlib module

In [15]:
from pathlib import Path
file_path = Path('.') / Path('joke.txt') 
# the slash comes from the "magic methods" in Python
file_path # this is an object of the Path class

PosixPath('joke.txt')

In [16]:
file_path.name

'joke.txt'

In [17]:
file_path.suffix

'.txt'

In [18]:
file_path.is_file()

True

In [19]:
# the object can be used with open
with open(file_path,'r') as fin:
    content = fin.read()

## Exercise: File I/O 

### Objective
Read a file containing molecular weights of chemical compounds, calculate the total and average molecular weight, and write the results to a new file.

---

### Steps

### 1. Input File
Create a text file named `molecular_weights.txt` with the following content:

`18.015 # Water (H2O)`  
`28.014 # Nitrogen (N2)`    
`44.009 # Carbon Dioxide (CO2)`    
`58.443 # Sodium Chloride (NaCl)`    
`16.042 # Methane (CH4)`  

The file contains molecular weights of some common chemical compounds.

### 2. Tasks
- Open and read the content of `molecular_weights.txt`.
- Parse the molecular weights (ignore the comments in the file).
- Perform the following calculations:
  - **Total Molecular Weight**: The sum of all molecular weights.
  - **Average Molecular Weight**: The mean molecular weight.
- Write the results to a new file named `results.txt` in the format:

Total Molecular Weight: 164.52 Average Molecular Weight: 32.90

### Hints
- Use the `open()` function with appropriate modes (`'r'` for reading and `'w'` for writing).
- Use Python's string manipulation methods to  skip comments and extract numeric values: `split()` and `float`.
- Utilize Python's built-in functions like `sum()` and `len()` for calculations.


### 3. Bonus Tasks
- **Handle Errors Gracefully**:
- If the input file is missing or contains invalid data, display an error message and exit gracefully.   
 (Use error handling (`try` and `except`) for robust code.)
- **Include Metadata**:
- Add a timestamp to the output file indicating when the calculations were performed with `time` module
- For example:
  ```
  Results computed on: 2024-12-19 15:30:00

## Regular expressions

A regular expression (RegEx) is used to identify and manipulate patterns in text, such as validating formats, searching, or replacing substrings.


In [20]:
import re

# Example DNA sequence
dna_sequence = "ATCGTTAGGCAAGGCGTTA"

# Regular expression pattern to find "A" followed by any two nucleotides and then "G"
pattern = r'A..G'

# Finding all matches in the DNA sequence
matches = re.findall(pattern, dna_sequence)
matches

['ATCG', 'AAGG']

In [21]:
# find exact beginning with match():
dna_sequence = "ATGATTACA"

# Regex pattern to match if the DNA sequence starts with "ATG" (start codon)
if re.match(r'ATG', dna_sequence):
    print("This sequence starts with the start codon 'ATG'!")

This sequence starts with the start codon 'ATG'!


In [22]:
# search anywhere with search
# Regex pattern to search for "GAT" (start codon) as first match anywhere in the sequence
if re.search(r'GAT', dna_sequence):
    print("Start codon 'GAT' found somewhere in the sequence!")

Start codon 'GAT' found somewhere in the sequence!


In [23]:
# finding all matches with findall
# Example elemental analysis (e.g., for glucose: C6H12O6)
elemental_analysis = "C6H12O6"

# Regex pattern to find all element symbols
# (capital letter followed by optional lowercase letter)
elements = re.findall(r'[A-Z][a-z]?', elemental_analysis)
elements

['C', 'H', 'O']

In [24]:
integers = re.findall(r'\d+', elemental_analysis)
integers

['6', '12', '6']

In [25]:
elemental_analysis = "C6H12O6"

# Using sub() to replace all integers with "x"
modified_analysis = re.sub(r'\d+', 'x', elemental_analysis)

print(f"Modified analysis: {modified_analysis}")

Modified analysis: CxHxOx


In [26]:
chemical_formula = "C6H12O6"

# Using split() to separate elements and their quantities
split_formula = re.split(r'(\d+)', chemical_formula)

# The result will alternate between elements and their corresponding quantities
print(f"Split formula: {split_formula}")

Split formula: ['C', '6', 'H', '12', 'O', '6', '']


### Special characters

- **`.`**: Matches any single character except newline.
- **`\d`**: Matches any digit (equivalent to `[0-9]`).
- **`\D`**: Matches any non-digit character.
- **`\w`**: Matches any word character (alphanumeric or underscore, `[A-Za-z0-9_]`).
- **`\W`**: Matches any non-word character.
- **`\s`**: Matches any whitespace character (spaces, tabs, newlines).
- **`\S`**: Matches any non-whitespace character.

In [27]:
import string
# contains 100 printable ASCII characters
printable = string.printable
len(printable)

100

In [28]:
printable

'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

In [29]:
# Which characters in printable are digits:
re.findall("\d",printable)

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

In [30]:
# which are space like characters?
re.findall("\s",printable)

[' ', '\t', '\n', '\r', '\x0b', '\x0c']

### Regex specifiers

- **`^`**: Anchors the match to the start of the string.
- **`$`**: Anchors the match to the end of the string.
- **`*`**: Matches 0 or more of the preceding character or group.
- **`+`**: Matches 1 or more of the preceding character or group.
- **`?`**: Matches 0 or 1 of the preceding character or group (makes it optional).
- **`\`**: Escapes a special character (e.g., `\.` to match a literal dot).
- **`|`**: Alternation (acts as "or"; matches either the expression before or after it).
- **`()`**: Groups characters together, allowing you to apply quantifiers or capture groups.
- **`[]`**: Character class, matches any one character inside the brackets (e.g., `[abc]` matches 'a', 'b', or 'c').
- **`{}`**: Specifies a number or range of repetitions (e.g., `a{2}` matches exactly two 'a's).

### Examples

In [31]:
# Example using the anchors ^ and $
chemical_formulas = ["CH4", "C2H6", "C6H12", "C2HO", "H2O", "C2H5OH"]

# Regex pattern: starts with C, ends with H, with any number of digits in between
pattern = r"^C\d*H\d*$"

# Apply the regex pattern
matches = [formula for formula in chemical_formulas if re.match(pattern, formula)]

print(f"Hydrocarbon matches: {matches}")

Hydrocarbon matches: ['CH4', 'C2H6', 'C6H12']


In [32]:
# Example finding alcohol
chemical_formulas = ["C2H5OH", "CH3OH", "C3H7OH", "C6H12O", "C2HO","KOH"]

# Regex pattern: a hydrocarbon chain (C and H) followed by OH group
pattern = r"C\d*H\d*(OH)+"

# Apply the regex pattern
matches = [formula for formula in chemical_formulas if re.match(pattern, formula)]

print(f"Alcohol matches: {matches}")

Alcohol matches: ['C2H5OH', 'CH3OH', 'C3H7OH']


In [33]:
# Example text containing ions with different charges
text = "The solution contains sulfate (SO4^2-) and hydrogensulfate (HSO4^-)."

# Regex pattern to match both SO4^2- and SO4^-
pattern = r"SO4\^2?-"

# Find all matches
matches = re.findall(pattern, text)

print(f"Matches: {matches}")

Matches: ['SO4^2-', 'SO4^-']


In [34]:
# finding a floating point number in a text
pattern = r"[-+]?\d*\.\d+"

text = "The temperature is -23.45°C, and the pH value is 7.05."

# Find all floating point numbers
matches = re.findall(pattern, text)

print(f"Floating-point numbers: {matches}")


Floating-point numbers: ['-23.45', '7.05']
