# [CPSC 322]() Data Science Algorithms
[Gonzaga University](https://www.gonzaga.edu/) |
[Sophina Luitel](https://www.gonzaga.edu/school-of-engineering-applied-science/faculty/detail/sophina-luitel-phd-0dba6a9d)


# Files

What are our learning objectives for this lesson?
* Understand file I/O in Python
* Read tables into memory as a list of lists


Content used in this lesson is based upon information in the following sources:
* None

## Warm-up Task(s)
In the same folder that you are working on, create a data.csv file. In data.csv, enter the following data and match the formatting exactly.

```
id,col1,col2,col3
1,25.3,45.0,60.0
2,45.0,5.6,32.5
3,45.4,67.4,45.5
```

## File I/O
A simple way to store data is in a *text file*, such as this simple text file, [transactions.txt](https://raw.githubusercontent.com/DataScienceAlgorithms/M2_Python/main/files/transactions.txt), that stores an individual's credit card transaction history. Each line in the file represents a transaction price.

To process data in a file, we typically take the following approach:
1. Open the file
1. Process the file
    * Read data (doesn't modify the file) or
    * Write data (overwrite existing file) or
    * Append data (retains existing information and adds new data)
1. Close the file

### Opening a File
Before we can read from a file or write to a file, we first need to open the file and get a file object (AKA handle). We do this with the built-in function `open()`:

In [1]:
# in_file is our variable connecting our program to transactions.txt
# transactions.txt is a file I have in a files folder in the same folder as this running Python file
in_file = open(r"files\transactions.txt", "r")

### File Modes

The first argument to `open()` is the path to the file, and the second argument is the **mode** in which the file is opened:

1. `"r"` – read mode
   - The file must exist, or Python will raise an error.  
2. `"w"` – write mode 
   - If the file does not exist, it will be created.  
   - If the file exists, its contents will be **cleared**.  
3. `"a"` – append mode
   - If the file does not exist, it will be created.  
   - If the file exists, new data is **added at the end** of the file.  

You can read more about file modes [here](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files).  

 `open()` returns a **file object**, which represents the connection between our program and the file (e.g., `transactions.txt`).  

---

### File Paths
The current directory is the folder where your Python script is running.  

- If your file is in the same directory as your script, you can just use its name:  

```python
file = open("transactions.txt", "r")
```
If a file you want to open is in a directory other than the current directory, you will have to specify its path. 

Note: On a windows machine, folders and file names in a path are separated by backslashes "\\". We know the backslash has a special purpose in Python, to escape certain characters, such as a newline "\n"; therefore, you will have to escape a backslash: "`\\`" in your path to a file: `"files\\transactions.txt"`. Alternatively, you can specify your path as a raw string: `r"files\transactions.txt"`. On a Unix-based machine (e.g. Mac, Linux distributions), the forward slash "/" is used in paths and you don't have to worry about this issue.

### Closing a File
When we are done with a file, we should close it with `close()`:

**Note**: Using `with open(...) as file:` automatically closes the file when done, which is the recommended approach:

In [2]:
in_file.close()

In [38]:
in_file = open(r"files\transactions.txt", "r")
# ... process the file ...
print(f"Before manually closing: {in_file.closed}")
in_file.close()
print(f"After manually closing: {in_file.closed}")


with open(r"files\transactions.txt", "r") as infile:
    '''
    process the file
    '''

# The file is now automatically closed
print(infile.closed) 



Before manually closing: False
After manually closing: True
True


### Processing a File
Once a file is open, we want to process the data inside the file (reading) or save data to file (writing). Consider the example transactions.txt we opened earlier.

#### Reading from a File
We will use the `readline()` function to read in a *single* line in the file (in transactions.txt this is the purchase price as a **string including the newline character \n**):

In [25]:
in_file = open(r"files\transactions.txt", "r")
transaction = in_file.readline()
# note the newline printed!! repr() shows non-printable characters like \n
print(transaction, repr(transaction), type(transaction))
transaction = float(transaction)
print(transaction, type(transaction))

13.42
 '13.42\n' <class 'str'>
13.42 <class 'float'>


#### Writing to a File
Now, let's use use the `write()` function to write the transaction price we just read in to an output file called single_transaction.txt:

In [27]:
# creates the file if it does not exist
# overwrites the file contents if it does exist
with open(r"files\single_transaction.txt", "w") as out_file:
    # save the value of transaction as string
    out_file.write(f"{transaction:.2f}")



Another simple way to store data is in a CSV file (Comma-Separated Values), which is commonly used in data science. For example, consider this sample file: [transactions.csv](https://raw.githubusercontent.com/DataScienceAlgorithms/M2_Python/main/files/transactions.csv). Each row represents an individual's credit card transaction with columns like Date, Description, and Amount.

Now that we have seen how to read and process text files using methods like `readlines()`, we can move on to CSV files, which are just text files with a special structure, columns separated by commas. CSvs let us store table-like data in a simple and readable format. Let's see more ways to read files using .csv files. Later, we will also see how Python's `csv` module can make working with CSV even easier.
## Different ways of File Reading
There are several ways to read data from a file in Python. Each has its use case:

1. `readline()` – Read One Line at a Time
   - Reads a single line from the file as a string, including the newline character `\n`.
   - Useful when processing a file line by line.
     
2. `readlines()` – Read All Lines as a List
   - Reads all lines at once and returns a list of strings.
   - Each string is one line, including `\n`.
   - Useful when you want to iterate over all lines or process lines in memory.
     
3. `read()` – Read the Entire File as a Single String
   - Reads the full content of the file at once.
   - Useful when you want to process the entire file as text.
   - Can split the string into lines using `.split("\n")` if needed.
4. Iterating Directly Over the File
   - The most memory-efficient method.
   - Iterates line by line over the file object.
   - Useful for large files where loading all lines into memory is not practical.


In [60]:
print(f"Using readline()")
with open(r"files\transaction.csv", "r") as file:
    line1 = file.readline()
    line2 = file.readline()
    print("Line 1:", line1)
    print("Line 2:", line2)

# note the newline printed!! repr() shows non-printable characters like \n
print(repr(line1), type(line1))

# removing newline character and leading spaces
print(f"After removing leading spaces and newline character: {line1.strip()}")


print(f"\nUsing readlines()")
with open(r"files\transaction.csv", "r") as file:
    lines = file.readlines()

for row in lines[1:]:  # Skipping header
    date, desc, amount = row.strip().split(",")
    print(date, desc, amount)
    
print(f"\n Using read()")
with open(r"files\transaction.csv", "r") as file:
    content = file.read()

rows = content.split("\n")
for row in rows[1:]:  # Skipping header
    print(row)

print(f"\nIterating directly over the file")
with open(r"files\transaction.csv", "r") as file:
    next(file)  # Skip header
    for line in file:
        date, desc, amount = line.strip().split(",")
        print(date, desc, amount)

Using readline()
Line 1:    Date,Description,Amount

Line 2:     9/1/2025,Coffee,3.5

'   Date,Description,Amount\n' <class 'str'>
After removing leading spaces and newline character: Date,Description,Amount

Using readlines()
9/1/2025 Coffee 3.5
9/2/2025 Groceries 45.2
9/3/2025 Movie Ticket 12
9/4/2025 Book 15.75
9/5/2025 Gas 30

 Using read()
    9/1/2025,Coffee,3.5
9/2/2025,Groceries,45.2
9/3/2025,Movie Ticket,12
9/4/2025,Book,15.75
9/5/2025,Gas,30


Iterating directly over the file
9/1/2025 Coffee 3.5
9/2/2025 Groceries 45.2
9/3/2025 Movie Ticket 12
9/4/2025 Book 15.75
9/5/2025 Gas 30


### Example Problem
On average, how much money do I spend per transaction?

Algorithm:
1. For each transaction
    1. Read in the details of transaction (each line) from file
    2. Remove the newline character and use split to create a list
    3. Get the amount value from the list
    1. Accumulate the total money spent so far
1. Divide total money spent by total number of transactions
1. Write the average transaction to file

### `while` Loops 
Let's use a `while` loop. `readline()` will return an empty string when the end of the file is reached. This can be used in our Boolean condition:

In [61]:
def compute_avg_spent():
    '''
    
    '''
    # accumulator variable
    total_spent = 0.0
    # count the transactions
    num_transactions = 0
    
   # in_file = open(r"files\transactions.txt", "r")
    with open(r"files/transaction.csv","r") as in_file:

        # read the first line in the file
        header = in_file.readline()
        #read the second line
        spent=in_file.readline()
        
        # test if this line is the empty string, meaning the end of file has been reached
        while spent != "":
        # not end of file, process this transaction
            #remove the newline character using strip() and divide string into list
            spent=spent.strip().split(",")
            print(spent)
            #using index of amount which is 2
            amount=spent[2]
            total_spent += float(amount)
            num_transactions += 1
            # progress toward Boolean condition being False here is progress through the file
            spent = in_file.readline()
  
    
    return total_spent / num_transactions

avg_spent_per_transaction = compute_avg_spent()

print(f"On average, you spend {avg_spent_per_transaction:.2f} per transaction ")

['9/1/2025', 'Coffee', '3.5']
['9/2/2025', 'Groceries', '45.2']
['9/3/2025', 'Movie Ticket', '12']
['9/4/2025', 'Book', '15.75']
['9/5/2025', 'Gas', '30']
On average, you spend 21.29 per transaction 


## The File "Cursor"
When you open a file for reading ("r" mode), the cursor marking the current position at which to read from starts at the beginning of the file (position 0). As `readlines()` is called, the cursor moves through the file. To find out the position of the cursor, you can call `tell()`:

In [69]:
with open(r"files\transactions.txt", "r") as in_file:

    print(f"File cursor is at position: {in_file.tell()}")

    # read data from the file advances the cursor by a certain number of bytes, depending on the number of characters in the line
    transaction = in_file.readline()
    print(f"File cursor is at position: {in_file.tell()}")
    # %r placeholder displays all characters in a string. we use it see the newline character as \n
    print(f"First line contains: {transaction!r} which contains {len(transaction)} characters (including newline)")


File cursor is at position: 0
File cursor is at position: 7
First line contains: '13.42\n' which contains 6 characters (including newline)


To move the cursor back to the beginning of the file, you can either:
1. Close the file and re-open it
1. Use `seek(0,0)`:

In [70]:
with open(r"files\transactions.txt", "r") as in_file:

    print(f"File cursor is at position: {in_file.tell()}")

    # read data from the file advances the cursor by a certain number of bytes, depending on the number of characters in the line
    transaction = in_file.readline()
    print(f"File cursor is at position: {in_file.tell()}")
    # !r displays all characters in a string. we use it see the newline character as \n
    print(f"First line contains: {transaction!r} which contains {len(transaction)} characters (including newline)")
    # move the cursor back to the beginning of the file
    in_file.seek(0,0) 
    print(f"File cursor is at position: {in_file.tell()}")


File cursor is at position: 0
File cursor is at position: 7
First line contains: '13.42\n' which contains 6 characters (including newline)
File cursor is at position: 0


Note: In the code above I used a built-in function called [`len()`](https://docs.python.org/3/library/functions.html#len). `len()` accepts a string as an argument and returns the number of characters in the string.

Digression: On Windows, newlines are actually represented by \r\n (carriage return and newline). Python combines the carriage return and newline for us so we don't have to worry about this. Knowing this least helps explain the cursor position of 7 above.

|Position|0|1|2|3|4|5|6|7|8|...|
|-|-|-|-|-|-|-|-|-|-|-|
|Character|1|3|.|4|2|\r|\n|2|7|...|

We can remove whitespace characters (like \n and \r) with a call to a string function `strip()`:

## Alternative way to write

In [50]:

# alternative way to write to a file using print() instead of write() 

with open(r"files\out_demo.txt", "w") as outfile:
    print("Writing this output via print()", file=outfile) # file=outfile directs print() to write to the file object outfile instead of the console


## Reading in Tables
Assume files stored in “Comma Separated Values” (CSV) format:

18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu,2881
15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320,2847
18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite,2831
...

Use the [csv module](https://docs.python.org/3/library/csv.html) to read in datasets
* here we are storing a dataset as a list of lists (sublists are rows)

In [87]:
import csv
import copy

def read_csv(filename):
    '''
    Reads in a csv file and returns a table as a list of lists (rows)
    '''
    with open(filename, 'r') as the_file:
        the_reader = csv.reader(the_file, dialect='excel')
        table = []
        for row in the_reader:
            if len(row) > 0:
                table.append(row)
    return table

table1=[]
table1=read_csv(r"files/transaction.csv")
print(table1)

[['   Date', 'Description', 'Amount'], ['    9/1/2025', 'Coffee', '3.5'], ['9/2/2025', 'Groceries    ', '45.2'], ['9/3/2025', 'Movie Ticket', '12'], ['9/4/2025', 'Book', '15.75'], ['9/5/2025', 'Gas', '30']]
