## Basic File Manipulation

For simple data ingestion of a file, we can utilize the built-in open function.

This allows us to read in any files (not just csvâ€™s!) into our program for manipulation/data extraction.

In [9]:
f = open('data/queen_songs.txt', 'r', encoding="utf-8")
print(f)
f.close()

<_io.TextIOWrapper name='data/queen_songs.txt' mode='r' encoding='utf-8'>


Let's explore how we can read in the contents of the file as a string.

In [10]:
f = open('data/queen_songs.txt', 'r', encoding="utf-8")

text = f.read()
print(text)

f.close()

Is this the real life? Is this just fantasy?
Caught in a landslide, no escape from reality
Open your eyes, look up to the skies and see
I'm just a poor boy, I need no sympathy
Because I'm easy come, easy go, little high, little low
Any way the wind blows doesn't really matter to me, to me
Mama, just killed a man
Put a gun against his head, pulled my trigger, now he's dead
Mama, life had just begun
But now I've gone and thrown it all away
Mama, ooh, didn't mean to make you cry
If I'm not back again this time tomorrow
Carry on, carry on as if nothing really matters
Too late, my time has come
Sends shivers down my spine, body's aching all the time
Goodbye, everybody, I've got to go
Gotta leave you all behind and face the truth
Mama, ooh (any way the wind blows)
I don't wanna die
I sometimes wish I'd never been born at all
I see a little silhouetto of a man
Scaramouche, Scaramouche, will you do the Fandango?
Thunderbolt and lightning, very, very frightening me
(Galileo) Galileo, (Galileo) 

In [16]:
import re

f = open('data/queen_songs.txt', 'r', encoding="utf-8")

text = f.read()
# this allows us to implement common string manipulation techniques such as `split()`
print(text.split())

f.close()

['Is', 'this', 'the', 'real', 'life?', 'Is', 'this', 'just', 'fantasy?', 'Caught', 'in', 'a', 'landslide,', 'no', 'escape', 'from', 'reality', 'Open', 'your', 'eyes,', 'look', 'up', 'to', 'the', 'skies', 'and', 'see', "I'm", 'just', 'a', 'poor', 'boy,', 'I', 'need', 'no', 'sympathy', 'Because', "I'm", 'easy', 'come,', 'easy', 'go,', 'little', 'high,', 'little', 'low', 'Any', 'way', 'the', 'wind', 'blows', "doesn't", 'really', 'matter', 'to', 'me,', 'to', 'me', 'Mama,', 'just', 'killed', 'a', 'man', 'Put', 'a', 'gun', 'against', 'his', 'head,', 'pulled', 'my', 'trigger,', 'now', "he's", 'dead', 'Mama,', 'life', 'had', 'just', 'begun', 'But', 'now', "I've", 'gone', 'and', 'thrown', 'it', 'all', 'away', 'Mama,', 'ooh,', "didn't", 'mean', 'to', 'make', 'you', 'cry', 'If', "I'm", 'not', 'back', 'again', 'this', 'time', 'tomorrow', 'Carry', 'on,', 'carry', 'on', 'as', 'if', 'nothing', 'really', 'matters', 'Too', 'late,', 'my', 'time', 'has', 'come', 'Sends', 'shivers', 'down', 'my', 'spine,'

In [11]:
f = open('data/queen_songs.txt', 'r', encoding="utf-8")

data = f.readlines()
print(data[0])

f.close()

Is this the real life? Is this just fantasy?



In [22]:
f = open('data/queen_songs.txt', 'r', encoding="utf-8")

for line in f:
    print(line)

f.close()

Is this the real life? Is this just fantasy?

Caught in a landslide, no escape from reality

Open your eyes, look up to the skies and see

I'm just a poor boy, I need no sympathy

Because I'm easy come, easy go, little high, little low

Any way the wind blows doesn't really matter to me, to me

Mama, just killed a man

Put a gun against his head, pulled my trigger, now he's dead

Mama, life had just begun

But now I've gone and thrown it all away

Mama, ooh, didn't mean to make you cry

If I'm not back again this time tomorrow

Carry on, carry on as if nothing really matters

Too late, my time has come

Sends shivers down my spine, body's aching all the time

Goodbye, everybody, I've got to go

Gotta leave you all behind and face the truth

Mama, ooh (any way the wind blows)

I don't wanna die

I sometimes wish I'd never been born at all

I see a little silhouetto of a man

Scaramouche, Scaramouche, will you do the Fandango?

Thunderbolt and lightning, very, very frightening me

(Galil

Instead of calling `close` every time we open a file, we can simply use a context manager. Notice how neat this is! 

In [None]:
with open('data/queen_songs.txt', 'r', encoding="utf-8") as f:
    data = f.readlines()
    print(data[0])

The `open` function is not just limited to text files. We can also open binary files using this function, but just be warned that common string manipulation methods may no longer apply!

In [21]:
with open('data/Tux.bmp', 'rb') as f:
    data = f.readlines()
    print(data[0])

b'BM\x8aJ.\x00\x00\x00\x00\x00\x8a\x00\x00\x00|\x00\x00\x00 \x03\x00\x00\xb4\x03\x00\x00\x01\x00 \x00\x03\x00\x00\x00\x00J.\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\x00\x00\xff\x00\x00\xff\x00\x00\x00\x00\x00\x00\xffBGRs\x8f\xc2\xf5(Q\xb8\x1e\x15\x1e\x85\xeb\x01333\x13fff&fff\x06\x99\x99\x99\t=\n'


## CSV Data

`pandas` is not the only tool out there for data manipulation. In fact, sometimes it's greedy memory usage is sometimes overkill for smaller datasets. For example, let's say we're analyzing a small dataset of `AMZN` stock data.

In [35]:
import pandas as pd
import tracemalloc

# start measuring memory usage
tracemalloc.start()

# do some basic pandas commands
df = pd.read_csv("data/AMZN.csv")

# display the memory usage
print(tracemalloc.get_traced_memory())
 
# stop measuring memory usage
tracemalloc.stop()

(495332, 709743)


At our peak, we are using `709,743` bytes (`709.743` kilobytes). Not too bad, but we can do better.

In [38]:
import csv

# start measuring memory usage
tracemalloc.start()

with open('data/AMZN.csv', 'r') as file:
    # read in as CSV
    reader = csv.DictReader(file)

# display the memory usage
print(tracemalloc.get_traced_memory())
 
# stop measuring memory usage
tracemalloc.stop()

(149834, 160731)


Now at our peak, we are using `160,731` bytes (`160.731` kilobytes). This might not seem like a big difference but remember, code scales!

Imagine this operation being repeated 100 if not 1000 times.

Let's dive deeper into this code to see how it works.

In [39]:
import csv

with open('data/AMZN.csv', 'r') as file:
    # read in as CSV
    reader = csv.DictReader(file)
    for row in reader:
        print(row)

{'Date': '1/4/2016', 'Open': '32.814499', 'High': '32.886002', 'Low': '31.3755', 'Close': '31.849501', 'Adj Close': '31.849501', 'Volume': '186290000'}
{'Date': '1/5/2016', 'Open': '32.342999', 'High': '32.345501', 'Low': '31.3825', 'Close': '31.689501', 'Adj Close': '31.689501', 'Volume': '116452000'}
{'Date': '1/6/2016', 'Open': '31.1', 'High': '31.9895', 'Low': '31.015499', 'Close': '31.6325', 'Adj Close': '31.6325', 'Volume': '106584000'}
{'Date': '1/7/2016', 'Open': '31.09', 'High': '31.5', 'Low': '30.2605', 'Close': '30.396999', 'Adj Close': '30.396999', 'Volume': '141498000'}
{'Date': '1/8/2016', 'Open': '30.983', 'High': '31.207001', 'Low': '30.299999', 'Close': '30.352501', 'Adj Close': '30.352501', 'Volume': '110258000'}
{'Date': '1/11/2016', 'Open': '30.624001', 'High': '30.9925', 'Low': '29.928499', 'Close': '30.886999', 'Adj Close': '30.886999', 'Volume': '97832000'}
{'Date': '1/12/2016', 'Open': '31.262501', 'High': '31.2995', 'Low': '30.612', 'Close': '30.894501', 'Adj C

Notice that each row is now expressed as a dictionary! This will allow us to access individual row and their respective columns.

For example, if we want to calculate the average opening price for Amazons stock, we need to take a bit more of a manual approach...

In [45]:
import csv

tot = 0

# start measuring memory usage
tracemalloc.start()

with open('data/AMZN.csv', 'r') as file:
    # read in as CSV
    reader = csv.DictReader(file)
    for row in reader:
        tot += float(row["Open"])
    rows = reader.line_num - 1

average = tot/rows

# display the memory usage
print(tracemalloc.get_traced_memory())
 
# stop measuring memory usage
tracemalloc.stop()

print("Average Opening Price for AMZN stock is", average)

(171514, 196511)
Average Opening Price for AMZN stock is 97.61417350731429


Once again, let's take a look at the KB usage, and compare it to our `pandas` usage.

In [46]:
# start measuring memory usage
tracemalloc.start()

# do some basic pandas commands
df = pd.read_csv("data/AMZN.csv")
print(df["Open"].mean())

# display the memory usage
print(tracemalloc.get_traced_memory())
 
# stop measuring memory usage
tracemalloc.stop()

97.61417350731422
(349750, 561919)


With our `csv` package, we utilize `196511` bytes of memory, while `pandas` uses `561919` bytes of memory.

## JSON

Let's also review how we can load in `JSON` files in Python. Note that files that we read in can simply be expressed in other parts of our code without the context manager!

In [47]:
import json

with open('data/spotify-api.json') as j:
    d = json.load(j)

print(d)

[{'added_at': '2024-03-05T00:48:56Z', 'track': {'album': {'album_type': 'single', 'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/1KpEYlQPQN64r0aRE9Wg6i'}, 'href': 'https://api.spotify.com/v1/artists/1KpEYlQPQN64r0aRE9Wg6i', 'id': '1KpEYlQPQN64r0aRE9Wg6i', 'name': 'wev', 'type': 'artist', 'uri': 'spotify:artist:1KpEYlQPQN64r0aRE9Wg6i'}], 'available_markets': ['AR', 'AU', 'AT', 'BE', 'BO', 'BR', 'BG', 'CA', 'CL', 'CO', 'CR', 'CY', 'CZ', 'DK', 'DO', 'DE', 'EC', 'EE', 'SV', 'FI', 'FR', 'GR', 'GT', 'HN', 'HK', 'HU', 'IS', 'IE', 'IT', 'LV', 'LT', 'LU', 'MY', 'MT', 'MX', 'NL', 'NZ', 'NI', 'NO', 'PA', 'PY', 'PE', 'PH', 'PL', 'PT', 'SG', 'SK', 'ES', 'SE', 'CH', 'TW', 'TR', 'UY', 'US', 'GB', 'AD', 'LI', 'MC', 'ID', 'JP', 'TH', 'VN', 'RO', 'IL', 'ZA', 'SA', 'AE', 'BH', 'QA', 'OM', 'KW', 'EG', 'MA', 'DZ', 'TN', 'LB', 'JO', 'PS', 'IN', 'BY', 'KZ', 'MD', 'UA', 'AL', 'BA', 'HR', 'ME', 'MK', 'RS', 'SI', 'KR', 'BD', 'PK', 'LK', 'GH', 'KE', 'NG', 'TZ', 'UG', 'AG', 'AM', 'BS',

Now that we've loaded in our `json` object, we can interact with it just as we do with any other Python data-structure.

In [48]:
d[0]

{'added_at': '2024-03-05T00:48:56Z',
 'track': {'album': {'album_type': 'single',
   'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/1KpEYlQPQN64r0aRE9Wg6i'},
     'href': 'https://api.spotify.com/v1/artists/1KpEYlQPQN64r0aRE9Wg6i',
     'id': '1KpEYlQPQN64r0aRE9Wg6i',
     'name': 'wev',
     'type': 'artist',
     'uri': 'spotify:artist:1KpEYlQPQN64r0aRE9Wg6i'}],
   'available_markets': ['AR',
    'AU',
    'AT',
    'BE',
    'BO',
    'BR',
    'BG',
    'CA',
    'CL',
    'CO',
    'CR',
    'CY',
    'CZ',
    'DK',
    'DO',
    'DE',
    'EC',
    'EE',
    'SV',
    'FI',
    'FR',
    'GR',
    'GT',
    'HN',
    'HK',
    'HU',
    'IS',
    'IE',
    'IT',
    'LV',
    'LT',
    'LU',
    'MY',
    'MT',
    'MX',
    'NL',
    'NZ',
    'NI',
    'NO',
    'PA',
    'PY',
    'PE',
    'PH',
    'PL',
    'PT',
    'SG',
    'SK',
    'ES',
    'SE',
    'CH',
    'TW',
    'TR',
    'UY',
    'US',
    'GB',
    'AD',
    'LI',
    'MC',
    

In [50]:
d[0]['track']

{'album': {'album_type': 'single',
  'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/1KpEYlQPQN64r0aRE9Wg6i'},
    'href': 'https://api.spotify.com/v1/artists/1KpEYlQPQN64r0aRE9Wg6i',
    'id': '1KpEYlQPQN64r0aRE9Wg6i',
    'name': 'wev',
    'type': 'artist',
    'uri': 'spotify:artist:1KpEYlQPQN64r0aRE9Wg6i'}],
  'available_markets': ['AR',
   'AU',
   'AT',
   'BE',
   'BO',
   'BR',
   'BG',
   'CA',
   'CL',
   'CO',
   'CR',
   'CY',
   'CZ',
   'DK',
   'DO',
   'DE',
   'EC',
   'EE',
   'SV',
   'FI',
   'FR',
   'GR',
   'GT',
   'HN',
   'HK',
   'HU',
   'IS',
   'IE',
   'IT',
   'LV',
   'LT',
   'LU',
   'MY',
   'MT',
   'MX',
   'NL',
   'NZ',
   'NI',
   'NO',
   'PA',
   'PY',
   'PE',
   'PH',
   'PL',
   'PT',
   'SG',
   'SK',
   'ES',
   'SE',
   'CH',
   'TW',
   'TR',
   'UY',
   'US',
   'GB',
   'AD',
   'LI',
   'MC',
   'ID',
   'JP',
   'TH',
   'VN',
   'RO',
   'IL',
   'ZA',
   'SA',
   'AE',
   'BH',
   'QA',
   'OM',
   'KW'

In [51]:
d[0]['track']['name']

'realornot?'

## File I/O Exercise

Complete the following 5 questions as a group. Find appropriate documentation to help you figure out these lines of code.

In [None]:
# TODO: Open the `AMZN.csv` dataset and calculate the average price for the `Close` column only using `csv` package.



In [None]:
# TODO: calculate the minumum for the `Volume` column only using `csv` package.



In [None]:
# TODO: Open the `spotify-api.json` file and get the artists name using the `json` package



In [None]:
# TODO: Open the `spotify-api.json` file and get the number of `available_markets` using the `json` package



In [56]:
# TODO: Open the `queen_songs.txt` file and count the number of times Freddie Mercury says "yeah" in his songs
# Hint: The `finall()` method can count all the occurences of substrings in a string.



8
