# Reading and Writing Structured/Unstructured Data in Python



In [2]:
import pandas as pd
import json

## Structured Data

Structured data refers to data that is organized in a predefined format, usually in the form of rows and columns.

It follows a fixed schema, meaning the structure of the data is defined before data is stored.

Because of this well-defined organization, structured data is easy to store, search, query, and analyze using traditional databases and data processing tools. 

Common examples include data stored in relational databases, spreadsheets, and CSV files, such as student records, transaction logs, and inventory tables.

---

## Unstructured Data

Unstructured data refers to data that does not have a predefined format or fixed schema. 

It is not organized in a tabular form and cannot be easily stored or queried using traditional database systems.

This type of data often contains text, images, audio, or video and requires preprocessing or advanced techniques such as text mining, natural language processing, or machine learning for analysis. 

Examples of unstructured data include emails, social media posts, documents, images, and multimedia files.


## 1. CSV (Comma-Separated Values) - Structured Data

CSV is a common format for tabular data. It's structured with rows and columns separated by commas (or other delimiters).

Because of its simple and lightweight structure, CSV is widely supported by databases, spreadsheet applications, and programming languages. 


### Reading CSV
- Using pandas `pd.read_csv()` to load CSV into a DataFrame.
- Options: delimiter, header, index_col, dtype, etc.
- or Using csv module

### Writing CSV
- Use `df.to_csv()` to save a DataFrame to CSV.
- Options: index, header, sep, mode, etc.
- or Using csv module

In [3]:
# Example data for demonstration
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
# Writing to CSV using pandas
csv_file = 'data/example.csv'
df.to_csv(csv_file, index=False)
print(f'CSV file written to: {csv_file}')

CSV file written to: data/example.csv


In [4]:
# Reading from CSV using pandas
df_read = pd.read_csv(csv_file)
print('Read CSV Data:')
print(df_read)

# Detailed options example
df_read_with_options = pd.read_csv(csv_file, dtype={'Age': 'float'}, index_col='Name')
print('\nRead with options:')
print(df_read_with_options)

Read CSV Data:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago

Read with options:
          Age         City
Name                      
Alice    25.0     New York
Bob      30.0  Los Angeles
Charlie  35.0      Chicago


In [5]:
# using csv module

import csv
# Writing to CSV using csv module
csv_file_module = 'data/example_module.csv'
with open(csv_file_module, mode='w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(data.keys())
    writer.writerows(zip(*data.values()))
print(f'CSV file written using csv module to: {csv_file_module}')

# Reading from CSV using csv module
with open(csv_file_module, mode='r') as file:
    reader = csv.reader(file)
    for row in reader:
        print(row)

CSV file written using csv module to: data/example_module.csv
['Name', 'Age', 'City']
['Alice', '25', 'New York']
['Bob', '30', 'Los Angeles']
['Charlie', '35', 'Chicago']


### ðŸ§  Deep Dive: How CSV is Handled in Memory

When you execute `pd.read_csv('data/example.csv')`, the following occurs at the system level:

1.  **Buffering & Tokenization**: 
    - The OS reads the file from the disk in chunks (typically 4KB or 8KB pages) into a memory buffer.
    - The Pandas C-engine (written in C for speed) iterates over this byte stream, identifying delimiters (`,`) and newlines (`\n`) to tokenize the data.

2.  **Type Inference**:
    - Pandas samples the first N rows (or the whole file if small) to infer data types.
    - It allocates memory for **NumPy arrays**.
    - **Integers/Floats**: Stored efficiently in contiguous blocks of memory (e.g., `int64` takes 8 bytes per value).
    - **Strings/Objects**: Stored as an array of pointers to Python objects. This is much less memory-efficient than numbers.

3.  **The Block Manager**:
    - Pandas does not store columns purely individually. It groups columns of the same type into **Blocks**.
    - If you have 3 Integer columns, they might be stored as a single `(3, N)` NumPy array matrix. This improves CPU cache locality.

*Key Takeaway*: CSVs are text-heavy. The in-memory DataFrame representation is often **larger** than the file size on disk because ASCII text is converted into heavy Python objects or fixed-width NumPy types.

## 2. JSON (JavaScript Object Notation) - Structured/Semi-Structured Data

JSON is a lightweight format for structured data, often used in APIs and configurations.

It represents data in keyâ€“value pairs and supports nested structures such as objects and arrays, making it flexible and easy to read by both humans and machines.

### Reading JSON
- Use `json.load()` for files or `json.loads()` for strings.
- pandas `pd.read_json()` for DataFrame conversion.

### Writing JSON
- Use `json.dump()` or `json.dumps()`.
- pandas `df.to_json()`.

In [15]:
# Writing to JSON using pandas
json_file = 'data/example.json'
df.to_json(json_file, orient='records', indent=4)
print(f'JSON file written to: {json_file}')

# Using built-in json
with open('data/example_builtin.json', 'w') as f:
    json.dump(data, f, indent=4)

JSON file written to: data/example.json


In [16]:
# Reading JSON with pandas
df_json = pd.read_json(json_file, orient='records')
print('Read JSON Data (pandas):')
print(df_json)

# Reading with built-in json
with open(json_file, 'r') as f:
    data_loaded = json.load(f)
print('\nRead JSON Data (built-in):')
print(data_loaded)

Read JSON Data (pandas):
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago

Read JSON Data (built-in):
[{'Name': 'Alice', 'Age': 25, 'City': 'New York'}, {'Name': 'Bob', 'Age': 30, 'City': 'Los Angeles'}, {'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'}]


### Handling Nested JSON
- Nested JSON is very common in APIs, logs, and web data, where values can be dictionaries inside dictionaries or lists.
- Use `json_normalize` in pandas for flattening.

In [17]:
nested_data = {
    'employees': [
        {'name': 'Alice', 'details': {'age': 25, 'city': 'New York'}},
        {'name': 'Bob', 'details': {'age': 30, 'city': 'Los Angeles'}}
    ]

}
with open('data/nested.json', 'w') as f:
    json.dump(nested_data, f)

df_nested = pd.json_normalize(nested_data['employees'])
print('Normalized Nested JSON:')
print(df_nested)

Normalized Nested JSON:
    name  details.age details.city
0  Alice           25     New York
1    Bob           30  Los Angeles


### ðŸ§  Deep Dive: JSON in Memory

1.  **Parsing (DOM style)**:
    - Standard `json.load()` reads the entire file content into a string and then parses it into a standard Python `dict` or `list`.
    - **Hash Maps**: Python dictionaries are implemented as Hash Tables. They are fast for lookups (O(1)) but memory-heavy because they must store hash values, pointers to keys, pointers to values, and maintain empty space (sparsity) to avoid collisions.

2.  **Pandas & Normalization**:
    - When using `pd.read_json` or `json_normalize`, the library must traverse this hierarchical tree structure.
    - It flattens the nested dictionaries into specific columns.
    - Missing data in sparse JSONs results in `NaN` (Not a Number) values in the DataFrame, which are typically standard floating-point markers, consuming space even for "empty" cells.

3.  **Overhead**:
    - JSON is repetitive (keys are repeated for every record). While the file on disk is text, the in-memory Python object removes this textual repetition (keys are stored once in the string interning pool if strictly reused), but the dictionary overhead usually results in **2x-10x** memory expansion compared to raw bytes.

## 3. Excel (XLSX) - Structured Data

Excel files (.xlsx) are one of the most common structured data formats used in data analysis, business reporting, and research.

Excel files handle spreadsheets with multiple sheets.

### Reading Excel
- Use `pd.read_excel()`.
- Options: sheet_name, header, usecols, etc.

### Writing Excel
- Use `df.to_excel()`.
- For multiple sheets, use `ExcelWriter`.


openpyxl library for Excel (optional, but recommended for writing Excel files)

In [18]:
pip install openpyxl

Note: you may need to restart the kernel to use updated packages.


In [10]:
# Writing to Excel
excel_file = 'data/example.xlsx'
df.to_excel(excel_file, index=False, sheet_name='Sheet1')
print(f'Excel file written to: {excel_file}')

# Writing multiple sheets
with pd.ExcelWriter('data/multi_sheet.xlsx') as writer:
    df.to_excel(writer, sheet_name='Sheet1', index=False)
    df.to_excel(writer, sheet_name='Sheet2', index=False)

Excel file written to: data/example.xlsx


In [11]:
# Reading Excel
df_excel = pd.read_excel(excel_file, sheet_name='Sheet1')
print('Read Excel Data:')
print(df_excel)

# Reading specific columns
df_cols = pd.read_excel(excel_file, usecols=['Name', 'Age'])
print('\nRead specific columns:')
print(df_cols)

# Reading multiple sheets
dfs = pd.read_excel('data/multi_sheet.xlsx', sheet_name=None)
print('\nSheets:')
for sheet, df_sheet in dfs.items():
    print(f'{sheet}:\n{df_sheet}')

Read Excel Data:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago

Read specific columns:
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35

Sheets:
Sheet1:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
Sheet2:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


### ðŸ§  Deep Dive: Excel Memory Overhead

Processing Excel files (`.xlsx`) is significantly more resource-intensive than CSV or JSON.

1.  **Zip Decompression**:
    - An `.xlsx` file is actually a zipped archive of multiple XML files.
    - `read_excel` (via `openpyxl`) must first unzip these files into memory or temporary storage.

2.  **XML Parsing**:
    - The data in Excel is stored in XML format. The parser must traverse the XML DOM (Document Object Model).
    - **Cell Objects**: Unlike CSV (where a number is just bytes), a cell in Excel has value, style, type, formula, etc. `openpyxl` often creates a Python object for every single populated cell to capture this metadata.
    - This results in massive memory usage. A 5MB Excel file can easily consume 100MB+ of RAM during processing.

3.  **Conversion**:
    - Finally, the values are extracted from these heavy Cell objects and placed into NumPy arrays for the Pandas DataFrame, releasing the heavy XML/Cell objects (Garbage Collection).

## 4. Text Files - Unstructured Data

Text files are unstructured and can contain free-form text.

They contain free-form text without a predefined schema, making them harder to analyze directly compared to structured formats like Excel or CSV.

Examples: news articles, reviews, emails, logs, social media posts.

Text files are considered unstructured because:

- No fixed rows and columns
- No data types or schema
- Content may be sentences, paragraphs, logs, or notes
- Meaning is embedded in natural language

### Reading Text
- Use `open()` with 'r' mode.
- Methods: read(), readline(), readlines().

### Writing Text
- Use `open()` with 'w' or 'a' mode.
- Methods: write(), writelines().

In [12]:
# Writing to text file
text_file = 'data/example.txt'
with open(text_file, 'w') as f:
    f.write('This is line 1.\n')
    f.write('This is line 2.\n')
    f.writelines(['Line 3.\n', 'Line 4.\n'])
print(f'Text file written to: {text_file}')

Text file written to: data/example.txt


In [20]:
# Reading entire text
with open(text_file, 'r') as f:
    content = f.read()
print('Entire content:')
print(content)

# Reading line by line
print('\nLine by line:')
with open(text_file, 'r') as f:
    for line in f:
        print('\n')
        print(line.strip())

# Reading all lines into list
with open(text_file, 'r') as f:
    lines = f.readlines()
print('\nLines list:')
print(lines)

Entire content:
This is line 1.
This is line 2.
Line 3.
Line 4.


Line by line:


This is line 1.


This is line 2.


Line 3.


Line 4.

Lines list:
['This is line 1.\n', 'This is line 2.\n', 'Line 3.\n', 'Line 4.\n']


### ðŸ§  Deep Dive: Text Streams & Buffering

Handling unstructured text gives you the most control over memory.

1.  **File Descriptors & Buffers**:
    - `open()` requests a file descriptor from the OS.
    - A specific buffer size is allocated (e.g., 4096 bytes).
    - When you read, the disk head moves, fills the buffer, and Python reads from RAM (the buffer).

2.  **Load vs. Stream**:
    - **`read()`**: Loads the **entire** file contents into a single Python string. For a 10GB log file, this will crash your program (MemoryError).
    - **`for line in f:`**: This is a **Lazy Iterator**. Python reads just enough bytes to find the next newline logic (`\n`).
    - At any specific moment, only one line is stored in Python's memory. The previous line is discarded (eligible for Garbage Collection).
    
3.  **Encoding**:
    - Files are stored as bytes (0s and 1s).
    - `open(..., encoding='utf-8')` applies a decoding layer. It converts raw bytes into **Unicode Code Points** (Python strings).
    - This decoding has a small CPU cost but ensures valid character representation logic.

## Cleanup
Optionally, remove generated files.

In [14]:
# import os
# files_to_remove = ['data/example.csv', 'data/example.json', 'data/example_builtin.json', 'data/nested.json', 'data/example.xlsx', 'data/multi_sheet.xlsx', 'data/example.txt']
# for file in files_to_remove:
#     if os.path.exists(file):
#         os.remove(file)
#         print(f'Removed: {file}')