# Reading and Writing Structured/Unstructured Data in Python



In [1]:
import pandas as pd
import json

## Structured Data

Structured data refers to data that is organized in a predefined format, usually in the form of rows and columns.

It follows a fixed schema, meaning the structure of the data is defined before data is stored.

Because of this well-defined organization, structured data is easy to store, search, query, and analyze using traditional databases and data processing tools. 

Common examples include data stored in relational databases, spreadsheets, and CSV files, such as student records, transaction logs, and inventory tables.

---

## Unstructured Data

Unstructured data refers to data that does not have a predefined format or fixed schema. 

It is not organized in a tabular form and cannot be easily stored or queried using traditional database systems.

This type of data often contains text, images, audio, or video and requires preprocessing or advanced techniques such as text mining, natural language processing, or machine learning for analysis. 

Examples of unstructured data include emails, social media posts, documents, images, and multimedia files.


## 1. CSV (Comma-Separated Values) - Structured Data

CSV is a common format for tabular data. It's structured with rows and columns separated by commas (or other delimiters).

Because of its simple and lightweight structure, CSV is widely supported by databases, spreadsheet applications, and programming languages. 


### Reading CSV
- Using pandas `pd.read_csv()` to load CSV into a DataFrame.
- Options: delimiter, header, index_col, dtype, etc.
- or Using csv module

### Writing CSV
- Use `df.to_csv()` to save a DataFrame to CSV.
- Options: index, header, sep, mode, etc.
- or Using csv module

In [2]:
# Example data for demonstration
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
# Writing to CSV using pandas
csv_file = 'example.csv'
df.to_csv(csv_file, index=False)
print(f'CSV file written to: {csv_file}')

CSV file written to: example.csv


In [3]:
# Reading from CSV using pandas
df_read = pd.read_csv(csv_file)
print('Read CSV Data:')
print(df_read)

# Detailed options example
df_read_with_options = pd.read_csv(csv_file, dtype={'Age': 'float'}, index_col='Name')
print('\nRead with options:')
print(df_read_with_options)

Read CSV Data:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago

Read with options:
          Age         City
Name                      
Alice    25.0     New York
Bob      30.0  Los Angeles
Charlie  35.0      Chicago


In [4]:
# using csv module

import csv
# Writing to CSV using csv module
csv_file_module = 'example_module.csv'
with open(csv_file_module, mode='w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(data.keys())
    writer.writerows(zip(*data.values()))
print(f'CSV file written using csv module to: {csv_file_module}')

# Reading from CSV using csv module
with open(csv_file_module, mode='r') as file:
    reader = csv.reader(file)
    for row in reader:
        print(row)

CSV file written using csv module to: example_module.csv
['Name', 'Age', 'City']
['Alice', '25', 'New York']
['Bob', '30', 'Los Angeles']
['Charlie', '35', 'Chicago']


## 2. JSON (JavaScript Object Notation) - Structured/Semi-Structured Data

JSON is a lightweight format for structured data, often used in APIs and configurations.

It represents data in keyâ€“value pairs and supports nested structures such as objects and arrays, making it flexible and easy to read by both humans and machines.

### Reading JSON
- Use `json.load()` for files or `json.loads()` for strings.
- pandas `pd.read_json()` for DataFrame conversion.

### Writing JSON
- Use `json.dump()` or `json.dumps()`.
- pandas `df.to_json()`.

In [5]:
# Writing to JSON using pandas
json_file = 'example.json'
df.to_json(json_file, orient='records', indent=4)
print(f'JSON file written to: {json_file}')

# Using built-in json
with open('example_builtin.json', 'w') as f:
    json.dump(data, f, indent=4)

JSON file written to: example.json


In [6]:
# Reading JSON with pandas
df_json = pd.read_json(json_file, orient='records')
print('Read JSON Data (pandas):')
print(df_json)

# Reading with built-in json
with open(json_file, 'r') as f:
    data_loaded = json.load(f)
print('\nRead JSON Data (built-in):')
print(data_loaded)

Read JSON Data (pandas):
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago

Read JSON Data (built-in):
[{'Name': 'Alice', 'Age': 25, 'City': 'New York'}, {'Name': 'Bob', 'Age': 30, 'City': 'Los Angeles'}, {'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'}]


### Handling Nested JSON
- Nested JSON is very common in APIs, logs, and web data, where values can be dictionaries inside dictionaries or lists.
- Use `json_normalize` in pandas for flattening.

In [7]:
nested_data = {
    'employees': [
        {'name': 'Alice', 'details': {'age': 25, 'city': 'New York'}},
        {'name': 'Bob', 'details': {'age': 30, 'city': 'Los Angeles'}}
    ]

}
with open('nested.json', 'w') as f:
    json.dump(nested_data, f)

df_nested = pd.json_normalize(nested_data['employees'])
print('Normalized Nested JSON:')
print(df_nested)

Normalized Nested JSON:
    name  details.age details.city
0  Alice           25     New York
1    Bob           30  Los Angeles


## 3. Excel (XLSX) - Structured Data

Excel files (.xlsx) are one of the most common structured data formats used in data analysis, business reporting, and research.

Excel files handle spreadsheets with multiple sheets.

### Reading Excel
- Use `pd.read_excel()`.
- Options: sheet_name, header, usecols, etc.

### Writing Excel
- Use `df.to_excel()`.
- For multiple sheets, use `ExcelWriter`.


openpyxl library for Excel (optional, but recommended for writing Excel files)

In [8]:
# pip install openpyxl

In [9]:
# Writing to Excel
excel_file = 'example.xlsx'
df.to_excel(excel_file, index=False, sheet_name='Sheet1')
print(f'Excel file written to: {excel_file}')

# Writing multiple sheets
with pd.ExcelWriter('multi_sheet.xlsx') as writer:
    df.to_excel(writer, sheet_name='Sheet1', index=False)
    df.to_excel(writer, sheet_name='Sheet2', index=False)

Excel file written to: example.xlsx


In [10]:
# Reading Excel
df_excel = pd.read_excel(excel_file, sheet_name='Sheet1')
print('Read Excel Data:')
print(df_excel)

# Reading specific columns
df_cols = pd.read_excel(excel_file, usecols=['Name', 'Age'])
print('\nRead specific columns:')
print(df_cols)

# Reading multiple sheets
dfs = pd.read_excel('multi_sheet.xlsx', sheet_name=None)
print('\nSheets:')
for sheet, df_sheet in dfs.items():
    print(f'{sheet}:\n{df_sheet}')

Read Excel Data:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago

Read specific columns:
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35

Sheets:
Sheet1:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
Sheet2:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


## 4. Text Files - Unstructured Data

Text files are unstructured and can contain free-form text.

They contain free-form text without a predefined schema, making them harder to analyze directly compared to structured formats like Excel or CSV.

Examples: news articles, reviews, emails, logs, social media posts.

Text files are considered unstructured because:

- No fixed rows and columns
- No data types or schema
- Content may be sentences, paragraphs, logs, or notes
- Meaning is embedded in natural language

### Reading Text
- Use `open()` with 'r' mode.
- Methods: read(), readline(), readlines().

### Writing Text
- Use `open()` with 'w' or 'a' mode.
- Methods: write(), writelines().

In [11]:
# Writing to text file
text_file = 'example.txt'
with open(text_file, 'w') as f:
    f.write('This is line 1.\n')
    f.write('This is line 2.\n')
    f.writelines(['Line 3.\n', 'Line 4.\n'])
print(f'Text file written to: {text_file}')

Text file written to: example.txt


In [12]:
# Reading entire text
with open(text_file, 'r') as f:
    content = f.read()
print('Entire content:')
print(content)

# Reading line by line
print('\nLine by line:')
with open(text_file, 'r') as f:
    for line in f:
        print(line.strip())

# Reading all lines into list
with open(text_file, 'r') as f:
    lines = f.readlines()
print('\nLines list:')
print(lines)

Entire content:
This is line 1.
This is line 2.
Line 3.
Line 4.


Line by line:
This is line 1.
This is line 2.
Line 3.
Line 4.

Lines list:
['This is line 1.\n', 'This is line 2.\n', 'Line 3.\n', 'Line 4.\n']


## Cleanup
Optionally, remove generated files.

In [13]:
# import os
# files_to_remove = ['example.csv', 'example.json', 'example_builtin.json', 'nested.json', 'example.xlsx', 'multi_sheet.xlsx', 'example.txt']
# for file in files_to_remove:
#     if os.path.exists(file):
#         os.remove(file)
#         print(f'Removed: {file}')