# 05_data_handling_files
File I/O, CSV, MAT, large datasets

In [None]:
% Content to be added

# File: notebooks/05_data_handling_files.ipynb

# OctaveMasterPro: Data Handling & File I/O

Master data import, export, and manipulation! This notebook covers file I/O operations, CSV handling, MAT files, and techniques for working with large datasets efficiently.

**Learning Objectives:**
- Master file input/output operations across formats
- Handle CSV data import, export, and manipulation
- Work with MAT files and binary data formats
- Implement strategies for large dataset management
- Apply data cleaning and preprocessing techniques

---

## 1. Basic File I/O Operations

```octave
% Basic file input/output operations
fprintf('=== Basic File I/O Operations ===\n');

% Text file writing
filename_txt = 'sample_data.txt';
fid = fopen(filename_txt, 'w');

if fid == -1
    error('Cannot open file for writing');
end

% Write various data types to text file
fprintf(fid, 'Sample Data File\n');
fprintf(fid, 'Generated on: %s\n', datestr(now()));
fprintf(fid, '================\n');
fprintf(fid, 'Integer: %d\n', 42);
fprintf(fid, 'Float: %.4f\n', pi);
fprintf(fid, 'Scientific: %e\n', 1.23e-6);

% Write array data
data_array = [1.1, 2.2, 3.3, 4.4, 5.5];
fprintf(fid, 'Array: ');
fprintf(fid, '%.2f ', data_array);
fprintf(fid, '\n');

fclose(fid);
fprintf('Text file written: %s\n', filename_txt);

% Text file reading
fid = fopen(filename_txt, 'r');
if fid == -1
    error('Cannot open file for reading');
end

fprintf('Reading text file contents:\n');
line_count = 0;
while ~feof(fid)
    line = fgetl(fid);
    if ischar(line)
        line_count = line_count + 1;
        fprintf('  Line %d: %s\n', line_count, line);
    end
end

fclose(fid);

% Binary file operations
binary_data = [1, 2, 3, 4, 5; 6, 7, 8, 9, 10];
filename_bin = 'binary_data.bin';

% Write binary data
fid = fopen(filename_bin, 'wb');
fwrite(fid, binary_data, 'double');
fclose(fid);

% Read binary data
fid = fopen(filename_bin, 'rb');
read_binary = fread(fid, [2, 5], 'double');
fclose(fid);

fprintf('Binary data round-trip test:\n');
fprintf('Original equal to read: %d\n', isequal(binary_data, read_binary));

% Clean up files
delete(filename_txt);
delete(filename_bin);
```

## 2. CSV File Operations

```octave
% CSV file handling and operations
fprintf('\n=== CSV File Operations ===\n');

% Create sample CSV data
csv_filename = 'sample_data.csv';
header_data = {'Name', 'Age', 'Score', 'Grade'};
sample_data = {
    'Alice', 25, 95.5, 'A';
    'Bob', 23, 87.2, 'B';
    'Charlie', 24, 92.8, 'A';
    'Diana', 22, 78.5, 'C';
    'Eve', 26, 88.9, 'B'
};

% Write CSV with headers (manual approach)
fid = fopen(csv_filename, 'w');
fprintf(fid, '%s,', header_data{1:end-1});
fprintf(fid, '%s\n', header_data{end});

for i = 1:size(sample_data, 1)
    fprintf(fid, '%s,%d,%.1f,%s\n', sample_data{i,1}, sample_data{i,2}, ...
            sample_data{i,3}, sample_data{i,4});
end
fclose(fid);

fprintf('CSV file created: %s\n', csv_filename);

% Read CSV file (numeric data only for csvread)
numeric_csv = 'numeric_data.csv';
numeric_data = [10, 20, 30; 40, 50, 60; 70, 80, 90];
csvwrite(numeric_csv, numeric_data);

% Read numeric CSV
read_numeric = csvread(numeric_csv);
fprintf('Numeric CSV read successfully. Size: [%d, %d]\n', ...
        size(read_numeric, 1), size(read_numeric, 2));
fprintf('Data verification: %d\n', isequal(numeric_data, read_numeric));

% Manual CSV parsing for mixed data
function [headers, data] = parse_csv(filename)
    % Parse CSV file with mixed data types
    % Input: filename - CSV file path
    % Output: headers - cell array of header names
    %         data - cell array of data
    
    fid = fopen(filename, 'r');
    if fid == -1
        error('Cannot open CSV file: %s', filename);
    end
    
    % Read header line
    header_line = fgetl(fid);
    headers = strsplit(header_line, ',');
    
    % Read data lines
    data = {};
    row = 1;
    while ~feof(fid)
        line = fgetl(fid);
        if ischar(line) && ~isempty(line)
            parts = strsplit(line, ',');
            for col = 1:length(parts)
                % Try to convert to number, otherwise keep as string
                num_val = str2num(parts{col});
                if ~isempty(num_val)
                    data{row, col} = num_val;
                else
                    data{row, col} = parts{col};
                end
            end
            row = row + 1;
        end
    end
    
    fclose(fid);
end

% Test CSV parsing
[csv_headers, csv_data] = parse_csv(csv_filename);
fprintf('CSV parsing results:\n');
fprintf('Headers: '); 
for i = 1:length(csv_headers)
    fprintf('%s ', csv_headers{i});
end
fprintf('\n');

fprintf('Sample data rows:\n');
for i = 1:min(3, size(csv_data, 1))
    fprintf('  Row %d: %s (age=%d, score=%.1f, grade=%s)\n', i, ...
            csv_data{i,1}, csv_data{i,2}, csv_data{i,3}, csv_data{i,4});
end

% Statistical analysis of CSV data
ages = [csv_data{:,2}];
scores = [csv_data{:,3}];

fprintf('CSV data statistics:\n');
fprintf('  Age: mean=%.1f, std=%.1f\n', mean(ages), std(ages));
fprintf('  Score: mean=%.1f, std=%.1f\n', mean(scores), std(scores));

% Clean up
delete(csv_filename);
delete(numeric_csv);
```

## 3. MAT File Operations

```octave
% MAT file operations and data persistence
fprintf('\n=== MAT File Operations ===\n');

% Create diverse data for saving
mat_filename = 'test_data.mat';

% Various data types
scalar_var = 42;
vector_var = 1:10;
matrix_var = magic(4);
string_var = 'Hello MAT file';
struct_var.name = 'John Doe';
struct_var.age = 30;
struct_var.scores = [85, 92, 78, 88];
cell_var = {'apple', 123, [1,2,3], true};
complex_var = 3 + 4i;

% Save all variables to MAT file
save(mat_filename, 'scalar_var', 'vector_var', 'matrix_var', ...
     'string_var', 'struct_var', 'cell_var', 'complex_var');

fprintf('MAT file saved: %s\n', mat_filename);

% Clear variables from workspace
clear scalar_var vector_var matrix_var string_var struct_var cell_var complex_var;

% Verify variables are cleared
fprintf('Variables cleared from workspace\n');

% Load specific variables
load(mat_filename, 'scalar_var', 'matrix_var');
fprintf('Loaded specific variables:\n');
fprintf('  scalar_var = %d\n', scalar_var);
fprintf('  matrix_var size = [%d, %d]\n', size(matrix_var, 1), size(matrix_var, 2));

% Load all variables
clear;  % Clear workspace
load(mat_filename);
fprintf('All variables reloaded:\n');
fprintf('  scalar_var = %d\n', scalar_var);
fprintf('  vector_var length = %d\n', length(vector_var));
fprintf('  string_var = %s\n', string_var);
fprintf('  struct_var.name = %s\n', struct_var.name);
fprintf('  complex_var = %.1f + %.1fi\n', real(complex_var), imag(complex_var));

% MAT file information without loading
try
    mat_info = whos('-file', mat_filename);
    fprintf('MAT file contents:\n');
    for i = 1:length(mat_info)
        fprintf('  %s: %s [%s]\n', mat_info(i).name, mat_info(i).class, ...
                num2str(mat_info(i).size));
    end
catch
    fprintf('MAT file info not available in this environment\n');
end

% Partial loading with conditions
function loaded_data = conditional_load(filename, condition_func)
    % Load MAT file data based on conditions
    % Input: filename - MAT file path
    %        condition_func - function to test variables
    % Output: loaded_data - struct with loaded variables
    
    loaded_data = struct();
    
    try
        file_vars = whos('-file', filename);
        for i = 1:length(file_vars)
            var_name = file_vars(i).name;
            if condition_func(file_vars(i))
                temp_data = load(filename, var_name);
                loaded_data.(var_name) = temp_data.(var_name);
            end
        end
    catch
        % Fallback: load all and filter
        all_data = load(filename);
        field_names = fieldnames(all_data);
        for i = 1:length(field_names)
            var_name = field_names{i};
            var_data = all_data.(var_name);
            if condition_func(struct('name', var_name, 'class', class(var_data), 'size', size(var_data)))
                loaded_data.(var_name) = var_data;
            end
        end
    end
end

% Test conditional loading
numeric_condition = @(var_info) strcmp(var_info.class, 'double');
numeric_vars = conditional_load(mat_filename, numeric_condition);
fprintf('Loaded numeric variables: %s\n', strjoin(fieldnames(numeric_vars), ', '));

% Clean up
delete(mat_filename);
```

## 4. Large Dataset Handling

```octave
% Strategies for handling large datasets
fprintf('\n=== Large Dataset Handling ===\n');

% Simulate large dataset creation
large_filename = 'large_dataset.mat';
fprintf('Creating simulated large dataset...\n');

% Create chunked data
chunk_size = 1000;
num_chunks = 5;
total_rows = chunk_size * num_chunks;

% Save data in chunks to simulate large file
for chunk = 1:num_chunks
    chunk_data = randn(chunk_size, 10);  % 1000x10 chunk
    chunk_name = sprintf('data_chunk_%d', chunk);
    
    if chunk == 1
        save(large_filename, chunk_name);
    else
        save(large_filename, chunk_name, '-append');
    end
end

fprintf('Large dataset created: %d rows total in %d chunks\n', total_rows, num_chunks);

% Memory-efficient data processing
function stats = process_large_dataset_chunked(filename, chunk_pattern)
    % Process large dataset chunk by chunk
    % Input: filename - MAT file path
    %        chunk_pattern - pattern for chunk variable names
    % Output: stats - aggregate statistics
    
    stats.total_elements = 0;
    stats.sum = 0;
    stats.sum_squares = 0;
    stats.min_val = inf;
    stats.max_val = -inf;
    
    chunk_num = 1;
    while true
        chunk_name = sprintf(chunk_pattern, chunk_num);
        try
            chunk_data = load(filename, chunk_name);
            if isfield(chunk_data, chunk_name)
                data = chunk_data.(chunk_name);
                
                % Update statistics incrementally
                stats.total_elements = stats.total_elements + numel(data);
                stats.sum = stats.sum + sum(data(:));
                stats.sum_squares = stats.sum_squares + sum(data(:).^2);
                stats.min_val = min(stats.min_val, min(data(:)));
                stats.max_val = max(stats.max_val, max(data(:)));
                
                fprintf('  Processed chunk %d: %dx%d\n', chunk_num, size(data,1), size(data,2));
                chunk_num = chunk_num + 1;
            else
                break;
            end
        catch
            break;
        end
    end
    
    % Calculate final statistics
    stats.mean = stats.sum / stats.total_elements;
    stats.variance = (stats.sum_squares - stats.total_elements * stats.mean^2) / (stats.total_elements - 1);
    stats.std = sqrt(stats.variance);
end

% Process large dataset efficiently
fprintf('Processing large dataset in chunks:\n');
large_stats = process_large_dataset_chunked(large_filename, 'data_chunk_%d');

fprintf('Large dataset statistics:\n');
fprintf('  Total elements: %d\n', large_stats.total_elements);
fprintf('  Mean: %.6f\n', large_stats.mean);
fprintf('  Std: %.6f\n', large_stats.std);
fprintf('  Range: [%.6f, %.6f]\n', large_stats.min_val, large_stats.max_val);

% Memory usage monitoring (simplified)
function monitor_memory_usage()
    % Monitor memory usage during processing
    try
        % This would work in full Octave environment
        mem_info = memory();
        fprintf('Memory monitoring not available in this environment\n');
    catch
        fprintf('Memory monitoring: Use ''memory'' command in full Octave\n');
    end
end

monitor_memory_usage();

% Clean up
delete(large_filename);
```

## 5. Data Import from Various Sources

```octave
% Data import from various file formats
fprintf('\n=== Data Import from Various Sources ===\n');

% Tab-separated values (TSV)
tsv_filename = 'data.tsv';
tsv_data = [1, 2, 3; 4, 5, 6; 7, 8, 9];

% Write TSV file
fid = fopen(tsv_filename, 'w');
for i = 1:size(tsv_data, 1)
    fprintf(fid, '%d', tsv_data(i, 1));
    for j = 2:size(tsv_data, 2)
        fprintf(fid, '\t%d', tsv_data(i, j));
    end
    fprintf(fid, '\n');
end
fclose(fid);

% Read TSV file
function data = read_tsv(filename)
    % Read tab-separated values file
    % Input: filename - TSV file path
    % Output: data - numeric array
    
    fid = fopen(filename, 'r');
    if fid == -1
        error('Cannot open TSV file: %s', filename);
    end
    
    data = [];
    row = 1;
    while ~feof(fid)
        line = fgetl(fid);
        if ischar(line) && ~isempty(line)
            values = str2num(strrep(line, sprintf('\t'), ' '));
            data(row, :) = values;
            row = row + 1;
        end
    end
    
    fclose(fid);
end

tsv_read = read_tsv(tsv_filename);
fprintf('TSV read test: %d\n', isequal(tsv_data, tsv_read));

% Fixed-width format
fixed_width_file = 'fixed_width.txt';
fid = fopen(fixed_width_file, 'w');
names = {'Alice', 'Bob  ', 'Charlie'};
ages = [25, 30, 22];
scores = [95.5, 87.2, 92.8];

for i = 1:length(names)
    fprintf(fid, '%-10s%3d%8.2f\n', names{i}, ages(i), scores(i));
end
fclose(fid);

% Read fixed-width file
function [names, ages, scores] = read_fixed_width(filename)
    % Read fixed-width format file
    % Input: filename - file path
    % Output: names, ages, scores - extracted data
    
    fid = fopen(filename, 'r');
    if fid == -1
        error('Cannot open fixed-width file: %s', filename);
    end
    
    names = {};
    ages = [];
    scores = [];
    
    row = 1;
    while ~feof(fid)
        line = fgetl(fid);
        if ischar(line) && length(line) >= 21
            names{row} = strtrim(line(1:10));
            ages(row) = str2num(line(11:13));
            scores(row) = str2num(line(14:21));
            row = row + 1;
        end
    end
    
    fclose(fid);
end

[fw_names, fw_ages, fw_scores] = read_fixed_width(fixed_width_file);
fprintf('Fixed-width parsing results:\n');
for i = 1:length(fw_names)
    fprintf('  %s: age=%d, score=%.1f\n', fw_names{i}, fw_ages(i), fw_scores(i));
end

% JSON-like data handling (simplified)
function data = parse_simple_json_like(text)
    % Parse simple JSON-like text structure
    % Input: text - JSON-like string
    % Output: data - parsed structure
    
    % Remove whitespace and brackets
    text = strrep(text, ' ', '');
    text = strrep(text, '{', '');
    text = strrep(text, '}', '');
    
    % Split by commas
    pairs = strsplit(text, ',');
    data = struct();
    
    for i = 1:length(pairs)
        if contains(pairs{i}, ':')
            key_val = strsplit(pairs{i}, ':');
            key = strrep(key_val{1}, '"', '');
            val_str = strrep(key_val{2}, '"', '');
            
            % Try to convert to number
            val_num = str2num(val_str);
            if ~isempty(val_num)
                data.(key) = val_num;
            else
                data.(key) = val_str;
            end
        end
    end
end

json_text = '{"name":"John","age":30,"score":85.5}';
json_data = parse_simple_json_like(json_text);
fprintf('JSON-like parsing: name=%s, age=%d, score=%.1f\n', ...
        json_data.name, json_data.age, json_data.score);

% Clean up
delete(tsv_filename);
delete(fixed_width_file);
```

## 6. Data Validation and Cleaning

```octave
% Data validation and cleaning techniques
fprintf('\n=== Data Validation and Cleaning ===\n');

% Create messy dataset
messy_data = [
    1.5, 2.3, 3.1, 4.2, 5.5;
    2.1, NaN, 3.8, 4.1, 5.2;
    1.9, 2.7, -999, 4.5, 5.8;  % -999 as missing value indicator
    2.3, 2.9, 3.4, Inf, 5.1;   % Infinite value
    1.7, 2.1, 3.6, 4.8, 5.3;
    2.0, 2.5, 3.2, 4.3, -999   % Another missing value
];

fprintf('Original messy data (%dx%d):\n', size(messy_data, 1), size(messy_data, 2));
fprintf('  NaN values: %d\n', sum(isnan(messy_data(:))));
fprintf('  Infinite values: %d\n', sum(isinf(messy_data(:))));
fprintf('  Missing indicators (-999): %d\n', sum(messy_data(:) == -999));

% Data cleaning function
function [cleaned_data, cleaning_report] = clean_dataset(data, missing_value)
    % Clean dataset by handling missing and invalid values
    % Input: data - input dataset
    %        missing_value - value indicating missing data
    % Output: cleaned_data - cleaned dataset
    %         cleaning_report - report of cleaning actions
    
    cleaning_report = struct();
    cleaning_report.original_size = size(data);
    
    % Replace missing value indicators with NaN
    data(data == missing_value) = NaN;
    cleaning_report.missing_replaced = sum(data(:) == missing_value);
    
    % Handle infinite values
    inf_mask = isinf(data);
    cleaning_report.inf_values = sum(inf_mask(:));
    data(inf_mask) = NaN;
    
    % Count total missing values
    nan_mask = isnan(data);
    cleaning_report.total_missing = sum(nan_mask(:));
    
    % Strategy 1: Remove rows with any missing values
    complete_rows = ~any(nan_mask, 2);
    cleaned_data_complete = data(complete_rows, :);
    cleaning_report.rows_removed = sum(~complete_rows);
    
    % Strategy 2: Column-wise mean imputation
    cleaned_data_imputed = data;
    for col = 1:size(data, 2)
        col_data = data(:, col);
        valid_data = col_data(~isnan(col_data));
        if ~isempty(valid_data)
            col_mean = mean(valid_data);
            cleaned_data_imputed(isnan(col_data), col) = col_mean;
        end
    end
    
    % Choose strategy based on data loss
    complete_data_loss = (size(data, 1) - size(cleaned_data_complete, 1)) / size(data, 1);
    
    if complete_data_loss < 0.3  % Less than 30% data loss
        cleaned_data = cleaned_data_complete;
        cleaning_report.strategy = 'complete_case_removal';
    else
        cleaned_data = cleaned_data_imputed;
        cleaning_report.strategy = 'mean_imputation';
    end
    
    cleaning_report.final_size = size(cleaned_data);
    cleaning_report.data_loss_percent = complete_data_loss * 100;
end

% Clean the messy dataset
[clean_data, report] = clean_dataset(messy_data, -999);

fprintf('Data cleaning results:\n');
fprintf('  Strategy used: %s\n', report.strategy);
fprintf('  Original size: [%d, %d]\n', report.original_size);
fprintf('  Final size: [%d, %d]\n', report.final_size);
fprintf('  Total missing values found: %d\n', report.total_missing);
fprintf('  Data loss: %.1f%%\n', report.data_loss_percent);

% Data validation functions
function validation_report = validate_data(data, rules)
    % Validate data against specified rules
    % Input: data - dataset to validate
    %        rules - struct with validation rules
    % Output: validation_report - validation results
    
    validation_report = struct();
    validation_report.passed = true;
    validation_report.issues = {};
    
    % Check for missing values
    if isfield(rules, 'allow_missing') && ~rules.allow_missing
        missing_count = sum(isnan(data(:)));
        if missing_count > 0
            validation_report.passed = false;
            validation_report.issues{end+1} = sprintf('Found %d missing values', missing_count);
        end
    end
    
    % Check data range
    if isfield(rules, 'min_value')
        below_min = sum(data(:) < rules.min_value);
        if below_min > 0
            validation_report.passed = false;
            validation_report.issues{end+1} = sprintf('%d values below minimum %.2f', below_min, rules.min_value);
        end
    end
    
    if isfield(rules, 'max_value')
        above_max = sum(data(:) > rules.max_value);
        if above_max > 0
            validation_report.passed = false;
            validation_report.issues{end+1} = sprintf('%d values above maximum %.2f', above_max, rules.max_value);
        end
    end
    
    % Check for outliers (using IQR method)
    if isfield(rules, 'check_outliers') && rules.check_outliers
        q1 = quantile(data(:), 0.25);
        q3 = quantile(data(:), 0.75);
        iqr = q3 - q1;
        outlier_bounds = [q1 - 1.5*iqr, q3 + 1.5*iqr];
        outliers = sum((data(:) < outlier_bounds(1)) | (data(:) > outlier_bounds(2)));
        
        validation_report.outliers = outliers;
        if outliers > 0
            validation_report.issues{end+1} = sprintf('Found %d potential outliers', outliers);
        end
    end
end

% Test data validation
validation_rules.allow_missing = false;
validation_rules.min_value = 0;
validation_rules.max_value = 10;
validation_rules.check_outliers = true;

dirty_validation = validate_data(messy_data, validation_rules);
clean_validation = validate_data(clean_data, validation_rules);

fprintf('Validation results:\n');
fprintf('  Messy data passed: %d\n', dirty_validation.passed);
fprintf('  Clean data passed: %d\n', clean_validation.passed);

if ~clean_validation.passed
    fprintf('  Remaining issues with clean data:\n');
    for i = 1:length(clean_validation.issues)
        fprintf('    - %s\n', clean_validation.issues{i});
    end
end
```

## 7. Data Export and Formatting

```octave
% Data export and formatting techniques
fprintf('\n=== Data Export and Formatting ===\n');

% Sample dataset for export
export_data = struct();
export_data.names = {'Alice', 'Bob', 'Charlie', 'Diana'};
export_data.ages = [25, 30, 22, 28];
export_data.scores = [95.5, 87.2, 92.8, 89.1];
export_data.grades = {'A', 'B', 'A', 'B'};

% Export to formatted text file
formatted_file = 'formatted_report.txt';
fid = fopen(formatted_file, 'w');

fprintf(fid, 'STUDENT PERFORMANCE REPORT\n');
fprintf(fid, '==========================\n');
fprintf(fid, 'Generated: %s\n\n', datestr(now()));

fprintf(fid, '%-12s %5s %8s %6s\n', 'Name', 'Age', 'Score', 'Grade');
fprintf(fid, '%s\n', repmat('-', 1, 35));

for i = 1:length(export_data.names)
    fprintf(fid, '%-12s %5d %8.1f %6s\n', export_data.names{i}, ...
            export_data.ages(i), export_data.scores(i), export_data.grades{i});
end

fprintf(fid, '\nSUMMARY STATISTICS:\n');
fprintf(fid, 'Average Age: %.1f years\n', mean(export_data.ages));
fprintf(fid, 'Average Score: %.1f points\n', mean(export_data.scores));
fprintf(fid, 'Total Students: %d\n', length(export_data.names));

fclose(fid);
fprintf('Formatted report exported to: %s\n', formatted_file);

% Export to CSV with custom formatting
custom_csv = 'custom_export.csv';
fid = fopen(custom_csv, 'w');

% Write header with metadata
fprintf(fid, '# Student Performance Data\n');
fprintf(fid, '# Generated: %s\n', datestr(now()));
fprintf(fid, '# Total Records: %d\n', length(export_data.names));
fprintf(fid, 'Name,Age,Score,Grade,Performance\n');

for i = 1:length(export_data.names)
    % Add calculated performance category
    if export_data.scores(i) >= 90
        performance = 'Excellent';
    elseif export_data.scores(i) >= 80
        performance = 'Good';
    else
        performance = 'Satisfactory';
    end
    
    fprintf(fid, '%s,%d,%.1f,%s,%s\n', export_data.names{i}, ...
            export_data.ages(i), export_data.scores(i), ...
            export_data.grades{i}, performance);
end

fclose(fid);
fprintf('Custom CSV exported to: %s\n', custom_csv);

% Export configuration function
function export_success = export_data_flexible(data, filename, format, options)
    % Flexible data export function
    % Input: data - struct with data fields
    %        filename - output filename
    %        format - 'csv', 'txt', 'mat'
    %        options - export options struct
    % Output: export_success - boolean success flag
    
    export_success = false;
    
    try
        switch lower(format)
            case 'csv'
                export_csv_format(data, filename, options);
            case 'txt'
                export_text_format(data, filename, options);
            case 'mat'
                save(filename, '-struct', 'data');
            otherwise
                error('Unsupported format: %s', format);
        end
        export_success = true;
    catch me
        fprintf('Export failed: %s\n', me.message);
    end
end

function export_csv_format(data, filename, options)
    % Export to CSV format
    fid = fopen(filename, 'w');
    
    if isfield(options, 'include_header') && options.include_header
        fields = fieldnames(data);
        fprintf(fid, '%s', fields{1});
        for i = 2:length(fields)
            fprintf(fid, ',%s', fields{i});
        end
        fprintf(fid, '\n');
    end
    
    % Assume first field determines number of records
    fields = fieldnames(data);
    n_records = length(data.(fields{1}));
    
    for row = 1:n_records
        fprintf(fid, '%s', num2str(data.(fields{1})(row)));
        for col = 2:length(fields)
            val = data.(fields{col})(row);
            if isnumeric(val)
                fprintf(fid, ',%.2f', val);
            else
                fprintf(fid, ',%s', val);
            end
        end
        fprintf(fid, '\n');
    end
    
    fclose(fid);
end

function export_text_format(data, filename, options)
    % Export to formatted text
    fid = fopen(filename, 'w');
    
    if isfield(options, 'title')
        fprintf(fid, '%s\n', options.title);
        fprintf(fid, '%s\n', repmat('=', 1, length(options.title)));
    end
    
    fields = fieldnames(data);
    n_records = length(data.(fields{1}));
    
    for row = 1:n_records
        for col = 1:length(fields)
            val = data.(fields{col})(row);
            if isnumeric(val)
                fprintf(fid, '%s: %.2f  ', fields{col}, val);
            else
                fprintf(fid, '%s: %s  ', fields{col}, val);
            end
        end
        fprintf(fid, '\n');
    end
    
    fclose(fid);
end

% Test flexible export
export_options.include_header = true;
export_options.title = 'Student Data Export';

success1 = export_data_flexible(export_data, 'flexible_export.csv', 'csv', export_options);
success2 = export_data_flexible(export_data, 'flexible_export.txt', 'txt', export_options);

fprintf('Flexible export results: CSV=%d, TXT=%d\n', success1, success2);

% Clean up export files
delete(formatted_file);
delete(custom_csv);
delete('flexible_export.csv');
delete('flexible_export.txt');
```

## 8. Performance Optimization for File Operations

```octave
% Performance optimization for file operations
fprintf('\n=== Performance Optimization ===\n');

% Buffered file operations
function write_performance_test(n_records)
    % Compare different file writing approaches
    % Input: n_records - number of records to write
    
    data = rand(n_records, 5);  % Random data
    
    % Method 1: Write line by line (slower)
    filename1 = 'test_line_by_line.csv';
    tic;
    fid = fopen(filename1, 'w');
    for i = 1:n_records
        fprintf(fid, '%.6f,%.6f,%.6f,%.6f,%.6f\n', data(i,:));
    end
    fclose(fid);
    time1 = toc;
    
    % Method 2: Use vectorized csvwrite (faster)
    filename2 = 'test_vectorized.csv';
    tic;
    csvwrite(filename2, data);
    time2 = toc;
    
    fprintf('Writing %d records:\n', n_records);
    fprintf('  Line-by-line: %.4f seconds\n', time1);
    fprintf('  Vectorized: %.4f seconds\n', time2);
    fprintf('  Speedup: %.1fx\n', time1/time2);
    
    % Clean up
    delete(filename1);
    delete(filename2);
end

% Test with moderate dataset size
write_performance_test(1000);

% Memory-mapped file simulation
function demonstrate_chunked_processing()
    % Demonstrate processing data in chunks
    fprintf('Demonstrating chunked data processing:\n');
    
    total_size = 5000;
    chunk_size = 1000;
    n_chunks = ceil(total_size / chunk_size);
    
    % Simulate processing large dataset in chunks
    total_sum = 0;
    total_elements = 0;
    
    for chunk = 1:n_chunks
        % Simulate loading chunk
        start_idx = (chunk - 1) * chunk_size + 1;
        end_idx = min(chunk * chunk_size, total_size);
        chunk_data = randn(end_idx - start_idx + 1, 3);
        
        % Process chunk
        chunk_sum = sum(chunk_data(:));
        chunk_elements = numel(chunk_data);
        
        total_sum = total_sum + chunk_sum;
        total_elements = total_elements + chunk_elements;
        
        fprintf('  Chunk %d: processed %d elements\n', chunk, chunk_elements);
    end
    
    overall_mean = total_sum / total_elements;
    fprintf('  Overall mean: %.6f\n', overall_mean);
end

demonstrate_chunked_processing();

% File format comparison
function compare_file_formats()
    % Compare different file formats for storage efficiency
    fprintf('Comparing file formats:\n');
    
    % Create test data
    test_data = rand(100, 10);
    
    % MAT file
    mat_file = 'test_data.mat';
    save(mat_file, 'test_data');
    mat_info = dir(mat_file);
    
    % CSV file  
    csv_file = 'test_data.csv';
    csvwrite(csv_file, test_data);
    csv_info = dir(csv_file);
    
    % Binary file
    bin_file = 'test_data.bin';
    fid = fopen(bin_file, 'wb');
    fwrite(fid, test_data, 'double');
    fclose(fid);
    bin_info = dir(bin_file);
    
    fprintf('  MAT file: %d bytes\n', mat_info.bytes);
    fprintf('  CSV file: %d bytes\n', csv_info.bytes);
    fprintf('  Binary file: %d bytes\n', bin_info.bytes);
    
    % Clean up
    delete(mat_file);
    delete(csv_file);
    delete(bin_file);
end

compare_file_formats();
```

## 9. Data Integrity and Backup

```octave
% Data integrity and backup strategies
fprintf('\n=== Data Integrity and Backup ===\n');

% Checksum calculation for data integrity
function checksum = calculate_checksum(data)
    % Calculate simple checksum for data integrity
    % Input: data - numeric data array
    % Output: checksum - calculated checksum
    
    % Simple checksum using sum and XOR operations
    data_flat = data(:);
    checksum = mod(sum(data_flat) + sum(data_flat.^2), 2^32);
end

% Create test data with checksum
original_data = [1, 2, 3; 4, 5, 6; 7, 8, 9];
original_checksum = calculate_checksum(original_data);

fprintf('Original data checksum: %d\n', original_checksum);

% Simulate data corruption
corrupted_data = original_data;
corrupted_data(2, 2) = 999;  % Corrupt one element
corrupted_checksum = calculate_checksum(corrupted_data);

fprintf('Corrupted data checksum: %d\n', corrupted_checksum);
fprintf('Data integrity check: %s\n', ...
        iif(original_checksum == corrupted_checksum, 'PASSED', 'FAILED'));

% Backup strategy implementation
function backup_success = create_backup(data, base_filename, backup_dir)
    % Create backup with timestamp
    % Input: data - data to backup
    %        base_filename - base name for files
    %        backup_dir - backup directory
    % Output: backup_success - success flag
    
    backup_success = false;
    
    try
        % Create backup directory if it doesn't exist
        if ~exist(backup_dir, 'dir')
            mkdir(backup_dir);
        end
        
        % Generate timestamped filename
        timestamp = datestr(now(), 'yyyymmdd_HHMMSS');
        backup_filename = fullfile(backup_dir, sprintf('%s_%s.mat', base_filename, timestamp));
        
        % Save with metadata
        save_time = now();
        data_checksum = calculate_checksum(data);
        save(backup_filename, 'data', 'save_time', 'data_checksum');
        
        fprintf('Backup created: %s\n', backup_filename);
        backup_success = true;
        
    catch me
        fprintf('Backup failed: %s\n', me.message);
    end
end

% Test backup system
backup_dir = 'backups';
success = create_backup(original_data, 'test_data', backup_dir);

% Backup verification
function verify_success = verify_backup(backup_filename)
    % Verify backup integrity
    % Input: backup_filename - path to backup file
    % Output: verify_success - verification result
    
    verify_success = false;
    
    try
        backup_data = load(backup_filename);
        
        % Verify checksum
        calculated_checksum = calculate_checksum(backup_data.data);
        stored_checksum = backup_data.data_checksum;
        
        if calculated_checksum == stored_checksum
            fprintf('Backup verification: PASSED\n');
            verify_success = true;
        else
            fprintf('Backup verification: FAILED (checksum mismatch)\n');
        end
        
    catch me
        fprintf('Backup verification failed: %s\n', me.message);
    end
end

% Find and verify recent backup
if exist(backup_dir, 'dir')
    backup_files = dir(fullfile(backup_dir, '*.mat'));
    if ~isempty(backup_files)
        latest_backup = fullfile(backup_dir, backup_files(end).name);
        verify_backup(latest_backup);
    end
    
    % Clean up backup directory
    rmdir(backup_dir, 's');
end
```

## 10. Real-World Applications

```octave
% Real-world data handling applications
fprintf('\n=== Real-World Applications ===\n');

% Application 1: Log file analysis
function analyze_log_data()
    % Simulate log file analysis
    fprintf('Log File Analysis Simulation:\n');
    
    % Create simulated log data
    log_file = 'system.log';
    fid = fopen(log_file, 'w');
    
    levels = {'INFO', 'WARNING', 'ERROR', 'DEBUG'};
    messages = {'System started', 'Memory usage high', 'Connection failed', 'Processing data'};
    
    for i = 1:20
        timestamp = datestr(now() - rand()*7, 'yyyy-mm-dd HH:MM:SS');
        level = levels{randi(4)};
        message = messages{randi(4)};
        fprintf(fid, '%s [%s] %s\n', timestamp, level, message);
    end
    fclose(fid);
    
    % Analyze log file
    fid = fopen(log_file, 'r');
    log_stats = struct('INFO', 0, 'WARNING', 0, 'ERROR', 0, 'DEBUG', 0);
    
    while ~feof(fid)
        line = fgetl(fid);
        if ischar(line)
            for j = 1:length(levels)
                if contains(line, levels{j})
                    log_stats.(levels{j}) = log_stats.(levels{j}) + 1;
                    break;
                end
            end
        end
    end
    fclose(fid);
    
    fprintf('  Log analysis results:\n');
    fields = fieldnames(log_stats);
    for i = 1:length(fields)
        fprintf('    %s: %d entries\n', fields{i}, log_stats.(fields{i}));
    end
    
    delete(log_file);
end

analyze_log_data();

% Application 2: Sensor data processing
function process_sensor_data()
    % Simulate sensor data processing pipeline
    fprintf('Sensor Data Processing Simulation:\n');
    
    % Generate synthetic sensor data
    time_points = 1000;
    time = linspace(0, 100, time_points);
    
    % Multiple sensors with different characteristics
    sensor1 = 10 + 2*sin(2*pi*time/10) + 0.5*randn(size(time));  % Temperature
    sensor2 = 50 + 10*cos(2*pi*time/15) + randn(size(time));     % Humidity  
    sensor3 = 1013 + 5*sin(2*pi*time/20) + 0.2*randn(size(time)); % Pressure
    
    % Introduce some bad data
    bad_indices = randi(time_points, 1, 20);
    sensor1(bad_indices(1:5)) = -999;     % Missing values
    sensor2(bad_indices(6:10)) = NaN;     % NaN values
    sensor3(bad_indices(11:15)) = Inf;    % Infinite values
    
    % Create sensor data structure
    sensor_data = struct();
    sensor_data.time = time;
    sensor_data.temperature = sensor1;
    sensor_data.humidity = sensor2;  
    sensor_data.pressure = sensor3;
    
    % Data validation and cleaning
    [clean_temp, temp_report] = clean_dataset(sensor_data.temperature', -999);
    [clean_hum, hum_report] = clean_dataset(sensor_data.humidity', -999);
    [clean_press, press_report] = clean_dataset(sensor_data.pressure', -999);
    
    fprintf('  Sensor data cleaning results:\n');
    fprintf('    Temperature: %s, %.1f%% data retained\n', ...
            temp_report.strategy, (1-temp_report.data_loss_percent/100)*100);
    fprintf('    Humidity: %s, %.1f%% data retained\n', ...
            hum_report.strategy, (1-hum_report.data_loss_percent/100)*100);
    fprintf('    Pressure: %s, %.1f%% data retained\n', ...
            press_report.strategy, (1-press_report.data_loss_percent/100)*100);
    
    % Calculate statistics
    fprintf('  Sensor statistics (after cleaning):\n');
    fprintf('    Temperature: mean=%.1f°C, std=%.2f°C\n', mean(clean_temp), std(clean_temp));
    fprintf('    Humidity: mean=%.1f%%, std=%.2f%%\n', mean(clean_hum), std(clean_hum));
    fprintf('    Pressure: mean=%.1f hPa, std=%.2f hPa\n', mean(clean_press), std(clean_press));
end

process_sensor_data();

% Application 3: Batch data processing
function batch_process_files()
    % Simulate batch processing of multiple data files
    fprintf('Batch Data Processing Simulation:\n');
    
    n_files = 5;
    batch_results = struct('filenames', {}, 'means', [], 'stds', [], 'sizes', []);
    
    for i = 1:n_files
        % Create temporary data file
        filename = sprintf('batch_data_%d.csv', i);
        data = randn(randi([50, 200]), 3) * (i * 2);  % Different scales
        csvwrite(filename, data);
        
        % Process file
        loaded_data = csvread(filename);
        file_mean = mean(loaded_data(:));
        file_std = std(loaded_data(:));
        file_size = numel(loaded_data);
        
        % Store results
        batch_results.filenames{i} = filename;
        batch_results.means(i) = file_mean;
        batch_results.stds(i) = file_std;  
        batch_results.sizes(i) = file_size;
        
        fprintf('  Processed %s: mean=%.3f, std=%.3f, size=%d\n', ...
                filename, file_mean, file_std, file_size);
        
        % Clean up
        delete(filename);
    end
    
    % Aggregate results
    overall_mean = mean(batch_results.means);
    overall_std = mean(batch_results.stds);
    total_elements = sum(batch_results.sizes);
    
    fprintf('  Batch processing summary:\n');
    fprintf('    Files processed: %d\n', n_files);
    fprintf('    Average mean: %.3f\n', overall_mean);
    fprintf('    Average std: %.3f\n', overall_std);
    fprintf('    Total elements: %d\n', total_elements);
end

batch_process_files();
```

---

# Summary

**Data Handling & File I/O Mastery Completed:**

This comprehensive notebook covered all essential aspects of data handling and file operations:

- ✅ **Basic File I/O**: Text files, binary files, reading/writing operations
- ✅ **CSV Operations**: Import/export, parsing mixed data, manual CSV handling  
- ✅ **MAT Files**: Saving/loading variables, partial loading, file information
- ✅ **Large Datasets**: Chunked processing, memory-efficient techniques
- ✅ **Data Import**: Multiple formats (TSV, fixed-width, JSON-like)
- ✅ **Data Cleaning**: Missing values, outliers, validation, imputation
- ✅ **Data Export**: Formatted output, flexible export functions
- ✅ **Performance**: Optimization strategies, format comparisons
- ✅ **Data Integrity**: Checksums, backup strategies, verification
- ✅ **Real Applications**: Log analysis, sensor data, batch processing

**Key Performance Insights:**
1. **Vectorized Operations**: Use csvwrite/csvread for better performance
2. **Memory Management**: Process large datasets in chunks
3. **Data Validation**: Always validate and clean data before analysis
4. **Format Selection**: Choose appropriate formats based on data types and size

**Best Practices Established:**
- Always validate input data and handle missing values appropriately
- Implement proper error handling for file operations
- Use checksums for data integrity verification
- Create automated backup and recovery strategies
- Optimize file I/O for performance with large datasets

**Real-World Impact:**
- Scientific data analysis pipelines
- Business intelligence and reporting systems
- IoT sensor data processing
- Quality control and monitoring applications
- Research data management and archival

**Next Steps:**
- Apply these techniques to domain-specific datasets
- Explore advanced file formats and databases
- Build automated data processing pipelines
- Proceed to `06_plotting_2d_3d.ipynb` for data visualization

Your data handling expertise is now enterprise-ready! 📊