In [2]:
import csv
import json 
import numpy as np
import pandas as pd 
from pathlib import Path
from matplotlib import pyplot as plt
import seaborn as sns

## Types of Files

There are two types of files that python works with: binary files and text files. Most of the files that we generally work with in computers are the binary files. Here are the common examples of the binary files:
* Image files jpg, png, bmp, gif etc. 
* databases files like mdb, frm, sqlite etc.
* document files including doc, xls, pdf etc

It might be the case that these files say include text but their format is binary files type. The reson for this is requirement for special handling and specific software to work with these. For example you need excel/ libre to work with the excel files or database program to work with sqlite file. 

On the other hand text files has no special encoding. and can be opened in any standard text editor. text files all have an unseen character at the end of each line which lets the text editor know that there should be a new line. When interacting with these files through programming, you can take advantage of that character. In Python, it is denoted by the “\n”.

## Python Built-Ins for File Handling

Python has the `open()` function which takes care of this. This will return the file object. 

You can open the file in many different modes: 
* 'w' – Write Mode: This mode is used when the file needs to be altered and information changed or added. Keep in mind that this erases the existing file to create a new one. File pointer is placed at the beginning of the file.
* 'r' – Read Mode: This mode is used when the information in the file is only meant to be read and not changed inany way. File pointer is placed at the beginning of the file.
* 'a' – Append Mode: This mode adds information to the end of the file automatically. File pointer is placed at the end of the file.
* 'r+' – Read/Write Mode: This is used when you will be making changes to the file and reading information from it. The file pointer is placed at the beginning of the file.
* 'a+' – Append and Read Mode: A file is opened to allow data to be added to the end of the file and lets your program read information as well. File pointer is placed at the end of the file.

When you are using binary files, you will use the same mode specifiers. However, you add a b to the end. So a write mode specifier for a binary file is 'wb'. The others are 'rb', 'ab', 'r+b', and 'a+b' respectively.

> Note: Best Practice is to use the open function in combination with the context manager using the with keyword. This makes sure to close/destroy the file object and free up the memory. If you don’t use the with keyword or use the fileobject.close() function then Python will automatically close and destroy the file object through the built in garbage collector. However, depending on your code, this garbage collection can happen at any time.

## Pathlib Module

Path object provides a cross-platform way to read, write, move, and delete files. Also the other main benifit of this library is that it brings together the functionality that is spread across other libraries like os, glob, and shutil. This makes file operation quite straight-forward. 

Plus, there is a built in methods for reading and writing both binary as well as text files. This ensures a clean and pythonic approach to the file operations. 

Agenda with the Pathlib exploration:
* object oriented interface for managing the file and directory path, 
* different instantiation methods to the Path object
* methods for reading, writing, mobing or deleting a file
* to list out all the paths in a directory
* method to check whether a path corresponds to a file 
* 

### Module Importing Pattern and Path Class 

We will need to import the pathlib in our code. And infact the Path object is so widely used that a general pattern is to import the Path directly like this:

```python

from pathlib import Path
```

Because we will work with Path class of the pathlib, importing this way makes sure that we do not need to keep refering to `pathlib.Path` everytime we need to use it. 

* we can get current working directory using the `Path.cwd()` or the home directory for the user space with `Path.home()`.
  * This will return a PosixPath object on linux and mac and on windows this will return a Windowspath
* we can also pass in a string to Path to point it to some directory or a file 
* joining a path
  * forward slash operator can be used to join parts of path 
  * also there is `joinpath` method which takes in the parts of path as argument

* 

Without the pathlib the file paths are represented using the regular text strings. and the functionality to work with the files paths was spread around different libraries. 

For example, say you would like to move some files into an archieve directory then to do this:

You would need to use:
* glob module to get all the directories which fit the criteria like all the csv files or all files which start withh letter a etc. 
* Then we would need to import the os module to join the each file path with the new directory path where we want to move the files into 
* then the shutil module to actually move the file to the new path 

But the pathlib module provides a Path  class which works the same way on different operating systems like windows, mac, linux etc. and Instead of importing different modules like glob, os and shutil to move the files, we can perform the same task using the Pathlib library alone. 

### Joining Path

Other than passing the path string or using the path methods to fetch the file paths, there is a third way to construct a path:
* we can use the / operator to join the parts of the path. 
  * As long as you include One path object then we can join several paths or a mix of paths and strings. 
  * This although does not look like a proper oo approach. 
    * We have another method join path which does the same operation. 
      * `Path.cwd().joinpath("archieve", 'file1.py')

In [3]:
Path.cwd().joinpath("archieve", 'file1.py')

PosixPath('/home/thinkstation/Projects/08-python-snippets/src/files-io/archieve/file1.py')

## File System Operations

****


### Picking Out Components of a Path

A file or directory path consists of different parts. When you use pathlib, these parts are conveniently available as properties. Basic examples include:

* `.name`: The filename without any directory
* `.stem`: The filename without the file extension
* `.suffix`: The file extension
* `.anchor`: The part of the path before the directories
* `.parent`: The directory containing the file, or the parent directory if the path is a directory


In [5]:
demo_text = Path.cwd().joinpath('faker_data', 'demo-file.txt')

with demo_text.open(mode='r', encoding='utf-8') as demo_file:
    print(demo_file.read())

This is the first line written in the file. 
 This is the second line written in teh file.this is the redirection from the print statement.
this is the second redirection from the print statement. 



### Methods to read, write the file 

In fact, Path.open() is calling the built-in open() function behind the scenes. That’s why we can use parameters like mode and encoding with Path.open().

On top of that, pathlib offers some convenient methods to read and write files:
* .read_text() opens the path in text mode and returns the contents as a string.
* .read_bytes() opens the path in binary mode and returns the contents as a byte string.
* .write_text() opens the path and writes string data to it.
* .write_bytes() opens the path in binary mode and writes data to it.
  * >Note: write methods overwrite the existing content.

Each of these methods handles the opening and closing of the file. Therefore, you can update the file using .read_text():

In [6]:
print(demo_text.read_text())

This is the first line written in the file. 
 This is the second line written in teh file.this is the redirection from the print statement.
this is the second redirection from the print statement. 



### Renaming the files 

* with_stem() - only the file name
* with_suffix() - only the extension
* with_name() - both

In [12]:
demo_text = demo_text.with_name('demo_f.txt')

In [13]:
demo_text

PosixPath('/home/thinkstation/Projects/08-python-snippets/src/files-io/faker_data/demo_f.txt')

In [14]:
demo_text = demo_text.with_stem('dummy-text')

In [15]:
demo_text

PosixPath('/home/thinkstation/Projects/08-python-snippets/src/files-io/faker_data/dummy-text.txt')

In [17]:
demo_text = demo_text.with_suffix('.md')

In [18]:
demo_text

PosixPath('/home/thinkstation/Projects/08-python-snippets/src/files-io/faker_data/dummy-text.md')

In [None]:
aaaaaa

In [5]:
from pathlib import Path

dir = Path('faker_data')
dir.mkdir(exist_ok=True)

In [6]:

from faker import Faker
import random

fake = Faker()

fp= dir / 'random_text.txt'
# Generate random text file with fake data
with open(fp, 'w') as f:
    # Generate 10 random paragraphs
    for _ in range(10):
        # Write a random paragraph with 3-7 sentences
        paragraph = fake.paragraph(nb_sentences=random.randint(3,10))
        f.write(paragraph + '\n\n')


# Flat Files in Python

## What are Flat Files?
- Simple text files storing data in plain text format
- No complex structure or relationships between records
- Each line typically represents one record/row

## Common Types
- **CSV** - Comma-separated values
- **TSV** - Tab-separated values  
- **TXT** - Plain text with custom delimiters
- **JSON Lines** - One JSON object per line
- **Fixed-width** - Columns with predetermined widths
****
## Best Practices
- Use appropriate encoding (UTF-8)
- Handle missing/null values consistently
- Validate data types during processing
- Use context managers (`with` statements)
- Consider memory usage for large files
- Implement error handling for malformed data

### Creating some data to work with
y

In [13]:


from faker import Faker
import csv

fake = Faker()

# Generate employee data
with open('employees.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['id', 'name', 'email', 'job', 'salary', 'department'])
    
    for i in range(1000):
        writer.writerow([
            i,
            fake.name(),
            fake.email(),
            fake.job(),
            fake.random_int(min=30000, max=120000),
            fake.random_element(elements=('IT', 'HR', 'Sales', 'Marketing', 'Engineering'))
        ])

# Generate sales data
with open('sales.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['transaction_id', 'date', 'product', 'quantity', 'price', 'customer'])
    
    for i in range(5000):
        writer.writerow([
            fake.uuid4(),
            fake.date_between(start_date='-1y', end_date='today'),
            fake.random_element(elements=('Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard')),
            fake.random_int(min=1, max=10),
            fake.random_int(min=100, max=2000),
            fake.company()
        ])

# Generate customer data
with open('customers.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['customer_id', 'name', 'email', 'phone', 'address', 'join_date'])
    
    for i in range(500):
        writer.writerow([
            fake.uuid4(),
            fake.name(),
            fake.email(),
            fake.phone_number(),
            fake.address(),
            fake.date_between(start_date='-5y', end_date='today')
        ])

# Generate numeric data for numpy analysis
with open('numeric_data.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['id', 'value_a', 'value_b', 'value_c'])
    
    for i in range(1000):
        writer.writerow([
            i,
            fake.random_int(min=1, max=1000),
            round(fake.random.uniform(0, 100), 2),
            fake.random_int(min=-500, max=500)
        ])


### Importing Flat Files using NumP

In [14]:
file_1 = Path('numeric_data.csv')

data_np = np.loadtxt(fname = file_1, delimiter= ',',  skiprows= 1)

print(data_np)

[[   0.    313.      8.94 -492.  ]
 [   1.    794.     59.99   -1.  ]
 [   2.    668.     86.83  322.  ]
 ...
 [ 997.    818.     16.42  117.  ]
 [ 998.    787.     70.35  -14.  ]
 [ 999.    208.     74.61  491.  ]]
