# Working with files

The most common file formats used for various types of data and purposes include:

1. **Text Formats**:
   - **Plain Text (.txt)**: Simple, human-readable text files used for various purposes.
   - **Comma-Separated Values (.csv)**: Tabular data format with values separated by commas, widely used for data exchange.

2. **Document Formats**:
   - **Microsoft Word (.docx)**: Proprietary word processing format for Microsoft Word documents.
   - **Portable Document Format (.pdf)**: Universal format for preserving document formatting.

3. **Spreadsheet Formats**:
   - **Microsoft Excel (.xlsx)**: Proprietary spreadsheet format for Microsoft Excel workbooks.
   - **Comma-Separated Values (.csv)**: Also used for tabular data in spreadsheets.

4. **Image Formats**:
   - **JPEG (.jpg)**: Common compressed image format for photographs and graphics.
   - **PNG (.png)**: Lossless image format suitable for images with transparency.
   - **GIF (.gif)**: Format for animations and simple graphics.
   - **BMP (.bmp)**: Bitmap format for storing graphics.

5. **Audio Formats**:
   - **MP3 (.mp3)**: Popular compressed audio format for music and audio files.
   - **WAV (.wav)**: Uncompressed audio format with high quality.
   - **FLAC (.flac)**: Lossless audio format for high-quality audio.

6. **Video Formats**:
   - **MP4 (.mp4)**: Common multimedia container format for video and audio.
   - **AVI (.avi)**: Audio Video Interleave format for video files.
   - **MKV (.mkv)**: Multimedia container format known for supporting multiple audio and subtitle tracks.

7. **Archive Formats**:
   - **ZIP (.zip)**: Archive format for compressing and packaging files.
   - **RAR (.rar)**: Archive format with advanced compression and splitting capabilities.
   - **TAR (.tar)**: Archive format often used on Unix-like systems.

8. **Database Formats**:
   - **SQLite (.sqlite, .db)**: Self-contained, serverless database format.
   - **MySQL (.sql)**: SQL script files for MySQL databases.
   - **JSON (.json)**: Semi-structured data format often used for configuration and NoSQL databases.

9. **Web Formats**:
   - **HTML (.html)**: Hypertext Markup Language for web pages.
   - **XML (.xml)**: Extensible Markup Language for structured data.
   - **JSON (.json)**: JavaScript Object Notation for data interchange.

10. **Data Exchange Formats**:
    - **JSON (.json)**: Common for data interchange between applications and web services.
    - **XML (.xml)**: Used for structured data interchange, including web services.

These file formats cover a wide range of data types and are commonly used in various industries and applications. The choice of format often depends on the specific use case and requirements, such as data storage, data exchange, multimedia content, or document processing.

# Importing different types of files in Pandas

In [1]:
import pandas as pd

In [2]:
from os import listdir
from os.path import join

listdir('.')

['.git',
 '.ipynb_checkpoints',
 'example.feather',
 'my_dataset.h5',
 'sample_dataset.html',
 'sample_dataset.json',
 'Training.ipynb']

A delimiter is one or more characters that separate text strings. Common delimiters are commas (,), semicolon (;), quotes ( ", ' ), braces ( {}), pipes (|), or slashes ( / \ ). When a program stores sequential or tabular data, it delimits each item of data with a predefined character. 

### 1.1 Going under the Hood of pandas read

In [3]:
import pandas as pd
pd.read_csv?

[1;31mSignature:[0m
[0mpd[0m[1;33m.[0m[0mread_csv[0m[1;33m([0m[1;33m
[0m    [0mfilepath_or_buffer[0m[1;33m:[0m [1;34m'FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str]'[0m[1;33m,[0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0msep[0m[1;33m:[0m [1;34m'str | None | lib.NoDefault'[0m [1;33m=[0m [1;33m<[0m[0mno_default[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m    [0mdelimiter[0m[1;33m:[0m [1;34m'str | None | lib.NoDefault'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mheader[0m[1;33m:[0m [1;34m"int | Sequence[int] | None | Literal['infer']"[0m [1;33m=[0m [1;34m'infer'[0m[1;33m,[0m[1;33m
[0m    [0mnames[0m[1;33m:[0m [1;34m'Sequence[Hashable] | None | lib.NoDefault'[0m [1;33m=[0m [1;33m<[0m[0mno_default[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m    [0mindex_col[0m[1;33m:[0m [1;34m'IndexLabel | Literal[False] | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0musecols[0m[1;33m=

### 1.1.2 Types of files one can import from Pandas

In [4]:
import re
regex = re.compile(r'read')
list(filter(regex.match, dir(pd)))

['read_clipboard',
 'read_csv',
 'read_excel',
 'read_feather',
 'read_fwf',
 'read_gbq',
 'read_hdf',
 'read_html',
 'read_json',
 'read_orc',
 'read_parquet',
 'read_pickle',
 'read_sas',
 'read_spss',
 'read_sql',
 'read_sql_query',
 'read_sql_table',
 'read_stata',
 'read_table',
 'read_xml']

The clipboard is a temporary storage area in your computer’s memory that stores the information you copy or cut. The information can be text, images, or other types of data. You can then paste the contents of the clipboard into another location, such as a document or an email 

## CSV vs. Excel

CSV stands for Comma-Separated Values, while Excel is a spreadsheet application that saves files into its own format 123. CSV files are used for storing data in tabular format and are just plain text files with values separated by commas. They can be opened with text editors (such as Notepad) and are faster to process and open. However, they cannot store other information like formatting, links, charts, pictures, etc13 On the other hand, Excel files are binary files with multiple worksheets that can store formatting and perform operations on data. They can contain symbols, links, charts, pictures, etc., and are harder to read with larger sets of data13

In [5]:
data = {
    'Employee ID': [101, 102, 103, 104, 105],
    'Name': ['John Smith', 'Jane Doe', 'Bob Johnson', 'Alice Brown', 'Charlie Wilson'],
    'Department': ['HR', 'Finance', 'Engineering', 'Marketing', 'Sales'],
    'Position': ['Manager', 'Analyst', 'Engineer', 'Coordinator', 'Sales Representative'],
    'Salary': [75000, 60000, 80000, 50000, 65000]
}

# Create a DataFrame
employee_df = pd.DataFrame(data)

# Display the DataFrame
employee_df

Unnamed: 0,Employee ID,Name,Department,Position,Salary
0,101,John Smith,HR,Manager,75000
1,102,Jane Doe,Finance,Analyst,60000
2,103,Bob Johnson,Engineering,Engineer,80000
3,104,Alice Brown,Marketing,Coordinator,50000
4,105,Charlie Wilson,Sales,Sales Representative,65000


In [6]:
employee_df.to_csv('Test1.csv')

In [7]:
employee_df.to_excel('Test2.xlsx')

## Feather files

A Feather file is a binary file format for efficiently storing and sharing data between different programming languages and data analysis tools. Feather was designed to be lightweight and to optimize data transfer between languages, particularly for data analysis libraries like pandas in Python and Apache Arrow in other languages like R, Julia, and more.

Feather files have a few key features:

Language-Agnostic: Feather files are designed to be language-agnostic, which means you can read and write them from multiple programming languages without losing data integrity or performance.

Columnar Storage: Feather stores data in a columnar format, which is often more efficient for data analysis and allows for faster read and write operations, especially when working with large datasets.

Metadata: Feather files include metadata that helps describe the data, such as data types and column names, making it self-descriptive.

Efficient Serialization: Feather is optimized for fast serialization and deserialization, making it suitable for reading and writing data frames or tables quickly.1`

In [8]:
import pandas as pd

# Create a sample DataFrame
data = {'Column1': [1, 2, 3, 4, 5], 'Column2': ['A', 'B', 'C', 'D', 'E']}
df = pd.DataFrame(data)

# Write the DataFrame to a Feather file
df.to_feather('example.feather')

# Read the Feather file back into a DataFrame
df_from_feather = pd.read_feather('example.feather')

# Display the DataFrame read from Feather
print(df_from_feather)


   Column1 Column2
0        1       A
1        2       B
2        3       C
3        4       D
4        5       E


## Fixed width Files fwf

The `read_fwf` function in pandas is used when you have data in a fixed-width format, and you want to read that data into a DataFrame for further analysis and manipulation. Fixed-width format data is a type of plain text data where each column has a specified width, and the data within each column is aligned to those widths.

Here are some common scenarios when you might use `read_fwf`:

1. **Legacy Data Formats**: Fixed-width files were common in legacy systems or data sources where data was stored and exchanged in a format where each column's position was predetermined. If you need to work with such data, you would use `read_fwf` to parse and load it into a DataFrame.

2. **Government Data**: Government agencies and organizations often provide data in fixed-width format files. Examples include census data, economic indicators, or demographic data.

3. **Mainframe Systems**: Data exported from mainframe systems or older databases may be in fixed-width format. Reading this data with `read_fwf` can be essential for data analysis or migration to modern systems.

4. **Financial Data**: Financial data, including stock market data or financial reports, may be distributed in fixed-width format. Analysts often use `read_fwf` to load this data for analysis.

5. **Custom Data Export Formats**: Sometimes, organizations or systems export data in custom fixed-width formats for specific applications. If you encounter such data, `read_fwf` can help you parse and work with it.

When working with fixed-width data, it's crucial to know the exact column widths and data types in the file, as these details are necessary to correctly parse the data. You'll typically need to provide the `colspecs` parameter to specify the start and end positions of each column.

Here's a general use case for `read_fwf`:

```python
import pandas as pd

# Define the column widths for your fixed-width file
colspecs = [(0, 5), (5, 10), (10, 15)]  # Adjust to match your data

# Read the FWF file into a DataFrame
df = pd.read_fwf('data.fwf', colspecs=colspecs)

# Perform data analysis and manipulation using the DataFrame
```

In this example, `colspecs` specifies the start and end positions of three columns in the FWF file, and `read_fwf` reads the data accordingly. You should adjust `colspecs` to match the specific formatting of your FWF file.

In [9]:
filepath = r"C:\Users\crist\Downloads\data.txt"

In [10]:
import pandas as pd
 
df = pd.read_fwf(filepath, colspecs='infer', header=None)
print(df)

           0
0  123456789
1  987654321
2  456789123


## Benefits of FWF Files

Fixed-width format (FWF) files have several benefits in data storage and processing:

Predictable Structure: FWF files have a fixed structure where each field occupies a specific number of characters or positions within each record. This predictability makes it easy to parse and read the data accurately.

Human-Readable: FWF files are often human-readable because of their fixed-column layout. This makes it easier for people to inspect the data without the need for specialized software.

Efficiency: FWF files can be more memory-efficient and faster to read and write compared to variable-width files (e.g., CSV) because there's no need for delimiters. Processing fixed-width data can be faster, especially with large datasets.

Preservation of Leading Zeros: FWF files are useful for storing data where leading zeros are significant (e.g., ZIP codes, product codes, or identification numbers) because they maintain the exact character positions.

Data Integrity: FWF files are less prone to data corruption due to missing or misplaced delimiters that can occur in variable-width files.

Compatibility: FWF files are well-suited for integration with legacy systems or other software that expects data in a fixed-width format.

Data Validation: The fixed-width format makes it easier to enforce data validation rules as data must conform to the specified column widths.

Alignment: When displaying FWF data in a text editor or fixed-width font, the columns align neatly, making it easier for users to visually interpret the data.

However, it's essential to consider the specific use case and requirements when choosing between FWF and other data storage formats like CSV, TSV, or JSON. FWF is most beneficial when data has a consistent and predictable structure. If your data has varying column widths or complex structures, other formats may be more suitable.







## Google Big Query

In [11]:
import pandas as pd

# Define your BigQuery SQL query as a string
query = """
SELECT
  column1,
  column2
FROM
  your_project_id.your_dataset.your_table
WHERE
  some_condition
"""

# Set up the BigQuery authentication (you need to authenticate to access your BigQuery data)
# You can use your Google Cloud credentials JSON file or application default credentials.
# For application default credentials, you can use:
# pd.read_gbq(query, project_id=your_project_id, dialect='standard')

# Authenticate using your Google Cloud credentials JSON file
# pd.read_gbq(query, project_id=your_project_id, private_key='path/to/your/credentials.json', dialect='standard')

# Use the read_gbq function to execute the query and retrieve the data into a DataFrame
df = pd.read_gbq(query, project_id=your_project_id, dialect='standard')

# Now, you can work with the data in the DataFrame 'df'
print(df.head())


NameError: name 'your_project_id' is not defined

## Benefits of Google Big Query

Google BigQuery is a fully-managed, serverless, and highly scalable data warehouse and analytics platform offered by Google Cloud. It provides several benefits for organizations and data professionals:

Scalability: BigQuery is designed to handle massive datasets with ease. It can automatically scale to accommodate growing data volumes, ensuring that you can run complex queries on large datasets without worrying about infrastructure limitations.

Serverless: You don't need to provision or manage servers when using BigQuery. It's a serverless platform, which means Google takes care of infrastructure management, including hardware and software updates.

Speed: BigQuery is known for its blazing-fast query performance. It uses a distributed architecture and columnar storage to execute queries quickly, even on petabyte-scale datasets.

Cost-Effective: With BigQuery's pay-as-you-go pricing model, you only pay for the data you query and store. It eliminates the need for upfront capital expenditures and allows you to control costs effectively.

SQL Support: BigQuery supports standard SQL, making it easy for data analysts and SQL developers to write and run queries without the need to learn a new query language.

Integration: It integrates seamlessly with other Google Cloud services, such as Google Cloud Storage, Google Data Studio, and Google Sheets, allowing you to build end-to-end data pipelines and analytics solutions.

Data Warehousing and Data Lake Capabilities: BigQuery can function as both a data warehouse and a data lake. You can store structured and semi-structured data in BigQuery tables or query data directly from external storage like Google Cloud Storage.

Security: Google Cloud provides robust security features, including encryption at rest and in transit, identity and access management (IAM), and audit logs. It complies with industry standards and certifications.

Real-Time Data Analysis: BigQuery supports real-time data streaming, enabling you to analyze data as it arrives, making it suitable for real-time analytics use cases.

Machine Learning Integration: You can leverage Google's machine learning capabilities and services like BigQuery ML to build and deploy machine learning models directly within BigQuery.

Geo-spatial and Advanced Analytics: BigQuery offers support for geospatial data and a wide range of advanced analytics functions, including window functions, machine learning, and statistical analysis.

Data Sharing and Collaboration: You can easily share datasets and queries with others in your organization or externally, facilitating collaboration and data sharing.

Automatic Backups and High Availability: BigQuery automatically takes care of data backups and provides high availability, ensuring that your data is safe and accessible.

Cost Optimization Tools: Google Cloud provides cost optimization tools and features to help you analyze and control your BigQuery costs, making it easier to manage your budget.

Community and Support: BigQuery has a large and active user community, and Google Cloud offers various levels of support, including documentation, forums, and premium support options.

Overall, Google BigQuery is a powerful and versatile platform that can help organizations make data-driven decisions, gain insights from their data, and leverage the benefits of cloud computing without the complexity of managing infrastructure.







## HDF File


An HDF file, which stands for "Hierarchical Data Format" file, is a file format designed for storing and organizing large amounts of data. HDF files are particularly popular in the scientific and engineering communities for applications that involve complex and multidimensional datasets. Here are some key features of HDF files:

Hierarchical Structure: HDF files have a hierarchical structure, similar to a file system, where data can be organized into groups and datasets. This hierarchical organization makes it easy to store and manage structured data.

Support for Various Data Types: HDF supports a wide range of data types, including numerical data (integers, floats, etc.), text, images, and more. This versatility makes it suitable for a broad spectrum of applications.

Compression: HDF files can be compressed, which helps reduce file size while preserving data integrity. This is particularly useful when dealing with large datasets.

Portability: HDF files are designed to be platform-independent. You can create and access HDF files on various operating systems, including Windows, macOS, and Linux.

Data Chunking: HDF allows data to be divided into smaller, regularly sized chunks. This can improve data access and manipulation, especially for large datasets.

Metadata: You can attach metadata to HDF datasets, providing additional information about the data's content, source, and any relevant attributes.

Libraries and APIs: There are libraries and APIs available for various programming languages (e.g., HDF5 for C/C++, h5py for Python) that allow developers to work with HDF files, making it easier to create, read, write, and manipulate data stored in HDF format.

HDF files are commonly used in fields such as astronomy, geoscience, climate modeling, bioinformatics, and more, where researchers need to store and analyze large and complex datasets. The two major versions of HDF are HDF4 and HDF5, with HDF5 being the more modern and widely adopted version due to its improved capabilities and performance.

## Create dummy HDF data

In [12]:
import h5py
import numpy as np

# Define your dataset (a simple 2D array in this example)
data = np.array([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]])

# Create an HDF5 file in write mode
with h5py.File('my_dataset.h5', 'w') as hdf_file:
    # Create a dataset and write data to it
    dataset = hdf_file.create_dataset('my_data', data=data)

    # Optionally, add attributes and metadata to the dataset
    dataset.attrs['description'] = 'My sample dataset'
    dataset.attrs['author'] = 'Your Name'

# The HDF5 file is automatically closed when the 'with' block exits

In [13]:
dataset

<Closed HDF5 dataset>

In [14]:
type(dataset)

h5py._hl.dataset.Dataset

## Visualize dataset

In [15]:
import h5py

# Open the HDF5 file in read mode
with h5py.File('my_dataset.h5', 'r') as hdf_file:
    # Access the dataset
    dataset = hdf_file['my_data']
    
    # Convert the dataset to a NumPy array
    data = dataset[()]
    
    # Print the contents of the array
    print(data)







[[1 2 3]
 [4 5 6]
 [7 8 9]]


## Convert HDF File to Dataframe format

In [16]:
import h5py

# Open the HDF5 file in read mode
with h5py.File('my_dataset.h5', 'r') as hdf_file:
    # Access the dataset
    dataset = hdf_file['my_data']
    
    # Convert the dataset to a pandas DataFrame
    df = pd.DataFrame(dataset[()])

# Now you have the data in a pandas DataFrame
print(df)

   0  1  2
0  1  2  3
1  4  5  6
2  7  8  9


## HTML Files 

In [17]:
# Define some example data
data = [
    {"name": "John", "age": 30, "city": "New York"},
    {"name": "Alice", "age": 25, "city": "Los Angeles"},
    {"name": "Bob", "age": 35, "city": "Chicago"},
]

# Create an HTML string
html_string = "<html>\n"
html_string += "<head><title>Sample Dataset</title></head>\n"
html_string += "<body>\n"
html_string += "<h1>Sample Dataset</h1>\n"
html_string += "<table border='1'>\n"
html_string += "<tr><th>Name</th><th>Age</th><th>City</th></tr>\n"

for record in data:
    html_string += "<tr>"
    html_string += f"<td>{record['name']}</td>"
    html_string += f"<td>{record['age']}</td>"
    html_string += f"<td>{record['city']}</td>"
    html_string += "</tr>\n"

html_string += "</table>\n"
html_string += "</body>\n"
html_string += "</html>"

# Save the HTML string to a file
with open("sample_dataset.html", "w") as html_file:
    html_file.write(html_string)

print("Sample dataset HTML file created: sample_dataset.html")


Sample dataset HTML file created: sample_dataset.html


## When to parse HTML files in Python

You would parse HTML files in Python when you need to extract and manipulate data or information from HTML documents. Parsing HTML is common in various scenarios, including:

Web Scraping: When you want to extract data from websites for purposes such as data analysis, research, or building web applications. Python libraries like Beautiful Soup and Scrapy are often used for web scraping tasks.

Data Extraction: When you need to extract structured data from HTML documents, such as tables, lists, or specific elements, for further processing or analysis.

Web Testing and Automation: When you automate web interactions or perform web testing, you may need to parse HTML to locate and interact with specific elements on a web page.

Data Cleaning: When you have HTML-encoded content in your dataset and you want to convert it to plain text or extract specific information.

Generating Dynamic Content: When you dynamically generate HTML documents based on data from databases or other sources.

Web Development: When building web applications using frameworks like Flask or Django, you often work with HTML templates to render dynamic content.

Python offers several libraries and tools for parsing HTML, including:

Beautiful Soup: A popular library for parsing HTML and XML documents, making it easy to navigate and search the document's structure.

lxml: A high-performance library for parsing XML and HTML documents. It's often used in combination with XPath for advanced data extraction.

html.parser: The built-in HTML parser in Python's standard library, which can be used for basic HTML parsing tasks.

Selenium: A tool often used for web testing and automation, allowing you to programmatically interact with web pages and extract data.

Scrapy: A powerful web crawling and scraping framework that provides a comprehensive set of tools for extracting and processing data from websites.

The specific use case for parsing HTML in Python depends on your project requirements. Whether it's extracting data from a website, cleaning and transforming HTML-encoded content, or interacting with web pages programmatically, Python offers a range of tools and libraries to help you achieve your goals.

## JSON 

JSON (JavaScript Object Notation) files and DataFrames are both data structures used for representing and storing data, but they have different characteristics and purposes. Here are the key differences between them:

Data Representation:

JSON: JSON is a lightweight, text-based data interchange format. It's designed to represent structured data as a collection of key-value pairs, where keys are strings, and values can be strings, numbers, objects, arrays, booleans, or null. JSON is often used for data exchange between systems and for configuration files.

DataFrame: A DataFrame is a tabular data structure commonly used in data analysis and manipulation. It's a two-dimensional, labeled data structure with rows and columns, similar to a spreadsheet or SQL table. Each column can have a different data type, making it suitable for heterogeneous data.

Use Cases:

JSON: JSON is typically used for data exchange between applications, configuration files, or representing data in a semi-structured format. It's commonly used in web APIs, configuration files, and data serialization.

DataFrame: DataFrames are used for data analysis, data manipulation, and exploration tasks. They are a fundamental data structure in data science libraries like pandas in Python and are used for tasks such as filtering, grouping, aggregation, and visualization.

Storage Format:

JSON: JSON is stored as plain text and is human-readable. It's a flexible format for representing structured data, but it may not be as space-efficient as binary formats for large datasets.

DataFrame: DataFrames are typically stored in memory, and they can be serialized to various formats like CSV, Excel, HDF5, or Parquet for storage. These formats may offer compression and better storage efficiency compared to JSON.

Schema and Type Information:

JSON: JSON does not have a predefined schema or type information. It's up to the application to interpret and validate the data.

DataFrame: DataFrames have a schema that defines the data types and column names. This schema enforces type consistency within columns, making it easier to work with structured data.

Access and Manipulation:

JSON: Accessing and manipulating data in JSON typically involves parsing the text and working with the resulting data structure in your programming language. Libraries like json in Python are commonly used for this purpose.

DataFrame: DataFrames provide a high-level API for data manipulation and analysis. Libraries like pandas offer powerful functions for filtering, transforming, aggregating, and visualizing data in a tabular format.

In summary, JSON files are more suited for data interchange and configuration, while DataFrames are designed for data analysis and manipulation. The choice between the two depends on your specific use case and the type of data you are working with.

## Create a dummy JSON file

In [18]:
import json

# Define a dictionary with sample data
data = {
    "employees": [
        {
            "name": "John",
            "age": 30,
            "department": "HR"
        },
        {
            "name": "Alice",
            "age": 28,
            "department": "Engineering"
        },
        {
            "name": "Bob",
            "age": 35,
            "department": "Marketing"
        }
    ]
}

# Serialize the data to a JSON-formatted string
json_string = json.dumps(data, indent=4)

# Print the JSON string
print(json_string)

# Alternatively, save the data to a JSON file
with open("sample_dataset.json", "w") as json_file:
    json.dump(data, json_file, indent=4)

print("Sample JSON dataset created: sample_dataset.json")


{
    "employees": [
        {
            "name": "John",
            "age": 30,
            "department": "HR"
        },
        {
            "name": "Alice",
            "age": 28,
            "department": "Engineering"
        },
        {
            "name": "Bob",
            "age": 35,
            "department": "Marketing"
        }
    ]
}
Sample JSON dataset created: sample_dataset.json


## Convert JSON to a dataframe

In [19]:
import pandas as pd

# Define the JSON data (you can also read it from a file)
data = {
    "employees": [
        {
            "name": "John",
            "age": 30,
            "department": "HR"
        },
        {
            "name": "Alice",
            "age": 28,
            "department": "Engineering"
        },
        {
            "name": "Bob",
            "age": 35,
            "department": "Marketing"
        }
    ]
}

# Convert the JSON data to a DataFrame
df = pd.DataFrame(data["employees"])

# Print the DataFrame
print(df)


    name  age   department
0   John   30           HR
1  Alice   28  Engineering
2    Bob   35    Marketing


## ORC Files

An ORC (Optimized Row Columnar) file is a columnar storage file format used for storing and managing large volumes of structured data efficiently. It was developed as an open-source project by the Hadoop ecosystem and is widely used in big data processing frameworks like Apache Hive, Apache Spark, and Apache Impala.

Key characteristics and features of ORC files include:

Columnar Storage: Unlike traditional row-based storage formats, ORC files store data column by column. This allows for better compression and encoding of data because similar data types within a column can be stored together.

Compression: ORC files employ various compression techniques to reduce the storage space required. They often use lightweight compression algorithms like Zlib, Snappy, or LZO.

Predicate Pushdown: ORC files support predicate pushdown, which means that query engines can apply filtering and predicate operations directly on the data stored in the file. This reduces the amount of data that needs to be read from storage during query execution.

Lightweight Indexing: ORC files include lightweight indexes that help with metadata operations and skip scanning of irrelevant data blocks when processing queries.

Schema Evolution: ORC supports schema evolution, allowing you to add, remove, or modify columns in your datasets while maintaining compatibility with existing data.

Performance: ORC is designed for high performance and is optimized for both read and write operations. It is particularly well-suited for complex queries on large datasets.

Compatibility: ORC is commonly used in the Hadoop ecosystem, and many data processing tools and frameworks have built-in support for reading and writing ORC files.

ORC files are used primarily in data warehousing, data lakes, and analytics applications, where efficiency in storage and query performance is crucial. They are often a preferred format for storing large datasets in distributed computing environments, such as Hadoop clusters, due to their performance benefits and compatibility with various data processing tools.






## Create a dummy ORC file

In [6]:
import pyorc
import random
import string

# Define the ORC schema for your dummy data
schema = "struct<id:int,name:string,age:int>"

# Generate some dummy data
def generate_dummy_data(num_records):
    data = []
    for _ in range(num_records):
        data.append((random.randint(1, 100), ''.join(random.choice(string.ascii_letters) for _ in range(10)), random.randint(18, 65)))
    return data

# Specify the number of records you want
num_records = 1000

# Generate dummy data
dummy_data = generate_dummy_data(num_records)

# Open a file for writing the ORC data
with open('dummy_data.orc', 'wb') as orc_file:
    # Create a pyorc.Writer and write the data to the file
    with pyorc.Writer(orc_file, schema) as writer:
        writer.writerows(dummy_data)


In [7]:
dummy_data

[(68, 'uvGNBPZtIq', 46),
 (28, 'HgmBbCwqeZ', 21),
 (73, 'pulnNMsGic', 52),
 (25, 'PNyWLBnhkJ', 62),
 (37, 'jjOcGeJqQi', 35),
 (38, 'IJoklXzAuy', 65),
 (55, 'oEGMtKZYYu', 55),
 (63, 'ZKbneHlUlQ', 24),
 (93, 'FAktlamclY', 50),
 (11, 'nOPhWNUrNw', 30),
 (84, 'CbFCIKbNzA', 33),
 (33, 'sYXwMHxGaI', 33),
 (83, 'yppDcDidBO', 46),
 (91, 'yjRlQsoJDO', 20),
 (85, 'JnASZzNTwL', 29),
 (42, 'FSHhyoqmCz', 47),
 (13, 'VJYDjduoDq', 43),
 (51, 'OeEIuogFxn', 63),
 (89, 'MHUziBXXmE', 27),
 (28, 'LVHGtMsBMZ', 39),
 (34, 'xnpoGMZQjo', 18),
 (99, 'SKeHhfJOsO', 52),
 (6, 'zVqxUZClva', 53),
 (56, 'WpNUzmuZNh', 39),
 (29, 'cxFuOVlufg', 51),
 (65, 'AiFwSVxczn', 60),
 (41, 'lYlZFgBDZn', 33),
 (24, 'TXzRNSVebz', 60),
 (59, 'iHcXdeTGfv', 27),
 (64, 'UhHKtasAHQ', 45),
 (88, 'BrwvtnvPvr', 52),
 (19, 'MhAGnMDUIC', 26),
 (13, 'HVagCpYPTP', 33),
 (49, 'eXdKLYVAtH', 41),
 (100, 'KtAjwUjjOr', 38),
 (34, 'fQNhxPsnnF', 63),
 (73, 'pjfmlNHzEF', 65),
 (84, 'aeocqfPdKH', 64),
 (92, 'LatBigqGQQ', 61),
 (93, 'tuTyBxmsFj', 64),


In [8]:
type(dummy_data)

list

In [14]:
import pandas as pd
orc_test = pd.read_orc('dummy_data.orc')

In [15]:
type(orc_test)

pandas.core.frame.DataFrame

In [16]:
orc_test

Unnamed: 0,id,name,age
0,68,uvGNBPZtIq,46
1,28,HgmBbCwqeZ,21
2,73,pulnNMsGic,52
3,25,PNyWLBnhkJ,62
4,37,jjOcGeJqQi,35
...,...,...,...
995,16,HVZYtUmoVq,41
996,68,mdlXFGOLgt,59
997,23,AVyPcqWjnO,39
998,75,SHrxsNxTLi,53


## Spark and Hive

Apache Spark:

General-purpose Data Processing: Spark is a general-purpose, distributed data processing framework that can handle a wide range of data processing tasks, including data analysis, machine learning, and more.
In-memory Processing: Spark performs in-memory processing, which can make it faster for iterative algorithms and interactive data analysis.
Programming in Scala, Java, Python, or R: You can write Spark applications using various programming languages.
If you have complex data processing and analysis needs, especially when dealing with large-scale data, Spark can be a powerful choice.

Apache Hive:

SQL-like Query Language: Hive provides a SQL-like query language called HiveQL, which makes it easy for analysts and SQL users to work with large datasets.
Data Warehousing: Hive is often used for data warehousing tasks, where you need to structure and query data stored in a data lake or data warehouse.
Integration with Hadoop Ecosystem: Hive integrates well with the Hadoop ecosystem and can work with various file formats, including ORC, Parquet, and others.
If you primarily need to query and analyze structured data using SQL-like queries and work with data stored in ORC or other columnar formats, Hive can be a good choice.

In the context of ORC files, both Spark and Hive can be used to analyze them, but the choice depends on the complexity of your analysis, your familiarity with the tools, and your overall data processing architecture. Spark might be more suitable for complex data transformations and machine learning tasks, while Hive is often used for straightforward SQL-like queries and data warehousing tasks.

Ultimately, the choice between Spark and Hive (or any other tool) should align with your specific use case, data volume, and the skills of your team. You can also integrate Spark and Hive if needed, allowing you to leverage the strengths of both tools in your data analysis workflows.




`

## Other tools 

Several tools and libraries can be used to analyze ORC (Optimized Row Columnar) files, depending on your specific needs and preferences. Here are some commonly used tools and libraries for working with ORC files:

Apache Hive:

As mentioned earlier, Apache Hive provides a SQL-like query language (HiveQL) that allows you to query and analyze ORC files along with other data formats. It is particularly well-suited for structured data analysis.
Apache Pig:

Apache Pig is a platform for analyzing large data sets. You can use the Pig Latin scripting language to load and process ORC files.
Apache Impala:

Apache Impala is a massively parallel processing SQL query engine for data stored in Hadoop. It supports ORC files and enables interactive SQL queries.
Apache Drill:

Apache Drill is a distributed SQL query engine that supports a variety of data sources, including ORC files. It allows you to run ANSI SQL queries on ORC data without the need for ETL.
Apache Parquet:

While Parquet is a different columnar file format, it is often used alongside ORC. Many tools that support Parquet also support ORC. Parquet is popular for its performance and columnar storage characteristics.
Python Libraries:

You can use Python libraries like PyORC and Pandas with the pyarrow library to read and analyze ORC files in Python. These libraries allow you to work with ORC data in a Python environment.
Apache Arrow:

Apache Arrow is a cross-language development platform for in-memory data that includes support for reading and writing ORC files. It can be used with various programming languages.
Big Data Platforms:

If you're working within a big data platform like Databricks, Amazon EMR, Google Dataprep, or other cloud-based services, these platforms often provide tools and APIs for working with ORC files as part of their data processing capabilities.
Custom Applications:

You can also build custom applications using programming languages like Java, Scala, or C++ by leveraging ORC file reader libraries and integrating them into your data processing pipelines.
The choice of tool or library depends on your specific use case, programming language preference, data volume, and existing data infrastructure. Many of these tools support multiple file formats, including ORC, Parquet, and more, so you can choose the one that best fits your data and analysis needs.







## Start Pyspark session

In [20]:
# ! pip install pyspark

In [21]:
#! pip install pip findspark
import findspark
findspark.init()

In [2]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("Test").getOrCreate()

# Check the SparkContext
sc = spark.sparkContext

# Verify that Spark is running
print(sc.version)


3.5.0


## Paraquet



### Parquet:

- Columnar storage format optimized for analytics.
- Binary format with schema preservation and compression.
- Ideal for big data and analytical processing.

### CSV (Comma-Separated Values):

- Text-based format with human-readability.
- Lacks schema information and supports limited data types.
- Common for simple data exchange between applications.

### Excel:

- Proprietary spreadsheet application, not a file format.
- Supports rich formatting, but not suitable for big data.
- Used for interactive data analysis with formulas, charts, et

##  Create dummy Paraquet data

In [17]:
import pandas as pd
import numpy as np

# Create a DataFrame with dummy data
data = {
    'ID': np.arange(1, 101),
    'Name': [f'User {i}' for i in range(1, 101)],
    'Age': np.random.randint(18, 65, size=100)
}

df = pd.DataFrame(data)

# Save the DataFrame to a Parquet file
df.to_parquet('dummy_data.parquet', index=False)

In [18]:
df

Unnamed: 0,ID,Name,Age
0,1,User 1,18
1,2,User 2,23
2,3,User 3,36
3,4,User 4,49
4,5,User 5,56
...,...,...,...
95,96,User 96,26
96,97,User 97,25
97,98,User 98,46
98,99,User 99,38


## Pickle


A pickle file is a serialized binary file format used in Python to store and exchange Python objects. It's named after the concept of "pickling," which is a common term for serializing objects in Python. Pickling allows you to convert complex Python objects, such as lists, dictionaries, classes, and more, into a binary representation that can be easily saved to a file or transmitted over a network.

Here are some key characteristics of pickle files:

Serialization: Pickle files are used for the serialization of Python objects. Serialization is the process of converting an object's state into a binary format that can be stored or transmitted.

Binary Format: Pickle files are binary files, meaning they are not human-readable. They contain a binary representation of the object's data and structure.

Cross-Compatible: Pickle files are specific to Python, and they are generally not compatible with other programming languages. However, Python provides modules to serialize and deserialize objects in other formats like JSON, which are more widely compatible.

Use Cases:

Data Persistence: You can use pickle files to save Python objects to disk, preserving their state and structure. This is commonly used for saving machine learning models, caching data, or storing program configurations.
Interprocess Communication: Pickle files can be used to pass Python objects between different Python processes or scripts.
Network Communication: Pickle files can be used to send Python objects over a network connection.
Security Considerations: It's important to note that loading pickle files from untrusted or unauthenticated sources can pose security risks, as malicious code could be executed when unpickling objects. Therefore, it's generally recommended to be cautious when working with pickle files from untrusted sources.

To work with pickle files in Python, you can use the pickle module, which provides functions like pickle.dump() to save objects to a file and pickle.load() to load objects from a file. Here's a simple example:

 it back into Python.







In [20]:

import pickle

# Create a Python object
data = {'name': 'John', 'age': 30}

# Serialize and save it to a pickle file
with open('data.pkl', 'wb') as file:
    pickle.dump(data, file)

# Load the object from the pickle file
with open('data.pkl', 'rb') as file:
    loaded_data = pickle.load(file)

print(loaded_data)  # Output: {'name': 'John', 'age': 30}
#In this example, we create a dictionary, serialize it to a pickle file, and then load

{'name': 'John', 'age': 30}


## Difference between Pickle and Parquet


Certainly! Here's a concise summary of the key differences between Pickle and Parquet:

Pickle is a Python-specific binary format for serializing Python objects, preserving their state and structure. It's used for saving and loading Python objects within the Python ecosystem.

Parquet is a cross-compatible binary file format optimized for efficient storage and analysis of structured tabular data, often used in big data processing frameworks. It's not Python-specific and is suitable for various programming languages.

In [21]:
import re
regex = re.compile(r'read')
list(filter(regex.match, dir(pd)))

['read_clipboard',
 'read_csv',
 'read_excel',
 'read_feather',
 'read_fwf',
 'read_gbq',
 'read_hdf',
 'read_html',
 'read_json',
 'read_orc',
 'read_parquet',
 'read_pickle',
 'read_sas',
 'read_spss',
 'read_sql',
 'read_sql_query',
 'read_sql_table',
 'read_stata',
 'read_table',
 'read_xml']

## SAS

SAS files refer to datasets and other data-related files used by the SAS (Statistical Analysis System) software, a popular statistical and data analysis tool used in various fields such as statistics, data analytics, and business intelligence. SAS files encompass various types of data and metadata files used within the SAS ecosystem. Here are some common types of SAS files:

1. **SAS Data Sets (.sas7bdat)**: These are binary files used to store structured data tables. SAS data sets contain data variables, observations, and metadata describing the structure of the data, including variable names, data types, and labels. These files are commonly used for data analysis and manipulation within SAS.

2. **SAS Program Files (.sas)**: SAS program files are plain text files that contain SAS code written in the SAS programming language. SAS programs are used to perform data analysis, generate reports, and automate tasks within SAS.

3. **SAS Catalogs (.sas7bcat)**: SAS catalogs store metadata information about SAS data sets, libraries, and other objects. They can include information about data set attributes, variable formats, and more. SAS catalogs are essential for managing and organizing SAS data.

4. **SAS Output Files (.log, .lst, .out)**: SAS generates log, listing (lst), and output (out) files when executing SAS programs. The log file contains information about the execution of the SAS program, including error messages and warnings. The listing and output files often contain the results of data analysis and reports generated by SAS programs.

5. **SAS Macro Files (.sas)**: SAS macros are reusable code segments or programs that can be invoked within SAS programs. SAS macro files contain macro definitions and are used to automate repetitive tasks and enhance code modularity.

6. **SAS Transport Files (.xpt)**: SAS transport files are used for data exchange between different SAS environments or versions. They can store data sets, formats, and other metadata in a standardized format that is portable across different platforms.

7. **SAS Access Database Files**: SAS can access various external data sources, including relational databases, Excel files, and more. SAS access files are used to connect to and access data from these external sources.

8. **SAS Metadata Files (.sas7bmdp)**: These files store metadata information about SAS metadata repositories, which help manage and organize metadata for various SAS objects and resources.

SAS files are an integral part of working with SAS software, allowing users to store, analyze, and manage data and code efficiently. SAS is commonly used in industries such as healthcare, finance, market research, and academia for its data analysis and statistical capabilities.

## Create dummy SAS data

https://pypi.org/project/pyreadstat/

In [26]:
! pip install pyreadstat

Collecting pyreadstat
  Obtaining dependency information for pyreadstat from https://files.pythonhosted.org/packages/fd/f4/d3098b49c073da3e0ba2d6fb7c7fc7355729386331532422ed1d2ba57338/pyreadstat-1.2.3-cp311-cp311-win_amd64.whl.metadata
  Downloading pyreadstat-1.2.3-cp311-cp311-win_amd64.whl.metadata (1.1 kB)
Downloading pyreadstat-1.2.3-cp311-cp311-win_amd64.whl (2.4 MB)
   ---------------------------------------- 0.0/2.4 MB ? eta -:--:--
   ---------------------------------------- 0.0/2.4 MB ? eta -:--:--
   ---------------------------------------- 0.0/2.4 MB ? eta -:--:--
   - -------------------------------------- 0.1/2.4 MB 919.0 kB/s eta 0:00:03
   ---- ----------------------------------- 0.3/2.4 MB 2.0 MB/s eta 0:00:02
   ------- -------------------------------- 0.5/2.4 MB 2.8 MB/s eta 0:00:01
   ----------- ---------------------------- 0.7/2.4 MB 3.6 MB/s eta 0:00:01
   -------------- ------------------------- 0.9/2.4 MB 3.7 MB/s eta 0:00:01
   ------------------- -------------

In [29]:
# Import the required libraries
import pandas as pd

# Create a dummy dataset using pandas
data = {
    'ID': [1, 2, 3, 4, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve']
}
df = pd.DataFrame(data)

# Create a SAS dataset manually as bytes
sas_data = b"""LIBNAME mydata 'C:/path/to/sas7bdat/files';
DATA mydata.dummy;
    INPUT ID 3. Name $12.;
    DATALINES4;
"""

# Add data rows
for _, row in df.iterrows():
    sas_data += f"{row['ID']} '{row['Name']}';\n".encode()

# Close the dataset
sas_data += b"""
;;;;
RUN;
"""

# Save the SAS dataset to a .sas file
with open('dummy.sas', 'wb') as sas_file:
    sas_file.write(sas_data)


## SPSS files

In [30]:
import pandas as pd
import pyreadstat

# Create a dummy dataset using pandas
data = {
    'ID': [1, 2, 3, 4, 5],
    'Score': [85, 92, 78, 88, 95]
}
df = pd.DataFrame(data)

# Define the SPSS file path
spss_file_path = 'dummy.sav'

# Save the DataFrame to the SPSS file
pyreadstat.write_sav(df, spss_file_path)

# Optional: Read and verify the contents of the SPSS file
df_from_spss, meta = pyreadstat.read_sav(spss_file_path)
print(df_from_spss)

    ID  Score
0  1.0   85.0
1  2.0   92.0
2  3.0   78.0
3  4.0   88.0
4  5.0   95.0


## .SAV Files

The SPSS data file (.sav) was originally developed as the file format for the computer program IBM SPSS. Today, it is the most widely used format for storing survey data and is created by and analyzed by most advanced data analysis software.

Overview of the file format
SPSS data files, often called "S A V" files, are binary files. The key feature of the file format is that it is very rich in terms of metadata. In particular, it has a very rich amount of metadata stored for each variable, including:

A variable Name.  E.g., Attitude.
A Variable Label. E.g., Attitude may have the label How strongly do you agree with the statement ‘Data Science is Cool’?
Each variable has a set of Value Labels. For example, a 1 for Gender may mean Male and a 2 may mean Female.
One variable may be flagged as a weight and another as a filter.
Related variables may be grouped into Multiple Response Sets.
Certain values may be flagged as Missing Values.
The scale type of variables will be stored as Nominal, Ordinal, or Scale.
Information about the date format may be stored.
Strengths of the file format
The richness of the metadata makes the SPSS Data File a good format for storing survey data. This, combined with the file format being 50+ years old, has made it, by far, the most widely used file format for science survey-based data analysis (e.g., psychology, marketing, sociology, politics, market research, social research, polling).

Weaknesses of the file format
The file format has evolved over its 50+ years of existence, and this can cause some compatibility issues, particularly with text variables.
Poor support for very long variable and value labels.
The file format cannot be used for very large data files. This is because the file format requires the whole file to be read into memory in order to be analyzed.
Limited support for metadata for variable sets. It only supports multiple response questions and not grids.

## SQL Files

In [40]:
import pandas as pd

data = {
    'ID': [1, 2, 3, 4, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, 22, 35, 28]
}

df = pd.DataFrame(data)

In [41]:
df

Unnamed: 0,ID,Name,Age
0,1,Alice,25
1,2,Bob,30
2,3,Charlie,22
3,4,David,35
4,5,Eve,28


In [42]:
import sqlite3

# Connect to the SQLite database (or create it if it doesn't exist)
conn = sqlite3.connect('dummy.db')

# Create a cursor object
cursor = conn.cursor()

# Define the SQL command to create a table
create_table_query = '''
CREATE TABLE IF NOT EXISTS dummy_data (
    ID INT PRIMARY KEY,
    Name TEXT,
    Age INT
);
'''

# Execute the SQL command
cursor.execute(create_table_query)

<sqlite3.Cursor at 0x245539d0140>

In [43]:
# Commit the changes and close the connection
conn.commit()
conn.close()

In [44]:
# Reconnect to the SQLite database
conn = sqlite3.connect('dummy.db')
cursor = conn.cursor()

# Convert the DataFrame to a list of tuples
data_to_insert = [tuple(row) for row in df.values]

# Define the SQL command to insert data
insert_data_query = 'INSERT INTO dummy_data (ID, Name, Age) VALUES (?, ?, ?);'

# Execute the SQL command for each row of data
cursor.executemany(insert_data_query, data_to_insert)

<sqlite3.Cursor at 0x24556018140>

In [45]:
# Commit the changes and close the connection
conn.commit()
conn.close()


In [46]:
import sqlite3

# Connect to the SQLite database
conn = sqlite3.connect('dummy.db')
cursor = conn.cursor()

# Define the SELECT query
select_query = 'SELECT * FROM dummy_data;'

# Execute the SELECT query
cursor.execute(select_query)

# Fetch all the results
results = cursor.fetchall()

# Close the connection
conn.close()

# Display the results
for row in results:
    print(row)


(1, 'Alice', 25)
(2, 'Bob', 30)
(3, 'Charlie', 22)
(4, 'David', 35)
(5, 'Eve', 28)


https://www.postgresql.org/download/

## STATA Files

Stata files are data files associated with Stata, a popular statistical software package used for data analysis, manipulation, and visualization. Stata files are used to store structured datasets, and they come in different formats, each serving a specific purpose:

1. **Stata Data File (.dta)**: This is the primary data file format used by Stata. It stores data in a binary format, preserving variable names, labels, data types, and other metadata. Stata data files can store both data and command results.

2. **Stata Program Files (.do)**: These are plain text files that contain Stata programming code written in the Stata scripting language. Stata program files are used to automate tasks, perform data analysis, and generate reports. They can be executed within Stata's interactive environment.

3. **Stata Log Files (.log)**: Stata log files capture the entire session of commands and output when running Stata programs or interacting with data. They are useful for documenting and replicating analyses.

4. **Stata Output Files (.out)**: These files store output generated by Stata when running programs or commands. They often contain statistical summaries, regression results, and other analysis outputs.

5. **Stata Dictionary Files (.ado)**: These files contain user-written Stata commands or "programs." Users can create custom commands or functions in Stata using .ado files, which can then be invoked like built-in Stata commands.

6. **Stata Journal Files (.sj)**: These files are used for the Stata Journal, a publication featuring articles on using Stata for data analysis. They typically contain articles, code, and datasets used in the journal.

7. **Stata Graphics Files (.gph)**: These files store graphs and plots generated by Stata. Stata provides a wide range of graphing options for data visualization, and the results can be saved in .gph files.

8. **Stata Viewer Files (.stview)**: Stata Viewer is a separate program for viewing Stata data and results files. .stview files are associated with Stata Viewer and are used to store data and results for viewing and sharing with others.

Stata files are widely used in fields such as economics, social sciences, epidemiology, and public health for data analysis and statistical research. The .dta format, in particular, is the standard for storing datasets in Stata, making it easy to share and collaborate on data-driven projects within the Stata ecosystem.

In [49]:
import pandas as pd

# Create a dummy dataset using pandas
data = {
    'ID': [1, 2, 3, 4, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, 22, 35, 28]
}

df = pd.DataFrame(data)

# Save the DataFrame to a Stata file using pandas
df.to_stata('dummy.dta')

## Read Table files 

read_table is a function used in some libraries like pandas in Python for reading tabular data from various sources and creating a DataFrame. The read_table function is typically used to read data from text files where columns are separated by a delimiter, often a tab character ('\t') or other specified characters.

In pandas, read_table has been deprecated in favor of more versatile functions like read_csv and read_csv. These functions are more commonly used for reading tabular data because they provide more options and better support for various file formats.

## XML files

XML (Extensible Markup Language) files are a popular format for storing and exchanging structured data. XML is a markup language that uses tags to define elements and their hierarchical relationships within a document. XML files are both human-readable and machine-readable, making them suitable for a wide range of applications, including data storage, configuration files, and data interchange between different systems.

Key characteristics of XML files include:

1. **Hierarchical Structure**: XML documents are organized hierarchically, consisting of elements nested within other elements. Elements can have attributes and contain text data, other elements, or a combination of both.

2. **Tags**: XML uses tags to define elements. Tags are enclosed in angle brackets (< >) and come in pairs: an opening tag and a closing tag. For example, `<book>` is an opening tag, and `</book>` is a closing tag.

3. **Attributes**: Elements can have attributes that provide additional information about the element. Attributes are typically name-value pairs and are specified within the opening tag. For example, `<book title="Introduction to XML">`.

4. **Nesting**: Elements can be nested within other elements to represent a hierarchical structure. For example:

   ```xml
   <library>
       <book title="Introduction to XML">
           <author>John Doe</author>
           <price>29.99</price>
       </book>
       <book title="Data Science Basics">
           <author>Jane Smith</author>
           <price>34.95</price>
       </book>
   </library>
   ```

5. **Human-Readable**: XML files are plain text and can be easily read and edited by humans using a text editor. The hierarchical structure and use of tags make the data's organization clear.

6. **Machine-Readable**: XML files can be parsed and processed by software applications, making them suitable for data exchange between different systems.

XML is widely used in various domains, including web services (SOAP and REST), configuration files (e.g., XML configuration files for software applications), data interchange formats, and data representation in databases. Additionally, XML has served as the foundation for other markup languages like XHTML and is closely related to HTML (Hypertext Markup Language) used for web page structure.

XML files are identified by the '.xml' file extension, and they adhere to well-defined rules and standards outlined in the XML specification, making them a versatile choice for data representation and exchange.

In [51]:
import xml.etree.ElementTree as ET

# Create the root element
root = ET.Element("library")

# Create child elements
book1 = ET.SubElement(root, "book")
book1_title = ET.SubElement(book1, "title")
book1_title.text = "Introduction to XML"
book1_author = ET.SubElement(book1, "author")
book1_author.text = "John Doe"
book1_price = ET.SubElement(book1, "price")
book1_price.text = "29.99"

book2 = ET.SubElement(root, "book")
book2_title = ET.SubElement(book2, "title")
book2_title.text = "Data Science Basics"
book2_author = ET.SubElement(book2, "author")
book2_author.text = "Jane Smith"
book2_price = ET.SubElement(book2, "price")
book2_price.text = "34.95"

# Create an ElementTree from the root element
tree = ET.ElementTree(root)

# Save the XML to a file
tree.write("dummy.xml")


In [52]:
tree


<xml.etree.ElementTree.ElementTree at 0x24556473010>

In [53]:
df = pd.read_xml('dummy.xml')

In [54]:
df

Unnamed: 0,title,author,price
0,Introduction to XML,John Doe,29.99
1,Data Science Basics,Jane Smith,34.95
