# Importing different types of files

In [None]:
import pandas as pd

In [None]:
from os import listdir
from os.path import join

listdir('.')

A delimiter is one or more characters that separate text strings. Common delimiters are commas (,), semicolon (;), quotes ( ", ' ), braces ( {}), pipes (|), or slashes ( / \ ). When a program stores sequential or tabular data, it delimits each item of data with a predefined character. 

### 1.1 Going under the Hood of pandas read

In [None]:
import pandas as pd
pd.read_csv?

### 1.1.2 Types of files one can import from Pandas

In [None]:
import re
regex = re.compile(r'read')
list(filter(regex.match, dir(pd)))

The clipboard is a temporary storage area in your computer’s memory that stores the information you copy or cut. The information can be text, images, or other types of data. You can then paste the contents of the clipboard into another location, such as a document or an email 

## CSV vs. Excel

CSV stands for Comma-Separated Values, while Excel is a spreadsheet application that saves files into its own format 123. CSV files are used for storing data in tabular format and are just plain text files with values separated by commas. They can be opened with text editors (such as Notepad) and are faster to process and open. However, they cannot store other information like formatting, links, charts, pictures, etc13 On the other hand, Excel files are binary files with multiple worksheets that can store formatting and perform operations on data. They can contain symbols, links, charts, pictures, etc., and are harder to read with larger sets of data13

In [None]:
data = {
    'Employee ID': [101, 102, 103, 104, 105],
    'Name': ['John Smith', 'Jane Doe', 'Bob Johnson', 'Alice Brown', 'Charlie Wilson'],
    'Department': ['HR', 'Finance', 'Engineering', 'Marketing', 'Sales'],
    'Position': ['Manager', 'Analyst', 'Engineer', 'Coordinator', 'Sales Representative'],
    'Salary': [75000, 60000, 80000, 50000, 65000]
}

# Create a DataFrame
employee_df = pd.DataFrame(data)

# Display the DataFrame
employee_df

In [None]:
employee_df.to_csv('Test1.csv')

In [None]:
employee_df.to_excel('Test2.xlsx')

## Feather files

A Feather file is a binary file format for efficiently storing and sharing data between different programming languages and data analysis tools. Feather was designed to be lightweight and to optimize data transfer between languages, particularly for data analysis libraries like pandas in Python and Apache Arrow in other languages like R, Julia, and more.

Feather files have a few key features:

Language-Agnostic: Feather files are designed to be language-agnostic, which means you can read and write them from multiple programming languages without losing data integrity or performance.

Columnar Storage: Feather stores data in a columnar format, which is often more efficient for data analysis and allows for faster read and write operations, especially when working with large datasets.

Metadata: Feather files include metadata that helps describe the data, such as data types and column names, making it self-descriptive.

Efficient Serialization: Feather is optimized for fast serialization and deserialization, making it suitable for reading and writing data frames or tables quickly.1`

In [None]:
import pandas as pd

# Create a sample DataFrame
data = {'Column1': [1, 2, 3, 4, 5], 'Column2': ['A', 'B', 'C', 'D', 'E']}
df = pd.DataFrame(data)

# Write the DataFrame to a Feather file
df.to_feather('example.feather')

# Read the Feather file back into a DataFrame
df_from_feather = pd.read_feather('example.feather')

# Display the DataFrame read from Feather
print(df_from_feather)


## Fixed width Files fwf

The `read_fwf` function in pandas is used when you have data in a fixed-width format, and you want to read that data into a DataFrame for further analysis and manipulation. Fixed-width format data is a type of plain text data where each column has a specified width, and the data within each column is aligned to those widths.

Here are some common scenarios when you might use `read_fwf`:

1. **Legacy Data Formats**: Fixed-width files were common in legacy systems or data sources where data was stored and exchanged in a format where each column's position was predetermined. If you need to work with such data, you would use `read_fwf` to parse and load it into a DataFrame.

2. **Government Data**: Government agencies and organizations often provide data in fixed-width format files. Examples include census data, economic indicators, or demographic data.

3. **Mainframe Systems**: Data exported from mainframe systems or older databases may be in fixed-width format. Reading this data with `read_fwf` can be essential for data analysis or migration to modern systems.

4. **Financial Data**: Financial data, including stock market data or financial reports, may be distributed in fixed-width format. Analysts often use `read_fwf` to load this data for analysis.

5. **Custom Data Export Formats**: Sometimes, organizations or systems export data in custom fixed-width formats for specific applications. If you encounter such data, `read_fwf` can help you parse and work with it.

When working with fixed-width data, it's crucial to know the exact column widths and data types in the file, as these details are necessary to correctly parse the data. You'll typically need to provide the `colspecs` parameter to specify the start and end positions of each column.

Here's a general use case for `read_fwf`:

```python
import pandas as pd

# Define the column widths for your fixed-width file
colspecs = [(0, 5), (5, 10), (10, 15)]  # Adjust to match your data

# Read the FWF file into a DataFrame
df = pd.read_fwf('data.fwf', colspecs=colspecs)

# Perform data analysis and manipulation using the DataFrame
```

In this example, `colspecs` specifies the start and end positions of three columns in the FWF file, and `read_fwf` reads the data accordingly. You should adjust `colspecs` to match the specific formatting of your FWF file.

In [None]:
filepath = r"C:\Users\crist\Downloads\data.txt"

In [None]:
import pandas as pd
 
df = pd.read_fwf(filepath, colspecs='infer', header=None)
print(df)

## Benefits of FWF Files

Fixed-width format (FWF) files have several benefits in data storage and processing:

Predictable Structure: FWF files have a fixed structure where each field occupies a specific number of characters or positions within each record. This predictability makes it easy to parse and read the data accurately.

Human-Readable: FWF files are often human-readable because of their fixed-column layout. This makes it easier for people to inspect the data without the need for specialized software.

Efficiency: FWF files can be more memory-efficient and faster to read and write compared to variable-width files (e.g., CSV) because there's no need for delimiters. Processing fixed-width data can be faster, especially with large datasets.

Preservation of Leading Zeros: FWF files are useful for storing data where leading zeros are significant (e.g., ZIP codes, product codes, or identification numbers) because they maintain the exact character positions.

Data Integrity: FWF files are less prone to data corruption due to missing or misplaced delimiters that can occur in variable-width files.

Compatibility: FWF files are well-suited for integration with legacy systems or other software that expects data in a fixed-width format.

Data Validation: The fixed-width format makes it easier to enforce data validation rules as data must conform to the specified column widths.

Alignment: When displaying FWF data in a text editor or fixed-width font, the columns align neatly, making it easier for users to visually interpret the data.

However, it's essential to consider the specific use case and requirements when choosing between FWF and other data storage formats like CSV, TSV, or JSON. FWF is most beneficial when data has a consistent and predictable structure. If your data has varying column widths or complex structures, other formats may be more suitable.







## Google Big Query

In [None]:
import pandas as pd

# Define your BigQuery SQL query as a string
query = """
SELECT
  column1,
  column2
FROM
  your_project_id.your_dataset.your_table
WHERE
  some_condition
"""

# Set up the BigQuery authentication (you need to authenticate to access your BigQuery data)
# You can use your Google Cloud credentials JSON file or application default credentials.
# For application default credentials, you can use:
# pd.read_gbq(query, project_id=your_project_id, dialect='standard')

# Authenticate using your Google Cloud credentials JSON file
# pd.read_gbq(query, project_id=your_project_id, private_key='path/to/your/credentials.json', dialect='standard')

# Use the read_gbq function to execute the query and retrieve the data into a DataFrame
df = pd.read_gbq(query, project_id=your_project_id, dialect='standard')

# Now, you can work with the data in the DataFrame 'df'
print(df.head())


## Benefits of Google Big Query

Google BigQuery is a fully-managed, serverless, and highly scalable data warehouse and analytics platform offered by Google Cloud. It provides several benefits for organizations and data professionals:

Scalability: BigQuery is designed to handle massive datasets with ease. It can automatically scale to accommodate growing data volumes, ensuring that you can run complex queries on large datasets without worrying about infrastructure limitations.

Serverless: You don't need to provision or manage servers when using BigQuery. It's a serverless platform, which means Google takes care of infrastructure management, including hardware and software updates.

Speed: BigQuery is known for its blazing-fast query performance. It uses a distributed architecture and columnar storage to execute queries quickly, even on petabyte-scale datasets.

Cost-Effective: With BigQuery's pay-as-you-go pricing model, you only pay for the data you query and store. It eliminates the need for upfront capital expenditures and allows you to control costs effectively.

SQL Support: BigQuery supports standard SQL, making it easy for data analysts and SQL developers to write and run queries without the need to learn a new query language.

Integration: It integrates seamlessly with other Google Cloud services, such as Google Cloud Storage, Google Data Studio, and Google Sheets, allowing you to build end-to-end data pipelines and analytics solutions.

Data Warehousing and Data Lake Capabilities: BigQuery can function as both a data warehouse and a data lake. You can store structured and semi-structured data in BigQuery tables or query data directly from external storage like Google Cloud Storage.

Security: Google Cloud provides robust security features, including encryption at rest and in transit, identity and access management (IAM), and audit logs. It complies with industry standards and certifications.

Real-Time Data Analysis: BigQuery supports real-time data streaming, enabling you to analyze data as it arrives, making it suitable for real-time analytics use cases.

Machine Learning Integration: You can leverage Google's machine learning capabilities and services like BigQuery ML to build and deploy machine learning models directly within BigQuery.

Geo-spatial and Advanced Analytics: BigQuery offers support for geospatial data and a wide range of advanced analytics functions, including window functions, machine learning, and statistical analysis.

Data Sharing and Collaboration: You can easily share datasets and queries with others in your organization or externally, facilitating collaboration and data sharing.

Automatic Backups and High Availability: BigQuery automatically takes care of data backups and provides high availability, ensuring that your data is safe and accessible.

Cost Optimization Tools: Google Cloud provides cost optimization tools and features to help you analyze and control your BigQuery costs, making it easier to manage your budget.

Community and Support: BigQuery has a large and active user community, and Google Cloud offers various levels of support, including documentation, forums, and premium support options.

Overall, Google BigQuery is a powerful and versatile platform that can help organizations make data-driven decisions, gain insights from their data, and leverage the benefits of cloud computing without the complexity of managing infrastructure.







## HDF File


An HDF file, which stands for "Hierarchical Data Format" file, is a file format designed for storing and organizing large amounts of data. HDF files are particularly popular in the scientific and engineering communities for applications that involve complex and multidimensional datasets. Here are some key features of HDF files:

Hierarchical Structure: HDF files have a hierarchical structure, similar to a file system, where data can be organized into groups and datasets. This hierarchical organization makes it easy to store and manage structured data.

Support for Various Data Types: HDF supports a wide range of data types, including numerical data (integers, floats, etc.), text, images, and more. This versatility makes it suitable for a broad spectrum of applications.

Compression: HDF files can be compressed, which helps reduce file size while preserving data integrity. This is particularly useful when dealing with large datasets.

Portability: HDF files are designed to be platform-independent. You can create and access HDF files on various operating systems, including Windows, macOS, and Linux.

Data Chunking: HDF allows data to be divided into smaller, regularly sized chunks. This can improve data access and manipulation, especially for large datasets.

Metadata: You can attach metadata to HDF datasets, providing additional information about the data's content, source, and any relevant attributes.

Libraries and APIs: There are libraries and APIs available for various programming languages (e.g., HDF5 for C/C++, h5py for Python) that allow developers to work with HDF files, making it easier to create, read, write, and manipulate data stored in HDF format.

HDF files are commonly used in fields such as astronomy, geoscience, climate modeling, bioinformatics, and more, where researchers need to store and analyze large and complex datasets. The two major versions of HDF are HDF4 and HDF5, with HDF5 being the more modern and widely adopted version due to its improved capabilities and performance.

## Create dummy HDF data

In [None]:
import h5py
import numpy as np

# Define your dataset (a simple 2D array in this example)
data = np.array([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]])

# Create an HDF5 file in write mode
with h5py.File('my_dataset.h5', 'w') as hdf_file:
    # Create a dataset and write data to it
    dataset = hdf_file.create_dataset('my_data', data=data)

    # Optionally, add attributes and metadata to the dataset
    dataset.attrs['description'] = 'My sample dataset'
    dataset.attrs['author'] = 'Your Name'

# The HDF5 file is automatically closed when the 'with' block exits

In [None]:
dataset

In [None]:
type(dataset)

## Visualize dataset

In [None]:
import h5py

# Open the HDF5 file in read mode
with h5py.File('my_dataset.h5', 'r') as hdf_file:
    # Access the dataset
    dataset = hdf_file['my_data']
    
    # Convert the dataset to a NumPy array
    data = dataset[()]
    
    # Print the contents of the array
    print(data)







## Convert HDF File to Dataframe format

In [None]:
import h5py

# Open the HDF5 file in read mode
with h5py.File('my_dataset.h5', 'r') as hdf_file:
    # Access the dataset
    dataset = hdf_file['my_data']
    
    # Convert the dataset to a pandas DataFrame
    df = pd.DataFrame(dataset[()])

# Now you have the data in a pandas DataFrame
print(df)

## HTML Files 

In [None]:
# Define some example data
data = [
    {"name": "John", "age": 30, "city": "New York"},
    {"name": "Alice", "age": 25, "city": "Los Angeles"},
    {"name": "Bob", "age": 35, "city": "Chicago"},
]

# Create an HTML string
html_string = "<html>\n"
html_string += "<head><title>Sample Dataset</title></head>\n"
html_string += "<body>\n"
html_string += "<h1>Sample Dataset</h1>\n"
html_string += "<table border='1'>\n"
html_string += "<tr><th>Name</th><th>Age</th><th>City</th></tr>\n"

for record in data:
    html_string += "<tr>"
    html_string += f"<td>{record['name']}</td>"
    html_string += f"<td>{record['age']}</td>"
    html_string += f"<td>{record['city']}</td>"
    html_string += "</tr>\n"

html_string += "</table>\n"
html_string += "</body>\n"
html_string += "</html>"

# Save the HTML string to a file
with open("sample_dataset.html", "w") as html_file:
    html_file.write(html_string)

print("Sample dataset HTML file created: sample_dataset.html")


## When to parse HTML files in Python

You would parse HTML files in Python when you need to extract and manipulate data or information from HTML documents. Parsing HTML is common in various scenarios, including:

Web Scraping: When you want to extract data from websites for purposes such as data analysis, research, or building web applications. Python libraries like Beautiful Soup and Scrapy are often used for web scraping tasks.

Data Extraction: When you need to extract structured data from HTML documents, such as tables, lists, or specific elements, for further processing or analysis.

Web Testing and Automation: When you automate web interactions or perform web testing, you may need to parse HTML to locate and interact with specific elements on a web page.

Data Cleaning: When you have HTML-encoded content in your dataset and you want to convert it to plain text or extract specific information.

Generating Dynamic Content: When you dynamically generate HTML documents based on data from databases or other sources.

Web Development: When building web applications using frameworks like Flask or Django, you often work with HTML templates to render dynamic content.

Python offers several libraries and tools for parsing HTML, including:

Beautiful Soup: A popular library for parsing HTML and XML documents, making it easy to navigate and search the document's structure.

lxml: A high-performance library for parsing XML and HTML documents. It's often used in combination with XPath for advanced data extraction.

html.parser: The built-in HTML parser in Python's standard library, which can be used for basic HTML parsing tasks.

Selenium: A tool often used for web testing and automation, allowing you to programmatically interact with web pages and extract data.

Scrapy: A powerful web crawling and scraping framework that provides a comprehensive set of tools for extracting and processing data from websites.

The specific use case for parsing HTML in Python depends on your project requirements. Whether it's extracting data from a website, cleaning and transforming HTML-encoded content, or interacting with web pages programmatically, Python offers a range of tools and libraries to help you achieve your goals.

## JSON 

JSON (JavaScript Object Notation) files and DataFrames are both data structures used for representing and storing data, but they have different characteristics and purposes. Here are the key differences between them:

Data Representation:

JSON: JSON is a lightweight, text-based data interchange format. It's designed to represent structured data as a collection of key-value pairs, where keys are strings, and values can be strings, numbers, objects, arrays, booleans, or null. JSON is often used for data exchange between systems and for configuration files.

DataFrame: A DataFrame is a tabular data structure commonly used in data analysis and manipulation. It's a two-dimensional, labeled data structure with rows and columns, similar to a spreadsheet or SQL table. Each column can have a different data type, making it suitable for heterogeneous data.

Use Cases:

JSON: JSON is typically used for data exchange between applications, configuration files, or representing data in a semi-structured format. It's commonly used in web APIs, configuration files, and data serialization.

DataFrame: DataFrames are used for data analysis, data manipulation, and exploration tasks. They are a fundamental data structure in data science libraries like pandas in Python and are used for tasks such as filtering, grouping, aggregation, and visualization.

Storage Format:

JSON: JSON is stored as plain text and is human-readable. It's a flexible format for representing structured data, but it may not be as space-efficient as binary formats for large datasets.

DataFrame: DataFrames are typically stored in memory, and they can be serialized to various formats like CSV, Excel, HDF5, or Parquet for storage. These formats may offer compression and better storage efficiency compared to JSON.

Schema and Type Information:

JSON: JSON does not have a predefined schema or type information. It's up to the application to interpret and validate the data.

DataFrame: DataFrames have a schema that defines the data types and column names. This schema enforces type consistency within columns, making it easier to work with structured data.

Access and Manipulation:

JSON: Accessing and manipulating data in JSON typically involves parsing the text and working with the resulting data structure in your programming language. Libraries like json in Python are commonly used for this purpose.

DataFrame: DataFrames provide a high-level API for data manipulation and analysis. Libraries like pandas offer powerful functions for filtering, transforming, aggregating, and visualizing data in a tabular format.

In summary, JSON files are more suited for data interchange and configuration, while DataFrames are designed for data analysis and manipulation. The choice between the two depends on your specific use case and the type of data you are working with.

## Create a dummy JSON file

In [None]:
import json

# Define a dictionary with sample data
data = {
    "employees": [
        {
            "name": "John",
            "age": 30,
            "department": "HR"
        },
        {
            "name": "Alice",
            "age": 28,
            "department": "Engineering"
        },
        {
            "name": "Bob",
            "age": 35,
            "department": "Marketing"
        }
    ]
}

# Serialize the data to a JSON-formatted string
json_string = json.dumps(data, indent=4)

# Print the JSON string
print(json_string)

# Alternatively, save the data to a JSON file
with open("sample_dataset.json", "w") as json_file:
    json.dump(data, json_file, indent=4)

print("Sample JSON dataset created: sample_dataset.json")


## Convert JSON to a dataframe

In [None]:
import pandas as pd

# Define the JSON data (you can also read it from a file)
data = {
    "employees": [
        {
            "name": "John",
            "age": 30,
            "department": "HR"
        },
        {
            "name": "Alice",
            "age": 28,
            "department": "Engineering"
        },
        {
            "name": "Bob",
            "age": 35,
            "department": "Marketing"
        }
    ]
}

# Convert the JSON data to a DataFrame
df = pd.DataFrame(data["employees"])

# Print the DataFrame
print(df)


## ORC Files

An ORC (Optimized Row Columnar) file is a columnar storage file format used for storing and managing large volumes of structured data efficiently. It was developed as an open-source project by the Hadoop ecosystem and is widely used in big data processing frameworks like Apache Hive, Apache Spark, and Apache Impala.

Key characteristics and features of ORC files include:

Columnar Storage: Unlike traditional row-based storage formats, ORC files store data column by column. This allows for better compression and encoding of data because similar data types within a column can be stored together.

Compression: ORC files employ various compression techniques to reduce the storage space required. They often use lightweight compression algorithms like Zlib, Snappy, or LZO.

Predicate Pushdown: ORC files support predicate pushdown, which means that query engines can apply filtering and predicate operations directly on the data stored in the file. This reduces the amount of data that needs to be read from storage during query execution.

Lightweight Indexing: ORC files include lightweight indexes that help with metadata operations and skip scanning of irrelevant data blocks when processing queries.

Schema Evolution: ORC supports schema evolution, allowing you to add, remove, or modify columns in your datasets while maintaining compatibility with existing data.

Performance: ORC is designed for high performance and is optimized for both read and write operations. It is particularly well-suited for complex queries on large datasets.

Compatibility: ORC is commonly used in the Hadoop ecosystem, and many data processing tools and frameworks have built-in support for reading and writing ORC files.

ORC files are used primarily in data warehousing, data lakes, and analytics applications, where efficiency in storage and query performance is crucial. They are often a preferred format for storing large datasets in distributed computing environments, such as Hadoop clusters, due to their performance benefits and compatibility with various data processing tools.






## Create a dummy ORC file

In [2]:
# ! pip install pyspark

In [5]:
#! pip install pip findspark
import findspark
findspark.init()

In [6]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("Test").getOrCreate()

# Check the SparkContext
sc = spark.sparkContext

# Verify that Spark is running
print(sc.version)


PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.

In [1]:
from pyspark.sql import SparkSession

# Initialize a Spark session
spark = SparkSession.builder.appName("DummyORC").getOrCreate()

# Create a DataFrame with sample data
data = [("John", 30, "HR"), ("Alice", 28, "Engineering"), ("Bob", 35, "Marketing")]
columns = ["name", "age", "department"]
df = spark.createDataFrame(data, columns)

# Write the DataFrame to an ORC file
df.write.orc("dummy.orc")

# Stop the Spark session
spark.stop()

print("Dummy ORC file created: dummy.orc")

PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.