# Data Sources

Apache Spark supports various data sources, allowing users to read and write data from/to different storage systems, file formats, and databases. Here are some commonly used data sources in Spark:

### 1. **Hadoop Distributed File System (HDFS):**
   - Spark can read and write data stored in HDFS, which is the distributed file system used by Hadoop.

### 2. **Apache Hive:**
   - Spark supports reading data from Hive tables, making it compatible with the Hive data warehouse.

### 3. **Apache HBase:**
   - Spark can read and write data to Apache HBase, a NoSQL database that runs on top of the Hadoop Distributed File System.

### 4. **Apache Cassandra:**
   - Spark can interact with Apache Cassandra, a distributed NoSQL database.

### 5. **Amazon S3:**
   - Spark provides connectors to read and write data from Amazon S3, a widely used cloud storage service.

### 6. **Azure Blob Storage:**
   - Similar to S3, Spark can read and write data from Azure Blob Storage, Microsoft's cloud object storage solution.

### 7. **File Formats:**
   - Spark supports various file formats, including Parquet, Avro, ORC, JSON, CSV, and more. Each format has its advantages, and users can choose the one that best suits their needs.

### 8. **Apache Kafka:**
   - Spark can consume and process data from Apache Kafka, a distributed event streaming platform.

### 9. **Relational Databases:**
   - Spark can connect to and interact with traditional relational databases such as MySQL, PostgreSQL, Oracle, and others. This is often done using JDBC connectors.

### 10. **Elasticsearch:**
    - Spark can read and write data to Elasticsearch, a distributed search and analytics engine.

### 11. **JDBC Data Sources:**
    - Spark supports reading and writing data using Java Database Connectivity (JDBC) to connect to various databases.

### 12. **Graph Databases (Neo4j, etc.):**
    - For graph-based data, Spark can interact with graph databases like Neo4j.

### 13. **In-memory Data Grids (e.g., Apache Ignite):**
    - Spark can leverage in-memory data grids for distributed in-memory processing.

### 14. **Custom Data Sources:**
    - Spark provides APIs for developers to create custom data sources, allowing integration with specialized or proprietary data storage systems.

### 15. **DataFrames and Datasets:**
    - Spark itself represents data as DataFrames and Datasets, providing a unified API for working with structured data. These can be used as sources for further processing or as sinks for storing results.

Spark's flexibility in supporting various data sources makes it a versatile framework for working with diverse data ecosystems and use cases. Users can seamlessly integrate Spark with their preferred storage systems and formats.

# Reading & Writing On CSV File 

In Apache Spark, you can perform read and write operations on CSV (Comma-Separated Values) files using the `spark.read` and `DataFrame.write` API. Here's how you can do it:

### Reading CSV Files:

```python
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("CSVExample").getOrCreate()

# Specify the path to the CSV file
csv_file_path = "path/to/your/csv/file.csv"

# Read the CSV file into a DataFrame
df = spark.read.csv(csv_file_path, header=True, inferSchema=True)

# Show the DataFrame
df.show()

# Stop the Spark session when done
spark.stop()
```

Explanation:
- `header=True`: Specifies that the first row of the CSV file contains the header.
- `inferSchema=True`: Infers the data types of columns automatically.

### Writing to CSV Files:

```python
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("CSVWriteExample").getOrCreate()

# Assume df is the DataFrame you want to write to a CSV file

# Specify the path to save the CSV file
output_csv_path = "path/to/save/output.csv"

# Write the DataFrame to a CSV file
df.write.csv(output_csv_path, header=True, mode="overwrite")

# Stop the Spark session when done
spark.stop()
```

Explanation:
- `header=True`: Writes the header to the CSV file.
- `mode="overwrite"`: Specifies that if the output CSV file already exists, it should be overwritten. Other options for `mode` include `"append"` and `"ignore"`.

Keep in mind that when reading and writing CSV files in Spark, you can customize the options based on your specific requirements. The `spark.read.csv` and `DataFrame.write.csv` methods provide various options for handling header, delimiter, schema, and more.

Note: Make sure to replace `"path/to/your/csv/file.csv"` and `"path/to/save/output.csv"` with the actual paths in your file system.

If your CSV file has a different delimiter (not a comma), you can specify it using the `sep` option. For example, if the delimiter is a semicolon, you can use `df.write.option("sep", ";").csv(output_csv_path)` when writing and `spark.read.option("sep", ";").csv(csv_file_path)` when reading. Adjust these options according to the specifics of your CSV file.

# Working With JSON File...

Working with JSON files in Apache Spark involves using the `spark.read` and `DataFrame.write` API to read and write data in JSON format. Here's a basic guide on how to perform these operations:

### Reading JSON Files:

```python
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("JSONExample").getOrCreate()

# Specify the path to the JSON file
json_file_path = "path/to/your/json/file.json"

# Read the JSON file into a DataFrame
df = spark.read.json(json_file_path)

# Show the DataFrame
df.show()

# Stop the Spark session when done
spark.stop()
```

### Writing to JSON Files:

```python
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("JSONWriteExample").getOrCreate()

# Assume df is the DataFrame you want to write to a JSON file

# Specify the path to save the JSON file
output_json_path = "path/to/save/output.json"

# Write the DataFrame to a JSON file
df.write.json(output_json_path, mode="overwrite")

# Stop the Spark session when done
spark.stop()
```

### Options and Configurations:

You can customize the behavior of reading and writing JSON files by using various options. For example:

#### Reading JSON Files with Options:

```python
# Read the JSON file with options
df = spark.read.option("multiLine", "true").json(json_file_path)
```

Here, the `multiLine` option is set to `true` to allow parsing JSON files with multiline records.

#### Writing to JSON Files with Options:

```python
# Write the DataFrame to a JSON file with options
df.write.option("compression", "gzip").json(output_json_path)
```

In this example, the `compression` option is set to `"gzip"` to compress the output JSON files.

### Handling Nested JSON Structures:

If your JSON file contains nested structures, Spark will automatically infer the schema. You can explore the nested structure using the `printSchema` method:

```python
# Print the schema of the DataFrame
df.printSchema()
```

Spark will display the inferred schema, including the nested structures.

Keep in mind that the specifics of reading and writing JSON files may depend on the structure of your JSON data. Customize the options based on your data format and requirements.

Adjust the file paths in the examples with the actual paths in your file system.

In [4]:
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("JSONExample").getOrCreate()

# Specify the path to the JSON file
json_file_path = "/home/blackheart/Documents/Data/Apache-Spark/Data/flight_data/2011-summary.json"

# Read the JSON file into a DataFrame
df = spark.read.json(json_file_path)

# Show the DataFrame
df.show()




23/11/11 14:08:51 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+--------------------+-------------------+-----+
|       United States|       Saint Martin|    2|
|       United States|             Guinea|    2|
|       United States|            Croatia|    1|
|       United States|            Romania|    3|
|       United States|            Ireland|  268|
|               Egypt|      United States|   13|
|       United States|              India|   76|
|       United States|          Singapore|   24|
|       United States|            Grenada|   59|
|          Costa Rica|      United States|  494|
|             Senegal|      United States|   29|
|              Guyana|      United States|   26|
|       United States|   Marshall Islands|   49|
|       United States|       Sint Maarten|  223|
|               Malta|      United States|    1|
|             Bolivia|      United States|   61|
|            Anguilla|      United States|   21|
|       United State

In [5]:
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("JSONWriteExample").getOrCreate()

# Assume df is the DataFrame you want to write to a JSON file

# Specify the path to save the JSON file
output_json_path = "/home/blackheart/Documents/Data/Apache-Spark/Data/flight_data/2011-summary_output.json"

# Write the DataFrame to a JSON file
df.write.json(output_json_path, mode="overwrite")

# Stop the Spark session when done
spark.stop()


23/11/11 14:08:55 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [6]:
# Print the schema of the DataFrame
df.printSchema()


root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)



# Parquet File

Parquet is a columnar storage file format that is highly optimized for use with big data processing frameworks like Apache Spark, Apache Hive, and Apache Impala. It is designed to provide better performance and storage efficiency compared to traditional row-based file formats like CSV and JSON. Parquet is particularly well-suited for analytics workloads on large datasets.

Here's an example of working with Parquet files in Apache Spark:

### Writing to Parquet:

```python
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("ParquetExample").getOrCreate()

# Assume df is the DataFrame you want to write to a Parquet file

# Specify the path to save the Parquet file
output_parquet_path = "path/to/save/output.parquet"

# Write the DataFrame to a Parquet file
df.write.parquet(output_parquet_path, mode="overwrite")

# Stop the Spark session when done
spark.stop()
```

In this example, the `write.parquet` method is used to write the DataFrame to a Parquet file. The `mode="overwrite"` option specifies that if the output Parquet file already exists, it should be overwritten. Other options for `mode` include `"append"` and `"ignore"`.

### Reading Parquet Files:

```python
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("ParquetReadExample").getOrCreate()

# Specify the path to the Parquet file
parquet_file_path = "path/to/your/parquet/file.parquet"

# Read the Parquet file into a DataFrame
df_parquet = spark.read.parquet(parquet_file_path)

# Show the DataFrame
df_parquet.show()

# Stop the Spark session when done
spark.stop()
```

The `read.parquet` method is used to read a Parquet file into a DataFrame. Spark automatically infers the schema of the Parquet file.

### Options and Configurations:

You can customize the behavior of reading and writing Parquet files by using various options. For example:

#### Reading Parquet Files with Options:

```python
# Read the Parquet file with options
df_parquet = spark.read.option("mergeSchema", "true").parquet(parquet_file_path)
```

Here, the `mergeSchema` option is set to `true` to merge multiple Parquet files with different schemas into a single DataFrame.

#### Writing to Parquet Files with Options:

```python
# Write the DataFrame to a Parquet file with options
df.write.option("compression", "snappy").parquet(output_parquet_path)
```

In this example, the `compression` option is set to `"snappy"` to use the Snappy compression algorithm for the output Parquet files.

Parquet files provide benefits such as better compression, schema evolution support, and predicate pushdown optimization. They are widely used in big data ecosystems for efficient data storage and processing. Adjust the file paths in the examples with the actual paths in your file system.

In [7]:
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("ParquetReadExample").getOrCreate()

# Specify the path to the Parquet file
parquet_file_path = "/home/blackheart/Documents/Data/Apache-Spark/Data/flight_data/part-r-00000-1a9822ba-b8fb-4d8e-844a-ea30d0801b9e.gz.parquet"

# Read the Parquet file into a DataFrame
df_parquet = spark.read.parquet(parquet_file_path)

# Show the DataFrame
df_parquet.show()

# Stop the Spark session when done
spark.stop()


+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+--------------------+-------------------+-----+
|       United States|            Romania|    1|
|       United States|            Ireland|  264|
|       United States|              India|   69|
|               Egypt|      United States|   24|
|   Equatorial Guinea|      United States|    1|
|       United States|          Singapore|   25|
|       United States|            Grenada|   54|
|          Costa Rica|      United States|  477|
|             Senegal|      United States|   29|
|       United States|   Marshall Islands|   44|
|              Guyana|      United States|   17|
|       United States|       Sint Maarten|   53|
|               Malta|      United States|    1|
|             Bolivia|      United States|   46|
|            Anguilla|      United States|   21|
|Turks and Caicos ...|      United States|  136|
|       United States|        Afghanistan|    2|
|Saint Vincent and..

In [9]:
df_parquet.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)



# ORC File

ORC (Optimized Row Columnar) is a columnar storage file format that is highly optimized for use with big data processing frameworks like Apache Spark and Apache Hive. It is designed to provide better performance and storage efficiency compared to traditional row-based file formats like CSV and JSON. ORC is particularly well-suited for analytics workloads on large datasets.

Here's an example of reading and writing ORC files in Apache Spark:

### Writing to ORC:

```python
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("ORCExample").getOrCreate()

# Assume df is the DataFrame you want to write to an ORC file

# Specify the path to save the ORC file
output_orc_path = "path/to/save/output.orc"

# Write the DataFrame to an ORC file
df.write.orc(output_orc_path, mode="overwrite")

# Stop the Spark session when done
spark.stop()
```

In this example, the `write.orc` method is used to write the DataFrame to an ORC file. The `mode="overwrite"` option specifies that if the output ORC file already exists, it should be overwritten. Other options for `mode` include `"append"` and `"ignore"`.

### Reading ORC Files:

```python
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("ORCReadExample").getOrCreate()

# Specify the path to the ORC file
orc_file_path = "path/to/your/orc/file.orc"

# Read the ORC file into a DataFrame
df_orc = spark.read.orc(orc_file_path)

# Show the DataFrame
df_orc.show()

# Stop the Spark session when done
spark.stop()
```

The `read.orc` method is used to read an ORC file into a DataFrame. Spark automatically infers the schema of the ORC file.

### Options and Configurations:

You can customize the behavior of reading and writing ORC files by using various options. For example:

#### Reading ORC Files with Options:

```python
# Read the ORC file with options
df_orc = spark.read.option("compression", "zlib").orc(orc_file_path)
```

Here, the `compression` option is set to `"zlib"` to use the zlib compression algorithm for the input ORC files.

#### Writing to ORC Files with Options:

```python
# Write the DataFrame to an ORC file with options
df.write.option("compression", "snappy").orc(output_orc_path)
```

In this example, the `compression` option is set to `"snappy"` to use the Snappy compression algorithm for the output ORC files.

ORC files provide benefits such as better compression, predicate pushdown optimization, and improved query performance. They are widely used in big data ecosystems for efficient data storage and processing. Adjust the file paths in the examples with the actual paths in your file system.

In [10]:
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("ORCReadExample").getOrCreate()

# Specify the path to the ORC file
orc_file_path = "/home/blackheart/Documents/Data/Apache-Spark/Data/flight_data/part-r-00000-2c4f7d96-e703-4de3-af1b-1441d172c80f.snappy.orc"

# Read the ORC file into a DataFrame
df_orc = spark.read.orc(orc_file_path)

# Show the DataFrame
df_orc.show()

# Stop the Spark session when done
spark.stop()


+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+--------------------+-------------------+-----+
|       United States|            Romania|    1|
|       United States|            Ireland|  264|
|       United States|              India|   69|
|               Egypt|      United States|   24|
|   Equatorial Guinea|      United States|    1|
|       United States|          Singapore|   25|
|       United States|            Grenada|   54|
|          Costa Rica|      United States|  477|
|             Senegal|      United States|   29|
|       United States|   Marshall Islands|   44|
|              Guyana|      United States|   17|
|       United States|       Sint Maarten|   53|
|               Malta|      United States|    1|
|             Bolivia|      United States|   46|
|            Anguilla|      United States|   21|
|Turks and Caicos ...|      United States|  136|
|       United States|        Afghanistan|    2|
|Saint Vincent and..

# Text File ...

In Apache Spark, reading and writing text files is a common operation. Here's an example of how to read and write text files using Spark:

### Writing to Text File:

```python
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("TextFileWriteExample").getOrCreate()

# Assume df is the DataFrame you want to write to a text file

# Specify the path to save the text file
output_text_path = "path/to/save/output.txt"

# Write the DataFrame to a text file
df.write.text(output_text_path, compression="gzip", mode="overwrite")

# Stop the Spark session when done
spark.stop()
```

In this example:
- `write.text`: The method used to write the DataFrame to a text file.
- `output_text_path`: The path where the text file or directory will be created.
- `compression="gzip"`: Optional. Specifies the compression codec to use. Here, it's set to Gzip compression.
- `mode="overwrite"`: Specifies that if the output file or directory already exists, it should be overwritten.

### Reading Text File:

```python
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("TextFileReadExample").getOrCreate()

# Specify the path to the text file
text_file_path = "path/to/your/text/file.txt"

# Read the text file into a DataFrame
df_text = spark.read.text(text_file_path)

# Show the DataFrame
df_text.show(truncate=False)

# Stop the Spark session when done
spark.stop()
```

In this example:
- `read.text`: The method used to read a text file into a DataFrame.
- `text_file_path`: The path to the text file.

### Options and Configurations:

You can customize the behavior of reading and writing text files by using various options. For example:

#### Reading Text File with Options:

```python
# Read the text file with options
df_text = spark.read.option("header", "true").text(text_file_path)
```

Here, the `header` option is set to `true` to interpret the first line of the text file as a header.

#### Writing to Text File with Options:

```python
# Write the DataFrame to a text file with options
df.write.option("delimiter", ",").text(output_text_path)
```

In this example, the `delimiter` option is set to `,` to specify a custom delimiter for the output text file.

Text files are a simple and versatile format, but keep in mind that they might not be the most efficient for large-scale data processing compared to columnar formats like Parquet or ORC. Adjust the file paths and options in the examples based on your specific use case.

# **Thank You!**