# Reading Data

Reading data in PySpark involves using the `SparkSession` object to create a DataFrame, which is a distributed collection of data organized into named columns. PySpark supports reading data from various sources, such as CSV, JSON, Parquet, Avro, and more. Let's go through the steps and terms involved in reading data using PySpark:

### 1. **Initialize a Spark Session:**

```python
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("example").getOrCreate()
```

Here, we create a `SparkSession` named "example," which is the entry point to programming Spark with the DataFrame and SQL API.

### 2. **Read Data into a DataFrame:**

You can use the `read` attribute of the `SparkSession` to read data from different formats. For example, reading from a CSV file:

```python
# Read CSV data into a DataFrame
df = spark.read.csv("path/to/your/file.csv", header=True, inferSchema=True)
```

- **`csv("path/to/your/file.csv")`:**
  - Specifies the path to the CSV file.
- **`header=True`:**
  - Specifies that the first row of the CSV file contains the column names.
- **`inferSchema=True`:**
  - Infers the data types of the columns.

### 3. **Show the DataFrame:**

```python
# Show the first few rows of the DataFrame
df.show()
```

The `show()` function displays the first 20 rows of the DataFrame in a tabular format.

### 4. **DataFrame Schema:**

```python
# Display the schema of the DataFrame
df.printSchema()
```

The `printSchema()` function prints the schema of the DataFrame, showing the data types of each column.

### Example:

Putting it all together:

```python
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("example").getOrCreate()

# Read CSV data into a DataFrame
df = spark.read.csv("path/to/your/file.csv", header=True, inferSchema=True)

# Show the first few rows of the DataFrame
df.show()

# Display the schema of the DataFrame
df.printSchema()

# Stop the Spark session when done
spark.stop()
```

Replace "path/to/your/file.csv" with the actual path to your CSV file.

### Terms:

- **Spark Session (`SparkSession`):**
  - The entry point to programming Spark. It allows the creation of DataFrames and executing SQL queries.

- **DataFrame:**
  - A distributed collection of data organized into named columns. It is the primary abstraction in PySpark.

- **Read API (`read`):**
  - An attribute of `SparkSession` used for reading data from external sources.

- **CSV (Comma-Separated Values):**
  - A popular plain-text format for tabular data, where each line of the file corresponds to a row, and values are separated by commas.

- **Header:**
  - The first row of a CSV file that contains column names.

- **InferSchema:**
  - An option to automatically infer the data types of columns when reading data.

- **Schema:**
  - The structure that defines the data types of each column in a DataFrame.

- **`show()`:**
  - A function used to display the first few rows of a DataFrame.

- **`printSchema()`:**
  - A function used to display the schema of a DataFrame.

These terms and steps provide a basic overview of how to read data into a PySpark DataFrame. Depending on the format of your data, you may use different read functions (e.g., `read.json()`, `read.parquet()`, etc.) with specific options.

In [1]:
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("example").getOrCreate()

23/11/10 00:48:43 WARN Utils: Your hostname, blackheart resolves to a loopback address: 127.0.1.1; using 192.168.74.222 instead (on interface wlp1s0)
23/11/10 00:48:43 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/11/10 00:48:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
# Read CSV data into a DataFrame
df = spark.read.csv("/home/blackheart/Documents/Data/Apache-Spark/Data/flight_data/2015-summary.csv", header=True, inferSchema=True)


                                                                                

In [4]:
# Show the first few rows of the DataFrame
df.show()

+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+--------------------+-------------------+-----+
|       United States|            Romania|   15|
|       United States|            Croatia|    1|
|       United States|            Ireland|  344|
|               Egypt|      United States|   15|
|       United States|              India|   62|
|       United States|          Singapore|    1|
|       United States|            Grenada|   62|
|          Costa Rica|      United States|  588|
|             Senegal|      United States|   40|
|             Moldova|      United States|    1|
|       United States|       Sint Maarten|  325|
|       United States|   Marshall Islands|   39|
|              Guyana|      United States|   64|
|               Malta|      United States|    1|
|            Anguilla|      United States|   41|
|             Bolivia|      United States|   30|
|       United States|           Paraguay|    6|
|             Algeri

In [5]:
# Display the schema of the DataFrame
df.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: integer (nullable = true)



In [8]:
flightData2015 = spark\
  .read\
  .option("inferSchema", "true")\
  .option("header", "true")\
  .csv("/home/blackheart/Documents/Data/Apache-Spark/Data/flight_data/2015-summary.csv")

In [9]:
flightData2015.show()

+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+--------------------+-------------------+-----+
|       United States|            Romania|   15|
|       United States|            Croatia|    1|
|       United States|            Ireland|  344|
|               Egypt|      United States|   15|
|       United States|              India|   62|
|       United States|          Singapore|    1|
|       United States|            Grenada|   62|
|          Costa Rica|      United States|  588|
|             Senegal|      United States|   40|
|             Moldova|      United States|    1|
|       United States|       Sint Maarten|  325|
|       United States|   Marshall Islands|   39|
|              Guyana|      United States|   64|
|               Malta|      United States|    1|
|            Anguilla|      United States|   41|
|             Bolivia|      United States|   30|
|       United States|           Paraguay|    6|
|             Algeri

In [10]:
flightData2015.createOrReplaceTempView("flight_data_2015")

In [11]:
flightData2015.show()

+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+--------------------+-------------------+-----+
|       United States|            Romania|   15|
|       United States|            Croatia|    1|
|       United States|            Ireland|  344|
|               Egypt|      United States|   15|
|       United States|              India|   62|
|       United States|          Singapore|    1|
|       United States|            Grenada|   62|
|          Costa Rica|      United States|  588|
|             Senegal|      United States|   40|
|             Moldova|      United States|    1|
|       United States|       Sint Maarten|  325|
|       United States|   Marshall Islands|   39|
|              Guyana|      United States|   64|
|               Malta|      United States|    1|
|            Anguilla|      United States|   41|
|             Bolivia|      United States|   30|
|       United States|           Paraguay|    6|
|             Algeri

# Reading Data In `SQL` Way...

In [12]:
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("example").getOrCreate()

# Read CSV data into a DataFrame
df = spark.read.csv("path/to/your/file.csv", header=True, inferSchema=True)

# Register the DataFrame as a temporary SQL table
df.createOrReplaceTempView("my_table")

# Use SQL queries to interact with the data
result = spark.sql("SELECT * FROM my_table WHERE age > 25")

# Show the result
result.show()

# Stop the Spark session when done
spark.stop()


In [14]:
sqlWay = spark.sql("""
SELECT DEST_COUNTRY_NAME, count(1)
FROM flight_data_2015
GROUP BY DEST_COUNTRY_NAME
""")

In [15]:
dataFrameWay = flightData2015\
  .groupBy("DEST_COUNTRY_NAME")\
  .count()

In [16]:
sqlWay.explain()
dataFrameWay.explain()


== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[DEST_COUNTRY_NAME#57], functions=[count(1)])
   +- Exchange hashpartitioning(DEST_COUNTRY_NAME#57, 200), ENSURE_REQUIREMENTS, [plan_id=110]
      +- HashAggregate(keys=[DEST_COUNTRY_NAME#57], functions=[partial_count(1)])
         +- FileScan csv [DEST_COUNTRY_NAME#57] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/home/blackheart/Documents/Data/Apache-Spark/Data/flight_data/201..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string>


== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[DEST_COUNTRY_NAME#57], functions=[count(1)])
   +- Exchange hashpartitioning(DEST_COUNTRY_NAME#57, 200), ENSURE_REQUIREMENTS, [plan_id=123]
      +- HashAggregate(keys=[DEST_COUNTRY_NAME#57], functions=[partial_count(1)])
         +- FileScan csv [DEST_COUNTRY_NAME#57] Batched: false, DataFilters: [], Format: CSV, Location: 

In [17]:
sqlWay.show()

+--------------------+--------+
|   DEST_COUNTRY_NAME|count(1)|
+--------------------+--------+
|            Anguilla|       1|
|              Russia|       1|
|            Paraguay|       1|
|             Senegal|       1|
|              Sweden|       1|
|            Kiribati|       1|
|              Guyana|       1|
|         Philippines|       1|
|            Djibouti|       1|
|            Malaysia|       1|
|           Singapore|       1|
|                Fiji|       1|
|              Turkey|       1|
|                Iraq|       1|
|             Germany|       1|
|              Jordan|       1|
|               Palau|       1|
|Turks and Caicos ...|       1|
|              France|       1|
|              Greece|       1|
+--------------------+--------+
only showing top 20 rows



# **Thank You!**