# Spark Types

In Spark, when we refer to "Spark types," we are typically talking about the data types that are used to represent and process data within the Spark framework. Spark provides a set of built-in data types that can be used when working with DataFrames, Datasets, and other Spark-related APIs. Here are some of the common Spark types:

1. **Numeric Types:**
   - **`IntegerType`:** Represents 32-bit signed integers (`int` in Python).
   - **`LongType`:** Represents 64-bit signed integers (`long` in Python).
   - **`FloatType`:** Represents 32-bit floating-point numbers (`float` in Python).
   - **`DoubleType`:** Represents 64-bit double-precision floating-point numbers (`double` in Python).

2. **String Type:**
   - **`StringType`:** Represents variable-length character strings (`str` in Python).

3. **Boolean Type:**
   - **`BooleanType`:** Represents boolean values (`bool` in Python).

4. **Binary Type:**
   - **`BinaryType`:** Represents binary data, typically used for storing raw binary blobs.

5. **Timestamp and Date Types:**
   - **`TimestampType`:** Represents a timestamp with both date and time information.
   - **`DateType`:** Represents a date without time information.

6. **ArrayType:**
   - **`ArrayType`:** Represents an array or list of elements. Elements can be of any valid Spark type, including nested arrays.

7. **MapType:**
   - **`MapType`:** Represents a map or dictionary where keys and values can be of any valid Spark type.

8. **StructType and StructField:**
   - **`StructType`:** Represents a structure or a row with named fields.
   - **`StructField`:** Represents a field within a `StructType`.

9. **Decimal Type:**
   - **`DecimalType`:** Represents arbitrary-precision decimals. It is often used for financial data.

10. **User-Defined Types (UDTs):**
    - Spark allows users to define custom data types by creating User-Defined Types (UDTs). This is useful when working with specialized data types that are not covered by the built-in types.

These Spark types are used when defining the schema for DataFrames or Datasets, specifying the data types of columns, and ensuring type safety during data processing operations. When working with PySpark in Python, these types are often reflected in the Python equivalents (e.g., `IntegerType` corresponds to Python's `int`). Understanding and correctly using these types are essential for effective data processing and analysis with Spark.

# Working with Different Types of Data

 Let's go through examples of working with different types of data, including numbers, strings, booleans, and complex types, in PySpark DataFrames.

### Working with Numbers:

```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr

# Create a Spark session
spark = SparkSession.builder.appName("example").getOrCreate()

# Sample data with numbers
data = [(1, 10.5), (2, 20.3), (3, 30.1)]

# Define the schema
schema = ["id", "value"]

# Create a DataFrame
df_numbers = spark.createDataFrame(data, schema=schema)

# Show the DataFrame
df_numbers.show()

# Perform operations on numeric columns
result_numbers = df_numbers.withColumn("double_value", col("value") * 2)

# Show the result
result_numbers.show()

# Stop the Spark session when done
spark.stop()
```

In this example, we create a DataFrame with numeric data and perform a simple operation, doubling the values in the "value" column.

### Working with Strings:

```python
# Sample data with strings
data_strings = [("Alice", "New York"), ("Bob", "San Francisco"), ("Charlie", "Los Angeles")]

# Define the schema
schema_strings = ["name", "city"]

# Create a DataFrame
df_strings = spark.createDataFrame(data_strings, schema=schema_strings)

# Show the DataFrame
df_strings.show()

# Concatenate string columns
result_strings = df_strings.withColumn("full_location", expr("name || ' in ' || city"))

# Show the result
result_strings.show()
```

Here, we create a DataFrame with strings and concatenate the "name" and "city" columns to create a new column, "full_location."

### Working with Booleans:

```python
# Sample data with booleans
data_booleans = [("Alice", True), ("Bob", False), ("Charlie", True)]

# Define the schema
schema_booleans = ["name", "is_student"]

# Create a DataFrame
df_booleans = spark.createDataFrame(data_booleans, schema=schema_booleans)

# Show the DataFrame
df_booleans.show()

# Filter rows based on boolean condition
result_booleans = df_booleans.filter(col("is_student") == True)

# Show the result
result_booleans.show()
```

In this example, we create a DataFrame with boolean values and filter rows based on a boolean condition.

### Working with Complex Types:

```python
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Sample data with complex types
data_complex = [(1, ("Alice", "New York")), (2, ("Bob", "San Francisco")), (3, ("Charlie", "Los Angeles"))]

# Define the schema with a struct type
schema_complex = StructType([
    StructField("id", IntegerType(), True),
    StructField("info", StructType([
        StructField("name", StringType(), True),
        StructField("city", StringType(), True)
    ]), True)
])

# Create a DataFrame
df_complex = spark.createDataFrame(data_complex, schema=schema_complex)

# Show the DataFrame
df_complex.show()

# Access elements in a struct column
result_complex = df_complex.withColumn("person_name", col("info.name"))

# Show the result
result_complex.show()
```

Certainly! Let's continue exploring more examples with different types of data in PySpark DataFrames.

### Working with Arrays:

```python
from pyspark.sql.functions import split

# Sample data with arrays
data_arrays = [("Alice", "New York,Chicago"), ("Bob", "San Francisco,Los Angeles"), ("Charlie", "Seattle")]

# Define the schema
schema_arrays = ["name", "locations"]

# Create a DataFrame
df_arrays = spark.createDataFrame(data_arrays, schema=schema_arrays)

# Show the DataFrame
df_arrays.show()

# Split the string in the "locations" column into an array
result_arrays = df_arrays.withColumn("location_array", split(col("locations"), ","))

# Show the result
result_arrays.show()
```

In this example, we create a DataFrame with a string column containing comma-separated locations. We use the `split` function to split the string into an array.

### Working with Maps:

```python
from pyspark.sql.functions import create_map

# Sample data with maps
data_maps = [("Alice", "NY"), ("Bob", "SF"), ("Charlie", "LA")]

# Define the schema
schema_maps = ["name", "abbreviation"]

# Create a DataFrame
df_maps = spark.createDataFrame(data_maps, schema=schema_maps)

# Show the DataFrame
df_maps.show()

# Create a map column
result_maps = df_maps.withColumn("location_map", create_map(col("name"), col("abbreviation")))

# Show the result
result_maps.show()
```

Here, we create a DataFrame with two columns, "name" and "abbreviation." We use the `create_map` function to create a map column, mapping names to abbreviations.

### Working with Dates and Timestamps:

```python
from pyspark.sql.functions import current_date, current_timestamp

# Create a DataFrame with current date and timestamp
df_dates = spark.createDataFrame([(1, current_date(), current_timestamp())], ["id", "current_date", "current_timestamp"])

# Show the DataFrame
df_dates.show()

# Add 5 days to the current date
result_dates = df_dates.withColumn("future_date", col("current_date") + 5)

# Show the result
result_dates.show()
```

In this example, we create a DataFrame with a column for the current date and another for the current timestamp. We then add 5 days to the current date using simple arithmetic.

These examples cover a range of scenarios when working with different types of data in PySpark DataFrames. Remember to adjust these examples based on your specific use cases and data.

In [2]:
import warnings
warnings.filterwarnings("ignore")

In [5]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr

# Create a Spark session
spark = SparkSession.builder.appName("example").getOrCreate()

# Sample data with numbers
data = [(1, 10.5), (2, 20.3), (3, 30.1)]

# Define the schema
schema = ["id", "value"]

# Create a DataFrame
df_numbers = spark.createDataFrame(data, schema=schema)

# Show the DataFrame
df_numbers.show()

# Perform operations on numeric columns
result_numbers = df_numbers.withColumn("double_value", col("value") * 2)

# Show the result
result_numbers.show()




                                                                                

+---+-----+
| id|value|
+---+-----+
|  1| 10.5|
|  2| 20.3|
|  3| 30.1|
+---+-----+

+---+-----+------------+
| id|value|double_value|
+---+-----+------------+
|  1| 10.5|        21.0|
|  2| 20.3|        40.6|
|  3| 30.1|        60.2|
+---+-----+------------+



                                                                                

In [6]:
# Sample data with strings
data_strings = [("Alice", "New York"), ("Bob", "San Francisco"), ("Charlie", "Los Angeles")]

# Define the schema
schema_strings = ["name", "city"]

# Create a DataFrame
df_strings = spark.createDataFrame(data_strings, schema=schema_strings)

# Show the DataFrame
df_strings.show()

# Concatenate string columns
result_strings = df_strings.withColumn("full_location", expr("name || ' in ' || city"))

# Show the result
result_strings.show()


+-------+-------------+
|   name|         city|
+-------+-------------+
|  Alice|     New York|
|    Bob|San Francisco|
|Charlie|  Los Angeles|
+-------+-------------+

+-------+-------------+--------------------+
|   name|         city|       full_location|
+-------+-------------+--------------------+
|  Alice|     New York|   Alice in New York|
|    Bob|San Francisco|Bob in San Francisco|
|Charlie|  Los Angeles|Charlie in Los An...|
+-------+-------------+--------------------+



In [7]:
# Sample data with booleans
data_booleans = [("Alice", True), ("Bob", False), ("Charlie", True)]

# Define the schema
schema_booleans = ["name", "is_student"]

# Create a DataFrame
df_booleans = spark.createDataFrame(data_booleans, schema=schema_booleans)

# Show the DataFrame
df_booleans.show()

# Filter rows based on boolean condition
result_booleans = df_booleans.filter(col("is_student") == True)

# Show the result
result_booleans.show()


                                                                                

+-------+----------+
|   name|is_student|
+-------+----------+
|  Alice|      true|
|    Bob|     false|
|Charlie|      true|
+-------+----------+





+-------+----------+
|   name|is_student|
+-------+----------+
|  Alice|      true|
|Charlie|      true|
+-------+----------+



                                                                                

In [8]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Sample data with complex types
data_complex = [(1, ("Alice", "New York")), (2, ("Bob", "San Francisco")), (3, ("Charlie", "Los Angeles"))]

# Define the schema with a struct type
schema_complex = StructType([
    StructField("id", IntegerType(), True),
    StructField("info", StructType([
        StructField("name", StringType(), True),
        StructField("city", StringType(), True)
    ]), True)
])

# Create a DataFrame
df_complex = spark.createDataFrame(data_complex, schema=schema_complex)

# Show the DataFrame
df_complex.show()

# Access elements in a struct column
result_complex = df_complex.withColumn("person_name", col("info.name"))

# Show the result
result_complex.show()


                                                                                

+---+--------------------+
| id|                info|
+---+--------------------+
|  1|   {Alice, New York}|
|  2|{Bob, San Francisco}|
|  3|{Charlie, Los Ang...|
+---+--------------------+

+---+--------------------+-----------+
| id|                info|person_name|
+---+--------------------+-----------+
|  1|   {Alice, New York}|      Alice|
|  2|{Bob, San Francisco}|        Bob|
|  3|{Charlie, Los Ang...|    Charlie|
+---+--------------------+-----------+



In [9]:
from pyspark.sql.functions import split

# Sample data with arrays
data_arrays = [("Alice", "New York,Chicago"), ("Bob", "San Francisco,Los Angeles"), ("Charlie", "Seattle")]

# Define the schema
schema_arrays = ["name", "locations"]

# Create a DataFrame
df_arrays = spark.createDataFrame(data_arrays, schema=schema_arrays)

# Show the DataFrame
df_arrays.show()

# Split the string in the "locations" column into an array
result_arrays = df_arrays.withColumn("location_array", split(col("locations"), ","))

# Show the result
result_arrays.show()


+-------+--------------------+
|   name|           locations|
+-------+--------------------+
|  Alice|    New York,Chicago|
|    Bob|San Francisco,Los...|
|Charlie|             Seattle|
+-------+--------------------+

+-------+--------------------+--------------------+
|   name|           locations|      location_array|
+-------+--------------------+--------------------+
|  Alice|    New York,Chicago| [New York, Chicago]|
|    Bob|San Francisco,Los...|[San Francisco, L...|
|Charlie|             Seattle|           [Seattle]|
+-------+--------------------+--------------------+



In [10]:
from pyspark.sql.functions import create_map

# Sample data with maps
data_maps = [("Alice", "NY"), ("Bob", "SF"), ("Charlie", "LA")]

# Define the schema
schema_maps = ["name", "abbreviation"]

# Create a DataFrame
df_maps = spark.createDataFrame(data_maps, schema=schema_maps)

# Show the DataFrame
df_maps.show()

# Create a map column
result_maps = df_maps.withColumn("location_map", create_map(col("name"), col("abbreviation")))

# Show the result
result_maps.show()


                                                                                

+-------+------------+
|   name|abbreviation|
+-------+------------+
|  Alice|          NY|
|    Bob|          SF|
|Charlie|          LA|
+-------+------------+

+-------+------------+---------------+
|   name|abbreviation|   location_map|
+-------+------------+---------------+
|  Alice|          NY|  {Alice -> NY}|
|    Bob|          SF|    {Bob -> SF}|
|Charlie|          LA|{Charlie -> LA}|
+-------+------------+---------------+



In [12]:
from pyspark.sql.functions import current_date, current_timestamp

# Create a DataFrame with current date and timestamp
df_dates = spark.createDataFrame([(1, current_date(), current_timestamp())], ["id", "current_date", "current_timestamp"])

# Show the DataFrame
df_dates.show()

# Add 5 days to the current date
result_dates = df_dates.withColumn("future_date", col("current_date") + 5)

# Show the result
result_dates.show()


# **Thank You!**