<a href="https://colab.research.google.com/github/AlefRP/Spark_PySpark/blob/main/pyspark_sample.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**PySpark in Google Colab**

Installing PySpark is not as straightforward as the usual process in Python. It involves more than just running a pip install. First, you need to install dependencies such as **Java 17** and **Apache Spark 3.5.1** along with **Hadoop 3.3**.

In [4]:
# Install dependencies
!apt-get install openjdk-17-jdk-headless -qq > /dev/null
!wget -q https://dlcdn.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz
!tar xf spark-3.5.1-bin-hadoop3.tgz
!pip install -q findspark

The next step is to set up the environment variables. This ensures that the Colab environment can correctly locate and use the installed dependencies.

To interact with the terminal and manipulate it, you can use the **os** library in Python.

In [12]:
# Configure environment variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-17-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.1-bin-hadoop3"

# Make PySpark importable
import findspark
findspark.init('/content/spark-3.5.1-bin-hadoop3')

With everything set up, let's run a local session to test if the installation worked correctly.

In [13]:
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("Introduction").getOrCreate()

# Test the Spark session
df = spark.createDataFrame([(1, 'foo'), (2, 'bar')], ['id', 'label'])
df.show()

+---+-----+
| id|label|
+---+-----+
|  1|  foo|
|  2|  bar|
+---+-----+



## **1 Using Spark with Pythom**

The first step in using Spark is to connect to a cluster.

In a practical scenario, the cluster will be hosted on a remote machine connected to all other nodes. This setup includes a primary machine known as the master, responsible for distributing data and computations. The master communicates with the other machines in the cluster, known as workers. The master delegates tasks and data to the workers for processing, and they return the results to the master.

### **1.1 Creating a SparkSession**

Creating multiple `SparkSessions` and `SparkContexts` can lead to issues, so it is a best practice to use the `SparkSession.builder.getOrCreate()` method. This method returns an existing `SparkSession` if one is already present in the environment, or it creates a new one if necessary. This approach ensures that you avoid problems associated with having multiple sessions or contexts running simultaneously.

In [14]:
# Start a local session
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("Introduction").getOrCreate()

In [15]:
# Verify SparkContext
print(spark)

# Print Spark version
print(spark.version)

<pyspark.sql.session.SparkSession object at 0x78a9e46d4850>
3.5.1
