# **Apache Spark**

* It is a fast and general purpose cluster computing system.

* It provides high-level APIs in Java, Scala, Python, R, and SQL.

* It makes parallel jobs easy to write, and optimized engine that supports
  general computation and graphs.

* It also supports a rich set of higher-level tools including Shark(Hive on Shark), MLib (For Machine Learning), Graphx (For Graph Processing) and Spark Straming.


## **Features** :-
 1. (Speed) Run workloads 100x faster.

 2. (Easy to Use) Write Applications quickly in Java, Scala, Python, R, and SQL.

 3. (Generality) Combine SQL, Streaming, and complex analytics.

 4. (Runs Everywhere) Spark runs on Hadoop, Apache Mesos, Kubernetes, Standalone, or in the cloud. It can access diverse data sources.

## **PySpark Library**

*   If you want to use  Spark in python use PySpark Library.

# Installation of Libraries

In [1]:
! pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
     ---------------------------------------- 0.0/317.0 MB ? eta -:--:--
     ---------------------------------------- 0.0/317.0 MB ? eta -:--:--
     ---------------------------------------- 0.0/317.0 MB ? eta -:--:--
     -------------------------------------- 0.0/317.0 MB 393.8 kB/s eta 0:13:25
     -------------------------------------- 0.1/317.0 MB 573.4 kB/s eta 0:09:13
     -------------------------------------- 0.1/317.0 MB 708.1 kB/s eta 0:07:28
     ---------------------------------------- 0.2/317.0 MB 1.1 MB/s eta 0:04:55
     ---------------------------------------- 0.2/317.0 MB 1.1 MB/s eta 0:04:55
     ---------------------------------------- 0.3/317.0 MB 1.0 MB/s eta 0:05:13
     ---------------------------------------- 0.4/317.0 MB 1.1 MB/s eta 0:04:39
     ---------------------------------------- 0.5/317.0 MB 1.2 MB/s eta 0:04:18
     ---------------------------------------- 0.6/317.0 MB 1.2 MB/s eta 0:04


[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
! pip install pandas numpy matplotlib scikit-learn

Collecting pandas
  Obtaining dependency information for pandas from https://files.pythonhosted.org/packages/61/11/1812ef6cbd7433ad240f72161ce5f84c4c450cede4db080365d371d29117/pandas-2.2.1-cp311-cp311-win_amd64.whl.metadata
  Downloading pandas-2.2.1-cp311-cp311-win_amd64.whl.metadata (19 kB)
Collecting numpy
  Obtaining dependency information for numpy from https://files.pythonhosted.org/packages/3f/6b/5610004206cf7f8e7ad91c5a85a8c71b2f2f8051a0c0c4d5916b76d6cbb2/numpy-1.26.4-cp311-cp311-win_amd64.whl.metadata
  Using cached numpy-1.26.4-cp311-cp311-win_amd64.whl.metadata (61 kB)
Collecting matplotlib
  Obtaining dependency information for matplotlib from https://files.pythonhosted.org/packages/2d/d5/6227732ecab9165586966ccb54301e3164f61b470c954c4cf6940654fbe1/matplotlib-3.8.4-cp311-cp311-win_amd64.whl.metadata
  Downloading matplotlib-3.8.4-cp311-cp311-win_amd64.whl.metadata (5.9 kB)
Collecting scikit-learn
  Obtaining dependency information for scikit-learn from https://files.pythonh


[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


# Basics

In [1]:
import pyspark

In [27]:
import pandas as pd
df1 = pd.read_csv('Test1.csv')
df1

Unnamed: 0,Name,age
0,Nehal,21
1,Chirag,20
2,Devanshu,19
3,Aryan,25
4,Daksh,24
5,Dhyey,30


In [28]:
type(df1)

pandas.core.frame.DataFrame

### **Spark Session**

A SparkSession can be used to create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files. 

To create a SparkSession, use the following builder pattern

####  **Steps to create Spark session :-**

   1. Set environment variable for Java_Home if on local machine (Windows).

   2. Import the library.
   
   3. Build the session.

In [4]:
import os
os.environ["JAVA_HOME"] = "C:\\Program Files\\Java\\jdk-21"

In [3]:
from pyspark.sql import SparkSession

* builder - A class attribute to construct SparkSession instances.

* master() – Sets the Spark master URL to connect local or remote cluster

  * If you are running it on the cluster you need to use your master name as an argument to master(). usually, it would be either yarn or mesos depends on your cluster setup.

  * Replace <ip> with the IP address of the remote master and <port> with the port number.

  * Use local[x] when running in Standalone mode. x should be an integer value and should be greater than 0; this represents how many partitions it should create when using RDD, DataFrame, and Dataset. Ideally, x value should be the number of CPU cores you have.

* appName() – Used to set your application name.

* getOrCreate() – This returns a SparkSession object if already exists, and creates a new one if not exist.

In [5]:
'''spark = SparkSession.builder
                       .master("local[4]")
                       .appName('Practise')
                       .getOrCreate()
                       
                       (or)
   spark = SparkSession.builder
                       .master("spark://<ip>:<port>")
                       .appName('Practise')
                       .getOrCreate()                      
'''

spark = SparkSession.builder.appName('Practise').getOrCreate() # type: ignore

In [7]:
spark

In [19]:
df_pyspark1 = spark.read.csv('Test1.csv')

In [20]:
df_pyspark1

DataFrame[_c0: string, _c1: string]

In [21]:
df_pyspark1.show()

+--------+---+
|     _c0|_c1|
+--------+---+
|    Name|age|
|   Nehal| 21|
|  Chirag| 20|
|Devanshu| 19|
|   Aryan| 25|
|   Daksh| 24|
|   Dhyey| 30|
+--------+---+



Remove the default column name (In this case _c0, _c1) from the dataframe

In [22]:
df_pyspark = spark.read.option('header','true').csv('Test1.csv')

In [23]:
df_pyspark

DataFrame[Name: string, age: string]

In [24]:
df_pyspark.show()

+--------+---+
|    Name|age|
+--------+---+
|   Nehal| 21|
|  Chirag| 20|
|Devanshu| 19|
|   Aryan| 25|
|   Daksh| 24|
|   Dhyey| 30|
+--------+---+



In [30]:
type(df_pyspark)

pyspark.sql.dataframe.DataFrame

In [31]:
df_pyspark.show(5)

+--------+---+
|    Name|age|
+--------+---+
|   Nehal| 21|
|  Chirag| 20|
|Devanshu| 19|
|   Aryan| 25|
|   Daksh| 24|
+--------+---+
only showing top 5 rows



In [35]:
df_pyspark.head()

Row(Name='Nehal', age='21')

In [34]:
df_pyspark.head(3)

[Row(Name='Nehal', age='21'),
 Row(Name='Chirag', age='20'),
 Row(Name='Devanshu', age='19')]

In [37]:
df_pyspark.printSchema()

root
 |-- Name: string (nullable = true)
 |-- age: string (nullable = true)

