# PySpark ( Python API for Spark)
* PySpark is the Python API for Apache Spark, an open-source distributed computing framework. It allows for parallel and distributed data processing across a cluster of machines.


* PySpark revolves around the concept of Resilient Distributed Datasets (RDDs), which are fault-tolerant collections of data that can be processed in parallel. RDDs support transformations and actions, enabling complex data workflows.

* PySpark provides high-level APIs for various data processing tasks, including SQL queries, streaming data processing, and machine learning. It allows seamless integration with popular Python libraries like Pandas and NumPy.

* PySpark leverages the Spark engine's ability to distribute data processing tasks across a cluster of machines, providing scalability for large-scale data processing. It is designed to deliver high performance by utilizing in-memory computation and lazy evaluation.

Reference link - (https://spark.apache.org/)



In [2]:
'''Install pyspark'''
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425345 sha256=d23157c4cd94d3b9a2742905ab88e55ed0161be3c28a78bb7b599aafbb12db35
  Stored in directory: /root/.cache/pip/wheels/41/4e/10/c2cf2467f71c678cfc8a6b9ac9241e5e44a01940da8fbb17fc
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.0


# Spark Session
* The entry point to programming Spark with the Dataset and DataFrame API.
* Spark Session helps us to create and manipulate DataFrames.

Reference - https://spark.apache.org/docs/latest/sql-getting-started.html

In [3]:
# Import necessary libraries
import math  # Import the math module for mathematical operations
import numpy as np  # Import NumPy for numerical operations
import pandas as pd  # Import Pandas for data manipulation
import pyspark  # Import PySpark for distributed data processing
from pyspark.sql import SparkSession  # Import SparkSession for Spark DataFrame operations
from pyspark.sql.functions import isnan, when, count, col, isnull, asc, desc, mean
# Import specific functions from PySpark for data wrangling tasks

# Create a Spark session
spark = SparkSession.builder.master("local").appName("DataWrangling").getOrCreate()
# Initialize a Spark session with the following options:

# - "master": Specifies the master URL, in this case, "local" means running in local mode.
#   Local mode runs Spark on a single machine, useful for development and testing.
#   Other options include "yarn" for running on a Hadoop cluster or a specific master URL.

# - "appName": Specifies the application name, which helps identify the application on the Spark cluster UI.
#   In this case, the application is named "DataWrangling."

# - "getOrCreate()": Retrieves an existing Spark session if available or creates a new one if none exists.

# The Spark session serves as the entry point to programming Spark with the DataFrame and SQL API.
# It provides a unified interface to access Spark functionality and resources.

# Set this configuration to get output similar to Pandas
spark.conf.set('spark.sql.repl.eagerEval.enabled', True)
# Configure Spark to enable eager evaluation for similar output behavior to Pandas


In [6]:
# create a dataframe to read csv file using spark
df = spark.read.csv('/content/train.csv',header=True)
df.limit(5)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen ...",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. Joh...",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. ...",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Ja...",female,35,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. Willia...",male,35,0,0,373450,8.05,,S


In [7]:
'''Find the count of a dataframe'''
df.count()

891

# Basic Queries using PySpark



In [8]:
# Count of values in the 'Sex' column
df.groupBy('Sex').count()
# Use the groupBy function to group the DataFrame by the 'Sex' column,
# and then apply the count function to calculate the number of occurrences for each unique value in the 'Sex' column.

Sex,count
female,314
male,577


In [9]:
# Find distinct values of the 'Embarked' column in the DataFrame
df.select('Embarked').distinct()
# Use the select function to extract the 'Embarked' column, and then apply the distinct function
# to obtain a DataFrame containing distinct values present in the 'Embarked' column.

Embarked
Q
C
S
""


In [27]:
# Select specific set of columns ('Survived') in the DataFrame and limit the result to 2 rows
df.select('Survived').limit(2)
# Use the select function to choose the 'Survived' column,
# and then apply the limit function to restrict the result to the first 2 rows in the DataFrame.

Survived
0
1


In [11]:
df.select('Survived', 'Age', 'Ticket').limit(5)

Survived,Age,Ticket
0,22,A/5 21171
1,38,PC 17599
1,26,STON/O2. 3101282
1,35,113803
0,35,373450


In [13]:
# Find not null values in the 'Age' column of the DataFrame
df.filter(col('Age').isNotNull()).limit(5)
# Use the filter function along with isNotNull condition on the 'Age' column
# to obtain a DataFrame containing rows where the 'Age' column is not null, and then limit the result to the first 5 rows.

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen ...",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. Joh...",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. ...",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Ja...",female,35,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. Willia...",male,35,0,0,373450,8.05,,S


In [14]:
'''Another way to find not null values of 'Age' '''
df.filter("Age is not NULL").limit(5)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen ...",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. Joh...",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. ...",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Ja...",female,35,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. Willia...",male,35,0,0,373450,8.05,,S


In [17]:
# Find the mean of the 'Age' column in the DataFrame
mean_ = df.select(mean(col('Age'))).take(1)[0][0]
# Use the select function with mean aggregation on the 'Age' column,
# and then take the result, which is a list, to get the mean value.

# Round up the mean value using math.ceil
mean_ = math.ceil(mean_)
# Use the math.ceil function to round up the mean value to the nearest integer.


In [18]:
# Find the value counts of the 'Cabin' column in the DataFrame and select the mode
df.groupBy(col('Cabin')).count().sort(desc("count")).limit(5)
# Use the groupBy function on the 'Cabin' column to group the DataFrame by cabin values,
# then apply the count function to calculate the occurrences of each unique cabin value.
# Sort the result in descending order based on the count, and limit it to the top 5 rows to get the mode(s).

Cabin,count
,687
B96 B98,4
G6,4
C23 C25 C27,4
F2,3


In [19]:
# Find the mode of the 'Embarked' column in the DataFrame
embarked_mode = df.groupBy(col('Embarked')).count().sort(desc("count")).take(1)[0][0]
# Use the groupBy function on the 'Embarked' column to group the DataFrame by embarkation values,
# then apply the count function to calculate the occurrences of each unique embarkation value.
# Sort the result in descending order based on the count, take the top row (mode), and extract the 'Embarked' value.

In [20]:
# Fill the missing values in the DataFrame
df = df.fillna({'Age': mean_, 'Cabin': 'C23', 'Embarked': embarked_mode})
# Use the fillna function to replace missing values in specific columns with specified values.
# In this case, missing values in the 'Age' column are replaced with the calculated mean_,
# missing values in the 'Cabin' column are replaced with 'C23', and missing values in the 'Embarked' column are replaced with the mode.

In [21]:
# Drop a single column ('Age') from the DataFrame
df.drop('Age').limit(5)
# Use the drop function to remove the specified column ('Age') from the DataFrame,
# and then limit the result to the first 5 rows to display the modified DataFrame.

PassengerId,Survived,Pclass,Name,Sex,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen ...",male,1,0,A/5 21171,7.25,C23,S
2,1,1,"Cumings, Mrs. Joh...",female,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. ...",female,0,0,STON/O2. 3101282,7.925,C23,S
4,1,1,"Futrelle, Mrs. Ja...",female,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. Willia...",male,0,0,373450,8.05,C23,S


In [22]:
# Drop multiple columns ('Age', 'Parch', 'Ticket') from the DataFrame
df.drop('Age', 'Parch', 'Ticket').limit(5)
# Use the drop function to remove the specified columns ('Age', 'Parch', 'Ticket') from the DataFrame,
# and then limit the result to the first 5 rows to display the modified DataFrame.

PassengerId,Survived,Pclass,Name,Sex,SibSp,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen ...",male,1,7.25,C23,S
2,1,1,"Cumings, Mrs. Joh...",female,1,71.2833,C85,C
3,1,3,"Heikkinen, Miss. ...",female,0,7.925,C23,S
4,1,1,"Futrelle, Mrs. Ja...",female,1,53.1,C123,S
5,0,3,"Allen, Mr. Willia...",male,0,8.05,C23,S


In [23]:
# Sort the 'Age' column in descending order in the DataFrame
df.sort(desc('Age')).limit(5)
# Use the sort function with desc (descending) order on the 'Age' column to arrange the DataFrame rows based on the 'Age' values in descending order,
# and then limit the result to the first 5 rows to display the sorted DataFrame.

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
183,0,3,"Asplund, Master. ...",male,9,4,2,347077,31.3875,C23,S
148,0,3,"""Ford, Miss. Robi...",female,9,2,2,W./C. 6608,34.375,C23,S
166,1,3,"""Goldsmith, Maste...",male,9,0,2,363291,20.525,C23,S
490,1,3,"""Coutts, Master. ...",male,9,1,1,C.A. 37671,15.9,C23,S
481,0,3,"Goodwin, Master. ...",male,9,5,2,CA 2144,46.9,C23,S


### As you reach the end of this PySpark notebook , remember: every line of code written is a step towards mastering the power of big data. Best of luck in your future data adventures, and may your insights be as profound as your code is elegant! Happy Sparking!