In [None]:
!pip install pyspark

In [1]:
import pyspark

In [2]:
# Reading a file with pandas
import pandas as pd
data_path = "C:\Personal\Carrier Path\Data_Scientist\Advanced Phase\PySpark/"
data = pd.read_csv(data_path + "data_1.csv")
data

Unnamed: 0,Name,Age,Experience
0,Gaurav,29,10
1,Mukesh,30,8
2,Nibesh,31,4


In [7]:
data.dtypes

Name          object
Age            int64
Experience     int64
dtype: object

### Creating PySpark session 

In [3]:
# While working with pyspark, first start the pyspark session 
from pyspark.sql import SparkSession

# 'Practice' is the name of the pySpark session 
spark = SparkSession.builder.appName('Practice').getOrCreate()

**Why we need to create PySpark session?**
* Creating a SparkSession is essential for initiating Apache Spark functionality within your Python environment and provides a foundation for building distributed data processing applications using PySpark.
***
**Also what is the relivance of 'appName('Practice')'?**
* appName() is a method of the SparkSession.Builder class, and it allows you to specify a name for your Spark application.
* Setting a meaningful name for your Spark application can be helpful for monitoring and debugging purposes, especially when you are running multiple Spark applications concurrently.It allows you to identify your application easily in the Spark UI or logs.

**SparkContext vs SparkSession, explain the difference**
* SparkSession provides a high-level interface for creating DataFrames, executing SQL queries, and performing structured data processing efficiently.
* SparkContext allows you to set various configurations, manage the execution context, and work directly with RDDs, which are more low-level distributed collections of data.

In [4]:
spark

In [5]:
# Importing data through pyspark
df_pyspark = spark.read.csv(data_path + "data_1.csv")

In [6]:
# this is how pyspark shows the dataframe - basically it provides the format of the dataframe
df_pyspark

DataFrame[_c0: string, _c1: string, _c2: string]

In [8]:
# Here it did't assumed the first row as the header
df_pyspark.show()

+------+---+----------+
|   _c0|_c1|       _c2|
+------+---+----------+
|  Name|Age|Experience|
|Gaurav| 29|        10|
|Mukesh| 30|         8|
|Nibesh| 31|         4|
+------+---+----------+



In [9]:
# here we have assigned the first row as the header of the pyspark dataframe
df_pyspark1 = spark.read.option('header','true').csv(data_path + "data_1.csv")

In [10]:
df_pyspark1.show()

+------+---+----------+
|  Name|Age|Experience|
+------+---+----------+
|Gaurav| 29|        10|
|Mukesh| 30|         8|
|Nibesh| 31|         4|
+------+---+----------+



**spark.read.option('header','true').csv(data_path + "data_1.csv").show()**
* option('header','true') : before picking up the data from the data source we set some options based on which the data from the source would be picked. 
* Here within option we have 2 parameters: a) key: key for the option (like 'header') b) value ('value' for the option)

In [11]:
# Observe the difference between the pandas dataFrame and the PysparkDataframe
print(type(df_pyspark1) , type(data))

<class 'pyspark.sql.dataframe.DataFrame'> <class 'pandas.core.frame.DataFrame'>


In [12]:
# head() function also works here
# Unlike pandas it provids just one datapoint by default 
df_pyspark1.head()

Row(Name='Gaurav', Age='29', Experience='10')

In [13]:
# Unlike padas (where we get data in the dataframe format), but here we get it in the list format.
df_pyspark1.head(5)

[Row(Name='Gaurav', Age='29', Experience='10'),
 Row(Name='Mukesh', Age='30', Experience='8'),
 Row(Name='Nibesh', Age='31', Experience='4')]

In [14]:
# to get information about the variables in the dataframe.
df_pyspark1.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Experience: string (nullable = true)

