**Topics Covered**
* PySpark Dataframe
* Reading the dataset
* Checking the DataTypes of the column (Schema)
* Selecting Columns and Indexing 
* Check Describe option similar to pandas
* Adding Columns 
* Dropping Columns

In [1]:
import pyspark
import pandas as pd

In [2]:
# create a pyspark session
from pyspark.sql import SparkSession
pyspark_session = SparkSession.builder.appName('practice').getOrCreate()
pyspark_session

In [3]:
# Data import using pyspark
data_path = "C:\Personal\Carrier Path\Data_Scientist\Advanced Phase\PySpark/"
data = pyspark_session.read.option('header','true').csv(data_path + "data_1.csv")
data.show()

+------+---+----------+
|  Name|Age|Experience|
+------+---+----------+
|Gaurav| 29|        10|
|Mukesh| 30|         8|
|Nibesh| 31|         4|
+------+---+----------+



In [4]:
# data import using pandas - we will use this to compare with pyspark functionality 
records = pd.read_csv(data_path + "data_1.csv")
records

Unnamed: 0,Name,Age,Experience
0,Gaurav,29,10
1,Mukesh,30,8
2,Nibesh,31,4


In [5]:
# Check the schema (variables and their datatypes)
# This is corresponding to the data.info() command in pandas
data.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Experience: string (nullable = true)



* By default importing the date in the way "data = pyspark_session.read.option('header','true').csv(data_path + "data_1.csv")" would result into assignment of the string variable type to all the variables. To tackle this issue we will use "csv(data_path + "data_1.csv" , inferSchema = True) command


In [6]:
data1 = pyspark_session.read.option('header', 'true').csv(data_path + 'data_1.csv',inferSchema = True)
data1.printSchema()
# now the data type of columns is similar to the columns in csv file.

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Experience: integer (nullable = true)



In [7]:
# Use both the 'inferSchema' and the 'header' option within the 'csv' function
data2 = pyspark_session.read.csv(data_path + 'data_1.csv' , header = True , inferSchema = True)

In [8]:
data2.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Experience: integer (nullable = true)



In [9]:
# get the list of the columns of the dataframe - it is similar to the pandas
data2.columns

['Name', 'Age', 'Experience']

In [10]:
# pandas : Get a specific variables as dataframe 
records.iloc[:, 0:1]

Unnamed: 0,Name
0,Gaurav
1,Mukesh
2,Nibesh


In [11]:
# pyspark: Get a specific variables as dataframe 
data2.select('Name')

DataFrame[Name: string]

In [12]:
data2.select('Name').show()

+------+
|  Name|
+------+
|Gaurav|
|Mukesh|
|Nibesh|
+------+



In [13]:
# pandas : Get a multiple variables as dataframe 
records.iloc[:, 0:2]

Unnamed: 0,Name,Age
0,Gaurav,29
1,Mukesh,30
2,Nibesh,31


In [14]:
# pyspark: Get multiple variables as dataframe 
data2.select(['Name','Age']).show()

+------+---+
|  Name|Age|
+------+---+
|Gaurav| 29|
|Mukesh| 30|
|Nibesh| 31|
+------+---+



In [15]:
# pyspark: get datatypes of all the variables
# it is a list
# Hence, in to get the datatype of all the variables is same for both the PySpark & Pandas but the format of the output is different
data2.dtypes

[('Name', 'string'), ('Age', 'int'), ('Experience', 'int')]

In [16]:
data2.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Experience: integer (nullable = true)



In [17]:
# Pandas: get the datatypes of all the variables
# it is the pandas series
records.dtypes

Name          object
Age            int64
Experience     int64
dtype: object

In [19]:
# pyspark: describe
data2.describe().show()

+-------+------+----+-----------------+
|summary|  Name| Age|       Experience|
+-------+------+----+-----------------+
|  count|     3|   3|                3|
|   mean|  NULL|30.0|7.333333333333333|
| stddev|  NULL| 1.0|3.055050463303893|
|    min|Gaurav|  29|                4|
|    max|Nibesh|  31|               10|
+-------+------+----+-----------------+



In [20]:
# Pandas
records.describe()

Unnamed: 0,Age,Experience
count,3.0,3.0
mean,30.0,7.333333
std,1.0,3.05505
min,29.0,4.0
25%,29.5,6.0
50%,30.0,8.0
75%,30.5,9.0
max,31.0,10.0


**pandas & pyspark; difference with describe**
* Pandas only provide description of only variables with numerical data types whereas Pyspark provides description for all variables (irrespective of data types)
* Pyspark provides only 5 summary statistics (excludes 3 quartiles) whereas pandas provides 8 summary statistics (including 3 quartiles along with 5 which pyspark provides)

### Adding columns 

In [21]:
# Pyspark 
data2 = data2.withColumn('Exp_after_2_yrs' , data2['Experience'] + 2)

In [22]:
data2.show()

+------+---+----------+---------------+
|  Name|Age|Experience|Exp_after_2_yrs|
+------+---+----------+---------------+
|Gaurav| 29|        10|             12|
|Mukesh| 30|         8|             10|
|Nibesh| 31|         4|              6|
+------+---+----------+---------------+



In [23]:
# pyspark": dropping the columns
data2 = data2.drop('Exp_after_2_yrs')

In [24]:
data2.show()

+------+---+----------+
|  Name|Age|Experience|
+------+---+----------+
|Gaurav| 29|        10|
|Mukesh| 30|         8|
|Nibesh| 31|         4|
+------+---+----------+



In [27]:
# pyspark: renaming the columns
data2.withColumnRenamed('Age', 'Ageing')

DataFrame[Name: string, Ageing: int, Experience: int]

**Here although I have renamed a column, but when I use 'data2.show()' that renaming is not there**
* When you rename a column in a PySpark DataFrame using withColumnRenamed, it returns a new DataFrame with the renamed column, but the original data2 remains unchanged. You need to assign the result back to data2 (or another variable) to apply the change.

In [29]:
data2 = data2.withColumnRenamed('Age', 'Ageing')

In [30]:
data2.show()

+------+------+----------+
|  Name|Ageing|Experience|
+------+------+----------+
|Gaurav|    29|        10|
|Mukesh|    30|         8|
|Nibesh|    31|         4|
+------+------+----------+

