# **Pyspark:-Basics**

**Installing Pyspark**

In [23]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


**Import Pyspark Library**

In [24]:
import pyspark

**Start Pyspark Session**

In [25]:
from pyspark.sql import SparkSession

**Building Pyspark DataFrame**

In [26]:
spark=SparkSession.builder.appName("dataframe").getOrCreate()

**Reading CSV File**

In [27]:
df_pyspark=spark.read.option("header","true").csv("test2.csv",inferSchema=True)

**Checking First 3 rows of the pyspark dataframe**

In [28]:
df_pyspark.head(3)

[Row(Name='Krish', age=31, Experience=10, Salary=30000),
 Row(Name='Sudhanshu', age=30, Experience=8, Salary=25000),
 Row(Name='Sunny', age=29, Experience=4, Salary=20000)]

**Using PrintSchema to print or display the schema of the DataFrame in the tree format along with column name and data type.**

In [29]:
df_pyspark.printSchema()

root
 |-- Name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- Experience: integer (nullable = true)
 |-- Salary: integer (nullable = true)



**Viewing the Pyspark DataFrame & it's DataTypes**

In [30]:
df_pyspark.show()

+---------+----+----------+------+
|     Name| age|Experience|Salary|
+---------+----+----------+------+
|    Krish|  31|        10| 30000|
|Sudhanshu|  30|         8| 25000|
|    Sunny|  29|         4| 20000|
|     Paul|  24|         3| 20000|
|   Harsha|  21|         1| 15000|
|  Shubham|  23|         2| 18000|
|   Mahesh|null|      null| 40000|
|     null|  34|        10| 38000|
|     null|  36|      null|  null|
+---------+----+----------+------+



In [31]:
type(df_pyspark)

pyspark.sql.dataframe.DataFrame

In [32]:
df_pyspark['Name']

Column<'Name'>

In [33]:
df_pyspark.dtypes

[('Name', 'string'), ('age', 'int'), ('Experience', 'int'), ('Salary', 'int')]

**Selecting Name & Experince Column from the pyspark DataFrame**

In [34]:
df_pyspark.select(['Name','Experience']).show()

+---------+----------+
|     Name|Experience|
+---------+----------+
|    Krish|        10|
|Sudhanshu|         8|
|    Sunny|         4|
|     Paul|         3|
|   Harsha|         1|
|  Shubham|         2|
|   Mahesh|      null|
|     null|        10|
|     null|      null|
+---------+----------+



**Checking the Description of the DataFrame**

In [35]:
df_pyspark.describe().show()

+-------+------+------------------+------------------+-----------------+
|summary|  Name|               age|        Experience|           Salary|
+-------+------+------------------+------------------+-----------------+
|  count|     7|                 8|                 7|                8|
|   mean|  null|              28.5| 5.428571428571429|          25750.0|
| stddev|  null|5.3718844791323335|3.8234863173611093|9361.776388210581|
|    min|Harsha|                21|                 1|            15000|
|    max| Sunny|                36|                10|            40000|
+-------+------+------------------+------------------+-----------------+



**Adding the Column in pyspark DataFrame**

In [36]:
### Adding Columns in data frame
df_pyspark=df_pyspark.withColumn('Experience After 2 year',df_pyspark['Experience']+2)
df_pyspark.show()

+---------+----+----------+------+-----------------------+
|     Name| age|Experience|Salary|Experience After 2 year|
+---------+----+----------+------+-----------------------+
|    Krish|  31|        10| 30000|                     12|
|Sudhanshu|  30|         8| 25000|                     10|
|    Sunny|  29|         4| 20000|                      6|
|     Paul|  24|         3| 20000|                      5|
|   Harsha|  21|         1| 15000|                      3|
|  Shubham|  23|         2| 18000|                      4|
|   Mahesh|null|      null| 40000|                   null|
|     null|  34|        10| 38000|                     12|
|     null|  36|      null|  null|                   null|
+---------+----+----------+------+-----------------------+



**Renaming the Column in Pyspark Dataframe**

In [37]:
### Rename the columns
df_pyspark.withColumnRenamed('Name','New Name').show()

+---------+----+----------+------+-----------------------+
| New Name| age|Experience|Salary|Experience After 2 year|
+---------+----+----------+------+-----------------------+
|    Krish|  31|        10| 30000|                     12|
|Sudhanshu|  30|         8| 25000|                     10|
|    Sunny|  29|         4| 20000|                      6|
|     Paul|  24|         3| 20000|                      5|
|   Harsha|  21|         1| 15000|                      3|
|  Shubham|  23|         2| 18000|                      4|
|   Mahesh|null|      null| 40000|                   null|
|     null|  34|        10| 38000|                     12|
|     null|  36|      null|  null|                   null|
+---------+----+----------+------+-----------------------+



**Dropping the Column in pyspark DataFrame**

In [38]:
### Drop the columns
df_pyspark=df_pyspark.drop('Experience After 2 year')

In [39]:
df_pyspark.show()

+---------+----+----------+------+
|     Name| age|Experience|Salary|
+---------+----+----------+------+
|    Krish|  31|        10| 30000|
|Sudhanshu|  30|         8| 25000|
|    Sunny|  29|         4| 20000|
|     Paul|  24|         3| 20000|
|   Harsha|  21|         1| 15000|
|  Shubham|  23|         2| 18000|
|   Mahesh|null|      null| 40000|
|     null|  34|        10| 38000|
|     null|  36|      null|  null|
+---------+----+----------+------+

