<a href="https://colab.research.google.com/github/Ashik9576/PySpark_Learning/blob/main/PySpark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **PySpark Installation**

In [24]:
!pip install pyspark



In [25]:
import pyspark

In [26]:
# mounted the drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [27]:
import pandas as pd
df=pd.read_csv("/content/drive/MyDrive/Pyspark/counties.csv")

In [28]:
df.columns=['Name','ID']

In [29]:
df.head()

Unnamed: 0,Name,ID
0,Adams,4029
1,Allamakee,14330
2,Appanoose,12884
3,Audubon,6119
4,Benton,26076


# **Reading file in Spark**


**firstly I have to creat a spark session**

In [30]:
from pyspark.sql import SparkSession

In [31]:
spark=SparkSession.builder.appName('Ashik').getOrCreate()

In [32]:
spark

In [33]:
df_spark=spark.read.csv("/content/drive/MyDrive/Pyspark/counties.csv")

In [34]:
df_spark.show()

+-----------+------+
|        _c0|   _c1|
+-----------+------+
|      Adair|  7682|
|      Adams|  4029|
|  Allamakee| 14330|
|  Appanoose| 12884|
|    Audubon|  6119|
|     Benton| 26076|
| Black Hawk|131090|
|      Boone| 26306|
|     Bremer| 24276|
|   Buchanan| 20958|
|Buena Vista| 20260|
|     Butler| 14867|
|    Calhoun|  9670|
|    Carroll| 20816|
|       Cass| 13956|
|      Cedar| 18499|
|Cerro Gordo| 44151|
|   Cherokee| 12072|
|  Chickasaw| 12439|
|     Clarke|  9286|
+-----------+------+
only showing top 20 rows



# **Changing the columns name**

In [38]:
df_spark=df_spark.withColumnRenamed("_c0","Name")\
 .withColumnRenamed("_c1","ID")

In [39]:
df_spark.show(5)

+---------+-----+
|     Name|   ID|
+---------+-----+
|    Adair| 7682|
|    Adams| 4029|
|Allamakee|14330|
|Appanoose|12884|
|  Audubon| 6119|
+---------+-----+
only showing top 5 rows



In [40]:
df_spark.printSchema()

root
 |-- Name: string (nullable = true)
 |-- ID: string (nullable = true)



My Id column is integer but it is showing string because it takes every column as string to handle it we will pass "inferScheme=True" while reading the dataset

In [43]:
df_spark=spark.read.csv("/content/drive/MyDrive/Pyspark/counties.csv",inferSchema=True)

In [44]:
df_spark=df_spark.withColumnRenamed("_c0","Name")\
 .withColumnRenamed("_c1","ID")

In [45]:
# checking the datatype of columns
df_spark.printSchema()

root
 |-- Name: string (nullable = true)
 |-- ID: integer (nullable = true)



Now it showing integer type for ID

In [46]:
# checking for first 3 rows
df_spark.head(3)

[Row(Name='Adair', ID=7682),
 Row(Name='Adams', ID=4029),
 Row(Name='Allamakee', ID=14330)]

In [47]:
#to check the type of variable_name 
type(df_spark)

pyspark.sql.dataframe.DataFrame

In [48]:
# to check the all columns name
df_spark.columns

['Name', 'ID']

In [51]:
# select column by providing columns names 
df_spark.select('Name').show(5)

+---------+
|     Name|
+---------+
|    Adair|
|    Adams|
|Allamakee|
|Appanoose|
|  Audubon|
+---------+
only showing top 5 rows



In [55]:
# check datatype of each column
df_spark.dtypes

[('Name', 'string'), ('ID', 'int')]

In [57]:
# descibing the dataset
df_spark.describe().show()

+-------+------+------------------+
|summary|  Name|                ID|
+-------+------+------------------+
|  count|    99|                99|
|   mean|  null| 30771.23232323232|
| stddev|  null|52888.737874675055|
|    min| Adair|              4029|
|    max|Wright|            430640|
+-------+------+------------------+



In [61]:
# adding column to dataframe
df_spark=df_spark.withColumn('ID2',df_spark['ID']+2)

In [62]:
df_spark.show(5)

+---------+-----+-----+
|     Name|   ID|  ID2|
+---------+-----+-----+
|    Adair| 7682| 7684|
|    Adams| 4029| 4031|
|Allamakee|14330|14332|
|Appanoose|12884|12886|
|  Audubon| 6119| 6121|
+---------+-----+-----+
only showing top 5 rows



In [64]:
# droping columns from data frame
df_spark=df_spark.drop('ID2')

In [65]:
df_spark.show(5)

+---------+-----+
|     Name|   ID|
+---------+-----+
|    Adair| 7682|
|    Adams| 4029|
|Allamakee|14330|
|Appanoose|12884|
|  Audubon| 6119|
+---------+-----+
only showing top 5 rows



# **Day 2**

In [80]:
df_spark=spark.read.csv("/content/drive/MyDrive/Pyspark/PYSPA.csv",header=True, inferSchema=True)

In [81]:
df_spark.show()

+----+----+----------+-----+
|Name| age|Experience|Salay|
+----+----+----------+-----+
|   A|  31|        10|30000|
|   B|  30|         8|25000|
|   C|  29|         4|20000|
|   D|  24|         3|15000|
|   E|  21|         1|18000|
|   F|  23|         2|40000|
|   G|null|      null|38000|
|null|  34|        10| null|
|null|  36|      null| null|
+----+----+----------+-----+



In [82]:
#droping rows cointaing null value if "how=any" it will drop row having 1,2 or more null values in a row 
#but if "how=all" it will drop row having all null values assigned to all columns
df_spark.na.drop(how="any").show()

+----+---+----------+-----+
|Name|age|Experience|Salay|
+----+---+----------+-----+
|   A| 31|        10|30000|
|   B| 30|         8|25000|
|   C| 29|         4|20000|
|   D| 24|         3|15000|
|   E| 21|         1|18000|
|   F| 23|         2|40000|
+----+---+----------+-----+



In [83]:
df_spark.show()

+----+----+----------+-----+
|Name| age|Experience|Salay|
+----+----+----------+-----+
|   A|  31|        10|30000|
|   B|  30|         8|25000|
|   C|  29|         4|20000|
|   D|  24|         3|15000|
|   E|  21|         1|18000|
|   F|  23|         2|40000|
|   G|null|      null|38000|
|null|  34|        10| null|
|null|  36|      null| null|
+----+----+----------+-----+



In [76]:
# threshold: if "thresh=2" it means at least two not null columns in a row if it is not then it will be deleted.

In [84]:
df_spark.na.drop(how="any",thresh=2).show()

+----+----+----------+-----+
|Name| age|Experience|Salay|
+----+----+----------+-----+
|   A|  31|        10|30000|
|   B|  30|         8|25000|
|   C|  29|         4|20000|
|   D|  24|         3|15000|
|   E|  21|         1|18000|
|   F|  23|         2|40000|
|   G|null|      null|38000|
|null|  34|        10| null|
+----+----+----------+-----+

