### *Pyspark Dataframe Commonly Used Functions*
Apache Spark is the most successful software of Apache Software Foundation and designed for fast computing. PySpark SQL is a module in Spark which integrates relational processing with Spark's functional programming API. We can extract the data by using an SQL query language. We can use the queries same as the SQL language.

In [0]:
df=spark.read.format("csv").option("header",True).option("inferschema",True).option("delimeter",',').load("/FileStore/tables/jamesbond.csv")
display(df)

Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4
On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Live and Let Die,1973,Roger Moore,Guy Hamilton,460.3,30.8,
The Man with the Golden Gun,1974,Roger Moore,Guy Hamilton,334.0,27.7,


#### *printSchema()*
DataFrame. printSchema() is used to print or display the schema of the DataFrame in the tree format along with column name and data type.

In [0]:
df.printSchema()

root
 |-- Film: string (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Actor: string (nullable = true)
 |-- Director: string (nullable = true)
 |-- Box Office: double (nullable = true)
 |-- Budget: double (nullable = true)
 |-- Bond Actor Salary: double (nullable = true)



In [0]:
# df.count() will get the count of rows
df.count()

Out[94]: 26

In [0]:
df.show()

+--------------------+----+--------------+------------------+----------+------+-----------------+
|                Film|Year|         Actor|          Director|Box Office|Budget|Bond Actor Salary|
+--------------------+----+--------------+------------------+----------+------+-----------------+
|              Dr. No|1962|  Sean Connery|     Terence Young|     448.8|   7.0|              0.6|
|From Russia with ...|1963|  Sean Connery|     Terence Young|     543.8|  12.6|              1.6|
|          Goldfinger|1964|  Sean Connery|      Guy Hamilton|     820.4|  18.6|              3.2|
|         Thunderball|1965|  Sean Connery|     Terence Young|     848.1|  41.9|              4.7|
|       Casino Royale|1967|   David Niven|        Ken Hughes|     315.0|  85.0|             null|
| You Only Live Twice|1967|  Sean Connery|     Lewis Gilbert|     514.2|  59.9|              4.4|
|On Her Majesty's ...|1969|George Lazenby|     Peter R. Hunt|     291.5|  37.3|              0.6|
|Diamonds Are Foreve

#### *select()*
In PySpark, select() function is used to select single, multiple, all columns from a DataFrame, PySpark select() is a transformation function hence it returns a new DataFrame with the selected columns.

In [0]:
df.select("*").display()

Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4
On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Live and Let Die,1973,Roger Moore,Guy Hamilton,460.3,30.8,
The Man with the Golden Gun,1974,Roger Moore,Guy Hamilton,334.0,27.7,


####Select Single & Multiple Columns From PySpark

In [0]:
df.select(df.Film).display()
df1=df.select("Film","Actor",df.Director).display()

Film
Dr. No
From Russia with Love
Goldfinger
Thunderball
Casino Royale
You Only Live Twice
On Her Majesty's Secret Service
Diamonds Are Forever
Live and Let Die
The Man with the Golden Gun


Film,Actor,Director
Dr. No,Sean Connery,Terence Young
From Russia with Love,Sean Connery,Terence Young
Goldfinger,Sean Connery,Guy Hamilton
Thunderball,Sean Connery,Terence Young
Casino Royale,David Niven,Ken Hughes
You Only Live Twice,Sean Connery,Lewis Gilbert
On Her Majesty's Secret Service,George Lazenby,Peter R. Hunt
Diamonds Are Forever,Sean Connery,Guy Hamilton
Live and Let Die,Roger Moore,Guy Hamilton
The Man with the Golden Gun,Roger Moore,Guy Hamilton


In [0]:
# Use pyspark distinct() to select unique rows from all columns.

df.select(df.Actor).distinct().display(truncate=False)

Actor
George Lazenby
Sean Connery
Roger Moore
David Niven
Daniel Craig
Timothy Dalton
Pierce Brosnan


#### *filter() , where()*
PySpark filter() function is used to filter the rows from DataFrame based on the given condition or SQL expression, you can also use where() clause instead of the filter() if you are coming from an SQL background, both these functions operate exactly the same. filter() function returns a new DataFrame or RDD with only the rows that meet the condition specified.

In [0]:
# DataFrame where() with Column Condition

df2=df.where(df.Year>=1998).display()

Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
The World Is Not Enough,1999,Pierce Brosnan,Michael Apted,439.5,158.3,13.5
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
Spectre,2015,Daniel Craig,Sam Mendes,726.7,206.3,


In [0]:
#  PySpark where() with Multiple Conditions

df2=df.where((df.Year >=1962) & (df.Budget>=100)).display()

Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Tomorrow Never Dies,1997,Pierce Brosnan,Roger Spottiswoode,463.2,133.9,10.0
The World Is Not Enough,1999,Pierce Brosnan,Michael Apted,439.5,158.3,13.5
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
Spectre,2015,Daniel Craig,Sam Mendes,726.7,206.3,


### Filter Based on Starts With(), Ends With(), Contains().

In [0]:
df.filter(df.Actor.startswith('Daniel')).display()
df.filter(df.Actor.endswith('Niven')).display()
df.filter(df.Actor.contains('Pierce')).display()

Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
Spectre,2015,Daniel Craig,Sam Mendes,726.7,206.3,


Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
GoldenEye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1
Tomorrow Never Dies,1997,Pierce Brosnan,Roger Spottiswoode,463.2,133.9,10.0
The World Is Not Enough,1999,Pierce Brosnan,Michael Apted,439.5,158.3,13.5
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9


In [0]:
df.filter(df.Year.isin("1983", "2012")).display()
df.filter(df.Year.between(1950, 1980)).display()

Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Never Say Never Again,1983,Sean Connery,Irvin Kershner,380.0,86.0,
Octopussy,1983,Roger Moore,John Glen,373.8,53.9,7.8
Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5


Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4
On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Live and Let Die,1973,Roger Moore,Guy Hamilton,460.3,30.8,
The Man with the Golden Gun,1974,Roger Moore,Guy Hamilton,334.0,27.7,


####isNotNull() – Returns True if the current expression is NOT null.
####isNull() – Returns True if the current expression is null.

In [0]:
# isNotNull() isNull()

df.filter(df.Budget.isNotNull()).display()
df.filter(df.Actor.isNull()).display()

Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4
On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Live and Let Die,1973,Roger Moore,Guy Hamilton,460.3,30.8,
The Man with the Golden Gun,1974,Roger Moore,Guy Hamilton,334.0,27.7,


Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary


In [0]:
# PySpark Filter like()

df.filter(df.Year.like("%81")).display()

Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
For Your Eyes Only,1981,Roger Moore,John Glen,449.4,60.2,


In [0]:
from pyspark.sql.functions import asc,desc
df.groupBy(df.Actor).count().alias("count").sort(asc ("count")).display()

Actor,count
George Lazenby,1
David Niven,1
Timothy Dalton,2
Daniel Craig,4
Pierce Brosnan,4
Sean Connery,7
Roger Moore,7


In [0]:
df.groupBy(df.Actor).agg(max("Budget").alias("maximum")).display()

Actor,maximum
George Lazenby,37.3
Sean Connery,86.0
Roger Moore,91.5
David Niven,85.0
Daniel Craig,206.3
Timothy Dalton,68.8
Pierce Brosnan,158.3


In [0]:
df.select(df.Budget.alias("Amount")).sort(desc ("Amount")).display()

Amount
206.3
181.4
170.2
158.3
154.2
145.3
133.9
91.5
86.0
85.0
