# 2) Pyspark Dataframe (part i)

In this section, we will Cover

 * PySpark Dataframe
 * Reading The Dataset
  * Checking the Datatypes of the Column(Schema)
  * Selecting Columns And Indexing
  * Check Describe option similar to Pandas
  * Adding Columns
   * Dropping columns
  * Renaming Columns

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName("Dataframe").getOrCreate()

In [3]:
spark

In [4]:
# reading the dataset : conside my 1st as header
spark.read.option('header', 'true').csv("test1.csv")

DataFrame[Name: string, age: string, Experience: string]

In [5]:
df_pyspark = spark.read.option('header', 'true').csv("test1.csv", inferSchema= True)
df_pyspark.show()

+------------+---+----------+
|        Name|age|Experience|
+------------+---+----------+
|      Younes| 31|        10|
|      Mourad| 30|         8|
|Fatima Zahra| 29|         4|
|     M'hamed| 24|         3|
|       Fdila| 21|         1|
|       Hamid| 23|         2|
+------------+---+----------+



`inferSchema= True` will import the columns with their true data types. Because by default, all the columns are important as string

In [6]:
# check the schema
df_pyspark.printSchema()

root
 |-- Name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- Experience: integer (nullable = true)



In [7]:
df_pyspark = spark.read.csv("test1.csv", header = True, inferSchema=True)
df_pyspark.show()

+------------+---+----------+
|        Name|age|Experience|
+------------+---+----------+
|      Younes| 31|        10|
|      Mourad| 30|         8|
|Fatima Zahra| 29|         4|
|     M'hamed| 24|         3|
|       Fdila| 21|         1|
|       Hamid| 23|         2|
+------------+---+----------+



In [8]:
# checking the data types of the columns
df_pyspark.printSchema()

root
 |-- Name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- Experience: integer (nullable = true)



In [9]:
type(df_pyspark)

pyspark.sql.dataframe.DataFrame

In [10]:
# show column names
df_pyspark.columns

['Name', 'age', 'Experience']

In [11]:
# getting the first 3 rows
df_pyspark.head(3)


[Row(Name='Younes', age=31, Experience=10),
 Row(Name='Mourad', age=30, Experience=8),
 Row(Name='Fatima Zahra', age=29, Experience=4)]

In [12]:
print(type(df_pyspark.head(3)))

<class 'list'>


In [13]:
df_pyspark.show()

+------------+---+----------+
|        Name|age|Experience|
+------------+---+----------+
|      Younes| 31|        10|
|      Mourad| 30|         8|
|Fatima Zahra| 29|         4|
|     M'hamed| 24|         3|
|       Fdila| 21|         1|
|       Hamid| 23|         2|
+------------+---+----------+



In [14]:
# selecting a column
df_pyspark.select("Name")

DataFrame[Name: string]

In [15]:
df_pyspark.select("Name").show()

+------------+
|        Name|
+------------+
|      Younes|
|      Mourad|
|Fatima Zahra|
|     M'hamed|
|       Fdila|
|       Hamid|
+------------+



In [16]:
# the type of the selected column
type(df_pyspark.select("Name"))

pyspark.sql.dataframe.DataFrame

In [17]:
# select multiple columns
df_pyspark.select(["Name", "experience"]).show()

+------------+----------+
|        Name|experience|
+------------+----------+
|      Younes|        10|
|      Mourad|         8|
|Fatima Zahra|         4|
|     M'hamed|         3|
|       Fdila|         1|
|       Hamid|         2|
+------------+----------+



In [18]:
# Checking the datatypes
df_pyspark.dtypes

[('Name', 'string'), ('age', 'int'), ('Experience', 'int')]

In [19]:
# describing
df_pyspark["age", "experience"].describe().show()

+-------+------------------+-----------------+
|summary|               age|       experience|
+-------+------------------+-----------------+
|  count|                 6|                6|
|   mean|26.333333333333332|4.666666666666667|
| stddev| 4.179314138308661|3.559026084010437|
|    min|                21|                1|
|    max|                31|               10|
+-------+------------------+-----------------+



In [20]:
# Adding columns io the data frame
df_pyspark = df_pyspark.withColumn("Experience After 3 years", df_pyspark["experience"]+3)
df_pyspark

DataFrame[Name: string, age: int, Experience: int, Experience After 3 years: int]

In [21]:
df_pyspark.show()

+------------+---+----------+------------------------+
|        Name|age|Experience|Experience After 3 years|
+------------+---+----------+------------------------+
|      Younes| 31|        10|                      13|
|      Mourad| 30|         8|                      11|
|Fatima Zahra| 29|         4|                       7|
|     M'hamed| 24|         3|                       6|
|       Fdila| 21|         1|                       4|
|       Hamid| 23|         2|                       5|
+------------+---+----------+------------------------+



In [22]:
# Dropping columns
df_pyspark = df_pyspark.drop("Experience After 3 years")
df_pyspark

DataFrame[Name: string, age: int, Experience: int]

In [23]:
df_pyspark.show()

+------------+---+----------+
|        Name|age|Experience|
+------------+---+----------+
|      Younes| 31|        10|
|      Mourad| 30|         8|
|Fatima Zahra| 29|         4|
|     M'hamed| 24|         3|
|       Fdila| 21|         1|
|       Hamid| 23|         2|
+------------+---+----------+



In [25]:
# Renaming a column
df_pyspark.withColumnRenamed("Name", "New Name").show()

+------------+---+----------+
|    New Name|age|Experience|
+------------+---+----------+
|      Younes| 31|        10|
|      Mourad| 30|         8|
|Fatima Zahra| 29|         4|
|     M'hamed| 24|         3|
|       Fdila| 21|         1|
|       Hamid| 23|         2|
+------------+---+----------+

