##  Data Scientist Jr.: Dr.Eddy Giusepe Chirinos Isidro

In this part, we will cover:
* PySpark DataFrame
* Reading the DataSet
* Checking the DataTypes of the column (Schema)
* Selecting columns and Indexing
* Check describe option similar to Pandas
* Adding columns
* Dropping columns
* Renaming columns

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName('Dataframe').getOrCreate()

22/03/01 16:49:57 WARN Utils: Your hostname, eddygiusepe resolves to a loopback address: 127.0.1.1; using 192.168.1.122 instead (on interface wlp0s20f3)
22/03/01 16:49:57 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/03/01 16:49:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
spark

#### <font color="orange">Reading my DataSet</font>

In [4]:
## Read the DataSet

spark.read.option("header", "true").csv("test1.csv").show()

+---------+---+----------+
|     Name|age|Experience|
+---------+---+----------+
|    Krish| 31|        10|
|Sudhanshu| 30|         8|
|    Sunny| 29|         4|
+---------+---+----------+



In [5]:
df_spark = spark.read.option("header", "true").csv("test1.csv")
df_spark.show()

+---------+---+----------+
|     Name|age|Experience|
+---------+---+----------+
|    Krish| 31|        10|
|Sudhanshu| 30|         8|
|    Sunny| 29|         4|
+---------+---+----------+



In [6]:
## Check the Schema
# Aqui perceberemos que por default é "STRING"
df_spark.printSchema()

root
 |-- Name: string (nullable = true)
 |-- age: string (nullable = true)
 |-- Experience: string (nullable = true)



In [7]:
# Podemos fazer o seguinte

df_spark = spark.read.option("header", "true").csv("test1.csv", inferSchema=True)
df_spark.show()

+---------+---+----------+
|     Name|age|Experience|
+---------+---+----------+
|    Krish| 31|        10|
|Sudhanshu| 30|         8|
|    Sunny| 29|         4|
+---------+---+----------+



In [8]:
# E novamente checking nosso Schema
# Observa que os números agora estão como "INTEGER
df_spark.printSchema()

root
 |-- Name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- Experience: integer (nullable = true)



#### <font color="orange">Using the netx structure to DataFrame</font>

In [9]:
# Em resumo podemos usar a seguinte estrutura (mais prático)

df_spark = spark.read.csv('test1.csv', header=True, inferSchema=True)
df_spark.show()

+---------+---+----------+
|     Name|age|Experience|
+---------+---+----------+
|    Krish| 31|        10|
|Sudhanshu| 30|         8|
|    Sunny| 29|         4|
+---------+---+----------+



In [10]:
type(df_spark)

pyspark.sql.dataframe.DataFrame

#### <font color="orange">Visualizing our columns of the DataFrame</font>

In [11]:
# Podemos visualizar as colunas:

df_spark.columns

['Name', 'age', 'Experience']

In [12]:
# Também:

df_spark.head(3)

[Row(Name='Krish', age=31, Experience=10),
 Row(Name='Sudhanshu', age=30, Experience=8),
 Row(Name='Sunny', age=29, Experience=4)]

#### <font color="orange">Selecting any columns in DataFrame</font>

In [13]:
# How do I select column?
df_spark.show()

+---------+---+----------+
|     Name|age|Experience|
+---------+---+----------+
|    Krish| 31|        10|
|Sudhanshu| 30|         8|
|    Sunny| 29|         4|
+---------+---+----------+



In [15]:
df_spark.select('Name').show()

+---------+
|     Name|
+---------+
|    Krish|
|Sudhanshu|
|    Sunny|
+---------+



In [17]:
# Duas colunas

df_spark.select(['Name', 'Experience']).show()  # ou:  df_spark.select('Name', 'Experience').show() 


+---------+----------+
|     Name|Experience|
+---------+----------+
|    Krish|        10|
|Sudhanshu|         8|
|    Sunny|         4|
+---------+----------+



In [20]:
df_spark['Name']

Column<'Name'>

In [21]:
df_spark.dtypes

[('Name', 'string'), ('age', 'int'), ('Experience', 'int')]

In [22]:
df_spark.describe()

DataFrame[summary: string, Name: string, age: string, Experience: string]

In [23]:
df_spark.describe().show()

+-------+-----+----+-----------------+
|summary| Name| age|       Experience|
+-------+-----+----+-----------------+
|  count|    3|   3|                3|
|   mean| null|30.0|7.333333333333333|
| stddev| null| 1.0|3.055050463303893|
|    min|Krish|  29|                4|
|    max|Sunny|  31|               10|
+-------+-----+----+-----------------+



#### <font color="orange">Adding columns in DataFrame</font>

In [24]:
df_spark.show()

+---------+---+----------+
|     Name|age|Experience|
+---------+---+----------+
|    Krish| 31|        10|
|Sudhanshu| 30|         8|
|    Sunny| 29|         4|
+---------+---+----------+



In [29]:
df_spark = df_spark.withColumn("Experience after 2 year", df_spark["Experience"] + 2)
df_spark.show()

+---------+---+----------+-----------------------+
|     Name|age|Experience|Experience after 2 year|
+---------+---+----------+-----------------------+
|    Krish| 31|        10|                     12|
|Sudhanshu| 30|         8|                     10|
|    Sunny| 29|         4|                      6|
+---------+---+----------+-----------------------+



#### <font color="orange">Drop the columns in DataFrame</font>

In [32]:
df_spark = df_spark.drop('Experience after 2 year')
df_spark.show()

+---------+---+----------+
|     Name|age|Experience|
+---------+---+----------+
|    Krish| 31|        10|
|Sudhanshu| 30|         8|
|    Sunny| 29|         4|
+---------+---+----------+



#### <font color="orange">Rename the columns in DataFrame</font>

In [33]:
df_spark.withColumnRenamed("Name", "New Name").show()

+---------+---+----------+
| New Name|age|Experience|
+---------+---+----------+
|    Krish| 31|        10|
|Sudhanshu| 30|         8|
|    Sunny| 29|         4|
+---------+---+----------+

