# DataFrames
We will cover:
- what is PySpark DataFrames  
- how we can reading dataset as DataFrame  
- checking the types of features  
- dropping/adding columns  
- describing columns  
- selecting/indexing

In [4]:
from pyspark.sql import SparkSession

# starting session
spark = SparkSession.builder.appName('DataFrames').getOrCreate()
spark

### Reading

In [13]:
# reading the dataset
df_spark = spark.read.options(delimiter=';', header=True).csv('data/tut02_test.csv')
df_spark.show()

+-------+-----------+---------+
|   Name|Subscribers|    Views|
+-------+-----------+---------+
|  Krish|     616000| 58258882|
|sentdex|    1120000|103555956|
|Ken Jee|     212000|  6692842|
+-------+-----------+---------+



In [14]:
# check the schema (datatypes of columns)
df_spark.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Subscribers: string (nullable = true)
 |-- Views: string (nullable = true)



When we read csv file, parameter `inferSchema` (by default) set to *False*. It is reason why all columns have *string* type

In [16]:
# reading the dataset
df_spark = spark.read.options(delimiter=';', header=True, inferSchema=True).csv('data/tut02_test.csv')

# check the schema (datatypes of columns)
df_spark.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Subscribers: integer (nullable = true)
 |-- Views: integer (nullable = true)



In [32]:
# also we can use `dtypes` option (the same with pandas)
df_spark.dtypes

[('Name', 'string'), ('Subscribers', 'int'), ('Views', 'int')]

### DataFrames

In [17]:
# check the type of our data
type(df_spark)

pyspark.sql.dataframe.DataFrame

DataFrame is a data structure and you can perform any kind of operations

In [19]:
# check columns
df_spark.columns

['Name', 'Subscribers', 'Views']

In [20]:
df_spark.head(2)

[Row(Name='Krish', Subscribers=616000, Views=58258882),
 Row(Name='sentdex', Subscribers=1120000, Views=103555956)]

### Selecting

In [27]:
# how to select a column
df_spark.select('Name').show()

+-------+
|   Name|
+-------+
|  Krish|
|sentdex|
|Ken Jee|
+-------+



In [31]:
# how to select the columns
df_spark.select(['Name', 'Subscribers']).show()

+-------+-----------+
|   Name|Subscribers|
+-------+-----------+
|  Krish|     616000|
|sentdex|    1120000|
|Ken Jee|     212000|
+-------+-----------+



### Describe function

In [34]:
df_spark.describe().show()

+-------+-------+-----------------+--------------------+
|summary|   Name|      Subscribers|               Views|
+-------+-------+-----------------+--------------------+
|  count|      3|                3|                   3|
|   mean|   null|649333.3333333334|5.6169226666666664E7|
| stddev|   null|454916.8422177105| 4.846535575030215E7|
|    min|Ken Jee|           212000|             6692842|
|    max|sentdex|          1120000|           103555956|
+-------+-------+-----------------+--------------------+



### Adding/Dropping/Renaming

In [41]:
# add
df_spark = df_spark.withColumn(colName='Views after two years', col=df_spark['Views']*1.5)
df_spark.show()

+-------+-----------+---------+---------------------+
|   Name|Subscribers|    Views|Views after two years|
+-------+-----------+---------+---------------------+
|  Krish|     616000| 58258882|          8.7388323E7|
|sentdex|    1120000|103555956|         1.55333934E8|
|Ken Jee|     212000|  6692842|          1.0039263E7|
+-------+-----------+---------+---------------------+



In [44]:
# drop
df_spark = df_spark.drop('Views after two years', 'Views')
df_spark.show()

+-------+-----------+
|   Name|Subscribers|
+-------+-----------+
|  Krish|     616000|
|sentdex|    1120000|
|Ken Jee|     212000|
+-------+-----------+



In [45]:
# rename
df_spark = df_spark.withColumnRenamed('Name', 'New Name')
df_spark.show()

+--------+-----------+
|New Name|Subscribers|
+--------+-----------+
|   Krish|     616000|
| sentdex|    1120000|
| Ken Jee|     212000|
+--------+-----------+

