# Tutorial 1 - PySpark DataFrames - pt1

Covers:

* Creating a spark session
* PySpark Dataframe
* Reading The Dataset
* Checking the Datatypes of the Column(Schema)
* Selecting Columns And Indexing
* Check Describe option similar to Pandas
* Adding Columns
* Dropping columns
* Renaming Columns


**Setup**

In [1]:
from pyspark.sql import SparkSession

**Creating a Spark Session**

In [2]:
spark=SparkSession.builder.appName('DataFrame').getOrCreate()

In [3]:
spark

**Reading the Dataset**

In [4]:
## Read the dataset
df_pyspark=spark.read.option('header','true').csv('Pokemon.csv',inferSchema=True) 

Reading the dataset optional arguments explained:
* read.option args: 'header', 'true' - first row is header.
* .csv inferSchema=True: infers the datatypes, otherwise makes them all strings.

In [5]:
## Show the DataFrame with pretty loading
df_pyspark.show()

+---+--------------------+------+------+-----+---+------+-------+-------+-------+-----+----------+---------+
|  #|                Name|Type 1|Type 2|Total| HP|Attack|Defense|Sp. Atk|Sp. Def|Speed|Generation|Legendary|
+---+--------------------+------+------+-----+---+------+-------+-------+-------+-----+----------+---------+
|  1|           Bulbasaur| Grass|Poison|  318| 45|    49|     49|     65|     65|   45|         1|    false|
|  2|             Ivysaur| Grass|Poison|  405| 60|    62|     63|     80|     80|   60|         1|    false|
|  3|            Venusaur| Grass|Poison|  525| 80|    82|     83|    100|    100|   80|         1|    false|
|  3|VenusaurMega Venu...| Grass|Poison|  625| 80|   100|    123|    122|    120|   80|         1|    false|
|  4|          Charmander|  Fire|  null|  309| 39|    52|     43|     60|     50|   65|         1|    false|
|  5|          Charmeleon|  Fire|  null|  405| 58|    64|     58|     80|     65|   80|         1|    false|
|  6|           Cha

In [6]:
## Check the schema (datatypes)
df_pyspark.printSchema() #equivalent to pandas .info()

root
 |-- #: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Type 1: string (nullable = true)
 |-- Type 2: string (nullable = true)
 |-- Total: integer (nullable = true)
 |-- HP: integer (nullable = true)
 |-- Attack: integer (nullable = true)
 |-- Defense: integer (nullable = true)
 |-- Sp. Atk: integer (nullable = true)
 |-- Sp. Def: integer (nullable = true)
 |-- Speed: integer (nullable = true)
 |-- Generation: integer (nullable = true)
 |-- Legendary: boolean (nullable = true)



In [7]:
## More readable way to read .csv:
df_pyspark=spark.read.csv('Pokemon.csv',header=True,inferSchema=True)

In [9]:
type(df_pyspark)

pyspark.sql.dataframe.DataFrame

**Selecting columns and indexing**

In [10]:
## See the columns
df_pyspark.columns

['#',
 'Name',
 'Type 1',
 'Type 2',
 'Total',
 'HP',
 'Attack',
 'Defense',
 'Sp. Atk',
 'Sp. Def',
 'Speed',
 'Generation',
 'Legendary']

In [11]:
## Select specific columns (and show them)
df_pyspark.select([ 'Name', 'Type 1', 'Type 2',]).show() 
#if just one column, don't need the square brackets

+--------------------+------+------+
|                Name|Type 1|Type 2|
+--------------------+------+------+
|           Bulbasaur| Grass|Poison|
|             Ivysaur| Grass|Poison|
|            Venusaur| Grass|Poison|
|VenusaurMega Venu...| Grass|Poison|
|          Charmander|  Fire|  null|
|          Charmeleon|  Fire|  null|
|           Charizard|  Fire|Flying|
|CharizardMega Cha...|  Fire|Dragon|
|CharizardMega Cha...|  Fire|Flying|
|            Squirtle| Water|  null|
|           Wartortle| Water|  null|
|           Blastoise| Water|  null|
|BlastoiseMega Bla...| Water|  null|
|            Caterpie|   Bug|  null|
|             Metapod|   Bug|  null|
|          Butterfree|   Bug|Flying|
|              Weedle|   Bug|Poison|
|              Kakuna|   Bug|Poison|
|            Beedrill|   Bug|Poison|
|BeedrillMega Beed...|   Bug|Poison|
+--------------------+------+------+
only showing top 20 rows

