# Hello and welcome

## This notebook has steps to test your PySpark in Dataproc Cluster

### I must remind you that this is one of the many alternatives that you can try out to run PySpark.

##### This is a follow up, the first steps can be found in [this file](pyspark_in_dataproc_cluster.md)

In [None]:
import pyspark

In [None]:
from pyspark.sql import SparkSession

In [None]:
spark = SparkSession.builder \
        .master("local[*]") \
        .appName('test') \
        .getOrCreate()

Usually, you would be capable of accessing your Spark Builder in localhost:4040, but using a web connected cluster, this would only be possible by:
* Accessing the machine via SSH, which can be done with one click in 'VM INSTANCES' tab, inside 'Cluster details' in Dataproc;
* Port fowarding the cluster to your computer;
* Enabling / Configuring a SSH tunnel with the cluster and your computer and operating the cluster machine.

Personally, I think the best option is the first, yet, I don't think you will need to access Spark.

In [None]:
!wget https://nyc-tlc.s3.amazonaws.com/trip+data/fhvhv_tripdata_2020-06.csv

#### One of the steps that you need to remember is when running Spark inside this cluster, Spark directories are not the same as if you had made the setup in your own computer. You will need to mirror files using 'hdfs dfs' command.

In [None]:
!hdfs dfs -copyFromLocal 'fhvhv_tripdata_2020-06.csv'

In [None]:
df = spark.read \
        .option("header", "true") \
        .csv('fhvhv_tripdata_2020-06.csv')

In [None]:
import pandas as pd

In [None]:
!head -n 1001 fhvhv_tripdata_2020-06.csv > head.csv

#### Again, copying files from local into HDFS. This step is always important and keep in mind everytime your code throws any error about 'Path not found'.

In [None]:
!hdfs dfs -copyFromLocal 'head.csv'

In [None]:
df_pandas = pd.read_csv('head.csv')

In [None]:
df_pandas.dtypes

In [34]:
spark.createDataFrame(df_pandas).schema

StructType(List(StructField(hvfhs_license_num,StringType,true),StructField(dispatching_base_num,StringType,true),StructField(pickup_datetime,StringType,true),StructField(dropoff_datetime,StringType,true),StructField(PULocationID,LongType,true),StructField(DOLocationID,LongType,true),StructField(SR_Flag,DoubleType,true)))

In [35]:
from pyspark.sql import types

In [None]:
schema = types.StructType([
    types.StructField('hvfhs_license_num', types.StringType(), True),
    types.StructField('dispatching_base_num', types.StringType(), True),
    types.StructField('pickup_datetime', types.TimestampType(), True),
    types.StructField('dropoff_datetime', types.TimestampType(), True),
    types.StructField('PULocationID', types.IntegerType(), True),
    types.StructField('DOLocationID', types.IntegerType(), True),
    types.StructField('SR_Flag', types.StringType(), True)
])

### Congrats if you could do all of this without any errors! Now your cluster is ready, operating and 'slightly' tested. 