<a href="https://colab.research.google.com/github/SophiaHe/Datacamp_PySpark/blob/master/Big_Data_with_PySpark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Course 1: Introduction to PySpark**

See what tables are in your cluster by calling spark.catalog.listTables()

query = "FROM flights SELECT * LIMIT 10"

Get the first 10 rows of flights: 
flights10 = spark.sql(query)

Show the results:
flights10.show()

Convert the results to a pandas DataFrame:
pd_counts = flight_counts.toPandas()

# Create pd_temp
pd_temp = pd.DataFrame(np.random.random(10))

# Create spark_temp from pd_temp
spark_temp = spark.createDataFrame(pd_temp)

# Add spark_temp to the catalog
spark_temp.createOrReplaceTempView("temp")

# Examine the tables in the catalog again
print(spark.catalog.listTables())

# Don't change this file path
file_path = "/usr/local/share/datasets/airports.csv"

# Read in the airports data
airports = spark.read.csv(file_path, header = True)

# Create the DataFrame flights
flights = spark.table("flights")

# Show the head
flights.show()




**Course 2: Manipulating Data**

# Add duration_hrs
flights = flights.withColumn('duration_hrs',flights.air_time/60)

# Filter flights by passing a string
long_flights1 = flights.filter("distance > 1000")

# Filter flights by passing a column of boolean values
long_flights2 = flights.filter(flights.distance > 1000)

# Select the first set of columns
selected1 = flights.select("tailnum","origin","dest")

# Select the second set of columns
temp = flights.select(flights.origin, flights.dest, flights.carrier)

# Define avg_speed
avg_speed = (flights.distance/(flights.air_time/60)).alias("avg_speed")

# Create the same table using a SQL expression
speed2 = flights.selectExpr("origin", "dest", "tailnum", "distance/(air_time/60) as avg_speed")

# Find the shortest flight from PDX in terms of distance
flights.filter(flights.origin == 'PDX').groupBy().min("distance").show()

# Average duration of Delta flights
flights.filter(flights.carrier == "DL").filter(flights.origin == 'SEA').groupBy().avg("air_time").show()

# Total hours in the air
flights.withColumn("duration_hrs", flights.air_time/60).groupBy().sum("duration_hrs").show()

# Import pyspark.sql.functions as F
import pyspark.sql.functions as F

# Standard deviation of departure delay
by_month_dest.agg(F.stddev('dep_delay')).show()

# Rename the faa column
airports = airports.withColumnRenamed('faa','dest')

# Join the DataFrames
flights_with_airports = flights.join(airports,on = 'dest', how = 'leftouter')

**Course 3: Machine Learning Pipeline**

At the core of the pyspark.ml module are the Transformer and Estimator classes. Almost every other class in the module behaves similarly to these two basic classes.

Transformer classes have a .transform() method that takes a DataFrame and returns a new DataFrame; usually the original one with a new column appended. For example, you might use the class Bucketizer to create discrete bins from a continuous feature or the class PCA to reduce the dimensionality of your dataset using principal component analysis.

Estimator classes all implement a .fit() method. These methods also take a DataFrame, but instead of returning another DataFrame they return a model object. This can be something like a StringIndexerModel for including categorical data saved as strings in your models, or a RandomForestModel that uses the random forest algorithm for classification or regression.

#### Spark only handles numeric data. That means all of the columns in your DataFrame must be either integers or decimals

It's important to note that .cast() works on columns, while .withColumn() works on DataFrames. The only argument you need to pass to .cast() is the kind of value you want to create, in string form. For example, to create integers, you'll pass the argument "integer" and for decimal numbers you'll use "double".

dataframe = dataframe.withColumn("col", dataframe.col.cast("new_type")) \\
model_data = model_data.withColumn("month", model_data.month.cast('integer'))

Convert to an integer \\
model_data = model_data.withColumn("label", model_data.is_late.cast('integer'))

Create a StringIndexer \\
carr_indexer = StringIndexer(inputCol= 'carrier',outputCol='carrier_index')

Create a OneHotEncoder \\
carr_encoder = OneHotEncoder(inputCol='carrier_index',outputCol='carrier_fact')

Make a VectorAssembler \\
vec_assembler = VectorAssembler(inputCols=['month','air_time','carrier_fact','dest_fact','plane_age'], outputCol='features')

Import Pipeline \\
from pyspark.ml import Pipeline

Make the pipeline \\
flights_pipe = Pipeline(stages=[dest_indexer,dest_encoder,carr_indexer,carr_encoder,vec_assembler])

Fit and transform the data \\
piped_data = flights_pipe.fit(model_data).transform(model_data)

Split the data into training and test sets \\
training, test = piped_data.randomSplit([0.6,0.4])

**Course 4: Model tuning and selection**

Import LogisticRegression \\
from pyspark.ml.classification import LogisticRegression

Create a LogisticRegression Estimator \\
lr = LogisticRegression()

Import the evaluation submodule \\
import pyspark.ml.evaluation as evals

Create a BinaryClassificationEvaluator \\
evaluator = BinaryClassificationEvaluator(metricName = 'areaUnderROC')

Import the tuning submodule \\
import pyspark.ml.tuning as tune

Create the parameter grid \\
grid = tune.ParamGridBuilder()

Add the hyperparameter \\
grid = grid.addGrid(lr.regParam, np.arange(0, .1, .01))
grid = grid.addGrid(lr.elasticNetParam, [0,1])

Build the grid \\
grid = grid.build()

Create the CrossValidator \\
cv = tune.CrossValidator(estimator=lr,
               estimatorParamMaps=grid,
               evaluator=evaluator
               )

Use the model to predict the test set \\
test_results = best_lr.transform(test)

Evaluate the predictions \\
print(evaluator.evaluate(test_results))

In [0]:
# The following are from https://medium.com/@rmache/big-data-with-spark-in-google-colab-7c046e24b3
# Install spark-related dependencies
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz

!pip install -q findspark
!pip install pyspark
# Set up required environment variables

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"

In [5]:
# Point Colaboratory to your Google Drive

from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
# Download datasets directly to your Google Drive "Colab Datasets" folder

import requests

# 2007 data

file_url = "http://stat-computing.org/dataexpo/2009/2007.csv.bz2"

r = requests.get(file_url, stream = True) 

with open("/content/gdrive/My Drive/Colab Datasets/2007.csv.bz2", "wb") as file: 
	for block in r.iter_content(chunk_size = 1024): 
		if block: 
			file.write(block)

# 2008 data

file_url = "http://stat-computing.org/dataexpo/2009/2008.csv.bz2"

r = requests.get(file_url, stream = True) 

with open("/content/gdrive/My Drive/Colab Datasets/2008.csv.bz2", "wb") as file: 
	for block in r.iter_content(chunk_size = 1024): 
		if block: 
			file.write(block)