# SQL Options in Spark

PySpark provides two main options when it comes to using staight SQL. Spark SQL and SQL Transformer. 

## 1. Spark SQL

Spark TempView provides two functions that allow users to run **SQL** queries against a Spark DataFrame: 

 - **createOrReplaceTempView:** The lifetime of this temporary view is tied to the SparkSession that was used to create the dataset. It creates (or replaces if that view name already exists) a lazily evaluated "view" that you can then use like a hive table in Spark SQL. It does not persist to memory unless you cache the dataset that underpins the view.
 - **createGlobalTempView:** The lifetime of this temporary view is tied to this Spark application. This feature is useful when you want to share data among different sessions and keep alive until your application ends.

A **Spark Session vs. Spark application:**

**Spark application** can be used:

- for a single batch job
- an interactive session with multiple jobs
- a long-lived server continually satisfying requests
- A Spark job can consist of more than just a single map and reduce.
- can consist of more than one Spark Session. 

A **SparkSession** on the other hand:

 - is an interaction between two or more entities. 
 - can be created without creating SparkConf, SparkContext or SQLContext, (they’re encapsulated within the SparkSession which is new to Spark 2.0)


## 2. SQL Transformer

You also have the option to use the SQL transformer option where you can write free-form SQL scripts as well.

# SQL Options within regular PySpark calls

1. The expr function in PySparks SQL Function Library
2. PySparks selectExpr function

We will go over all these in detail so buckel up!


Let's start with Spark SQL. But first we need to create a Spark Session!

In [1]:
# import findspark
# findspark.init()

import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession
# May take awhile locally
spark = SparkSession.builder.appName("SparkSQL").getOrCreate()

cores = spark._jsc.sc().getExecutorMemoryStatus().keySet().size()
print("You are working with", cores, "core(s)")
spark

You are working with 1 core(s)


## Let's Read in our DataFrame for this Notebook

### About this data

Recorded crime for the Police Force Areas of England and Wales. The data are rolling 12-month totals, with points at the end of each financial year between year ending March 2003 to March 2007 and at the end of each quarter from June 2007.

**Source:** https://www.kaggle.com/r3w0p4/recorded-crime-data-at-police-force-area-level

In [2]:
# Start by reading a basic csv dataset
# Let Spark know about the header and infer the Schema types!

path = 'Datasets/'

crime = spark.read.csv(path+"rec-crime-pfa.csv",header=True,inferSchema=True)

In [16]:
# This is way better
crime.limit(5).toPandas()

Unnamed: 0,12 months ending,PFA,Region,Offence,Rolling year total number of offences
0,31/03/2003,Avon and Somerset,South West,All other theft offences,25959
1,31/03/2003,Avon and Somerset,South West,Bicycle theft,3090
2,31/03/2003,Avon and Somerset,South West,Criminal damage and arson,26202
3,31/03/2003,Avon and Somerset,South West,Death or serious injury caused by illegal driving,2
4,31/03/2003,Avon and Somerset,South West,Domestic burglary,14561


In [17]:
print(crime.printSchema())

root
 |-- 12 months ending: string (nullable = true)
 |-- PFA: string (nullable = true)
 |-- Region: string (nullable = true)
 |-- Offence: string (nullable = true)
 |-- Rolling year total number of offences: integer (nullable = true)

None


So, in order for us to perform SQL calls off of this dataframe, we will need to rename any variables that have spaces in them. We will not be using the first variable so I'll leave that one as is, but we will be using the last variable, so I will go ahead and change that to Count so we can work with it. 

In [4]:
df = crime.withColumnRenamed('Rolling year total number of offences','Count') #.withColumn("12 months ending", crime["12 months ending"].cast(DateType())).
print(df.printSchema())

root
 |-- 12 months ending: string (nullable = true)
 |-- PFA: string (nullable = true)
 |-- Region: string (nullable = true)
 |-- Offence: string (nullable = true)
 |-- Count: integer (nullable = true)

None


In [5]:
# Create a temporary view of the dataframe
df.createOrReplaceTempView("tempview")

In [11]:
# Then Query the temp view
spark.sql("SELECT * FROM tempview WHERE Count > 1000").limit(5).toPandas()

+----------------+-----------------+----------+--------------------+-----+
|12 months ending|              PFA|    Region|             Offence|Count|
+----------------+-----------------+----------+--------------------+-----+
|      31/03/2003|Avon and Somerset|South West|All other theft o...|25959|
|      31/03/2003|Avon and Somerset|South West|       Bicycle theft| 3090|
|      31/03/2003|Avon and Somerset|South West|Criminal damage a...|26202|
|      31/03/2003|Avon and Somerset|South West|   Domestic burglary|14561|
|      31/03/2003|Avon and Somerset|South West|       Drug offences| 2308|
|      31/03/2003|Avon and Somerset|South West|      Fraud offences| 5339|
|      31/03/2003|Avon and Somerset|South West|Miscellaneous cri...| 1597|
|      31/03/2003|Avon and Somerset|South West|Non-domestic burg...|15621|
|      31/03/2003|Avon and Somerset|South West|Public order offe...| 4025|
|      31/03/2003|Avon and Somerset|South West|             Robbery| 3504|
|      31/03/2003|Avon an

In [9]:
# Or choose which vars you want
spark.sql("SELECT Region, PFA FROM tempview WHERE Count > 1000").limit(5).toPandas()

Unnamed: 0,Region,PFA
0,South West,Avon and Somerset
1,South West,Avon and Somerset
2,South West,Avon and Somerset
3,South West,Avon and Somerset
4,South West,Avon and Somerset


In [21]:
# You can also pass your query results to an object 
# (we don't need to use .collect() here)
sql_results = spark.sql("SELECT * FROM tempview WHERE Count > 1000 AND Region='South West'")
sql_results.limit(5).toPandas()

Unnamed: 0,12 months ending,PFA,Region,Offence,Count
0,31/03/2003,Avon and Somerset,South West,All other theft offences,25959
1,31/03/2003,Avon and Somerset,South West,Bicycle theft,3090
2,31/03/2003,Avon and Somerset,South West,Criminal damage and arson,26202
3,31/03/2003,Avon and Somerset,South West,Domestic burglary,14561
4,31/03/2003,Avon and Somerset,South West,Drug offences,2308


In [10]:
# We can even do aggregated "group by" calls like this
spark.sql("SELECT Region, sum(Count) AS Total FROM tempview GROUP BY Region").limit(5).toPandas()

Unnamed: 0,Region,Total
0,Fraud: CIFAS,7678981
1,North West,30235732
2,British Transport Police,3029117
3,Wales,11137260
4,London,42691902


basically anything goes

### SQL Transformer

You also have the option to use the SQL transformer option where you can write freeform SQL scripts.

In [7]:
# First we need to import SQL transformer
from pyspark.ml.feature import SQLTransformer

In [10]:
# Then we create an SQL call 
sqlTrans = SQLTransformer(
    statement="SELECT PFA,Region,Offence FROM __THIS__") 
# And use it to transform our df object
sqlTrans.transform(df).show(5)

+----------------+-----------------+----------+--------------------+-----+
|12 months ending|              PFA|    Region|             Offence|Count|
+----------------+-----------------+----------+--------------------+-----+
|      31/03/2003|Avon and Somerset|South West|All other theft o...|25959|
|      31/03/2003|Avon and Somerset|South West|       Bicycle theft| 3090|
|      31/03/2003|Avon and Somerset|South West|Criminal damage a...|26202|
|      31/03/2003|Avon and Somerset|South West|   Domestic burglary|14561|
|      31/03/2003|Avon and Somerset|South West|       Drug offences| 2308|
+----------------+-----------------+----------+--------------------+-----+
only showing top 5 rows



In [28]:
type(sqlTrans)

pyspark.ml.feature.SQLTransformer

In [25]:
# Note that "__THIS__" is a special word and cannot be change to __THAT__ for example
sqlTrans = SQLTransformer(
    statement="SELECT PFA,Region,Offence FROM __THAT__") 
# And use it to transform our df object
sqlTrans.transform(df).show(5)

AnalysisException: 'Table or view not found: __THAT__; line 1 pos 31'

In [23]:
# Also Note that a call like this won't work...
SQLTransformer(statement="SELECT PFA,Region,Offence FROM __THIS__").show()

AttributeError: 'SQLTransformer' object has no attribute 'show'

**Now how about a group by call**

In [26]:
#Note that this call will not work on the original dataframe "crime" when the variable "Count" is a string

sqlTrans = SQLTransformer(
    statement="SELECT Offence, SUM(Count) as Total FROM __THIS__ GROUP BY Offence") 
sqlTrans.transform(df).show(5)

+--------------------+--------+
|             Offence|   Total|
+--------------------+--------+
|Public order offe...|10925676|
|       Bicycle theft| 5297006|
|Residential burglary| 1671469|
|Violence without ...|16590158|
|All other theft o...|30979393|
+--------------------+--------+
only showing top 5 rows



**And a where statement**

In [27]:
sqlTrans = SQLTransformer(
    statement="SELECT PFA,Offence FROM __THIS__ WHERE Count > 1000") 
sqlTrans.transform(df).show(5)

+-----------------+--------------------+
|              PFA|             Offence|
+-----------------+--------------------+
|Avon and Somerset|All other theft o...|
|Avon and Somerset|       Bicycle theft|
|Avon and Somerset|Criminal damage a...|
|Avon and Somerset|   Domestic burglary|
|Avon and Somerset|       Drug offences|
+-----------------+--------------------+
only showing top 5 rows



**You can also, of course, read the output into a dataframe**

In [29]:
result = sqlTrans.transform(df)
result.show(5)

+-----------------+--------------------+
|              PFA|             Offence|
+-----------------+--------------------+
|Avon and Somerset|All other theft o...|
|Avon and Somerset|       Bicycle theft|
|Avon and Somerset|Criminal damage a...|
|Avon and Somerset|   Domestic burglary|
|Avon and Somerset|       Drug offences|
+-----------------+--------------------+
only showing top 5 rows



# SQL Options within regular PySpark calls

### The expr function in PySparks SQL Function Library

You can also use the expr function within the pyspark.sql.functions library coupled with either PySpark's withColumn function or the select function.

In [30]:
# First we need to read in the library
from pyspark.sql.functions import expr 

Let's add a percent column to the dataframe. To do this, first we need to get the total number of rows in the dataframe (we can't soft this unfortunatly).

In [34]:
sqlTrans = SQLTransformer(
    statement="SELECT SUM(Count) as Total FROM __THIS__") 
sqlTrans.transform(df).show(5)

+---------+
|    Total|
+---------+
|244720928|
+---------+



In [36]:
# We could add a percent column to our df 
# that shows the offence %
# with the "withColumn" command
df.withColumn("percent",expr("round((count/244720928)*100,2)")).show()

+----------------+-----------------+----------+--------------------+-----+-------+
|12 months ending|              PFA|    Region|             Offence|Count|percent|
+----------------+-----------------+----------+--------------------+-----+-------+
|      31/03/2003|Avon and Somerset|South West|All other theft o...|25959|   0.01|
|      31/03/2003|Avon and Somerset|South West|       Bicycle theft| 3090|    0.0|
|      31/03/2003|Avon and Somerset|South West|Criminal damage a...|26202|   0.01|
|      31/03/2003|Avon and Somerset|South West|Death or serious ...|    2|    0.0|
|      31/03/2003|Avon and Somerset|South West|   Domestic burglary|14561|   0.01|
|      31/03/2003|Avon and Somerset|South West|       Drug offences| 2308|    0.0|
|      31/03/2003|Avon and Somerset|South West|      Fraud offences| 5339|    0.0|
|      31/03/2003|Avon and Somerset|South West|            Homicide|   19|    0.0|
|      31/03/2003|Avon and Somerset|South West|Miscellaneous cri...| 1597|    0.0|
|   

In [35]:
# Same thing with the "select" command
df.select("*",expr("round((count/244720928)*100,2) AS percent")).show()

+----------------+-----------------+----------+--------------------+-----+-------+
|12 months ending|              PFA|    Region|             Offence|Count|percent|
+----------------+-----------------+----------+--------------------+-----+-------+
|      31/03/2003|Avon and Somerset|South West|All other theft o...|25959|   0.01|
|      31/03/2003|Avon and Somerset|South West|       Bicycle theft| 3090|    0.0|
|      31/03/2003|Avon and Somerset|South West|Criminal damage a...|26202|   0.01|
|      31/03/2003|Avon and Somerset|South West|Death or serious ...|    2|    0.0|
|      31/03/2003|Avon and Somerset|South West|   Domestic burglary|14561|   0.01|
|      31/03/2003|Avon and Somerset|South West|       Drug offences| 2308|    0.0|
|      31/03/2003|Avon and Somerset|South West|      Fraud offences| 5339|    0.0|
|      31/03/2003|Avon and Somerset|South West|            Homicide|   19|    0.0|
|      31/03/2003|Avon and Somerset|South West|Miscellaneous cri...| 1597|    0.0|
|   

### PySparks selectExpr function

Very similar idea here but slightly different syntax.

In [37]:
df.selectExpr("*","round((count/244720928)*100,2) AS percent").filter("Region ='South West'").show()

+----------------+-----------------+----------+--------------------+-----+-------+
|12 months ending|              PFA|    Region|             Offence|Count|percent|
+----------------+-----------------+----------+--------------------+-----+-------+
|      31/03/2003|Avon and Somerset|South West|All other theft o...|25959|   0.01|
|      31/03/2003|Avon and Somerset|South West|       Bicycle theft| 3090|    0.0|
|      31/03/2003|Avon and Somerset|South West|Criminal damage a...|26202|   0.01|
|      31/03/2003|Avon and Somerset|South West|Death or serious ...|    2|    0.0|
|      31/03/2003|Avon and Somerset|South West|   Domestic burglary|14561|   0.01|
|      31/03/2003|Avon and Somerset|South West|       Drug offences| 2308|    0.0|
|      31/03/2003|Avon and Somerset|South West|      Fraud offences| 5339|    0.0|
|      31/03/2003|Avon and Somerset|South West|            Homicide|   19|    0.0|
|      31/03/2003|Avon and Somerset|South West|Miscellaneous cri...| 1597|    0.0|
|   

## That's all folks! Great job!

In [None]:
# Speed test

In [15]:
spark.sql("SELECT * FROM tempview WHERE Count > 1000").show()

+----------------+-----------------+----------+--------------------+-----+
|12 months ending|              PFA|    Region|             Offence|Count|
+----------------+-----------------+----------+--------------------+-----+
|      31/03/2003|Avon and Somerset|South West|All other theft o...|25959|
|      31/03/2003|Avon and Somerset|South West|       Bicycle theft| 3090|
|      31/03/2003|Avon and Somerset|South West|Criminal damage a...|26202|
|      31/03/2003|Avon and Somerset|South West|   Domestic burglary|14561|
|      31/03/2003|Avon and Somerset|South West|       Drug offences| 2308|
|      31/03/2003|Avon and Somerset|South West|      Fraud offences| 5339|
|      31/03/2003|Avon and Somerset|South West|Miscellaneous cri...| 1597|
|      31/03/2003|Avon and Somerset|South West|Non-domestic burg...|15621|
|      31/03/2003|Avon and Somerset|South West|Public order offe...| 4025|
|      31/03/2003|Avon and Somerset|South West|             Robbery| 3504|
|      31/03/2003|Avon an

In [None]:
# Then we create an SQL call 
sqlTrans = SQLTransformer(
    statement="SELECT * FROM __THIS__ WHERE Count > 1000")
# And use it to transform our df object
sqlTrans.transform(df).show(5)

In [16]:
# Then we create an SQL call 
SQLTransformer(statement="SELECT * FROM __THIS__ WHERE Count > 1000").transform(df).show()

+----------------+-----------------+----------+--------------------+-----+
|12 months ending|              PFA|    Region|             Offence|Count|
+----------------+-----------------+----------+--------------------+-----+
|      31/03/2003|Avon and Somerset|South West|All other theft o...|25959|
|      31/03/2003|Avon and Somerset|South West|       Bicycle theft| 3090|
|      31/03/2003|Avon and Somerset|South West|Criminal damage a...|26202|
|      31/03/2003|Avon and Somerset|South West|   Domestic burglary|14561|
|      31/03/2003|Avon and Somerset|South West|       Drug offences| 2308|
|      31/03/2003|Avon and Somerset|South West|      Fraud offences| 5339|
|      31/03/2003|Avon and Somerset|South West|Miscellaneous cri...| 1597|
|      31/03/2003|Avon and Somerset|South West|Non-domestic burg...|15621|
|      31/03/2003|Avon and Somerset|South West|Public order offe...| 4025|
|      31/03/2003|Avon and Somerset|South West|             Robbery| 3504|
|      31/03/2003|Avon an

---

# SQL Options in Spark HW Solutions

Let's start with Spark SQL. But first we need to create a Spark Session!

In [1]:
import findspark
findspark.init()

import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession
# May take awhile locally
spark = SparkSession.builder.appName("SparkSQLHWSolutions").getOrCreate()

cores = spark._jsc.sc().getExecutorMemoryStatus().keySet().size()
print("You are working with", cores, "core(s)")
spark

You are working with 1 core(s)


## Read in our DataFrame for this Notebook

For this notebook we will be using the Google Play Store csv file.

### About this dataset

Contains a list of Google Play Store Apps and info about the apps like the category, rating, reviews, size, etc.

Source: https://www.kaggle.com/lava18/google-play-store-apps

In [3]:
path = 'Datasets/'

googlep = spark.read.csv(path+"googleplaystore.csv",header=True,inferSchema=True)

In [4]:
googlep.limit(5).toPandas()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [5]:
googlep.printSchema()

root
 |-- App: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- Rating: string (nullable = true)
 |-- Reviews: string (nullable = true)
 |-- Size: string (nullable = true)
 |-- Installs: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Price: string (nullable = true)
 |-- Content Rating: string (nullable = true)
 |-- Genres: string (nullable = true)
 |-- Last Updated: string (nullable = true)
 |-- Current Ver: string (nullable = true)
 |-- Android Ver: string (nullable = true)



Looks like we need to edit some of the datatypes. Let's just update Rating, Reviews and Price as integer (float for Rating) values for now, since the Size and Installs variables will need a bit more cleaning.

In [6]:
from pyspark.sql.types import IntegerType, FloatType
df = googlep.withColumn("Rating", googlep["Rating"].cast(FloatType())) \
            .withColumn("Reviews", googlep["Reviews"].cast(IntegerType())) \
            .withColumn("Price", googlep["Price"].cast(IntegerType()))
print(df.printSchema())
df.limit(5).toPandas()

root
 |-- App: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- Rating: float (nullable = true)
 |-- Reviews: integer (nullable = true)
 |-- Size: string (nullable = true)
 |-- Installs: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Price: integer (nullable = true)
 |-- Content Rating: string (nullable = true)
 |-- Genres: string (nullable = true)
 |-- Last Updated: string (nullable = true)
 |-- Current Ver: string (nullable = true)
 |-- Android Ver: string (nullable = true)

None


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


## 1. Create Tempview¶

Go ahead and create a tempview of the dataframe so we can work with it in spark sql.

In [7]:
# Create a temporary view of the dataframe
df.createOrReplaceTempView("tempview")

## 2. Select all apps with ratings above 4.1

In [8]:
# Then Query the temp view
spark.sql("SELECT * FROM tempview WHERE Rating > 4.1").limit(5).toPandas()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
1,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
2,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
3,Paper flowers instructions,ART_AND_DESIGN,4.4,167,5.6M,"50,000+",Free,0,Everyone,Art & Design,"March 26, 2017",1.0,2.3 and up
4,Garden Coloring Book,ART_AND_DESIGN,4.4,13791,33M,"1,000,000+",Free,0,Everyone,Art & Design,"September 20, 2017",2.9.2,3.0 and up


## 3. Now pass your results to an object (ie create a spark dataframe)

Select just the App and Rating column where the Category is in the Comic category and the Rating is above 4.5.

In [9]:
# Or pass it to an object
sql_results = spark.sql("SELECT App,Rating FROM tempview WHERE Category = 'COMICS' AND Rating > 4.5")
sql_results.limit(5).toPandas()

Unnamed: 0,App,Rating
0,Manga Master - Best manga & comic reader,4.6
1,GANMA! - All original stories free of charge f...,4.7
2,Röhrich Werner Soundboard,4.7
3,Unicorn Pokez - Color By Number,4.8
4,Manga - read Thai translation,4.6


## 4. Which category has the most cumulative reviews

Only select the one category with the most reivews.

Note: will require adding all the review together for each category

In [10]:
spark.sql("SELECT Category, sum(Reviews) AS Total_Reviews FROM tempview GROUP BY Category ORDER BY Total_Reviews DESC") \
        .limit(1).toPandas()

Unnamed: 0,Category,Total_Reviews
0,GAME,1585422349


## 5. Which App has the most reviews?

Display ONLY the top result

Include only the App column and the Reviews column.

In [11]:
spark.sql("SELECT App, Reviews FROM tempview ORDER BY Reviews DESC").show(1)

+--------+--------+
|     App| Reviews|
+--------+--------+
|Facebook|78158306|
+--------+--------+
only showing top 1 row



## 5. Select all apps that contain the word 'dating' anywhere in the title

In [12]:
spark.sql("SELECT * FROM tempview WHERE App LIKE '%dating%'").limit(5).toPandas()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,"Meet, chat & date. Free dating app - Chocolate...",DATING,3.9,8661,9.5M,"1,000,000+",Free,0,Mature 17+,Dating,"April 3, 2018",0.1.11,4.0 and up
1,Friend Find: free chat + flirt dating app,DATING,,23,11M,100+,Free,0,Mature 17+,Dating,"July 31, 2018",1.0,4.4 and up
2,Spine- The dating app,DATING,5.0,5,9.3M,500+,Free,0,Teen,Dating,"July 14, 2018",4.0,4.0.3 and up
3,Princess Closet : Otome games free dating sim,FAMILY,4.5,29495,56M,"1,000,000+",Free,0,Teen,Simulation,"May 24, 2018",1.11.0,4.0.3 and up
4,happn – Local dating app,LIFESTYLE,4.3,1118201,Varies with device,"10,000,000+",Free,0,Mature 17+,Lifestyle,"July 24, 2018",Varies with device,Varies with device


## 6. Use SQL Transformer to display how many free apps there are in this list

In [13]:
# First we need to import SQL transformer
from pyspark.ml.feature import SQLTransformer

In [14]:
sqlTrans = SQLTransformer(
    statement="SELECT count(*) FROM __THIS__ WHERE Type = 'Free'") 
sqlTrans.transform(df).show()

+--------+
|count(1)|
+--------+
|   10037|
+--------+



## 7. What is the most popular Genre?

In [15]:
sqlTrans = SQLTransformer(
    statement="SELECT Genres, count(*) as Total FROM __THIS__ GROUP BY Genres ORDER BY Total DESC") 
sqlTrans.transform(df).show(1)

+------+-----+
|Genres|Total|
+------+-----+
| Tools|  842|
+------+-----+
only showing top 1 row



## 8. Select all the apps in the 'Tools' genre that have more than 100 reviews

In [16]:
sqlTrans = SQLTransformer(
    statement="SELECT App, Reviews FROM __THIS__ WHERE Genres = 'Tools' AND Reviews > 100") 
sqlTrans.transform(df).show(10)

+--------------------+--------+
|                 App| Reviews|
+--------------------+--------+
|   Moto File Manager|   38655|
|              Google| 8033493|
|    Google Translate| 5745093|
|        Moto Display|   18239|
|      Motorola Alert|   24199|
|     Motorola Assist|   37333|
|Cache Cleaner-DU ...|12759663|
|  Moto Suggestions ™|     308|
|          Moto Voice|   33216|
|          Calculator|   40770|
+--------------------+--------+
only showing top 10 rows

