# SQL Options in Spark HW

Alirght let's apply what we learned in the lecture to a new dataset!

**But first!**

Let's start with Spark SQL. But first we need to create a Spark Session!

In [2]:
import findspark

findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.appName("SQLOptions").master("local[*]").getOrCreate()

spark

## Read in our DataFrame for this Notebook

For this notebook we will be using the Google Play Store csv file attached to this lecture. Let's go ahead and read it in. 

### About this dataset

Contains a list of Google Play Store Apps and info about the apps like the category, rating, reviews, size, etc. 

**Source:** https://www.kaggle.com/lava18/google-play-store-apps

In [3]:
gps = spark.read.csv(
    path=r"Datasets/googleplaystore.csv",
    inferSchema=True,
    sep=",",
    header=True,
)

gps.printSchema()

root
 |-- App: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- Rating: string (nullable = true)
 |-- Reviews: string (nullable = true)
 |-- Size: string (nullable = true)
 |-- Installs: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Price: string (nullable = true)
 |-- Content Rating: string (nullable = true)
 |-- Genres: string (nullable = true)
 |-- Last Updated: string (nullable = true)
 |-- Current Ver: string (nullable = true)
 |-- Android Ver: string (nullable = true)



## First things first

Let's check out the first few lines of the dataframe to see what we are working with

In [4]:
gps.limit(10).toPandas()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
5,Paper flowers instructions,ART_AND_DESIGN,4.4,167,5.6M,"50,000+",Free,0,Everyone,Art & Design,"March 26, 2017",1.0,2.3 and up
6,Smoke Effect Photo Maker - Smoke Editor,ART_AND_DESIGN,3.8,178,19M,"50,000+",Free,0,Everyone,Art & Design,"April 26, 2018",1.1,4.0.3 and up
7,Infinite Painter,ART_AND_DESIGN,4.1,36815,29M,"1,000,000+",Free,0,Everyone,Art & Design,"June 14, 2018",6.1.61.1,4.2 and up
8,Garden Coloring Book,ART_AND_DESIGN,4.4,13791,33M,"1,000,000+",Free,0,Everyone,Art & Design,"September 20, 2017",2.9.2,3.0 and up
9,Kids Paint Free - Drawing Fun,ART_AND_DESIGN,4.7,121,3.1M,"10,000+",Free,0,Everyone,Art & Design;Creativity,"July 3, 2018",2.8,4.0.3 and up


As well as the schema to make sure all the column types were correctly infered

In [None]:
# Done above

Looks like we need to edit some of the datatypes. We need to update Rating, Reviews and Price as integer (float for Rating) values for now, since the Size and Installs variables will need a bit more cleaning. Since we haven't been over this yet, I'm going to provide the code for you here so you can get a quick look at how it used (and how often we need it!).

**make sure to change the df name to whatever you named your df**

In [76]:
from pyspark.sql.types import IntegerType, FloatType

new_gps = gps.withColumn("Rating", gps["Rating"].cast(FloatType())) \
            .withColumn("Reviews", gps["Reviews"].cast(IntegerType())) \
            .withColumn("Price", gps["Price"].cast(IntegerType()))
print(new_gps.printSchema())
new_gps.limit(5).toPandas()

root
 |-- App: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- Rating: float (nullable = true)
 |-- Reviews: integer (nullable = true)
 |-- Size: string (nullable = true)
 |-- Installs: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Price: integer (nullable = true)
 |-- Content Rating: string (nullable = true)
 |-- Genres: string (nullable = true)
 |-- Last Updated: string (nullable = true)
 |-- Current Ver: string (nullable = true)
 |-- Android Ver: string (nullable = true)

None


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


Looks like that worked! Great! Let's dig in. 

## 1. Create Tempview

Go ahead and create a tempview of the dataframe so we can work with it in spark sql.

In [77]:
new_gps.createOrReplaceTempView("NEWGPS")

## 2. Select all apps with ratings above 4.1

Use your tempview to select all apps with ratings above 4.1

In [78]:
spark.sql(
    """
    SELECT *
    FROM NEWGPS
    WHERE Rating > 4.1
    """
).limit(5).toPandas()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
1,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
2,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
3,Paper flowers instructions,ART_AND_DESIGN,4.4,167,5.6M,"50,000+",Free,0,Everyone,Art & Design,"March 26, 2017",1.0,2.3 and up
4,Garden Coloring Book,ART_AND_DESIGN,4.4,13791,33M,"1,000,000+",Free,0,Everyone,Art & Design,"September 20, 2017",2.9.2,3.0 and up


## 3. Now pass your results to an object 
(ie create a spark dataframe)

Select just the App and Rating column where the Category is in the Comic category and the Rating is above 4.5.

In [80]:
above45 = spark.sql(
    """
    SELECT *
    FROM NEWGPS
    WHERE Rating > 4.1
    """
)

above45.printSchema()

above45.createOrReplaceTempView(
    "ABOVE45"
)

above45.select(
    above45.App,
    above45.Rating,
).where(
    (above45.Category.isin(["COMICS"])) &
    (above45.Rating > 4.5)
).limit(5).toPandas()

root
 |-- App: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- Rating: float (nullable = true)
 |-- Reviews: integer (nullable = true)
 |-- Size: string (nullable = true)
 |-- Installs: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Price: integer (nullable = true)
 |-- Content Rating: string (nullable = true)
 |-- Genres: string (nullable = true)
 |-- Last Updated: string (nullable = true)
 |-- Current Ver: string (nullable = true)
 |-- Android Ver: string (nullable = true)



Unnamed: 0,App,Rating
0,Manga Master - Best manga & comic reader,4.6
1,GANMA! - All original stories free of charge f...,4.7
2,Röhrich Werner Soundboard,4.7
3,Unicorn Pokez - Color By Number,4.8
4,Manga - read Thai translation,4.6


In [81]:
spark.sql(
    """
    SELECT App, Rating
    FROM ABOVE45
    WHERE Category LIKE 'COMICS' AND Rating > 4.5
    """
).limit(5).toPandas()

Unnamed: 0,App,Rating
0,Manga Master - Best manga & comic reader,4.6
1,GANMA! - All original stories free of charge f...,4.7
2,Röhrich Werner Soundboard,4.7
3,Unicorn Pokez - Color By Number,4.8
4,Manga - read Thai translation,4.6


## 4. Which category has the most cumulative reviews

Only select the one category with the most reivews. 

*Note: will require adding all the review together for each category*

In [90]:
above45.select(
    "Category",
    "Reviews",
).groupBy(
    "Category"
).agg(
    F.sum("Reviews"),
).toPandas()

Unnamed: 0,Category,sum(Reviews)
0,EVENTS,66650.0
1,COMICS,3148919.0
2,SPORTS,62617620.0
3,WEATHER,14257100.0
4,VIDEO_PLAYERS,107141900.0
5,AUTO_AND_VEHICLES,1136069.0
6,PARENTING,909569.0
7,ENTERTAINMENT,51605140.0
8,PERSONALIZATION,88051410.0
9,HEALTH_AND_FITNESS,36537880.0


In [88]:
spark.sql(
    """
    SELECT Category, SUM(Reviews)
    FROM ABOVE45
    GROUP BY Category
    """
).toPandas()

Unnamed: 0,Category,sum(Reviews)
0,EVENTS,66650.0
1,COMICS,3148919.0
2,SPORTS,62617620.0
3,WEATHER,14257100.0
4,VIDEO_PLAYERS,107141900.0
5,AUTO_AND_VEHICLES,1136069.0
6,PARENTING,909569.0
7,ENTERTAINMENT,51605140.0
8,PERSONALIZATION,88051410.0
9,HEALTH_AND_FITNESS,36537880.0


## 5. Which App has the most reviews?

Display ONLY the top result

Include only the App column and the Reviews column.

In [91]:
above45.select(
    "App",
    "Reviews",
).orderBy(
    F.desc("Reviews"),
).limit(1).toPandas()

Unnamed: 0,App,Reviews
0,WhatsApp Messenger,69119316


In [93]:
spark.sql(
    """
    SELECT App, Reviews
    FROM ABOVE45
    ORDER BY Reviews DESC
    """
).limit(1).toPandas()

Unnamed: 0,App,Reviews
0,WhatsApp Messenger,69119316


## 5. Select all apps that contain the word 'dating' anywhere in the title

*Note: we did not cover this in the lecture. You'll have to use your SQL knowledge :) Google it if you need to.*

In [0]:
above45.select(
    "App"
).where(
    F.col("App").contains("dating")
).toPandas()

In [96]:
spark.sql(
    """
    SELECT App
    FROM ABOVE45
    WHERE App LIKE "%dating%"
    """
).toPandas()

Unnamed: 0,App
0,Friend Find: free chat + flirt dating app
1,Spine- The dating app
2,Princess Closet : Otome games free dating sim
3,happn – Local dating app


## 6. Use SQL Transformer to display how many free apps there are in this list

In [101]:
from pyspark.ml.feature import SQLTransformer

trans = SQLTransformer(
    statement="""
    SELECT COUNT(APP)
    FROM __THIS__
    WHERE PRICE = 0
    """
)

trans.transform(above45).toPandas()

Unnamed: 0,count(APP)
0,6939


## 7. What is the most popular Genre?

Which genre appears most often in the dataframe. Show only the top result.

In [109]:
above45.select(
    "Genres",
).groupBy(
    "Genres",
).agg(
    F.count("Genres").alias("Count"),
).orderBy(
    F.desc(F.col("Count"))
).limit(1).toPandas()

Unnamed: 0,Genres,Count
0,Tools,511


In [112]:
spark.sql(
    """
    SELECT Genres, COUNT(Genres) AS Count
    FROM ABOVE45
    GROUP BY Genres
    ORDER BY Count DESC
    """
).limit(1).toPandas()

Unnamed: 0,Genres,Count
0,Tools,511


## 8. Select all the apps in the 'Tools' genre that have more than 100 reviews

In [0]:
above45.select(
    "App",
    "Reviews",
    "Genres",
).where(
    (above45.Genres == "Tools") &
    (above45.Reviews > 100)
).toPandas()

In [116]:
spark.sql(
    """
    SELECT App, Reviews, Genres
    FROM ABOVE45
    WHERE Genres LIKE "Tools" AND Reviews > 100
    """
).toPandas()

Unnamed: 0,App,Reviews,Genres
0,Google,8033493,Tools
1,Google Translate,5745093,Tools
2,Moto Display,18239,Tools
3,Motorola Alert,24199,Tools
4,Cache Cleaner-DU Speed Booster (booster & clea...,12759663,Tools
...,...,...,...
312,Fingerprint Quick Action,8484,Tools
313,Finger Scanner Gestures,2531,Tools
314,ChopAssistant,455,Tools
315,Reindeer VPN - Proxy VPN,7339,Tools


## That's all folks! Great job!