# SQL Options in Spark HW

Alirght let's apply what we learned in the lecture to a new dataset!

**But first!**

Let's start with Spark SQL. But first we need to create a Spark Session!

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
path = "drive/MyDrive/5. Spark/spark-scripts/section2/Datasets/"

In [None]:
!pip install pyspark
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('sparksql').getOrCreate()
spark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.1.tar.gz (281.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.4/281.4 MB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 KB[0m [31m22.3 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.1-py2.py3-none-any.whl size=281845512 sha256=31b382e36cc1b32897592cbbd65bbd391339d31fcaf961b6239a9e2fa35df55f
  Stored in directory: /root/.cache/pip/wheels/43/dc/11/ec201cd671da62fa9c5cc77078235e40722170ceba231d7598
Successfully built pyspark
Installing collected packages: py4j, pyspa

## Read in our DataFrame for this Notebook

For this notebook we will be using the Google Play Store csv file attached to this lecture. Let's go ahead and read it in. 

### About this dataset

Contains a list of Google Play Store Apps and info about the apps like the category, rating, reviews, size, etc. 

**Source:** https://www.kaggle.com/lava18/google-play-store-apps

In [None]:
df = spark.read.csv(path+'googleplaystore.csv',inferSchema = True, header = True)

## First things first

Let's check out the first few lines of the dataframe to see what we are working with

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth',None)
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',None)

df.limit(5).toPandas()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide Apps",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


As well as the schema to make sure all the column types were correctly infered

In [None]:
df.printSchema()

root
 |-- App: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- Rating: string (nullable = true)
 |-- Reviews: string (nullable = true)
 |-- Size: string (nullable = true)
 |-- Installs: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Price: string (nullable = true)
 |-- Content Rating: string (nullable = true)
 |-- Genres: string (nullable = true)
 |-- Last Updated: string (nullable = true)
 |-- Current Ver: string (nullable = true)
 |-- Android Ver: string (nullable = true)



Looks like we need to edit some of the datatypes. We need to update Rating, Reviews and Price as integer (float for Rating) values for now, since the Size and Installs variables will need a bit more cleaning. Since we haven't been over this yet, I'm going to provide the code for you here so you can get a quick look at how it used (and how often we need it!).

**make sure to change the df name to whatever you named your df**

In [None]:
from pyspark.sql.types import IntegerType, FloatType
df = df.withColumn("Rating", df["Rating"].cast(FloatType())) \
            .withColumn("Reviews", df["Reviews"].cast(IntegerType())) \
            .withColumn("Price", df["Price"].cast(IntegerType()))
print(df.printSchema())
df.limit(5).toPandas()

root
 |-- App: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- Rating: float (nullable = true)
 |-- Reviews: integer (nullable = true)
 |-- Size: string (nullable = true)
 |-- Installs: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Price: integer (nullable = true)
 |-- Content Rating: string (nullable = true)
 |-- Genres: string (nullable = true)
 |-- Last Updated: string (nullable = true)
 |-- Current Ver: string (nullable = true)
 |-- Android Ver: string (nullable = true)

None


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide Apps",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


Looks like that worked! Great! Let's dig in. 

## 1. Create Tempview

Go ahead and create a tempview of the dataframe so we can work with it in spark sql.

In [None]:
df.createOrReplaceTempView('view')

## 2. Select all apps with ratings above 4.1

Use your tempview to select all apps with ratings above 4.1

In [None]:
spark.sql('SELECT * FROM view WHERE rating > 4.1').show()

+--------------------+--------------+------+-------+----+-----------+----+-----+--------------+--------------------+------------------+------------------+------------+
|                 App|      Category|Rating|Reviews|Size|   Installs|Type|Price|Content Rating|              Genres|      Last Updated|       Current Ver| Android Ver|
+--------------------+--------------+------+-------+----+-----------+----+-----+--------------+--------------------+------------------+------------------+------------+
|U Launcher Lite –...|ART_AND_DESIGN|   4.7|  87510|8.7M| 5,000,000+|Free|    0|      Everyone|        Art & Design|    August 1, 2018|             1.2.4|4.0.3 and up|
|Sketch - Draw & P...|ART_AND_DESIGN|   4.5| 215644| 25M|50,000,000+|Free|    0|          Teen|        Art & Design|      June 8, 2018|Varies with device|  4.2 and up|
|Pixel Draw - Numb...|ART_AND_DESIGN|   4.3|    967|2.8M|   100,000+|Free|    0|      Everyone|Art & Design;Crea...|     June 20, 2018|               1.1|  4.4 

## 3. Now pass your results to an object 
(ie create a spark dataframe)

Select just the App and Rating column where the Category is in the Comic category and the Rating is above 4.5.

In [None]:
data = spark.sql("select App,rating from view where lower(Category) = 'comics' and rating > 4.5 ")
data.limit(5).toPandas()

Unnamed: 0,App,rating
0,Manga Master - Best manga & comic reader,4.6
1,GANMA! - All original stories free of charge for all original comics,4.7
2,Röhrich Werner Soundboard,4.7
3,Unicorn Pokez - Color By Number,4.8
4,Manga - read Thai translation,4.6


## 4. Which category has the most cumulative reviews

Only select the one category with the most reivews. 

*Note: will require adding all the review together for each category*

In [None]:
spark.sql('select Category, sum(reviews) as Total_reviews from view group by category order by sum(reviews) desc').show()

+-------------------+-------------+
|           Category|Total_reviews|
+-------------------+-------------+
|               GAME|   1585422349|
|      COMMUNICATION|    815462260|
|             SOCIAL|    621241422|
|             FAMILY|    410226330|
|              TOOLS|    273185044|
|        PHOTOGRAPHY|    213516650|
|           SHOPPING|    115041222|
|       PRODUCTIVITY|    114116975|
|      VIDEO_PLAYERS|    110380188|
|    PERSONALIZATION|     89346140|
|             SPORTS|     70830169|
|   TRAVEL_AND_LOCAL|     62617919|
|      ENTERTAINMENT|     59178154|
| NEWS_AND_MAGAZINES|     54400863|
|          EDUCATION|     39595786|
| HEALTH_AND_FITNESS|     37891234|
|MAPS_AND_NAVIGATION|     30557006|
|BOOKS_AND_REFERENCE|     21959069|
|            FINANCE|     17550728|
|            WEATHER|     14604735|
+-------------------+-------------+
only showing top 20 rows



## 5. Which App has the most reviews?

Display ONLY the top result

Include only the App column and the Reviews column.

In [None]:
spark.sql('select app,reviews from view order by reviews desc limit 1').toPandas()

Unnamed: 0,app,reviews
0,Facebook,78158306


## 5. Select all apps that contain the word 'dating' anywhere in the title

*Note: we did not cover this in the lecture. You'll have to use your SQL knowledge :) Google it if you need to.*

In [None]:
spark.sql("select * from view where app like '%dating%' ").toPandas()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,"Meet, chat & date. Free dating app - Chocolate app",DATING,3.9,8661,9.5M,"1,000,000+",Free,0,Mature 17+,Dating,"April 3, 2018",0.1.11,4.0 and up
1,Friend Find: free chat + flirt dating app,DATING,,23,11M,100+,Free,0,Mature 17+,Dating,"July 31, 2018",1.0,4.4 and up
2,Spine- The dating app,DATING,5.0,5,9.3M,500+,Free,0,Teen,Dating,"July 14, 2018",4.0,4.0.3 and up
3,Princess Closet : Otome games free dating sim,FAMILY,4.5,29495,56M,"1,000,000+",Free,0,Teen,Simulation,"May 24, 2018",1.11.0,4.0.3 and up
4,happn – Local dating app,LIFESTYLE,4.3,1118201,Varies with device,"10,000,000+",Free,0,Mature 17+,Lifestyle,"July 24, 2018",Varies with device,Varies with device


## 6. Use SQL Transformer to display how many free apps there are in this list

In [None]:
from pyspark.ml.feature import SQLTransformer

query = SQLTransformer(statement = "select count(*) as Number_of_apps from __THIS__ where type = 'Free' ")

query.transform(df).toPandas()


Unnamed: 0,Number_of_apps
0,10037


## 7. What is the most popular Genre?

Which genre appears most often in the dataframe. Show only the top result.

In [None]:
spark.sql('select category, count(*) from view group by category order by count(*) desc limit 1 ').toPandas()

Unnamed: 0,category,count(1)
0,FAMILY,1972


## 8. Select all the apps in the 'Tools' genre that have more than 100 reviews

In [None]:
spark.sql("select * from view where lower(category) = 'tools' and reviews > 100 ").toPandas()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Moto File Manager,TOOLS,4.1,38655,5.9M,"10,000,000+",Free,0.0,Everyone,Tools,"February 1, 2018",v3.7.93,5.0 and up
1,Google,TOOLS,4.4,8033493,Varies with device,"1,000,000,000+",Free,0.0,Everyone,Tools,"August 3, 2018",Varies with device,Varies with device
2,Google Translate,TOOLS,4.4,5745093,Varies with device,"500,000,000+",Free,0.0,Everyone,Tools,"August 4, 2018",Varies with device,Varies with device
3,Moto Display,TOOLS,4.2,18239,Varies with device,"10,000,000+",Free,0.0,Everyone,Tools,"August 6, 2018",Varies with device,Varies with device
4,Motorola Alert,TOOLS,4.2,24199,3.9M,"50,000,000+",Free,0.0,Everyone,Tools,"November 21, 2014",1.02.53,4.4 and up
5,Motorola Assist,TOOLS,4.1,37333,Varies with device,"50,000,000+",Free,0.0,Everyone,Tools,"January 17, 2016",Varies with device,Varies with device
6,Cache Cleaner-DU Speed Booster (booster & cleaner),TOOLS,4.5,12759663,15M,"100,000,000+",Free,0.0,Everyone,Tools,"July 25, 2018",3.1.2,4.0 and up
7,Moto Suggestions ™,TOOLS,4.6,308,4.3M,"1,000,000+",Free,0.0,Everyone,Tools,"June 8, 2018",0.2.32,8.0 and up
8,Moto Voice,TOOLS,4.1,33216,Varies with device,"10,000,000+",Free,0.0,Everyone,Tools,"June 5, 2018",Varies with device,Varies with device
9,Calculator,TOOLS,4.3,40770,Varies with device,"100,000,000+",Free,0.0,Everyone,Tools,"November 21, 2017",Varies with device,Varies with device


## That's all folks! Great job!