# Environment Setup and Dataset download
Run the following code snippets to setup the envs

Download the Netflix Subscription dataset from this [link](https://drive.google.com/file/d/1optmRfNfXUFSTWY2l4FAod6aiYl4y91P/view?usp=sharing) using your IIT account and upload to the this session storage.

In [None]:
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

In [3]:
import pyspark
import pyspark.sql  as pyspark_sql
import pyspark.sql.types as pyspark_types
import pyspark.sql.functions  as pyspark_functions
from pyspark import SparkContext, SparkConf

In [4]:
# create the session
conf = SparkConf().set("spark.ui.port", "4050")

# create the context
sc = pyspark.SparkContext(conf=conf)
spark = pyspark_sql.SparkSession.builder.getOrCreate()

# Dataframe Ops

In [131]:
# Load the dataset
data = spark.read.csv("Netflix subscription fee Dec-2021.csv", header=True, inferSchema=True)

In [132]:
data.printSchema()

root
 |-- Country_code: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- Total Library Size: integer (nullable = true)
 |-- No_of_TVShows: integer (nullable = true)
 |-- No_of_Movies: integer (nullable = true)
 |-- Cost Per Month - Basic ($): double (nullable = true)
 |-- Cost Per Month - Standard ($): double (nullable = true)
 |-- Cost Per Month - Premium ($): double (nullable = true)



In [133]:
data.show()

+------------+----------+------------------+-------------+------------+--------------------------+-----------------------------+----------------------------+
|Country_code|   Country|Total Library Size|No_of_TVShows|No_of_Movies|Cost Per Month - Basic ($)|Cost Per Month - Standard ($)|Cost Per Month - Premium ($)|
+------------+----------+------------------+-------------+------------+--------------------------+-----------------------------+----------------------------+
|          ar| Argentina|              4760|         3154|        1606|                      3.74|                          6.3|                        9.26|
|          au| Australia|              6114|         4050|        2064|                      7.84|                        12.12|                       16.39|
|          at|   Austria|              5640|         3779|        1861|                      9.03|                        14.67|                       20.32|
|          be|   Belgium|              4990|        

In [134]:
data.printSchema()

root
 |-- Country_code: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- Total Library Size: integer (nullable = true)
 |-- No_of_TVShows: integer (nullable = true)
 |-- No_of_Movies: integer (nullable = true)
 |-- Cost Per Month - Basic ($): double (nullable = true)
 |-- Cost Per Month - Standard ($): double (nullable = true)
 |-- Cost Per Month - Premium ($): double (nullable = true)



In [135]:
data.describe('Cost Per Month - Premium ($)').show()

+-------+----------------------------+
|summary|Cost Per Month - Premium ($)|
+-------+----------------------------+
|  count|                          65|
|   mean|          15.612923076923078|
| stddev|           4.040672408104298|
|    min|                        4.02|
|    max|                       26.96|
+-------+----------------------------+



In [None]:
# used to serialize the data and convert to a regular python variable
data.collect()

In [137]:
from pyspark.sql import functions as F

# Selection of a column

# Filter Operation

## Filtering Questions

In [138]:
# Filter all the countries where the number of movies offered are > 2000


Filter all the countries where the number of movies offered are > 2000 AND "Cost Per Month - Basic ($)" is greater than 8 per month

In [139]:
# Filter all the countries where the number of movies offered are > 2000


In [140]:
# Filter countries with library size > 5000


In [141]:
# Select countries with Premium plan cost < $15


In [142]:
# Filter for more TV shows than movies


# Selection based on a condition

Print 1 for a condition otherwise 0

In [143]:
# all coutries with library size < 7000


# Challenging Questions

1. Find the countries with the lowest no of movies offered and the higest.

2. Determine whether the countries are in the 1st, 2nd, 3rd or 4th quartile of the distribution of values in the no of movies column. We want a column that says 1,2,3, ocr 4 denoting those quartiles respectively

1. First Question

In [144]:
# find min and max


In [145]:
# can use either filtering or F.when to select the records
# (For Smartasses: Yes there is a way to get it by aggregating) but we haven't covered it yet


2. Second Question

Hey Nerds... yes I know you can use custom Lambda Functions to do this.. I get it you're smart.. while we cover them for the other mere mortals in the class in the next lecture, please contend with doing it in Lowly Earthly Peasantly Python

In [146]:
# convert Spark to Python List


In [147]:
# convert raw spark types to ints


In [148]:
# get quartiles values for the list of ints
import statistics

# Calculate quartiles


In [149]:
# all coutries with quartiles of no of movies
