# Environment Setup and Dataset download
Run the following code snippets to setup the envs

Download the Netflix Subscription dataset from this [link](https://drive.google.com/file/d/1optmRfNfXUFSTWY2l4FAod6aiYl4y91P/view?usp=sharing) using your IIT account and upload to the this session storage.

In [None]:
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

In [2]:
import pyspark
import pyspark.sql  as pyspark_sql
import pyspark.sql.types as pyspark_types
import pyspark.sql.functions  as pyspark_functions
from pyspark import SparkContext, SparkConf

In [3]:
# create the session
conf = SparkConf().set("spark.ui.port", "4050")

# create the context
sc = pyspark.SparkContext(conf=conf)
spark = pyspark_sql.SparkSession.builder.getOrCreate()

# Dataframe Ops

In [5]:
# Load the dataset
data = spark.read.csv("Netflix subscription fee Dec-2021.csv", header=True, inferSchema=True)

In [6]:
data.printSchema()

root
 |-- Country_code: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- Total Library Size: integer (nullable = true)
 |-- No_of_TVShows: integer (nullable = true)
 |-- No_of_Movies: integer (nullable = true)
 |-- Cost Per Month - Basic ($): double (nullable = true)
 |-- Cost Per Month - Standard ($): double (nullable = true)
 |-- Cost Per Month - Premium ($): double (nullable = true)



In [7]:
data.show()

+------------+----------+------------------+-------------+------------+--------------------------+-----------------------------+----------------------------+
|Country_code|   Country|Total Library Size|No_of_TVShows|No_of_Movies|Cost Per Month - Basic ($)|Cost Per Month - Standard ($)|Cost Per Month - Premium ($)|
+------------+----------+------------------+-------------+------------+--------------------------+-----------------------------+----------------------------+
|          ar| Argentina|              4760|         3154|        1606|                      3.74|                          6.3|                        9.26|
|          au| Australia|              6114|         4050|        2064|                      7.84|                        12.12|                       16.39|
|          at|   Austria|              5640|         3779|        1861|                      9.03|                        14.67|                       20.32|
|          be|   Belgium|              4990|        

In [8]:
data.printSchema()

root
 |-- Country_code: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- Total Library Size: integer (nullable = true)
 |-- No_of_TVShows: integer (nullable = true)
 |-- No_of_Movies: integer (nullable = true)
 |-- Cost Per Month - Basic ($): double (nullable = true)
 |-- Cost Per Month - Standard ($): double (nullable = true)
 |-- Cost Per Month - Premium ($): double (nullable = true)



In [9]:
data.describe('Cost Per Month - Premium ($)').show()

+-------+----------------------------+
|summary|Cost Per Month - Premium ($)|
+-------+----------------------------+
|  count|                          65|
|   mean|          15.612923076923078|
| stddev|           4.040672408104298|
|    min|                        4.02|
|    max|                       26.96|
+-------+----------------------------+



In [10]:
# used to serialize the data and convert to a regular python variable
data.collect()

[Row(Country_code='ar', Country='Argentina', Total Library Size=4760, No_of_TVShows=3154, No_of_Movies=1606, Cost Per Month - Basic ($)=3.74, Cost Per Month - Standard ($)=6.3, Cost Per Month - Premium ($)=9.26),
 Row(Country_code='au', Country='Australia', Total Library Size=6114, No_of_TVShows=4050, No_of_Movies=2064, Cost Per Month - Basic ($)=7.84, Cost Per Month - Standard ($)=12.12, Cost Per Month - Premium ($)=16.39),
 Row(Country_code='at', Country='Austria', Total Library Size=5640, No_of_TVShows=3779, No_of_Movies=1861, Cost Per Month - Basic ($)=9.03, Cost Per Month - Standard ($)=14.67, Cost Per Month - Premium ($)=20.32),
 Row(Country_code='be', Country='Belgium', Total Library Size=4990, No_of_TVShows=3374, No_of_Movies=1616, Cost Per Month - Basic ($)=10.16, Cost Per Month - Standard ($)=15.24, Cost Per Month - Premium ($)=20.32),
 Row(Country_code='bo', Country='Bolivia', Total Library Size=4991, No_of_TVShows=3155, No_of_Movies=1836, Cost Per Month - Basic ($)=7.99, Co

In [11]:
from pyspark.sql import functions as F

# Week 2

## Joining in Spark


Download this dataset https://www.kaggle.com/datasets/iamsouravbanerjee/world-population-dataset


and upload to the session storage in Colab

In [19]:
# Load the dataset

+----+----+-------------------+----------------+-------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+----------+-----------------+-----------+---------------------------+
|Rank|CCA3|  Country/Territory|         Capital|    Continent|2022 Population|2020 Population|2015 Population|2010 Population|2000 Population|1990 Population|1980 Population|1970 Population|Area (km²)|Density (per km²)|Growth Rate|World Population Percentage|
+----+----+-------------------+----------------+-------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+----------+-----------------+-----------+---------------------------+
|  36| AFG|        Afghanistan|           Kabul|         Asia|       41128771|       38972230|       33753499|       28189672|       19542982|       10694796|       12486631|       10752971|    652230|          63.0587| 

## Sort

the dataframe by "World Population Percentage"

+----+----+-----------------+----------------+-------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+----------+-----------------+-----------+---------------------------+
|Rank|CCA3|Country/Territory|         Capital|    Continent|2022 Population|2020 Population|2015 Population|2010 Population|2000 Population|1990 Population|1980 Population|1970 Population|Area (km²)|Density (per km²)|Growth Rate|World Population Percentage|
+----+----+-----------------+----------------+-------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+----------+-----------------+-----------+---------------------------+
|   1| CHN|            China|         Beijing|         Asia|     1425887337|     1424929781|     1393715448|     1348191368|     1264099069|     1153704252|      982372466|      822534450|   9706961|         146.8933|        1

+----+----+-----------------+----------------+-------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+----------+-----------------+-----------+---------------------------+
|Rank|CCA3|Country/Territory|         Capital|    Continent|2022 Population|2020 Population|2015 Population|2010 Population|2000 Population|1990 Population|1980 Population|1970 Population|Area (km²)|Density (per km²)|Growth Rate|World Population Percentage|
+----+----+-----------------+----------------+-------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+----------+-----------------+-----------+---------------------------+
|   9| RUS|           Russia|          Moscow|       Europe|      144713314|      145617329|      144668389|      143242599|      146844839|      148005704|      138257420|      130093010|  17098242|           8.4636|     0.99

## Download and upload to session.

This dataset as well. https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes/blob/master/all/all.csv

+-------------------+-------+-------+------------+-------------+--------+--------------------+-------------------+-----------+---------------+------------------------+
|               name|alpha-2|alpha-3|country-code|   iso_3166-2|  region|          sub-region|intermediate-region|region-code|sub-region-code|intermediate-region-code|
+-------------------+-------+-------+------------+-------------+--------+--------------------+-------------------+-----------+---------------+------------------------+
|        Afghanistan|     AF|    AFG|           4|ISO 3166-2:AF|    Asia|       Southern Asia|               NULL|        142|             34|                    NULL|
|      Åland Islands|     AX|    ALA|         248|ISO 3166-2:AX|  Europe|     Northern Europe|               NULL|        150|            154|                    NULL|
|            Albania|     AL|    ALB|           8|ISO 3166-2:AL|  Europe|     Southern Europe|               NULL|        150|             39|                  

## Joining

join the two dataframe on the CCA3 and alpha-3 columns

+----+----+-------------------+----------------+-------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+----------+-----------------+-----------+---------------------------+-------------------+-------+-------+------------+-------------+--------+--------------------+-------------------+-----------+---------------+------------------------+
|Rank|CCA3|  Country/Territory|         Capital|    Continent|2022 Population|2020 Population|2015 Population|2010 Population|2000 Population|1990 Population|1980 Population|1970 Population|Area (km²)|Density (per km²)|Growth Rate|World Population Percentage|               name|alpha-2|alpha-3|country-code|   iso_3166-2|  region|          sub-region|intermediate-region|region-code|sub-region-code|intermediate-region-code|
+----+----+-------------------+----------------+-------------+---------------+---------------+---------------+---------------+---------------+------

## WithColoumn

WithColoumn is used to create a new col based on some logic in the existing columns or external factors.  

Capitalise the country code and create a new column 'CC_Capitals'

+------------+----------+------------------+-------------+------------+--------------------------+-----------------------------+----------------------------+-----------+
|Country_code|   Country|Total Library Size|No_of_TVShows|No_of_Movies|Cost Per Month - Basic ($)|Cost Per Month - Standard ($)|Cost Per Month - Premium ($)|CC_Capitals|
+------------+----------+------------------+-------------+------------+--------------------------+-----------------------------+----------------------------+-----------+
|          ar| Argentina|              4760|         3154|        1606|                      3.74|                          6.3|                        9.26|         AR|
|          au| Australia|              6114|         4050|        2064|                      7.84|                        12.12|                       16.39|         AU|
|          at|   Austria|              5640|         3779|        1861|                      9.03|                        14.67|                      

## Join

https://i.ytimg.com/vi/Sh4ENyItXK8/hqdefault.jpg

Join the capitalised dataframe with the population dataframe on the CC_Capitals and alpha-2 columns.


a small intro to joins is given in the pic

name the df "netflix_data_with_populations"

65

## Aggregating functions

1.   Find the average population min and max in each continent using agg and grouby
2.   Find out using dataframe ops how to get the max value of a col



+-------------+--------------------+--------------+--------------+
|    Continent|      Population Avg|Population min|Population max|
+-------------+--------------------+--------------+--------------+
|       Europe|       1.486295076E7|           510|     144713314|
|       Africa|2.5030367228070177E7|        107118|     218541212|
|North America|        1.50074034E7|          4390|     338289857|
|South America|3.1201186285714287E7|          3780|     215313498|
|      Oceania|           1958198.0|          1871|      26177413|
|         Asia|       9.442766548E7|        449002|    1425887337|
+-------------+--------------------+--------------+--------------+





For the above dataframe see how many countries in each continent as well

+-------------+--------------------+--------------+--------------+-------------+
|    Continent|      Population Avg|Population min|Population max|Country Count|
+-------------+--------------------+--------------+--------------+-------------+
|       Europe|       1.486295076E7|           510|     144713314|           50|
|       Africa|2.5030367228070177E7|        107118|     218541212|           57|
|North America|        1.50074034E7|          4390|     338289857|           40|
|South America|3.1201186285714287E7|          3780|     215313498|           14|
|      Oceania|           1958198.0|          1871|      26177413|           23|
|         Asia|       9.442766548E7|        449002|    1425887337|           50|
+-------------+--------------------+--------------+--------------+-------------+



## Challenging Tasks

for the netflix_data_with_populations dataframe,

1.   Find the average population min and max in each continent using agg and grouby
2.   Find out using dataframe ops how to get the max value of a col
3.   For this dataframe see how many countries in each continent as well
4. why is there a difference in the no of countries

+-------------+--------------------+--------------+--------------+-------------+
|    Continent|      Population Avg|Population min|Population max|Country Count|
+-------------+--------------------+--------------+--------------+-------------+
|       Europe|2.0960307411764707E7|         32649|     144713314|           34|
|       Africa|         5.9893885E7|      59893885|      59893885|            1|
|North America|         8.9617651E7|       5180829|     338289857|            6|
|South America|        4.35081505E7|       3422794|     215313498|           10|
|      Oceania|        1.56813505E7|       5185288|      26177413|            2|
|         Asia|        1.85114481E8|       5975689|    1417173173|           12|
+-------------+--------------------+--------------+--------------+-------------+



# Do try to answer other interesting questions using the combined df you have made.