
# Glue Studio Notebook
You are now running a **Glue Studio** notebook; before you can start using your notebook you *must* start an interactive session.

## Available Magics
|          Magic              |   Type       |                                                                        Description                                                                        |
|-----------------------------|--------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| %%configure                 |  Dictionary  |  A json-formatted dictionary consisting of all configuration parameters for a session. Each parameter can be specified here or through individual magics. |
| %profile                    |  String      |  Specify a profile in your aws configuration to use as the credentials provider.                                                                          |
| %iam_role                   |  String      |  Specify an IAM role to execute your session with.                                                                                                        |
| %region                     |  String      |  Specify the AWS region in which to initialize a session                                                                                                  |
| %session_id                 |  String      |  Returns the session ID for the running session.                                                                                                          |
| %connections                |  List        |  Specify a comma separated list of connections to use in the session.                                                                                     |
| %additional_python_modules  |  List        |  Comma separated list of pip packages, s3 paths or private pip arguments.                                                                                 |
| %extra_py_files             |  List        |  Comma separated list of additional Python files from S3.                                                                                                 |
| %extra_jars                 |  List        |  Comma separated list of additional Jars to include in the cluster.                                                                                       |
| %number_of_workers          |  Integer     |  The number of workers of a defined worker_type that are allocated when a job runs. worker_type must be set too.                                          |
| %worker_type                |  String      |  Standard, G.1X, *or* G.2X. number_of_workers must be set too. Default is G.1X                                                                            |
| %glue_version               |  String      |  The version of Glue to be used by this session. Currently, the only valid options are 2.0 and 3.0                                                        |
| %security_configuration     |  String      |  Define a security configuration to be used with this session.                                                                                            |
| %sql                        |  String      |  Run SQL code. All lines after the initial %%sql magic will be passed as part of the SQL code.                                                            |
| %streaming                  |  String      |  Changes the session type to Glue Streaming.                                                                                                              |
| %etl                        |  String      |   Changes the session type to Glue ETL.                                                                                                                   |
| %status                     |              |  Returns the status of the current Glue session including its duration, configuration and executing user / role.                                          |
| %stop_session               |              |  Stops the current session.                                                                                                                               |
| %list_sessions              |              |  Lists all currently running sessions by name and ID.                                                                                                     |

In [None]:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
  
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
It looks like there is a newer version of the kernel available. The latest version is 0.31 and you have 0.30 installed.
Please run `pip install --upgrade aws-glue-sessions` to upgrade your kernel
Authenticating with environment variables and user-defined glue_role_arn: arn:aws:iam::533588983801:role/FULL_GLUE
Attempting to use existing AssumeRole session credentials.
Trying to create a Glue session for the kernel.
Worker Type: G.1X
Number of Workers: 5
Session ID: e80924ae-e5a4-4bea-89b6-06f1b5415d61
Applying the following default arguments:
--glue_kernel_version 0.30
--enable-glue-datacatalog true
Waiting for session e80924ae-e5a4-4bea-89b6-06f1b5415d61 to get into ready s

In [1]:
ruta = 's3://xxxxxxxx/LOAD00000001.parquet'
rdd = spark.read.parquet(ruta)




In [1]:
#rdd.show(2)

In [17]:
from pyspark.sql import DataFrame
from pyspark.rdd import RDD

print(isinstance(rdd,DataFrame))
print(isinstance(rdd,RDD))

True
False


rdd2.show() is only for dataFrames
rdd2.take(10) is equivalent for rdd



In [19]:
print(type(rdd2))
print(type(rdd))

<class 'pyspark.rdd.PipelinedRDD'>
<class 'pyspark.sql.dataframe.DataFrame'>


In [1]:
rdd = sc.textFile('s3://xxxxxxxx/ps_text.txt')




In [2]:
rdd2 = rdd.map(lambda x: x.split(' '))
rdd2.collect()

[['Hi', 'how', 'are', 'you?'], ['Hope', 'you', 'are', 'doing'], ['great']]


In [3]:
def getResult(x):
    words = x.split(' ')
    a = []
    for x in words:
        a.append(len(x))
    return a

rdd3 = rdd.map(lambda x: [len(y) for y in x.split(' ')])




In [4]:
rdd3.collect()

[[2, 3, 3, 4], [4, 3, 3, 5], [5]]


In [5]:
rdd2 = rdd.map(getResult)
rdd2.collect()

[[2, 3, 3, 4], [4, 3, 3, 5], [5]]


### RDD flatMap() flapMap is an extension of map() [Transformation]
it's used as maper of data and explodes data before final output
it's quite the same than map. flatMap return the result as new rdd and lets the result in a single list. (not list of lists)

In [6]:
rdd4 = rdd.flatMap(lambda x: x.split(' '))




In [2]:
#rdd4.collect()

### RDD filter() [Transformation]
filter() will create a new rdd with data filtered :). to filter you have to pass a function as argument

In [1]:
rdd = sc.textFile('s3://xxxxxxxx/quiz2.txt')
rdd.collect()

['This mango company animal', 'Cat dog ant mic laptop', 'Chair switch mobile am charger cover', 'Amanda any alarm ant']


In [2]:
rdd2 = rdd.flatMap(lambda x: x.split(' '))




In [3]:
rdd2.collect()

['This', 'mango', 'company', 'animal', 'Cat', 'dog', 'ant', 'mic', 'laptop', 'Chair', 'switch', 'mobile', 'am', 'charger', 'cover', 'Amanda', 'any', 'alarm', 'ant']


In [13]:
rdd3 = rdd2.filter(lambda x: (x[0].upper() != 'A') and (x[0].upper() != 'C') )




In [4]:
def filt(x):
    #if ((x[0].upper() == 'A' ) or (x[0].upper() == 'C')):
    if x.upper().startswith('A') or x.upper().startswith('C'):
        return False
    else:
        return True
    
rdd4 = rdd2.filter(filt)
rdd4.collect()

['This', 'mango', 'dog', 'mic', 'laptop', 'switch', 'mobile']


In [15]:
rdd3.collect()

['This', 'mango', 'dog', 'mic', 'laptop', 'switch', 'mobile']


### RDD distinct() [T]
Is used to get the distinct elements in RDD

In [5]:
rdd = sc.textFile('s3://xxxxxxxx/ps_text.txt')




In [6]:
rdd.collect()

['Hi how are you?', 'Hope you are doing', 'great']


In [7]:
rdd.flatMap(lambda x: x.split(' ')).collect()

['Hi', 'how', 'are', 'you?', 'Hope', 'you', 'are', 'doing', 'great']


In [8]:
rdd.flatMap(lambda x: x.split(' ')).distinct().collect()

['how', 'you', 'Hi', 'you?', 'great', 'doing', 'Hope', 'are']


### RDD GroupByKey() [T]
it's used to create groups based on Keys. \n
it's necessary that the values comes in format key:value

('raid', 5)
('raid', 2)
('raid', 7)
('raid', 3)

when we do GroupByKey() we will get as pyspark iterable. 
To watch output as the following ('raid', [5,2,7,3]) we have to use mapValues(parameter) and pass parameter we would like to see. Ex: list
and so on...

In [3]:
rdd = sc.textFile('s3://xxxxxxxx/quiz2.txt')
rdd.collect()

['This mango company animal', 'Cat dog ant mic laptop', 'Chair switch mobile am charger cover', 'Amanda any alarm ant']


In [4]:
rdd3 = rdd.flatMap(lambda x: x.split(' ')).map(lambda x: (x, len(x)))
rdd3.groupByKey().collect()

[('mango', <pyspark.resultiterable.ResultIterable object at 0x7f4b8efd1590>), ('Amanda', <pyspark.resultiterable.ResultIterable object at 0x7f4b8efd1510>), ('alarm', <pyspark.resultiterable.ResultIterable object at 0x7f4b8efd1610>), ('charger', <pyspark.resultiterable.ResultIterable object at 0x7f4b8efd1690>), ('ant', <pyspark.resultiterable.ResultIterable object at 0x7f4b8efd1710>), ('laptop', <pyspark.resultiterable.ResultIterable object at 0x7f4b8efd1790>), ('mobile', <pyspark.resultiterable.ResultIterable object at 0x7f4b8efd1910>), ('any', <pyspark.resultiterable.ResultIterable object at 0x7f4b8efd1810>), ('Cat', <pyspark.resultiterable.ResultIterable object at 0x7f4b8efd15d0>), ('mic', <pyspark.resultiterable.ResultIterable object at 0x7f4b8efd1950>), ('This', <pyspark.resultiterable.ResultIterable object at 0x7f4b8efd18d0>), ('Chair', <pyspark.resultiterable.ResultIterable object at 0x7f4b8efd19d0>), ('am', <pyspark.resultiterable.ResultIterable object at 0x7f4b8efd1a50>), ('dog

In [5]:
rdd3.groupByKey().mapValues(list).collect()

[('mango', [5]), ('Amanda', [6]), ('alarm', [5]), ('charger', [7]), ('ant', [3, 3]), ('laptop', [6]), ('mobile', [6]), ('any', [3]), ('Cat', [3]), ('mic', [3]), ('This', [4]), ('Chair', [5]), ('am', [2]), ('dog', [3]), ('company', [7]), ('animal', [6]), ('switch', [6]), ('cover', [5])]


### reduceByKey() [T]
*It's used to combine data based on Keys in RDD.*

``` python
reduceByKey(lambda x,y: x+y)
``` 
  
**Example:**  
We have the next data in key:value format:
- ('raid',3)
- ('raid',6)
- ('raid',3)
- ('raid',10)  
We use **reduceByKey(lambda x,y: x+y). This will work as follow:
- x = 3 and y = 6 => x + y  = 9 
- **then:** x = 9 and y = 3 => x + y = 12
- **then:** x = 12 and y = 10 => x + y = 22

**So finally, reduceByKey(lambda x,y: x+ y) will give us: ('raid', 22)**


#### Quiz: reduceByKey()

In [6]:
rdd = sc.textFile('s3://xxxxxxxx/quiz2.txt')
rdd.collect()

['This mango company animal', 'Cat dog ant mic laptop', 'Chair switch mobile am charger cover', 'Amanda any alarm ant']


In [5]:
rdd.flatMap(lambda x: x.lower().split(' ')).map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y).collect()

[('mango', 1), ('alarm', 1), ('this', 1), ('charger', 1), ('ant', 2), ('cat', 1), ('laptop', 1), ('chair', 1), ('mobile', 1), ('amanda', 1), ('any', 1), ('mic', 1), ('am', 1), ('dog', 1), ('company', 1), ('animal', 1), ('switch', 1), ('cover', 1)]


### count() [A]
**rdd.count()** returns the number of elements inside the rdd

In [7]:
rdd = sc.textFile('s3://xxxxxxxx/quiz2.txt')
rdd.collect()

['This mango company animal', 'Cat dog ant mic laptop', 'Chair switch mobile am charger cover', 'Amanda any alarm ant']


In [8]:
rdd.count()

4


In [10]:
rdd2 = rdd.flatMap(lambda x: x.split(' '))
rdd2.collect()

['This', 'mango', 'company', 'animal', 'Cat', 'dog', 'ant', 'mic', 'laptop', 'Chair', 'switch', 'mobile', 'am', 'charger', 'cover', 'Amanda', 'any', 'alarm', 'ant']


In [11]:
rdd2.count()

19


### countByValue() [A]
It provide many times each value occur in RDD  
``` python
rdd.countByValue()
```

In [1]:
rdd = sc.textFile('s3://xxxxxxxx/quiz2.txt')
rdd.collect()

['This mango company animal', 'Cat dog ant mic laptop', 'Chair switch mobile am charger cover', 'Amanda any alarm ant']


In [3]:
%time
rdd.flatMap(lambda x: x.split(' ')).countByValue()

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.48 µs
defaultdict(<class 'int'>, {'This': 1, 'mango': 1, 'company': 1, 'animal': 1, 'Cat': 1, 'dog': 1, 'ant': 2, 'mic': 1, 'laptop': 1, 'Chair': 1, 'switch': 1, 'mobile': 1, 'am': 1, 'charger': 1, 'cover': 1, 'Amanda': 1, 'any': 1, 'alarm': 1})


In [4]:
hp = sc.textFile('s3://xxxxxxxx/Harry Potter_ The Complete Coll - J.K. Rowling.txt')




In [6]:
hp.getNumPartitions() #number of partitions where the rdd has been distribuited

2


In [14]:
%time
hp2 = hp.flatMap(lambda x: x.split(' ')).filter(lambda x: x not in ('Harry','Potter',''))

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.48 µs



In [21]:
hp2.count()

1109657


In [22]:
hp2.countByValue().head()

AttributeError: 'collections.defaultdict' object has no attribute 'head'


### saveAsTextFile() [A]  
It's used to save a rdd to a file

In [7]:
hp.take(2)

['CONTENTS', '']


In [1]:
#hp.saveAsTextFile('s3://xxxxxxxx/output/harrypotterRdd')

### repartition() [T]
repartition is used to change the number of partitions in a RDD 

``` python
rdd.repartition(number of partitions)
```
### coalesce() [T]
Coalesce is used to decrease the number of partitions in RDD 
``` python
rdd.coalesce(number of partitions)
```

In [1]:
hp = sc.textFile('s3://xxxxxxxx/Harry Potter_ The Complete Coll - J.K. Rowling.txt')




In [2]:
hp.getNumPartitions()

2


### Quiz: Calculate the average score in each month

In [3]:
rdd = sc.textFile('s3://xxxxxxxx/average_quiz_sample.csv')




In [4]:
rdd.collect()

['JAN,NY,3.0', 'JAN,PA,1.0', 'JAN,NJ,2.0', 'JAN,CT,4.0', 'FEB,PA,1.0', 'FEB,NJ,1.0', 'FEB,NY,2.0', 'FEB,VT,1.0', 'MAR,NJ,2.0', 'MAR,NY,1.0', 'MAR,VT,2.0', 'MAR,PA,3.0']


In [10]:
rdd2 = rdd.map(lambda x: x.split(','))




In [11]:
rdd2.collect()

[['JAN', 'NY', '3.0'], ['JAN', 'PA', '1.0'], ['JAN', 'NJ', '2.0'], ['JAN', 'CT', '4.0'], ['FEB', 'PA', '1.0'], ['FEB', 'NJ', '1.0'], ['FEB', 'NY', '2.0'], ['FEB', 'VT', '1.0'], ['MAR', 'NJ', '2.0'], ['MAR', 'NY', '1.0'], ['MAR', 'VT', '2.0'], ['MAR', 'PA', '3.0']]


In [24]:
rdd3 = rdd2.map(lambda x: (x[0],(float(x[2]),1)))




In [25]:
rdd3.collect()

[('JAN', (3.0, 1)), ('JAN', (1.0, 1)), ('JAN', (2.0, 1)), ('JAN', (4.0, 1)), ('FEB', (1.0, 1)), ('FEB', (1.0, 1)), ('FEB', (2.0, 1)), ('FEB', (1.0, 1)), ('MAR', (2.0, 1)), ('MAR', (1.0, 1)), ('MAR', (2.0, 1)), ('MAR', (3.0, 1))]


In [37]:
rdd4 = rdd3.reduceByKey( lambda x, y: (x[0]+y[0], x[1]+y[1]))
rdd4.collect()

[('MAR', (8.0, 4)), ('JAN', (10.0, 4)), ('FEB', (5.0, 4))]


In [39]:
rdd4.map(lambda x: (x[0], x[1][0]/x[1][1])).collect()

[('MAR', 2.0), ('JAN', 2.5), ('FEB', 1.25)]


### Finding Min and Max

Convert the data to pair-value  
```python
(x.split(',')[0],x.split(',')[1])
```
#### ReduceByKey()
```python
    rdd2  = rdd.reduceByKey(lambda x,y: x if x < y else y )
```
- ('The Matrix', 5)  
- ('The Matrix', 3)  
- ('The Matrix', 4)  
- 5 , 3 => 3  
- 3 , 4 => 3

Then the output of the last iteration will be considered for the next iteration

### Quiz Max Min:
Calculate Minimum and maximum rating given by each city

In [1]:
rdd = sc.textFile('s3://xxxxxxxx/average_quiz_sample.csv')




In [2]:
rdd.collect()

['JAN,NY,3.0', 'JAN,PA,1.0', 'JAN,NJ,2.0', 'JAN,CT,4.0', 'FEB,PA,1.0', 'FEB,NJ,1.0', 'FEB,NY,2.0', 'FEB,VT,1.0', 'MAR,NJ,2.0', 'MAR,NY,1.0', 'MAR,VT,2.0', 'MAR,PA,3.0']


In [3]:
rdd2 = rdd.map(lambda x: (x.split(',')[1],x.split(',')[2]))




In [7]:
rddMin = rdd2.reduceByKey(lambda x,y: x if x<y else y)
rddMin.collect()

[('NY', '1.0'), ('NJ', '1.0'), ('PA', '1.0'), ('CT', '4.0'), ('VT', '1.0')]


In [8]:
rddMax = rdd2.reduceByKey(lambda x,y: x if x>y else y)
rddMax.collect()

[('NY', '3.0'), ('NJ', '2.0'), ('PA', '3.0'), ('CT', '4.0'), ('VT', '2.0')]


## Spark RDD Project :]
### Perform the following analytics on the data
- Show the number of students in the file
- Show the total marks achieved by Female and Male students
- Show the total number of students that have passed and failed. 50+ marks are required to pass the course
- Show the total number of students enrolled per course 
- Show the total marks that that students have achieved per course 
- Show the average marks that students have achieved per course 
- Show the minimum and maximum marks achieved per course 
- Show the average age of male and female students

In [1]:
rdd = sc.textFile('s3://xxxxxxxx/StudentData.csv')




In [15]:
rdd.take(10)

['age,gender,name,course,roll,marks,email', '28,Female,Hubert Oliveras,DB,02984,59,Annika Hoffman_Naoma Fritts@OOP.com', '29,Female,Toshiko Hillyard,Cloud,12899,62,Margene Moores_Marylee Capasso@DB.com', '28,Male,Celeste Lollis,PF,21267,45,Jeannetta Golden_Jenna Montague@DSA.com', '29,Female,Elenore Choy,DB,32877,29,Billi Clore_Mitzi Seldon@DB.com', '28,Male,Sheryll Towler,DSA,41487,41,Claude Panos_Judie Chipps@OOP.com', '28,Male,Margene Moores,MVC,52771,32,Toshiko Hillyard_Clementina Menke@MVC.com', '28,Male,Neda Briski,OOP,61973,69,Alberta Freund_Elenore Choy@DB.com', '28,Female,Claude Panos,Cloud,72409,85,Sheryll Towler_Alberta Freund@Cloud.com', '28,Male,Celeste Lollis,MVC,81492,64,Nicole Harwood_Claude Panos@MVC.com']


In [2]:
col = rdd.first()




In [4]:
rdd2 = rdd.filter(lambda x: x != col)
rdd2.count()

1000


In [5]:
rdd2.take(10)

['28,Female,Hubert Oliveras,DB,02984,59,Annika Hoffman_Naoma Fritts@OOP.com', '29,Female,Toshiko Hillyard,Cloud,12899,62,Margene Moores_Marylee Capasso@DB.com', '28,Male,Celeste Lollis,PF,21267,45,Jeannetta Golden_Jenna Montague@DSA.com', '29,Female,Elenore Choy,DB,32877,29,Billi Clore_Mitzi Seldon@DB.com', '28,Male,Sheryll Towler,DSA,41487,41,Claude Panos_Judie Chipps@OOP.com', '28,Male,Margene Moores,MVC,52771,32,Toshiko Hillyard_Clementina Menke@MVC.com', '28,Male,Neda Briski,OOP,61973,69,Alberta Freund_Elenore Choy@DB.com', '28,Female,Claude Panos,Cloud,72409,85,Sheryll Towler_Alberta Freund@Cloud.com', '28,Male,Celeste Lollis,MVC,81492,64,Nicole Harwood_Claude Panos@MVC.com', '29,Male,Cordie Harnois,OOP,92882,51,Judie Chipps_Clementina Menke@MVC.com']


In [20]:
rdd3  = rdd2.map(lambda x: x.split(','))




In [22]:
rdd3.take(10)

[['28', 'Female', 'Hubert Oliveras', 'DB', '02984', '59', 'Annika Hoffman_Naoma Fritts@OOP.com'], ['29', 'Female', 'Toshiko Hillyard', 'Cloud', '12899', '62', 'Margene Moores_Marylee Capasso@DB.com'], ['28', 'Male', 'Celeste Lollis', 'PF', '21267', '45', 'Jeannetta Golden_Jenna Montague@DSA.com'], ['29', 'Female', 'Elenore Choy', 'DB', '32877', '29', 'Billi Clore_Mitzi Seldon@DB.com'], ['28', 'Male', 'Sheryll Towler', 'DSA', '41487', '41', 'Claude Panos_Judie Chipps@OOP.com'], ['28', 'Male', 'Margene Moores', 'MVC', '52771', '32', 'Toshiko Hillyard_Clementina Menke@MVC.com'], ['28', 'Male', 'Neda Briski', 'OOP', '61973', '69', 'Alberta Freund_Elenore Choy@DB.com'], ['28', 'Female', 'Claude Panos', 'Cloud', '72409', '85', 'Sheryll Towler_Alberta Freund@Cloud.com'], ['28', 'Male', 'Celeste Lollis', 'MVC', '81492', '64', 'Nicole Harwood_Claude Panos@MVC.com'], ['29', 'Male', 'Cordie Harnois', 'OOP', '92882', '51', 'Judie Chipps_Clementina Menke@MVC.com']]


In [36]:
rdd4 = rdd3.map(lambda x: (x[1], int(x[5])))




In [37]:
rdd4.take(10)

[('Female', 59), ('Female', 62), ('Male', 45), ('Female', 29), ('Male', 41), ('Male', 32), ('Male', 69), ('Female', 85), ('Male', 64), ('Male', 51)]


In [38]:
rdd4.reduceByKey(lambda x, y: x+y).collect()

[('Male', 30461), ('Female', 29636)]


In [44]:
rdd3 = rdd2.map(lambda x: x.split(','))




In [45]:
rdd3.take(10)

[['28', 'Female', 'Hubert Oliveras', 'DB', '02984', '59', 'Annika Hoffman_Naoma Fritts@OOP.com'], ['29', 'Female', 'Toshiko Hillyard', 'Cloud', '12899', '62', 'Margene Moores_Marylee Capasso@DB.com'], ['28', 'Male', 'Celeste Lollis', 'PF', '21267', '45', 'Jeannetta Golden_Jenna Montague@DSA.com'], ['29', 'Female', 'Elenore Choy', 'DB', '32877', '29', 'Billi Clore_Mitzi Seldon@DB.com'], ['28', 'Male', 'Sheryll Towler', 'DSA', '41487', '41', 'Claude Panos_Judie Chipps@OOP.com'], ['28', 'Male', 'Margene Moores', 'MVC', '52771', '32', 'Toshiko Hillyard_Clementina Menke@MVC.com'], ['28', 'Male', 'Neda Briski', 'OOP', '61973', '69', 'Alberta Freund_Elenore Choy@DB.com'], ['28', 'Female', 'Claude Panos', 'Cloud', '72409', '85', 'Sheryll Towler_Alberta Freund@Cloud.com'], ['28', 'Male', 'Celeste Lollis', 'MVC', '81492', '64', 'Nicole Harwood_Claude Panos@MVC.com'], ['29', 'Male', 'Cordie Harnois', 'OOP', '92882', '51', 'Judie Chipps_Clementina Menke@MVC.com']]


In [53]:
rdd4 = rdd3.map(lambda x: 1 if int(x[5]) > 50 else 0)




In [54]:
rdd4.take(10)

[1, 1, 0, 0, 0, 0, 1, 1, 1, 1]


In [37]:
rdd4.countByValue()

defaultdict(<class 'int'>, {('Female', 1): 501, ('Male', 1): 499})


### Show the total number of students enrolled per course

In [38]:
rdd = sc.textFile('s3://xxxxxxxx/StudentData.csv')
col = rdd.first()
print(col)

age,gender,name,course,roll,marks,email


In [39]:
rdd = rdd.filter( lambda x: x != col)




In [40]:
rdd2 = rdd.map(lambda x: x.split(','))




In [42]:
rdd3 = rdd2.map(lambda x: (x[3],1))
rdd4 = rdd3.reduceByKey(lambda x, y: x+y)
rdd4.collect()

[('Cloud', 192), ('DSA', 176), ('DB', 157), ('MVC', 157), ('PF', 166), ('OOP', 152)]


### Show the total marks that that students have achieved per course 

In [64]:
rdd = sc.textFile('s3://xxxxxxxx/StudentData.csv')
col = rdd.first()
print(col)

age,gender,name,course,roll,marks,email


In [65]:
rdd = rdd.filter( lambda x: x != col)




In [66]:
rdd2 = rdd.map(lambda x: x.split(','))
rdd2.take(10)

[['28', 'Female', 'Hubert Oliveras', 'DB', '02984', '59', 'Annika Hoffman_Naoma Fritts@OOP.com'], ['29', 'Female', 'Toshiko Hillyard', 'Cloud', '12899', '62', 'Margene Moores_Marylee Capasso@DB.com'], ['28', 'Male', 'Celeste Lollis', 'PF', '21267', '45', 'Jeannetta Golden_Jenna Montague@DSA.com'], ['29', 'Female', 'Elenore Choy', 'DB', '32877', '29', 'Billi Clore_Mitzi Seldon@DB.com'], ['28', 'Male', 'Sheryll Towler', 'DSA', '41487', '41', 'Claude Panos_Judie Chipps@OOP.com'], ['28', 'Male', 'Margene Moores', 'MVC', '52771', '32', 'Toshiko Hillyard_Clementina Menke@MVC.com'], ['28', 'Male', 'Neda Briski', 'OOP', '61973', '69', 'Alberta Freund_Elenore Choy@DB.com'], ['28', 'Female', 'Claude Panos', 'Cloud', '72409', '85', 'Sheryll Towler_Alberta Freund@Cloud.com'], ['28', 'Male', 'Celeste Lollis', 'MVC', '81492', '64', 'Nicole Harwood_Claude Panos@MVC.com'], ['29', 'Male', 'Cordie Harnois', 'OOP', '92882', '51', 'Judie Chipps_Clementina Menke@MVC.com']]


In [69]:
rdd3 = rdd2.map( lambda x: (x[3],int(x[5])))




In [70]:
rdd3.take(10)

[('DB', 59), ('Cloud', 62), ('PF', 45), ('DB', 29), ('DSA', 41), ('MVC', 32), ('OOP', 69), ('Cloud', 85), ('MVC', 64), ('OOP', 51)]


In [72]:
rdd4 = rdd3.reduceByKey(lambda x, y: x + y)




In [73]:
rdd4.take(10)

[('Cloud', 11443), ('DSA', 10950), ('DB', 9270), ('MVC', 9585), ('PF', 9933), ('OOP', 8916)]


### Show the average marks that students have achieved per course 

In [44]:
rdd = sc.textFile('s3://xxxxxxxx/StudentData.csv')
col = rdd.first()
print(col)

age,gender,name,course,roll,marks,email


In [45]:
rdd2 = rdd.filter(lambda x: x != col)




In [46]:
rdd3 = rdd2.map(lambda x: x.split(','))
rdd3.take(10)

[['28', 'Female', 'Hubert Oliveras', 'DB', '02984', '59', 'Annika Hoffman_Naoma Fritts@OOP.com'], ['29', 'Female', 'Toshiko Hillyard', 'Cloud', '12899', '62', 'Margene Moores_Marylee Capasso@DB.com'], ['28', 'Male', 'Celeste Lollis', 'PF', '21267', '45', 'Jeannetta Golden_Jenna Montague@DSA.com'], ['29', 'Female', 'Elenore Choy', 'DB', '32877', '29', 'Billi Clore_Mitzi Seldon@DB.com'], ['28', 'Male', 'Sheryll Towler', 'DSA', '41487', '41', 'Claude Panos_Judie Chipps@OOP.com'], ['28', 'Male', 'Margene Moores', 'MVC', '52771', '32', 'Toshiko Hillyard_Clementina Menke@MVC.com'], ['28', 'Male', 'Neda Briski', 'OOP', '61973', '69', 'Alberta Freund_Elenore Choy@DB.com'], ['28', 'Female', 'Claude Panos', 'Cloud', '72409', '85', 'Sheryll Towler_Alberta Freund@Cloud.com'], ['28', 'Male', 'Celeste Lollis', 'MVC', '81492', '64', 'Nicole Harwood_Claude Panos@MVC.com'], ['29', 'Male', 'Cordie Harnois', 'OOP', '92882', '51', 'Judie Chipps_Clementina Menke@MVC.com']]


In [47]:
rdd4 = rdd3.map(lambda x: (x[3], (int(x[5]), 1)))
rdd4.take(10)

[('DB', (59, 1)), ('Cloud', (62, 1)), ('PF', (45, 1)), ('DB', (29, 1)), ('DSA', (41, 1)), ('MVC', (32, 1)), ('OOP', (69, 1)), ('Cloud', (85, 1)), ('MVC', (64, 1)), ('OOP', (51, 1))]


In [48]:
rdd5 = rdd4.reduceByKey( lambda x, y: (x[0] + y[0], x[1]+y[1]) )
rdd5.take(10)

[('Cloud', (11443, 192)), ('DSA', (10950, 176)), ('DB', (9270, 157)), ('MVC', (9585, 157)), ('PF', (9933, 166)), ('OOP', (8916, 152))]


In [49]:
rdd6 = rdd5.map(lambda x: ( x[0], x[1][0]/x[1][1]))
rdd6.collect()

[('Cloud', 59.598958333333336), ('DSA', 62.21590909090909), ('DB', 59.044585987261144), ('MVC', 61.05095541401274), ('PF', 59.83734939759036), ('OOP', 58.6578947368421)]


mapValues(): will focus on the value of a {key: value} so, in this case will be easier to handle de calculation.
Example: 
```python
rdd5.mapValues(lambda x: x[0]/x[1]).collect()
```
It will bring the same result as line before :) 

In [50]:
rdd5.mapValues(lambda x: x[0] / x[1] ).collect()

[('Cloud', 59.598958333333336), ('DSA', 62.21590909090909), ('DB', 59.044585987261144), ('MVC', 61.05095541401274), ('PF', 59.83734939759036), ('OOP', 58.6578947368421)]


### Show the minimum and maximum marks achieved per course 

In [8]:
rdd = sc.textFile('s3://xxxxxxxx/StudentData.csv')
col = rdd.first() 
print(col)

age,gender,name,course,roll,marks,email


In [9]:
rd2 = rdd.filter(lambda x: x != col)




In [10]:
rd3 = rd2.map(lambda x: x.split(','))
rd3.take(10)

[['28', 'Female', 'Hubert Oliveras', 'DB', '02984', '59', 'Annika Hoffman_Naoma Fritts@OOP.com'], ['29', 'Female', 'Toshiko Hillyard', 'Cloud', '12899', '62', 'Margene Moores_Marylee Capasso@DB.com'], ['28', 'Male', 'Celeste Lollis', 'PF', '21267', '45', 'Jeannetta Golden_Jenna Montague@DSA.com'], ['29', 'Female', 'Elenore Choy', 'DB', '32877', '29', 'Billi Clore_Mitzi Seldon@DB.com'], ['28', 'Male', 'Sheryll Towler', 'DSA', '41487', '41', 'Claude Panos_Judie Chipps@OOP.com'], ['28', 'Male', 'Margene Moores', 'MVC', '52771', '32', 'Toshiko Hillyard_Clementina Menke@MVC.com'], ['28', 'Male', 'Neda Briski', 'OOP', '61973', '69', 'Alberta Freund_Elenore Choy@DB.com'], ['28', 'Female', 'Claude Panos', 'Cloud', '72409', '85', 'Sheryll Towler_Alberta Freund@Cloud.com'], ['28', 'Male', 'Celeste Lollis', 'MVC', '81492', '64', 'Nicole Harwood_Claude Panos@MVC.com'], ['29', 'Male', 'Cordie Harnois', 'OOP', '92882', '51', 'Judie Chipps_Clementina Menke@MVC.com']]


In [11]:
rd4 = rd3.map(lambda x: (x[3],x[5]))
rd4.take(10)

[('DB', '59'), ('Cloud', '62'), ('PF', '45'), ('DB', '29'), ('DSA', '41'), ('MVC', '32'), ('OOP', '69'), ('Cloud', '85'), ('MVC', '64'), ('OOP', '51')]


In [17]:
minim = rd4.reduceByKey(lambda x, y: x if x<y else y).collect()
maxim = rd4.reduceByKey(lambda x, y: x if x>y else y).collect()
print('min: ', minim)
print('max: ', maxim)

min:  [('Cloud', '20'), ('DSA', '20'), ('DB', '20'), ('MVC', '22'), ('PF', '20'), ('OOP', '20')]
max:  [('Cloud', '99'), ('DSA', '99'), ('DB', '98'), ('MVC', '99'), ('PF', '99'), ('OOP', '99')]


### Show the average age of male and female students

In [19]:
rdd = sc.textFile('s3://xxxxxxxx/StudentData.csv')
col = rdd.first()
print(col)

age,gender,name,course,roll,marks,email


In [56]:
rdd2 = rdd.filter(lambda x: x != col) 
rdd3 = rdd2.map(lambda x: x.split(','))
rdd4 = rdd3.map(lambda x: (x[1],(int(x[0]),1)))
total = rdd4.count()
rdd4.take(10)

[('Female', (28, 1)), ('Female', (29, 1)), ('Male', (28, 1)), ('Female', (29, 1)), ('Male', (28, 1)), ('Male', (28, 1)), ('Male', (28, 1)), ('Female', (28, 1)), ('Male', (28, 1)), ('Male', (29, 1))]


In [57]:
rdd5 = rdd4.reduceByKey(lambda x, y: (x[0]+y[0], x[1]+y[1]))
rdd5.collect()

[('Male', (14233, 499)), ('Female', (14273, 501))]


In [62]:
rdd5.map(lambda x: (x[0],x[1][0]/x[1][1])).collect()

[('Male', 28.52304609218437), ('Female', 28.489021956087825)]


In [61]:
rdd5.mapValues(lambda x: x[0]/x[1]).collect()

[('Male', 28.52304609218437), ('Female', 28.489021956087825)]
