# Student Alcohol Consumption

### Introduction:

This time you will download a dataset from the UCI.

### Step 1. Import the necessary libraries

In [1]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 50 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 43.1 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.0-py2.py3-none-any.whl size=281764026 sha256=966d0b5b207c9464655429ca86259773d74fb64c71e4ca77bffb24f981aa65ca
  Stored in directory: /root/.cache/pip/wheels/7a/8e/1b/f73a52650d2e5f337708d9f6a1750d451a7349a867f928b885
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.0


In [28]:
from pyspark.sql import SparkSession, functions as f
from pyspark.sql.types import StringType, BooleanType
from pyspark.files import SparkFiles

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/04_Apply/Students_Alcohol_Consumption/student-mat.csv).

In [4]:
spark = SparkSession.builder.appName("exercise40").getOrCreate()
url = "https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/04_Apply/Students_Alcohol_Consumption/student-mat.csv"
spark.sparkContext.addFile(url)

### Step 3. Assign it to a variable called df.

In [6]:
df = spark.read.csv("file://"+SparkFiles.get("student-mat.csv"), header=True, inferSchema=True)
df.printSchema()
df.show()

root
 |-- school: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- address: string (nullable = true)
 |-- famsize: string (nullable = true)
 |-- Pstatus: string (nullable = true)
 |-- Medu: integer (nullable = true)
 |-- Fedu: integer (nullable = true)
 |-- Mjob: string (nullable = true)
 |-- Fjob: string (nullable = true)
 |-- reason: string (nullable = true)
 |-- guardian: string (nullable = true)
 |-- traveltime: integer (nullable = true)
 |-- studytime: integer (nullable = true)
 |-- failures: integer (nullable = true)
 |-- schoolsup: string (nullable = true)
 |-- famsup: string (nullable = true)
 |-- paid: string (nullable = true)
 |-- activities: string (nullable = true)
 |-- nursery: string (nullable = true)
 |-- higher: string (nullable = true)
 |-- internet: string (nullable = true)
 |-- romantic: string (nullable = true)
 |-- famrel: integer (nullable = true)
 |-- freetime: integer (nullable = true)
 |-- goout: integer (null

### Step 4. For the purpose of this exercise slice the dataframe from 'school' until the 'guardian' column

In [13]:
school_idx = df.columns.index("school")
guardian_idx = df.columns.index("guardian")

df.select(*df.columns[school_idx: guardian_idx+1]).show()

+------+---+---+-------+-------+-------+----+----+--------+--------+----------+--------+
|school|sex|age|address|famsize|Pstatus|Medu|Fedu|    Mjob|    Fjob|    reason|guardian|
+------+---+---+-------+-------+-------+----+----+--------+--------+----------+--------+
|    GP|  F| 18|      U|    GT3|      A|   4|   4| at_home| teacher|    course|  mother|
|    GP|  F| 17|      U|    GT3|      T|   1|   1| at_home|   other|    course|  father|
|    GP|  F| 15|      U|    LE3|      T|   1|   1| at_home|   other|     other|  mother|
|    GP|  F| 15|      U|    GT3|      T|   4|   2|  health|services|      home|  mother|
|    GP|  F| 16|      U|    GT3|      T|   3|   3|   other|   other|      home|  father|
|    GP|  M| 16|      U|    LE3|      T|   4|   3|services|   other|reputation|  mother|
|    GP|  M| 16|      U|    LE3|      T|   2|   2|   other|   other|      home|  mother|
|    GP|  F| 17|      U|    GT3|      A|   4|   4|   other| teacher|      home|  mother|
|    GP|  M| 15|     

### Step 5. Create a lambda function that will capitalize strings.

In [20]:
capitalize = f.udf(lambda x : x.capitalize(),StringType())

### Step 6. Capitalize both Mjob and Fjob

In [21]:
df.select(capitalize(df.Mjob), capitalize(df.Fjob)).show()

+--------------+--------------+
|<lambda>(Mjob)|<lambda>(Fjob)|
+--------------+--------------+
|       At_home|       Teacher|
|       At_home|         Other|
|       At_home|         Other|
|        Health|      Services|
|         Other|         Other|
|      Services|         Other|
|         Other|         Other|
|         Other|       Teacher|
|      Services|         Other|
|         Other|         Other|
|       Teacher|        Health|
|      Services|         Other|
|        Health|      Services|
|       Teacher|         Other|
|         Other|         Other|
|        Health|         Other|
|      Services|      Services|
|         Other|         Other|
|      Services|      Services|
|        Health|         Other|
+--------------+--------------+
only showing top 20 rows



### Step 7. Print the last elements of the data set.

In [22]:
df.tail(1)

[Row(school='MS', sex='M', age=19, address='U', famsize='LE3', Pstatus='T', Medu=1, Fedu=1, Mjob='other', Fjob='at_home', reason='course', guardian='father', traveltime=1, studytime=1, failures=0, schoolsup='no', famsup='no', paid='no', activities='no', nursery='yes', higher='yes', internet='yes', romantic='no', famrel=3, freetime=2, goout=3, Dalc=3, Walc=3, health=5, absences=5, G1=8, G2=9, G3=9)]

### Step 8. Did you notice the original dataframe is still lowercase? Why is that? Fix it and capitalize Mjob and Fjob.

In [27]:
df = df.withColumn("Mjob", capitalize(df.Mjob)).withColumn("Fjob", capitalize(df.Fjob))
df.show()


+------+---+---+-------+-------+-------+----+----+--------+--------+----------+--------+----------+---------+--------+---------+------+----+----------+-------+------+--------+--------+------+--------+-----+----+----+------+--------+---+---+---+
|school|sex|age|address|famsize|Pstatus|Medu|Fedu|    Mjob|    Fjob|    reason|guardian|traveltime|studytime|failures|schoolsup|famsup|paid|activities|nursery|higher|internet|romantic|famrel|freetime|goout|Dalc|Walc|health|absences| G1| G2| G3|
+------+---+---+-------+-------+-------+----+----+--------+--------+----------+--------+----------+---------+--------+---------+------+----+----------+-------+------+--------+--------+------+--------+-----+----+----+------+--------+---+---+---+
|    GP|  F| 18|      U|    GT3|      A|   4|   4| At_home| Teacher|    course|  mother|         2|        2|       0|      yes|    no|  no|        no|    yes|   yes|      no|      no|     4|       3|    4|   1|   1|     3|       6|  5|  6|  6|
|    GP|  F| 17|    

### Step 9. Create a function called majority that returns a boolean value to a new column called legal_drinker (Consider majority as older than 17 years old)

> Indented block



In [29]:
majority = f.udf(lambda age : age > 17,BooleanType())

In [30]:
df.withColumn("legal_drinker", majority(df.age)).show()

+------+---+---+-------+-------+-------+----+----+--------+--------+----------+--------+----------+---------+--------+---------+------+----+----------+-------+------+--------+--------+------+--------+-----+----+----+------+--------+---+---+---+-------------+
|school|sex|age|address|famsize|Pstatus|Medu|Fedu|    Mjob|    Fjob|    reason|guardian|traveltime|studytime|failures|schoolsup|famsup|paid|activities|nursery|higher|internet|romantic|famrel|freetime|goout|Dalc|Walc|health|absences| G1| G2| G3|legal_drinker|
+------+---+---+-------+-------+-------+----+----+--------+--------+----------+--------+----------+---------+--------+---------+------+----+----------+-------+------+--------+--------+------+--------+-----+----+----+------+--------+---+---+---+-------------+
|    GP|  F| 18|      U|    GT3|      A|   4|   4| At_home| Teacher|    course|  mother|         2|        2|       0|      yes|    no|  no|        no|    yes|   yes|      no|      no|     4|       3|    4|   1|   1|     3|

### Step 10. Multiply every number of the dataset by 10. 
##### I know this makes no sense, don't forget it is just an exercise

In [38]:
def multiplier(c, n = 10):
  if dict(df.dtypes)[c] == "int":
    return f.col(c)*n
  else:
    return f.col(c)

In [39]:
df.select(*map(multiplier , df.columns)).show()

+------+---+----------+-------+-------+-------+-----------+-----------+--------+--------+----------+--------+-----------------+----------------+---------------+---------+------+----+----------+-------+------+--------+--------+-------------+---------------+------------+-----------+-----------+-------------+---------------+---------+---------+---------+
|school|sex|(age * 10)|address|famsize|Pstatus|(Medu * 10)|(Fedu * 10)|    Mjob|    Fjob|    reason|guardian|(traveltime * 10)|(studytime * 10)|(failures * 10)|schoolsup|famsup|paid|activities|nursery|higher|internet|romantic|(famrel * 10)|(freetime * 10)|(goout * 10)|(Dalc * 10)|(Walc * 10)|(health * 10)|(absences * 10)|(G1 * 10)|(G2 * 10)|(G3 * 10)|
+------+---+----------+-------+-------+-------+-----------+-----------+--------+--------+----------+--------+-----------------+----------------+---------------+---------+------+----+----------+-------+------+--------+--------+-------------+---------------+------------+-----------+-----------