# Student Alcohol Consumption

### Introduction:

This time you will download a dataset from the UCI.

### Step 1. Import the necessary libraries

In [5]:
!pip install pyspark



In [6]:
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, FloatType
from pyspark.sql.functions import expr, col, mean, when, sum, count, desc, min, max
spark = SparkSession.builder.master("local[*]").getOrCreate()

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/04_Apply/Students_Alcohol_Consumption/student-mat.csv).

In [7]:
!wget -O student-mat.csv https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/04_Apply/Students_Alcohol_Consumption/student-mat.csv

--2024-04-11 09:40:51--  https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/04_Apply/Students_Alcohol_Consumption/student-mat.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 41983 (41K) [text/plain]
Saving to: ‘student-mat.csv’


2024-04-11 09:40:51 (26.1 MB/s) - ‘student-mat.csv’ saved [41983/41983]



### Step 3. Assign it to a variable called df.

In [15]:
df = spark.read.csv('student-mat.csv', sep=',', header=True, inferSchema=True)

In [9]:
df.show(10)

+------+---+---+-------+-------+-------+----+----+--------+--------+----------+--------+----------+---------+--------+---------+------+----+----------+-------+------+--------+--------+------+--------+-----+----+----+------+--------+---+---+---+
|school|sex|age|address|famsize|Pstatus|Medu|Fedu|    Mjob|    Fjob|    reason|guardian|traveltime|studytime|failures|schoolsup|famsup|paid|activities|nursery|higher|internet|romantic|famrel|freetime|goout|Dalc|Walc|health|absences| G1| G2| G3|
+------+---+---+-------+-------+-------+----+----+--------+--------+----------+--------+----------+---------+--------+---------+------+----+----------+-------+------+--------+--------+------+--------+-----+----+----+------+--------+---+---+---+
|    GP|  F| 18|      U|    GT3|      A|   4|   4| at_home| teacher|    course|  mother|         2|        2|       0|      yes|    no|  no|        no|    yes|   yes|      no|      no|     4|       3|    4|   1|   1|     3|       6|  5|  6|  6|
|    GP|  F| 17|    

### Step 4. For the purpose of this exercise slice the dataframe from 'school' until the 'guardian' column

In [16]:
start_columns_to_slice = df.columns.index('school')

end_columns_to_slice = df.columns.index('guardian')

columns_needed = df.columns[start_columns_to_slice:end_columns_to_slice+1]

In [17]:
columns_needed

['school',
 'sex',
 'age',
 'address',
 'famsize',
 'Pstatus',
 'Medu',
 'Fedu',
 'Mjob',
 'Fjob',
 'reason',
 'guardian']

In [18]:
df = df.select(columns_needed)

In [19]:
df.show()

+------+---+---+-------+-------+-------+----+----+--------+--------+----------+--------+
|school|sex|age|address|famsize|Pstatus|Medu|Fedu|    Mjob|    Fjob|    reason|guardian|
+------+---+---+-------+-------+-------+----+----+--------+--------+----------+--------+
|    GP|  F| 18|      U|    GT3|      A|   4|   4| at_home| teacher|    course|  mother|
|    GP|  F| 17|      U|    GT3|      T|   1|   1| at_home|   other|    course|  father|
|    GP|  F| 15|      U|    LE3|      T|   1|   1| at_home|   other|     other|  mother|
|    GP|  F| 15|      U|    GT3|      T|   4|   2|  health|services|      home|  mother|
|    GP|  F| 16|      U|    GT3|      T|   3|   3|   other|   other|      home|  father|
|    GP|  M| 16|      U|    LE3|      T|   4|   3|services|   other|reputation|  mother|
|    GP|  M| 16|      U|    LE3|      T|   2|   2|   other|   other|      home|  mother|
|    GP|  F| 17|      U|    GT3|      A|   4|   4|   other| teacher|      home|  mother|
|    GP|  M| 15|     

### Step 5. Create a lambda function that will capitalize strings.

### Step 6. Capitalize both Mjob and Fjob

In [28]:
df.select(F.initcap('Mjob'), F.initcap('Fjob'), '*').show()

+-------------+-------------+------+---+---+-------+-------+-------+----+----+--------+--------+----------+--------+
|initcap(Mjob)|initcap(Fjob)|school|sex|age|address|famsize|Pstatus|Medu|Fedu|    Mjob|    Fjob|    reason|guardian|
+-------------+-------------+------+---+---+-------+-------+-------+----+----+--------+--------+----------+--------+
|      At_home|      Teacher|    GP|  F| 18|      U|    GT3|      A|   4|   4| At_home| Teacher|    course|  mother|
|      At_home|        Other|    GP|  F| 17|      U|    GT3|      T|   1|   1| At_home|   Other|    course|  father|
|      At_home|        Other|    GP|  F| 15|      U|    LE3|      T|   1|   1| At_home|   Other|     other|  mother|
|       Health|     Services|    GP|  F| 15|      U|    GT3|      T|   4|   2|  Health|Services|      home|  mother|
|        Other|        Other|    GP|  F| 16|      U|    GT3|      T|   3|   3|   Other|   Other|      home|  father|
|     Services|        Other|    GP|  M| 16|      U|    LE3|    

## OR

In [11]:
df = df.withColumn('Mjob', F.initcap(col('Mjob'))).withColumn('Fjob', F.initcap(col('Fjob')))

In [29]:
df.show()

+------+---+---+-------+-------+-------+----+----+--------+--------+----------+--------+
|school|sex|age|address|famsize|Pstatus|Medu|Fedu|    Mjob|    Fjob|    reason|guardian|
+------+---+---+-------+-------+-------+----+----+--------+--------+----------+--------+
|    GP|  F| 18|      U|    GT3|      A|   4|   4| At_home| Teacher|    course|  mother|
|    GP|  F| 17|      U|    GT3|      T|   1|   1| At_home|   Other|    course|  father|
|    GP|  F| 15|      U|    LE3|      T|   1|   1| At_home|   Other|     other|  mother|
|    GP|  F| 15|      U|    GT3|      T|   4|   2|  Health|Services|      home|  mother|
|    GP|  F| 16|      U|    GT3|      T|   3|   3|   Other|   Other|      home|  father|
|    GP|  M| 16|      U|    LE3|      T|   4|   3|Services|   Other|reputation|  mother|
|    GP|  M| 16|      U|    LE3|      T|   2|   2|   Other|   Other|      home|  mother|
|    GP|  F| 17|      U|    GT3|      A|   4|   4|   Other| Teacher|      home|  mother|
|    GP|  M| 15|     

### Step 7. Print the last elements of the data set.

In [20]:
df.tail(5)

[Row(school='MS', sex='M', age=20, address='U', famsize='LE3', Pstatus='A', Medu=2, Fedu=2, Mjob='services', Fjob='services', reason='course', guardian='other'),
 Row(school='MS', sex='M', age=17, address='U', famsize='LE3', Pstatus='T', Medu=3, Fedu=1, Mjob='services', Fjob='services', reason='course', guardian='mother'),
 Row(school='MS', sex='M', age=21, address='R', famsize='GT3', Pstatus='T', Medu=1, Fedu=1, Mjob='other', Fjob='other', reason='course', guardian='other'),
 Row(school='MS', sex='M', age=18, address='R', famsize='LE3', Pstatus='T', Medu=3, Fedu=2, Mjob='services', Fjob='other', reason='course', guardian='mother'),
 Row(school='MS', sex='M', age=19, address='U', famsize='LE3', Pstatus='T', Medu=1, Fedu=1, Mjob='other', Fjob='at_home', reason='course', guardian='father')]

### Step 8. Did you notice the original dataframe is still lowercase? Why is that? Fix it and capitalize Mjob and Fjob.

In [None]:
# I had it correctly

### Step 9. Create a function called majority that returns a boolean value to a new column called legal_drinker (Consider majority as older than 17 years old)

In [21]:
df = df.withColumn('legal_drinker', col('age')>17)

In [22]:
df.show()

+------+---+---+-------+-------+-------+----+----+--------+--------+----------+--------+-------------+
|school|sex|age|address|famsize|Pstatus|Medu|Fedu|    Mjob|    Fjob|    reason|guardian|legal_drinker|
+------+---+---+-------+-------+-------+----+----+--------+--------+----------+--------+-------------+
|    GP|  F| 18|      U|    GT3|      A|   4|   4| at_home| teacher|    course|  mother|         true|
|    GP|  F| 17|      U|    GT3|      T|   1|   1| at_home|   other|    course|  father|        false|
|    GP|  F| 15|      U|    LE3|      T|   1|   1| at_home|   other|     other|  mother|        false|
|    GP|  F| 15|      U|    GT3|      T|   4|   2|  health|services|      home|  mother|        false|
|    GP|  F| 16|      U|    GT3|      T|   3|   3|   other|   other|      home|  father|        false|
|    GP|  M| 16|      U|    LE3|      T|   4|   3|services|   other|reputation|  mother|        false|
|    GP|  M| 16|      U|    LE3|      T|   2|   2|   other|   other|     

### Step 10. Multiply every number of the dataset by 10.
##### I know this makes no sense, don't forget it is just an exercise

In [43]:
df.dtypes

[('school', 'string'),
 ('sex', 'string'),
 ('age', 'int'),
 ('address', 'string'),
 ('famsize', 'string'),
 ('Pstatus', 'string'),
 ('Medu', 'int'),
 ('Fedu', 'int'),
 ('Mjob', 'string'),
 ('Fjob', 'string'),
 ('reason', 'string'),
 ('guardian', 'string'),
 ('legal_drinker', 'boolean')]

In [44]:
columns_to_mult=[c[0] for c in df.dtypes if c[1] == 'int']

In [45]:
columns_to_mult

['age', 'Medu', 'Fedu']

In [51]:
for colm in columns_to_mult:
  df = df.withColumn(colm, col(colm) * 10)

In [52]:
df.show(5)

+------+---+---+-------+-------+-------+----+----+-------+--------+------+--------+-------------+
|school|sex|age|address|famsize|Pstatus|Medu|Fedu|   Mjob|    Fjob|reason|guardian|legal_drinker|
+------+---+---+-------+-------+-------+----+----+-------+--------+------+--------+-------------+
|    GP|  F|180|      U|    GT3|      A|  40|  40|at_home| teacher|course|  mother|         true|
|    GP|  F|170|      U|    GT3|      T|  10|  10|at_home|   other|course|  father|        false|
|    GP|  F|150|      U|    LE3|      T|  10|  10|at_home|   other| other|  mother|        false|
|    GP|  F|150|      U|    GT3|      T|  40|  20| health|services|  home|  mother|        false|
|    GP|  F|160|      U|    GT3|      T|  30|  30|  other|   other|  home|  father|        false|
+------+---+---+-------+-------+-------+----+----+-------+--------+------+--------+-------------+
only showing top 5 rows

