# Occupation

### Introduction:

Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import requests

In [4]:
spark = SparkSession.builder.master("local[1]").appName("user").getOrCreate()

22/09/07 23:25:37 WARN Utils: Your hostname, xkeyscore resolves to a loopback address: 127.0.1.1; using 192.168.1.8 instead (on interface wlp0s20f3)
22/09/07 23:25:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/09/07 23:25:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/09/07 23:25:38 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user). 

In [2]:
url = "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user"
data_content = requests.get(url)

with open("user.csv","w") as f:
    f.write(data_content.text)

### Step 3. Assign it to a variable called users.

In [5]:
users = spark.read.options(header=True, inferSchema=True, delimiter="|").csv("user.csv")

In [6]:
users.show(5)

+-------+---+------+----------+--------+
|user_id|age|gender|occupation|zip_code|
+-------+---+------+----------+--------+
|      1| 24|     M|technician|   85711|
|      2| 53|     F|     other|   94043|
|      3| 23|     M|    writer|   32067|
|      4| 24|     M|technician|   43537|
|      5| 33|     F|     other|   15213|
+-------+---+------+----------+--------+
only showing top 5 rows



In [7]:
users.printSchema()

root
 |-- user_id: integer (nullable = true)
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- zip_code: string (nullable = true)



### Step 4. Discover what is the mean age per occupation

In [9]:
users.select("age", "occupation").groupby("occupation").mean().show(100)

+-------------+------------------+
|   occupation|          avg(age)|
+-------------+------------------+
|    librarian|              40.0|
|      retired| 63.07142857142857|
|       lawyer|             36.75|
|         none|26.555555555555557|
|       writer| 36.31111111111111|
|   programmer|33.121212121212125|
|    marketing| 37.61538461538461|
|        other|34.523809523809526|
|    executive|          38.71875|
|    scientist| 35.54838709677419|
|      student|22.081632653061224|
|     salesman|35.666666666666664|
|       artist|31.392857142857142|
|   technician|33.148148148148145|
|administrator| 38.74683544303797|
|     engineer| 36.38805970149254|
|   healthcare|           41.5625|
|     educator| 42.01052631578948|
|entertainment| 29.22222222222222|
|    homemaker| 32.57142857142857|
|       doctor| 43.57142857142857|
+-------------+------------------+



### Step 5. Discover the Male ratio per occupation and sort it from the most to the least

In [45]:
total_count = users.groupBy("occupation").count()

In [46]:
occup_count = users.groupBy("occupation", "gender").agg(
    count("occupation").alias("occup_CNT")
)

In [29]:
final_df = occup_count.join(total_count, occup_count.occupation==total_count.occupation)
final_df.show()

+----------+------+---------+----------+-----+
|occupation|gender|occup_CNT|occupation|count|
+----------+------+---------+----------+-----+
| librarian|     F|       29| librarian|   51|
| librarian|     M|       22| librarian|   51|
|   retired|     M|       13|   retired|   14|
|   retired|     F|        1|   retired|   14|
|    lawyer|     M|       10|    lawyer|   12|
|    lawyer|     F|        2|    lawyer|   12|
|      none|     M|        5|      none|    9|
|      none|     F|        4|      none|    9|
|    writer|     M|       26|    writer|   45|
|    writer|     F|       19|    writer|   45|
|programmer|     F|        6|programmer|   66|
|programmer|     M|       60|programmer|   66|
| marketing|     F|       10| marketing|   26|
| marketing|     M|       16| marketing|   26|
|     other|     M|       69|     other|  105|
|     other|     F|       36|     other|  105|
| executive|     F|        3| executive|   32|
| executive|     M|       29| executive|   32|
| scientist| 

In [39]:
final_df = final_df.withColumn("ratio", 100*col("occup_CNT")/col("count"))

In [44]:
final_df.select("*").filter("gender=='M'").sort(desc("ratio")).show()

+-------------+------+---------+-------------+-----+-----------------+
|   occupation|gender|occup_CNT|   occupation|count|            ratio|
+-------------+------+---------+-------------+-----+-----------------+
|       doctor|     M|        7|       doctor|    7|            100.0|
|     engineer|     M|       65|     engineer|   67|97.01492537313433|
|   technician|     M|       26|   technician|   27|96.29629629629629|
|      retired|     M|       13|      retired|   14|92.85714285714286|
|   programmer|     M|       60|   programmer|   66| 90.9090909090909|
|    executive|     M|       29|    executive|   32|           90.625|
|    scientist|     M|       28|    scientist|   31| 90.3225806451613|
|entertainment|     M|       16|entertainment|   18|88.88888888888889|
|       lawyer|     M|       10|       lawyer|   12|83.33333333333333|
|     salesman|     M|        9|     salesman|   12|             75.0|
|     educator|     M|       69|     educator|   95|72.63157894736842|
|     

### Step 6. For each occupation, calculate the minimum and maximum ages

In [51]:
users.select("occupation","age").groupby("occupation").agg(
    min("age").alias("MIN AGE"), max("age").alias("MAX_AGE")
).show()

+-------------+-------+-------+
|   occupation|MIN AGE|MAX_AGE|
+-------------+-------+-------+
|    librarian|     23|     69|
|      retired|     51|     73|
|       lawyer|     21|     53|
|         none|     11|     55|
|       writer|     18|     60|
|   programmer|     20|     63|
|    marketing|     24|     55|
|        other|     13|     64|
|    executive|     22|     69|
|    scientist|     23|     55|
|      student|      7|     42|
|     salesman|     18|     66|
|       artist|     19|     48|
|   technician|     21|     55|
|administrator|     21|     70|
|     engineer|     22|     70|
|   healthcare|     22|     62|
|     educator|     23|     63|
|entertainment|     15|     50|
|    homemaker|     20|     50|
+-------------+-------+-------+
only showing top 20 rows



### Step 7. For each combination of occupation and gender, calculate the mean age

In [52]:
users.select("occupation","gender","age").groupby("occupation","gender").mean().show()

+-------------+------+------------------+
|   occupation|gender|          avg(age)|
+-------------+------+------------------+
|   technician|     M| 32.96153846153846|
|     educator|     F| 39.11538461538461|
|       lawyer|     F|              39.5|
|entertainment|     F|              31.0|
|       lawyer|     M|              36.2|
|      retired|     F|              70.0|
|      student|     F|             20.75|
|   healthcare|     F| 39.81818181818182|
|administrator|     M| 37.16279069767442|
|    marketing|     M|            37.875|
|     engineer|     F|              29.5|
|    homemaker|     F|34.166666666666664|
|       artist|     F|30.307692307692307|
|         none|     F|              36.5|
|       doctor|     M| 43.57142857142857|
|       writer|     F| 37.63157894736842|
|     educator|     M| 43.10144927536232|
|    scientist|     M| 36.32142857142857|
|   technician|     F|              38.0|
|       writer|     M| 35.34615384615385|
+-------------+------+------------

### Step 8.  For each occupation present the percentage of women and men

In [53]:
final_df.show()

+----------+------+---------+----------+-----+------------------+
|occupation|gender|occup_CNT|occupation|count|             ratio|
+----------+------+---------+----------+-----+------------------+
| librarian|     F|       29| librarian|   51| 56.86274509803921|
| librarian|     M|       22| librarian|   51| 43.13725490196079|
|   retired|     M|       13|   retired|   14| 92.85714285714286|
|   retired|     F|        1|   retired|   14| 7.142857142857143|
|    lawyer|     M|       10|    lawyer|   12| 83.33333333333333|
|    lawyer|     F|        2|    lawyer|   12|16.666666666666668|
|      none|     M|        5|      none|    9| 55.55555555555556|
|      none|     F|        4|      none|    9| 44.44444444444444|
|    writer|     M|       26|    writer|   45| 57.77777777777778|
|    writer|     F|       19|    writer|   45| 42.22222222222222|
|programmer|     F|        6|programmer|   66| 9.090909090909092|
|programmer|     M|       60|programmer|   66|  90.9090909090909|
| marketin