# Occupation

### Introduction:

Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [2]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 46 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 49.3 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.0-py2.py3-none-any.whl size=281764026 sha256=e48a758b546c6a1b746cd7aa9b2dcc964e1558d136015a35c4e53c3562de99af
  Stored in directory: /root/.cache/pip/wheels/7a/8e/1b/f73a52650d2e5f337708d9f6a1750d451a7349a867f928b885
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.0


In [3]:
from pyspark.sql import SparkSession, Window, functions as f
from pyspark.files import SparkFiles

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user). 

In [4]:
url = "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user"

spark = SparkSession.builder.appName("Exercise31").getOrCreate()
spark.sparkContext.addFile(url)

df = spark.read.csv("file://" + SparkFiles.get("u.user"), sep="|", header=True, inferSchema=True)
df.printSchema()

root
 |-- user_id: integer (nullable = true)
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- zip_code: string (nullable = true)



### Step 3. Assign it to a variable called users.

### Step 4. Discover what is the mean age per occupation

In [5]:
df.groupBy("occupation").agg(f.mean("age")).show()

+-------------+------------------+
|   occupation|          avg(age)|
+-------------+------------------+
|    librarian|              40.0|
|      retired| 63.07142857142857|
|       lawyer|             36.75|
|         none|26.555555555555557|
|       writer| 36.31111111111111|
|   programmer|33.121212121212125|
|    marketing| 37.61538461538461|
|        other|34.523809523809526|
|    executive|          38.71875|
|    scientist| 35.54838709677419|
|      student|22.081632653061224|
|     salesman|35.666666666666664|
|       artist|31.392857142857142|
|   technician|33.148148148148145|
|administrator| 38.74683544303797|
|     engineer| 36.38805970149254|
|   healthcare|           41.5625|
|     educator| 42.01052631578948|
|entertainment| 29.22222222222222|
|    homemaker| 32.57142857142857|
+-------------+------------------+
only showing top 20 rows



### Step 5. Discover the Male ratio per occupation and sort it from the most to the least

In [39]:
males = df.filter(df.gender == "M").groupBy("occupation").agg(f.count("gender").alias("count-M"))
females = df.filter(df.gender == "F").groupBy("occupation").agg(f.count("gender").alias("count-F"))
males_and_females = males.join(females, on="occupation")

males_and_females.select(f.col("occupation"), f.col("count-M"), f.col("count-F"), (f.col("count-M")/(f.col("count-M")+f.col("count-F"))).alias("ratio males")).show()

+-------------+-------+-------+-------------------+
|   occupation|count-M|count-F|        ratio males|
+-------------+-------+-------+-------------------+
|    librarian|     22|     29|0.43137254901960786|
|      retired|     13|      1| 0.9285714285714286|
|       lawyer|     10|      2| 0.8333333333333334|
|         none|      5|      4| 0.5555555555555556|
|       writer|     26|     19| 0.5777777777777777|
|   programmer|     60|      6| 0.9090909090909091|
|    marketing|     16|     10| 0.6153846153846154|
|        other|     69|     36| 0.6571428571428571|
|    executive|     29|      3|            0.90625|
|    scientist|     28|      3| 0.9032258064516129|
|      student|    136|     60| 0.6938775510204082|
|     salesman|      9|      3|               0.75|
|       artist|     15|     13| 0.5357142857142857|
|   technician|     26|      1| 0.9629629629629629|
|administrator|     43|     36| 0.5443037974683544|
|     engineer|     65|      2| 0.9701492537313433|
|   healthca

### Step 6. For each occupation, calculate the minimum and maximum ages

In [6]:
df.groupBy("occupation").agg(f.min("age"),f.max("age")).show()

+-------------+--------+--------+
|   occupation|min(age)|max(age)|
+-------------+--------+--------+
|    librarian|      23|      69|
|      retired|      51|      73|
|       lawyer|      21|      53|
|         none|      11|      55|
|       writer|      18|      60|
|   programmer|      20|      63|
|    marketing|      24|      55|
|        other|      13|      64|
|    executive|      22|      69|
|    scientist|      23|      55|
|      student|       7|      42|
|     salesman|      18|      66|
|       artist|      19|      48|
|   technician|      21|      55|
|administrator|      21|      70|
|     engineer|      22|      70|
|   healthcare|      22|      62|
|     educator|      23|      63|
|entertainment|      15|      50|
|    homemaker|      20|      50|
+-------------+--------+--------+
only showing top 20 rows



### Step 7. For each combination of occupation and gender, calculate the mean age

In [7]:
df.groupBy("occupation", "gender").agg(f.mean("age")).show()

+-------------+------+------------------+
|   occupation|gender|          avg(age)|
+-------------+------+------------------+
|   technician|     M| 32.96153846153846|
|     educator|     F| 39.11538461538461|
|       lawyer|     F|              39.5|
|entertainment|     F|              31.0|
|       lawyer|     M|              36.2|
|      retired|     F|              70.0|
|      student|     F|             20.75|
|   healthcare|     F| 39.81818181818182|
|administrator|     M| 37.16279069767442|
|    marketing|     M|            37.875|
|     engineer|     F|              29.5|
|    homemaker|     F|34.166666666666664|
|       artist|     F|30.307692307692307|
|         none|     F|              36.5|
|       doctor|     M| 43.57142857142857|
|       writer|     F| 37.63157894736842|
|     educator|     M| 43.10144927536232|
|    scientist|     M| 36.32142857142857|
|   technician|     F|              38.0|
|       writer|     M| 35.34615384615385|
+-------------+------+------------

### Step 8.  For each occupation present the percentage of women and men

In [43]:
males_and_females.select(f.col("occupation"), f.col("count-M"), f.col("count-F"),
                         (f.col("count-M")/(f.col("count-M")+f.col("count-F"))*100).alias("percentage-males"),
                         (f.col("count-F")/(f.col("count-M")+f.col("count-F"))*100).alias("percentage-females")).show()

+-------------+-------+-------+------------------+------------------+
|   occupation|count-M|count-F|  percentage-males|percentage-females|
+-------------+-------+-------+------------------+------------------+
|    librarian|     22|     29| 43.13725490196079| 56.86274509803921|
|      retired|     13|      1| 92.85714285714286| 7.142857142857142|
|       lawyer|     10|      2| 83.33333333333334|16.666666666666664|
|         none|      5|      4| 55.55555555555556| 44.44444444444444|
|       writer|     26|     19| 57.77777777777777| 42.22222222222222|
|   programmer|     60|      6|  90.9090909090909| 9.090909090909092|
|    marketing|     16|     10| 61.53846153846154| 38.46153846153847|
|        other|     69|     36| 65.71428571428571|34.285714285714285|
|    executive|     29|      3|            90.625|             9.375|
|    scientist|     28|      3| 90.32258064516128|  9.67741935483871|
|      student|    136|     60| 69.38775510204081|30.612244897959183|
|     salesman|     