# Occupation

### Introduction:

Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=f4f598583b09c54de63c43a73621ad2daa06af7fb0ae484a10234518a7f87fe5
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, FloatType
from pyspark.sql.functions import expr, col, mean, when, sum, count, desc, min, max
spark = SparkSession.builder.master("local[*]").getOrCreate()

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user).

In [3]:
!wget -O u.user.csv https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user

--2024-04-09 17:32:59--  https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22667 (22K) [text/plain]
Saving to: ‘u.user.csv’


2024-04-09 17:33:00 (97.2 MB/s) - ‘u.user.csv’ saved [22667/22667]



### Step 3. Assign it to a variable called users.

In [4]:
users = spark.read.csv('u.user.csv', sep='|', header=True, inferSchema=True)

### Step 4. Discover what is the mean age per occupation

In [12]:
users.groupBy(col('occupation')).agg({"age": "mean"}).collect()

[Row(occupation='librarian', avg(age)=40.0),
 Row(occupation='retired', avg(age)=63.07142857142857),
 Row(occupation='lawyer', avg(age)=36.75),
 Row(occupation='none', avg(age)=26.555555555555557),
 Row(occupation='writer', avg(age)=36.31111111111111),
 Row(occupation='programmer', avg(age)=33.121212121212125),
 Row(occupation='marketing', avg(age)=37.61538461538461),
 Row(occupation='other', avg(age)=34.523809523809526),
 Row(occupation='executive', avg(age)=38.71875),
 Row(occupation='scientist', avg(age)=35.54838709677419),
 Row(occupation='student', avg(age)=22.081632653061224),
 Row(occupation='salesman', avg(age)=35.666666666666664),
 Row(occupation='artist', avg(age)=31.392857142857142),
 Row(occupation='technician', avg(age)=33.148148148148145),
 Row(occupation='administrator', avg(age)=38.74683544303797),
 Row(occupation='engineer', avg(age)=36.38805970149254),
 Row(occupation='healthcare', avg(age)=41.5625),
 Row(occupation='educator', avg(age)=42.01052631578948),
 Row(occupa

### Step 5. Discover the Male ratio per occupation and sort it from the most to the least

In [13]:
users.columns

['user_id', 'age', 'gender', 'occupation', 'zip_code']

In [25]:
users.show()

+-------+---+------+-------------+--------+----+
|user_id|age|gender|   occupation|zip_code|Male|
+-------+---+------+-------------+--------+----+
|      1| 24|     M|   technician|   85711|   1|
|      2| 53|     F|        other|   94043|   0|
|      3| 23|     M|       writer|   32067|   1|
|      4| 24|     M|   technician|   43537|   1|
|      5| 33|     F|        other|   15213|   0|
|      6| 42|     M|    executive|   98101|   1|
|      7| 57|     M|administrator|   91344|   1|
|      8| 36|     M|administrator|   05201|   1|
|      9| 29|     M|      student|   01002|   1|
|     10| 53|     M|       lawyer|   90703|   1|
|     11| 39|     F|        other|   30329|   0|
|     12| 28|     F|        other|   06405|   0|
|     13| 47|     M|     educator|   29206|   1|
|     14| 45|     M|    scientist|   55106|   1|
|     15| 49|     F|     educator|   97301|   0|
|     16| 21|     M|entertainment|   10309|   1|
|     17| 30|     M|   programmer|   06355|   1|
|     18| 35|     F|

In [22]:
users = users.withColumn('Male', when(col('gender')=='M', 1).otherwise(0))

In [38]:
users_grouped_by = users.groupBy('occupation').agg(count('*').alias('total'), sum('Male').alias('male_count'))

In [39]:
users_grouped_by = users_grouped_by.withColumn('male_ratio', (col('male_count') / col('total')))

In [40]:
users_grouped_by = users_grouped_by.sort(desc(col('male_ratio')))

### Step 6. For each occupation, calculate the minimum and maximum ages

In [33]:
users.groupBy(col('occupation')).agg(min('age'), max('age')).show()

+-------------+--------+--------+
|   occupation|min(age)|max(age)|
+-------------+--------+--------+
|    librarian|      23|      69|
|      retired|      51|      73|
|       lawyer|      21|      53|
|         none|      11|      55|
|       writer|      18|      60|
|   programmer|      20|      63|
|    marketing|      24|      55|
|        other|      13|      64|
|    executive|      22|      69|
|    scientist|      23|      55|
|      student|       7|      42|
|     salesman|      18|      66|
|       artist|      19|      48|
|   technician|      21|      55|
|administrator|      21|      70|
|     engineer|      22|      70|
|   healthcare|      22|      62|
|     educator|      23|      63|
|entertainment|      15|      50|
|    homemaker|      20|      50|
+-------------+--------+--------+
only showing top 20 rows



### Step 7. For each combination of occupation and gender, calculate the mean age

In [34]:
users.groupBy('occupation', 'gender').agg(mean('age')).show()

+-------------+------+------------------+
|   occupation|gender|          avg(age)|
+-------------+------+------------------+
|   technician|     M| 32.96153846153846|
|     educator|     F| 39.11538461538461|
|       lawyer|     F|              39.5|
|entertainment|     F|              31.0|
|       lawyer|     M|              36.2|
|      retired|     F|              70.0|
|      student|     F|             20.75|
|   healthcare|     F| 39.81818181818182|
|administrator|     M| 37.16279069767442|
|    marketing|     M|            37.875|
|     engineer|     F|              29.5|
|    homemaker|     F|34.166666666666664|
|       artist|     F|30.307692307692307|
|         none|     F|              36.5|
|       doctor|     M| 43.57142857142857|
|       writer|     F| 37.63157894736842|
|     educator|     M| 43.10144927536232|
|    scientist|     M| 36.32142857142857|
|   technician|     F|              38.0|
|       writer|     M| 35.34615384615385|
+-------------+------+------------

### Step 8.  For each occupation present the percentage of women and men

In [45]:
users_grouped_by = users_grouped_by.withColumn("Female_ratio", 1 - col('male_ratio'))

In [49]:
users_grouped_by = users_grouped_by.withColumn("Female_ratio", col('Female_ratio') * 100)
users_grouped_by = users_grouped_by.withColumn("male_ratio", col('male_ratio') * 100)

In [52]:
users_grouped_by.show()

+-------------+-----+----------+-----------------+------------------+
|   occupation|total|male_count|       male_ratio|      Female_ratio|
+-------------+-----+----------+-----------------+------------------+
|       doctor|    7|         7|            100.0|               0.0|
|     engineer|   67|        65|97.01492537313433| 2.985074626865669|
|   technician|   27|        26|96.29629629629629| 3.703703703703709|
|      retired|   14|        13|92.85714285714286|  7.14285714285714|
|   programmer|   66|        60| 90.9090909090909| 9.090909090909093|
|    executive|   32|        29|           90.625|             9.375|
|    scientist|   31|        28|90.32258064516128| 9.677419354838712|
|entertainment|   18|        16|88.88888888888889|11.111111111111116|
|       lawyer|   12|        10|83.33333333333334|16.666666666666664|
|     salesman|   12|         9|             75.0|              25.0|
|     educator|   95|        69|72.63157894736842|27.368421052631575|
|      student|  196