# Exercises

Try to use `pandas` functions as much as you can.

Do **not** use any Python `if` or `for`.

Do **not** overwrite the `df` varaible anywhere, keep it as it is.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('occupation.csv', sep='|')

## Show the first 10 elements of the DataFrame.

In [4]:
df.head(10)


Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213
5,6,42,M,executive,98101
6,7,57,M,administrator,91344
7,8,36,M,administrator,5201
8,9,29,M,student,1002
9,10,53,M,lawyer,90703


## Show the last 8 elements of the  DataFrame.

In [5]:
df.tail(8)

Unnamed: 0,user_id,age,gender,occupation,zip_code
935,936,24,M,other,32789
936,937,48,M,educator,98072
937,938,38,F,technician,55038
938,939,26,F,student,33319
939,940,32,M,administrator,2215
940,941,20,M,student,97229
941,942,48,F,librarian,78209
942,943,22,M,student,77841


## How many different occupations are there in the dataset?

**Hint**: `nunique()`

In [25]:
df['occupation'].nunique()


21

## How many different genders?

**Hint**: `unique()`

In [26]:
df['occupation'].unique()

array(['technician', 'other', 'writer', 'executive', 'administrator',
       'student', 'lawyer', 'educator', 'scientist', 'entertainment',
       'programmer', 'librarian', 'homemaker', 'artist', 'engineer',
       'marketing', 'none', 'healthcare', 'retired', 'salesman', 'doctor'],
      dtype=object)

## How many elements are there for each gender?

**Hint**: `value_counts()`

In [28]:
df['gender'].value_counts()


gender
M    670
F    273
Name: count, dtype: int64

## Show the individuals with 16 years or more.

In [49]:
df[df['age'] >= 16]
df[df['age'] >= 16][['user_id', 'age']]

Unnamed: 0,user_id,age
0,1,24
1,2,53
2,3,23
3,4,24
4,5,33
...,...,...
938,939,26
939,940,32
940,941,20
941,942,48


## Show the individuals with ages between 30 and 40 years old, both included.

In [54]:
df[(df['age'] >29) & (df['age'] <41)]  

Unnamed: 0,user_id,age,gender,occupation,zip_code
4,5,33,F,other,15213
7,8,36,M,administrator,05201
10,11,39,F,other,30329
16,17,30,M,programmer,06355
17,18,35,F,other,37212
...,...,...,...,...,...
910,911,37,F,writer,53210
917,918,40,M,scientist,70116
919,920,30,F,artist,90008
937,938,38,F,technician,55038


## Show females between 30 and 35 years old that work as educators.

In [57]:
df[(df['age'] >29) & (df['age'] <36) & (df['gender'] == 'F') & (df['occupation'] == 'educator')]  

Unnamed: 0,user_id,age,gender,occupation,zip_code
151,152,33,F,educator,68767
208,209,33,F,educator,85710
223,224,31,F,educator,43512
329,330,35,F,educator,33884
449,450,35,F,educator,11758
555,556,35,F,educator,30606
592,593,31,F,educator,68767


## Show male individuals that work as technician or writer.

In [64]:
df[(df['gender'] == 'M') & ((df['occupation'] == 'technician') | (df['occupation'] == 'writer'))]

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
2,3,23,M,writer,32067
3,4,24,M,technician,43537
20,21,26,M,writer,30068
21,22,25,M,writer,40206
27,28,32,M,writer,55369
43,44,26,M,technician,46260
49,50,21,M,writer,52245
76,77,30,M,technician,29379
142,143,42,M,technician,08832


## How many elements are there in the previous filter?

## Randomly take 10 rows from the DataFrame.

**Hint**: `sample()`

## Randomly take 3 elements from the DataFrame that work in the zip code `55337`.

## Sort the dataset based on age, from older to younger.

## Sort the dataset based on age as before, but make sure the people with the same age are sorted ascendingly by user_id (smaller user_ids go before).

## Plot a histogram showcasing the age distribution of users.

## Plot a pie chart depicting the distribution of male and female users.

## Remove all rows with occupation `other`.

## Remove the columns gender and user_id.

## Plot a bar chart of the average age of users for each occupation.

That is, display a bar for each occupation, with the height being the average user age.

**Extra**: Display them sorted descendingly, that is, the ocupation with the highest average age goes first and so on.

**Hint**: `kind='bar'` or `kind='barh'` (for displaying it horizontal).

## Create multiple boxplots in a single chart to display the spread of ages for each occupation.

**Hint**: `boxplot()` method in a pandas DF.

## Change the colum name `occupation` for `job`.

## Filter the rows that have an age equal to one of the following numbers: [19, 43, 26, 12]

**Hint**: `isin()`

## Group the individuals by occupation and show how many men and women are there in each group. Remember to use `pandas` only.

## Create a new column `age_range` that has 3 possible values:
- 0 the individual is less than 25 years old.
- 1 the individual is between 26 and 65 years old.
- 2 the individual is more than 65 years old.

## Randomly add `NaN` (`np.nan`) values in the age column. Then filter the dataset, showing all the rows with any `NaN` value.

Here you have to modify the `df` variable.

## Replace all `NaN` values by `-1`.

Do not modify `df`.

## Replace `NaN` values in the age column by the mean of all the column values.

Modify `df`.

## Age Bracketing with Occupations

Create a new DataFrame where you categorize users into age brackets: 18-30, 31-50, 51-70, and 70+.

Then, for each occupation and age bracket compute the average age and the male percentage.

**Hint**: Check pandas' `cut` method for computing age brackets.

## Gender Dominance by Zip Code

Identify zip codes where a single gender represents more than 90% of individuals.

## Most Diverse Occupations

Identify the top 5 occupations that have the closest male-to-female ratio.

**Hint**:
- If performing a `groupby()` with occupation and gender, check what the `size()` method provides. Basically, it returns a "multiindex" Series with two indices: occupation and gender, with the values corresponding to the number of elements of each group.
- This multiindex can be "ungrouped" and transformed into a DF with occupation as index and gender as columns with the `unstack()` method.
- From here, think how to do the rest.

## User Density by State

Assuming the first two characters in the `zip_code` represent a USA state, find out which state has the highest user density (number of users per unique occupation).

## Rare Occupations

Identify the occupations that are unique to each gender (i.e., occupations held only by males or only by females).

## Pie Gender Ratio

Based on the age bracketing in exercise 2.1., for each age bracket plot a pie chart of the gender ratio.

## Occupation Age Range

For each occupation, compute the age range (max - min). Display the top 10 occupations with the most diverse age ranges using a horizontal bar chart.