# Working With A Dataset On Canadian Occupations Prestige



In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Importing The Dataset

The dataset is from https://vincentarelbundock.github.io/Rdatasets/doc/carData/Prestige.html. This website contains datasets in the .csv format which are typically found in libraries from the R programming language.

In [2]:
# Importing Data:

'''
Documentation Info: https://vincentarelbundock.github.io/Rdatasets/doc/carData/Prestige.html
This data frame contains the following columns:

education : Average education of occupational incumbents, years, in 1971.

income: Average income of incumbents, dollars, in 1971.

women: Percentage of incumbents who are women.

prestige: Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s.

census: Canadian Census occupational code.

type: Type of occupation. A factor with levels (note: out of order): bc, Blue Collar; prof, 
      Professional, Managerial, and Technical; wc, White Collar.
'''

prestige_data = pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/carData/Prestige.csv")

# Preview data with .head(15)

prestige_data.head(15)



Unnamed: 0.1,Unnamed: 0,education,income,women,prestige,census,type
0,gov.administrators,13.11,12351,11.16,68.8,1113,prof
1,general.managers,12.26,25879,4.02,69.1,1130,prof
2,accountants,12.77,9271,15.7,63.4,1171,prof
3,purchasing.officers,11.42,8865,9.11,56.8,1175,prof
4,chemists,14.62,8403,11.68,73.5,2111,prof
5,physicists,15.64,11030,5.13,77.6,2113,prof
6,biologists,15.09,8258,25.65,72.6,2133,prof
7,architects,15.44,14163,2.69,78.1,2141,prof
8,civil.engineers,14.52,11377,1.03,73.1,2143,prof
9,mining.engineers,14.64,11023,0.94,68.8,2153,prof


In [3]:
# Preview data with .tail(15)

prestige_data.tail(15)

Unnamed: 0.1,Unnamed: 0,education,income,women,prestige,census,type
87,electrical.linemen,9.05,8316,1.34,40.9,8731,bc
88,electricians,9.93,7147,0.99,50.2,8733,bc
89,construction.foremen,8.24,8880,0.65,51.1,8780,bc
90,carpenters,6.92,5299,0.56,38.9,8781,bc
91,masons,6.6,5959,0.52,36.2,8782,bc
92,house.painters,7.81,4549,2.46,29.9,8785,bc
93,plumbers,8.33,6928,0.61,42.9,8791,bc
94,construction.labourers,7.52,3910,1.09,26.5,8798,bc
95,pilots,12.27,14032,0.58,66.1,9111,prof
96,train.engineers,8.49,8845,0.0,48.9,9131,bc


In [4]:
# Rename columns

prestige_data.columns = ['Job Worker', 'Education', 'Income', 'Women', 'Prestige', 'Census', 'Type']

prestige_data.head(15)

Unnamed: 0,Job Worker,Education,Income,Women,Prestige,Census,Type
0,gov.administrators,13.11,12351,11.16,68.8,1113,prof
1,general.managers,12.26,25879,4.02,69.1,1130,prof
2,accountants,12.77,9271,15.7,63.4,1171,prof
3,purchasing.officers,11.42,8865,9.11,56.8,1175,prof
4,chemists,14.62,8403,11.68,73.5,2111,prof
5,physicists,15.64,11030,5.13,77.6,2113,prof
6,biologists,15.09,8258,25.65,72.6,2133,prof
7,architects,15.44,14163,2.69,78.1,2141,prof
8,civil.engineers,14.52,11377,1.03,73.1,2143,prof
9,mining.engineers,14.64,11023,0.94,68.8,2153,prof


# Some Pandas Operations

Python's pandas library can be used to extract information from the dataset.

## 1) Jobs By Education

You can use filtering methods in pandas to extract jobs with less than 8 years of education, jobs greater than 8 years of education, and jobs greater than 12 years of education.

In [5]:
# Jobs less than 8 years education:

jobs_less_eightyrs = prestige_data[prestige_data.Education < 8]

jobs_less_eightyrs.head(12)

Unnamed: 0,Job Worker,Education,Income,Women,Prestige,Census,Type
59,cooks,7.74,3116,52.0,29.7,6121,bc
63,launderers,7.33,3000,69.31,20.8,6162,bc
64,janitors,7.11,3472,33.57,17.3,6191,bc
65,elevator.operators,7.58,3582,30.08,20.1,6193,bc
66,farmers,6.84,3643,3.6,44.1,7112,
69,bakers,7.54,4199,33.3,38.9,8213,bc
70,slaughterers.1,7.64,5134,17.26,25.2,8215,bc
71,slaughterers.2,7.64,5134,17.26,34.8,8215,bc
72,canners,7.42,1890,72.24,23.2,8221,bc
73,textile.weavers,6.69,4443,31.36,33.3,8267,bc


In [6]:
# Jobs more than 8 years education:

jobs_more_eightyrs = prestige_data[prestige_data.Education > 8]

jobs_more_eightyrs.head(12)

Unnamed: 0,Job Worker,Education,Income,Women,Prestige,Census,Type
0,gov.administrators,13.11,12351,11.16,68.8,1113,prof
1,general.managers,12.26,25879,4.02,69.1,1130,prof
2,accountants,12.77,9271,15.7,63.4,1171,prof
3,purchasing.officers,11.42,8865,9.11,56.8,1175,prof
4,chemists,14.62,8403,11.68,73.5,2111,prof
5,physicists,15.64,11030,5.13,77.6,2113,prof
6,biologists,15.09,8258,25.65,72.6,2133,prof
7,architects,15.44,14163,2.69,78.1,2141,prof
8,civil.engineers,14.52,11377,1.03,73.1,2143,prof
9,mining.engineers,14.64,11023,0.94,68.8,2153,prof


In [7]:
# Jobs more than 12 years education:

jobs_more_twelveyrs = prestige_data[prestige_data.Education > 12]

jobs_more_twelveyrs.head(12)

Unnamed: 0,Job Worker,Education,Income,Women,Prestige,Census,Type
0,gov.administrators,13.11,12351,11.16,68.8,1113,prof
1,general.managers,12.26,25879,4.02,69.1,1130,prof
2,accountants,12.77,9271,15.7,63.4,1171,prof
4,chemists,14.62,8403,11.68,73.5,2111,prof
5,physicists,15.64,11030,5.13,77.6,2113,prof
6,biologists,15.09,8258,25.65,72.6,2133,prof
7,architects,15.44,14163,2.69,78.1,2141,prof
8,civil.engineers,14.52,11377,1.03,73.1,2143,prof
9,mining.engineers,14.64,11023,0.94,68.8,2153,prof
10,surveyors,12.39,5902,1.91,62.0,2161,prof


In [8]:
# Job With Least Amount of Educational Years

prestige_data[prestige_data['Education'] == prestige_data['Education'].min()]

Unnamed: 0,Job Worker,Education,Income,Women,Prestige,Census,Type
83,sewing.mach.operators,6.38,2847,90.67,28.2,8563,bc


In [9]:
# Job With Highest Amount of Educational Years

prestige_data[prestige_data['Education'] == prestige_data['Education'].max()]

Unnamed: 0,Job Worker,Education,Income,Women,Prestige,Census,Type
20,university.teachers,15.97,12480,19.59,84.6,2711,prof


In [10]:
# Top Five Jobs In Terms Of Amount of Educational Years
# Use .nlargest(<num>, 'column_name')

prestige_data.nlargest(5, 'Education')

Unnamed: 0,Job Worker,Education,Income,Women,Prestige,Census,Type
20,university.teachers,15.97,12480,19.59,84.6,2711,prof
23,physicians,15.96,25308,10.56,87.2,3111,prof
24,veterinarians,15.94,14558,4.32,66.7,3115,prof
16,lawyers,15.77,19263,5.13,82.3,2343,prof
5,physicists,15.64,11030,5.13,77.6,2113,prof


In [11]:
# Top Five Jobs In Least Amount of Educational Years
# Use .nlargest(<num>, 'column_name')

prestige_data.nsmallest(5, 'Education')

Unnamed: 0,Job Worker,Education,Income,Women,Prestige,Census,Type
83,sewing.mach.operators,6.38,2847,90.67,28.2,8563,bc
91,masons,6.6,5959,0.52,36.2,8782,bc
86,railway.sectionmen,6.67,4696,0.0,27.3,8715,bc
73,textile.weavers,6.69,4443,31.36,33.3,8267,bc
74,textile.labourers,6.74,3485,39.48,28.8,8278,bc


## 2) Jobs By Income

Filtering methods in pandas can also be used to extract rows with certain incomes. I use the `.nlargest()` and `.nsmallest()` methods to obtain the top five jobs and bottom five jobs by income.

In [12]:
# Top Five Jobs In Terms Of Income
# Use .nlargest(<num>, 'column_name')

prestige_data.nlargest(5, 'Income')

Unnamed: 0,Job Worker,Education,Income,Women,Prestige,Census,Type
1,general.managers,12.26,25879,4.02,69.1,1130,prof
23,physicians,15.96,25308,10.56,87.2,3111,prof
16,lawyers,15.77,19263,5.13,82.3,2343,prof
25,osteopaths.chiropractors,14.71,17498,6.91,68.4,3117,prof
24,veterinarians,15.94,14558,4.32,66.7,3115,prof


In [13]:
# Bottom Five Jobs In Terms Of Income
# Use .nsmallest(<num>, 'column_name')

prestige_data.nsmallest(5, 'Income')

Unnamed: 0,Job Worker,Education,Income,Women,Prestige,Census,Type
62,babysitters,9.46,611,96.53,25.9,6147,
52,newsboys,9.62,918,7.0,14.8,5143,
67,farm.workers,8.6,1656,27.75,21.5,7182,bc
72,canners,7.42,1890,72.24,23.2,8221,bc
53,service.station.attendant,9.93,2370,3.69,23.3,5145,bc


## 3) Jobs By Prestige

From the dataset the prestige column is a column full of numeric prestige scores. This Pineo-Porter prestige score for occupation is from a social survey conducted in the mid-1960s.

In [14]:
prestige_data.tail(10)

Unnamed: 0,Job Worker,Education,Income,Women,Prestige,Census,Type
92,house.painters,7.81,4549,2.46,29.9,8785,bc
93,plumbers,8.33,6928,0.61,42.9,8791,bc
94,construction.labourers,7.52,3910,1.09,26.5,8798,bc
95,pilots,12.27,14032,0.58,66.1,9111,prof
96,train.engineers,8.49,8845,0.0,48.9,9131,bc
97,bus.drivers,7.58,5562,9.47,35.9,9171,bc
98,taxi.drivers,7.93,4224,3.59,25.1,9173,bc
99,longshoremen,8.37,4753,0.0,26.1,9313,bc
100,typesetters,10.0,6462,13.58,42.2,9511,bc
101,bookbinders,8.55,3617,70.87,35.2,9517,bc


**Sorting Jobs By Prestige**

In [15]:
# Sort Jobs By Prestige From Highest To Lowest:
# Take Top 15 Prestigious Jobs

prestige_data.sort_values(by = ['Prestige'], ascending = False).head(15)

Unnamed: 0,Job Worker,Education,Income,Women,Prestige,Census,Type
23,physicians,15.96,25308,10.56,87.2,3111,prof
20,university.teachers,15.97,12480,19.59,84.6,2711,prof
16,lawyers,15.77,19263,5.13,82.3,2343,prof
7,architects,15.44,14163,2.69,78.1,2141,prof
5,physicists,15.64,11030,5.13,77.6,2113,prof
14,psychologists,14.36,7405,48.28,74.9,2315,prof
4,chemists,14.62,8403,11.68,73.5,2111,prof
8,civil.engineers,14.52,11377,1.03,73.1,2143,prof
19,ministers,14.5,4686,4.14,72.8,2511,prof
6,biologists,15.09,8258,25.65,72.6,2133,prof


In [16]:
# Sort Jobs By Prestige From Lowest To Highest
# Take Top 15 Least Prestigious Jobs

prestige_data.sort_values(by = ['Prestige'], ascending = True).head(15)

Unnamed: 0,Job Worker,Education,Income,Women,Prestige,Census,Type
52,newsboys,9.62,918,7.0,14.8,5143,
64,janitors,7.11,3472,33.57,17.3,6191,bc
65,elevator.operators,7.58,3582,30.08,20.1,6193,bc
60,bartenders,8.5,3930,15.51,20.2,6123,bc
63,launderers,7.33,3000,69.31,20.8,6162,bc
67,farm.workers,8.6,1656,27.75,21.5,7182,bc
72,canners,7.42,1890,72.24,23.2,8221,bc
53,service.station.attendant,9.93,2370,3.69,23.3,5145,bc
98,taxi.drivers,7.93,4224,3.59,25.1,9173,bc
70,slaughterers.1,7.64,5134,17.26,25.2,8215,bc


## 4) Jobs By Type

In the Type column of the dataset there are three factors. These three factors are:

* bc for Blue Collar
* prof for Professional, Managerial and Technical
* wc for White Collar Jobs

You can easily extract jobs given a type using Pandas subsetting. Subsetting can be done with code such as `prestige_data[prestige_data.Type == 'bc']` for extracting occupations that are blue collar.


In [17]:
'''
type: Type of occupation. A factor with levels (note: out of order): bc, Blue Collar; prof, 
      Professional, Managerial, and Technical; wc, White Collar.
'''

# Some Blue collar jobs:

prestige_data[prestige_data.Type == 'bc'].head()

Unnamed: 0,Job Worker,Education,Income,Women,Prestige,Census,Type
27,nursing.aides,9.45,3485,76.14,34.9,3135,bc
53,service.station.attendant,9.93,2370,3.69,23.3,5145,bc
57,firefighters,9.47,8895,0.0,43.5,6111,bc
58,policemen,10.93,8891,1.65,51.6,6112,bc
59,cooks,7.74,3116,52.0,29.7,6121,bc


**Pandas Groupby Method**

The `.groupby()` method from pandas is very useful in obtaining aggregate information for groups.

In [18]:
# Counting The Number Of Workers For Each Occupation Type:

prestige_data.groupby(['Type']).size()

Type
bc      44
prof    31
wc      23
dtype: int64

We have 44 blue collar jobs, 31 professional jobs and 23 white collar jobs from the given dataset.