# Bayesian Statistics

This notebook explores how to query a dataset for an occupation and provide a count

Our references are:
* [Pandas](https://pandas.pydata.org/), version 2.20, date: 20 Jan, 2024
* [Python](https://www.python.org/), version 3.12.1
* The University of Chicago's NORC dataset entitled [The General Social Survey](https://gss.norc.org/)
* [Think Bayes: Bayesian Statistics in Python, 2nd Edition - Allen B. Downey - Github](http://allendowney.github.io/ThinkBayes2/index.html)
* [US Census Bureau: Data & Maps](https://www.census.gov/data.html)
* [US Bureau of Labor Statistics: SOC (Standard Occupational Classification)](https://www.bls.gov/soc/)

We can drag and drop data files (csv files) that we want to work with from our local drive into the google colab file icon (left side of the colab screen)
* Download the GSS Data to your desktop
* Click on the folder on left side of the approximate middle of the Colab screen
* Drag and drop the gss_bayes.csv file into the folder to upload it to Google Colab from your desktop
* You will need to do this operation everytime you use the notebook

This script will accomplish the following tasks:

* Import the Pandas library
* Import the GSS Bayes dataset into Pandas
* Display the first and last five lines of the GSS Bayes dataset


In [6]:
# Import Pandas Library
import pandas as pd

# Load the dataset into Pandas
gss = pd.read_csv('gss_bayes.csv')

# Display the first and last five rows of the dataset
gss

Unnamed: 0,caseid,year,age,sex,polviews,partyid,indus10
0,1,1974,21.0,1,4.0,2.0,4970.0
1,2,1974,41.0,1,5.0,0.0,9160.0
2,5,1974,58.0,2,6.0,1.0,2670.0
3,6,1974,30.0,1,5.0,4.0,6870.0
4,7,1974,48.0,1,5.0,4.0,7860.0
...,...,...,...,...,...,...,...
49285,2863,2016,57.0,2,1.0,0.0,7490.0
49286,2864,2016,77.0,1,6.0,7.0,3590.0
49287,2865,2016,87.0,2,4.0,5.0,770.0
49288,2866,2016,55.0,2,5.0,5.0,8680.0


This script will provide descriptive statistics for the GSS Bayes dataset

In [7]:
gss.describe()

Unnamed: 0,caseid,year,age,sex,polviews,partyid,indus10
count,49290.0,49290.0,49290.0,49290.0,49290.0,49290.0,49290.0
mean,1167.140901,1995.36405,46.143132,1.537858,4.105052,2.753905,5993.666504
std,848.141764,12.336592,17.11142,0.49857,1.37716,2.048108,2796.295069
min,1.0,1974.0,18.0,1.0,1.0,0.0,170.0
25%,507.0,1985.0,32.0,1.0,3.0,1.0,3890.0
50%,1035.0,1996.0,44.0,2.0,4.0,3.0,6990.0
75%,1592.0,2006.0,59.0,2.0,5.0,5.0,8190.0
max,4510.0,2016.0,89.0,2.0,7.0,7.0,9870.0


Background on occupation data:

* The U.S. Census Bureau currently collects data on industry, occupation, and class of worker for Americans in the labor force on several surveys. ACS (American Community Survey), CPS (Current Population Survey) and SIPP (Survey of Income and Program Participation) are the largest ones.

* The 2018 Standard Occupational Classification (SOC) system is a federal statistical standard used by federal agencies to classify workers into occupational categories for the purpose of collecting, calculating, or disseminating data. All workers are classified into one of 867 detailed occupations according to their occupational definition. To facilitate classification, detailed occupations are combined to form 459 broad occupations, 98 minor groups, and 23 major groups. Detailed occupations in the SOC with similar job duties, and in some cases skills, education, and/or training, are grouped together.

* The 2022 GSS Cross-section data, featuring a new multi-mode design, was collected between May and December of 2022, include repeated measures and new data related to health, marriage and family, and work conditions, as well as several new experiments in data collection and survey design

This script will:
* Search the GSS Bayes dataset for the code 6870 which represents "Banking and related activities"
* Provide a count of the number of 'bankers' in the GSS Bayes dataset

In [10]:
banker = (gss['indus10'] == 6870)
banker.sum()

728