## Exercise
We will be working with the same dataset and questions we dealt with in the previous lab, so that you have some reference to work on (and check if your results match).

We will use the [Adult UCI dataset](https://archive.ics.uci.edu/ml/datasets/adult) with a few modifications; download the following files: [description](./files/adults/adults.names), [data - part1](./files/adults/adults1.csv) [data - part2](./files/adults/adults2.csv).

Follow the instructions below and answer the questions. 

Filtering and joins should be done in DAX, not in the UI. You should submit the PowerBI file plus a Jupyter Notebook including the DAX code for each question and a snapshot of the visualization. 

1-2 are the same instructions you followed for the Postgres lab, they are reproduced here for your convenience.

1. Create the two tables in the DB with the right data dypes. In order to use enumerated types in the table definition you will first need to create the type. For example:

```
CREATE TYPE mood AS ENUM ('sad', 'ok', 'happy');
CREATE TABLE person (
    name text,
    current_mood mood
);
INSERT INTO person VALUES ('Moe', 'happy');
SELECT * FROM person WHERE current_mood = 'happy';
 name | current_mood 
------+--------------
 Moe  | happy
```

Notice that each row/sample does not have an id. Instead of using an INTEGER data type for the id, we recommend you look into SERIAL.

2. Load the datasets. We should use this version of the copy_from command where we specify the name of the columns and the value for NULL in the file of origin. Notice how we are not specifying the id column, since that will be autogenerated by the DB. Caveat: once you have loaded the data, double check that the id/SERIAL column starts at 1. 
```
cursor.copy_from(f, 'adults1', columns=('age', 'workclass', ...), sep=',', null='?')
```
3. How many people under 18 years old have never worked? Of the never having worked people (all ages) is there any race bias (how many by race)? Is there any sex bias?
4. Look at the hours per week of people with a paying job, by sex. Look at how many's income is above and below 50k. Compare and analyse.
5. How many people with college education do manual labour?
6. What is the minimum, mean and maximum capital gain and capital loss for every marital status?

## 1. Create the two tables in the DB with the right data dypes. 

First we are going to connect to our database. Then thanks to lab 1 we already know what the two databases are and what types should be each elements so we just have to create the two tables with the right types. But we also need to give an id to each table.

In [6]:
import psycopg2


try:
    conn = psycopg2.connect("dbname='postgres' user='postgres' host='localhost' password='pass1234'")
    print('Success connecting to the DB')
    cursor = conn.cursor()
except:
    print('I am unable to connect to the database')


Success connecting to the DB


In [7]:
try:
    cursor.execute("""  CREATE TYPE workclass_type AS ENUM ('Private', 'Self-emp-not-inc', 'Self-emp-inc', 'Federal-gov', 'Local-gov', 'State-gov', 'Without-pay', 'Never-worked');
                        CREATE TYPE education_type as ENUM('Bachelors', 'Some-college', '11th', 'HS-grad', 'Prof-school', 'Assoc-acdm', 'Assoc-voc', '9th', '7th-8th', '12th', 'Masters', '1st-4th', '10th', 'Doctorate', '5th-6th', 'Preschool');
                        CREATE TYPE "marital-status_type" as ENUM('Married-civ-spouse', 'Divorced', 'Never-married', 'Separated', 'Widowed', 'Married-spouse-absent', 'Married-AF-spouse');
                        CREATE TYPE occupation_type as ENUM('Tech-support', 'Craft-repair', 'Other-service', 'Sales', 'Exec-managerial', 'Prof-specialty', 'Handlers-cleaners', 'Machine-op-inspct', 'Adm-clerical', 'Farming-fishing', 'Transport-moving', 'Priv-house-serv', 'Protective-serv', 'Armed-Forces');
                        CREATE TABLE adults1 (
                            id SERIAL PRIMARY KEY,
                            age smallint,
                            workclass workclass_type,
                            fnlwgt bigint,
                            education education_type,
                            "education-num" bigint,
                            "marital-status" "marital-status_type",
                            occupation occupation_type
                        );
                        """)
    
    cursor.execute("""  CREATE TYPE relationship_type AS ENUM ('Wife', 'Own-child', 'Husband', 'Not-in-family', 'Other-relative', 'Unmarried');
                        CREATE TYPE race_type as ENUM('White', 'Asian-Pac-Islander', 'Amer-Indian-Eskimo', 'Other', 'Black');
                        CREATE TYPE sex_type as ENUM('Female', 'Male');
                        CREATE TYPE "native-country_type" as ENUM('United-States', 'Cambodia', 'England', 'Puerto-Rico', 'Canada', 'Germany', 'Outlying-US(Guam-USVI-etc)', 'India', 'Japan', 'Greece', 'South', 'China', 'Cuba', 'Iran', 'Honduras', 'Philippines', 'Italy', 'Poland', 'Jamaica', 'Vietnam', 'Mexico', 'Portugal', 'Ireland', 'France', 'Dominican-Republic', 'Laos', 'Ecuador', 'Taiwan', 'Haiti', 'Columbia', 'Hungary', 'Guatemala', 'Nicaragua', 'Scotland', 'Thailand', 'Yugoslavia', 'El-Salvador', 'Trinadad&Tobago', 'Peru', 'Hong', 'Holand-Netherlands');
                        CREATE TYPE income_type as ENUM('>50K', '<=50K');
                        CREATE TABLE adults2 (
                            id SERIAL PRIMARY KEY,
                            relationship relationship_type,
                            race race_type,
                            sex sex_type,
                            "capital-gain" bigint,
                            "capital-loss" bigint,
                            "hours-per-week" smallint,
                            "native-country" "native-country_type",
                            income income_type
                        );
                        """)
    
    conn.commit()
except Exception as e:
    # if the transaction aborts we will need to rollback
    cursor.execute("ROLLBACK")
    print(e)

## 2. Load the datasets. Once you have loaded the data, double check that the id/SERIAL column starts at 1. 

We simply need to open each adults.csv file and load their data in the adults table.

In [8]:
try:
    with open('./files/adults/adults1.csv', 'r') as f:
        cursor.copy_from(f, 'adults1', columns=('age','workclass','fnlwgt','education','education-num','marital-status','occupation'), sep=',', null='?')
        
    with open('./files/adults/adults2.csv', 'r') as g:
        cursor.copy_from(g, 'adults2', columns=('relationship','race','sex','capital-gain','capital-loss','hours-per-week','native-country','income'), sep=',', null='?')
        
    conn.commit()
except Exception as e:
    # if the transaction aborts we will need to rollback
    cursor.execute("ROLLBACK")
    print(e)

Now let's check the serial column. Let's start by getting the serial sequence.

In [9]:
try:    
    cursor.execute("""  SELECT pg_get_serial_sequence('adults1', 'id'), pg_get_serial_sequence('adults2', 'id');""")
    rows = cursor.fetchall()
    print(rows)
except Exception as e:
    # if the transaction aborts we will need to rollback
    cursor.execute("ROLLBACK")
    print(e)

[('public.adults1_id_seq', 'public.adults2_id_seq')]


Once we have the sequence we can check the first value 

In [10]:
try:
    cursor.execute("""  SELECT start_value FROM pg_sequences WHERE schemaname = 'public' AND sequencename IN ('adults1_id_seq','adults2_id_seq');""")
    rows = cursor.fetchall()
    print(rows)
except Exception as e:
    # if the transaction aborts we will need to rollback
    cursor.execute("ROLLBACK")
    print(e)

[(1,), (1,)]


## 3. How many people under 18 years old have never worked? Of the never having worked people (all ages) is there any race bias (how many by race)? Is there any sex bias? - Note: no need to elaborate, just present the data in tables. 

First we count how many people under 18 years old have never worked with the following DAX code :

![image.png](attachment:image.png)

There is only one person under 18 who has never worked

![image.png](attachment:image.png)

![image.png](attachment:image.png)

In the never worked group, the ratio of whites to blacks is 2.5, whereas in the general dataset whites are over-represented 8.9x. Now, these numbers are small and cannot confirm the bias, however, it is a hint that warrants further investigation (although by looking at different datasets).

![image.png](attachment:image.png)

![image.png](attachment:image.png)

We cannot conclude there is sex bias, since the proportions are similar and most importantly, the number of samples is too small.

## 4. Look at the hours per week of people with a paying job, by sex. Look at how many's income is above and below 50k. Compare and analyse.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

When examining the hours worked per week among people with a paying job, there is a clear difference between men and women. On average, men work about 42.9 hours per week, while women work around 37 hours, meaning that men work roughly 6 hours more per week on average.

When we analyze income distribution, the gender gap becomes even more evident. Among people with paying jobs:

Only 11.4% of women earn more than 50K, while 88.6% earn 50K or less.
In contrast, 31.4% of men earn more than 50K, and 68.6% earn 50K or less.

There is also a positive correlation between hours worked and higher income: both men and women who earn more than 50K work longer hours on average (about 46.6 hours for men and 40.8 hours for women) compared to those earning 50K or less.

Overall, these results highlight a double disparity: men tend to work longer hours and are also much more likely to hold higher-paying positions, which together contribute to a significant gender income gap among people with paying jobs.

## 5. How many people with college education do manual labour?

![image.png](attachment:image.png)

The column education contains the categorical values that describe a person’s level of studies, while the column education-num provides a corresponding numerical scale, where 1 represents the most basic level of education and 16 corresponds to the highest, a doctorate. Based on this information, we can define college education as those cases where education-num >= 10.

4722 people with some college education do manual jobs.

## 6. What is the minimum, mean and maximum capital gain and capital loss for every marital status?

![image.png](attachment:image.png)

Capital Gain

We can see that for all marital-status groups, the minimum capital gain is 0, meaning that at least one individual in each group did not receive any gains.

The average capital gain is higher for Married-civ-spouse and Divorced individuals. The higher mean for Married-civ-spouse could be due to greater financial stability in dual-income households, allowing for larger or more frequent investments.

The maximum capital gains are extremely high (99999) for most groups, highlighting the presence of outliers. These extreme values increase the standard deviation, making the spread of the data appear larger. Married-AF-spouse is an exception, with a lower maximum gain (7298) and smaller standard deviation, likely because the sample size for this group is very small.

Capital Loss

Similarly, the minimum capital loss is 0 for all groups, indicating that at least one individual in each group did not lose money.

The average capital loss is highest among Married-civ-spouse, which may suggest that individuals with higher financial stability are willing to take greater risks, resulting in larger losses. Most other groups have a mean loss around 50–60. The Married-AF-spouse group shows 0 average loss, likely due to the very small number of observations.

The maximum capital loss is around 3900 for most groups, again reflecting the presence of extreme cases. Married-AF-spouse reports no losses, which is probably because the sample size is too small to include any cases of loss.

![image.png](attachment:image.png)

But the fact that the minimum is always 0 is curious, it may indicate missing values for numeric columns (even though the dataset description does not say so). As an option, we could replace those 0's with None's and recompute the stats.