# Exercise

We will be working with the same dataset and questions we dealt with in the previous lab, so that you have some reference to work on (and check if your results match).

We will use the [Adult UCI dataset](https://archive.ics.uci.edu/ml/datasets/adult) with a few modifications; download the following files: [description](./files/adults/adults.names), [data - part1](./files/adults/adults1.csv) [data - part2](./files/adults/adults2.csv).

Follow the instructions below and answer the questions. 

1. Create the two tables in the DB with the right data dypes. In order to use enumerated types in the table definition you will first need to create the type. For example:

```
CREATE TYPE mood AS ENUM ('sad', 'ok', 'happy');
CREATE TABLE person (
    name text,
    current_mood mood
);
INSERT INTO person VALUES ('Moe', 'happy');
SELECT * FROM person WHERE current_mood = 'happy';
 name | current_mood 
------+--------------
 Moe  | happy
```

Notice that each row/sample does not have an id. Instead of using an INTEGER data type for the id, we recommend you look into SERIAL.

2. Load the datasets. We should use this version of the copy_from command where we specify the name of the columns and the value for NULL in the file of origin. Notice how we are not specifying the id column, since that will be autogenerated by the DB. Caveat: once you have loaded the data, double check that the id/SERIAL column starts at 1. 
```
cursor.copy_from(f, 'adults1', columns=('age', 'workclass', ...), sep=',', null='?')
```
3. How many people under 18 years old have never worked? Of the never having worked people (all ages) is there any race bias (how many by race)? Is there any sex bias? - Note: no need to elaborate, just present the data in tables. 
4. Look at the hours per week of people with a paying job, by sex. Look at how many's income is above and below 50k. Compare and analyse. - Note: no need to elaborate, produce a single table that shows the data for both sex and salary.
5. How many people with college education do manual labour?
6. What is the minimum, mean and maximum capital gain and capital loss for every marital status?

## 1. Create the two tables in the DB with the right data dypes. 

First we are going to connect to our database. Then thanks to lab 1 we already know what the two databases are and what types should be each elements so we just have to create the two tables with the right types. But we also need to give an id to each table.

In [1]:
import psycopg2


try:
    conn = psycopg2.connect("dbname='postgres' user='postgres' host='localhost' password='pass1234'")
    print('Success connecting to the DB')
except:
    print('I am unable to connect to the database')


Success connecting to the DB


In [2]:
cursor = conn.cursor()
try:
    cursor.execute("""  CREATE TYPE workclass AS ENUM ('Private', 'Self-emp-not-inc', 'Self-emp-inc', 'Federal-gov', 'Local-gov', 'State-gov', 'Without-pay', 'Never-worked');
                        CREATE TYPE education as ENUM('Bachelors', 'Some-college', '11th', 'HS-grad', 'Prof-school', 'Assoc-acdm', 'Assoc-voc', '9th', '7th-8th', '12th', 'Masters', '1st-4th', '10th', 'Doctorate', '5th-6th', 'Preschool');
                        CREATE TYPE "marital-status" as ENUM('Married-civ-spouse', 'Divorced', 'Never-married', 'Separated', 'Widowed', 'Married-spouse-absent', 'Married-AF-spouse');
                        CREATE TYPE occupation as ENUM('Tech-support', 'Craft-repair', 'Other-service', 'Sales', 'Exec-managerial', 'Prof-specialty', 'Handlers-cleaners', 'Machine-op-inspct', 'Adm-clerical', 'Farming-fishing', 'Transport-moving', 'Priv-house-serv', 'Protective-serv', 'Armed-Forces');
                        CREATE TABLE adults1 (
                            id SERIAL PRIMARY KEY,
                            age bigint,
                            workclass workclass,
                            fnlwgt bigint,
                            education education,
                            "education-num" bigint,
                            "marital-status" "marital-status",
                            occupation occupation
                        );
                        """)
    
    cursor.execute("""  CREATE TYPE relationship AS ENUM ('Wife', 'Own-child', 'Husband', 'Not-in-family', 'Other-relative', 'Unmarried');
                        CREATE TYPE race as ENUM('White', 'Asian-Pac-Islander', 'Amer-Indian-Eskimo', 'Other', 'Black');
                        CREATE TYPE sex as ENUM('Female', 'Male');
                        CREATE TYPE "native-country" as ENUM('United-States', 'Cambodia', 'England', 'Puerto-Rico', 'Canada', 'Germany', 'Outlying-US(Guam-USVI-etc)', 'India', 'Japan', 'Greece', 'South', 'China', 'Cuba', 'Iran', 'Honduras', 'Philippines', 'Italy', 'Poland', 'Jamaica', 'Vietnam', 'Mexico', 'Portugal', 'Ireland', 'France', 'Dominican-Republic', 'Laos', 'Ecuador', 'Taiwan', 'Haiti', 'Columbia', 'Hungary', 'Guatemala', 'Nicaragua', 'Scotland', 'Thailand', 'Yugoslavia', 'El-Salvador', 'Trinadad&Tobago', 'Peru', 'Hong', 'Holand-Netherlands');
                        CREATE TYPE income as ENUM('>50K', '<=50K');
                        CREATE TABLE adults2 (
                            id SERIAL PRIMARY KEY,
                            relationship relationship,
                            race race,
                            sex sex,
                            "capital-gain" bigint,
                            "capital-loss" bigint,
                            "hours-per-week" bigint,
                            "native-country" "native-country",
                            income income
                        );
                        """)
    
    conn.commit()
except Exception as e:
    # if the transaction aborts we will need to rollback
    cursor.execute("ROLLBACK")
    print(e)

## 2. Load the datasets. Once you have loaded the data, double check that the id/SERIAL column starts at 1. 

We simply need to open each adults.csv file and load their data in the adults table.

In [3]:
try:
    with open('./files/adults/adults1.csv', 'r') as f:
        cursor.copy_from(f, 'adults1', columns=('age','workclass','fnlwgt','education','education-num','marital-status','occupation'), sep=',', null='?')
        
    with open('./files/adults/adults2.csv', 'r') as g:
        cursor.copy_from(g, 'adults2', columns=('relationship','race','sex','capital-gain','capital-loss','hours-per-week','native-country','income'), sep=',', null='?')
        
    conn.commit()
except Exception as e:
    # if the transaction aborts we will need to rollback
    cursor.execute("ROLLBACK")
    print(e)

Now let's check the serial column. Let's start by getting the serial sequence.

In [4]:
try:    
    cursor.execute("""  SELECT pg_get_serial_sequence('adults1', 'id'), pg_get_serial_sequence('adults2', 'id');""")
    rows = cursor.fetchall()
    print(rows)
except Exception as e:
    # if the transaction aborts we will need to rollback
    cursor.execute("ROLLBACK")
    print(e)

[('public.adults1_id_seq', 'public.adults2_id_seq')]


Once we have the sequence we can check the first value 

In [5]:
try:
    cursor.execute("""  SELECT start_value FROM pg_sequences WHERE schemaname = 'public' AND sequencename IN ('adults1_id_seq','adults2_id_seq');""")
    rows = cursor.fetchall()
    print(rows)
except Exception as e:
    # if the transaction aborts we will need to rollback
    cursor.execute("ROLLBACK")
    print(e)

[(1,), (1,)]


We can see that the serial ID from both of the tables start with 1

## 3. How many people under 18 years old have never worked? Of the never having worked people (all ages) is there any race bias (how many by race)? Is there any sex bias? - Note: no need to elaborate, just present the data in tables. 

In [6]:
try:
    cursor.execute("""
        SELECT * FROM adults1
        WHERE age < 18 AND workclass = 'Never-worked';
    """)
    rows = cursor.fetchall()
    print(rows)

    cursor.execute("""
        SELECT COUNT(*) FROM adults1
        WHERE age < 18 AND workclass = 'Never-worked';
    """)
    count = cursor.fetchone()
    print("Total:", count[0])
except Exception as e:
    # if the transaction aborts we will need to rollback
    cursor.execute("ROLLBACK")
    print(e)

[(14773, 17, 'Never-worked', 237272, '10th', 6, 'Never-married', None)]
Total: 1


In [7]:
try:
    cursor.execute("""
        SELECT a2.race, COUNT(*) AS count_by_race
        FROM adults1 a1
        JOIN adults2 a2 ON a1.id = a2.id
        WHERE a1.workclass = 'Never-worked'
        GROUP BY a2.race
        ORDER BY count_by_race DESC;

    """)
    rows = cursor.fetchall()

    print("\nPeople who never worked by race:")
    print(f"{'Race':<25} {'Count':>10}")
    print("-" * 35)
    for race, count in rows:
        print(f"{race:<25} {count:>10,}")

    print('')

    cursor.execute("""
        SELECT a2.race, COUNT(*) AS count_by_race
        FROM adults1 a1
        JOIN adults2 a2 ON a1.id = a2.id
        GROUP BY a2.race
        ORDER BY count_by_race DESC;

    """)
    rows = cursor.fetchall()
    print("\nPeople by race:")
    print(f"{'Race':<25} {'Count':>10}")
    print("-" * 35)
    for race, count in rows:
        print(f"{race:<25} {count:>10,}")

except Exception as e:
    # if the transaction aborts we will need to rollback
    cursor.execute("ROLLBACK")
    print(e)


People who never worked by race:
Race                           Count
-----------------------------------
White                              5
Black                              2


People by race:
Race                           Count
-----------------------------------
White                         27,816
Black                          3,124
Asian-Pac-Islander             1,039
Amer-Indian-Eskimo               311
Other                            271


In [8]:
try:
    cursor.execute("""
        SELECT a2.sex, COUNT(*) AS count_by_race
        FROM adults1 a1
        JOIN adults2 a2 ON a1.id = a2.id
        WHERE a1.workclass = 'Never-worked'
        GROUP BY a2.sex
        ORDER BY count_by_race DESC;

    """)
    rows = cursor.fetchall()

    print("\nPeople who never worked by sex:")
    print(f"{'Race':<25} {'Count':>10}")
    print("-" * 35)
    for sex, count in rows:
        print(f"{sex:<25} {count:>10,}")

    print('')

    cursor.execute("""
        SELECT a2.sex, COUNT(*) AS count_by_race
        FROM adults1 a1
        JOIN adults2 a2 ON a1.id = a2.id
        GROUP BY a2.sex
        ORDER BY count_by_race DESC;

    """)
    rows = cursor.fetchall()
    print("\nPeople by sex:")
    print(f"{'Sex':<25} {'Count':>10}")
    print("-" * 35)
    for sex, count in rows:
        print(f"{sex:<25} {count:>10,}")

except Exception as e:
    # if the transaction aborts we will need to rollback
    cursor.execute("ROLLBACK")
    print(e)


People who never worked by sex:
Race                           Count
-----------------------------------
Male                               5
Female                             2


People by sex:
Sex                            Count
-----------------------------------
Male                          21,790
Female                        10,771


## 4. Look at the hours per week of people with a paying job, by sex. Look at how many's income is above and below 50k. Compare and analyse. - Note: no need to elaborate, produce a single table that shows the data for both sex and salary.

In [9]:
try:
    cursor.execute("""
        SELECT a2.sex,
        COUNT(*)              AS count_people,
        ROUND(AVG(a2."hours-per-week"), 2) AS avg_hours_per_week,
        MIN(a2."hours-per-week")           AS min_hours_per_week,
        MAX(a2."hours-per-week")           AS max_hours_per_week,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY a2."hours-per-week") AS median_hours
        FROM adults1 a1
        JOIN adults2 a2 ON a1.id = a2.id
        WHERE a1.workclass NOT IN ('Without-pay', 'Never-worked')
        GROUP BY a2.sex
        ORDER BY a2.sex;
    """)
    rows = cursor.fetchall()
    print("\nHours per week of people with a paying job, by sex:")
    print(f"{'Sex':<10} {'Count':>10} {'Avg Hours':>12} {'Min':>6} {'Median':>8} {'Max':>6}")
    print("-" * 60)
    for sex, count, avg, min_h, max_h, median in rows:
        print(f"{sex:<10} {count:>10,} {float(avg):>12.2f} {min_h:>6} {median:>8} {max_h:>6}")


    cursor.execute("""
        SELECT a2.sex,
        a2.income,
        COUNT(*) AS count_people,
        ROUND(AVG(a2."hours-per-week"), 2) AS avg_hours_per_week,
        MIN(a2."hours-per-week")           AS min_hours_per_week,
        MAX(a2."hours-per-week")           AS max_hours_per_week,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY a2."hours-per-week") AS median_hours
        FROM adults1 a1
        JOIN adults2 a2 ON a1.id = a2.id
        WHERE a1.workclass NOT IN ('Without-pay', 'Never-worked')
        GROUP BY a2.sex, a2.income
        ORDER BY a2.sex, a2.income;
    """)

    rows = cursor.fetchall()
    print("\nIncome distribution of people with a paying job, by sex:")
    print(f"{'Sex':<10} {'Income':<6} {'Count':>10} {'Avg Hours':>12} {'Min':>6} {'Median':>8} {'Max':>6}")
    print("-" * 70)
    for sex, income, count, avg, min_h, max_h, median in rows:
        print(f"{sex:<10} {income:<6} {count:>10,} {float(avg):>12.2f} {min_h:>6}  {median:>8} {max_h:>6}")

except Exception as e:
    # if the transaction aborts we will need to rollback
    cursor.execute("ROLLBACK")
    print(e)


Hours per week of people with a paying job, by sex:
Sex             Count    Avg Hours    Min   Median    Max
------------------------------------------------------------
Female          9,925        36.96      1     40.0     99
Male           20,779        42.86      1     40.0     99

Income distribution of people with a paying job, by sex:
Sex        Income      Count    Avg Hours    Min   Median    Max
----------------------------------------------------------------------
Female     >50K        1,127        40.82      2      40.0     99
Female     <=50K       8,798        36.46      1      40.0     99
Male       >50K        6,523        46.55      1      45.0     99
Male       <=50K      14,256        41.17      1      40.0     99


## 5. How many people with college education do manual labour?

In [10]:
try:

    cursor.execute("""
        SELECT
            COUNT(*) FILTER (WHERE occupation IN ('Tech-support', 'Sales', 'Exec-managerial', 'Prof-specialty', 'Adm-clerical')) AS desk,
            COUNT(*) FILTER (WHERE occupation NOT IN ('Tech-support', 'Sales', 'Exec-managerial', 'Prof-specialty', 'Adm-clerical')) AS manual
        FROM adults1
        WHERE occupation IS NOT NULL
          AND "education-num" IS NOT NULL
          AND workclass IS NOT NULL
          AND "education-num" >= 10;
        """)
    rows = cursor.fetchall()
    colnames = [desc[0] for desc in cursor.description]

    print(f"{'occupation_type':<12} | {'count_people':>12}")
    print("-" * 28)
    for col, val in zip(colnames, rows[0]):
        print(f"{col:<12} | {val:>12}")

except Exception as e:
    # if the transaction aborts we will need to rollback
    cursor.execute("ROLLBACK")
    print(e)

occupation_type | count_people
----------------------------
desk         |        12207
manual       |         4722


## 6. What is the minimum, mean and maximum capital gain and capital loss for every marital status?