# Assignment 1

For this assignment, you will work on the *Contracts of apprenticeship in early modern Venice* [dataset](https://github.com/mromanello/ADA-DHOxSS/tree/master/data#contracts-of-apprenticeship-in-early-modern-venice).

**Please carefully read the assignment guidelines in Canvas. You are expected to work in groups and submit as a group.**

Consider the dataset on contracts of apprenticeship in Venice: is it sufficiently tidy? If not (hint: it is not), work on tidying it up: first, create a schema for it (use pen and paper and reason in terms of tidy tables and relations among tables, or an [entity-relationship model](https://en.wikipedia.org/wiki/Entity–relationship_model)), then implement your schema in this notebook. Second, check the values of each column and assess whether you need to clean anything up (e.g., remove non-uniform values). End your work by briefly discussing what are the benefits of your resulting tidy dataset over the original.

**Please make sure to carefully explain and motivate your choices via markdown cells and Python comments, as approptiate.**

*Hint: you might want to start from considering which observation types are in the dataset now (all within the same table) and thus which variables (columns) might be redundant. Examples include the profession categorization and the masters.*

In [1]:
import pandas as pd

In [6]:
df_contracts = pd.read_csv("https://raw.githubusercontent.com/mromanello/ADA-DHOxSS/master/data/apprenticeship_venice/professions_data.csv", sep=";")

In [7]:
df_contracts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9653 entries, 0 to 9652
Data columns (total 47 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   page_title                9653 non-null   object 
 1   register                  9653 non-null   object 
 2   annual_salary             7870 non-null   float64
 3   a_profession              9653 non-null   object 
 4   profession_code_strict    9618 non-null   object 
 5   profession_code_gen       9614 non-null   object 
 6   profession_cat            9597 non-null   object 
 7   corporation               9350 non-null   object 
 8   keep_profession_a         9653 non-null   int64  
 9   complete_profession_a     9653 non-null   int64  
 10  enrolmentY                9628 non-null   float64
 11  enrolmentM                9631 non-null   float64
 12  startY                    9533 non-null   float64
 13  startM                    9539 non-null   float64
 14  length  

In [4]:
df_contracts.head(5)

Unnamed: 0,page_title,register,annual_salary,a_profession,profession_code_strict,profession_code_gen,profession_cat,corporation,keep_profession_a,complete_profession_a,...,personal_care_master,clothes_master,generic_expenses_master,salary_in_kind_master,pledge_goods_master,pledge_money_master,salary_master,female_guarantor,period_cat,incremental_salary
0,Carlo Della sosta (Orese) 1592-08-03,"asv, giustizia vecchia, accordi dei garzoni, 1...",,orese,orese,orefice,orefice,Oresi,1,1,...,1,1,1,0,0,0,0,0,,0
1,Antonio quondam Andrea (squerariol) 1583-01-09,"asv, giustizia vecchia, accordi dei garzoni, 1...",12.5,squerariol,squerariol,lavori allo squero,lavori allo squero,Squerarioli,1,1,...,0,0,1,0,0,0,1,0,1.0,0
2,Cristofollo di Zuane (batioro in carta) 1591-0...,"asv, giustizia vecchia, accordi dei garzoni, 1...",,batioro,batioro,battioro,fabbricatore di foglie/fili/cordelle d'oro o a...,Battioro,1,1,...,0,0,0,0,0,0,0,0,,0
3,Illeggibile (marzer) 1584-06-21,"asv, giustizia vecchia, accordi dei garzoni, 1...",,marzer,marzer,marzer,merciaio,Merzeri,1,1,...,0,0,0,0,0,0,0,0,,0
4,Domenico Morebetti (spechier) 1664-09-13,"asv, giustizia vecchia, accordi dei garzoni, 1...",7.0,marzer,marzer,marzer,merciaio,Merzeri,1,1,...,0,0,1,0,0,0,1,0,1.0,0


In [5]:
df_contracts.columns

Index(['page_title', 'register', 'annual_salary', 'a_profession',
       'profession_code_strict', 'profession_code_gen', 'profession_cat',
       'corporation', 'keep_profession_a', 'complete_profession_a',
       'enrolmentY', 'enrolmentM', 'startY', 'startM', 'length', 'has_fled',
       'm_profession', 'm_profession_code_strict', 'm_profession_code_gen',
       'm_profession_cat', 'm_corporation', 'keep_profession_m',
       'complete_profession_m', 'm_gender', 'm_name', 'm_surname',
       'm_patronimic', 'm_atelier', 'm_coords', 'a_name', 'a_age', 'a_gender',
       'a_geo_origins', 'a_geo_origins_std', 'a_coords', 'a_quondam',
       'accommodation_master', 'personal_care_master', 'clothes_master',
       'generic_expenses_master', 'salary_in_kind_master',
       'pledge_goods_master', 'pledge_money_master', 'salary_master',
       'female_guarantor', 'period_cat', 'incremental_salary'],
      dtype='object')

Every row represents an apprenticeship contract. Contracts were registered both at the guild's and at a public office. This is a sample of contracts from a much larger set of records.

The variables you will need to work with are:
* `annual_salary`: the annual salary paid to the apprencice, if any (in Venetian ducats).
* `a_profession` to `corporation`: from specific to increasingly generic classifications for the apprentice's stated profession.
* `startY`, `startM`: start of the contarct, year and month.
* `length`: of the contract, in months.
* `has_fled`: whether the apprentice has fled from the master during the contract (boolean value).
* `m_profession`, `m_profession_code_strict`, `m_profession_code_gen`, `m_profession_cat`, `m_corporation`: from specific to increasingly generic classifications for the master's stated profession.
* `m_gender` and `a_gender`: of master and apprentice respectively.
* `a_age`: age of the apprentice at entry, in years.
* `m_name`, `m_surname`, `a_name`: the name, surname and full name of the master (`m_`) and apprentice (`a_`).
* `a_geo_origins_std`: the place where the apprentice was from.
* `female_guarantor`: if at least one of the contract's guarantors was female (boolean value).
* `incremental_salary`: whether the salary went up over time (boolean value).

In [8]:
df_contracts = df_contracts[['annual_salary', 'a_profession',
       'profession_code_strict', 'profession_code_gen', 'profession_cat',
       'corporation', 'startY', 'startM', 'length', 'has_fled',
       'm_profession', 'm_profession_code_strict', 'm_profession_code_gen',
       'm_profession_cat', 'm_corporation', 'm_gender', 'm_name', 'm_surname',
       'a_name', 'a_age', 'a_gender', 'a_geo_origins_std', 
       'female_guarantor', 'incremental_salary']]

*Hint: for your tidy version, consider splitting this table into a few tables with uniform observation types for each. While you should keep one table for apprenticeship contracts, you can consider having separate tables for, e.g., profession codes and corporations, places, and even persons (limited to masters because for apprentices we only know their names).*

---

In [None]:
# your code here