# Data science for business project
This project is developed by 
- Drago Emanuele
- Lambrughi Achille

# Index


- [Introduction](#Introduction)<a href='#Introduction'></a><br>
- [Resources](#res)<a href='res'></a><br>
- [Configuration](#config)<a href='config'></a><br>
- [Library](#lib)<a href='lib'></a><br>
- [Function defined](#my_fun)<a href='my_fun'></a><br>

[**Dataset analysis**](#cap1)<a href='cap1'></a><br>
- [Dataset Import](#dt_import)<a href='dt_import'></a><br>
- [Outlier detection](#dec_outlier)<a href='dec_outlier'></a><br>
    - [DATA column](#data_cl)<a href='data_cl'></a><br>
    - [SETTOREECONOMICODETTAGLIO column](#set_ec_cl)<a href='linear_regression'></a><br>
    - [MODALITALAVORO column](#mod_work_cl)<a href='mod_work_cl'></a><br>
    
[**Building machine learning models**](#ml_model)<a href='ml_model'></a><br>
- [Feature encoding](#feature_encoding)<a href='feature_encoding'></a><br>
    - [Direct Encoding](#simp_encoding)<a href='simp_encoding'></a><br>
    - [Our Encoding](#elab_encoding)<a href='elab_encoding'></a><br>
- [Models Traning](#models_trainig)<a href='models_trainig'></a><br>
- [Models Evaluation](#model_evaluation)<a href='model_evaluation'></a><br>
- [Possible application](#model_usage)<a href='model_usage'></a><br>
[**Conclusion**](#conclusion)<a href='conclusion'></a><br>



<a id='Introduction'></a>
## Introduction
The aim of this project is to illustrates various phases of a data science project. In the first phase we will import and analyze a dataset, then based on the analysis result we clean up the data removing outlier and null values.
Once the data is ready we will build a machine learning model able to respond to some question regarding the data.

<a id='res'></a>
## Resources
The datasets considered are available at the following link:
- https://www.dati.lombardia.it/Attivit-Produttive/Rapporti-di-lavoro-attivati/qbau-cyuc
- https://esploradati.censimentopopolazione.istat.it/databrowser/#/it/censtest/ITC4

other resources:
- https://www.google.it/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahUKEwiamPWll5T4AhUSG-wKHYxBBooQFnoECAcQAQ&url=https%3A%2F%2Fwww.istat.it%2Fstorage%2Fcodici-unita-amministrative%2FElenco-comuni-italiani.xls&usg=AOvVaw1grUzCb-YznlY1XTyzCUJE

 in this file the column *Denominazione dell'Unità territoriale sovracomunale 
 (valida a fini statistici)* has been renamed in *Provincia*


- The file CodiceAteco.xlsx that contain the italian classification of economical activities

For execution all the files need to be put in a folder called 'dataset' in the root folder of the project

# Objective
Since the main dataset we have chosen it's a feature rich dataset representing all the new started working contracts, our goal is to predict the type of contract a person could be most probably hired

<a id='config'></a>
### Configuration
For this project will be used the module xgboost so for a correct compilation of the code is necessary to install it.

<a id='lib'></a>
### Used libraries
The following cell contains the main libraries used for the project, some libraries are imported right before being used

In [None]:
import numpy as np
import pandas as pd
import matplotlib as mpl, matplotlib.pyplot as plt
from pathlib import PurePath
from datetime import datetime, timedelta
import seaborn as sns
mpl.rcParams['figure.dpi'] = 150

<a id='my_fun'></a>
### Defining some functions
Before proceding with the dataset we define the following functions:
- `series_to_set`: this function takes as input a column and a dataframe and returns a set containing the column values;
- `mapping`: this function takes as input a `Series` and a mapping dictionary, returns the series mapped to the dictionary;

In [None]:
def series_to_set(column, source_df):
    SET = set()
    for elem in source_df[column]:
        SET.add(elem)
    return SET

def mapping(series, mapp): 
    series = series.apply(lambda x: mapp.get(x) if mapp.get(x) != None else x)
    return series

<a id='cap1'></a>
# Dataset analysis
In this section will be analyzed the content of the dataset. The chosen dataset contains the new work contracts activated in Lombardy. The dataset contains the following columns:
- DATA: date of the contract
- GENERE: gender of the person
- ETA: age of the person
- SETTOREECONOMICODETTAGLIO: category of work
- TITOLOSTUDIO: level of education of the person
- CONTRATTO: type of contract
- MODALITALAVORO: work mode
- PROVINCIAIMPRESA: province of the place of work
- ITALIANO: nationality of the person


<a id='dt_import'></a>
### Dataset import
 Here we import the dataset, parse the field `DATA` to datetime and take a first look at the column and their type.

In [None]:
source_df = pd.read_csv(PurePath('dataset', 'Rapporti_di_lavoro_attivati.csv'),parse_dates=['DATA'])

In [None]:
rap_lavoro_attivati = source_df.copy()

In [None]:
print([rap_lavoro_attivati.info()])
rap_lavoro_attivati.head(10)

### Changing data type
As we can see almost all columns of the dataset have type object and contains strings. This  means that the machine learning model will be fed with categorical data or a numerical representation of it. Before proceding with the analysis we change the type from object to string.

Since categorical data are difficult to use for training a machine learning model, we already know that those values should be replaced with numericals. So, during the analysis, we will try to understand which values are more suitable for this task.

In [None]:
rap_lavoro_attivati[['GENERE','SETTOREECONOMICODETTAGLIO','TITOLOSTUDIO','CONTRATTO', 'MODALITALAVORO','PROVINCIAIMPRESA','ITALIANO']]=rap_lavoro_attivati[['GENERE','SETTOREECONOMICODETTAGLIO','TITOLOSTUDIO','CONTRATTO', 'MODALITALAVORO','PROVINCIAIMPRESA','ITALIANO']].astype('string')

### Searching wrong and null data
Now we will look inside the data checking null or possibly wrong value, then those data will be corrected or deleted.
First off we will find out how many null values there are in each column

In [None]:
print([rap_lavoro_attivati.isnull().sum()])

From this first look it is possible to see that a big part of values is missing from the column `MODALITALAVORO`, these values needs to be replaced while the others could be simply deleted because they represent a small part of the dataset.

But before proceding now will be checked the actual value of some columns to see if there are some non plausible data.
Starting from the column `DATA`

<a id='dec_outlier'></a>
## Identifying outliers

<a id='data_cl'></a>
### column DATA

At first, we extract the year from the date, so that the column `DATA` becomes easier to analyze

In [None]:
rap_lavoro_attivati['DATA'] = rap_lavoro_attivati['DATA'].apply(lambda x: x.year)

In [None]:
rap_lavoro_attivati = rap_lavoro_attivati[rap_lavoro_attivati['DATA'] < 2022]

In [None]:
fig, ax = plt.subplots()
ax.boxplot(rap_lavoro_attivati['DATA'])
plt.show()

We can see from the above boxplot that we have many outliers in the years before 2000, so we should at least restrict the dataset to the last 20 years. Moreover, we will combine this dataset with other datasets that have values only for years 2018 and 2019. But this is not a problem since those two years represent a quite recent job situation (differently from the early 2000s), while we don't analyze data from 2020 onwards because they are altered by the consequences of the COVID-19 pandemic.

In [None]:
rap_lavoro_attivati['DATA'].head()


## Working on column `DATA`
In this section we will analyze in more detail the column `DATA` and adjust the data accordingly.
The first step is to group the column by year and count how many contracts are activated per year

In [None]:
rap_lavoro_attivati = rap_lavoro_attivati[(rap_lavoro_attivati.DATA >= 2018) & (rap_lavoro_attivati.DATA < 2020)]
print(rap_lavoro_attivati.groupby(['DATA'])['GENERE'].count())

With this we can see how many contracts have been activated in the last few years in Lombardy.

In [None]:
rap_lavoro_attivati.groupby(pd.Grouper(key='DATA'))['DATA'].count().plot(label="attivati", kind='bar')

plt.legend()
plt.show()

As we can see there isn't so much difference between the two years. The next graph shows the difference in the number of contracts between male and female 

In [None]:
m_to_f_rationA = rap_lavoro_attivati.GENERE.value_counts()

xaxisA = m_to_f_rationA.index
valueA = m_to_f_rationA.values

fig1, axis = plt.subplots(1)
fig1.dpi = 100

#Attivati male to female Graph
axis.pie(valueA, labels=xaxisA, autopct='%1.1f%%', startangle=90)
axis.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.


From these graphs we can say that probably there are more male than female in the popoulation, but this information should be verified plotting data from a dataset that involves the whole population, not only the newly hired.

In [None]:
rap_lavoro_attivati.head()

The next plot represents the number of contracts `== LAVORO A TEMPO INDETERMINATO` in relation to the working time: full-time or part-time (and its variations)

In [None]:
df = rap_lavoro_attivati[['CONTRATTO', 'MODALITALAVORO']]
df = df[df['CONTRATTO'] == 'LAVORO A TEMPO INDETERMINATO']
df = df.groupby(['MODALITALAVORO']).count()
df.plot(kind='bar')

The result shows that the working hours and the contract type are highly correlated, hence the feature `MODALITALAVORO` must be exploited for our model.

Next, we will see if there is a correlation amongst province and contracts. 

In [None]:
df = rap_lavoro_attivati[['CONTRATTO', 'PROVINCIAIMPRESA']]
df = df[df['CONTRATTO'] == 'LAVORO A TEMPO INDETERMINATO']
df = df.groupby(['PROVINCIAIMPRESA']).count()
df.plot(kind='bar', legend=False)
plt.ylabel('Tempo indeterminato')

They seem to be correlated, BUT we should pay attention to what we have just plotted. Indeed, we have counted the **total** number of activated contracts in each province and this information could introduce an error in evaluation. This error bears in the fact that different provinces have different populations. To clarify, the population of Milan (province) sums to roughly 3.2 million people, while in province of Lodi live 230 thousands people. So, despite considering the total number of contracts, we will consider the *percentage* over the total number of newly started contracts.  

In [None]:
df = rap_lavoro_attivati[['CONTRATTO', 'PROVINCIAIMPRESA']]
df = df[df['CONTRATTO'] == 'LAVORO A TEMPO INDETERMINATO']
df = df.groupby(['PROVINCIAIMPRESA']).count()
df.loc[:, 'CONTRATTO'] = 100*df["CONTRATTO"]/rap_lavoro_attivati.groupby(['PROVINCIAIMPRESA']).count()['CONTRATTO']
df.plot(kind='bar', legend=False)
plt.ylabel('% Tempo indeterminato')
################################################
df = rap_lavoro_attivati[['CONTRATTO', 'PROVINCIAIMPRESA']]
df = df[df['CONTRATTO'] == 'LAVORO A TEMPO DETERMINATO']
df = df.groupby(['PROVINCIAIMPRESA']).count()
df.loc[:, 'CONTRATTO'] = 100*df["CONTRATTO"]/rap_lavoro_attivati.groupby(['PROVINCIAIMPRESA']).count()['CONTRATTO']
df.plot(kind='bar', legend=False)
plt.ylabel('% Tempo determinato')
################################################
df = rap_lavoro_attivati[['CONTRATTO', 'PROVINCIAIMPRESA']]
df = df[df['CONTRATTO'] == 'TIROCINIO']
df = df.groupby(['PROVINCIAIMPRESA']).count()
df.loc[:, 'CONTRATTO'] = 100*df["CONTRATTO"]/rap_lavoro_attivati.groupby(['PROVINCIAIMPRESA']).count()['CONTRATTO']
df.plot(kind='bar', legend=False)
plt.ylabel('% Tirocinio')
################################################
df = rap_lavoro_attivati[['CONTRATTO', 'PROVINCIAIMPRESA']]
df = df[df['CONTRATTO'] == 'LAVORO DOMESTICO']
df = df.groupby(['PROVINCIAIMPRESA']).count()
df.loc[:, 'CONTRATTO'] = 100*df["CONTRATTO"]/rap_lavoro_attivati.groupby(['PROVINCIAIMPRESA']).count()['CONTRATTO']
df.plot(kind='bar', legend=False)
plt.ylabel('% Tempo determinato')

From the plots we can tell that the type of contract and the province in which people work are related. 

We can note that for example `MONZA E BRIANZA` has the highest percetage in both open-ended contracts and internships, while it got the lowest percentage in fixed-term contracts. This could probably means that people in Monza e Brianza are more frequently hired in an internship and then their contract directly change to an open-ended contract. 

In [None]:
rap_lavoro_attivati.groupby(['PROVINCIAIMPRESA']).count()

<a id='set_ec_cl'></a>
### column SETTOREECONOMICODETTAGLIO

In [None]:
settore_economico = series_to_set('SETTOREECONOMICODETTAGLIO', rap_lavoro_attivati)
len(settore_economico)

Let's plot the data of `SETTOREECONOMICODETTAGLIO` and their 'weight' in the dataset.

In [None]:
fig, axis = plt.subplots()

axis.pie(rap_lavoro_attivati.groupby(['SETTOREECONOMICODETTAGLIO']).count()['GENERE'])

plt.show()

We believe that the pieplot well describes the problem about this column. Taken as is, it cannot be used for analysis; the amount of different data is huge. 

Fortunately, it seems that the captions match with the Ateco Codes! Hence, we can map them into wider categories.

In [None]:
sett_eco_codes = pd.read_excel(PurePath('dataset', 'CodiceAteco.xlsx'))
sett_eco_codes.head()

In [None]:
sett_eco_codes['Descrizione'] = sett_eco_codes['Descrizione'].str.lower()
sett_eco_codes = sett_eco_codes[['Lettera', 'Descrizione']]
ateco = sett_eco_codes.set_index('Descrizione').to_dict().get('Lettera')

In [None]:
rap_lavoro_attivati["SETTOREECONOMICODETTAGLIO"] = rap_lavoro_attivati["SETTOREECONOMICODETTAGLIO"].str.lower()
rap_lavoro_attivati.head()

In [None]:
rap_lavoro_attivati["SETTOREECONOMICODETTAGLIO"] = rap_lavoro_attivati["SETTOREECONOMICODETTAGLIO"].map(ateco)
rap_lavoro_attivati.head()

In [None]:
rap_lavoro_attivati['SETTOREECONOMICODETTAGLIO'].isna().sum()

In [None]:
rap_lavoro_attivati = rap_lavoro_attivati[rap_lavoro_attivati['SETTOREECONOMICODETTAGLIO'].notna()]

In [None]:
rap_lavoro_attivati['SETTOREECONOMICODETTAGLIO'].isna().sum()

Now we can visualize the column after the mapping.

In [None]:
fig, axis = plt.subplots()

axis.pie(rap_lavoro_attivati.groupby(['SETTOREECONOMICODETTAGLIO']).count()['GENERE'])

plt.show()

So much better, now we don't have too much different values and so the column can be used for analisys.

In [None]:
df = rap_lavoro_attivati[['CONTRATTO', 'SETTOREECONOMICODETTAGLIO']]
df = df[df['CONTRATTO'] == 'LAVORO A TEMPO INDETERMINATO']
df = df.groupby(['SETTOREECONOMICODETTAGLIO']).count()
df.loc[:, 'CONTRATTO'] = 100*df["CONTRATTO"]/rap_lavoro_attivati.groupby(['SETTOREECONOMICODETTAGLIO']).count()['CONTRATTO']
df.plot(kind='bar', legend=False)
plt.ylabel('% Tempo indeterminato')

To correctly interpret the above bar plot, we should be aware of the meaning of the different characters, so:
- A: AGRICOLTURA, SILVICOLTURA E PESCA;
- B: ESTRAZIONE DI MINERALI DA CAVE E MINIERE;
- C: ATTIVITÀ MANIFATTURIERE;
- D: FORNITURA DI ENERGIA ELETTRICA, GAS, VAPORE E ARIA CONDIZIONATA;
- E: FORNITURA DI ACQUA; RETI FOGNARIE, ATTIVITÀ DI GESTIONE DEI RIFIUTI E RISANAMENTO;
- F: COSTRUZIONI;
- G: COMMERCIO ALL'INGROSSO E AL DETTAGLIO; RIPARAZIONE DI AUTOVEICOLI E MOTOCICLI;
- H: TRASPORTO E MAGAZZINAGGIO;
- I: ATTIVITÀ DEI SERVIZI DI ALLOGGIO E DI RISTORAZIONE;
- J: SERVIZI DI INFORMAZIONE E COMUNICAZIONE;
- K: ATTIVITÀ FINANZIARIE E ASSICURATIVE;
- L: ATTIVITA' IMMOBILIARI;
- M: ATTIVITÀ PROFESSIONALI, SCIENTIFICHE E TECNICHE;
- N: NOLEGGIO, AGENZIE DI VIAGGIO, SERVIZI DI SUPPORTO ALLE IMPRESE;
- O: AMMINISTRAZIONE PUBBLICA E DIFESA; ASSICURAZIONE SOCIALE OBBLIGATORIA;
- P: ISTRUZIONE;
- Q: SANITA' E ASSISTENZA SOCIALE;
- R: ATTIVITÀ ARTISTICHE, SPORTIVE, DI INTRATTENIMENTO E DIVERTIMENTO;
- S: ALTRE ATTIVITÀ DI SERVIZI;
- T: ATTIVITÀ DI FAMIGLIE E CONVIVENZE COME DATORI DI LAVORO PER PERSONALE DOMESTICO; PRODUZIONE DI BENI E SERVIZI INDIFFERENZIATI PER USO PROPRIO DA PARTE DI FAMIGLIE E CONVIVENZE;
- U: ORGANIZZAZIONI ED ORGANISMI EXTRATERRITORIALI;

In [None]:
rap_lavoro_attivati['CONTRATTO'].value_counts()

As we can see there are categories with just few entries. Since these have a low impact on the dataset we can delete them with the following function. It groups data by `CONTRATTO`, create a group of indexes containing only the elements that satisfy a condition and then it keeps only the values that are inside those indexes.

In [None]:
#delete low number contracts
byContract = rap_lavoro_attivati.groupby('CONTRATTO').aggregate(np.count_nonzero)
tags = byContract[byContract.GENERE >= 2000].index
rap_lavoro_attivati = rap_lavoro_attivati[rap_lavoro_attivati['CONTRATTO'].isin(tags)]

<a id='mod_work_cl'></a>
### Column MODALITALAVORO

In the `rap_lavoro_attivati` there are a lot of null values for the column `MODALITALAVORO`, removing the rows would reduce the set of data we are analysing. Hence, we'd rather prefer to fill those null values with a suitable values. 

The value `NON DEFINITO` can be used to fill the na, since that won't introduce much bias, differently from the other available values. 

In [None]:
def clean(df):
    df['MODALITALAVORO'] = df['MODALITALAVORO'].fillna('NON DEFINITO')
    df.dropna(axis = 0, inplace = True)
    

clean(rap_lavoro_attivati)
rap_lavoro_attivati.isnull().sum()

### Column GENERE
Now we will try to understand if the gender and the contract type are related

In [None]:
gender_df = rap_lavoro_attivati[['GENERE', "CONTRATTO"]]
gender_df = gender_df[gender_df['CONTRATTO'] == 'LAVORO A TEMPO INDETERMINATO']
group_gender = gender_df.groupby(['GENERE']).count()

group_gender.loc[:, 'CONTRATTO'] = 100*group_gender["CONTRATTO"]/rap_lavoro_attivati.groupby(['GENERE']).count()['CONTRATTO']

fig, ax = plt.subplots()
colors = ['green', 'blue']
ax.bar(group_gender.index, group_gender['CONTRATTO'], color=colors)
ax.set_ylabel('Contratti tempo indeterminato')
plt.show()
################
gender_df = rap_lavoro_attivati[['GENERE', "CONTRATTO"]]
gender_df = gender_df[gender_df['CONTRATTO'] == 'LAVORO A TEMPO DETERMINATO']
group_gender = gender_df.groupby(['GENERE']).count()
group_gender.loc[:, 'CONTRATTO'] = 100*group_gender["CONTRATTO"]/rap_lavoro_attivati.groupby(['GENERE']).count()['CONTRATTO']

fig, ax = plt.subplots()
colors = ['green', 'blue']
ax.bar(group_gender.index, group_gender['CONTRATTO'], color=colors)
ax.set_ylabel('Contratti tempo determinato')
plt.show()
#################
gender_df = rap_lavoro_attivati[['GENERE', "CONTRATTO"]]
gender_df = gender_df[gender_df['CONTRATTO'] == 'TIROCINIO']
group_gender = gender_df.groupby(['GENERE']).count()
group_gender.loc[:, 'CONTRATTO'] = 100*group_gender["CONTRATTO"]/rap_lavoro_attivati.groupby(['GENERE']).count()['CONTRATTO']

fig, ax = plt.subplots()
colors = ['green', 'blue']
ax.bar(group_gender.index, group_gender['CONTRATTO'], color=colors)
ax.set_ylabel('Tirocinio')
plt.show()

They seem to be highly correlated. We notice that over all the started contracts, women are hired in an open-ended contract in $\sim{10}\%$ of the number of contracts, while men get a $\sim14\%$ over the total number. The same (but opposite) happens for internships, indeed women are on $\sim 4\%$, while men are at $\sim 3\%$. 

In [None]:
set(rap_lavoro_attivati['CONTRATTO'])

### Dividing by age group
For a better reading of the data we create a new column for the age group

In [None]:
bins = np.arange(14, 68, 3).tolist()
bins

In [None]:
bins = np.arange(14, 68, 3).tolist()
rap_lavoro_attivati['agerange'] = pd.cut(rap_lavoro_attivati['ETA'], bins)
rap_lavoro_attivati

In [None]:
rap_lavoro_attivati.dropna(axis = 0, inplace = True)

In [None]:
testAge = rap_lavoro_attivati.agerange.value_counts()

labels = testAge.index
newContract = np.log(testAge.values)

x = np.arange(len(labels))  # the label locations
width = 0.35  # the width of the bars

fig, ax = plt.subplots()
rects1 = ax.bar(x - width/2, newContract, width, label='New Contract')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('number')
ax.set_title('number of contract divided by age group')
ax.set_xticks(x)
ax.set_xticklabels(labels=labels,rotation=45,
    horizontalalignment='right');
ax.legend()



fig.tight_layout()

plt.show()

## Introducing another dataset
This dataset contains the level of education for age referred to the region Lombardy from the year 2018 to 2020.The educational levels  are encoded in the following way:
- NED = nessun titolo di studio
- IL =  analfabeti
- LBNA = analfabeti privi di titolo di studio
- PSE = licenza di scuola elementare
- LSE = licenza media o avviamento professionale (conseguito non oltre l'anno 1965)/ Diploma di istruzione secondaria di I grado
- USE_IF = Diploma di istruzione secondaria di II grado o di qualifica professionale (corso di 3-4 anni) compresi IFTS
- BL = Diploma tecnico superiore ITS o titolo di studio terziario di primo livello
- ML_RDD = titolo di studio terziario di secondo livello e dottorato di ricerca
- ML = Titolo di studio terziario di secondo livello
- RDD = Dottorato di ricerca/diploma accademico di formazione alla ricerca
- ALL = totale

In [None]:
grado_istruzione_age  = pd.read_csv(PurePath('dataset', 'Grado_istruzione_per_età_Lombardia_IT1,DF_DCSS_ISTR_LAV_PEN_2_REG.csv'),low_memory=False)
grado_istruzione_age.head()

In [None]:
print([grado_istruzione_age.isnull().sum()])

There are some columns with null values that can be deleted

In [None]:
grado_istruzione_age = grado_istruzione_age[['REF_AREA', 'GENDER', 'AGE_NOCLASS', 'EDU_ATTAIN', 'TIME_PERIOD', 'OBS_VALUE']]

In [None]:
grado_istruzione_age ['AGE_NOCLASS'].value_counts()

Those values represent age ranges, in particular:
- Y_GE9: all people with age greater than 9, in this case all the people in the dataset
- Y25-49: people with an age between 25 and 49
- Y50-64: people with an age between 50 and 64
- Y_GE65: people with more than 65 years
- Y9-24: people with an age between 9 and 24

In the following part we will add a column with the same age range of the dataset *Grado_istruzione_per_età_Lombardia*

Since in the dataset `rap_lavoro_attivati` the minimum age is 16 while in `Grado_istruzione_per_età_Lombardia` is 9 we will delete all the illiterate and the people with only elementary school license with the age between 9 and 24, we have chosen to do so because the majority of the people in that age range with only elementary school license are the ones under 16 years old.

In [None]:
grado_istruzione_age = grado_istruzione_age [~((grado_istruzione_age ['EDU_ATTAIN']== 'IL')|(grado_istruzione_age ['EDU_ATTAIN']== 'LBNA')|((grado_istruzione_age ['EDU_ATTAIN']== 'PSE')&(grado_istruzione_age ['AGE_NOCLASS']== 'Y9_24')))]
grado_istruzione_age

In [None]:
print(rap_lavoro_attivati['TITOLOSTUDIO'].head())
grado_istruzione_age['EDU_ATTAIN'].head()

In [None]:
print(series_to_set('TITOLOSTUDIO', rap_lavoro_attivati))
print(series_to_set('EDU_ATTAIN', grado_istruzione_age))

In [None]:
grado_istruzione_age.loc[(grado_istruzione_age.EDU_ATTAIN.isin(['ML_RDD'])) & ~(grado_istruzione_age.AGE_NOCLASS == 'Y_GE9')]

In [None]:
grado_istruzione_age = grado_istruzione_age[grado_istruzione_age['TIME_PERIOD'] != 2020]
grado_istruzione_age = grado_istruzione_age[~grado_istruzione_age.EDU_ATTAIN.isin(['ALL', 'ML', 'RDD'])]
grado_istruzione_age = grado_istruzione_age[grado_istruzione_age.AGE_NOCLASS != 'Y_GE9']
grado_istruzione_age = grado_istruzione_age[grado_istruzione_age.GENDER != 'T']
grado_istruzione_age.head()

Unfortunately the educational levels of the two datasets don't match easily, but we can try to generalize a bit and to map the different values to the same set. In this way we can then unify the two datasets, or at least analyze the first dataset exploiting the second. 

In [None]:
edu_map = {
    'IL': 'NESSUN TITOLO DI STUDIO',
    'NED': 'NESSUN TITOLO DI STUDIO',
    'LBNA': 'NESSUN TITOLO DI STUDIO',
    'PSE': 'LICENZA ELEMENTARE',
    'LSE': 'LICENZA MEDIA',
    'USE_IF': 'DIPLOMA DI ISTRUZIONE SECONDARIA SUPERIORE',
    "TITOLO DI ISTRUZIONE SECONDARIA SUPERIORE (SCOLASTICA ED EXTRA-SCOLASTICA) CHE NON PERMETTE L'ACCESSO ALL'UNIVERSITÀ ()": 'DIPLOMA DI ISTRUZIONE SECONDARIA SUPERIORE',
    "DIPLOMA DI ISTRUZIONE SECONDARIA SUPERIORE  CHE PERMETTE L'ACCESSO ALL'UNIVERSITA": 'DIPLOMA DI ISTRUZIONE SECONDARIA SUPERIORE',
    'BL': 'LAUREA',
    'ML_RDD': 'TITOLO DI STUDIO TERZIARIO DI SECONDO LIVELLO O DOTTORATO',
    'DIPLOMA TERZIARIO EXTRA-UNIVERSITARIO': 'TITOLO DI STUDIO TERZIARIO DI SECONDO LIVELLO O DOTTORATO',
    'MASTER UNIVERSITARIO DI PRIMO LIVELLO': 'TITOLO DI STUDIO TERZIARIO DI SECONDO LIVELLO O DOTTORATO',
    'LAUREA - Vecchio o nuovo ordinamento': 'LAUREA',
    'DIPLOMA DI SPECIALIZZAZIONE': 'TITOLO DI STUDIO TERZIARIO DI SECONDO LIVELLO O DOTTORATO',
    'DIPLOMA UNIVERSITARIO': 'LAUREA',
    'TITOLO DI STUDIO POST-LAUREA': 'TITOLO DI STUDIO TERZIARIO DI SECONDO LIVELLO O DOTTORATO',
    'TITOLO DI DOTTORE DI RICERCA': 'TITOLO DI STUDIO TERZIARIO DI SECONDO LIVELLO O DOTTORATO'
}


In [None]:
def mapping(series, mapp): 
    series = series.apply(lambda x: mapp.get(x) if mapp.get(x) != None else x)
    return series

rap_lavoro_attivati['TITOLOSTUDIO'] = mapping(rap_lavoro_attivati['TITOLOSTUDIO'], edu_map)

grado_istruzione_age['EDU_ATTAIN'] = mapping(grado_istruzione_age['EDU_ATTAIN'], edu_map)

grado_istruzione_age['EDU_ATTAIN']


In [None]:
city_codes = pd.read_excel(PurePath('dataset', 'Elenco-comuni-italiani.xls'))
city_codes = city_codes[['Codice Comune formato alfanumerico', 
                        'Denominazione in italiano',
                        'Denominazione Regione',
                        'Provincia',
                        'Codice NUTS3 2021',
                        'Codice NUTS2 2021 (3) '
                        ]]
city_codes = city_codes[city_codes['Denominazione Regione'] == 'Lombardia']
city_codes = city_codes.set_index('Codice Comune formato alfanumerico')
city_codes.loc[1] = ['','','Milano','ITC45','']   #The code in the dataset does not match with the code of Milano, 
                                                  #manually added
city_codes.head()

In [None]:
def get_provincia_by_code(code):
    try:
        return city_codes[city_codes['Codice NUTS3 2021'] == code]['Provincia'].iloc[0]
    except: 
        if code == 'IT108':
            return 'MONZA E BRIANZA'
                    

In [None]:
area_codes = set(city_codes['Codice NUTS3 2021'])
area_codes.add('IT108')
grado_istruzione_age = grado_istruzione_age[grado_istruzione_age.REF_AREA.isin(area_codes)]

In [None]:
grado_istruzione_age['REF_AREA'] = grado_istruzione_age['REF_AREA'].apply(lambda x: get_provincia_by_code(x)) 
grado_istruzione_age

In [None]:
titoli = set(rap_lavoro_attivati['TITOLOSTUDIO'])

Now we will see how the education levels are distributed among the started contracts. Then we will see how they are distributed in the population.

In [None]:
istruction_df = rap_lavoro_attivati[['DATA', 'TITOLOSTUDIO', "CONTRATTO"]]
istruction_df = istruction_df[(istruction_df['TITOLOSTUDIO'].isin(titoli))]
group_ist = istruction_df.groupby(['TITOLOSTUDIO', 'DATA']).count()
group_ist['CONTRATTO']

group_ist.unstack().plot(kind='bar', stacked=False)
plt.legend([2018,2019])
plt.xticks(rotation=90)
plt.show()

In [None]:
len(rap_lavoro_attivati)

In [None]:
test = group_ist
test['TITOLOSTUDIO'] = test.index.get_level_values(0)
test.head()

In [None]:
test_ist = grado_istruzione_age[grado_istruzione_age['EDU_ATTAIN'].isin(titoli)]
test['OBS_VALUE'] = test_ist.groupby(['EDU_ATTAIN', 'TIME_PERIOD']).sum()['OBS_VALUE']

In [None]:
test.loc[:, 'CONTRATTO'] = test['CONTRATTO']/test['OBS_VALUE']

In the following plot, we will see the percentage of all the educational levels of the started contracts over the total population with the same educational level.

In [None]:
test = test[['CONTRATTO', 'TITOLOSTUDIO']]
test.unstack().plot(kind='bar', stacked=False)
plt.xticks(rotation=90)
plt.show()

Surprisingly, people without a qualification are the most hired in comparison to their total number. We can further analyze the dataset and in particular the values about people that don't have any qualification. 

First, let's see which are their most frequent type of contracts.

In [None]:
df = rap_lavoro_attivati[['CONTRATTO', 'TITOLOSTUDIO']]
df = df[df['TITOLOSTUDIO'] == 'NESSUN TITOLO DI STUDIO']
df = df.groupby(['CONTRATTO']).count()
df.plot(kind='bar', legend=False)
plt.xlabel('Contratti Nessun titolo di studio')
plt.title('Numero di persone senza titolo di studio divise per tipo di contratto')
plt.show()

In [None]:
#delete this cell
df

The vast majority of the contracts are fixed-term contracts, this could mean that probably many people start more than one contract per year. As a consequence it happens more often that multiple contracts belong to the same person. 

We can also try to understand at which age people without an educational qualification are hired compared to the others.

In [None]:
df = rap_lavoro_attivati[['CONTRATTO', 'ETA', 'TITOLOSTUDIO']]
df = df[df['TITOLOSTUDIO'] == 'NESSUN TITOLO DI STUDIO']
df = df[['ETA', 'CONTRATTO']]
df = df.groupby(['ETA']).count()

df.plot(kind='bar', legend=False, stacked=False, figsize=(12,3))
plt.ylabel('Contratti')
plt.xlabel('Età senza titolo di studio')
plt.show()
###########################################################################
df = rap_lavoro_attivati[['CONTRATTO', 'ETA', 'TITOLOSTUDIO']]
df = df[df['TITOLOSTUDIO'] != 'NESSUN TITOLO DI STUDIO']
df = df[['ETA', 'CONTRATTO']]
df = df.groupby(['ETA']).count()

df.plot(kind='bar', legend=False, stacked=False, figsize=(12,3))
plt.ylabel('Contratti')
plt.xlabel('Età con titolo di studio')
plt.show()

It can be noted that the plot of people *with* a qualification have a maximum at age 25, then it quickly decrease until 30 and at that point it starts decreasing more slowly. Instead, the plot of people *without* a qualification is  a lot different. It keeps really high until 43 and then it starts slowly decreasing. This means that people without schooling are hired in a much wider range, and this means a lot more people are hired.

Now we can plot the number of people hired with an open-ended contract divided per age and educational qualification to stress once again the importance of this two features.

In [None]:
df = rap_lavoro_attivati[['CONTRATTO', 'ETA', 'TITOLOSTUDIO']]
df = df[df['TITOLOSTUDIO'] == 'NESSUN TITOLO DI STUDIO']
df = df[df['CONTRATTO'] == 'LAVORO A TEMPO INDETERMINATO'][['ETA', 'CONTRATTO']]
df = df.groupby(['ETA']).count()

df.plot(kind='bar', legend=False, stacked=False, figsize=(12,3))
plt.ylabel('Contratti tempo indeterminato')
plt.xlabel('Età senza titolo di studio')
plt.show()
###########################################################################
df = rap_lavoro_attivati[['CONTRATTO', 'ETA', 'TITOLOSTUDIO']]
df = df[df['TITOLOSTUDIO'] == 'LAUREA']
df = df[df['CONTRATTO'] == 'LAVORO A TEMPO INDETERMINATO'][['ETA', 'CONTRATTO']]
df = df.groupby(['ETA']).count()

df.plot(kind='bar', legend=False, stacked=False, figsize=(12,3))
plt.ylabel('Contratti tempo indeterminato')
plt.xlabel('Età laureati')
plt.show()
###########################################################################
df = rap_lavoro_attivati[['CONTRATTO', 'ETA', 'TITOLOSTUDIO']]
df = df[df['TITOLOSTUDIO'] == 'DIPLOMA DI ISTRUZIONE SECONDARIA SUPERIORE']
df = df[df['CONTRATTO'] == 'LAVORO A TEMPO INDETERMINATO'][['ETA', 'CONTRATTO']]
df = df.groupby(['ETA']).count()

df.plot(kind='bar', legend=False, stacked=False, figsize=(12,3))
plt.ylabel('Contratti tempo indeterminato')
plt.xlabel('Età diplomati')
plt.show()

In general we can say that lower is the schooling level, higher is the possibility to get an open-ended contract *after* you are 30 years old. While higher the educational qualification is, higher is the chance to obtain an open-ended contract *before* 30 years old.

In the next section we'll see a graph comparing the number of contracts activated and the total population in 2018

In [None]:
import datetime as dt

activate2018 = rap_lavoro_attivati[rap_lavoro_attivati['DATA'] == 2018]
bins = [0,25, 50, 65, 200]
labels = ['Y9-24', 'Y25-49', 'Y50-64','Y_GE65' ]
activate2018.loc[:,'agerange'] = pd.cut(activate2018['ETA'], bins, labels = labels,include_lowest = True)



istr2018 = grado_istruzione_age[(grado_istruzione_age['TIME_PERIOD']== 2018)]
total = istr2018.groupby(['AGE_NOCLASS']).sum()


testAge = activate2018['agerange'].value_counts()

labels = testAge.index
newContract = np.log(testAge.values)
totalPopulation = np.log(total['OBS_VALUE'].values)

x = np.arange(len(labels))  # the label locations
width = 0.35  # the width of the bars

fig, ax = plt.subplots()
rects1 = ax.bar(x - width/2, newContract, width, label='New Contract')
rects2 = ax.bar(x + width/2, totalPopulation, width, label='Total Population')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('log number')
ax.set_title('number by attivati vs total number')
ax.set_xticks(x)
ax.set_xticklabels(labels=labels,rotation=45,
    horizontalalignment='right');
ax.legend()



fig.tight_layout()

plt.show()

In the next plots we will see the relevance of the age (exclusively) on the type of contract.  

In [None]:
df = rap_lavoro_attivati[['ETA', 'CONTRATTO']]
df = df[df['CONTRATTO'] == 'TIROCINIO']
df = df.groupby(['ETA']).count()
plt.figure(figsize=(12, 3))
plt.xticks(ticks=range(15,66), rotation=90)
plt.ylabel('Tirocinio')
plt.bar(df.index, df["CONTRATTO"], width=0.4)
plt.show()
###########################
df = rap_lavoro_attivati[['ETA', 'CONTRATTO']]
df = df[df['CONTRATTO'] == 'LAVORO A TEMPO INDETERMINATO']
df = df.groupby(['ETA']).count()
plt.figure(figsize=(12, 3))
plt.xticks(ticks=range(15,66), rotation=90)
plt.ylabel('Tempo indeterminato')
plt.bar(df.index, df["CONTRATTO"], width=0.4)
plt.show()
###########################
df = rap_lavoro_attivati[['ETA', 'CONTRATTO']]
df = df[df['CONTRATTO'] == 'LAVORO A TEMPO DETERMINATO']
df = df.groupby(['ETA']).count()
plt.figure(figsize=(12, 3))
plt.xticks(ticks=range(15,66), rotation=90)
plt.ylabel('Tempo determinato')
plt.bar(df.index, df["CONTRATTO"], width=0.4)
plt.show()

###########################
df = rap_lavoro_attivati[['ETA', 'CONTRATTO']]
df = df[df['CONTRATTO'] == 'LAVORO INTERMITTENTE']
df = df.groupby(['ETA']).count()
plt.figure(figsize=(12, 3))
plt.xticks(ticks=range(15,66), rotation=90)
plt.ylabel('Lavoro intermittente')
plt.bar(df.index, df["CONTRATTO"], width=0.4)
plt.show()

As expected, better job contracts are gained around 30s. Interestingly, internship has two peaks at 19 and 24, that are the ages at which people usually finish high school and university, respectively.

In [None]:
#TODO description of condizione professionale

In [None]:
condizione_professionale_age = pd.read_csv(PurePath('dataset', 'Condizione professionale per età - Lombardia.csv'),low_memory=False)
condizione_professionale_age.head()

In [None]:
condizione_professionale_age = condizione_professionale_age[['REF_AREA', 'GENDER', 'AGE_NOCLASS', 'CUR_ACT_STAT', 'TIME_PERIOD', 'OBS_VALUE']]
condizione_professionale_age.head()

In [None]:
condizione_professionale_age = condizione_professionale_age[condizione_professionale_age.REF_AREA.isin(area_codes)]

condizione_professionale_age.loc[:, 'REF_AREA'] = condizione_professionale_age['REF_AREA'].apply(lambda x: get_provincia_by_code(x)) 
condizione_professionale_age.head()

Regarding the meaning of the values in `CUR_ACT_STAT`:
- $22$: labor force;
 - $1$ : employed;
 - $12$: unemployed;
- $23$: non-labor force.

Thus, for our analysis, we can keep the labor force only. 

In [None]:
condizione_professionale_age = condizione_professionale_age[condizione_professionale_age['CUR_ACT_STAT'] == 22]
condizione_professionale_age.head()

For now, we can consider only the overall number and not the age range.

In [None]:
condizione_professionale_age = condizione_professionale_age[condizione_professionale_age['AGE_NOCLASS'] == 'Y_GE15']
condizione_professionale_age.head()

In [None]:
df = condizione_professionale_age[['GENDER', 'TIME_PERIOD', 'OBS_VALUE']].copy()
df = df[df['GENDER'] != 'T']
df = df.groupby(['GENDER', 'TIME_PERIOD']).sum()
df.unstack().plot(kind='pie', subplots=True, autopct='%1.1f%%', startangle=90)

The same holds for the `GENDER`, we can consider the total number of people that could potentially work.

In [None]:
condizione_professionale_age = condizione_professionale_age[condizione_professionale_age['GENDER'] == 'T']
condizione_professionale_age.head()

<a id='ml_model'></a>
# Building a machine learning model
In this section we will see three different machine learning algorithms and decide which is the best for our goal.
Before creating the training and test set we must encode the data before feeding it to the algorithm. To do this we will use two approach: the first approach is a simple encoding, so a number will be assigned to every distinct value of a feature; the second approach try to assign a numeric value with some meaning, when possible, to each different value of a feature.

In [None]:
import matplotlib as mtl
import matplotlib.pyplot as plt
import matplotlib.figure as fig
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

<a id='feature_encoding'></a>
## Feature encoding

The first step is to create a new dataset in which we will apply all the needed changes. Here we will add new columns for every encoding we perform, so in the end this dataset will contain the original value plus the encoded ones, this choice has been made to recognize the original values from their encoding.

In [None]:
model_dt= rap_lavoro_attivati[['DATA', 'GENERE', 'ETA', 'agerange', 'TITOLOSTUDIO', 'CONTRATTO', 'MODALITALAVORO','PROVINCIAIMPRESA','SETTOREECONOMICODETTAGLIO']].copy(deep=True)

<a id='simp_encoding'></a>
### Simple encoding
In this case we simply assign a numeric value to each unique value. We convert all field to float except `CONTRATTO` that we have converted to int starting from zero. This is done to avoid using deprecated method.

In [None]:
model_dt['RANKGENERE'] = model_dt['GENERE'].rank(method='dense', ascending=False).astype('float')
model_dt['RANKTITOLOSTUDIO'] = model_dt['TITOLOSTUDIO'].rank(method='dense', ascending=False).astype('float')
model_dt['RANKMODALITALAVORO'] = model_dt['MODALITALAVORO'].rank(method='dense', ascending=False).astype('float')
model_dt['RANKPROVINCIAIMPRESA'] = model_dt['PROVINCIAIMPRESA'].rank(method='dense', ascending=False).astype('float')
model_dt['RANKagerange'] = model_dt['agerange'].rank(method='dense', ascending=False).astype('float')
model_dt['RANKSETTOREECONOMICODETTAGLIO'] = model_dt['SETTOREECONOMICODETTAGLIO'].rank(method='dense', ascending=False).astype('float')
model_dt['RANKCONTRATTO'] = model_dt['CONTRATTO'].rank(method='dense', ascending=False).astype('int')
model_dt['RANKCONTRATTO'] = model_dt['RANKCONTRATTO'].sub(1)

<a id='elab_encoding'></a>
### Elaborate encoding
 For this encoding we will assign to the field `RANKPROVINCIAIMPRESA` the number of people that live in that area and in the same year, to the field `RANKTITOLOSTUDIO` the minimum age necessary to acquire that title and to the fields `RANKSETTOREECONOMICODETTAGLIO` and `RANKMODALITALAVORO` we will substitute values with their occurences.

#### Province column

For obtainig the number of people we will use the dataset `Grado_istruzione_per_età_Lombardia`, from which we will extrapolate the population for every Lombardy province, create a map and apply it to `model_dt`.

In [None]:
model_dt.index = range(0,len(model_dt))
grado_istruzione_age.loc[:, 'REF_AREA'] = grado_istruzione_age['REF_AREA'].str.upper() 
to_zip = grado_istruzione_age[['REF_AREA', 'TIME_PERIOD', 'OBS_VALUE']].groupby(['REF_AREA', 'TIME_PERIOD']).sum().copy()
province_map = to_zip.to_dict().get('OBS_VALUE')
model_dt['MyENPROVINCIAIMPRESA'] = pd.Series(model_dt[['PROVINCIAIMPRESA','DATA']].itertuples(index=False, name=None)).map(province_map)

In [None]:
model_dt

#### Study title column
Here we manually create a dictionary that assigns to a study title its minimum age, then we apply it to the dataset

In [None]:
rank_edu_map = {
    'NESSUN TITOLO DI STUDIO':0,
    'LICENZA ELEMENTARE':11,
    'LICENZA MEDIA':14,
    'DIPLOMA DI ISTRUZIONE SECONDARIA SUPERIORE':19,
    'LAUREA':22,
    'TITOLO DI STUDIO TERZIARIO DI SECONDO LIVELLO O DOTTORATO':24    
}


In [None]:
model_dt['MyENTITOLOSTUDIO'] = mapping(model_dt['TITOLOSTUDIO'], rank_edu_map)

####  Economic sector and work modality columns
For these two columns we simply count the ocurrences and substitute them with the correct value

In [None]:
model_dt['MyENMODALITALAVORO'] = mapping(model_dt['MODALITALAVORO'], dict(model_dt['MODALITALAVORO'].value_counts()))
model_dt['MyENSETTOREECONOMICODETTAGLIO'] =  mapping(model_dt['SETTOREECONOMICODETTAGLIO'], dict(model_dt['SETTOREECONOMICODETTAGLIO'].value_counts()))

The remaining columns will use the previous encoding because age group and gender do not need to preserve information.

<a id='models_trainig'></a>
# Training the models
In this section we divide our data into training and testing data through the specific function and take a first look at the model performances.This work will be repeated for the two different encoding.

In [None]:
X = model_dt[['RANKGENERE', 'RANKTITOLOSTUDIO', 'ETA', 'RANKPROVINCIAIMPRESA','RANKSETTOREECONOMICODETTAGLIO', 'RANKMODALITALAVORO']]
y = model_dt['RANKCONTRATTO']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=100)

## Simple encoding

### Multi layer perceptron

In [None]:
mlpSimpModel = MLPClassifier(random_state=1, max_iter=20, hidden_layer_sizes=(24,24), early_stopping = True)
mlpSimpModel = mlpSimpModel.fit(X_train, y_train)

#prediction and probability
mlpSimpPred = mlpSimpModel.predict(X_test)
mlp_proba = mlpSimpModel.predict_proba(X_test)

#Report metrix
simpReportMLP = classification_report(y_test, mlpSimpPred,output_dict= True, zero_division=0)
accMlp = accuracy_score(y_test, mlpSimpPred)
print('Accuracy:')
print(accMlp)

### Random forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

simpForest = RandomForestClassifier(n_estimators=24, max_depth=6, random_state=0,max_features = None,n_jobs = -1)
simpForest.fit(X_train, y_train)
    
forestSimpPred = simpForest.predict(X_test)
reportSimpForest = classification_report(y_test, forestSimpPred,output_dict= True, zero_division=0)

score = simpForest.score(X_test, y_test)

print('Accuracy:')
print(score)

### XGboost

In [None]:
# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

import xgboost as xgb

xgSimp = xgb.XGBClassifier(learning_rate = 0.3,
                max_depth =4, alpha = 2, n_estimators = 35, eval_metric='mlogloss',use_label_encoder=False)
#Fit the model
xgSimp.fit(X_train,y_train)
#xg_reg.save_model("categorical-model.json")

#Make predictions
xgSimpPred = xgSimp.predict(X_test)
xgSimpReport = classification_report(y_test, xgSimpPred,output_dict= True)

In [None]:
accuracy_score(y_test, xgSimpPred)

## Elaborate encoding

In [None]:
X = model_dt[['RANKGENERE', 'MyENTITOLOSTUDIO', 'ETA', 'MyENPROVINCIAIMPRESA','MyENSETTOREECONOMICODETTAGLIO', 'MyENMODALITALAVORO']]
y = model_dt['RANKCONTRATTO']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=100)

### Multi Layer Perceptron

In [None]:
mlpModel = MLPClassifier(random_state=1, max_iter=20, hidden_layer_sizes=(24,24), early_stopping = True)
mlpModel = mlpModel.fit(X_train, y_train)

#prediction and probability
mlp_pred = mlpModel.predict(X_test)
mlp_proba = mlpModel.predict_proba(X_test)

#Report metrix
reportMLP = classification_report(y_test, mlp_pred,output_dict= True, zero_division=0)
accMlp = accuracy_score(y_test, mlp_pred)
print('Accuracy:')
print(accMlp)

### Random Forest


In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=15, max_depth=5, random_state=0,max_features = None,n_jobs = -1)
model.fit(X_train, y_train)

forestPred = model.predict(X_test)
reportForest = classification_report(y_test, forestPred,output_dict= True, zero_division=0)
score = model.score(X_test, y_test)

In [None]:
print('Accuracy:')
print(score)

### XGboost

In [None]:
import xgboost as xgb
xg_class = xgb.XGBClassifier(learning_rate = 0.3,
                max_depth =4, alpha = 2, n_estimators = 35, eval_metric='mlogloss',use_label_encoder =False)
#Fit the model
xg_class.fit(X_train,y_train)
#xg_reg.save_model("categorical-model.json")

#Make predictions
preds = xg_class.predict(X_test)

xbreport = classification_report(y_test, preds,output_dict= True)

In [None]:
print('Accuracy:')
print(accuracy_score(y_test, preds))

<a id='model_evaluation'></a>
## Model Evaluation

### Accuracy Comparison
A first measure for evaluate machine learning models is the accuracy, that is the number of correct predictions divided by the number of predictions made. 

In [None]:
modelsAccuracy =[simpReportMLP['accuracy'],
reportSimpForest['accuracy'],
xgSimpReport['accuracy'],
reportMLP['accuracy'],
reportForest['accuracy'],
xbreport['accuracy']]

modelName = ['simple MLP', 'simple Forest', 'simple xgboost', 'mlp', 'Forest', 'xgboost']


x = np.arange(len(modelName))
width = 0.35

accuracyGraph, axis = plt.subplots()
accuracyGraph.dpi = 100
accuracyGraph.set_figheight(8)
bars = axis.bar(x - width/2, modelsAccuracy, width)

axis.bar_label(bars, padding=3)

#Precison red an white wine graph
axis.bar(modelName,modelsAccuracy, label ='Accuracy', color= 'cornflowerblue')
axis.legend(loc = 'upper left')
axis.set_title("Accuracy Comparison")
axis.set_xticks(x)
axis.set_xticklabels(labels=modelName,rotation=45,horizontalalignment='right');
axis.set( ylabel='Accuracy')
axis.get_legend().remove()

plt.show()

As it is possible to see from the graph the accuracy is almost the same with the exception of the multi layer perceptron with the custom encoding. This could be due to the fact that for these type of models is easier to work with small data or the other encoding create false relations, assigning the same value for different features.

### Precision Comparison
Because the accuracy alone is not enough to evaluate a model, we now see the precision that is the correctly classified samples divided by the sum of true and false positive. Considering these two metrics together is possible to analyze better the performace of the different models.

In [None]:
modelsPrecision =[simpReportMLP['macro avg']['precision'],
reportSimpForest['macro avg']['precision'],
xgSimpReport['macro avg']['precision'],
reportMLP['macro avg']['precision'],
reportForest['macro avg']['precision'],
xbreport['macro avg']['precision']]

x = np.arange(len(modelName))
width = 0.35

precisonGraph, axis = plt.subplots()
precisonGraph.dpi = 100
precisonGraph.set_figheight(8)
rects1 = axis.bar(x - width/2, modelsPrecision, width)

axis.bar_label(rects1, padding=3)

#Precison red an white wine graph
axis.bar(modelName,modelsPrecision, label='Precison')
axis.legend(loc = 'upper left')
axis.get_legend().remove()
axis.set_title("Precison Comparison")
axis.set_xticks(x)
axis.set_xticklabels(labels=modelName,rotation=45,horizontalalignment='right');
axis.set( ylabel='Precision')
plt.show()

From this comparison is clear that XGboost has the best performance and that there are some problems with the multi layer perceptron with custom encoding, while the models that use random forest have similar performance.
As previously said there is the possibility that the low variance of value in the simple encoding cause the multi layer perceptron to make wrong assumption on the relation between feature bringing to better result when compared to the model that uses our encoding.

Because the difference between the two models that uses XGboost is minimal, for bulding an example of application we will use the one with the custom encoding.

<a id='model_usage'></a>
## Possible application
In the following part is shown how the trained model can be used to predict a type of contract when given all the necessary information.
Here we feed the data directly in the code, but in more realistic scenario the model could be implemented in the backend of a website and a user from the front end could choose the parameter through, for example, a drop down list; doing so would simplify the choice for the user and prevent spelling error.

In [None]:
contrattoMap=dict(zip(model_dt.RANKCONTRATTO, model_dt.CONTRATTO))
modlavoroMap=dict(zip(model_dt.MODALITALAVORO, model_dt.MyENMODALITALAVORO))
settoreEcoMap=dict(zip(model_dt.SETTOREECONOMICODETTAGLIO, model_dt.MyENSETTOREECONOMICODETTAGLIO))
provinciaMap=dict(zip(model_dt.PROVINCIAIMPRESA, model_dt.MyENPROVINCIAIMPRESA))
agerangeMap=dict(zip(model_dt.agerange, model_dt.RANKagerange))
titoloStudioMap=dict(zip(model_dt.TITOLOSTUDIO, model_dt.MyENTITOLOSTUDIO))
genereMap=dict(zip(model_dt.GENERE, model_dt.RANKGENERE))

In [None]:
def findContract(genere, studyTitle,age,provincia,settoreEco,modLavoro):
    myJob = { 
    'RANKGENERE': [genereMap[genere]],
    'MyENTITOLOSTUDIO': [titoloStudioMap[studyTitle]],
    'ETA':[age],
    'MyENPROVINCIAIMPRESA':[provinciaMap[provincia]],
    'MyENSETTOREECONOMICODETTAGLIO':[settoreEcoMap[settoreEco]],
    'MyENMODALITALAVORO':[modlavoroMap[modLavoro]]};
    contract = contrattoMap[xg_class.predict(pd.DataFrame(myJob))[0]]
    return contract

### Testing the method

In [None]:
findContract('M','DIPLOMA DI ISTRUZIONE SECONDARIA SUPERIORE', 18, 'BERGAMO', 'I', 'TEMPO PIENO')

# Conclusion
In this project we have seen how data is imported analyzed and cleaned. Then we created a question based on the data that we had, and tried to answer that question building a machine learning model.

Because of the categorical nature of the data used, we needed to encode it before feeding it to the machine learning algorithm. So we implemented two different encodings to see if a simple encoding will perform better than a custom encoding in which we tried to give meaningful value to the features.

After training, testing and evaluating different algorithms we have discovered that the simple encoding performs better than the custom one in most cases but is possible that create false connections between features.

At last, we implemented an example of application that use the model trained for answering the question.