### Création d'un schéma

In [1]:
# Importation du module whyqd 
import whyqd as _w

# Ajout des information complémentaires
details = {
        "name": "human-development-report",
        "title": "UN Human Development Report 2007 - 2008",
        "description": """
        En 1990, le premier rapport sur le développement humain a introduit une nouvelle approche pour
        faire progresser le bien-être de l'homme. Le développement humain - ou l'approche du développement humain - consiste à
        l'élargissement de la richesse de la vie humaine, plutôt que simplement de la richesse de l'économie dans laquelle
        les êtres humains vivent. Il s'agit d'une approche axée sur les personnes, leurs possibilités et leurs choix."""
}
schema = _w.Schema()
schema.set_details(**details)


# Definissons les champs dans notre schéma

fields = [
    {
        "name": "Country Name",
        "title": "Country Name",
        "type": "string",
        "description": "Official country names.",
        "constraints": {
            "required": True
        }
    },
    {
        "name": "HDI Category",
        "title": "HDI Category",
        "type": "string",
        "description": "Human Development Index Category derived from the HDI Rank.",
    },
    {
        "name": "Indicator Name",
        "title": "Indicator Name",
        "type": "string",
        "description": "Indicator described in the data series.",
    },
    {
        "name": "Reference",
        "title": "Reference",
        "type": "string",
        "description": "Reference to data source.",
    },
    {
        "name": "Year",
        "title": "Year",
        "type": "year",
        "description": "Year of release.",
    },
    {
        "name": "Values",
        "title": "Values",
        "type": "number",
        "description": "Value for the Year and Indicator Name.",
        "constraints": {
            "required": True
        }
    },
]
for field in fields:
    schema.set_field(**field)

In [2]:
# Accéder au field renseigné

schema.field("country_name")


{'name': 'country_name',
 'type': 'string',
 'constraints': {'required': True},
 'title': 'Country Name',
 'description': 'Official country names.'}

In [3]:
# Enregistrement du schema
directory = "table3"
# vous pouvez également spécifier un nom de fichier facultatif
# si vous l'omettez, le nom du fichier sera par défaut le nom du schéma
filename = "new"
# si le fichier existe déjà, vous devrez spécifier "overwrite=True" sinon vous obtiendrez une erreur
schema.save(directory, filename=filename, overwrite=True)

True

### Création d'une méthode¶
Methods sont la façon dont vous définissez les étapes que whyqd doit effectuer pour restructurer vos données et les aligner avec votre Schema. Il n'y a pas beaucoup de codage à faire, mais il y a beaucoup de décisions à prendre.

In [6]:

### Les importations et paramètres suivants vous permettent d'obtenir un large éventail de résultats pour vos tableaux
from IPython.core.display import HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))

import numpy as np
import whyqd as _w

SCHEMA_SOURCE = "table3new.json"
DIRECTORY = "table3"
INPUT_DATA = [
   "C:/Users/PCHOME\Desktop/Jasminehelene/HDR 2007-2008 Table 03.xlsx"
]
method = _w.Method(SCHEMA_SOURCE, directory=DIRECTORY, input_data=INPUT_DATA)


In [7]:
# Afficher ce à quoi ressemble mes données
print(method.print_input_data())



Data id: eba4c7ba-72ad-4011-92c9-f464273ef815
Original source: C:/Users/PCHOME\Desktop/Jasminehelene/HDR 2007-2008 Table 03.xlsx

  ..  Unnamed: 0                                         Unnamed: 1    Unnamed: 2    Monitoring human development: enlarging people's choices …    Unnamed: 4    Unnamed: 5    Unnamed: 6    Unnamed: 7    Unnamed: 8    Unnamed: 9    Unnamed: 10    Unnamed: 11    Unnamed: 12    Unnamed: 13    Unnamed: 14    Unnamed: 15    Unnamed: 16    Unnamed: 17    Unnamed: 18    Unnamed: 19    Unnamed: 20    Unnamed: 21    Unnamed: 22    Unnamed: 23    Unnamed: 24    Unnamed: 25    Unnamed: 26    Unnamed: 27    Unnamed: 28    Unnamed: 29    Unnamed: 30
   0  3 Human and income poverty Developing countries           nan           nan                                                           nan           nan           nan           nan           nan           nan           nan            nan            nan            nan            nan            nan            nan        

In [8]:
# Voir les types de transformations qui permettent de restructurer les tables avant de les fusionner 
method.default_morph_types

['CATEGORISE', 'DEBLANK', 'DEDUPE', 'DELETE', 'MELT', 'REBASE', 'RENAME']

In [9]:
# Voir ce que chacun fait 
method.default_morph_settings("CATEGORISE")

{'name': 'CATEGORISE',
 'title': 'Categorise',
 'type': 'morph',
 'description': 'Convert row-level categories into column categorisations.',
 'structure': ['rows', 'column_names']}

La façon standard d'écrire un morphing est :

["NOM_MORPH", [rows], [columns], [column_names]]

In [10]:
# Utilisez _id, ou une autre variable, puisque `id` est un terme protégé par Python
_id = method.input_data[0]["id"]
df = method.input_dataframe(_id)
df.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Monitoring human development: enlarging people's choices …,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27,Unnamed: 28,Unnamed: 29,Unnamed: 30
0,3 Human and income poverty Developing countries,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


Si vous arrivez à un point où vous êtes complètement embrouillé, reset_input_data_morph(id) supprimera tous les morphes et vous permettra de recommencer :

In [11]:
#method.reset_input_data_morph(_id)

In [12]:
# Replaçons le tableau au sommet des données réelles 
method.add_input_data_morph(_id, ["REBASE", 11])

In [13]:
# Nous pouvons nous débarrasser des lignes en dessous de la 144 jusqu'à la fin du tableau. 
# On obtient la valeur du dernier élément de l'index, puis on ajoute 1 pour créer l'intervalle
rows = [int(i) for i in np.arange(144, df.index[-1]+1)]
method.add_input_data_morph(_id, ["DELETE", rows])

In [14]:
# Nommons maintenant les colonnes qui restent en fonction de leur nom d'origine. Notez également que les colonnes de référence n'étaient pas étiquetées auparavant 

columns = [
    "HDI rank",
    "Country",
    "Human poverty index (HPI-1) - Rank",
    "Reference 1",
    "Human poverty index (HPI-1) - Value (%)",
    "Probability at birth of not surviving to age 40 (% of cohort) 2000-05",
    "Reference 2",
    "Adult illiteracy rate (% aged 15 and older) 1995-2005",
    "Reference 3",
    "Population not using an improved water source (%) 2004",
    "Reference 4",
    "Children under weight for age (% under age 5) 1996-2005",
    "Reference 5",
    "Population below income poverty line (%) - $1 a day 1990-2005",
    "Reference 6",
    "Population below income poverty line (%) - $2 a day 1990-2005",
    "Reference 7",
    "Population below income poverty line (%) - National poverty line 1990-2004",
    "Reference 8",
    "HPI-1 rank minus income poverty rank"
]
method.add_input_data_morph(_id, ["RENAME", columns])

In [15]:
df = method.input_dataframe(_id)
df.head()

Unnamed: 0,HDI rank,Country,Human poverty index (HPI-1) - Rank,Reference 1,Human poverty index (HPI-1) - Value (%),Probability at birth of not surviving to age 40 (% of cohort) 2000-05,Reference 2,Adult illiteracy rate (% aged 15 and older) 1995-2005,Reference 3,Population not using an improved water source (%) 2004,Reference 4,Children under weight for age (% under age 5) 1996-2005,Reference 5,Population below income poverty line (%) - $1 a day 1990-2005,Reference 6,Population below income poverty line (%) - $2 a day 1990-2005,Reference 7,Population below income poverty line (%) - National poverty line 1990-2004,Reference 8,HPI-1 rank minus income poverty rank
14,HIGH HUMAN DEVELOPMENT,,,,,,,,,,,,,,,,,,,
15,21,"Hong Kong, China (SAR)",..,,..,1.5,e,..,,..,,..,,..,,..,,..,,..
16,25,Singapore,7,,5.2,1.8,,7.5,,0,,3,,..,,..,,..,,..
17,26,Korea (Republic of),..,,..,2.5,,1.0,,8,,..,,<2,,<2,,..,,..
18,28,Cyprus,..,,..,2.4,,3.2,,0,,..,,..,,..,,..,,..


In [17]:
# Obtenir les indices des lignes de données catégorielles

hdi_categories = ["HIGH HUMAN DEVELOPMENT", "MEDIUM HUMAN DEVELOPMENT", "LOW HUMAN DEVELOPMENT"]
rows = df[df["HDI rank"].isin(hdi_categories)].index
method.add_input_data_morph(_id, ["CATEGORISE", rows, "HDI category"])



ValueError: Task morph `CATEGORISE` has invalid structure `['rows', 'column_names']`.

In [18]:
type(rows)

pandas.core.indexes.numeric.Int64Index

In [19]:
method.add_input_data_morph(_id, ["CATEGORISE", list(rows), "HDI category"])

In [20]:
df = method.input_dataframe(_id)
df.head()

Unnamed: 0,HDI rank,Country,Human poverty index (HPI-1) - Rank,Reference 1,Human poverty index (HPI-1) - Value (%),Probability at birth of not surviving to age 40 (% of cohort) 2000-05,Reference 2,Adult illiteracy rate (% aged 15 and older) 1995-2005,Reference 3,Population not using an improved water source (%) 2004,...,Children under weight for age (% under age 5) 1996-2005,Reference 5,Population below income poverty line (%) - $1 a day 1990-2005,Reference 6,Population below income poverty line (%) - $2 a day 1990-2005,Reference 7,Population below income poverty line (%) - National poverty line 1990-2004,Reference 8,HPI-1 rank minus income poverty rank,HDI category
15,21,"Hong Kong, China (SAR)",..,,..,1.5,e,..,,..,...,..,,..,,..,,..,,..,HIGH HUMAN DEVELOPMENT
16,25,Singapore,7,,5.2,1.8,,7.5,,0,...,3,,..,,..,,..,,..,HIGH HUMAN DEVELOPMENT
17,26,Korea (Republic of),..,,..,2.5,,1.0,,8,...,..,,<2,,<2,,..,,..,HIGH HUMAN DEVELOPMENT
18,28,Cyprus,..,,..,2.4,,3.2,,0,...,..,,..,,..,,..,,..,HIGH HUMAN DEVELOPMENT
19,30,Brunei Darussalam,..,,..,3.0,,7.3,,..,...,..,,..,,..,,..,,..,HIGH HUMAN DEVELOPMENT


La plupart de ces colonnes sont en fait des indicateurs et peuvent être pivotées en une colonne Indicatoravec les Values assignées dans une seule colonne. C'est ce qu'on appelle un MELT :

In [21]:

# Sélectionnez toutes les colonnes à "melt"
columns = [
    "HDI rank",
    "Human poverty index (HPI-1) - Rank",
    "Human poverty index (HPI-1) - Value (%)",
    "Probability at birth of not surviving to age 40 (% of cohort) 2000-05",
    "Adult illiteracy rate (% aged 15 and older) 1995-2005",
    "Population not using an improved water source (%) 2004",
    "Children under weight for age (% under age 5) 1996-2005",
    "Population below income poverty line (%) - $1 a day 1990-2005",
    "Population below income poverty line (%) - $2 a day 1990-2005",
    "Population below income poverty line (%) - National poverty line 1990-2004",
    "HPI-1 rank minus income poverty rank"
]
method.add_input_data_morph(_id, ["MELT", columns, ["Indicator Name", "Indicator Value"]])

In [22]:
# De même, les References peuvent également être pivotées dans une colonne séparée :
columns = [
    "Reference 1",
    "Reference 2",
    "Reference 3",
    "Reference 4",
    "Reference 5",
    "Reference 6",
    "Reference 7",
    "Reference 8",
]
method.add_input_data_morph(_id, ["MELT", columns, ["Reference Name", "Reference"]])

In [23]:
# Ajoutons un dernier DEBLANK juste pour être sûr 
method.add_input_data_morph(_id, ["DEBLANK"])

In [24]:
# Obtenez la mise en œuvre actuelle des morphes et jetez-y un coup d'œil :
df = method.input_dataframe(_id)
df.head()

Unnamed: 0,Indicator Value,Country,Indicator Name,HDI category,Reference Name,Reference
0,21,"Hong Kong, China (SAR)",HDI rank,HIGH HUMAN DEVELOPMENT,Reference 1,
1,25,Singapore,HDI rank,HIGH HUMAN DEVELOPMENT,Reference 1,
2,26,Korea (Republic of),HDI rank,HIGH HUMAN DEVELOPMENT,Reference 1,
3,28,Cyprus,HDI rank,HIGH HUMAN DEVELOPMENT,Reference 1,
4,30,Brunei Darussalam,HDI rank,HIGH HUMAN DEVELOPMENT,Reference 1,


In [25]:
print(method.help("merge"))


`merge` will join, in order from right to left, your input data on a common column.

To add input data, where `input_data` is a filename, or list of filenames:

	>>> method.add_input_data(input_data)

To remove input data, where `id` is the unique id for that input data:

	>>> method.remove_input_data(id)

Prepare an `order_and_key` list, where each dict in the list has:

	{{id: input_data id, key: column_name for merge}}

Run the merge by calling (and, optionally - if you need to overwrite an existing merge - setting
`overwrite_working=True`):

	>>> method.merge(order_and_key, overwrite_working=True)

To view your existing `input_data`:

	>>> method.input_data


Data id: eba4c7ba-72ad-4011-92c9-f464273ef815
Original source: C:/Users/PCHOME\Desktop/Jasminehelene/HDR 2007-2008 Table 03.xlsx

  ..  Unnamed: 0                                         Unnamed: 1    Unnamed: 2    Monitoring human development: enlarging people's choices …    Unnamed: 4    Unnamed: 5    Unnamed: 6    Unnamed:

# C'est le code ci-dessous qui redemarre le Kernel, surement il doit avoir une erreur quelque part 

In [26]:
%time method.merge(overwrite_working=True)

Wall time: 8.24 s


L'étape suivante est la structure.



In [27]:
print(method.help("structure"))


`structure` is the core of the wrangling process and is the process where you define the actions
which must be performed to restructure your working data.

Create a list of methods of the form:

	{
		"schema_field1": ["action", "column_name1", ["action", "column_name2"]],
		"schema_field2": ["action", "column_name1", "modifier", ["action", "column_name2"]],
	}

The format for defining a `structure` is as follows::

	[action, column_name, [action, column_name]]

e.g.::

	["CATEGORISE", "+", ["ORDER", "column_1", "column_2"]]

This permits the creation of quite expressive wrangling structures from simple building
blocks.

The schema for this method consists of the following terms:

['country_name', 'hdi_category', 'indicator_name', 'reference', 'year', 'values']

The actions:

['CALCULATE', 'CATEGORISE', 'JOIN', 'NEW', 'ORDER', 'ORDER_NEW', 'ORDER_OLD', 'RENAME']

The columns from your working data:

['Indicator Value', 'Country', 'Indicator Name', 'HDI category', 'Reference Name', 'Ref

In [28]:

structure = {
    "country_name": ["RENAME", "Country"],
    "hdi_category": ["RENAME", "HDI category"],
    "indicator_name": ["RENAME", "Indicator Name"],
    "reference": ["RENAME", "Reference"],
    "values": ["RENAME", "Indicator Value"],
}
# Notez le "**" au début du nom du paramètre
# Ceci "dépaquette"  le dictionnaire de sorte que tous les termes soient visibles pour la fonction
method.set_structure(**structure)

In [29]:
# Malgré tout cela, whyqd a préservé vos données sources. Il est maintenant temps de créer votre transformation de données et 
#de la sauvegarder :


method.transform(overwrite_output=True)
FILENAME = "hdi_report_exercise"
method.save(directory, filename=FILENAME, overwrite=True)

In [30]:
#Vous pouvez revoir vos méthodes comme une sortie JSON en utilisant .settings pour la méthode entière, ou .input_data_morphs(_id) 
#pour les morphs eux-mêmes :
method.input_data_morphs(_id)


[{'2b1d0544-1200-4f90-9d26-416cf9c28c78': ['DEBLANK']},
 {'1c024f61-3ce6-4b77-90c9-016f141f7ccb': ['DEDUPE']},
 {'cea1f353-6e7c-42df-8109-bf431e17fea1': ['REBASE', [11]]},
 {'51ceb2fa-4db3-482c-b15e-41943dfa6f3f': ['DELETE',
   [144,
    145,
    146,
    147,
    148,
    149,
    150,
    151,
    152,
    153,
    154,
    155,
    156,
    157,
    158,
    159,
    160,
    161,
    162,
    163,
    164,
    165,
    166,
    167,
    168,
    169,
    170,
    171,
    172,
    173,
    174,
    175,
    176,
    177,
    178,
    179]]},
 {'1bc6d13e-26bd-479e-b87c-346eac650d1c': ['RENAME',
   ['HDI rank',
    'Country',
    'Human poverty index (HPI-1) - Rank',
    'Reference 1',
    'Human poverty index (HPI-1) - Value (%)',
    'Probability at birth of not surviving to age 40 (% of cohort) 2000-05',
    'Reference 2',
    'Adult illiteracy rate (% aged 15 and older) 1995-2005',
    'Reference 3',
    'Population not using an improved water source (%) 2004',
    'Reference 4',

#### Validation et manipulation des données 

In [31]:
# Si vous voulez vérifier si vos données sont valides dans whyqd, vous pouvez simplement exécuter une commande :
%time method.validates


Wall time: 3.88 s


True

Que se passe-t-il donc si vous avez besoin de savoir ce que contiennent vos données pour pouvoir les utiliser ? Vous pouvez utiliser pandas pour explorer vos données. 
Notre nouveau fichier de sortie est à notre disposition :

In [32]:
import pandas as pd
import numpy as np
df = pd.read_csv(r"C:\Users\PCHOME\Desktop\Jasminehelene\table3\output_3027fd53-e95b-4529-92b1-7ae869d0634a.csv")
df.head()

Unnamed: 0,year,country_name,hdi_category,indicator_name,reference,values
0,,"Hong Kong, China (SAR)",HIGH HUMAN DEVELOPMENT,HDI rank,,21
1,,Singapore,HIGH HUMAN DEVELOPMENT,HDI rank,,25
2,,Korea (Republic of),HIGH HUMAN DEVELOPMENT,HDI rank,,26
3,,Cyprus,HIGH HUMAN DEVELOPMENT,HDI rank,,28
4,,Brunei Darussalam,HIGH HUMAN DEVELOPMENT,HDI rank,,30


In [33]:
df.info()

#Nous pouvons utiliser pandas pour simplement replace ce que nous n'aimons pas.
#Nous pouvons vérifier les problèmes de validation spécifiques.


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2079 entries, 0 to 2078
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   year            0 non-null      float64
 1   country_name    2079 non-null   object 
 2   hdi_category    1892 non-null   object 
 3   indicator_name  2079 non-null   object 
 4   reference       693 non-null    object 
 5   values          2079 non-null   object 
dtypes: float64(1), object(5)
memory usage: 97.6+ KB


In [34]:
# Tout comme vous avez installé whyqd, vous devez installer PandasSchema. dans le terminal
# Pandas_schema contient une série de méthodes pour analyser et valider vos données par rapport à un schéma. Nous allons aborder ce sujet superficiellement.
# Voyons comment cela se passe :




In [35]:
from pandas_schema import Column, Schema
from pandas_schema.validation import LeadingWhitespaceValidation, TrailingWhitespaceValidation, IsDtypeValidation, InListValidation
#
## Nous ne testerons que ces colonnes

columns = ["country_name", "hdi_category", "values"]
## Et ces catégories

hdi_categories = ["HIGH HUMAN DEVELOPMENT", "MEDIUM HUMAN DEVELOPMENT", "LOW HUMAN DEVELOPMENT"]
#
schema = Schema([
   Column("country_name", [LeadingWhitespaceValidation(), TrailingWhitespaceValidation()]),   Column("hdi_category", [InListValidation(hdi_categories)]),
   Column("values", [IsDtypeValidation(np.dtype(float)), IsDtypeValidation(np.dtype(int))])
])

errors = schema.validate(df[columns])

print(F"Nombre d'erreurs :  {len(errors)}")

## Juste les 10 premiers

for error in errors[:10]:
   print(error)

Nombre d'erreurs :  189
The column values has a dtype of object which is not a subclass of the required type float64
The column values has a dtype of object which is not a subclass of the required type int32
{row: 112, column: "hdi_category"}: "nan" is not in the list of legal options (HIGH HUMAN DEVELOPMENT, MEDIUM HUMAN DEVELOPMENT, LOW HUMAN DEVELOPMENT)
{row: 113, column: "hdi_category"}: "nan" is not in the list of legal options (HIGH HUMAN DEVELOPMENT, MEDIUM HUMAN DEVELOPMENT, LOW HUMAN DEVELOPMENT)
{row: 114, column: "hdi_category"}: "nan" is not in the list of legal options (HIGH HUMAN DEVELOPMENT, MEDIUM HUMAN DEVELOPMENT, LOW HUMAN DEVELOPMENT)
{row: 115, column: "hdi_category"}: "nan" is not in the list of legal options (HIGH HUMAN DEVELOPMENT, MEDIUM HUMAN DEVELOPMENT, LOW HUMAN DEVELOPMENT)
{row: 116, column: "hdi_category"}: "nan" is not in the list of legal options (HIGH HUMAN DEVELOPMENT, MEDIUM HUMAN DEVELOPMENT, LOW HUMAN DEVELOPMENT)
{row: 117, column: "hdi_category

#### Publication et citation des données

In [37]:
for l in method.citation.split(","):
   print(l)

2020-05-26
 UN Human Development Report 2007 - 2008
 29ea76d29f4756c0669a57009a46e14f724c75ee4e9df7058b0ed557179011647baa0e8317c5fde92c59cc8e9e8471186ce67830f56f96e3ed579522953eb7f9
 [input sources: C:/Users/PCHOME\Desktop/Jasminehelene/HDR 2007-2008 Table 03.xlsx
 7d95ebdb36966c7b97b7b4e578cac70ea89463e95f64ccada60cf15a76f29c68b56f64aca9e28b8042e3c9ce37522fc03a13d1a1e8b05eac6edf26e09e5c32d5]
