<h2>Match ansible tasks to descriptions given the official documentation</h2>

<p>The work is organized as below:</p>
<ul>
    <li>
        <p>We scrape official ansible documentation website in order to get the current supported modules. The result a dictionary where the key is the module name and value is the description of it. The method <i>get_ansible_modules</i> returns the aforementioned dictionary.</p>
    </li>
    <li>
        <p>Using the class <i>identify_ansible_modules</i> we perform seach within the method description of an ansible task. In detail:
        </p>
            <ul>
                <li><p>For each key of the method description we check if it in the official ansible documentation and we return its rescpective text. The result text is saved in a new column</p></li>
                <li><p>For each value of each key in method description:</p></li>
                <li>
                    <ul>
                        <li>If the value is a nested dictionary we perform search of its keys if there are part of the official ansible documentation and we return the (concatenated) text of them in a new column</li>
                        <li>If the value is a nested dictionary which none of its keys is a key of the official ansible documentation but contains the key 'name' we return its text.</li>
                        <li>If the value is a nested dictionary which none of its keys is a key of the official ansible documentation or it does not contain the key 'name' we return each of the keys as a string with space used as a separator</li>
                        <!-- <li>if the value is a string we check if it is an ansible variable (i.e. contains the characters '{{' and '}}') and return the label 'variable'. Otherwise we return the string as it is.</li> -->
                    </ul>
                </li>
            </ul>
    </li>
</ul>
<p>The result dataframe contains <strong>16348 rows (i.e. 89% of the initial tasks fed for identification)</strong>, where the keys of the method descriptions were matched with the respective ansible module documentation</p>

In [1]:
import pickle
import pandas as pd
import numpy as np
from ansible_modules import get_ansible_modules,top10_modules_used
from ident_ans_mods import identify_ansible_modules
from collections import Counter
from  itertools import chain
%load_ext autoreload
%autoreload 2

In [2]:
with open('tasks_from_repos.pkl', 'rb') as input_file:
    tasks = pickle.load(input_file)   
tasks.columns

Index(['task_name', 'method_description', 'file_name', 'repo_name',
       'is_roles'],
      dtype='object')

In [3]:
descriptions = tasks[['task_name','method_description']]

In [4]:
mda = get_ansible_modules('https://docs.ansible.com/ansible/latest/modules/list_of_all_modules.html')

In [5]:
df_desc_mod = descriptions.copy()
df_desc_mod = df_desc_mod.reset_index()
df_desc_mod = df_desc_mod.rename(columns={'index':'repo_count'})
df_desc_mod

Unnamed: 0,repo_count,task_name,method_description
0,0,restart datadog-agent,"{'service': {'name': 'datadog-agent', 'state':..."
1,1,restart datadog-agent-win,"{'win_service': {'name': 'datadogagent', 'stat..."
2,2,populate service facts,{'service_facts': None}
3,3,"add ""{{ datadog_user }}"" user to additional gr...","{'user': 'name=""{{ datadog_user }}"" groups=""{{..."
4,4,Create Datadog agent config directory,"{'file': {'dest': '/etc/datadog-agent', 'state..."
...,...,...,...
18281,243,Install expect (EL5),"{'package': {'name': '{{ item }}', 'state': '{..."
18282,244,Debian/Ubuntu | Remove Wazuh repository.,{'apt_repository': {'repo': 'deb https://packa...
18283,245,Debian/Ubuntu | Remove Nodejs repository.,{'apt_repository': {'repo': 'deb https://deb.n...
18284,246,RedHat/CentOS/Fedora | Remove NodeJS repositor...,"{'yum_repository': {'name': 'NodeJS', 'state':..."


In [6]:
inst = identify_ansible_modules(df_desc_mod,mda)

In [7]:
df = inst.create_text_cols()

In [8]:
df

Unnamed: 0,repo_count,task_name,method_description,mod_keys_found,mod_values_found,key_module_text,value_module_text
0,0,restart datadog-agent,"{'service': {'name': 'datadog-agent', 'state':...",[service],[datadog-agent],Manage services,datadog-agent
1,1,restart datadog-agent-win,"{'win_service': {'name': 'datadogagent', 'stat...",[win_service],[datadogagent],Manage and query Windows services,datadogagent
2,2,populate service facts,{'service_facts': None},[service_facts],[],Return service state information as fact data,
3,3,"add ""{{ datadog_user }}"" user to additional gr...","{'user': 'name=""{{ datadog_user }}"" groups=""{{...",[user],[],Manage user accounts,
4,4,Create Datadog agent config directory,"{'file': {'dest': '/etc/datadog-agent', 'state...",[file],[],Manage files and file properties,
...,...,...,...,...,...,...,...
18281,243,Install expect (EL5),"{'package': {'name': '{{ item }}', 'state': '{...",[package],[{{ item }}],Generic OS package manager,{{ item }}
18282,244,Debian/Ubuntu | Remove Wazuh repository.,{'apt_repository': {'repo': 'deb https://packa...,[apt_repository],[],Add and remove APT repositories,
18283,245,Debian/Ubuntu | Remove Nodejs repository.,{'apt_repository': {'repo': 'deb https://deb.n...,[apt_repository],[],Add and remove APT repositories,
18284,246,RedHat/CentOS/Fedora | Remove NodeJS repositor...,"{'yum_repository': {'name': 'NodeJS', 'state':...",[yum_repository],[NodeJS],Add or remove YUM repositories,NodeJS


In [9]:
df[df['mod_keys_found'].apply(len).gt(0)]

Unnamed: 0,repo_count,task_name,method_description,mod_keys_found,mod_values_found,key_module_text,value_module_text
0,0,restart datadog-agent,"{'service': {'name': 'datadog-agent', 'state':...",[service],[datadog-agent],Manage services,datadog-agent
1,1,restart datadog-agent-win,"{'win_service': {'name': 'datadogagent', 'stat...",[win_service],[datadogagent],Manage and query Windows services,datadogagent
2,2,populate service facts,{'service_facts': None},[service_facts],[],Return service state information as fact data,
3,3,"add ""{{ datadog_user }}"" user to additional gr...","{'user': 'name=""{{ datadog_user }}"" groups=""{{...",[user],[],Manage user accounts,
4,4,Create Datadog agent config directory,"{'file': {'dest': '/etc/datadog-agent', 'state...",[file],[],Manage files and file properties,
...,...,...,...,...,...,...,...
18281,243,Install expect (EL5),"{'package': {'name': '{{ item }}', 'state': '{...",[package],[{{ item }}],Generic OS package manager,{{ item }}
18282,244,Debian/Ubuntu | Remove Wazuh repository.,{'apt_repository': {'repo': 'deb https://packa...,[apt_repository],[],Add and remove APT repositories,
18283,245,Debian/Ubuntu | Remove Nodejs repository.,{'apt_repository': {'repo': 'deb https://deb.n...,[apt_repository],[],Add and remove APT repositories,
18284,246,RedHat/CentOS/Fedora | Remove NodeJS repositor...,"{'yum_repository': {'name': 'NodeJS', 'state':...",[yum_repository],[NodeJS],Add or remove YUM repositories,NodeJS


In [10]:
df = df[df['mod_keys_found'].apply(len).gt(0)]

In [11]:
df = df.reset_index(drop=True)

In [12]:
with open('tasks_ast_raw.pkl', 'wb') as output_file:
    pickle.dump(df, output_file)

In [13]:
with open('tasks_ast_raw.pkl', 'rb') as input_file:
    tasks_ast_raw = pickle.load(input_file)   
tasks_ast_raw.shape

(16348, 7)

In [14]:
tasks_ast_raw = tasks_ast_raw[tasks_ast_raw['mod_keys_found'].apply(len).gt(0)]

In [15]:
tasks_ast_raw.columns

Index(['repo_count', 'task_name', 'method_description', 'mod_keys_found',
       'mod_values_found', 'key_module_text', 'value_module_text'],
      dtype='object')

<h4>Top 10 Ansible modules used</h4>

In [16]:
counted_modules, counted_modules_df = top10_modules_used(tasks_ast_raw)

In [17]:
counted_modules

['shell',
 'command',
 'set_fact',
 'template',
 'file',
 'gather_facts',
 'copy',
 'service',
 'debug',
 'fail']

In [18]:
counted_modules_df.head(10)

Unnamed: 0,ansible_modules,frequency
153,shell,2126
23,command,1702
150,set_fact,1246
163,template,1198
50,file,1151
54,gather_facts,806
24,copy,773
148,service,569
27,debug,484
48,fail,395


In [19]:
with open('top10_list.pkl', 'wb') as output_file:
    pickle.dump(counted_modules, output_file)

In [20]:
with open('top10_list.pkl', 'rb') as input_file:
    top10_module_list = pickle.load(input_file)   
top10_module_list

['shell',
 'command',
 'set_fact',
 'template',
 'file',
 'gather_facts',
 'copy',
 'service',
 'debug',
 'fail']

In [21]:
tasks_ast_raw = tasks_ast_raw[tasks_ast_raw['mod_keys_found'].apply(lambda v: len(set(v) & set(top10_module_list)) > 0)]

In [22]:
tasks_ast_raw.shape

(10450, 7)

In [23]:
tasks_ast_raw.shape

(10450, 7)

In [24]:
with open('tasks_top10_modules.pkl', 'wb') as output_file:
    pickle.dump(tasks_ast_raw, output_file)

In [25]:
with open('tasks_top10_modules.pkl', 'rb') as input_file:
    tasks_top10_modules = pickle.load(input_file)   
tasks_top10_modules.shape

(10450, 7)

In [26]:
tasks_10 = tasks_top10_modules.copy()

In [27]:
two_keys_set = tasks_10[tasks_10['mod_keys_found'].apply(lambda x: len(x) >1)]
two_keys_set.shape

(54, 7)

In [28]:
one_key_set = tasks_10[tasks_10['mod_keys_found'].apply(lambda x: len(x) < 2)]
one_key_set.shape

(10396, 7)

In [29]:
one_key_set.columns

Index(['repo_count', 'task_name', 'method_description', 'mod_keys_found',
       'mod_values_found', 'key_module_text', 'value_module_text'],
      dtype='object')

In [30]:
# one_key_set['param_descrption'] = one_key_set['mod_keys_found'].apply(lambda x: get_module_parameters(x))

In [30]:
one_key_set['mod_keys_found_string'] = one_key_set['mod_keys_found'].apply(lambda x: ''.join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [31]:
one_key_set.shape

(10396, 8)

In [32]:
type(one_key_set['mod_keys_found'][0])

list

In [33]:
list(one_key_set['mod_keys_found_string'].unique())

['service',
 'file',
 'template',
 'set_fact',
 'fail',
 'command',
 'debug',
 'copy',
 'shell',
 'gather_facts']

In [34]:
from module_parameters import get_module_parameters

In [35]:
params_dict = get_module_parameters(top10_module_list)

In [36]:
params_dict['set_fact']

{'cacheable_required': False,
 'cacheable': 'boolean',
 'cacheable_text': "This boolean converts the variable into an actual 'fact' which will also be added to the fact cache, if fact caching is enabled.Normally this module creates 'host level variables' and has much higher precedence, this option changes the nature and precedence (by 7 steps) of the variable created. https://docs.ansible.com/ansible/latest/user_guide/playbooks_variables.html#variable-precedence-where-should-i-put-a-variableThis actually creates 2 copies of the variable, a normal 'set_fact' host variable with high precedence and a lower 'ansible_fact' one that is available for persistance via the facts cache plugin. This creates a possibly confusing interaction with meta: clear_facts as it will remove the 'ansible_fact' but not the host variable.",
 'key_value_required': True,
 'key_value': '-',
 'key_value_text': 'The set_fact module takes key=value pairs as variables to set in the playbook scope. Or alternatively, ac

In [37]:
one_key_set['method_description'][6]

{'file': {'dest': '/etc/datadog-agent/conf.d/{{ item }}.d',
  'state': 'directory',
  'owner': '{{ datadog_user }}',
  'group': '{{ datadog_group }}'},
 'with_items': '{{ datadog_checks|list }}'}

In [38]:
from module_parameters import map_module_used_parameters

In [39]:
one_key_set.shape

(10396, 8)

In [42]:
one_key_set = one_key_set.drop(columns=['found_used_parameters'])

In [43]:
one_key_set['found_used_parameters'] = one_key_set['method_description'].apply(lambda x: map_module_used_parameters(x,params_dict))

In [45]:
one_key_set.head(10)

Unnamed: 0,repo_count,task_name,method_description,mod_keys_found,mod_values_found,key_module_text,value_module_text,mod_keys_found_string,found_used_parameters
0,0,restart datadog-agent,"{'service': {'name': 'datadog-agent', 'state':...",[service],[datadog-agent],Manage services,datadog-agent,service,"{'module_used': 'service', 'intersected_params..."
4,4,Create Datadog agent config directory,"{'file': {'dest': '/etc/datadog-agent', 'state...",[file],[],Manage files and file properties,,file,"{'module_used': 'file', 'intersected_params': ..."
5,5,Create main Datadog agent configuration file,"{'template': {'src': 'datadog.yaml.j2', 'dest'...",[template],[group],Template a file out to a remote server,Add or remove groups,template,"{'module_used': 'template', 'intersected_param..."
6,6,Ensure configuration directories are present f...,{'file': {'dest': '/etc/datadog-agent/conf.d/{...,[file],[group],Manage files and file properties,Add or remove groups,file,"{'module_used': 'file', 'intersected_params': ..."
7,7,Create a configuration file for each Datadog c...,"{'template': {'src': 'checks.yaml.j2', 'dest':...",[template],[group],Template a file out to a remote server,Add or remove groups,template,"{'module_used': 'template', 'intersected_param..."
8,8,Remove old configuration file for each Datadog...,{'file': {'dest': '/etc/datadog-agent/conf.d/{...,[file],[],Manage files and file properties,,file,"{'module_used': 'file', 'intersected_params': ..."
9,9,Create trace agent configuration file,"{'template': {'src': 'datadog.conf.j2', 'dest'...",[template],[group],Template a file out to a remote server,Add or remove groups,template,"{'module_used': 'template', 'intersected_param..."
10,10,Create process agent configuration file,"{'template': {'src': 'datadog.conf.j2', 'dest'...",[template],[group],Template a file out to a remote server,Add or remove groups,template,"{'module_used': 'template', 'intersected_param..."
11,11,Create system-probe configuration file,"{'template': {'src': 'system-probe.yaml.j2', '...",[template],[group],Template a file out to a remote server,Add or remove groups,template,"{'module_used': 'template', 'intersected_param..."
12,12,Ensure datadog-agent is running,"{'service': {'name': 'datadog-agent', 'state':...",[service],[datadog-agent],Manage services,datadog-agent,service,"{'module_used': 'service', 'intersected_params..."


In [44]:
one_key_set['method_description'][4]

{'file': {'dest': '/etc/datadog-agent', 'state': 'directory'}}

In [45]:
one_key_set.shape

(10396, 9)

In [48]:
with open('tasks_top10_mapped_parameters.pkl', 'wb') as output_file:
    pickle.dump(one_key_set, output_file)

In [49]:
with open('tasks_top10_mapped_parameters.pkl', 'rb') as input_file:
    tasks_top10_mapped_parameters = pickle.load(input_file)   
tasks_top10_mapped_parameters.shape

(10396, 9)