## Data Preprocessing Python Script
- File stored as CSV from Webscraping Script is converted to parquet file
- Standard Format like 200-301 and MMLU Dataset
- 350-701 Cisco CCNP Questions
    - 140 Question
    - Containing Images encoded base64
    - Also dataset without images

In [1]:
from pandas import read_csv
import pandas as pd
import ast

In [12]:
ccnp_350_701 = read_csv('/home/iai/sb7059/git/llm_test/archive/raw_data/extracted_questions_answers_350_701.csv')
ccna_200_301 = pd.read_parquet('/home/iai/sb7059/git/llm_test/data/201-301-CCNA.parquet')

In [13]:
ccnp_350_701

Unnamed: 0,Question,Answers,Correct Answer,Image
0,Which functions of an SDN architecture require...,"['SDN controller and the network elements', 'm...",A,
1,Which two request methods of REST API are vali...,"['put', 'options', 'get', 'push', 'connect']",AC,
2,The main function of northbound APIs in the SD...,"['SDN controller and the cloud', 'management c...",D,
3,What is a feature of the open platform capabil...,"['application adapters', 'domain integration',...",C,
4,Refer to the exhibit. What does the API do whe...,['create an SNMP pull mechanism for managing A...,D,iVBORw0KGgoAAAANSUhEUgAAAgsAAAGmCAIAAAB9V1A0AA...
...,...,...,...,...
135,An engineer has been tasked with configuring a...,"['Enable traffic analysis in the Cisco FTD.', ...",C,
136,An organization uses Cisco FMC to centrally ma...,['Change the management port on Cisco FMC so t...,C,
137,An administrator is establishing a new site-to...,['crypto isakmp identity address 172.19.20.24'...,D,
138,A Cisco FTD engineer is creating a newIKEv2 po...,['Change the encryption to AES* to support all...,D,


In [15]:
display(ccnp_350_701.iloc[10]['Answers'])

"['Define security group memberships.', 'Revoke expired CRL of the websites.', 'Use antispyware software.']"

In [5]:
#Rename columns from Question to question and Answers to choices Correct Answer to answer
ccnp_350_701 = ccnp_350_701.rename(columns={'Question': 'question', 'Answers': 'choices', 'Correct Answer': 'answer', 'Image': 'image'})

In [6]:
#Change column choices to pandas.core.series.Series
ccnp_350_701['choices'] = ccnp_350_701['choices'].apply(lambda x: pd.Series(x))

In [7]:
#Add a whitespace to the string of the column answers if there are more than one answer like AB or ABC to A B or A B C
ccnp_350_701['answer'] = ccnp_350_701['answer'].apply(lambda x: ' '.join(list(x)) if isinstance(x, str) else x)

In [11]:
display(ccnp_350_701.iloc[10])
display(ccnp_350_701.iloc[10]['choices'])
display(ccnp_350_701.iloc[10]['question'])

question    Which two mechanisms are used to control phish...
choices     ['Define security group memberships.', 'Revoke...
answer                                                    A E
image                                                     NaN
Name: 10, dtype: object

"['Define security group memberships.', 'Revoke expired CRL of the websites.', 'Use antispyware software.']"

'Which two mechanisms are used to control phishing attacks? (Choose two.)'

In [76]:
#Convert the letters given in column answers to a pandas series of numbers, like: A=0, B=1, C=2, D=3 or A B = 0,1
def convert_to_number(x):
    if isinstance(x, str):
        return [ord(i) - 65 for i in x.split()]
    else:
        return x

ccnp_350_701['answer'] = ccnp_350_701['answer'].apply(convert_to_number)

In [77]:
#Convert to string ao actual lists
ccnp_350_701['choices'] = ccnp_350_701['choices'].apply(lambda x: [i.strip() for i in ast.literal_eval(x)])
ccnp_350_701['choices'] = ccnp_350_701['choices'].apply(lambda x: [i.strip() for i in x])

In [78]:
ccnp_350_701

Unnamed: 0,question,choices,answer,image
0,Which functions of an SDN architecture require...,"[SDN controller and the network elements, mana...",[0],
1,Which two request methods of REST API are vali...,"[put, options, get, push, connect]","[0, 2]",
2,The main function of northbound APIs in the SD...,"[SDN controller and the cloud, management cons...",[3],
3,What is a feature of the open platform capabil...,"[application adapters, domain integration, int...",[2],
4,Refer to the exhibit. What does the API do whe...,[create an SNMP pull mechanism for managing AM...,[3],iVBORw0KGgoAAAANSUhEUgAAAgsAAAGmCAIAAAB9V1A0AA...
...,...,...,...,...
135,An engineer has been tasked with configuring a...,"[Enable traffic analysis in the Cisco FTD., Im...",[2],
136,An organization uses Cisco FMC to centrally ma...,[Change the management port on Cisco FMC so th...,[2],
137,An administrator is establishing a new site-to...,"[crypto isakmp identity address 172.19.20.24, ...",[3],
138,A Cisco FTD engineer is creating a newIKEv2 po...,[Change the encryption to AES* to support all ...,[3],


In [79]:
#Add a column with the exam name
ccnp_350_701['exam'] = '350-701'

In [80]:
#Export CCNP 350-701 to parquet with filename 350-701-CCNP.parquet in the folder /home/iai/sb7059/git/llm_test/data
ccnp_350_701.to_parquet('/home/iai/sb7059/git/llm_test/data/350-701-CCNP.parquet')

In [81]:
#Keep the rows with NaN values in column image
ccnp_350_701_no_image = ccnp_350_701[ccnp_350_701['image'].isna()]
ccnp_350_701_no_image.to_parquet('/home/iai/sb7059/git/llm_test/data/350-701-CCNP_no_image.parquet')

In [82]:
ccnp_350_701_no_image

Unnamed: 0,question,choices,answer,image,exam
0,Which functions of an SDN architecture require...,"[SDN controller and the network elements, mana...",[0],,350-701
1,Which two request methods of REST API are vali...,"[put, options, get, push, connect]","[0, 2]",,350-701
2,The main function of northbound APIs in the SD...,"[SDN controller and the cloud, management cons...",[3],,350-701
3,What is a feature of the open platform capabil...,"[application adapters, domain integration, int...",[2],,350-701
5,Which form of attack is launched using botnets?,"[TCP flood, DDOS, DOS, virus]",[1],,350-701
...,...,...,...,...,...
135,An engineer has been tasked with configuring a...,"[Enable traffic analysis in the Cisco FTD., Im...",[2],,350-701
136,An organization uses Cisco FMC to centrally ma...,[Change the management port on Cisco FMC so th...,[2],,350-701
137,An administrator is establishing a new site-to...,"[crypto isakmp identity address 172.19.20.24, ...",[3],,350-701
138,A Cisco FTD engineer is creating a newIKEv2 po...,[Change the encryption to AES* to support all ...,[3],,350-701
