## **Simple data cleaning**



After annotating the SAS script code (getting the matching instructions using GPT 3.5 API) we proceed to clean anf prepare our dataset for the fine tuning process

In [1]:
import pandas as pd
parquet_file = r'/content/df_annotated_finalbatch.parquet'
df=pd.read_parquet(parquet_file, engine = 'pyarrow')
df

Unnamed: 0,content,Annotation
0,"libname storm xlsx ""/home/student/Courses/gasa...","""Set the library 'storm' to the Excel file loc..."
1,/* Advent of Code Day 2 Puzzle 2 */\n\nDATA da...,"Create a data frame named ""day2"" by reading in..."
2,%macro Ulcer_Index_test(keep=FALSE);\n%global ...,"I am sorry, but I cannot provide an instructio..."
3,FILENAME REFFILE '/folders/myshortcuts/SASUniv...,"""Import the CSV file located at '/folders/mysh..."
4,"\nlibname b ""./"";\n\nproc contents data=b.pers...",Please create a program that accomplishes the ...
...,...,...
11383,**********************************************...,The instruction to match the SAS programming l...
11384,"/* \nFrom: Xie, Fenglong [mailto:fenglongxie@u...","Create a macro named ""ciras"" that takes in thr..."
11385,/*********************************************...,Create SAS informats for missing values using ...
11386,"proc iml;\n\n\tstart m(t) global(c1,c2,age);\n...","Sorry, I cannot provide an instruction for an ..."


We want to get rid of every wrong annotation that would harm the efficieny of our models, therefore we propose as an initial step to remove any non generated annotation (ie any type of 'Sorry, cannot provide an instruction ..')

In [2]:
df['Annotation_lower'] = df['Annotation'].str.lower()
keywords = ['Sorry', 'Cannot provide']
# Ensure keywords are lowercase
keywords = [keyword.lower() for keyword in keywords]

filtered_df = df[df['Annotation_lower'].str.contains('|'.join(keywords))]
filtered_df


Unnamed: 0,content,Annotation,Annotation_lower
2,%macro Ulcer_Index_test(keep=FALSE);\n%global ...,"I am sorry, but I cannot provide an instructio...","i am sorry, but i cannot provide an instructio..."
7,*Ex24_Week10_List_of_Files.sas;\ntitle; title1...,"Sorry, as an AI language model, I cannot provi...","sorry, as an ai language model, i cannot provi..."
21,begin_version\n3.FOND\nend_version\nbegin_metr...,"I'm sorry, but it is not possible to provide a...","i'm sorry, but it is not possible to provide a..."
27,%macro Prospect_Ratio_test(keep=FALSE);\n%glob...,"I'm sorry, as an AI language model, I cannot p...","i'm sorry, as an ai language model, i cannot p..."
30,%macro table_CalendarReturns_test2(keep=FALSE)...,"I'm sorry, as an AI language model, I cannot p...","i'm sorry, as an ai language model, i cannot p..."
...,...,...,...
11358,%macro return_calculate_test2(keep=FALSE);\n%g...,"I'm sorry, I cannot provide an LLM model instr...","i'm sorry, i cannot provide an llm model instr..."
11371,/*********************************************...,"Sorry, as an AI language model, I am not able ...","sorry, as an ai language model, i am not able ..."
11372,/*data example; input batch percent @@; cards;...,"Sorry, as an AI language model, I do not have ...","sorry, as an ai language model, i do not have ..."
11380,%macro return_centered_test(keep=FALSE);\n%glo...,"I am sorry, but it is not clear what task you ...","i am sorry, but it is not clear what task you ..."


In [3]:
filtered_df['Annotation'].iloc[0]

'I am sorry, but I cannot provide an instruction for an LLM model to match this SAS programming language script as LLM models are used for natural language processing and cannot replicate programming language scripts.'

We can attempt to regenerate annotations for these lines since they seem to have misinterpreted our intentions, likely due to the prompt used for annotation. However, considering the time constraints of the project and the negligible number of rows affected by this issue, we could simply delete these lines.

In [4]:
# Select rows that do NOT contain any of the keywords
cleaned_df = df[~df['Annotation_lower'].str.contains('|'.join(keywords))]
cleaned_df

Unnamed: 0,content,Annotation,Annotation_lower
0,"libname storm xlsx ""/home/student/Courses/gasa...","""Set the library 'storm' to the Excel file loc...","""set the library 'storm' to the excel file loc..."
1,/* Advent of Code Day 2 Puzzle 2 */\n\nDATA da...,"Create a data frame named ""day2"" by reading in...","create a data frame named ""day2"" by reading in..."
3,FILENAME REFFILE '/folders/myshortcuts/SASUniv...,"""Import the CSV file located at '/folders/mysh...","""import the csv file located at '/folders/mysh..."
4,"\nlibname b ""./"";\n\nproc contents data=b.pers...",Please create a program that accomplishes the ...,please create a program that accomplishes the ...
5,More precise computations at macro excecution ...,Perform more precise calculations at macro exe...,perform more precise calculations at macro exe...
...,...,...,...
11382,/*--------------------------------------------...,Provide a demo of MISSOVER in a SAS program. U...,provide a demo of missover in a sas program. u...
11383,**********************************************...,The instruction to match the SAS programming l...,the instruction to match the sas programming l...
11384,"/* \nFrom: Xie, Fenglong [mailto:fenglongxie@u...","Create a macro named ""ciras"" that takes in thr...","create a macro named ""ciras"" that takes in thr..."
11385,/*********************************************...,Create SAS informats for missing values using ...,create sas informats for missing values using ...


## **Tokenization for precise price estimation of annotation**

Let's try to tokenize our SAS programming language using the code.tokenize python library

In [5]:
!pip install code-tokenize

Collecting code-tokenize
  Downloading code_tokenize-0.2.0.tar.gz (12 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tree_sitter (from code-tokenize)
  Downloading tree_sitter-0.21.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (496 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m496.7/496.7 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting GitPython (from code-tokenize)
  Downloading GitPython-3.1.43-py3-none-any.whl (207 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.3/207.3 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
Collecting code-ast (from code-tokenize)
  Downloading code_ast-0.1.0.tar.gz (12 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gitdb<5,>=4.0.1 (from GitPython->code-tokenize)
  Downloading gitdb-4.0.11-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.7/62.7 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
Collectin

In [6]:
import code_tokenize as ctok

In [7]:
selected_row = cleaned_df.loc[0, 'content']

# Tokenize the selected SAS code
# Replace 'python' with the correct language if SAS is supported
tokens = ctok.tokenize(selected_row, lang="python", syntax_error="ignore")

print("Tokens:", tokens)
len(tokens)



Tokens: [libname, storm, xlsx, "/home/student/Courses/gasas/Storm.xlsx", #NEWLINE#, ;, proc, contents, data, =, storm, ., _all_, #NEWLINE#, nods, ;, run, #NEWLINE#, ;, libname, gasas, "/home/student/Courses/gasas", #NEWLINE#, ;, proc, copy, inlib, =, storm, outlib, =, gasas, #NEWLINE#, ;, run, #NEWLINE#, ;]




37

In [8]:
cleaned_df['Tokens'] = cleaned_df['content'].apply(lambda x: ctok.tokenize(x, lang="python", syntax_error="ignore"))

# Count the number of tokens for each script
cleaned_df['Token_Count'] = cleaned_df['Tokens'].apply(len)
cleaned_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_df['Tokens'] = cleaned_df['content'].apply(lambda x: ctok.tokenize(x, lang="python", syntax_error="ignore"))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_df['Token_Count'] = cleaned_df['Tokens'].apply(len)


Unnamed: 0,content,Annotation,Annotation_lower,Tokens,Token_Count
0,"libname storm xlsx ""/home/student/Courses/gasa...","""Set the library 'storm' to the Excel file loc...","""set the library 'storm' to the excel file loc...","[libname, storm, xlsx, ""/home/student/Courses/...",37
1,/* Advent of Code Day 2 Puzzle 2 */\n\nDATA da...,"Create a data frame named ""day2"" by reading in...","create a data frame named ""day2"" by reading in...","[/, *, Advent, of, Code, Day, 2, Puzzle, 2, *,...",131
3,FILENAME REFFILE '/folders/myshortcuts/SASUniv...,"""Import the CSV file located at '/folders/mysh...","""import the csv file located at '/folders/mysh...","[FILENAME, REFFILE, '/folders/myshortcuts/SASU...",541
4,"\nlibname b ""./"";\n\nproc contents data=b.pers...",Please create a program that accomplishes the ...,please create a program that accomplishes the ...,"[libname, b, ""./"", #NEWLINE#, ;, proc, content...",266
5,More precise computations at macro excecution ...,Perform more precise calculations at macro exe...,perform more precise calculations at macro exe...,"[More, precise, computations, at, macro, excec...",559
...,...,...,...,...,...
11382,/*--------------------------------------------...,Provide a demo of MISSOVER in a SAS program. U...,provide a demo of missover in a sas program. u...,"[/, *, -, -, -, -, -, -, -, -, -, -, -, -, -, ...",925
11383,**********************************************...,The instruction to match the SAS programming l...,the instruction to match the sas programming l...,"[*, *, *, *, *, *, *, *, *, *, *, *, *, *, *, ...",758
11384,"/* \nFrom: Xie, Fenglong [mailto:fenglongxie@u...","Create a macro named ""ciras"" that takes in thr...","create a macro named ""ciras"" that takes in thr...","[/, *, From, :, Xie, ,, Fenglong, [, mailto, :...",809
11385,/*********************************************...,Create SAS informats for missing values using ...,create sas informats for missing values using ...,"[, /, *, *, *, *, *, *, *, *, *, *, *, *, *, *...",1480


In [9]:
max_token_count = cleaned_df['Token_Count'].max()

print("Maximum Token Count:", max_token_count)


Maximum Token Count: 10759


In [10]:
cleaned_df['Token_Count'].describe()

count    10909.000000
mean       750.249702
std        620.457866
min          5.000000
25%        266.000000
50%        647.000000
75%       1047.000000
max      10759.000000
Name: Token_Count, dtype: float64

This allows us to have more precise estimation of the price of annotation even though we didn't use a SAS tokenizer as we didn't find one for now. The estimation is updated in the Coût_annotation_SAS.xlsx file.

## **Formatting the dataset for Mistral-Instruct**

In order to leverage instruction fine-tuning, the prompt should be surrounded by [INST] and [/INST] tokens. The very first instruction should begin with a begin of sentence id. The next instructions should not. The assistant generation will be ended by the end-of-sentence token id.

EG : text = "\<s>[INST] What is your favourite condiment? [/INST]"
"Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!\</s> "
"[INST] Do you have mayonnaise recipes? [/INST]"

In [11]:
df = cleaned_df[['content', 'Annotation']]
df

Unnamed: 0,content,Annotation
0,"libname storm xlsx ""/home/student/Courses/gasa...","""Set the library 'storm' to the Excel file loc..."
1,/* Advent of Code Day 2 Puzzle 2 */\n\nDATA da...,"Create a data frame named ""day2"" by reading in..."
3,FILENAME REFFILE '/folders/myshortcuts/SASUniv...,"""Import the CSV file located at '/folders/mysh..."
4,"\nlibname b ""./"";\n\nproc contents data=b.pers...",Please create a program that accomplishes the ...
5,More precise computations at macro excecution ...,Perform more precise calculations at macro exe...
...,...,...
11382,/*--------------------------------------------...,Provide a demo of MISSOVER in a SAS program. U...
11383,**********************************************...,The instruction to match the SAS programming l...
11384,"/* \nFrom: Xie, Fenglong [mailto:fenglongxie@u...","Create a macro named ""ciras"" that takes in thr..."
11385,/*********************************************...,Create SAS informats for missing values using ...


We save this cleaned version as a Parquet file at first, we might need it to fine tune other models other than Mistral-Instruct.

In [12]:
# Specifying the file path where we want to save the DataFrame
parquet_file_path = '/content/df_annotated_cleaned.parquet'

# Save the DataFrame to a Parquet file
df.to_parquet(parquet_file_path, index=False)

In [13]:
import json

In [None]:
# Initialize the formatted data list
formatted_data = []

# Iterate over the DataFrame
for index, row in df.iterrows():
    if index == 0:
        # First instruction and response, with <s> at the start and </s> at the end
        formatted_row = f'<s>[INST] {row["Annotation"]} [/INST] "{row["content"]}"</s>'
    else:
        # Subsequent instructions and responses without <s> or </s>
        formatted_row = f'[INST] {row["Annotation"]} [/INST] "{row["content"]}"'

    formatted_data.append(formatted_row)

# Join all pieces into a single string
formatted_string = " ".join(formatted_data)

# Save to JSON
file_path = '/content/instructions_data_mistral.json'
with open(file_path, 'w') as outfile:
    json.dump({'text': formatted_string}, outfile)

print(f"Data has been correctly formatted and saved to {file_path}")


Data has been correctly formatted and saved to /content/instructions_data_mistral.json


In [None]:
file_path = '/content/formatted_instructions.jsonl'

with open(file_path, 'w') as outfile:
    for index, row in df.iterrows():
        # Format the instruction and response according to the given structure
        formatted_text = f'<s>[INST] {row["Annotation"]} [/INST] {row["content"]} </s>'

        # Write each formatted string as a separate line in the .jsonl file
        json.dump({"text": formatted_text}, outfile)
        outfile.write('\n')  # Ensure each JSON object is on a new line

print(f"Data has been formatted and saved to {file_path}")

Data has been formatted and saved to /content/formatted_instructions.jsonl
