**Introduction**

In this notebook, we prepare data for fine-tuning an Arabic model focused on financial guidance. Following the CRISP-DM framework, we cover Data Understanding and Data Preprocessing.

Data Collection
Data is sourced from PDFs and generated through prompt engineering. Both are structured into Question-Answer pairs to reflect realistic financial advisory scenarios.

Data Types
General Financial Advice: Covers foundational financial habits.
Strategic Financial Steps: Focuses on advanced financial planning strategies.
Preprocessing
We clean, normalize, and format each dataset, preparing them for fine-tuning to enable the model to deliver actionable financial insights.



**Librairies Importation:**

In [1]:
import pandas as pd
import numpy as np
import re
import json
from google.colab import files

In [2]:
!pip install pdfplumber
import pdfplumber


Collecting pdfplumber
  Downloading pdfplumber-0.11.4-py3-none-any.whl.metadata (41 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.0/42.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pdfminer.six==20231228 (from pdfplumber)
  Downloading pdfminer.six-20231228-py3-none-any.whl.metadata (4.2 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-4.30.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.5/48.5 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Downloading pdfplumber-0.11.4-py3-none-any.whl (59 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.2/59.2 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pdfminer.six-20231228-py3-none-any.whl (5.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Collecting Data for General Financial advises:

General Financial Advice:

In [3]:
questions = [
    "What are the main aspects of personal finance?",
    "How can I create a good budget?",
    "What are good financial habits to develop?",
    "Why is an emergency fund important?",
    "How can I improve my credit score?",
    "What is compound interest and why is it important?",
    "How can I start investing with little money?",
    "What is the difference between saving and investing?",
    "Why is diversification important in investing?",
    "How can I manage debt effectively?",
    "What is a retirement account and why is it important?",
    "How do I choose a credit card that's right for me?",
    "What are the advantages of automatic savings?",
    "Why should I invest in a 401(k) or IRA?",
    "What are the risks of not having health insurance?",
    "What is the best way to save for a large purchase?",
    "How do I set realistic financial goals?",
    "What is a good strategy for building wealth?",
    "How can I balance saving and enjoying life?",
    "What are the benefits of creating a financial plan?",
    "How does inflation affect personal finances?",
    "What are the different types of investment accounts?",
    "How do I avoid lifestyle inflation?",
    "How can I reduce my monthly expenses?",
    "Why should I avoid high-interest debt?",
    "What is the rule of 72 in investing?",
    "How do I calculate my net worth?",
    "What is a stock, and how does it work?",
    "How do I protect myself from financial scams?",
    "What is a credit report, and how do I read it?",

]

# List of 200 sample answers
answers = [
    "Personal finance involves managing your income, expenses, savings, investments, and debt to achieve financial stability.",
    "Start by listing all sources of income and categorizing your expenses. Allocate a portion of your income to savings and set limits on non-essential spending.",
    "Good habits include tracking your spending, saving consistently, avoiding unnecessary debt, and investing wisely for the future.",
    "An emergency fund helps you cover unexpected expenses, like medical bills or car repairs, without going into debt. Aim for 3-6 months of living expenses saved.",
    "To improve your credit score, pay bills on time, keep your credit utilization low, avoid opening too many accounts, and check your credit report regularly.",
    "Compound interest is the interest on both the initial principal and the accumulated interest from previous periods. It accelerates the growth of savings or investments over time.",
    "Start with low-cost index funds, exchange-traded funds (ETFs), or micro-investing apps. Consistent investing, even with small amounts, can grow over time due to compounding.",
    "Saving is setting aside money for short-term goals or emergencies, usually in low-risk accounts. Investing involves putting money into assets like stocks, bonds, or real estate to grow wealth over time.",
    "Diversification spreads your investments across different assets, reducing risk. If one investment underperforms, others may balance it out, protecting your portfolio from major losses.",
    "To manage debt, prioritize high-interest debt, avoid taking on new debt, create a repayment plan, and consider debt consolidation or refinancing to lower interest rates.",
    "A retirement account allows you to save for your future, often with tax advantages, and provides financial security during retirement.",
    "Choose a credit card by considering factors like interest rates, rewards, annual fees, and your credit score.",
    "Automatic savings help you save without thinking, ensuring you consistently set money aside for future goals.",
    "A 401(k) or IRA allows you to save for retirement with tax advantages, helping your savings grow faster.",
    "Without health insurance, you risk facing overwhelming medical bills that could harm your financial stability.",
    "To save for a large purchase, set a specific goal, create a budget, and consistently save a portion of your income toward that goal.",
    "Set realistic financial goals by evaluating your current financial situation, identifying your priorities, and making a plan to achieve them.",
    "Building wealth involves earning more than you spend, saving consistently, and investing wisely to grow your assets over time.",
    "Balancing saving and enjoying life means creating a budget that allows for both financial goals and personal enjoyment.",
    "Creating a financial plan helps you set goals, make informed decisions, and stay on track to achieve financial security.",
    "Inflation erodes purchasing power, meaning your money buys less over time, so it's important to invest and grow your wealth.",
    "Investment accounts can include brokerage accounts, retirement accounts, and college savings accounts, each with different tax implications.",
    "Avoid lifestyle inflation by keeping your expenses consistent even as your income grows, allowing you to save and invest more.",
    "Reduce monthly expenses by tracking your spending, cutting out unnecessary purchases, and finding more affordable alternatives.",
    "High-interest debt can trap you in a cycle of payments, so it's important to pay off balances quickly to avoid accruing more interest.",
    "The rule of 72 is a simple way to estimate how long it will take for your investment to double at a given interest rate.",
    "To calculate your net worth, subtract your liabilities (debts) from your assets (savings, investments, property, etc.).",
    "A stock represents ownership in a company, and its value fluctuates based on the company's performance and market conditions.",
    "Protect yourself from financial scams by being cautious with your personal information, verifying the legitimacy of offers, and reporting suspicious activity.",
    "A credit report is a detailed summary of your credit history, including your credit accounts, balances, and payment history.",
]

print(len(questions))
print(len(answers))



30
30


In [4]:
additional_data = pd.DataFrame({
    'Question': questions[:30],
    'Answer': answers[:30]
})
data_1=pd.DataFrame()
# Assuming your original DataFrame is named 'data', append the new rows
data_1= pd.concat([data_1, additional_data], ignore_index=True)

# Optional: Preview the updated DataFrame
print(data_1.head())

                                         Question  \
0  What are the main aspects of personal finance?   
1                 How can I create a good budget?   
2      What are good financial habits to develop?   
3             Why is an emergency fund important?   
4              How can I improve my credit score?   

                                              Answer  
0  Personal finance involves managing your income...  
1  Start by listing all sources of income and cat...  
2  Good habits include tracking your spending, sa...  
3  An emergency fund helps you cover unexpected e...  
4  To improve your credit score, pay bills on tim...  


Data Parsing from a PDF file that contains responses to common and important financial aspects

In [5]:
# Function to extract Q&A from text
def extract_q_and_a(text):
    # Regular expression to match questions (ending with '?')
    question_pattern = r'(.*\?)'
    # Split the text into lines
    lines = text.split('\n')

    q_and_a_dict = {}
    question = None
    answer = []

    for line in lines:
        line = line.strip()  # Remove any extra spaces
        if re.match(question_pattern, line):  # If it's a question
            if question and answer:  # If there's an existing Q&A pair, save it
                q_and_a_dict[question] = ' '.join(answer).strip()  # Join answer lines together
            question = line  # Set the new question
            answer = []  # Clear the answer list for the new question
        elif question:  # If an answer is expected (after a question)
            answer.append(line)

    # For the last Q&A pair, after the loop ends
    if question and answer:
        q_and_a_dict[question] = ' '.join(answer).strip()

    return q_and_a_dict

#  Upload the PDF file
from google.colab import files
uploaded = files.upload()

# Get the name of the uploaded file
file_name = list(uploaded.keys())[0]  # Get the first uploaded file name

# Open the PDF and parse text from page 3 onwards
with pdfplumber.open(file_name) as pdf:
    all_text = ""
    for page_num in range(2, len(pdf.pages)):  # Start from page 3 (index 2)
        page = pdf.pages[page_num]
        all_text += page.extract_text()

# Extract Questions and Answers
q_and_a_dict = extract_q_and_a(all_text)

#  Print the dictionary of Questions and Answers
for question, answer in q_and_a_dict.items():
    print(f"Q: {question}")
    print(f"A: {answer}")
    print()


Saving financial.pdf to financial.pdf
Q: 1) What is Accounting?
A: American Institute of Certified Public Accountants (AICPA) formulated the following definition of accounting in 1961 – “Accounting is the art of recording, classifying and summarizing in a significant manner and in terms of money, transactions and events which are, in part at least, of a financial character, and interpreting the result thereof” As per the ICAI, Accounting is simply an art of record keeping. The aim of accounting is to meet the information needs of rational and sound decision-makers. Hence, it is called the language of business. Thus, Accounting, in the simplest form, is the process of systematically recording, summarizing, classifying, analyzing, and reporting financial transactions. Following are the four types of Accounting – 1) Financial Accounting 2) Cost Accounting 3) Tax Accounting 4) Management Accounting Accounting is an art of Recording Classifying S ummarizing In terpreting Reporting business 

In [6]:
import csv

# Function to save Q&A to a CSV file
def save_to_csv(q_and_a_dict, file_name='questions_answers.csv'):
    # Open (or create) a CSV file for writing
    with open(file_name, mode='w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        # Write the header row
        writer.writerow(["Question", "Answer"])

        # Write each question and answer as a row
        for question, answer in q_and_a_dict.items():
            writer.writerow([question, answer])

# Step 6: Save the Q&A to a CSV file
save_to_csv(q_and_a_dict)

# Optional: Download the CSV file to your local machine (if you're using Colab)
#from google.colab import files
#files.download('questions_answers.csv')


Explore the dictionnaire in csv file

In [7]:
data_2 = pd.DataFrame(list(q_and_a_dict.items()), columns=['Question', 'Answer'])


In [8]:
display(data_2.head())
print(data_2.info())

Unnamed: 0,Question,Answer
0,1) What is Accounting?,American Institute of Certified Public Account...
1,32) How does Accounting Information help in ma...,Accounting information plays a crucial role in...
2,3) Why Should an Investor Understand Accountin...,An Investor should understand Accounting Infor...
3,44) How does limited liability encourage entre...,“Limited liability is a legal status where a p...
4,information?,1 . Internal Users Users of Accounting Informa...


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Question  64 non-null     object
 1   Answer    64 non-null     object
dtypes: object(2)
memory usage: 1.1+ KB
None


In [9]:
print(data_2.describe())

                      Question  \
count                       64   
unique                      64   
top     1) What is Accounting?   
freq                         1   

                                                   Answer  
count                                                  64  
unique                                                 64  
top     American Institute of Certified Public Account...  
freq                                                    1  


We are now uploading a CSV file containing scenario-specific questions for the customer, along with answers that provide them with a financial strategy

In [10]:
uploaded = files.upload()
data_3 = pd.read_csv("unique_financial.csv")


Saving unique_financial.csv to unique_financial.csv


In [11]:
data_3.head()


Unnamed: 0,Question,Answer
0,I'm a 27-year-old freelancer in digital market...,"With a stable freelance income, begin by creat..."
1,I’m a 32-year-old small business owner with an...,Consider a SEP IRA or Solo 401(k) for self-emp...
2,I am a 24-year-old tech startup founder making...,It’s essential to balance growth and risk. All...
3,I’m a 45-year-old entrepreneur with a stable i...,Maximize contributions to retirement accounts ...
4,As a 30-year-old e-commerce business owner wit...,Create a cash buffer to manage fluctuating inc...


In [12]:
data_3.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Question  100 non-null    object
 1   Answer    100 non-null    object
dtypes: object(2)
memory usage: 1.7+ KB


In [13]:
nan_summary = data_3.isna().sum()  # count NaNs in each column
print("NaN values per column:\n", nan_summary)

# Drop rows with NaN values
data_3 = data_3.dropna()

# Check for duplicate rows
duplicates = data_3.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

NaN values per column:
 Question    0
Answer      0
dtype: int64
Number of duplicate rows: 0


In [14]:
data_3.describe()


Unnamed: 0,Question,Answer
count,100,100
unique,100,100
top,I'm a 27-year-old freelancer in digital market...,"With a stable freelance income, begin by creat..."
freq,1,1


## Combining our 2 Data Frame For the General Financial model

In [15]:
data = pd.concat([data_1, data_2], ignore_index=True)

In [16]:
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94 entries, 0 to 93
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Question  94 non-null     object
 1   Answer    94 non-null     object
dtypes: object(2)
memory usage: 1.6+ KB
None


In [17]:
nan_summary = data.isna().sum()
print("NaN values per column:\n", nan_summary)

data = data.dropna()

# Check for duplicate rows
duplicates = data.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

NaN values per column:
 Question    0
Answer      0
dtype: int64
Number of duplicate rows: 0


In [18]:
# Function to clean and preprocess text
def clean_text(text):
    text = re.sub(r'[^A-Za-z0-9.,?!\s$€₹]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text


In [19]:
data_f=data.copy()

In [20]:
data_f['Question'] = data['Question'].apply(clean_text)
data_f['Answer'] = data['Answer'].apply(clean_text)


##Stretegie Data Cleaning :

In [21]:
data_3f=data_3.copy()

In [22]:
data_3f['Question'] = data_3['Question'].apply(clean_text)
data_3f['Answer'] = data_3['Answer'].apply(clean_text)

#Downloading the two data frames into csv files to the fine tune models:

In [23]:
data_f.to_csv("general.csv", index=False)
# Download the file
files.download("general.csv")

# Save your second DataFrame as a CSV file
data_3.to_csv("strategie.csv", index=False)
# Download the file
files.download("strategie.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>