# Data pre-processing
Prepare and pre-process the documents to be used as training data.

#### Convert the word file to text file and then save it **(run only once)**

Tried other function or methods to extract text from pdf files, for better enhancing the table recognistion and more especially equation recognition

In [14]:
from docx import Document # .doc method text extraction
import fitz  # PyMuPDF a common pdf method extraction
import pdfplumber # Other pdf method extraction

from pdf2image import convert_from_path
from PIL import Image
import pytesseract

# Function to extract text from a word file
def extract_text_from_docs(doc_path):
    doc = Document(doc_path)
    return '\n'.join([para.text for para in doc.paragraphs if para.text.strip()])

# function to extract text from pdf file pymupdf method
def extract_text_from_pdf_pymupdf(file_path):
    text = ""
    with fitz.open(file_path) as doc:
        for page in doc:
            text += page.get_text()
    return text

# function to extract text from pdf file pdfplumber method
def extract_text_from_pdf_pdfplumber(file_path):
    text = ""
    with pdfplumber.open(file_path) as pdf:
        for page in pdf.pages:
            page_text = page.extract_text()
            if page_text:
                text += page_text + "\n"
    return text

# Function to save text file from pdf using OCR 
def extract_text_from_pdf_ocr(file_path, lang='eng'):
    # Convert PDF to images
    try:
        pages = convert_from_path(file_path, dpi=300)
    except Exception as e:
        print(f"Unexpected error occured in conversion, {e}")
    # OCR each page
    text = ""
    for i, page in enumerate(pages):
        try:
            page_text = pytesseract.image_to_string(page, lang=lang)
            text += f"\n\n--- Page {i+1} ---\n\n{page_text}"
        except Exception as e:
            print(f"Warning: OCR failed on page {i+1}: {e}")
        
    return text


# Function to save it to a .txt file
def save_text_to_file(text, output_path):
    with open(output_path, 'w', encoding='utf-8') as file:
        file.write(text)

pdf_path =   pdf_path = r'Books/pdf/Diagnosisofbusiness.pdf'

#text = extract_text_from_docs(r'C:\Users\arez3\Desktop\Etudes\Limitless Learning\Gen AI\Mini projet\Books\word\Diagnosisofbusiness.docx')
text = extract_text_from_pdf_ocr(pdf_path)
save_text_to_file(text, r'C:\Users\arez3\Desktop\Etudes\Limitless Learning\Gen AI\Mini projet\Books\text\Diagnosisofbusiness4.txt')




#### Chunk the text for embedding

Split the document to several smaller bits to make it more efficient

In [51]:
#TODO: The chunk needs to be organized as thematic or can be randomly made?

import textwrap 

def chunk_text(text, chunk_size=1000, overlap=200):
    chunk = []
    start = 0

    while start < len(text):
        end = start + chunk_size
        chunk.append(text[start:end])
        start += chunk_size - overlap
    return chunk

chunks = chunk_text(text)


In [52]:
import os
os.environ["TRANSFORMERS_NO_TF"] = "1"  # disable tensorflow fallback

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

def search_similar_chunks(query, model, index, chunks, top_k=3):
    query_embedding = model.encode([query])
    D, I = index.search(np.array(query_embedding), top_k)
    return [chunks[i] for i in I[0]]


model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(chunks)

dimension = len(embeddings[0])
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings))

In [53]:
query = "Give a diagnose of my buisness problems and possible solutions"
relevant_chunks = search_similar_chunks(query, model, index, chunks, top_k=1)

print(relevant_chunks)

[' modern approaches to diagnose a business performance of a company which activates in a modern sustainable development economy. Also it proposes general models for diagnosing a business. This work is necessary both for the academic environment and the business environment, providing the guarantee of acquiring rich and up-to-date skills in the the financial analysis area, in order to ensure the professionalisation of all economic specialists at a high level.\nCHAPTER 1. FUNDAMENTAL CONCEPTS OF BUSINESS DIAGNOSIS\nConceptual approaches\nThe diagnosis process, in the general sense, represent „a broad investigation of the main aspects of the organization activity, of economic, technical, sociological, legal and managerial nature, in order to identify strengths and disruption, causes that generated them, and to design some recommendations for improvement and development ” (Miles, 2000, p.86).\nIn the view Bătrâncea et al. (2008) analysis and diagnosis of a business involves the decomposit

In [54]:
from transformers import AutoTokenizer

# Load a tokenizer (choose one similar to Gemini, e.g. 'bert-base-uncased')
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

def trim_text_by_tokens(text, max_tokens):
    # Tokenize text to tokens
    tokens = tokenizer.tokenize(text)
    
    # Trim tokens if longer than max_tokens
    #if len(tokens) > max_tokens:
    #    tokens = tokens[:max_tokens]
    
    # Convert tokens back to string
    trimmed_text = tokenizer.convert_tokens_to_string(tokens)
    return trimmed_text

In [55]:
import time
import google.generativeai as genai
from google.api_core.exceptions import ResourceExhausted

genai.configure(api_key="AIzaSyDXO1LSxJ9I5281a9E7e6MitdlWFlv_r30")
modelai = genai.GenerativeModel("gemma-3-12b-it")

def ask_gemini(question, context_chunks, max_context_chars=10000, max_retries=5, initial_wait_time=5):
    context = "\n\n".join(context_chunks)
    context = context[:max_context_chars]  # truncate safely
    context = trim_text_by_tokens(context,1000)
    prompt = f"""Based on the following business knowledge, answer the question:\n\nContext:\n{context}\n\nQuestion: {question}"""

    retries = 0
    wait_time = initial_wait_time
    while retries < max_retries:
        try:
            response = modelai.generate_content(prompt)
            return response.text
        except ResourceExhausted as e:
            print(f"Rate limit exceeded. Retrying in {wait_time} seconds... (Attempt {retries + 1}/{max_retries})")
            print(f"Error details: {e}")
            time.sleep(wait_time)
            retries += 1
            wait_time *= 2  # Exponential backoff
        except Exception as e: # Catch other potential errors
            print(f"An unexpected error occurred: {e}")
            raise # Re-raise other errors
    
    raise ResourceExhausted(f"Failed to get response after {max_retries} retries due to rate limiting.")


In [56]:

# Example
answer = ask_gemini("Sales at my retail business have been declining for the past 9 months. I’m seeing fewer customers, my inventory isn't moving fast, and my online ads aren't working well. Can you help me diagnose what might be wrong and suggest where to focus", relevant_chunks)
print(answer)

Okay, based on the provided context, here's a breakdown of how to approach diagnosing your retail business's declining performance, and where to focus your efforts. This aligns with the "business diagnosis" concept described in the text.

**1. Understanding the Diagnostic Approach (Based on the Context)**

The context emphasizes a *broad investigation* of your business, looking at economic, technical, sociological, legal, and managerial aspects. It's not just about the numbers (though those are important!), but also the *why* behind them.  The goal is to identify:

*   **Strengths:** What's still working well?
*   **Disruptions:** What's going wrong (you've already identified some!)
*   **Causes:** *Why* are these disruptions happening?
*   **Recommendations:** What can you do to improve and develop?

**2. Applying the Diagnostic Framework to Your Situation**

Let's break down your situation and apply this framework. Here's a structured approach, categorized by the areas mentioned in t

In [11]:
answer = ask_gemini("What is H infinity Controller ?", relevant_chunks)
print(answer)

This text **does not mention H infinity Controller**. It focuses on business diagnosis and financial analysis within a sustainable development economy. The provided context describes the general process of business diagnosis, referencing the works of Miles (2000) and Batrancea et al. (2008). It's entirely unrelated to control systems or H infinity controllers.



Therefore, based solely on the provided text, there is no information about what an H infinity Controller is.


In [12]:
answer = ask_gemini("Hello", relevant_chunks)
print(answer)

Hello! It appears you've started a question but haven't finished it. Based on the provided context, you likely want to ask something about business diagnosis. 

**Please complete your question.** I'm ready to answer it based on the information given. For example, you could ask:

* "According to Batrancea et al. (2008), what does analysis and diagnosis of a business involve?"
* "What is the general definition of business diagnosis according to Miles (2000)?"
* "What is the purpose of business diagnosis in a modern sustainable development economy?"



I'm waiting for your full question!


In [38]:
answer = ask_gemini("Can you give me the exhaustive list of all the bibliography list provided in the book ?", relevant_chunks)
print(answer)

This is a trick question! The provided text **does not contain a bibliography list**. It's a description of a company's situation (strengths, weaknesses, opportunities) – a business analysis, not an academic paper with citations.



Therefore, the answer is: **There is no bibliography list provided in the text.**


In [37]:
print(relevant_chunks)

['et position for most of its products and services; traditional customers represents over 50 % from the total number of customers of the company; there are no quantitative restrictions in raw materials and materials supply; the distribution is organized through own networks; very good product quality; traditional relations with beneficiaries; the company functions in a significant competitive milieu, but, through its products, it becomes competitive now and in the future; The company is the only producer or distributor.\nWeaknesses points may include: lack of financial resources leading to the interruption of supply; the current production capacity and it technical level represents a limit in entering other market segments; decrease (in real terms) of the incomes obtained from important activities; distribution through middlemen; Weak product quality; Performing activities with no or little profitability.\nMarket opportunities may include: fast market increase; possibility of expandin