# AI-Powered Document Analyzer

### Import modules

In [1]:
from PyPDF2 import PdfReader

In [2]:
pdf_reader = PdfReader("../data/Sample.pdf")

text = ""
for page in pdf_reader.pages:
    text += page.extract_text()

In [3]:
text

"                    \n     © Database Town. com  \n \n \nhttps://d atabasetown. com    Page 1 \n \nSTATISTICS FOR DATA SCIENCE  \nA. DESCRIPTIVE STATISTICS:  \nBefore going to discuss about descriptive statistics, first  we recall the basic concept of data \nand its types again here  before starting descriptive statistics ….. \nData : \n \nData is a collection of factual information based on numbers, words, observations, \nmeasurements which can be utilized for calculation, discussion and reasoning.  \n \nTYPES OF DATA :\n \nThe crude dataset is the basic foundation of data science and it may be of dif ferent kinds like \nStructured Data (Tabular structure), Unstructured Data (pictures, recordings, messages, PDF \ndocuments and so forth.) and Semi Structured.  DATA (plural) \nSingular form is \ndatum\nCategorical or Qualitative Data \nbased on descriptive information \ne.g He is a clever boy\nBionomial Data\nVariable data with only two \noptions \ne.g. good or bad, true or false\nNomi

### Text Processing

In [4]:
import spacy
import re

### Load Spacy Model

In [5]:
nlp = spacy.load("en_core_web_sm")

In [6]:
doc = nlp(text)
doc

                    
     © Database Town. com  
 
 
https://d atabasetown. com    Page 1 
 
STATISTICS FOR DATA SCIENCE  
A. DESCRIPTIVE STATISTICS:  
Before going to discuss about descriptive statistics, first  we recall the basic concept of data 
and its types again here  before starting descriptive statistics ….. 
Data : 
 
Data is a collection of factual information based on numbers, words, observations, 
measurements which can be utilized for calculation, discussion and reasoning.  
 
TYPES OF DATA :
 
The crude dataset is the basic foundation of data science and it may be of dif ferent kinds like 
Structured Data (Tabular structure), Unstructured Data (pictures, recordings, messages, PDF 
documents and so forth.) and Semi Structured.  DATA (plural) 
Singular form is 
datum
Categorical or Qualitative Data 
based on descriptive information 
e.g He is a clever boy
Bionomial Data
Variable data with only two 
options 
e.g. good or bad, true or false
Nominal or Unordered Data
Variable

In [7]:
tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct and re.compile(r'^[a-zA-Z0-9]*$').match(str(token))]
preprocessed_text = " ".join(tokens)

In [8]:
preprocessed_text = preprocessed_text.lower()

In [9]:
preprocessed_text

'database town com atabasetown com page 1 statistics datum science descriptive statistics go discuss descriptive statistic recall basic concept datum \n type start descriptive statistic \n datum data collection factual information base number word observation \n measurement utilize calculation discussion reasoning type datum crude dataset basic foundation datum science dif ferent kind like \n structured data tabular structure unstructured data picture recording message pdf \n document forth semi structured data plural \n singular form \n datum \n categorical qualitative data \n base descriptive information \n clever boy \n bionomial data \n variable datum \n option \n good bad true false \n nominal unordered data \n variable datum \n unordered form \n red green man \n ordinal data \n variable datum proper \n order \n short medium longnumerical quantitative data \n base numerical informtaion \n 2 leg \n discrete data \n datum countable \n child \n numberscontinuous datum \n datum measur

In [10]:
from langchain_groq import ChatGroq

In [11]:
llm = ChatGroq(
    model="llama-3.1-70b-versatile",
    temperature=0,
    groq_api_key= # Paste your Groq API Key
)

In [12]:
response = llm.invoke("""'This theorem states that the distribution of sample means approximates a normal 
distribution  as  the  sample  size  gets  larger  (assuming  that  all  samples  are  the  same  in  size), 
regardless of population distribution shape. If the sample sizes= or >30 are considered enough for 
the  Central  Limit  Theorem  to  hold.  The  main  aspect  of  this  theorem  is  that  the  average  of  the 
sample  means  and  standard  deviations  will  equal  the  population  mean  and  standard  deviation. 
Furthermore,  an  adequately  large  sample  size  can  forecast  the  characteristics  of  a  population 
accurately.'
======================================================
Translate the above paragraph into Hindi.
Don't give any preamble, just give only translated text.
""")

In [13]:
response.content

'यह प्रमेय बताता है कि नमूना माध्यों का वितरण नमूना आकार के बड़ा होने के साथ एक सामान्य वितरण के अनुरूप होता है (यह मानकर कि सभी नमूने आकार में समान हैं), जनसंख्या वितरण आकार की परवाह किए बिना। यदि नमूना आकार 30 या अधिक माने जाते हैं तो केंद्रीय सीमा प्रमेय के लिए पर्याप्त है। इस प्रमेय का मुख्य पहलू यह है कि नमूना माध्यों और मानक विचलनों का औसत जनसंख्या माध्य और मानक विचलन के बराबर होगा। इसके अलावा, एक पर्याप्त रूप से बड़ा नमूना आकार एक जनसंख्या की विशेषताओं को सटीक रूप से अनुमान लगा सकता है।'

# Translated Response

'यह प्रमेय बताता है कि नमूना माध्यों का वितरण नमूना आकार के बड़ा होने के साथ एक सामान्य वितरण के अनुरूप होता है (यह मानकर कि सभी नमूने आकार में समान हैं), जनसंख्या वितरण आकार की परवाह किए बिना। यदि नमूना आकार 30 या अधिक माने जाते हैं तो केंद्रीय सीमा प्रमेय के लिए पर्याप्त है। इस प्रमेय का मुख्य पहलू यह है कि नमूना माध्यों और मानक विचलनों का औसत जनसंख्या माध्य और मानक विचलन के बराबर होगा। इसके अलावा, एक पर्याप्त रूप से बड़ा नमूना आकार एक जनसंख्या की विशेषताओं को सटीक रूप से अनुमान लगा सकता है।'

In [14]:
from langchain_core.prompts import PromptTemplate

In [15]:
# Define a question-answering prompt template
qa_prompt_template = PromptTemplate.from_template(
    """
    ### DOCUMENT: {document_text}
    
    ### INSTRUCTION:
    You are a question-answering system. The user will provide a document and ask a question about it. Your task is to answer the question based on the content of the document.

    Question: {question}
    Answer:

    """
)

qa_chain = qa_prompt_template | llm
response = qa_chain.invoke(
    {
        "document_text": preprocessed_text,
        "question": "What is Central Limit Theorem?"
    }
)

In [16]:
response.content

'The Central Limit Theorem (CLT) states that the distribution of the sample mean will approximate a normal distribution as the sample size gets large, regardless of the shape of the population distribution. This theorem assumes that the sample size is sufficiently large (usually greater than 30) and that the population distribution has a finite mean and standard deviation. The CLT is important because it allows us to make inferences about a population based on a sample of data, even if the population distribution is unknown.'

# Response

'The Central Limit Theorem (CLT) states that the distribution of the sample mean will approximate a normal distribution as the sample size gets large, regardless of the shape of the population distribution. This theorem assumes that the sample size is sufficiently large (usually greater than 30) and that the population distribution has a finite mean and standard deviation. The CLT is important because it allows us to make inferences about a population based on a sample of data, even if the population distribution is unknown.'

In [17]:
document_list = [
    """
        Paris, the capital city of France, is renowned for its rich history and cultural heritage. 
        Situated in the north-central part of the country along the Seine River, Paris is often referred to 
        as "The City of Light" due to its pivotal role during the Age of Enlightenment and its stunning illuminated landmarks. 
        The city is famous for its iconic monuments, such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum, 
        which houses thousands of works of art including the Mona Lisa. Paris is also a global hub for fashion, cuisine, and the arts, 
        making it a major destination for tourists from around the world.
    """,
    """
        Marie Curie, born Maria Skłodowska in Warsaw, Poland in 1867, was a pioneering scientist renowned for her groundbreaking work 
        in the field of radioactivity. She was the first woman to win a Nobel Prize and remains the only person to 
        have won Nobel Prizes in two different scientific fields: Physics (1903, shared with her husband Pierre Curie 
        and Henri Becquerel) and Chemistry (1911). Marie Curie's research was instrumental in the development of X-ray machines, 
        and she made significant contributions to the understanding of radioactive elements such as radium and polonium. 
        Her work laid the foundation for advancements in medical treatments and nuclear science.
    """
]

In [18]:
# Define a multiple documents comparison prompt template
compare_docs_prompt_template = PromptTemplate.from_template(
    """
    ## DOCUMENTS FOR COMPARISON:
        - List of Documents Content:
            - {document_list}

    ==================================================================

    ## INSTRUCTIONS:
    
    - If user provides only one document then provide only Summary and Key Insights, 
    - DO NOT PROVIDE SIMILARITIES AS WELL AS PREAMBLE.
    - If user provides only one document then do not give response as document 1, etc., as only one document is provided.
    - Treat this response as a markdown file.

    Summary:
        - Provide a concise summary of each document from given document dictionary values, capturing the main points and overall message.
        - In your response, include the dictionary key as the document name, followed by the summary of the corresponding value.
        - Format:
            **Document Name**: 
                - [Summary of the document]

    ===================================================================

    Similarities:
        - Identify and describe the similarities across the documents. Focus on common themes, topics, or pieces of information.

    ===================================================================

    Key Insights:
        - Extract and list key pieces of information from each document, such as, specific places or regions mentioned, any organizations or entities referenced, notable facts, data, or events, mentioned individuals or relevant people, etc. 
        - If any of the mentioned things are present in the document then give the Desert(#FAD5A5) as background color and black as font color to that word, also wrap the word into rounded border of 3 pixel and make font-weight to bold.
        - If any of the mentioned things are not present in the document then do not include them as response.
        
        - In your response, include the dictionary key as the document name, followed by the key insights of the corresponding value.
        - Format:
            **Document Name**: 
                - [Key insights of the document]

    ## EXAMPLE RESPONSE:

    ### Summary:
    
        - **Document Name**: 
            - [Summary]
            

    ### Similarities:
    
        - [List of similarities]
        

    ### Key Insights:
    
        - **Document Name**: 
            - [Key Insights]
    """
)

qa_chain = compare_docs_prompt_template | llm

In [19]:
response = qa_chain.invoke(
    {
        "document_list": document_list
    }
)

In [20]:
print(response.content)

### Summary:

- **Paris**: 
    - Paris, the capital city of France, is famous for its rich history, cultural heritage, and iconic landmarks like the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It's a global hub for fashion, cuisine, and the arts, attracting tourists worldwide.

- **Marie Curie**: 
    - Marie Curie was a pioneering scientist who made groundbreaking contributions to the field of radioactivity, winning two Nobel Prizes in Physics and Chemistry. Her research led to the development of X-ray machines and advancements in medical treatments and nuclear science.

### Similarities:

- Both documents highlight the significance of their respective subjects in their fields, with Paris being a cultural and historical hub, and Marie Curie being a pioneering scientist.
- Both documents mention the subjects' impact on the world, with Paris attracting tourists and Marie Curie's research leading to advancements in medical treatments and nuclear science.

### Key Insights

### Summary:

- **Paris**: 
    - Paris, the capital city of France, is famous for its rich history, cultural heritage, and iconic landmarks like the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It's a global hub for fashion, cuisine, and the arts, attracting tourists worldwide.

- **Marie Curie**: 
    - Marie Curie was a pioneering scientist who made groundbreaking contributions to the field of radioactivity, winning two Nobel Prizes in Physics and Chemistry. Her research led to the development of X-ray machines and advancements in medical treatments and nuclear science.

### Similarities:

- Both documents highlight the significance of their respective subjects in their fields, with Paris being a cultural and historical hub, and Marie Curie being a pioneering scientist.
- Both documents mention the subjects' impact on the world, with Paris attracting tourists and Marie Curie's research leading to advancements in medical treatments and nuclear science.

### Key Insights:

- **Paris**: 
    - <span style="background-color: #FAD5A5; border-radius: 3px; font-weight: bold; color: black">Paris</span> is the capital city of <span style="background-color: #FAD5A5; border-radius: 3px; font-weight: bold; color: black">France</span>.
    - The city is situated along the <span style="background-color: #FAD5A5; border-radius: 3px; font-weight: bold; color: black">Seine River</span>.
    - Famous landmarks include the <span style="background-color: #FAD5A5; border-radius: 3px; font-weight: bold; color: black">Eiffel Tower</span>, <span style="background-color: #FAD5A5; border-radius: 3px; font-weight: bold; color: black">Notre-Dame Cathedral</span>, and the <span style="background-color: #FAD5A5; border-radius: 3px; font-weight: bold; color: black">Louvre Museum</span>.
    - The city is a global hub for <span style="background-color: #FAD5A5; border-radius: 3px; font-weight: bold; color: black">fashion</span>, <span style="background-color: #FAD5A5; border-radius: 3px; font-weight: bold; color: black">cuisine</span>, and the <span style="background-color: #FAD5A5; border-radius: 3px; font-weight: bold; color: black">arts</span>.

- **Marie Curie**: 
    - <span style="background-color: #FAD5A5; border-radius: 3px; font-weight: bold; color: black">Marie Curie</span> was born in <span style="background-color: #FAD5A5; border-radius: 3px; font-weight: bold; color: black">Warsaw, Poland</span> in 1867.
    - She was the first woman to win a <span style="background-color: #FAD5A5; border-radius: 3px; font-weight: bold; color: black">Nobel Prize</span> and the only person to win Nobel Prizes in two different scientific fields: <span style="background-color: #FAD5A5; border-radius: 3px; font-weight: bold; color: black">Physics</span> (1903) and <span style="background-color: #FAD5A5; border-radius: 3px; font-weight: bold; color: black">Chemistry</span> (1911).
    - Her research led to the development of <span style="background-color: #FAD5A5; border-radius: 3px; font-weight: bold; color: black">X-ray machines</span> and advancements in medical treatments and <span style="background-color: #FAD5A5; border-radius: 3px; font-weight: bold; color: black">nuclear science</span>.