<a href="https://colab.research.google.com/github/Rdughan/A-Safer-Campus/blob/main/Document_Intelligence_Tool.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -q transformers torch gradio sentencepiece

In [2]:
import gradio as gr
from transformers import pipeline
import torch



In [3]:
device = 0 if torch.cuda.is_available() else -1
print(f"🖥️  Running on: {'GPU' if device == 0 else 'CPU'}")
print(f"💾 GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB\n" if device == 0 else "")


🖥️  Running on: CPU



In [4]:
print("🤖 Loading AI models (this takes ~1-2 minutes)...\n")

🤖 Loading AI models (this takes ~1-2 minutes)...



In [5]:
print("Loading sentiment analyzer...")
sentiment_analyzer = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english",
    device=device
)

Loading sentiment analyzer...


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu


In [6]:
print("Loading summarizer...")
summarizer = pipeline(
    "summarization",
    model="facebook/bart-large-cnn",
    device=device)

Loading summarizer...


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


In [7]:
print("Loading Q&A system...")
qa_model = pipeline(
    "question-answering",
    model="deepset/roberta-base-squad2",
    device=device
)

Loading Q&A system...


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

Device set to use cpu


In [8]:
print("Loading entity recognizer...")
ner_model = pipeline(
    "ner",
    model="dslim/bert-base-NER",
    aggregation_strategy="simple",
    device=device
)

Loading entity recognizer...


config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cpu


In [9]:
print("\n✅ All models loaded successfully!\n")


✅ All models loaded successfully!



In [10]:
def analyze_sentiment(text):
    """
    Analyzes emotional tone of text.
    Returns: Label (POSITIVE/NEGATIVE) and confidence score
    """
    if not text.strip():
        return "Please enter some text"

    result = sentiment_analyzer(text[:512])[0]  # Limit to 512 tokens
    label = result['label']
    score = result['score']

    emoji = "😊" if label == "POSITIVE" else "😞"
    return f"{emoji} {label} (confidence: {score:.2%})"

In [11]:
def summarize_text(text, max_length=130, min_length=30):
    """
    Creates a concise summary of long text.
    Uses BART model trained on news articles.
    """
    if not text.strip():
        return "Please enter some text"

    if len(text.split()) < 50:
        return "⚠️ Text too short to summarize (need at least 50 words)"

    try:
        summary = summarizer(
            text,
            max_length=max_length,
            min_length=min_length,
            do_sample=False
        )[0]['summary_text']
        return summary
    except Exception as e:
        return f"Error: {str(e)}"

In [12]:
def answer_question(context, question):
    """
    Answers questions based on provided context.
    Uses RoBERTa trained on SQuAD dataset.
    """
    if not context.strip() or not question.strip():
        return "Please provide both context and question"

    try:
        result = qa_model(question=question, context=context)
        answer = result['answer']
        confidence = result['score']
        return f"📝 Answer: {answer}\n\n🎯 Confidence: {confidence:.2%}"
    except Exception as e:
        return f"Error: {str(e)}"

In [13]:
def extract_entities(text):
    """
    Identifies named entities (people, places, organizations).
    Returns formatted list of findings.
    """
    if not text.strip():
        return "Please enter some text"

    entities = ner_model(text[:512])

    if not entities:
        return "No entities found"

    # Group by entity type
    grouped = {}
    for entity in entities:
        ent_type = entity['entity_group']
        ent_text = entity['word']
        score = entity['score']

        if ent_type not in grouped:
            grouped[ent_type] = []
        grouped[ent_type].append(f"{ent_text} ({score:.2%})")

    # Format output
    output = []
    for ent_type, items in grouped.items():
        output.append(f"\n**{ent_type}:**")
        output.extend([f"  • {item}" for item in items])

    return "\n".join(output)


In [14]:
def full_analysis(text):
    """
    Runs all analyses on input text.
    Great for getting complete document insights.
    """
    if not text.strip():
        return "Please enter some text"

    results = []
    results.append("# 📊 FULL DOCUMENT ANALYSIS\n")

    # Sentiment
    results.append("## 1️⃣ Sentiment")
    results.append(analyze_sentiment(text))

    # Entities
    results.append("\n## 2️⃣ Named Entities")
    results.append(extract_entities(text))

    # Summary (if long enough)
    if len(text.split()) >= 50:
        results.append("\n## 3️⃣ Summary")
        results.append(summarize_text(text))

    return "\n".join(results)


In [15]:
print("🎨 Building interactive interface...\n")

🎨 Building interactive interface...



In [22]:
sample_article = """The Amazon rainforest, covering much of northwestern Brazil and extending into Colombia, Peru and other South American countries, is the world’s largest tropical rainforest, famed for its biodiversity. It’s crisscrossed by thousands of rivers, including the powerful Amazon. River towns, with 19th-century architecture from rubber-boom days, include Brazil’s Manaus and Belém and Peru’s Iquitos and Puerto Maldonado.\nThe Amazon basin is the part of South America drained by the Amazon River and its tributaries. The Amazon drainage basin covers an area of about 7.05 million square kilometers (2.72 million square miles), or about 40 percent of the South American continent. It is located in the countries of Bolivia, Brazil, Colombia, Ecuador, French Guiana, Guyana, Peru, Suriname, and Venezuela.\n"""

sample_context = """The Amazon rainforest is a moist broadleaf forest in the Amazon biome that covers most of the Amazon basin of South America. This basin encompasses 7 million square kilometers (2.7 million square miles), of which 5.5 million square kilometers (2.1 million square miles) are covered by the rainforest. This region includes territory belonging to nine nations. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, and with minor amounts in Colombia, Ecuador, Bolivia, Guyana, Suriname, and French Guiana. States or departments in four nations have "Amazonas" in their names. The Amazon represents over half of the planet's remaining rainforests, and comprises the largest and most biodiverse tract of tropical rainforest in the world."""

with gr.Blocks(title="Document Intelligence Tool", theme=gr.themes.Soft()) as demo:

    gr.Markdown("""
    # 🤖 Document Intelligence Tool
    ### Learn ML/AI by analyzing real text with Hugging Face models

    **What this does:** Analyzes documents using state-of-the-art NLP models
    **What you're learning:** Transformers, tokenization, transfer learning, model pipelines
    """)

    with gr.Tab("📊 Full Analysis"):
        gr.Markdown("Get complete insights: sentiment, entities, and summary")
        with gr.Row():
            with gr.Column():
                full_input = gr.Textbox(
                    label="Your Text",
                    placeholder="Paste an article, email, or document...",
                    lines=10,
                    value=sample_article
                )
                full_btn = gr.Button("🔍 Analyze Document", variant="primary")
            with gr.Column():
                full_output = gr.Markdown(label="Analysis Results")

        full_btn.click(fn=full_analysis, inputs=full_input, outputs=full_output)

  with gr.Blocks(title="Document Intelligence Tool", theme=gr.themes.Soft()) as demo:


In [None]:
with gr.Tab("😊 Sentiment"):
        gr.Markdown("Detect emotional tone: positive or negative")
        sent_input = gr.Textbox(
            label="Text to analyze",
            placeholder="Enter a review, tweet, or comment...",
            lines=3
        )
        sent_btn = gr.Button("Analyze Sentiment")
        sent_output = gr.Textbox(label="Result")

        sent_btn.click(fn=analyze_sentiment, inputs=sent_input, outputs=sent_output)

        gr.Examples(
            examples=[
                ["This product exceeded my expectations! Absolutely love it."],
                ["Terrible service, would not recommend to anyone."],
                ["The movie was okay, nothing special but not bad either."]
            ],
            inputs=sent_input
        )

In [None]:
with gr.Tab("📝 Summarization"):
        gr.Markdown("Generate concise summaries of long text")
        summ_input = gr.Textbox(
            label="Long text to summarize",
            placeholder="Paste an article (min 50 words)...",
            lines=8,
            value=sample_article
        )
        with gr.Row():
            max_len = gr.Slider(50, 200, value=130, label="Max summary length")
            min_len = gr.Slider(20, 100, value=30, label="Min summary length")
        summ_btn = gr.Button("Generate Summary")
        summ_output = gr.Textbox(label="Summary", lines=4)

        summ_btn.click(
            fn=summarize_text,
            inputs=[summ_input, max_len, min_len],
            outputs=summ_output
        )

In [None]:
with gr.Tab("❓ Q&A"):
        gr.Markdown("Ask questions about your documents")
        with gr.Row():
            with gr.Column():
                qa_context = gr.Textbox(
                    label="Context (document text)",
                    placeholder="Paste the text you want to ask about...",
                    lines=8,
                    value=sample_context
                )
            with gr.Column():
                qa_question = gr.Textbox(
                    label="Your Question",
                    placeholder="What do you want to know?",
                    lines=2
                )
                qa_btn = gr.Button("Get Answer")
                qa_output = gr.Textbox(label="Answer", lines=4)

        qa_btn.click(fn=answer_question, inputs=[qa_context, qa_question], outputs=qa_output)

        gr.Examples(
            examples=[
                ["How much of the rainforest is in Brazil?"],
                ["What is the size of the Amazon basin?"],
                ["Which countries contain the Amazon rainforest?"]
            ],
            inputs=qa_question
        )

In [None]:
with gr.Tab("🏷️ Named Entities"):
        gr.Markdown("Extract people, places, and organizations")
        ner_input = gr.Textbox(
            label="Text to analyze",
            placeholder="Enter text with names, locations, organizations...",
            lines=5,
            value=sample_article
        )
        ner_btn = gr.Button("Extract Entities")
        ner_output = gr.Markdown(label="Found Entities")

        ner_btn.click(fn=extract_entities, inputs=ner_input, outputs=ner_output)

In [None]:
with gr.Tab("📚 Learn More"):
        gr.Markdown("""
        ## 🎓 What You're Learning

        ### Models Used:
        1. **DistilBERT** (Sentiment) - Smaller, faster version of BERT
        2. **BART** (Summarization) - Encoder-decoder architecture
        3. **RoBERTa** (Q&A) - Optimized BERT variant
        4. **BERT-NER** (Entities) - Fine-tuned for named entity recognition

        ### Key ML Concepts:
        - **Transfer Learning**: Using pre-trained models for new tasks
        - **Tokenization**: Converting text into numbers models understand
        - **Attention Mechanisms**: How models focus on relevant parts of text
        - **Fine-tuning**: Adapting general models to specific tasks

        ### Next Steps:
        1. Try your own documents
        2. Compare different model outputs
        3. Experiment with parameters (summary length, etc.)
        4. Learn to fine-tune models on custom data

        ### Resources:
        - [Hugging Face Docs](https://huggingface.co/docs)
        - [Transformers Course](https://huggingface.co/course)
        - [Model Hub](https://huggingface.co/models)
        """)

print("✅ Interface ready!\n")

In [None]:
demo.launch(share=True, debug=True)

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://a41d6e7e84eaf0137d.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)
