<a href="https://colab.research.google.com/github/Shrimayee30/AgileAI/blob/main/playground1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AgileAI

### 1. Import all Libraries and transformer

In [None]:
!pip install transformers accelerate



In [None]:
# Load the GPT-J model
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "EleutherAI/gpt-j-6B"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

KeyboardInterrupt: 

### 2. Data Loading
Fetch the data from the drive and do data validation, preprocessing etc

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# Define your folder paths
data_folder = "/content/drive/MyDrive/AgileAI/Part 1/data"
template_folder = "/content/drive/MyDrive/AgileAI/Part 1/templates"

In [4]:
import os

# List project description files (PDFs)
data_files = sorted([f for f in os.listdir(data_folder) if f.endswith('.pdf')])
print("Project Description Files:", data_files)

# List template files (TXT)
template_files = sorted([f for f in os.listdir(template_folder) if f.endswith('.txt')])
print("Template Files:", template_files)

Project Description Files: ['AI model - Online Transaction Fraud Detection.pdf', 'Data Science - Sales Analysis.pdf', 'IOT - smart home.pdf', 'Mobile Application - Digital Banking.pdf', 'Website - E commerce fashion.pdf']
Template Files: ['AI-system_template.txt', 'DS_template.txt', 'Mobile-app_template.txt', 'Universal_template.txt', 'Webapp_template.txt', 'iot_template.txt']


In [5]:
# read the project description pdfs
!pip install PyMuPDF

import fitz  # PyMuPDF
import os

def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    return "\n".join([page.get_text() for page in doc])

pdf_files = sorted([f for f in os.listdir(data_folder) if f.endswith('.pdf')])
raw_texts = {f: extract_text_from_pdf(os.path.join(data_folder, f)) for f in pdf_files}

Collecting PyMuPDF
  Downloading pymupdf-1.26.5-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.5-cp39-abi3-manylinux_2_28_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m68.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.26.5


In [6]:
# Load all templates
template_texts = {}
for filename in template_files:
    path = os.path.join(template_folder, filename)
    with open(path, "r") as f:
        template_texts[filename] = f.read()

In [7]:
# Lets see our data, preview first 1000 characters of each project
for name, text in raw_texts.items():
    print(f"\n--- {name} ---\n")
    print(text[:1000])


--- AI model - Online Transaction Fraud Detection.pdf ---

          International Research Journal of Engineering and Technology (IRJET)       e-ISSN: 2395-0056 
                Volume: 11 Issue: 04 | Apr 2024              www.irjet.net                                                                        p-ISSN: 2395-0072 
 
 
© 2024, IRJET       |       Impact Factor value: 8.226       |       ISO 9001:2008 Certified Journal       |     Page 2499 
 
Online Transaction Fraud Detection Using Machine Learning  
 Dr.Ranjit K N1, Ms. Bhoomika Rajendra Vernekar2, Ms. Chandana MR3, Ms. Spandana MP4,  
Mr. Mallikarjun Bachwar5 
1 Associate Professor, Dept. of Computer Science and Engineering, Maharaja Institute of Technology, 
Thandavapura 
2,3,4,5Students, Dept of Computer Science and Engineering, Maharaja Institute of Technology, Thandavapura 
---------------------------------------------------------------------***---------------------------------------------------------------------
Abs

### Data Preprocessing
From the pdfs, we're going to remove the unnecessary sections:
- Page numbers, headers/footers
- Citations, references, footnotes
- Academic boilerplate (e.g., “Abstract",“Acknowledgements”)

Keep only sections that describe:
- Project goal or purpose
- Functional modules/screens/pages
- Technologies used
- User roles or flows
- Data sources or integrations

Lastly, we're going to Normalize for Prompt Consistency to ensure:
- Consistent casing (e.g., title case for page names)
- Clear separation between modules
- No trailing whitespace or broken sentences

In [9]:
import re

# Step 1: Remove academic metadata and boilerplate
def clean_text(text):
    text = re.sub(r'\s+', ' ', text)

    # Journal metadata
    text = re.sub(r'(e-ISSN|p-ISSN|Impact Factor|Volume:?|Issue:?|ISO 9001|Certified Journal|IRJET|IRJMETS|IJSRST|OPUS Open Portal|National Conference)', '', text, flags=re.IGNORECASE)

    # Author and affiliation blocks
    text = re.sub(r'(Dr\.|Mr\.|Ms\.|Prof\.|By [A-Z][a-z]+|[A-Z][a-z]+ [A-Z]\.?[A-Z]*\d*|Department of .*?Engineering|Institute of .*?Technology|University of .*?Technology).*?(?=\s[A-Z])', '', text)

    # Declaration and acknowledgements
    text = re.sub(r'(Candidate’s Declaration|This is to certify|I hereby declare|submitted in partial fulfillment|ACKNOWLEDGEMENTS|Supervisor|Project Coordinator|authorized administrator).*?(?=\s[A-Z])', '', text, flags=re.IGNORECASE)
    text = re.sub(r'(Follow this and additional works at|Part of the .*? Commons|Visit the .*? Department|This project has been made possible by .*?|I would like to express .*?|In completing this graduate project .*?)', '', text, flags=re.IGNORECASE)

    # Copyright and license
    text = re.sub(r'(Copyright|Creative Commons Attribution|Licensee|Open-access article|© \d{4})', '', text, flags=re.IGNORECASE)

    # URLs and emails
    text = re.sub(r'http\S+|www\.\S+|@\S+', '', text)

    # Broken figure/table labels
    text = re.sub(r'(Figure|Table) \d+:?\s*$', '', text)

    return text.strip()

# Step 2: Fix fragmented and broken phrases
def fix_phrases(text):
    text = re.sub(r'[\|]{2,}|[-]{2,}|[:\-]{2,}', '', text)
    text = re.sub(r'([a-zA-Z])\s{2,}([a-zA-Z])', r'\1 \2', text)
    text = re.sub(r'\s{2,}', ' ', text)
    text = re.sub(r'\.\s*([a-z])', lambda m: '. ' + m.group(1).upper(), text)
    text = re.sub(r'([a-z]) ([A-Z])', r'\1. \2', text)
    text = re.sub(r'\.(?=[a-z])', '. ', text)
    return text.strip()

# Step 3: Remove structural artifacts
def strip_structural_artifacts(text):
    text = re.sub(r'\b[A-Z]{2,}\b', '', text)
    text = re.sub(r'\d+\.\d+.*?(?=\s[A-Z])', '', text)
    text = re.sub(r'DETAILS[.·•:_\-–—=]{5,}', '', text)
    text = re.sub(r'[.·•:_\-–—=]{5,}', '', text)
    text = re.sub(r'\bProjects at to\b|\bProjects by an of to\b', '', text)
    return text.strip()

# Step 4: Extract scope-relevant sentences
def extract_scope(text):
    keywords = ['project', 'system', 'module', 'screen', 'dashboard', 'sensor', 'user', 'technology', 'goal', 'objective', 'function', 'feature', 'process', 'application', 'payment', 'security', 'automation', 'shopping cart']
    sentences = re.split(r'(?<=[.!?]) +', text)
    scoped = [s for s in sentences if any(k in s.lower() for k in keywords) and len(s.strip()) > 50]
    return " ".join(scoped)


In [10]:
def preprocess_pipeline(text):
    cleaned = clean_text(text)
    fixed = fix_phrases(cleaned)
    stripped = strip_structural_artifacts(fixed)
    scoped = extract_scope(stripped)
    return scoped


In [15]:
final_texts = {name: preprocess_pipeline(raw) for name, raw in raw_texts.items()}
for name, text in final_texts.items():
    print(f"\n--- {name} ---\n")
    print(text[:1000])



--- AI model - Online Transaction Fraud Detection.pdf ---

Technology () : 2395-0056 11 04 | Apr 2024 : 2395-0072 , | value:  Page 2499 K N1, Vernekar2, Dept. Engineering, Technology, Thandavapura 2,3,4,5Students, Dept of. Engineering, Technology, Thandavapura *** Abstract – Nowadays, people conduct practically all their business online. While there are many benefits to online transactions, like viability, speedier payments, and ease of use, there are drawbacks as well, including fraud, phishing, and data theft. A feature-engineered machine learning-based model that is. Finally, understanding the costs and risks associated with payment methods is essential to fighting fraud in a methodical and cost- effective manner. Despite the deployment of multiple security measures, fraudulent transactions nonetheless result in the loss of a substantial amount of money. The process of monitoring user activity to assess, spot, or stop unwanted behaviour, such as fraud, intrusion, and defaults, is k

As we can see, I have removed the irrelevant pages such as acknowledgements, related work, index etc, still some parts of it remains, but its much more cleaner and lesser scoped than before.