## 🚀 Full-Stack NLP: Preparing Text Data from Emails, PDFs, and Web for AI Training

Full Data Pipeline
* ✅ Extract emails from Gmail (IMAP)
* ✅ Extract text from PDF (pdfplumber)
* ✅ Clean Extracted Text (PDF & Email)
* ✅ Scrape text data from websites (BeautifulSoup)
* ✅ Convert text data (Scrape, Email & PDF) to JSONL for LLM fine-tuning
    * ✅ Extra: Load JSONL Data into Pandas for Data manipulation & Analysis
* ✅ Fine-tune an LLM (Hugging Face Transformers)
* ✅ Load the Fine-Tuned Model to Test our model


#### 📌 Step 1: Setting Up Your Environment

Run to Install the following required dependencies/libraries

In [1]:
!pip install pandas pdfplumber imaplib2 beautifulsoup4 requests nltk jsonlines dotenv selenium



These dependencies are commonly used for data extraction, web scraping, and natural language processing (NLP). Here's what each package does:  

1. **`pandas`** – Used for data manipulation and analysis, especially for handling structured data in DataFrames.  
2. **`pdfplumber`** – Extracts text and images from PDF files.  
3. **`imaplib2`** – An extended version of Python's `imaplib`, used to interact with email servers via IMAP. 
4. **`beautifulsoup4`** – Parses and extracts data from HTML and XML documents (often used for web scraping).  
5. **`requests`** – Sends HTTP requests to web pages and APIs (commonly used in web scraping and API interactions).  
6. **`nltk`** – The Natural Language Toolkit for NLP tasks such as tokenization, stemming, and sentiment analysis.  
7. **`jsonlines`** – Reads and writes data in JSON Lines format (`.jsonl`), which is useful for handling large JSON datasets efficiently.  
8. **`dotenv`** – Loads environment variables from a `.env` file, useful for storing API keys and secrets securely.  
9. **`selenium`** – Automates web browser interactions, often used for web scraping when JavaScript rendering is required.  


In [2]:
# Install Hugging Face Transformers & Datasets

!pip install transformers datasets torch



These dependencies are essential for working with deep learning models, particularly in **Natural Language Processing (NLP)** and **Large Language Model (LLM) fine-tuning**:

1. **`transformers`** – A library by Hugging Face for using and fine-tuning pre-trained transformer models like BERT, GPT, and T5.  
2. **`datasets`** – A lightweight library by Hugging Face for loading, processing, and managing large-scale datasets efficiently.  
3. **`torch`** – The core PyTorch library, required for building and training deep learning models.  

#### 📌 Step 2: Import Required Libraries


In [3]:
''' Extract Emails from Gmail
'''
import os  # To interact with the operating system, including handling environment variables.
from dotenv import load_dotenv  # loads environment variables from a .env file into os.environ

import imaplib  # Import the imaplib module for connecting to and interacting with an email server using IMAP.
import email  # Import the email module to parse email messages.
from email.header import decode_header  # Import decode_header from email.header to decode email subject headers.


''' Extract Text from PDF
'''
import pdfplumber  # Import pdfplumber to extract text from PDF files.


''' Clean Extracted Text
'''
import nltk  # Import nltk (Natural Language Toolkit) for natural language processing tasks.
from nltk.corpus import stopwords # Import stopwords from nltk.corpus to filter out common words in text processing. 
import re  # Import re (regular expressions) for pattern matching and text processing.


''' Save Data in JSONL Format
'''
import jsonlines  # Import jsonlines to read and write JSON Lines (.jsonl) files.


''' Load JSONL Data into Pandas
'''
import json  # For handling JSON files
import pandas as pd  # For handling structured data in dataframe for data manipulation and analysis


''' Web Scraping for LLM Training Data
'''
import requests  # For sending HTTP requests to web pages and APIs
from bs4 import BeautifulSoup  # To Parses and extracts data from HTML and XML documents


''' Fine-Tune a Pretrained LLM with Your Extracted Data
'''
# Import necessary modules from the Hugging Face Transformers library  
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments  
# Import functions to load datasets from the Hugging Face Datasets library  
from datasets import load_dataset  
# Import DatasetDict to manage multiple dataset splits (train, validation, test)  
from datasets import DatasetDict  
# for optimizing distributed training with Hugging Face Transformers
import accelerate

#### 📌 Step 3 Option 1: Extract Emails from Gmail

Run the following block of codes to fetch emails:

NOTE: To Use Your Gmail for IMAP Access

     1️⃣ Enable IMAP in Gmail
        1. Go to Gmail Settings (⚙️ → "See all settings").
        2. Navigate to the "Forwarding and POP/IMAP" tab.
        3. Under IMAP access, select Enable IMAP and save changes.
        
     2️⃣ Generate an App Password (for security)
        Since Google blocks less secure apps, you must use an App Password instead of your actual Gmail password.
        1. Go to Google App Passwords.
        2. Select Mail and choose the device (e.g., Windows, Mac).
        3. Click Generate – Google will give you a 16-character password.
        4. Use this password in your script instead of your Gmail password.
     
     Environment variables or a .env file is used for security reasons, to avoid making available the gmail details.

In [4]:
# Gmail IMAP settings
'''
This setup is simply for security purposes to prevent 
hardcoding access codes directly on the worksheet.

EDIT ACCESS DETAIL saved in .env file
'''

# Load environment variables from .env file
load_dotenv("gmail_details.env")

# Retrieve credentials
EMAIL = os.getenv("EMAIL")
PASSWORD = os.getenv("PASSWORD")
IMAP_SERVER = os.getenv("IMAP_SERVER")

In [5]:
# Connect to Gmail
mail = imaplib.IMAP4_SSL(IMAP_SERVER)
mail.login(EMAIL, PASSWORD)
mail.select("inbox")

('OK', [b'101'])

In [6]:
# Search for all emails
status, messages = mail.search(None, "ALL")
email_ids = messages[0].split()

In [7]:
# Extract latest 5 emails
emails = []
for num in email_ids[-5:]:  # <-- Remove [-5:] to fetch all emails
    status, msg_data = mail.fetch(num, "(RFC822)")
    for response in msg_data:
        if isinstance(response, tuple):
            msg = email.message_from_bytes(response[1])
            subject, encoding = decode_header(msg["Subject"])[0]
            if isinstance(subject, bytes):
                subject = subject.decode(encoding or "utf-8")

            # Get email body
            body = ""
            if msg.is_multipart():
                for part in msg.walk():
                    content_type = part.get_content_type()
                    if content_type == "text/plain":
                        body = part.get_payload(decode=True).decode("utf-8")
                        break
            else:
                body = msg.get_payload(decode=True).decode("utf-8")

            emails.append({"subject": subject, "body": body})

mail.logout()

# Display extracted emails
emails[:3]  # Show first 3 emails

[{'subject': 'Whole has subscribed to you on YouTube!',
  'body': "Whole has subscribed to you on YouTube!\r\nChannels who subscribe to you will be notified when you upload new videos  \r\nor respond to others' videos (by favoriting, commenting, rating, etc). You  \r\ncan control which of your actions are publicly visible by going to your  \r\nSharing settings -  \r\nhttps://www.youtube.com/account_privacy?feature=em-subscription_create\r\nHelp Center - https://support.google.com/youtube\r\nEmail options -  \r\nhttps://www.youtube.com/account_notifications?feature=em-subscription_create\r\nUnsubscribe -  \r\nhttps://www.youtube.com/email_unsubscribe?uid=AcDEyFiAiOwI_2jbBZmxw3s9u9cz7E-unwCC1I-86ohMATfpb7kTxX47-OEC&action_unsubscribe=subscriber&timestamp=1738134675&feature=em-subscription_create\r\n(C) 2025 YouTube, LLC 901 Cherry Ave, San Bruno, CA 94066\r\n"},
 {'subject': 'Andrea Haynes has subscribed to you on YouTube!',
  'body': "Andrea Haynes has subscribed to you on YouTube!\r\nC

#### 📌 Step 3 Option 2: Extract Text from PDFs
Run this in a separate cell to extract text from PDFs:

In [8]:
def extract_text_from_pdf(pdf_path):
    text = ""
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text += page.extract_text() + "\n"
    return text

In [9]:
# Insert PDF file name into the function for text extraction
pdf_text = extract_text_from_pdf("benny+hinn-+good+morning+holy+spirit.pdf")

In [10]:
# Display extracted text
print(pdf_text[:1000])  # Show first 1000 characters

http://www.nd-warez.info/
Good
Morning,
Holy
Spirit
Books by Benny Hinn from
Thomas Nelson Publishers
The Anointing
The Biblical Road to Blessing
Good Morning, Holy Spirit
Welcome, Holy Spirit

Copyright © 1990,1997 by Benny Hinn
All rights reserved. Written permission must be secured from the publisher to use or reproduce
any part of this book, except for brief quotations in critical reviews or articles.
Published in Nashville, Tennessee, by Thomas Nelson, Inc.
Scripture quotations are from THE NEW KING JAMES VERSION of the Bible. Copyright © 1979,
1980, 1982, Thomas Nelson, Inc., Publishers.
Library of Congress Cataloging-in-Publication Data
Hinn, Benny.
Good morning, Holy Spirit / Benny Hinn.
p. cm.
Includes bibliographical references.
ISBN 0-7852-7176-7 (pbk.)
1. Hinn, Benny. 2. Pentecostal churches—United States—Clergy-Biography. 3.
Evangelists—United States—Biography. 4. Holy Spirit. I. Title.
BX8762.Z8H5S 1997
289.9'4'092-dc21
[B] 97-5430
CIP
Printed in the United States of Amer

#### 📌 Step 4: Clean Extracted Text
Run this to clean the data:

In [11]:
nltk.download("stopwords")
stop_words = set(stopwords.words("english"))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [12]:
def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r"\s+", " ", text)  # Remove extra spaces
    text = re.sub(r"[^\w\s]", "", text)  # Remove punctuation
    words = text.split()
    words = [word for word in words if word not in stop_words]  # Remove stopwords
    return " ".join(words)

In [13]:
# Clean email text
for email_data in emails:
    email_data["clean_body"] = clean_text(email_data["body"])

In [14]:
# Clean PDF text
pdf_text_clean = clean_text(pdf_text)

In [15]:
# Display cleaned data
pdf_text_clean[:1000]  # Show first 1000 characters

'httpwwwndwarezinfo good morning holy spirit books benny hinn thomas nelson publishers anointing biblical road blessing good morning holy spirit welcome holy spirit copyright 19901997 benny hinn rights reserved written permission must secured publisher use reproduce part book except brief quotations critical reviews articles published nashville tennessee thomas nelson inc scripture quotations new king james version bible copyright 1979 1980 1982 thomas nelson inc publishers library congress cataloginginpublication data hinn benny good morning holy spirit benny hinn p cm includes bibliographical references isbn 0785271767 pbk 1 hinn benny 2 pentecostal churchesunited statesclergybiography 3 evangelistsunited statesbiography 4 holy spirit title bx8762z8h5s 1997 28994092dc21 b 975430 cip printed united states america 48 01 00 99 98 dedication person holy spirit reason daughters jessica natasha lord tarry carry message generation contents acknowledgmentsviii 1 really know you11 2 jaffa end

#### 📌 Step 5: Web Scraping for LLM Training Data
Run this in a separate cell to extract text from PDFs:

In [16]:
#🔸 Scrape Web Pages for Text Data

def scrape_website(url):
    """Scrapes text content from a webpage"""
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")

    # Extract text from paragraphs
    paragraphs = soup.find_all("p")
    text_data = "\n".join([p.get_text() for p in paragraphs])

    return text_data

In [17]:
# Example: Scrape Wikipedia
url = "https://en.wikipedia.org/wiki/Data_science"
data = scrape_website(url)

In [18]:
# Save to file for LLM fine-tuning
with open("scraped_data.txt", "w", encoding="utf-8") as f:
    f.write(data)

print("✅ Web scraping complete. Data saved!")

✅ Web scraping complete. Data saved!


#### 📌 Step 6: Save Data in JSONL Format
Run this to store the structured data:

In [19]:
# Save extracted email data and PDF text
with jsonlines.open("data.jsonl", "w") as file:
    # Save email data
    for email_data in emails:
        file.write({"source": "email", "text": email_data["clean_body"]})

    # Save PDF text
    file.write({"source": "pdf", "text": pdf_text_clean})

    
print("Email & PDF data saved successfully in data.jsonl!")

Email & PDF data saved successfully in data.jsonl!


In [20]:
# Convert scraped_data.txt to JSONL (Append to data.jsonl)

def convert_text_file_to_jsonl(input_file, output_file, source_type="web"):
    """
    Reads a text file line by line, converts it to JSONL format, and appends it to the existing dataset.
    
    :param input_file: The input text file containing extracted web data.
    :param output_file: The output JSONL file where data will be stored.
    :param source_type: The source label (default: "web").
    """
    with open(input_file, "r", encoding="utf-8") as f:
        data_list = f.readlines()  # Read all lines from the text file

    with jsonlines.open(output_file, "a") as f:  # Open in append mode
        for text in data_list:
            f.write({"source": source_type, "text": text.strip()})  # Write each line as JSON

# Convert scraped text and append to JSONL file
convert_text_file_to_jsonl("scraped_data.txt", "data.jsonl", source_type="web")

print("✅ Web data converted and saved in data.jsonl")

✅ Web data converted and saved in data.jsonl


#### 📌 Extra Step: Load JSONL Data into Pandas
Run this to:

    1. Reads the data.jsonl file line by line.
    2. Converts it into a structured Pandas DataFrame.
    3. Displays the first few rows for inspection.

In [21]:
# Load the JSONL data into a list
data = []
with jsonlines.open("data.jsonl", "r") as file:
    for line in file:
        data.append(line)

# Convert to Pandas DataFrame
df = pd.DataFrame(data)

In [22]:
# Display the first few rows
df.head(6)

Unnamed: 0,source,text
0,email,whole subscribed youtube channels subscribe no...
1,email,andrea haynes subscribed youtube channels subs...
2,email,image google new signin windows revelationchan...
3,email,image google account recovered successfully re...
4,email,image google app password created sign account...
5,pdf,httpwwwndwarezinfo good morning holy spirit bo...


### ✅ Fine-Tuning an Open-Source LLM (Example: LLaMA 2 or GPT-2) Using Extracted Data
Once you’ve gathered data from emails, PDF, and web scraping, you need to format it for fine-tuning.

In [23]:
# Load dataset from JSONL file
dataset = load_dataset("json", data_files={"train": "data.jsonl"})

Generating train split: 0 examples [00:00, ? examples/s]

In [24]:
# Load a Pretrained Model (LLaMA-2 or GPT-2)
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

In [25]:
# Resolves Tokenizer Padding Issue by Set the padding token explicitly before tokenization
tokenizer.pad_token = tokenizer.eos_token

In [26]:
# Tokenize dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/41 [00:00<?, ? examples/s]

In [27]:
print(tokenized_datasets.keys())

dict_keys(['train'])


In [28]:
# Define split ratio (e.g., 90% train, 10% validation)
split_datasets = tokenized_datasets["train"].train_test_split(test_size=0.1)

# Rename test set to validation
tokenized_datasets = DatasetDict({
    "train": split_datasets["train"],
    "validation": split_datasets["test"]  # Rename test set to validation
})

In [29]:
print(tokenized_datasets.keys())

dict_keys(['train', 'validation'])


In [30]:
# Check If Model Requires labels During Training
def add_labels(batch):
    batch["labels"] = batch["input_ids"].copy()  # Copy input_ids as labels
    return batch

tokenized_datasets = tokenized_datasets.map(add_labels, batched=True)

Map:   0%|          | 0/36 [00:00<?, ? examples/s]

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

In [31]:
# Define Training Parameters
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    logging_dir="./logs",
)



In [32]:
# Train Model
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
)

trainer.train()

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Epoch,Training Loss,Validation Loss
1,No log,0.170684
2,No log,0.152763
3,No log,0.150106


TrainOutput(global_step=27, training_loss=1.4366320857295283, metrics={'train_runtime': 2370.2296, 'train_samples_per_second': 0.046, 'train_steps_per_second': 0.011, 'total_flos': 56439078912000.0, 'train_loss': 1.4366320857295283, 'epoch': 3.0})

In [33]:
# Save fine-tuned model
model.save_pretrained("fine_tuned_llm")
tokenizer.save_pretrained("fine_tuned_llm")

print("✅ Fine-tuning complete! Model saved.")


✅ Fine-tuning complete! Model saved.


### Load the Fine-Tuned Model to Test our model


In [34]:
# Load the fine-tuned model and tokenizer
model_path = "fine_tuned_llm"  # Update with your saved model path
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)

In [35]:
# Function to generate text
def generate_text(prompt, max_length=100):
    inputs = tokenizer(prompt, return_tensors="pt")
    output = model.generate(**inputs, max_length=max_length, temperature=0.7)
    return tokenizer.decode(output[0], skip_special_tokens=True)

In [36]:
# Test the model with a sample prompt

# Request user input for the prompt (ASK: What is data science?) ENSURE it ends with a ? mark
prompt = input("🔹 Enter a prompt: ")

🔹 Enter a prompt: What is Data Science?


In [37]:
# Generate text based on user input
generated_text = generate_text(prompt)

# Display the model's response
print("\n📝 Model Output:")
print(generated_text)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



📝 Model Output:
What is Data Science?

Data science is a field of study that uses data to understand and understand the world around us. Data science is a field of study that uses data to understand and understand the world around us.
