<a href="https://colab.research.google.com/github/Capsone34/ML/blob/main/Email%20Resume.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download stopwords once
nltk.download('stopwords')
nltk.download('punkt')

def preprocess_email(text):
    # Remove unwanted characters (HTML tags, URLs, etc.)
    text = re.sub(r"http\S+|www\S+|https\S+", '', text, flags=re.MULTILINE)
    text = re.sub(r'\s+', ' ', text)  # Remove extra spaces
    text = re.sub(r'\n', ' ', text)   # Replace newline characters with space

    # Tokenize text
    tokens = word_tokenize(text.lower())

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]

    return ' '.join(filtered_tokens)

# Example email content
email_text = """
Hello John, we are happy to inform you that the contract has been signed by XYZ Corp.
The payment of $50,000 will be processed on January 5th, 2024.
"""
cleaned_text = preprocess_email(email_text)
print(cleaned_text)


hello john , happy inform contract signed xyz corp. payment $ 50,000 processed january 5th , 2024 .


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [4]:
import spacy

# Load a pre-trained English NER model
nlp = spacy.load("en_core_web_sm")

def extract_entities(text):
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    return entities

# Extract entities from the cleaned email
entities = extract_entities(cleaned_text)
print(entities)


[('john ,', 'PERSON'), ('xyz corp.', 'ORG'), ('50,000', 'MONEY'), ('january 5th , 2024', 'DATE')]


In [7]:
def extract_relationships(doc):
    # Example of extracting simple relationships
    for ent in doc.ents:
        if ent.label_ == 'ORG' and 'contract' in doc.text.lower():
            print(f"Organization {ent.text} signed a contract")
        if ent.label_ == 'MONEY':
            print(f"Amount involved: {ent.text}")
        if ent.label_ == 'PERSON':
            print(f"Name of the person: {ent.text}")
        if ent.label_ == 'DATE':
            print(f"The date is: {ent.text}")

extract_relationships(nlp(cleaned_text))


Name of the person: john ,
Organization xyz corp. signed a contract
Amount involved: 50,000
The date is: january 5th , 2024


In [8]:
from gensim import corpora
from gensim.models import LdaModel

def topic_modeling(texts):
    # Preprocess text for LDA
    tokens = [word_tokenize(preprocess_email(text)) for text in texts]

    # Create a dictionary representation of the documents
    dictionary = corpora.Dictionary(tokens)

    # Convert documents to bag-of-words format
    corpus = [dictionary.doc2bow(text) for text in tokens]

    # Build the LDA model
    lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=3, passes=10)

    # Print the topics
    topics = lda_model.print_topics(num_words=4)
    for topic in topics:
        print(topic)

# Example of multiple email texts
email_texts = [
    "The project is delayed due to budget constraints.",
    "We need to finalize the contract with XYZ Corp."
]
topic_modeling(email_texts)


(0, '0.091*"need" + 0.091*"due" + 0.091*"." + 0.091*"budget"')
(1, '0.091*"need" + 0.091*"." + 0.091*"finalize" + 0.091*"corp"')
(2, '0.149*"." + 0.085*"constraints" + 0.085*"delayed" + 0.085*"project"')


In [10]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

# Initialize the sentiment analyzer
sid = SentimentIntensityAnalyzer()

def analyze_sentiment(text):
    sentiment = sid.polarity_scores(text)
    return sentiment

# Perform sentiment analysis
sentiment = analyze_sentiment(email_text)
print(sentiment)


{'neg': 0.0, 'neu': 0.879, 'pos': 0.121, 'compound': 0.5719}


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


In [11]:
from transformers import pipeline

# Load a summarization pipeline
summarizer = pipeline("summarization")

def summarize_email(text):
    summary = summarizer(text, max_length=50, min_length=25, do_sample=False)
    return summary[0]['summary_text']

# Summarize email content
email_summary = summarize_email(email_text)
print(email_summary)


No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Your max_length is set to 50, but your input_length is only 43. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=21)


 The payment of $50,000 will be processed on January 5th, 2024 . XYZ Corp. has signed a contract with John Defterios . The payment will take place in January 2024 .


In [12]:
def process_email(email_text):
    # Step 1: Preprocess email
    cleaned_email = preprocess_email(email_text)

    # Step 2: Named Entity Recognition
    entities = extract_entities(cleaned_email)
    print("Entities:", entities)

    # Step 3: Relation Extraction
    doc = nlp(cleaned_email)
    extract_relationships(doc)

    # Step 4: Topic Modeling (on multiple emails)
    # topic_modeling([email_text])  # Assuming a list of emails

    # Step 5: Sentiment Analysis
    sentiment = analyze_sentiment(email_text)
    print("Sentiment:", sentiment)

    # Step 6: Summarization
    summary = summarize_email(email_text)
    print("Summary:", summary)

# Process a sample email
process_email(email_text)


Your max_length is set to 50, but your input_length is only 43. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=21)


Entities: [('john ,', 'PERSON'), ('xyz corp.', 'ORG'), ('50,000', 'MONEY'), ('january 5th , 2024', 'DATE')]
Name of the person: john ,
Organization xyz corp. signed a contract
Amount involved: 50,000
The date is: january 5th , 2024
Sentiment: {'neg': 0.0, 'neu': 0.879, 'pos': 0.121, 'compound': 0.5719}
Summary:  The payment of $50,000 will be processed on January 5th, 2024 . XYZ Corp. has signed a contract with John Defterios . The payment will take place in January 2024 .
