In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


### Corpus Creation

In [3]:
df = pd.read_csv("articles.csv")
data = df["text"].values

print("Number of Articles : ", len(data))

Number of Articles :  337


#### Pre-Processing

In [4]:
import re
from nltk.tokenize import sent_tokenize

def pre_processing(text):
    
    # text to sentence
    tokenized = sent_tokenize(text)
    
    # Remove Punctuation
    # Lower Case 
    # Strip White Spaces
    pattern   = re.compile(r'[^a-zA-Z0-9\s]')
    tokenized = [pattern.sub('', sent).strip().lower() for sent in tokenized]
    
    return tokenized

corpus = []
for doc in data:
    corpus.extend(pre_processing(doc))
    
print("Number of Sentences in Corpus : ", len(corpus))

Number of Sentences in Corpus :  32376


### Pre-Processing on Input Text

In [128]:
input_text = """
 Education
Shiv Nadar University Chennai Chennai
B.Tech Artificial Intelligence and Data Science 2021 - 2025
• Activities: President of SNUC Art Club, Active member at SNU Coding club, Volunteer at NSO.
 Internship
Circular Edge Solution Pvt Ltd. May 2023
Data Engineer Chennai
• Trained on company’s application AtomIQ for visualization creation and dashboard building within DI-Studio.
• Demonstrated proficiency in creating objects and conducting basic testing within the company’s mobile application
• Integrated acquired training to map visualizations and dashboards from DI-Studio to the company’s mobile app.
§ Projects
EmoSense | BERT, BeautifulSoup, Scikit-learn, Tensorflow, Python Link to Demo ®
• Developed a BERT-based sentiment analysis chatbot system for diverse textual inputs.
• Employed advanced text preprocessing techniques to enhance input data quality, thereby improving the model’s
accuracy and robustness.
• Utilized advanced visualization tools to offer intuitive insights into sentiment trends and patterns.
• Achieved outstanding accuracy rates, through model fine-tuning and optimization, ensuring dependable sentiment
predictions.
Text2Img | GANs, Pytorch, Python Link to Demo ®
• Developed a PyTorch-based application for generating images from textual prompts utilizing Generative
Adversarial Networks (GANs).
• Implemented a Generator to create images from noise and prompts, alongside a Discriminator to distinguish
between real and fake images.
• Applied in diverse creative tasks and advanced data augmentation techniques in computer vision, enhancing model
performance and promoting artistic exploration.
Poetify | RNNs, GRU, Tensorflow, Python Link to Demo ®
• Developed a poetry generation application utilizing Recurrent Neural Networks (RNNs) implemented with Keras
and TensorFlow.
• Trained the RNN model on a dataset of poems to capture intricate patterns and structures of poetry.
• Leveraged the analysis of character sequences to generate new poems with coherent styles and themes.
PlateVisionizer | CNNs, Tesseract OCR, Tensorflow, Python Link to Demo ®
• Developed an Automated License Plate Recognition system for license plate extraction and character recognition.
• Implemented Convolutional Neural Network (CNN) to recognize number plates in vehicle images.
• Integrated Tesseract OCR library to achieve 90 percent accuracy in character recognition.
% Technical Skills
Languages: Python, C, MySQL, JavaScript, HTML/CSS, ReactJS.
Developer Tools: Tableau, SQL, VS Code, Figma, Jupyter Notebook.
Technologies: Scikit-learn, TensorFlow, PyTorch, Keras, OpenCV, SpaCy, Tessaract and NLTK.
 Certifications
Microsoft: Career Essentials in Data Analysis.
University of Michigan: Applied Machine Learning in Python.
LinkedIn: Computer Vision Deep Dive in Python, Natural Language Processing with TensorFlow.
University of Texas: Problem solving in Python.
"""

input_text = input_text.replace("\n", " ")
sentences = sent_tokenize(input_text)
input_tok = pre_processing(input_text)

In [7]:
pip install rouge-score

Collecting rouge-scoreNote: you may need to restart the kernel to use updated packages.

  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py): started
  Building wheel for rouge-score (setup.py): finished with status 'done'
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24972 sha256=630a0ec32dc88bfa728a374db764575c9f9128430243aa74391577f5720401f0
  Stored in directory: c:\users\agilan m a\appdata\local\pip\cache\wheels\85\9d\af\01feefbe7d55ef5468796f0c68225b6788e85d9d0a281e7a70
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2


### ROUGE Score

In [110]:
from rouge_score import rouge_scorer

expected = """
Education: Holds an M.S. in Public Administration, Government Law from Sul Ross University and 
a B.S. in Psychology, Education from Morgan State University. Specialized Training includes Security+, 
Cyber 200, Basic Computer Operations, DISA Action Officers, DOD Information Assurance Boot Camp, and 
Computer Network Defense/Threat courses. Skills encompass Air Force operations, hardware, 
software development, and project management. Professional Experience: Currently serving as a 
Branch Chief Information Technology Specialist, overseeing enterprise-level IT programs for the Air Force, 
with a TOP SECRET (SCI) clearance. Proficient in Office of Management and Budget (OMB), Department of Defense (DoD), 
and U.S. Air Force regulations. Expertise includes crisis action planning, cyber operations, and collaboration with 
government partners on Global Information Grid (GIG) operations. Demonstrates exceptional managerial 
and communication skills, contributing to impactful outcomes through multifaceted experience and education. 
Responsibilities also include software development analysis and comprehensive planning for computer network operations.
"""

expected = expected.replace("\n", " ").strip()

def rouge_metrics(summary):
    
    scorer = rouge_scorer.RougeScorer(['rouge1'], use_stemmer=True)
    scores = scorer.score(summary, expected)
    
    print("Rouge Score : ", scores, end="\n\n")

### Summarize Function

In [144]:
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx

def summarize(input_vec):
    # Cosine Similarity
    similarity_matrix = cosine_similarity(input_vec, input_vec)

    # Matrix to Graph
    G = nx.from_numpy_array(similarity_matrix)

    # PageRank Algorithm
    pagerank_scores = nx.pagerank(G)

    # Sort sentences based on PageRank Scores
    sorted_sentences = sorted(pagerank_scores, key=pagerank_scores.get, reverse=True)

    # Select top 10 
    top_k = 4
    summary = [sentences[i] for i in sorted_sentences[:top_k]]

    rouge_metrics(" ".join(summary))
    print(" ".join(summary))

### Vectorization

#### Bag of Words

In [145]:
from sklearn.feature_extraction.text import CountVectorizer

bag_of_words = CountVectorizer()

corpus_bow = bag_of_words.fit_transform(corpus)
input_bow  = bag_of_words.transform(input_tok)

In [146]:
summarize(input_bow)

Rouge Score :  {'rouge1': Score(precision=0.18120805369127516, recall=0.28421052631578947, fmeasure=0.22131147540983606)}

Poetify | RNNs, GRU, Tensorflow, Python Link to Demo ® • Developed a poetry generation application utilizing Recurrent Neural Networks (RNNs) implemented with Keras and TensorFlow. • Applied in diverse creative tasks and advanced data augmentation techniques in computer vision, enhancing model performance and promoting artistic exploration. • Demonstrated proficiency in creating objects and conducting basic testing within the company’s mobile application • Integrated acquired training to map visualizations and dashboards from DI-Studio to the company’s mobile app. • Implemented a Generator to create images from noise and prompts, alongside a Discriminator to distinguish between real and fake images.


In [150]:
similarity_matrix_bow = summarize(input_bow)
print("Cosine Similarity Matrix - Bag of Words:")
print(similarity_matrix_bow)

Cosine Similarity Matrix - Bag of Words:
[[1.         0.10540926 0.07254763 0.         0.10846523 0.06201737
  0.06454972 0.         0.09128709 0.13693064 0.04472136 0.15811388
  0.11547005 0.04303315 0.         0.         0.         0.
  0.0745356  0.09128709 0.15811388 0.         0.16903085]
 [0.10540926 1.         0.22941573 0.05892557 0.11433239 0.13074409
  0.06804138 0.1028689  0.09622504 0.14433757 0.0942809  0.16666667
  0.06085806 0.09072184 0.         0.         0.         0.
  0.07856742 0.09622504 0.         0.         0.        ]
 [0.07254763 0.22941573 1.         0.08111071 0.23606684 0.17996851
  0.09365858 0.14159847 0.29801978 0.19867985 0.16222142 0.22941573
  0.25131234 0.12487811 0.14048787 0.1956464  0.         0.
  0.10814761 0.06622662 0.05735393 0.04682929 0.06131393]
 [0.         0.05892557 0.08111071 1.         0.06063391 0.13867505
  0.07216878 0.38188131 0.10206207 0.05103104 0.35       0.05892557
  0.12909944 0.38490018 0.07216878 0.07537784 0.09449112 0.
 

#### TF - IDF

In [148]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf_idf = TfidfVectorizer()

corpus_idf = tf_idf.fit_transform(corpus)
input_idf = tf_idf.transform(input_tok)

In [None]:
summarize(input_idf)

In [151]:
similarity_matrix_idf = summarize(input_idf)
print("Cosine Similarity Matrix - TF-IDF:")
print(similarity_matrix_idf)

Cosine Similarity Matrix - TF-IDF:
[[1.         0.02140829 0.00732773 0.         0.02329379 0.00580939
  0.0056516  0.         0.01032363 0.0258113  0.00431721 0.01734659
  0.01088386 0.00355651 0.         0.         0.         0.
  0.00623609 0.02829541 0.09723799 0.         0.09285926]
 [0.02140829 1.         0.22063799 0.00982113 0.02243904 0.09232229
  0.00544422 0.05583886 0.00994481 0.02486417 0.05226325 0.06614514
  0.00532699 0.0099745  0.         0.         0.         0.
  0.00600726 0.02725713 0.         0.         0.        ]
 [0.00732773 0.22063799 1.         0.00656722 0.01919947 0.01544339
  0.00806554 0.0502658  0.03881499 0.0234427  0.04710975 0.02107484
  0.01972775 0.00945445 0.01502991 0.08792616 0.         0.
  0.00889969 0.00843801 0.00902628 0.00723203 0.00861982]
 [0.         0.00982113 0.00656722 1.         0.00482327 0.11581812
  0.10760728 0.33123139 0.00925219 0.07738151 0.30175559 0.0052944
  0.06497176 0.25216828 0.00605737 0.00568159 0.05204595 0.
  0.1463

#### Continuous Bag of Words

In [134]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

g_model = Word2Vec(sentences=[word_tokenize(sent) for sent in corpus], vector_size=200, window=5, workers=5, epochs=500)

In [135]:
def get_embeddings(sent_l):
    vec = np.array([g_model.wv[word] if word in g_model.wv else np.zeros((200)) for word in sent_l])
    vec = vec.sum(axis=0)
    return vec

input_cbow = np.array([get_embeddings(sent) for sent in [word_tokenize(sent) for sent in input_tok]])

In [136]:
summarize(input_cbow)

Rouge Score :  {'rouge1': Score(precision=0.22818791946308725, recall=0.20359281437125748, fmeasure=0.2151898734177215)}

§ Projects EmoSense | BERT, BeautifulSoup, Scikit-learn, Tensorflow, Python Link to Demo ® • Developed a BERT-based sentiment analysis chatbot system for diverse textual inputs. Poetify | RNNs, GRU, Tensorflow, Python Link to Demo ® • Developed a poetry generation application utilizing Recurrent Neural Networks (RNNs) implemented with Keras and TensorFlow. PlateVisionizer | CNNs, Tesseract OCR, Tensorflow, Python Link to Demo ® • Developed an Automated License Plate Recognition system for license plate extraction and character recognition. Text2Img | GANs, Pytorch, Python Link to Demo ® • Developed a PyTorch-based application for generating images from textual prompts utilizing Generative Adversarial Networks (GANs). • Applied in diverse creative tasks and advanced data augmentation techniques in computer vision, enhancing model performance and promoting artistic ex

In [184]:
similarity_matrix_w2v = summarize(input_cbow)
print("Cosine Similarity Matrix - CBOW:")
print(similarity_matrix_w2v)

Cosine Similarity Matrix - CBOW:
[[ 1.          0.15565664  0.08100839  0.05865292  0.08710148  0.03701585
  -0.05190951  0.06589054 -0.10415016  0.23521386  0.00616118 -0.00211996
  -0.02286441  0.04096264 -0.02529175  0.02009513  0.06758164  0.05574347
   0.01297578  0.28621116  0.34575958  0.17755406  0.23209398]
 [ 0.15565664  1.          0.44057592  0.17048051  0.12470869  0.23945396
   0.1230022   0.13020147  0.11116875  0.23193111  0.07823255  0.13417519
   0.04202913  0.07123264  0.08054879  0.0121653  -0.00901814  0.16228839
   0.12014976  0.2004436   0.15387239  0.0090165   0.01253701]
 [ 0.08100839  0.44057592  1.          0.05541327  0.07398488  0.12917327
   0.17329663  0.13842553  0.20760785  0.08504656  0.09547156  0.25701417
   0.09174199  0.10952233  0.07379617  0.15391577 -0.03877023  0.07974737
   0.09755216  0.15318517 -0.01560299 -0.05159263 -0.01798513]
 [ 0.05865292  0.17048051  0.05541327  1.          0.14890251  0.20602826
   0.17063515  0.46506948  0.13735624 

#### Skip gram

In [137]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

g_model = Word2Vec(sentences=[word_tokenize(sent) for sent in corpus], vector_size=200, window=5, workers=5, epochs=500, sg=1)


KeyboardInterrupt



In [None]:
def get_embeddings(sent_l):
    vec = np.array([g_model.wv[word] if word in g_model.wv else np.zeros((200)) for word in sent_l])
    vec = vec.sum(axis=0)
    return vec

input_sg = np.array([get_embeddings(sent) for sent in [word_tokenize(sent) for sent in input_tok]])

In [None]:
summarize(input_sg)

#### Word2Vec - PreTrained Embeddings

In [None]:
import gensim.downloader as api

model = api.load("glove-wiki-gigaword-200")

def get_embeddings(sent_l):
    vec = np.array([model[word] if word in model else np.zeros((200)) for word in sent_l])
    vec = vec.sum(axis=0)
    return vec

input_wv = np.array([get_embeddings(sent) for sent in [word_tokenize(sent) for sent in input_tok]])

In [None]:
summarize(input_wv)

#### FastText

In [124]:
from gensim.models import FastText
from nltk.tokenize import word_tokenize

f_model = FastText(sentences=[word_tokenize(sent) for sent in corpus], vector_size=200, window=5, workers=5, epochs=500)

In [125]:
def get_embeddings(sent_l):
    vec = np.array([f_model.wv[word] if word in f_model.wv else np.zeros((200)) for word in sent_l])
    vec = vec.sum(axis=0)
    return vec

input_ft = np.array([get_embeddings(sent) for sent in [word_tokenize(sent) for sent in input_tok]])

In [126]:
summarize(input_ft)

Rouge Score :  {'rouge1': Score(precision=0.4429530201342282, recall=0.29596412556053814, fmeasure=0.35483870967741943)}

*Thorough knowledge ofSCADAsystems operations, security, safeguardsand protection *SECURITYCLEARANCE:TOP SECRET (SCI) w/CI Polygraph ProfessionalExperience 07/2014 to Current BranchChiefInformationTechnology Specialist CompanyNameï¼ City Effectively oversee, manage,and evaluate multipleenterpriselevelIT programs, serveas senior technicaladvisorand evaluator for programs using cutting edgetechnology for the Headquarters Air Force(HAF)command,control,communications,computer, intelligence, surveillanceand reconnaissance(C4ISR). Developed information operationsand computer network operations plans, including defensivecomputer operations planning, to ensure support Geographic Combatant Commanders' intent. 08/2009 to 07/2014 InformationTechnology Specialist (INFOSEC/NETOPS) CompanyNameï¼ City , State Developed detailed operations plansand operations orders supporting cybe

In [153]:
similarity_matrix_ft = summarize(input_ft)
print("Cosine Similarity Matrix - FastText:")
print(similarity_matrix_ft)


Cosine Similarity Matrix - FastText:
[[1.         0.2981681  0.44218078 ... 0.31538147 0.58161676 0.50535953]
 [0.2981681  0.99999976 0.37946737 ... 0.5391238  0.3251799  0.1696044 ]
 [0.44218078 0.37946737 1.0000001  ... 0.2909866  0.33070162 0.16499981]
 ...
 [0.31538147 0.5391238  0.2909866  ... 0.9999997  0.47525415 0.2794646 ]
 [0.58161676 0.3251799  0.33070162 ... 0.47525415 1.0000001  0.61722654]
 [0.50535953 0.1696044  0.16499981 ... 0.2794646  0.61722654 0.9999998 ]]


## Summarization using BERT

In [155]:
pip install transformers

Note: you may need to restart the kernel to use updated packages.


In [156]:
from transformers import BertTokenizer, BertModel

In [157]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')


In [158]:
input_text_tokens = tokenizer.encode_plus(input_text, return_tensors="pt", max_length=512, truncation=True, padding=True)


In [160]:
import torch

In [161]:
with torch.no_grad():
    outputs = model(**input_text_tokens)

In [167]:
import numpy as np

# Assuming your 1D array is named sentence_embeddings
sentence_embeddings_2d = sentence_embeddings.reshape(1, -1)

# Now, you can pass sentence_embeddings_2d to the cosine_similarity function
similarity_matrix = cosine_similarity(sentence_embeddings_2d, sentence_embeddings_2d)


In [166]:
last_hidden_states = outputs.last_hidden_state
sentence_embeddings = torch.mean(last_hidden_states, dim=1).squeeze().numpy()

In [168]:
import networkx as nx
G = nx.from_numpy_array(similarity_matrix)

In [169]:
pagerank_scores = nx.pagerank(G)


In [170]:
sorted_sentences = sorted(pagerank_scores, key=pagerank_scores.get, reverse=True)

In [181]:
top_k = 100
summary_sentences = [sentences[i] for i in sorted_sentences[:top_k]]

#### Summary

In [182]:
summary = " ".join(summary_sentences)
print("Summary using BERT embeddings:")
print(summary)

Summary using BERT embeddings:
  Education Shiv Nadar University Chennai Chennai B.Tech Artificial Intelligence and Data Science 2021 - 2025 • Activities: President of SNUC Art Club, Active member at SNU Coding club, Volunteer at NSO.


In [177]:
pip install rouge

Collecting rougeNote: you may need to restart the kernel to use updated packages.

  Downloading rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)
Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1


In [179]:
from rouge import Rouge

In [180]:
rouge = Rouge()

# Calculate ROUGE scores
scores = rouge.get_scores(summary, expected)

# Print ROUGE scores
print(scores)

[{'rouge-1': {'r': 0.0423728813559322, 'p': 0.16129032258064516, 'f': 0.06711409066438465}, 'rouge-2': {'r': 0.0, 'p': 0.0, 'f': 0.0}, 'rouge-l': {'r': 0.0423728813559322, 'p': 0.16129032258064516, 'f': 0.06711409066438465}}]
