<a href="https://colab.research.google.com/github/AhmedCoolProjects/ESI/blob/main/Text_Mining_Project_Model_Creation_and_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BY AHMED BARGADY

## Load Data

In [110]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [111]:
df = pd.read_csv("https://firebasestorage.googleapis.com/v0/b/esi-school-resources.appspot.com/o/text_mining%2Fproject%2Ffinal_dataset.csv?alt=media&token=cc09b493-3102-4983-b4d1-355c0dcc3b0f")
print(df.shape)
df.head()

(4571, 3)


Unnamed: 0,id,content,category
0,2312.06659,We establish the convergence of the unified tw...,math
1,2312.06656,Using the notion of integral distance to analy...,math
2,2312.06651,"This paper is the first part of the series ""Sp...",math
3,2312.0665,"This paper is the second part of the series ""S...",math
4,2312.06649,This paper is the fourth and the last part of ...,math


# Data Preprocessing

In [112]:
df['content'][0]

'We establish the convergence of the unified two-timescale Reinforcement Learning (RL) algorithm presented by Angiuli et al. This algorithm provides solutions to Mean Field Game (MFG) or Mean Field Control (MFC) problems depending on the ratio of two learning rates, one for the value function and the other for the mean field term. We focus a setting with finite state and action spaces, discrete time and infinite horizon. The proof of convergence relies on a generalization of the two-timescale approach of Borkar. The accuracy of approximation to the true solutions depends on the smoothing of the policies. We then provide an numerical example illustrating the convergence. Last, we generalize our convergence result to a three-timescale RL algorithm introduced by Angiuli et al. to solve mixed Mean Field Control Games (MFCGs).'

In [113]:
category_counts = df['category'].value_counts().reset_index()
category_counts = category_counts.rename(columns={'index': 'category', 'category': 'count'})

category_counts

Unnamed: 0,category,count
0,cs,2248
1,math,955
2,astro-ph,376
3,eess,352
4,quant-ph,266
5,stat,229
6,q-bio,76
7,hep-ex,69


In [114]:
import plotly.express as px

In [115]:
fig = px.bar(category_counts, x='category', y='count', color='count',
             title='Distribution of Number of Articles in Each Category')

fig.update_layout(xaxis_tickangle=-45)
fig.show()

In [116]:
bubble_size = category_counts.values[:, 1].astype(float).tolist()

bubble_colors = px.colors.qualitative.Set1[:len(bubble_size)]

# Create a bubble plot with plotly_express
fig = px.scatter(category_counts, x='category', y='count', size=bubble_size,
                color=bubble_size,
                 color_continuous_scale=bubble_colors,
                 labels={'category': 'Category', 'count': 'Count'},
                 title='Bubble Plot of Category Counts')

fig.update_layout(xaxis_tickangle=-45)
fig.show()

In [117]:
fig = px.pie(category_counts, names='category', values='count',
             title='Distribution of Article Counts by Category',
             labels={'category': 'Category', 'count': 'Article Count'})

fig.show()

# Sentence Tokenization

In [118]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
import numpy as np
import networkx as nx
import re

In [119]:
nltk.download('punkt')  # Download the Punkt tokenizer models

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [120]:
def tokenize_sentences(text):

    sentences = sent_tokenize(text)

    return sentences

In [121]:
def clean_text(tokens):

    cleaned_tokens = []
    # replace characters that are not alphanumeric (i.e., not letters or numbers)
    for sentence in tokens:

        cleaned_tokens.append(re.sub(r'[^a-zA-Z0-9]', ' ', sentence))

    return cleaned_tokens

In [122]:
example = df['content'][0]
example_tokens = tokenize_sentences(example)


print("original: ", example)
print(f"\nexample_tokenized: {example_tokens}")

original:  We establish the convergence of the unified two-timescale Reinforcement Learning (RL) algorithm presented by Angiuli et al. This algorithm provides solutions to Mean Field Game (MFG) or Mean Field Control (MFC) problems depending on the ratio of two learning rates, one for the value function and the other for the mean field term. We focus a setting with finite state and action spaces, discrete time and infinite horizon. The proof of convergence relies on a generalization of the two-timescale approach of Borkar. The accuracy of approximation to the true solutions depends on the smoothing of the policies. We then provide an numerical example illustrating the convergence. Last, we generalize our convergence result to a three-timescale RL algorithm introduced by Angiuli et al. to solve mixed Mean Field Control Games (MFCGs).

example_tokenized: ['We establish the convergence of the unified two-timescale Reinforcement Learning (RL) algorithm presented by Angiuli et al.', 'This al

# Extractive Summarization

## Sentence Similarity

In [123]:
import tensorflow_hub as hub

embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

loads the Universal Sentence Encoder, which is a pre-trained model capable of converting sentences into fixed-size vectors in a high-dimensional space, capturing semantic information.

In [124]:
sentences = ["This is an example sentence.", "Another example sentence."]

In [125]:
A = embed([sentences[0]])[0]
print(A.shape)

(512,)


In [126]:
from sentence_transformers import SentenceTransformer

embedd = SentenceTransformer("bert-base-nli-mean-tokens")

In [127]:
embeddings = embedd.encode([sentences[0]])[0]

print(embeddings.shape)

(768,)


In [128]:
def sentence_similarity(sent1,sent2,embed):
    A = embed([sent1])[0]
    B = embed([sent2])[0]
    return 1 - (np.dot(A,B)/(np.linalg.norm(A)*np.linalg.norm(B)))

In [129]:
print(f"\033[92m Sentence 1 : {example_tokens[0]}")
print(f"\033[92m Sentence 2 : {example_tokens[1]}")
print(f"\033[91m Similarity Score : {sentence_similarity(example_tokens[0], example_tokens[1], embed)}")

[92m Sentence 1 : We establish the convergence of the unified two-timescale Reinforcement Learning (RL) algorithm presented by Angiuli et al.
[92m Sentence 2 : This algorithm provides solutions to Mean Field Game (MFG) or Mean Field Control (MFC) problems depending on the ratio of two learning rates, one for the value function and the other for the mean field term.
[91m Similarity Score : 0.8045627176761627


In [130]:
def build_similarity_matrix(sentences, embeds):
    similarity_matrix = np.zeros((len(sentences),len(sentences)))
    for idx1 in range(len(sentences)):
        for idx2 in range(len(sentences)):
            if idx1!=idx2:
                similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], embeds)
    return similarity_matrix

In [131]:
sim_mat_example = build_similarity_matrix(example_tokens, embed)

In [132]:
print(sim_mat_example.shape)
sim_mat_example

(8, 8)


array([[0.        , 0.80456272, 0.63578242, 0.64462206, 0.8813566 ,
        0.58914715, 0.30881792, 0.87431081],
       [0.80456272, 0.        , 0.84788474, 0.96362628, 0.81968093,
        0.93547247, 0.83536229, 0.58116105],
       [0.63578242, 0.84788474, 0.        , 0.84665209, 0.93106667,
        0.7588222 , 0.72783697, 0.82372005],
       [0.64462206, 0.96362628, 0.84665209, 0.        , 0.63377598,
        0.52669197, 0.51487094, 0.96521413],
       [0.8813566 , 0.81968093, 0.93106667, 0.63377598, 0.        ,
        0.80164887, 0.72645575, 0.91921887],
       [0.58914715, 0.93547247, 0.7588222 , 0.52669197, 0.80164887,
        0.        , 0.52317116, 0.88563332],
       [0.30881792, 0.83536229, 0.72783697, 0.51487094, 0.72645575,
        0.52317116, 0.        , 0.94849145],
       [0.87431081, 0.58116105, 0.82372005, 0.96521413, 0.91921887,
        0.88563332, 0.94849145, 0.        ]])

In [133]:
fig = px.imshow(sim_mat_example,
                labels=dict(x="Sentence Index", y="Sentence Index", color="Similarity"),
                x=[f"Sentence {i}" for i in range(sim_mat_example.shape[0])],
                y=[f"Sentence {i}" for i in range(sim_mat_example.shape[0])],
                color_continuous_scale="Viridis")

fig.update_layout(title="Sentence Similarity Matrix",
                  xaxis_title="Sentences",
                  yaxis_title="Sentences")

fig.show()

In [134]:
import networkx as nx
from bokeh.io import output_notebook, show, save
from bokeh.plotting import figure
from bokeh.plotting import from_networkx
from bokeh.models import Range1d, Circle, ColumnDataSource, MultiLine

In [135]:
output_notebook()

In [136]:
g = nx.Graph()
for i in range(sim_mat_example.shape[0]):
    for j in range(sim_mat_example.shape[1]):
        if sim_mat_example[i][j] >=.9:
            g.add_edge(i, j)

In [137]:
HOVER_TOOLTIPS = [("sent_tok", "@index")]
plot = figure(tooltips = HOVER_TOOLTIPS, tools="pan,wheel_zoom,save,reset", active_scroll='wheel_zoom',x_range=Range1d(-10.1, 10.1), y_range=Range1d(-10.1, 10.1))

In [138]:
network_graph = from_networkx(g, nx.spring_layout, scale=7, center=(0, 0))
network_graph.node_renderer.glyph = Circle(size=15,fill_color='green')
network_graph.edge_renderer.glyph = MultiLine(line_alpha=0.5, line_width=1)
plot.renderers.append(network_graph)
show(plot)

In [139]:
def generate_summary(text, top_n, embeds):
    summarize_text = []
    sentences = tokenize_sentences(text)
    sentence_similarity_matrix = build_similarity_matrix(sentences, embeds)
    sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_matrix)
    scores = nx.pagerank(sentence_similarity_graph)
    ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)),reverse=True)
    for i in range(top_n):
        summarize_text.append(ranked_sentences[i][1])
    return " ".join(summarize_text)

In [140]:
Original_Text = df['content'][8]
Summarized_Text = generate_summary(Original_Text, top_n=5, embeds=embed)

In [141]:
Original_Text

'In this paper, we test various models of wastewater infrastructure for risk analysis and compare their performance. While many representations are available, existing studies do not consider selection of the appropriate model for risk analysis. In this paper, we define two characteristics of wastewater models: the network granularity and the fidelity of the governing equations. We consider different combinations of these characteristics to determine 6 network representations that could be used as the foundation for risk analysis. We test the performance of each model as compared to predictions from the most detailed model, the full network with dynamic wave flow equations. We demonstrate the model selection for Seaside, Oregon. We conclude that the full network granularity is needed as compared to a coarse network representation. For the fidelity of the governing equations, connectivity analysis is reasonable if the primary goal is to determine the spatial distribution of hazard impac

In [142]:
Summarized_Text

'We demonstrate the model selection for Seaside, Oregon. To more accurately predict nodal performance measures, the dynamic wave equations are needed as they capture important physical phenomena. For the fidelity of the governing equations, connectivity analysis is reasonable if the primary goal is to determine the spatial distribution of hazard impact. While many representations are available, existing studies do not consider selection of the appropriate model for risk analysis. We conclude that the full network granularity is needed as compared to a coarse network representation.'

In [144]:
!pip install rouge

Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1


In [145]:
from rouge import Rouge

In [146]:
rouge = Rouge()
scores = rouge.get_scores(Summarized_Text, Original_Text)

In [148]:
print(scores[0])

{'rouge-1': {'r': 0.6666666666666666, 'p': 1.0, 'f': 0.7999999952000001}, 'rouge-2': {'r': 0.5448275862068965, 'p': 0.9634146341463414, 'f': 0.6960352376758718}, 'rouge-l': {'r': 0.6666666666666666, 'p': 1.0, 'f': 0.7999999952000001}}
