In [1]:
import numpy as np
import pandas as pd
import nltk
nltk.download('punkt') # one time execution
import re

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\user1\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
#save excel file, open in notepad and save with utf-8 encoding. later, open txt file and save as .csv with utf-8 encoding
df = pd.read_csv("AI_define.csv")

In [3]:
df.head()

Unnamed: 0,article_id,article_text,Source
0,1,Artificial intelligence (AI) makes it possible...,https://www.sas.com/en_us/insights/analytics/w...
1,2,"Artificial intelligence (AI), the ability of a...",https://www.britannica.com/technology/artifici...
2,3,Artificial intelligence today is properly know...,https://futureoflife.org/background/benefits-r...
3,4,Artificial intelligence (AI) promises to be th...,https://www.behblaw.com/Hidden-Pages/The-Role-...
4,5,"The world of technology is changing rapidly, w...",The Future is here: Artificial Intelligence an...


In [4]:
df['article_text'][0]

'Artificial intelligence (AI) makes it possible for machines to learn from experience, adjust to new inputs and perform human-like tasks. Most AI examples that you hear about today – from chess-playing computers to self-driving cars – rely heavily on deep learning and natural language processing. Using these technologies, computers can be trained to accomplish specific tasks by processing large amounts of data and recognizing patterns in the data.'

In [5]:
from nltk.tokenize import sent_tokenize
sentences = []
for s in df['article_text']:
  sentences.append(sent_tokenize(s))

sentences = [y for x in sentences for y in x] # flatten list

In [6]:
sentences[:5]

['Artificial intelligence (AI) makes it possible for machines to learn from experience, adjust to new inputs and perform human-like tasks.',
 'Most AI examples that you hear about today – from chess-playing computers to self-driving cars – rely heavily on deep learning and natural language processing.',
 'Using these technologies, computers can be trained to accomplish specific tasks by processing large amounts of data and recognizing patterns in the data.',
 'Artificial intelligence (AI), the ability of a digital computer or computer-controlled robot to perform tasks commonly associated with intelligent beings.',
 'The term is frequently applied to the project of developing systems endowed with the intellectual processes characteristic of humans, such as the ability to reason, discover meaning, generalize, or learn from past experience.']

In [7]:
# remove punctuations, numbers and special characters
clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")

# make alphabets lowercase
clean_sentences = [s.lower() for s in clean_sentences]

In [8]:
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user1\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [9]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [10]:
# function to remove stopwords
def remove_stopwords(sen):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new

In [11]:
# remove stopwords from the sentences
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

In [12]:
# Extract word vectors
# We will be using the pre-trained Wikipedia 2014 + Gigaword 5 GloVe vectors available at: https://nlp.stanford.edu/data/glove.6B.zip. Pls note the size of these word embeddings is 822 MB.
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()


In [13]:
len(word_embeddings)

400000

In [14]:
sentence_vectors = []
for i in clean_sentences:
  if len(i) != 0:
    v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
  else:
    v = np.zeros((100,))
  sentence_vectors.append(v)

In [15]:
# similarity matrix
sim_mat = np.zeros([len(sentences), len(sentences)])

In [16]:
from sklearn.metrics.pairwise import cosine_similarity


In [17]:
for i in range(len(sentences)):
  for j in range(len(sentences)):
    if i != j:
      sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]

In [18]:
import networkx as nx

nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)

In [19]:
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)


In [20]:
# Extract top 5 sentences as the summary
for i in range(5):
  print(ranked_sentences[i][1])

Artificial intelligence (AI) makes it possible for machines to learn from experience, adjust to new inputs and perform human-like tasks.
The most promising aspect of applying AI in the legal profession lies in automating simple and repetitive tasks, like e-discovery or legal bill review, while enabling human experts to improve results beyond what machines or people could do alone.
Most AI examples that you hear about today – from chess-playing computers to self-driving cars – rely heavily on deep learning and natural language processing.
Artificial intelligence (AI) promises to be the most disruptive class of technologies in driving digital business forward during the next ten years.
The term is frequently applied to the project of developing systems endowed with the intellectual processes characteristic of humans, such as the ability to reason, discover meaning, generalize, or learn from past experience.
