In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
%cd /content/drive/My Drive/Colab Notebooks/Text_Summarization_Task

/content/drive/My Drive/Colab Notebooks/Text_Summarization_Task


# **Extractive Model #2 using Text Rank Algorithm and Word embeddings**
We  continue to discover new models for extractive text summarization.
In this process 

TextRank is an extractive and unsupervised text summarization technique. Let’s take a look at the flow of the TextRank algorithm that we will be following:

* The first step would be to concatenate all the text contained in the articles

* Then split the text into individual sentences
* In the next step, we will find vector representation (word embeddings) for each and every sentence

* Similarities between sentence vectors are then calculated and stored in a matrix
* The similarity matrix is then converted into a graph, with sentences as vertices and similarity scores as edges, for sentence rank calculation
* Finally, a certain number of top-ranked sentences form the final summary



## **Running preprocessing Notebook to use it's functions**

In [0]:
%run Preprocessing.ipynb


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## **Importing Libraries**

In [0]:
import numpy as np
import gensim
from nltk.tokenize import sent_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## **Getting The Data**

Just like we did In the other Notebook (Extractive Simple Model), We will read an arabic and english wikipedia articles and work on them.
  
We will use the functions from the preprocessing notebook. Check the notebook for more informations about the functions used here.

In [0]:
English_article = read_wiki('https://en.wikipedia.org/wiki/20th_century')
Arabic_article = read_wiki('https://ar.wikipedia.org/wiki/%D8%A7%D9%84%D8%AD%D8%B1%D8%A8_%D8%A7%D9%84%D8%B9%D8%A7%D9%84%D9%85%D9%8A%D8%A9_%D8%A7%D9%84%D8%AB%D8%A7%D9%86%D9%8A%D8%A9')

## **Preprocessing**

## **Split article into sentences**

we will use sent_tokenizer to split the string into list or sentences

In [0]:
English_sentences = sent_tokenize(English_article)
Arabic_sentences  = sent_tokenize(Arabic_article)

We will Do some basic cleaning to the data. Our preprocessing function imported from the the notebook by dafault opens what we want which are:

*   Convert everything to lowercase (English article)
*   Contraction mapping (English article)
*   Eliminate punctuations and special characters
*   Remove stopwords
*   Remove Tashkel (Arabic article)
For more information about the function check the preprocessing notebook

In [0]:
clean_en_sentences=[]
clean_ar_sentences=[]

for s in English_sentences:
  clean_en_sentences.append(preprocess(s , lang = 'en' , stemming = False , rm_short = True))

for s in Arabic_sentences:
  clean_ar_sentences.append(preprocess(s , lang = 'ar' , stemming = False))


In [0]:
print(clean_en_sentences)
print(clean_ar_sentences)

['twentieth century century began january ended december', 'tenth final century millennium', 'century dominated chain events heralded significant changes world history redefine era flu pandemic world war world war nuclear power space exploration nationalism decolonization cold war post cold war conflicts intergovernmental organizations cultural homogenization developments emerging transportation communications technology poverty reduction world population growth awareness environmental degradation ecological extinction birth digital revolution enabled wide adoption mos transistors integrated circuits', 'saw great advances power generation communication medical technology late allowed near instantaneous worldwide computer communication genetic modification life', 'century saw largest transformation world order since fall rome global total fertility rates sea level rise ecological collapses increased resulting competition land dwindling resources accelerated deforestation water depletion

## **Word Embedding**

In this Section we will Get Word Embeddings for Both Arabic and English. These word embeddings will be used to create vectors for our sentences. 

### **English Word Embedding (GLOVE)**

GloVe, coined from Global Vectors, is a model for distributed word representation. The model is an unsupervised learning algorithm for obtaining vector representations for words. This is achieved by mapping words into a meaningful space where the distance between words is related to semantic similarity.  

It can be downloaded from [Here](https://nlp.stanford.edu/data/glove.6B.zip)

In [0]:
# Extract word vectors
english_embeddings = {}
f = open('/content/drive/My Drive/Colab Notebooks/Text_Summarization_Task/Word_Embedding/glove/glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    english_embeddings[word] = coefs
f.close()

### **Arabic Word Embedding (AraVec)**

AraVec is one of the few Arabic WordEmbeddings out There you can get it and Read more about it from here:
[Github](https://github.com/bakrianoo/aravec)

I am using the 300 vec size Wikipedia SkipGram:
[Here](https://bakrianoo.sfo2.digitaloceanspaces.com/aravec/full_uni_sg_300_wiki.zip)

In [0]:
t_model = gensim.models.Word2Vec.load('/content/drive/My Drive/Colab Notebooks/Text_Summarization_Task/Word_Embedding/AraVec/full_uni_sg_300_wiki.mdl')
arabic_embedding= t_model.wv

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


## **Vector Representation of Sentences**

Now We will create our Vector representation of the sentences:

We will fetch vectors for the constituent words in a sentence and then take mean/average of those vectors to arrive at a consolidated vector for the sentence.

In [0]:
def eng_vec_rep(clean_sentences):
  sentence_vectors = []
  for i in clean_sentences:
    if len(i) != 0:
      v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
    else:
      v = np.zeros((100,))
    sentence_vectors.append(v)
  return sentence_vectors

def ar_vec_rep(clean_sentences):
  sentence_vectors = []
  for i in clean_sentences:
    if len(i) != 0:
      v = 0
      for w in i.split():
        if w in arabic_embedding:
          v+= arabic_embedding[w]
        else:
          v+= np.zeros((300,))
      v/= len(i.split())+0.001
    else:
      v = np.zeros((300,))
    sentence_vectors.append(v)
  return sentence_vectors

en_sentence_vectors = eng_vec_rep(clean_en_sentences)
ar_sentence_vectors = ar_vec_rep(clean_ar_sentences)


## **Similarity matrix**

The next step is to find similarities between the sentences, and we will use the cosine similarity approach for this challenge. Let’s create an empty similarity matrix for this task and populate it with cosine similarities of the sentences.

Let’s first define a zero matrix of dimensions (n * n).  We will initialize this matrix with cosine similarity scores of the sentences. Here, n is the number of sentences.

In [0]:
# similarity matrix
en_sim_mat = np.zeros([len(clean_en_sentences), len(clean_en_sentences)])
ar_sim_mat = np.zeros([len(clean_ar_sentences), len(clean_ar_sentences)])

We will use Cosine Similarity to compute the similarity between a pair of sentences.
And initialize the matrix with cosine similarity scores.

In [0]:
from sklearn.metrics.pairwise import cosine_similarity

#For English
for i in range(len(clean_en_sentences)):
  for j in range(len(clean_en_sentences)):
    if i != j:
      en_sim_mat[i][j] = cosine_similarity(en_sentence_vectors[i].reshape(1,100), en_sentence_vectors[j].reshape(1,100))[0,0]

#For Arabic
for i in range(len(clean_ar_sentences)):
  for j in range(len(clean_ar_sentences)):
    if i != j:
      ar_sim_mat[i][j] = cosine_similarity(ar_sentence_vectors[i].reshape(1,300), ar_sentence_vectors[j].reshape(1,300))[0,0]

## **Applying PageRank Algorithm**

Before proceeding further, let’s convert the similarity matrix sim_mat into a graph. The nodes of this graph will represent the sentences and the edges will represent the similarity scores between the sentences. On this graph, we will apply the PageRank algorithm to arrive at the sentence rankings.

In [0]:
import networkx as nx

#for English
en_nx_graph = nx.from_numpy_array(en_sim_mat)
en_scores = nx.pagerank(en_nx_graph)

#for Arabic
ar_nx_graph = nx.from_numpy_array(ar_sim_mat)
ar_scores = nx.pagerank(ar_nx_graph)

## **Summary Extraction**

 it’s time to extract the top N sentences based on their rankings for summary generation.

In [0]:
#english
en_ranked_sentences = sorted(((en_scores[i],s) for i,s in enumerate(English_sentences)), reverse=True)

#arabic
ar_ranked_sentences = sorted(((ar_scores[i],s) for i,s in enumerate(Arabic_sentences)), reverse=True)


## **Results**

English

In [0]:
# Printing top 5 sentences as the summary
for i in range(5):
  print(en_ranked_sentences[i][1])


The period was marked by a new arms race as the USSR became the second nation to develop nuclear weapons, which were produced by both sides in sufficient numbers to end most human life on the planet had a large-scale nuclear exchange ever occurred.
Western Europe was rebuilt with the aid of the American Marshall Plan, resulting in a major post-war economic boom, and many of the affected nations became close allies of the United States.
The dissolution of the Soviet Union in 1991 after the collapse of its European alliance was heralded by the West as the end of communism, though by the century's end roughly one in six people on Earth lived under communist rule, mostly in China which was rapidly rising as an economic and geopolitical power.
The 20th century was dominated by a chain of events that heralded significant changes in world history as to redefine the era: flu pandemic, World War I and World War II, nuclear power and space exploration, nationalism and decolonization, the Cold Wa

Arabic

In [0]:
# Printing top 5 sentences as the summary
for i in range(5):
  print(ar_ranked_sentences[i][1])

في الأول من سبتمبر هاجمت القوات الألمانية بولندا بحجة أن بولندا شنت هجمات على الأراضي الألمانية[64]، وبعد يومين، في 3 سبتمبر 1939 ونتيجة لتجاهل ألمانيا الإنذار البريطاني بوقف العمليات العسكرية في بولندا
أعلنت كل من المملكة المتحدة وفرنسا الحرب على ألمانيا وتبعهما في إعلان الحرب دول الكومنولث البريطاني وهي كل من أستراليا (في 3 سبمتمر) ونيوزلندا (3 سبتمبر) وجنوب أفريقيا (6 سبمتبر) وكندا (10 سبتمبر)[13]، وبدأت بريطانيا بإرسال جيوشها إلى فرنسا، إلا أن هذه الجيوش لم تقدم أي مساعدة فعلية للبولنديين خلال الغزو، وبقيت الحدود الفرنسية الألمانية هادئة، وبدأت ما تعرف بالحرب الزائفة[65]، حيث اكتفت الجيوش بفرض حصار اقتصادي على ألمانيا مما دفع ألمانيا بمهاجمة السفن التجارية في المحيط الأطلسي وبداية معركة الأطلسي.
في 17 سبتمبر 1939 قام الاتحاد السوفيتي بمهاجمة بولندا[64][66]، وفي 27 سبتمبر 1939 استسلمت بولندا للألمان مع وجود جيوب للمقاومة[67]، وفي 6 أكتوبر تم إعلان الاستسلام النهائي لبولندا بعد معركة كوك[68][69]، وتم تشكيل حكومة منفى في لندن تابعت لاحقاً إدارة العمليات القتالية لجيش الوطن - حركة المق

## **Observations**

This model is much better than the simple previous one, it uses word embeddings to get a better meaning of words hence give better score to each sentence and rank.