# UnFound NLP Assignment 

1. Given some input word or phrase, figure out the relevant Wikipedia article. If there’s
nothing relevant, return null.
2. Create a timeline from the article. To do that, extract all sentences with dates (&
temporal words like yesterday, tomorrow, etc). Return only best n (NOT necessarily first
n) sentences in timeline in chronological order. [we leave it up to how you infer best]
3. For each sentence in the timeline, output a sentence embedding. Choose an embedding
of your choice (Word2Vec, Glove, Universal Sentence Encoder, ELMo, etc).
4. Create a microservice (BONUS: host it on Heroku or similar) which takes some
word/phrase and a number n as input, & returns these
a. Relevant Wikipedia page name (return null if irrelevant)
b. Best n timeline sentences (with sentence embedding)

# For First Task
    1. Import required libraries wikipedia,wikipediaapi
    2. Accept User Input for query
    3. Search Relevant query

# 1.1 Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import wikipedia
import wikipediaapi

%matplotlib inline

# 1.2 Take input from User


In [2]:
article=input()

prime minister of india


#### 1.2.1  Processing the query before fetching results

In [3]:
import string
article=string.capwords(article)

article

'Prime Minister Of India'

# 1.3 Finding article related to query

   #### 1.3.1 Storing page content if exists

In [4]:
wiki_wiki = wikipediaapi.Wikipedia('en')

page_py = wiki_wiki.page(article)
if(page_py.exists()):
    print("Page - Summary: %s" % page_py.text)
    page_content=page_py.text
else :
    print("NULL")

Page - Summary: The Prime Minister of India is the leader of the executive of the Government of India. The prime minister is also the chief adviser to the President of India and head of the Council of Ministers. They can be a member of any of the two houses of the Parliament of India — the Lok Sabha (House of the People) and the Rajya Sabha (Council of the States) — but has to be a member of the political party or coalition, having a majority in the Lok Sabha.
The prime minister is the senior-most member of cabinet in the executive of government in a parliamentary system. The prime minister selects and can dismiss members of the cabinet; allocates posts to members within the government; and is the presiding member and chairperson of the cabinet.
The union cabinet headed by the prime minister is appointed by the President of India to assist the latter in the administration of the affairs of the executive. Union cabinet is collectively responsible to the Lok Sabha as per article 75(3) of

In [5]:
page_content

"The Prime Minister of India is the leader of the executive of the Government of India. The prime minister is also the chief adviser to the President of India and head of the Council of Ministers. They can be a member of any of the two houses of the Parliament of India — the Lok Sabha (House of the People) and the Rajya Sabha (Council of the States) — but has to be a member of the political party or coalition, having a majority in the Lok Sabha.\nThe prime minister is the senior-most member of cabinet in the executive of government in a parliamentary system. The prime minister selects and can dismiss members of the cabinet; allocates posts to members within the government; and is the presiding member and chairperson of the cabinet.\nThe union cabinet headed by the prime minister is appointed by the President of India to assist the latter in the administration of the affairs of the executive. Union cabinet is collectively responsible to the Lok Sabha as per article 75(3) of the Constitu

#  2 .Select best n lines from the article (having dates and temporal words) in  chronological order
    1. Text Cleaning
    2. Split page line by line
    3. Creating a DataFrame to store relevant values
    4. Data Preprocessing
    5. Take input N 
    6. Select sentence according to ranking and chronological order
    

## 2.1 Text Cleaning
### 2.1.1 Remove \n from article




In [6]:
page_content=page_content.replace('\n',' ')

### 2.1.2 Remove abbrevated text
   #### - usually consist of name initials in the form of [A-Z]. 

In [9]:
import re
page_content=re.sub(r"\b[A-Z\.]\b"," ",page_content)

## 2.2 Split page line by line

In [10]:
page_lines=page_content.split('.')
page_lines

['The Prime Minister of India is the leader of the executive of the Government of India',
 ' The prime minister is also the chief adviser to the President of India and head of the Council of Ministers',
 ' They can be a member of any of the two houses of the Parliament of India — the Lok Sabha (House of the People) and the Rajya Sabha (Council of the States) — but has to be a member of the political party or coalition, having a majority in the Lok Sabha',
 ' The prime minister is the senior-most member of cabinet in the executive of government in a parliamentary system',
 ' The prime minister selects and can dismiss members of the cabinet; allocates posts to members within the government; and is the presiding member and chairperson of the cabinet',
 ' The union cabinet headed by the prime minister is appointed by the President of India to assist the latter in the administration of the affairs of the executive',
 ' Union cabinet is collectively responsible to the Lok Sabha as per articl

### Updating  page content and page lines

In [11]:
# Update after testing
page_content=re.sub(r"\s\.","",page_content)
#updated split by lines
page_lines=page_content.split('.')
page_lines

['The Prime Minister of India is the leader of the executive of the Government of India',
 ' The prime minister is also the chief adviser to the President of India and head of the Council of Ministers',
 ' They can be a member of any of the two houses of the Parliament of India — the Lok Sabha (House of the People) and the Rajya Sabha (Council of the States) — but has to be a member of the political party or coalition, having a majority in the Lok Sabha',
 ' The prime minister is the senior-most member of cabinet in the executive of government in a parliamentary system',
 ' The prime minister selects and can dismiss members of the cabinet; allocates posts to members within the government; and is the presiding member and chairperson of the cabinet',
 ' The union cabinet headed by the prime minister is appointed by the President of India to assist the latter in the administration of the affairs of the executive',
 ' Union cabinet is collectively responsible to the Lok Sabha as per articl

## 2.3 Create a DataFrame to store all sentences

In [12]:
df=pd.DataFrame({'Sentence':page_lines})

In [13]:
df.describe()

Unnamed: 0,Sentence
count,140
unique,140
top,The Special Protection Group ( ) is charged w...
freq,1


### 2.3.1 Form a year column

In [14]:
# Forming a year column in dataframe
Years_line=[]
for x in page_lines:
    l=re.search(r"(\d{4})",x)
    if l:
        h=int(l.group(0))
        if h<2099:
            Years_line.append(h)
        else:
            Years_line.append(None)
    else:
        Years_line.append(None)
df['Year']=Years_line
df   

Unnamed: 0,Sentence,Year
0,The Prime Minister of India is the leader of t...,
1,The prime minister is also the chief adviser ...,
2,They can be a member of any of the two houses...,
3,The prime minister is the senior-most member ...,
4,The prime minister selects and can dismiss me...,
5,The union cabinet headed by the prime ministe...,
6,Union cabinet is collectively responsible to ...,
7,The prime minister has to enjoy the confidenc...,
8,Origins and history India follows a parliame...,
9,"In such systems, the head of state, or, the h...",


In [15]:
df.describe()

Unnamed: 0,Year
count,35.0
mean,1982.314286
std,21.137883
min,1947.0
25%,1963.0
50%,1984.0
75%,1999.0
max,2014.0


In [16]:
MONTHS_PATTERN = ['january','february','march','april','may','june','july','august','september','october','november','december','jan','feb','mar','apr','may','jun','jul','aug','sep','sept','oct','nov','dec','today','tomorrow','yesterday']

### 2.3.2 Add new Column in Dataframe for months/temporal words detection
#### 1- indicates month/temporal words  are present

In [17]:
dpset=set()
for num,line in enumerate(page_lines):
    l=re.search(r"(\d{4})",line)
    if l:
        h=l.group(0)
        if int(h)<2099:
            dpset.add(num)
    words=line.split(' ')
    for wn,word in enumerate(words):
        if word.lower() in MONTHS_PATTERN:
            dpset.add(num)
for x in dpset:
    df.at[x, 'Date Present']=1

df

Unnamed: 0,Sentence,Year,Date Present
0,The Prime Minister of India is the leader of t...,,
1,The prime minister is also the chief adviser ...,,
2,They can be a member of any of the two houses...,,
3,The prime minister is the senior-most member ...,,
4,The prime minister selects and can dismiss me...,,
5,The union cabinet headed by the prime ministe...,,
6,Union cabinet is collectively responsible to ...,,
7,The prime minister has to enjoy the confidenc...,,
8,Origins and history India follows a parliame...,,
9,"In such systems, the head of state, or, the h...",,


## 2.4 Data preprocessing
### 2.4.1 Drop rows with NA values

In [18]:
df.dropna(inplace=True)
df=df.reset_index(drop=True)
df

Unnamed: 0,Sentence,Year,Date Present
0,"History 1947-1984 Since 1947, there have bee...",1947.0,1.0
1,The first few decades after 1947 saw the Indi...,1947.0,1.0
2,India's first prime minister — Jawaharlal Neh...,1947.0,1.0
3,"His tenure ended in May 1964, on his death",1964.0,1.0
4,Shastri's tenure saw the Indo-Pakistani War o...,1965.0,1.0
5,"In addition, events such as the Indo-Pakistan...",1971.0,1.0
6,"In 1975, President Fakhruddin Ali Ahmed — on ...",1975.0,1.0
7,All of the political parties of the oppositio...,1977.0,1.0
8,"Ultimately, after two and a half years as ; ...",1979.0,1.0
9,"In 1980, after a three-year absence, the Cong...",1980.0,1.0


### 2.4.2 Best Sentences 

#### 2.4.2.1 Calculate Number of rows

In [19]:
# Store Value of number of rows
nr=df.shape[0]

#### 2.4.2.2 Forming a Vector representaion of the best sentences using doc2vec model 

#### 2.5.2.2.1 Calculating similarity and giving ranks to sentences 

In [20]:
from gensim.models import doc2vec
from collections import namedtuple

# Load data

doc1 =df['Sentence']

# Transform data (you can add more data preprocessing steps) 

docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for i, text in enumerate(doc1):
    words = text.lower().split()
    tags = [i]
    docs.append(analyzedDocument(words, tags))

# Train model (set min_count = 1, if you want the model to work with the provided example data set)

model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 1, workers = 4)


tokens =article.split()

new_vector = model.infer_vector(tokens)
sim_list = model.docvecs.most_similar([new_vector],topn=nr) 

  if np.issubdtype(vec.dtype, np.int):


#### Giving rank to each sentence 

In [21]:
#Rank to each sentence
rank=[i[0] for i in sim_list]

In [22]:
df['Ranking']=rank

## 2.5 Taking input for N for best lines

In [23]:
# Number of best Lines

no_best_lines=input()

5


## 2.6 Selecting n best sentences and ordering then in chronological order

In [24]:
nbest_lines=list()
nyear=list()
for index, row in df.iterrows():
    if row['Ranking'] < int(no_best_lines):
        print (row['Year'],':', row['Sentence'])
        nbest_lines.append(row['Sentence'])
        nyear.append(row['Year'])

1947.0 :  India's first prime minister — Jawaharlal Nehru — took oath on 15 August 1947
1965.0 :  Shastri's tenure saw the Indo-Pakistani War of 1965
1984.0 :  Subsequently, on 31 October 1984, Indira was shot dead by Satwant Singh and Beant Singh — two of her bodyguards — in the garden of her residence at 1, Safdarjung Road, New Delhi
1996.0 :  Rao, however, did complete five continuous years in office, becoming the first prime minister outside of the Nehru—Gandhi family to do so After the end of Rao's tenure in May 1996, the nation saw four prime ministers in a span of three years, viz
2002.0 :  But during his reign, the 2002 Gujarat communal riots in the state of Gujarat took place; resulting in the death of about 2,000 deaths


# 3. For each sentence in the timeline, output a sentence embedding. Choose an embedding of your choice.

        1. Converting sentences to vectors
        2. Output sentences in Correct order

## 3.1 Converting the best n sentences to vectors using doc2vec

In [25]:
# Sentence Embedding 


nbest_docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for i, text in enumerate(nbest_lines):
    words = text.lower().split()
    tags = [i]
    nbest_docs.append(analyzedDocument(words, tags))

In [26]:
# Model to Convert Best sentences to Vector format

best_model = doc2vec.Doc2Vec(nbest_docs, size = 100, window = 300, min_count = 1, workers = 4)



## 3.2 Displaying Sentences in order with chronological order with Sentence Embedding

In [27]:
# Displaying N best sentences with Sentence Embedding

for num,i in enumerate (best_model.docvecs):
    if num<int(no_best_lines):
        print(nyear[num],':',nbest_lines[num])
        print(best_model.docvecs[num])

1947.0 :  India's first prime minister — Jawaharlal Nehru — took oath on 15 August 1947
[-3.3806867e-03  3.5947154e-03 -2.6295851e-03  3.0730427e-03
  4.8870132e-03  2.3795976e-03  1.3583943e-03  2.6353032e-03
  6.7270664e-04  6.6648971e-04  2.0569549e-03 -1.6148591e-03
 -4.6742093e-03  4.5079775e-03  4.1837613e-03 -3.8102674e-03
 -1.2791312e-03 -1.6936724e-03 -3.5667235e-03  3.8790375e-03
  3.5140817e-03 -2.6407030e-03  1.3354226e-03 -1.3470687e-03
  3.0024722e-04 -1.5032118e-04  1.9756428e-03  1.6037764e-03
 -2.2000265e-03  1.6455081e-03  4.3766676e-03 -2.6347581e-03
 -2.5495049e-03  4.5804060e-03  1.0401267e-03  2.9754702e-03
  4.5681647e-03  3.1581477e-04  4.1134492e-03  3.4596310e-03
  6.1815273e-04  4.2663021e-03  6.6707534e-04 -3.8048925e-03
  4.0109311e-03  4.4902717e-03 -2.5108941e-03 -1.0492139e-03
  2.6680594e-03  5.2443764e-04  4.1539958e-03 -4.8922081e-03
 -3.4825201e-03 -2.6757058e-04  2.1119779e-03 -1.8075345e-03
 -3.8266899e-03 -1.0501815e-03 -4.1286703e-03  2.5823072e-

KeyError: "tag '5' not seen in training corpus/invalid"

### Conclusion:
In this assignmnet I have tried to achieve the required goal.The results from the notebook satisfy all the required pointers,It prints the best n sentences with respect to the given article, with sentence embedding of each sentence

## I would like to know your improvements and suggestions on the same