<a href="https://colab.research.google.com/github/manish284/Natural-Language-Processing/blob/master/Find_the_Duplicate_News_Heahlines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Problem Statement:

In attached data there are various headlines (Sentences) from various newspapers. 
Your task is to build a model that can search the similar sentences from the data set.

for example: "where are you currently residing?" is similar to "what is your address?"

if you pass query = " where are you currently residing ?"
Your model should be able to return like something below:

1. What is your current address?
2. Can you give me your current address?
3. Please provide me your address.
4. In which city you residing ............... and so on

The most similar sentence should be on top and follow the order.

#Importing the required packages

In [56]:
import re
import os
import nltk
import gensim
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import cosine_similarity

In [57]:
#download stopt words
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Unzip the data folder

In [3]:
!unzip /content/news-text.csv.zip

Archive:  /content/news-text.csv.zip
  inflating: news-text.csv           


#Read the data from file

In [58]:
dataframe = pd.read_csv('/content/news-text.csv')

In [59]:
#check the top 5 row
print(dataframe.head())
dataframe.shape

   publish_date                                      headline_text
0      20030219  aba decides against community broadcasting lic...
1      20030219     act fire witnesses must be aware of defamation
2      20030219     a g calls for infrastructure protection summit
3      20030219           air nz staff in aust strike for pay rise
4      20030219      air nz strike to affect australian travellers


(1186018, 2)

# Preprocessing The raw data

In [60]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
GOOD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

In [61]:
def text_prepare(text):
    text = text.lower()
    text = REPLACE_BY_SPACE_RE.sub(' ', text)
    text = GOOD_SYMBOLS_RE.sub('', text)
    text = ' '.join([x for x in text.split() if x and x not in STOPWORDS])
    return text.strip()

In [62]:
prepared_dataframe = []
for line in dataframe['headline_text']:
  prepared_dataframe.append(text_prepare(line))

In [63]:
#chcking the to 10 row after cleaning
prepared_dataframe[0:10]

['aba decides community broadcasting licence',
 'act fire witnesses must aware defamation',
 'g calls infrastructure protection summit',
 'air nz staff aust strike pay rise',
 'air nz strike affect australian travellers',
 'ambitious olsson wins triple jump',
 'antic delighted record breaking barca',
 'aussie qualifier stosur wastes four memphis match',
 'aust addresses un security council iraq',
 'australia locked war timetable opp']

#Note : -

To solve the problem I am going to use two solution

1. Pre-train word and Phrase vector
    using this method i will prepair the duplicte sentence    present in raw file

2. StarSpace
    using this model i will train model by raw data


#Downloading Pre-trained word and phrase vectors
pre-trained vectors trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases.

In [64]:
# downloading pre-train word2vec
!wget https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz

--2020-10-11 20:06:29--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.77.102
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.77.102|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘GoogleNews-vectors-negative300.bin.gz.1’


2020-10-11 20:07:12 (37.2 MB/s) - ‘GoogleNews-vectors-negative300.bin.gz.1’ saved [1647046227/1647046227]



In [65]:
# uploading word2vec using gensim library
wv_embeddings = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True, limit=500000)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [None]:
# hecking Pre-trained word vectors
wv_embeddings['me']

In [66]:
# generate the embaddaings of sentence
def question_to_vec(question, embeddings, dim=300):
    """
        question: a string
        embeddings: dict where the key is a word and a value is its' embedding
        dim: size of the representation

        result: vector representation for the question
    """
    result = np.zeros(dim)
    c = 0
    words = question.split()
    for word in words:
        if word in embeddings:
            result += np.array(embeddings[word])
            c += 1
    if c != 0:
        result /= c
    return result

In [67]:
# find out the the cosine semilarties of sentence with other ssentences
def rank_candidates(question, candidates, embeddings, dim=300):
    """
        question: a string
        candidates: a list of strings (candidates) which we want to rank
        embeddings: some embeddings
        dim: dimension of the current embeddings
        
        result: a list of pairs (initial position in the list, question)
    """
    q_vecs = np.array([question_to_vec(question, embeddings, dim) for i in range(len(candidates))])
    cand_vecs = np.array([question_to_vec(candidate, embeddings, dim) for candidate in candidates])
    cosines = np.array(cosine_similarity(q_vecs, cand_vecs)[0])
    merged_list = list(zip(cosines, range(len(candidates)), candidates))
    sorted_list  = sorted(merged_list, key=lambda x: x[0], reverse=True)
    result = [(a,b,c) for a,b,c in sorted_list if a> 0.4]
    return result

In [None]:
wv_prepared_ranking = [] 
for indx,question in enumerate(prepared_dataframe):
    candidate=prepared_dataframe
    ranks = rank_candidates(question, candidate, wv_embeddings)
    wv_prepared_ranking.append([r[2] for r in ranks])

In [None]:
# save the similar sentences in single row
  out = open('/content/prepared_train.tsv', 'w')
  for line in wv_prepared_ranking:
    new_line = [text_prepare(q) for q in line]
    print(*new_line, sep='\t', file=out)
  out.close()

In [None]:
prepare_file()

#2. StarSpace

Traing the the model with the raw data

In [82]:
# Setup
def setup_starspace():
    if not os.path.exists("/usr/local/bin/starspace"):
        os.system("wget https://dl.bintray.com/boostorg/release/1.63.0/source/boost_1_63_0.zip")
        os.system("unzip boost_1_63_0.zip && mv boost_1_63_0 /usr/local/bin")
        os.system("git clone https://github.com/facebookresearch/Starspace.git")
        os.system("cd Starspace && make && cp -Rf starspace /usr/local/bin")
setup_starspace()

In [85]:
# traning the the model with raw data
!starspace train -trainFile "/content/prepared_train.tsv" -model starspace_embedding \
-trainMode 3 -adagrad true -ngrams 1 -epoch 5 -dim 100 -similarity cosine -minCount 2 \
-verbose true -fileFormat labelDoc -negSearchLimit 10 -lr 0.05

Arguments: 
lr: 0.05
dim: 100
epoch: 5
maxTrainTime: 8640000
validationPatience: 10
saveEveryEpoch: 0
loss: hinge
margin: 0.05
similarity: cosine
maxNegSamples: 10
negSearchLimit: 10
batchSize: 5
thread: 10
minCount: 2
minCountLabel: 1
label: __label__
label: __label__
ngrams: 1
bucket: 2000000
adagrad: 1
trainMode: 3
fileFormat: labelDoc
normalizeText: 0
dropoutLHS: 0
dropoutRHS: 0
useWeight: 0
weightSep: :
Start to initialize starspace model.
Build dict from input file : /content/prepared_train.tsv
Read 0M words
Number of words in dictionary:  394
Number of labels in dictionary: 0
Loading data from file : /content/prepared_train.tsv
Total number of examples loaded : 89
Initialized model weights. Model size :
matrix : 394 100
Training epoch 0: 0.05 0.01
Epoch: 88.9%  lr: 0.050000  loss: 0.332594  eta: <1min   tot: 0h0m0s  (17.8%)
 ---+++                Epoch    0 Train error : 0.61408126 +++--- ☃
Training epoch 1: 0.04 0.01
Epoch: 88.9%  lr: 0.040000  loss: 0.188699  eta: <1min   t

In [91]:
#storing the generated embaddings
starspace_embeddings = dict()
for line in open('starspace_embedding.tsv', encoding='utf-8'):
    row = line.strip().split('\t')
    starspace_embeddings[row[0]] = np.array(row[1:], dtype=np.float32)

In [94]:
#checkig the result
starspace_ranks_results = []
for indx,question in enumerate(prepared_dataframe[0:100]):
    candidate=prepared_dataframe[0:100]
    ranks = rank_candidates(question, candidate, starspace_embeddings,100)
    starspace_ranks_results.append([r[2] for r in ranks])


In [109]:
def get_duplicte(question):
  candidate=prepared_dataframe[0:100]
  ranks = rank_candidates(question, candidate, starspace_embeddings,100)
  print('Question:%s',question)
  print('Duplicate Healines:')
  print('Rank','\t\t\t','Duplicates')
  for i in ranks:
    print(i[0],i[2])


In [110]:
test ='act fire witnesses must aware defamation'
get_duplicte(test)

Question:%s act fire witnesses must aware defamation
Duplicate Healines:
Rank 			 Duplicates
0.9999999999999997 act fire witnesses must aware defamation
0.7779561812346191 businesses prepare terrorist attacks
0.5745279379977436 jury consider verdict murder case
0.5537601082881319 man charged cooma murder
0.44224793381100824 korean subway fire 314 still missing
0.42304599351810324 dog mauls 18 month old toddler nsw
