### NLP Search Engine

Tutorual Link - https://medium.com/towards-artificial-intelligence/similar-texts-search-in-python-with-a-few-lines-of-code-an-nlp-project-9ace2861d261

Dataset Link - https://www.kaggle.com/sameersmahajan/people-wikipedia-data

In [1]:
import numpy as np
import pandas as pd
from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
df = pd.read_csv('dataset/people_wiki.csv')
df.head()

Unnamed: 0,URI,name,text
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...


In [3]:
df.shape

(42786, 3)

In [4]:
df.describe()

Unnamed: 0,URI,name,text
count,42786,42786,42786
unique,42786,42785,42786
top,<http://dbpedia.org/resource/Jeetendra_Singh_B...,author),galeard lee wade born january 20 1929 is a for...
freq,1,2,1


In [5]:
df.isnull().sum()

URI     0
name    0
text    0
dtype: int64

### How to Vectorize?
In Python’s scikit-learn library, there is a function named ‘count vectorizer’. This function provides an index to each word and generates a vector that contains the number of appearances of each word in a piece of text. Here, I will demonstrate it with a small text for your understanding. Suppose, this is our text:

###### text = ["Jen is a good student. Jen plays guiter as well"]
Let’s import the function from the scikit_learn library and fit the text in the function.

###### vectorizer = CountVectorizer()
###### vectorizer.fit(text)

Here, I am printing the vocabulary:
###### print(vectorizer.vocabulary_)
###### Output:  {'jen': 4, 'is': 3, 'good': 1, 'student': 6, 'plays': 5, 'guiter': 2, 'as': 0, 'well': 7}
Look, each word of the text received a number. Those numbers are the index of that word. It has eight significant words. So, the index is from 0 to 7. Next, we need to transform the text. I will print the transformed vector as an array.

###### vector = vectorizer.transform(text)
###### print(vector.toarray())
Here is the output: [[1 1 1 1 2 1 1 1]]. ‘Jen’ has index 4 and it appeared twice. So in this output vector, the 4th indexed element is 2. All the other words appeared only once. So the elements of the vector are ones.

In [6]:
vect = CountVectorizer()
word_weight = vect.fit_transform(df['text'])

In [7]:
print(word_weight.shape)

(42786, 437503)


#### Model Fit

In [8]:
nn = NearestNeighbors(metric = 'euclidean')
nn.fit(word_weight)

NearestNeighbors(metric='euclidean')

#### Predict

First, find the index of ‘Barak Obama’ from the dataset.

In [9]:
obama_index = df[df['name'] == 'Barack Obama'].index[0]

In [10]:
obama_index

35811

In [11]:
distances, indices = nn.kneighbors(word_weight[obama_index], n_neighbors = 10)

In [12]:
distances

array([[ 0.        , 33.01514804, 34.3074336 , 35.79106034, 36.06937759,
        36.24913792, 36.27671429, 36.40054945, 36.44173432, 36.83748091]])

In [13]:
distances.flatten()

array([ 0.        , 33.01514804, 34.3074336 , 35.79106034, 36.06937759,
       36.24913792, 36.27671429, 36.40054945, 36.44173432, 36.83748091])

In [14]:
neighbors = pd.DataFrame({'distance': distances.flatten(), 'id': indices.flatten()})
print(neighbors)

    distance     id
0   0.000000  35811
1  33.015148  24478
2  34.307434  28441
3  35.791060  14754
4  36.069378  35351
5  36.249138  31417
6  36.276714  13229
7  36.400549  36358
8  36.441734  22745
9  36.837481   7660


In [15]:
nearest_info = (df.merge(neighbors, right_on = 'id', left_index = True).sort_values('distance')[['id', 'name', 'distance']])
print(nearest_info)

      id                        name   distance
0  35811                Barack Obama   0.000000
1  24478                   Joe Biden  33.015148
2  28441              George W. Bush  34.307434
3  14754                 Mitt Romney  35.791060
4  35351            Lawrence Summers  36.069378
5  31417              Walter Mondale  36.249138
6  13229            Francisco Barrio  36.276714
7  36358                  Don Bonker  36.400549
8  22745  Wynn Normington Hugh-Jones  36.441734
9   7660    Refael (Rafi) Benvenisti  36.837481


#### Search for user entered data

In [16]:
usr_text = df[df['name'] == 'Refael (Rafi) Benvenisti']['text']

In [17]:
usr_text

7660    refael rafi benvenisti hebrew born in 1937 was...
Name: text, dtype: object

In [18]:
usr_input = vect.transform(usr_text) # ["born in athens georgia on 1972"])

In [19]:
usr_input.shape

(1, 437503)

In [20]:
distances, indices  = nn.kneighbors(usr_input, n_neighbors = 10)

In [21]:
print(distances)
print(indices )

[[ 0.         31.38470965 31.38470965 31.67017524 31.68595904 32.18695388
  32.23352292 32.31098884 32.37282811 32.46536616]]
[[ 7660 21309 42246 14609 17730  9210 18215  5678  8620 13556]]


In [23]:
usr_search_neighbors = pd.DataFrame({'distance': distances.flatten(), 'id': indices.flatten()})
print(usr_search_neighbors)

    distance     id
0   0.000000   7660
1  31.384710  21309
2  31.384710  42246
3  31.670175  14609
4  31.685959  17730
5  32.186954   9210
6  32.233523  18215
7  32.310989   5678
8  32.372828   8620
9  32.465366  13556


In [24]:
usr_search_info = (df.merge(usr_search_neighbors, right_on = 'id', left_index = True).sort_values('distance')[['id', 'name', 'distance']])
print(usr_search_info)

      id                      name   distance
0   7660  Refael (Rafi) Benvenisti   0.000000
1  21309           Philippe Augier  31.384710
2  42246          Eric D. Huntsman  31.384710
3  14609    Joe Young (politician)  31.670175
4  17730    Douglas Young (lawyer)  31.685959
5   9210              Andy Anstett  32.186954
6  18215            Norman Zabusky  32.233523
7   5678    Gopal Ballav Pattanaik  32.310989
8   8620      Krzysztof Piesiewicz  32.372828
9  13556            Robert A. Roth  32.465366


#### Save models

In [25]:
import pickle

In [27]:
pickle.dump(nn, open('nearest_neighbor.pickle', 'wb'))
pickle.dump(vect, open('cpunt_vectorizer.pickel', 'wb'))