**Building a demo knowledge graph on wikidata using spacey library**


**Loading dependencies**

In [3]:
import re
import pandas as pd
import bs4   #To read the data from the website hence using beautiful soup
import requests
import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_sm')

from spacy.matcher import Matcher 
from spacy.tokens import Span 

import networkx as nx

import matplotlib.pyplot as plt
from tqdm import tqdm

pd.set_option('display.max_colwidth', 200)
%matplotlib inline

**Loading data from google drive**

In [9]:
from google.colab import drive
drive.mount('/content/drive')
dataset = pd.read_csv('/content/drive/MyDrive/model.csv', na_values='?') 
dataset = dataset.reset_index(drop=True)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [10]:
dataset

Unnamed: 0,sentence
0,"confused and frustrated, connie decides to leave on her own."
1,"later, a woman’s scream is heard in the distance."
2,christian is then paralyzed by an elder.
3,the temple is set on fire.
4,"outside, the cult wails with him."
...,...
4313,"confidencial also responded negatively, calling the film a barren drama, unsubtle and self-indulgent."
4314,and le parisien gave the film their highest five-star rating.
4315,"the museum collection includes 37,000 film titles, 60,000 posters, 700,000 photographs and 20,000 books."
4316,"its predecessor was the dutch historical film archive, founded in 1946."


In [11]:
dataset['sentence'].sample(1)

2435    kanchivaram  was selected to be premiered at the toronto international film festival.
Name: sentence, dtype: object

**Extracting sentence having only one subject and one predicate. The below example shows that in a single sentence what is a subject and what is an object and how to extract data dependencies using nlp library**

In [13]:
sentences = nlp("Walmart is a good company for working")
for words in sentences:
  print(words.text,".....",words.dep_)

Walmart ..... nsubj
is ..... ROOT
a ..... det
good ..... amod
company ..... attr
for ..... prep
working ..... pcomp


**In this function what we are doing is that for each sentence we are finding the subject and object and returning that. Also we can see that most of the sentences might be combined words so we are looking into that too.**

In [21]:
def get_entities(sent):
  entity1 = ""
  entity2 = ""
  previous_dependency = ""   
  previous_text = ""   
  prefix = ""
  modifier = ""
  i = 0
  while(i<len(nlp(sent))):
    token = nlp(sent)[i]
    if token.dep_ != "punct":
      if token.dep_ == "compound":
        prefix = token.text
        if previous_dependency== "compound":
          prefix = previous_text + " "+ token.text
      if token.dep_.endswith("mod") == True:
        modifier = token.text
        if previous_dependency == "compound":
          modifier = previous_text + " "+ token.text
      if token.dep_.find("subj") == True:
        entity1 = modifier +" "+ prefix + " "+ token.text
        prefix = ""
        modifier = ""
        previous_dependency = ""
        previous_text = ""      
      if token.dep_.find("obj") == True:
        entity2 = modifier +" "+ prefix +" "+ token.text
      previous_dependency = token.dep_
      previous_text = token.text
      i = i+1
  return [entity1.strip(), entity2.strip()] #to remove end spaces

**Testing get_entities function**

In [22]:
get_entities("My Name is Khan")
get_entities("The film has 200 patents")

['film', '200  patents']