<a href="https://colab.research.google.com/github/GMurf/Hongda-George/blob/main/NLP_SERP_Analysis_with_SpaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting Top Results from Google Search
**Installing libraries**

In [None]:
# Installation scraping, cleaning and text analysis only

!pip install google
!pip install trafilatura

import re
import pandas as pd
import numpy as np
import trafilatura
import pprint

Collecting trafilatura
  Downloading trafilatura-1.2.2-py3-none-any.whl (1.0 MB)
[K     |████████████████████████████████| 1.0 MB 4.4 MB/s 
[?25hCollecting courlan>=0.7.2
  Downloading courlan-0.7.2-py3-none-any.whl (32 kB)
Collecting lxml>=4.6.4
  Downloading lxml-4.8.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (6.4 MB)
[K     |████████████████████████████████| 6.4 MB 28.7 MB/s 
[?25hCollecting urllib3<2,>=1.26
  Downloading urllib3-1.26.9-py2.py3-none-any.whl (138 kB)
[K     |████████████████████████████████| 138 kB 12.3 MB/s 
[?25hCollecting justext>=3.0.0
  Downloading jusText-3.0.0-py2.py3-none-any.whl (837 kB)
[K     |████████████████████████████████| 837 kB 11.6 MB/s 
Collecting htmldate>=1.2.1
  Downloading htmldate-1.2.1-py3-none-any.whl (37 kB)
Collecting langcodes>=3.3.0
  Downloading langcodes-3.3.0-py3-none-any.whl (181 kB)
[K     |████████████████████████████████| 181 kB 35.6 MB/s 
[?25hCollecting tld>=0.12.6
  Downloading tl

In [None]:
# Installation tensorflow + transformers + pipelines
# You need this to summarize the SERP and to run question-answering on the extracted corpus of text 

!pip install "transformers == 3.3.0"
from transformers import pipeline


Collecting transformers==3.3.0
  Downloading transformers-3.3.0-py3-none-any.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 5.2 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 44.8 MB/s 
[?25hCollecting sentencepiece!=0.1.92
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 37.4 MB/s 
Collecting tokenizers==0.8.1.rc2
  Downloading tokenizers-0.8.1rc2-cp37-cp37m-manylinux1_x86_64.whl (3.0 MB)
[K     |████████████████████████████████| 3.0 MB 30.5 MB/s 
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 57.8 MB/s 
Installing collected packages: urllib3, tokenizers, sentencepiece, sacremoses, transformers
  Attempting uninstall: urllib3
    Found existing installation: urll



### Running the query

Here are the parameters that we can use:

* **query** : query string that we want to search for.
* **tld** : tld stands for top level domain which means we want to search our * result on google.com or google.in or some other domain.
* **lang** : lang stands for language.
* **num** : Number of results we want.
* **start** : First result to retrieve.
* **stop** : Last result to retrieve. Use None to keep searching forever.
* **pause** : Lapse to wait between HTTP requests. Lapse too short may cause  Google to block your IP. Keeping significant lapse will make your program slow but its safe and better option.
Return : Generator (iterator) that yields found URLs. 
If the stop parameter is None the iterator will loop forever.

Here is the documentation: https://python-googlesearch.readthedocs.io/en/latest/

In [None]:
uQuery_1 = "HOW DOES SEMANTIC SEARCH IMPACT SEO?" #@param {type:"string"}
uNum = 10

def getResults(uQuery, uTLD, uNum, uStart, uStop):
  try: 
      from googlesearch import search 
  except ImportError:  
      print("No module named 'google' found") 
  
  # What are we searching for 
  query = uQuery
  
  # Prepare the data frame to store urls
  d = []

  for j in search(query, tld=uTLD, num=uNum, start=uStart, stop=uStop, pause=2): 
      d.append(j)
      print(j)
  return d

results_1 = getResults(uQuery_1, "com", uNum, 1,uNum)


https://www.searchenginejournal.com/semantic-search-seo/264037/
https://blog.hubspot.com/marketing/semantic-search
https://zagfirst.com/what-is-semantic-search-and-how-does-it-impact-seo/
https://www.hillwebcreations.com/what-is-semantic-search/
https://www.infidigit.com/blog/semantic-search/
https://ahrefs.com/blog/semantic-search/
https://www.seo.com/blog/positive-impacts-of-semantic-search-for-seo/
https://www.seoblog.com/semantic-search-seo-impact/
https://blog.alexa.com/semantic-search/
https://betterstudio.com/blog/semantic-search/


### Scraping results with Trafilatura

The library can seamlessly download, parse and convert web documents: it scrapes the main body text while preserving part of the text formatting and page structure and converts to TXT, CSV, XML & TEI-XML.

Here is the documentation: https://trafilatura.readthedocs.io/


In [None]:
pd.set_option('display.max_colwidth', None) # make sure output is not truncated (cols width)
pd.set_option("display.max_rows", 100) # make sure output is not truncated (rows)

def basicPreprocess(text): # Pre-processing
  try:
    processed_text = text.lower()
    processed_text = re.sub(r'\W +', ' ', processed_text)
    processed_text = re.sub(r'\\n', ' ', processed_text)
    processed_text = re.sub(r'<[^>]+>',' ', processed_text)
    processed_text = re.sub(r'\\xa0', ' ', processed_text)

  except Exception as e:
    print("Exception:",e,",on text:", text)
    return None
  return processed_text

def readResults(urls, query):
    # Prepare the data frame to store results
    x = []
    position = 0 # position on the serp

    # Loop items in results
    for page in urls:
       position += 1
       downloaded = trafilatura.fetch_url(page)
       if downloaded is not None: # assuming the download was successful
        result = trafilatura.extract(downloaded, include_tables=False, include_formatting=False, include_comments=False) 
        result = basicPreprocess(result)
        x.append((page, result, query, position))
    return x

d = readResults(results_1, uQuery_1) # get results from there 1st query

df_1 = pd.DataFrame(d, columns=('url', 'result', 'query', 'position')) # store data in a data frame

df_final = pd.concat([df_1])
print("total number of articles (before filtering) ",len(df_final))

# Remove rows where result is empty 
df_final['result'].replace(' ', np.nan, inplace=True)
df_final = df_final.dropna(subset=['result'])

# Remove rows where article are less than 200 characters in lenght
df_final = df_final[df_final['result'].apply(lambda x: len(str(x))>200)]


# Reindex df
df_final.index = range(len(df_final.index))

# Set the file name
uQuery = uQuery_1
cleanQuery = re.sub('\W+','', uQuery)
file_name = cleanQuery + ".csv"


total number of articles (before filtering)  10


#Analyze terms from the corpus of results
Beautiful visualization of how language differs among search results. Scattertext is a tool for finding distinguishing terms in small-to-medium-sized corpora like the one we're using here.

Scattertext presents terms/concepts in an interactive, HTML scatter plot. Points corresponding to terms are selectively labeled so that they don't overlap with other labels or points.

Here is the documentation: https://github.com/JasonKessler/scattertext

In [None]:
# Getting additional hourse power - adding more libraries
!pip install scattertext

%matplotlib inline
import scattertext as st
from sklearn.feature_extraction import _stop_words

import io
from scipy.stats import rankdata, hmean, norm
import spacy
import os, pkgutil, json, urllib
from urllib.request import urlopen
from IPython.display import IFrame
from IPython.core.display import display, HTML
from scattertext import CorpusFromPandas, produce_scattertext_explorer
display(HTML("<style>.container { width:98% !important; }</style>"))

nlp = spacy.load('en') # make sure you have the right language here 

Collecting scattertext
  Downloading scattertext-0.1.5-py3-none-any.whl (7.3 MB)
[K     |████████████████████████████████| 7.3 MB 5.4 MB/s 
Collecting gensim>=4.0.0
  Downloading gensim-4.1.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[K     |████████████████████████████████| 24.1 MB 2.1 MB/s 
Collecting mock
  Downloading mock-4.0.3-py3-none-any.whl (28 kB)
Collecting flashtext
  Downloading flashtext-2.7.tar.gz (14 kB)
Building wheels for collected packages: flashtext
  Building wheel for flashtext (setup.py) ... [?25l[?25hdone
  Created wheel for flashtext: filename=flashtext-2.7-py2.py3-none-any.whl size=9309 sha256=5083b9b499246164b1f8d5c90d20b5c208be316895a98330be4fc12b71386386
  Stored in directory: /root/.cache/pip/wheels/cb/19/58/4e8fdd0009a7f89dbce3c18fff2e0d0fa201d5cdfd16f113b7
Successfully built flashtext
Installing collected packages: mock, gensim, flashtext, scattertext
  Attempting uninstall: gensim
    Found existing installation: gensim 3.6

In [None]:
df_1['top_result'] = ['Top 3' if x <= 3 else 'Positions 4 - 10' for x in df_1['position']] # add top_result = True when position <=3 

# Remove rows where result is empty 
df_1['result'].replace(' ', np.nan, inplace=True)
df_1 = df_1.dropna(subset=['result'])

df_1['index'] = df_1.index

df_1.groupby('top_result').apply(lambda x: x.result.apply(lambda x: len(x.split())).sum())
df_1['parsed'] = df_1.result.apply(nlp)

# Turn it into a Scattertext corpus
corpus_1 = (st.CorpusFromParsedDocuments(df_1, 
                                       category_col='top_result', 
                                       parsed_col='parsed')
          .build()) 

### Visualizing the Top Results

In [None]:
html = produce_scattertext_explorer(corpus_1,
                                    category='Top 3',
                                    category_name='Top 3',
                                    not_category_name='Positions 4 - 10',
                                    width_in_pixels=900,
                                    minimum_term_frequency=3,
                                    term_significance = st.LogOddsRatioUninformativeDirichletPrior())
open("SERP-Visualization_top3.html", 'wb').write(html.encode('utf-8'))
display(HTML(html))


## Top 25 Terms


In [None]:
df_final = corpus_1.get_term_freq_df()

df_final.head(25)

Unnamed: 0_level_0,Top 3 freq,Positions 4 - 10 freq
term,Unnamed: 1_level_1,Unnamed: 2_level_1
many,3,15
things,8,6
have,13,40
changed,2,5
since,3,3
2010,1,1
when,12,23
seo,18,117
was,5,18
more,19,87


# Natural Langauge Processing

## Install Libraries

### Extracting the content from Top 5

In [None]:
df_entity =df_1[df_1['position'] < 6]


In [None]:
# Remove rows where article are less than 300 characters in lenght
df_entity = df_entity[df_entity['result'].apply(lambda x: len(str(x))>300)]


# getting text ready by merging all pages together (no index)
full_body = df_entity[['result']].agg(''.join, axis=1).to_string(index=False).strip()

# cleaning up the text
full_body = basicPreprocess(full_body)

pp = pprint.PrettyPrinter(indent=35)

pp.pprint(full_body) 

with open('output.txt', 'w') as text_file:
    text_file.write(full_body)

('many things have changed since 2010 when seo was more concerned with getting '
 'as many backlinks as you could and including as many keywords as possible. '
 'in 2021 the focus has shifted to understanding intent and behavior and the '
 'context semantics behind them. today search engine understanding has evolved '
 'and we’ve changed how we optimize for it as a result the days of '
 'reverse-engineering content that ranks higher are behind us and identifying '
 'keywords is no longer enough. now you need to understand what those keywords '
 'mean provide rich information that contextualizes those keywords and firmly '
 'understand user intent. these things are vital for seo in an age of semantic '
 'search where machine learning and natural language processing are helping '
 'search engines understand context and consumers better. in this piece you’ll '
 'learn what semantic search is why it’s essential for seo and how to optimize '
 'your content for it. what is semantic search? s

### Extracting Entities

In [None]:
df_spacy_entities = pd.DataFrame(columns=['Entity','Type'])

nlp = spacy.load("en_core_web_sm")
doc = nlp(full_body)

for ent in doc.ents:
    df_spacy_entities = df_spacy_entities.append({"Entity":ent.text,"Type":ent.label_}, ignore_index=True)


df_spacy_entities.head()


Unnamed: 0,Entity,Type
0,2010,DATE
1,2021,DATE
2,today,DATE
3,the days,DATE
4,google,ORG


In [None]:
##Removing Duplicates from the DataFrame 

df_spacy_entities = df_spacy_entities.drop_duplicates(subset=['Entity'])

df_spacy_entities.head()

Unnamed: 0,Entity,Type
0,2010,DATE
1,2021,DATE
2,today,DATE
3,the days,DATE
4,google,ORG


## Visualizing the Data


In [None]:
!pip install plotly==4.5
import plotly.express as px
import numpy as np

Collecting plotly==4.5
  Downloading plotly-4.5.0-py2.py3-none-any.whl (7.1 MB)
[K     |████████████████████████████████| 7.1 MB 4.4 MB/s 
Installing collected packages: plotly
  Attempting uninstall: plotly
    Found existing installation: plotly 4.4.1
    Uninstalling plotly-4.4.1:
      Successfully uninstalled plotly-4.4.1
Successfully installed plotly-4.5.0


In [None]:
fig_3 = px.treemap(df_spacy_entities, path=['Type','Entity'])
fig_3.show()

In [None]:
# Store DataFrame to CSV (Entity, Relevance, Confidence)

from google.colab import files

df_spacy_entities.to_csv('entities-to-add.csv') 
files.download('entities-to-add.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>