![Image of Yaktocat](https://seeders.nl/wp-content/uploads/2020/03/seeders-logo.png)
## SERP Analyser
*This notebook gets SERPs for top searched keywords in Europe and anlayses the top 10 results to gain insights for important SEO ranking factors across Europe.*

We will analyse SERPS based on the following questions<br>
 - Is the domain exstension a ranking factor?<br>
 - ??

### Import libraries

In [1]:
import pandas as pd
import gspread
from gspread_dataframe import get_as_dataframe
from oauth2client.service_account import ServiceAccountCredentials

import requests as rq
from requests import get
from bs4 import BeautifulSoup
import time
from tqdm import tqdm
from urllib.parse import urlparse

import matplotlib.pyplot as plt 
import seaborn as sns

### Get the data from google spreadsheets

In [2]:
scope = ['https://spreadsheets.google.com/feeds', 'https://www.googleapis.com/auth/drive']
credentials = ServiceAccountCredentials.from_json_keyfile_name('C:/Users/Anne/PycharmProjects/crawlersAndscrapers/Scrapers and Crawlers-79156bc3792f.json', scope)
client = gspread.authorize(credentials)
print("Authorizing.......")

spreadsheet_key = '1n6lCJTKjX6ZDlP8WSv_6ZNgq11SwFM_Owbkfo_hmCwo'
print("Opening.......")
sheet = client.open("Zoekwoorden voor onderzoek").sheet1

Authorizing.......
Opening.......


### Clean the data

In [3]:
df = get_as_dataframe(sheet, header=[0,1])#GET ONLY ROWS POPULATED WITH DATA
df = df[0:21]
df = df.filter(regex='^((?!Unnamed).)*$', axis=1) #REMOVE ALL COLUMNS WHERE THE HEADER IS UNNAMED
df = df.filter(regex='^((?!Volume).)*$', axis=1) #REMOVE ALL COLUMNS CONTAINING VAGUE SEARCH VOLUME DATA
df = df.rename(columns=lambda x: x.strip()) #REMOVE WHITESPACE FROM COLUMN NAMES
df.head(3)

Unnamed: 0_level_0,Nederland,Duitsland,Engeland,Spanje,Italie,Frankrijk,Portugal,Belgie,Denemarken,Zweden,Polen
Unnamed: 0_level_1,Keyword,Keyword,Keyword,Keyword,Keyword,Keyword,Keyword,Keyword,Keyword,Keyword,Keyword
0,autoverzekering,Autoversicherung,Car insurance,seguro coche,assicurazione auto,assurance auto,seguro automóvel,autoverzekering,bilforsikring,bilförsäkring,ubezpieczenie samochodu
1,sneakers,Sneakers,Sneakers,Sneakers,scarpe da ginnastica,sneakers,ténis,sneakers,sneakers,sneakers,sneakers
2,geld lenen,Geld leihen,Money loan,prestar dinero,prestiti,prêt,empréstimo,geld lenen,låne penge,låna pengar,pożyczka gotówkowa


In [4]:
#Get only first two columns

df = df.filter(items=[( 'Nederland', 'Keyword'),('Duitsland', 'Keyword')])
df.columns.names = ['Country', 'Atts']
df = df.head(5)
df

Country,Nederland,Duitsland
Atts,Keyword,Keyword
0,autoverzekering,Autoversicherung
1,sneakers,Sneakers
2,geld lenen,Geld leihen
3,eten bestellen,Essen bestellen
4,hypotheek,Hypothek


In [5]:
# testlist = ['een', 'twee', 'drie', 'vier' , 'vijf', 'zes', 'zeven', 'acht', 'negen', 'tien', 'elf', 'twaalf', 'dertien', 'viertien', 'vijftien', 'zestien', 'zeventien', 'achttien', 'negentien', 'twintig',' eenentwintig']

### Build Google Search function

In [9]:
def search(term, num_results=10, lang="nl"):
    usr_agent = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1'}

    def fetch_results(search_term, number_results, language_code):
        escaped_search_term = search_term.replace(' ', '+')

        google_url = 'https://www.google.com/search?q={}&num={}&hl={}'.format(escaped_search_term, number_results+1,language_code)
        response = get(google_url, headers=usr_agent)
#         status_code = response.status_code
        
        if response.status_code == 200:  ## doesnt work, maybe try yield ipv. return? https://dzone.com/articles/when-to-use-yield-instead-of-return-in-python#:~:text=return%20sends%20a%20specified%20value,is%20used%20in%20Python%20generators.
#             global succes_message
            succes_message = "Succesfully connected to SERP"
            print(succes_message)
        elif status_code == 429:
            print(rq.get_retry_after(response))
            
        response.raise_for_status()

        return response.text

    def parse_results(raw_html):
        soup = BeautifulSoup(raw_html, 'html.parser')
        result_block = soup.find_all('div', attrs={'class': 'g'})
        for result in result_block:
            link = result.find('a', href=True)
            title = result.find('h3')
            if link and title:
                yield link['href']
                
#     print(succes_message)
    html = fetch_results(term, num_results, lang)
    return list(parse_results(html))

In [10]:
urls = []
for column in df:
    column_urls = []
    kws = df[column]
    for kw in kws:
        print("Current KW:    ", kw)
        temp = []
        print("Sleeping for 2 seconds.....")
        time.sleep(2)
        for j in search(kw):
            temp.append(j)
            while len(temp) > 10:
                temp.pop()
        column_urls.append(temp)
    urls.append(column_urls)

Current KW:     autoverzekering
Sleeping for 2 seconds.....
Succesfully connected to SERP
Current KW:     sneakers
Sleeping for 2 seconds.....
Succesfully connected to SERP
Current KW:     geld lenen
Sleeping for 2 seconds.....
Succesfully connected to SERP
Current KW:     eten bestellen
Sleeping for 2 seconds.....
Succesfully connected to SERP
Current KW:     hypotheek
Sleeping for 2 seconds.....


KeyboardInterrupt: 

In [16]:
print(len(urls))
test_urls = urls.copy()
print(len(test_urls))

for column in df:
    print(column)
    for countrylist in test_urls:
        df[str(column[0]),'urls'] = countrylist
        if len(urls) == 2:
            test_urls.pop(0)
        else:
            pass
        
df = df.sort_index(axis=1)
df

'''
Alright fuckboi, 
1. rewrite the above loop to functions. Either, updat the search funtion to take column(name) or see if you can 
change proxy within loop. 
1.1 Check how to print my current IP/location to check if changing proxy works. 
Don't forget to check the NordVPN library
2. Test and find out how to print response code while requesting to ensure the 429 Retry After is printed when it occurs.
'''

2
2
('Duitsland', 'Keyword')
('Duitsland', 'urls')
('Nederland', 'Keyword')
('Nederland', 'urls')


'\nAlright fuckboi, \n1. rewrite the above loop to functions. Either, updat the search funtion to take column(name) or see if you can \nchange proxy within loop. \n2. Test and find out how to print response code while requesting to ensure the 429 Retry After is printed when it occurs.\n'