#### info for google package

search(query, tld='com', lang='en', num=10, start=0, stop=None, pause=2.0)

    query : query string that we want to search for.
    tld : tld stands for top level domain which means we want to search our result on google.com or google.in or some other domain.
    lang : lang stands for language.
    num : Number of results we want.
    start : First result to retrieve.
    stop : Last result to retrieve. Use None to keep searching forever.
    pause : Lapse to wait between HTTP requests. Lapse too short may cause Google to block your IP. Keeping significant lapse will make your program slow but its safe and better option.
    Return : Generator (iterator) that yields found URLs. If the stop parameter is None the iterator will loop forever.

#### import google python package

In [3]:
try: 
    from googlesearch import search 
except ImportError:  
    print("No module named 'google' found")
    print("i) hit windows key")
    print("ii) search for and open anaconda prompt")
    print("iii) type: pip install google")
    print("iv) hit enter and wait for the package to install")

**example 1**:
simply use the "search" function from the "google" package
this returns the weblink which is the first search result (as stop=1)

In [30]:
# search string
query = "uniprot acetyl-CoA acetyltransferase, mitochondrial precursor [Homo sapiens]"
  
for geneWebpage in search(query, tld="com", num=10, start=0, stop=1, pause=2): 
    print(geneWebpage) 

https://www.uniprot.org/uniprot/P24752


**example 2**: search google until a hit containing a certain string is found

In [51]:
query = "Barack Obama" # search string
# you can try "twitter", "wikipedia", "chicago", "trump" e.g.
URLstr = "chicago" # string to be found in link that is google hit
nrSearchesMAX = 10 # define the maximum number of searches to be conducted
nrSearches = 0 # initialize searches counter
for hit in search(query, tld="com", num=1, start=0, stop=nrSearchesMAX, pause=2): 
    print(hit) # prints the websites found by google
    nrSearches = nrSearches+1
    if hit.find(URLstr) == -1: # URLstr was not found
        if nrSearches == nrSearchesMAX:
            print("Maximum number of searches reached. <URLstr> not found in google hits.")
    else: # URLstr was found
        print(nrSearches)
        break

https://barackobama.com/
https://en.wikipedia.org/wiki/Barack_Obama
https://twitter.com/barackobama
https://www.britannica.com/biography/Barack-Obama
https://www.history.com/topics/us-presidents/barack-obama
https://www.facebook.com/barackobama/
https://www.chicagotribune.com/topic/politics-government/government/barack-obama-PEPLT007408-topic.html
7


**example 3**: write a function that searches until a website containing a certain search string is found (see example 2), <br>
pull the html text from that website and seach for information in that text

In [49]:
# import requests which allows getting info (html) from and interacting with webpage
import requests

In [None]:
# define a function that we weill need in the main function below
# find all instances of sub in strIN
# find overlapping matches is slower but can be done
def findAll(strIN, sub):
    # help docstring for this function:
    "this function finds all instances of <sub> in <strIN> and returns a list of indices of where the matches start in <strIN>"
    start = 0
    while True:
        start = strIN.find(sub, start)
        if start == -1: return
        yield start
        # non-overlapping matches, faster:
        start += len(sub) 
        # find overlapping matches, slower:
        #start += 1
#
# use like so:
#listIND = list(findAll(textStr, searchStr))

In [80]:
# define funtion that gets a gene symbol from link found with google search (aiming at uniprot website results)
def getGeneSymbol(geneDescription, URLstr, nrSearchesMAX):
    # help docstring for this function:
    "define funtion that gets a gene symbol from link found with google search (aiming at uniprot website results)"
    nrSearches = 0 # initialize searches counter
    # get google results for gene descriptions
    query = "uniprot " + geneDescription
    for hit in search(query, tld="com", num=1, stop=nrSearchesMAX, pause=2):
        print(hit) # prints the websites found by google
        nrSearches = nrSearches+1
        if hit.find(URLstr) == -1: # URLstr was not found
            if nrSearches == nrSearchesMAX:
                print("Maximum number of searches reached. <URLstr> not found in google hits.")
                return("-")
        else: # URLstr was found
            print(nrSearches)
            page = requests.get(hit) # get page info from google hit
            # search for "GN=" which is where gene symbol name is stored in webpage text
            GNindStart = list(findAll(page.text, "GN="))
            if len(GNindStart) > 1:
                print("more than one instance of <GN=> found!")
                return("-")
            elif len(GNindStart) == 0:
                print("no instance of <GN=> found!")
                return("-")
            else:
                #print("Index of instance: " + str(IDlistIND[instance]))
                geneSym = page.text[GNindStart[0]:]
                # find next space which on this website happens to separate the different entries
                geneSymStop = geneSym.find(' ')
                # get string between "GN=" and "space"
                geneSym = geneSym[3:geneSymStop] # 3 is the length of the string GN=
                return(geneSym)
#
# use like so:
#geneDescription = "uniprot acetyl-CoA acetyltransferase, mitochondrial precursor [Homo sapiens]"
#geneSymbol = getGeneSymbol(geneDescription, "uniprot.org", 5)
#print(geneSymbol)

In [81]:
# use function
geneDescription = "uniprot acetyl-CoA acetyltransferase, mitochondrial precursor [Homo sapiens]"
geneSymbol = getGeneSymbol(geneDescription, "uniprot.org", 5)
print(geneSymbol)

https://www.uniprot.org/uniprot/P24752
1
ACAT1
