# Analysis Ideas for Search Queries from Outdoors site
## Background
Client has a log of all keyword searches that have been submitted to their outdoors site for the past five years.

The search results include a number of comma-separated columns, but the only data column of interest right now are the search query terms submitted by site users.

Each search query submission is a single string of one or more space-separated words.

Client already has a set of lists of the names of Rivers, Towns, Trips and Runs.  He would like to know the frequency (proportion) of searches that:
- are one or another type of {rivers, towns, trips or runs} e.g. "ohio" = {river} [assuming no word exists in > 1 list/category]
- for each category, which list entry is most frequently requested e.g. "colorado" is the most frequently-searched river
- only request one type of data (only strings that are included in the category lists) vs. those requests that include filtering terminology (e.g. "x river" vs "x river rafting")

## Code ideas
- list.remove(string) # will remove the first instance of a string - can reuse the remove_all() function I defined
- string.split(' ') # will break apart the search query string into tokens
- x in y # will test whether string 'x' appears in list 'y' [remembering that strings are also lists]
- I could add new fields to the CSV (or a new CSV) for each search entry, to indicate things like
- - category it belongs to
- - does it include the filtering term (True/False)
- - Stripping quotes around strings in CSV fields is automatic for the CSV module: http://stackoverflow.com/questions/1707558/can-python-remove-double-quotes-from-a-string-when-reading-in-text-file

One generic way to open a file and start parsing:

In [60]:
f = open('searchresults.csv', 'r') # 'r' opens the file as strings, 'rb' opens the file as bytes
f.readline()
f.readline()
query = f.readline()
querylist = query.split(';')
querylist[1]
#query

'"colorado"'

Here's another approach that uses the CSV module

In [102]:
import csv
filename = 'searchresults.csv'
#f = open(filename, 'rb')
states = 'states.csv'
stateslist = []
with open(states, 'r') as st:
    for row in csv.reader(st, delimiter = ';'):
        stateslist.append(row[1].upper()) # states.csv has multiple columns, column 2 contains the spelled-out state name
countTrue = 0
countFalse = 0
with open(filename, 'r') as f:
    for row in csv.reader(f, delimiter = ';', skipinitialspace=False):
        #print(','.join(row))
        if bool(row[1].upper() in stateslist): #searchresults.csv has multiple columns, column 2 contains the search query terms
            countTrue = countTrue + 1
        else:
            countFalse = countFalse + 1
print('True searches = ', countTrue)
print('False searches = ', countFalse)

True searches =  2662
False searches =  37085
