# What do you store in your Google Drive?

Sometimes it can be quite troublesome to crawl web data - for example, when you can't just collect data from web-pages because the authentification to a website is required. Today's tutorial is about a dataset of special type - namely, Google Drive data. You will need to get access to the system using OAuth protocol and download and parse files of different types.

Plan. 
1. Download [this little archive](https://drive.google.com/open?id=1Xji4A_dEAm_ycnO0Eq6vxj7ThcqZyJZR), **unzip** it and place the folder anywhere inside your Google Drive. You should get a subtree of 6 folders with files of different types: presentations, pdf-files, texts, and even code.
2. Go to [Google Drive API](https://developers.google.com/drive/api/v3/quickstart/python) documentation, read [intro](https://developers.google.com/drive/api/v3/about-sdk) and learn how to [search for files](https://developers.google.com/drive/api/v3/reference/files/list) and [download](https://developers.google.com/drive/api/v3/manage-downloads) them.
3. Learn how to open from python such files as [pptx](https://python-pptx.readthedocs.io/en/latest/user/quickstart.html), pdf, docx or even use generalized libraries like [textract](https://textract.readthedocs.io/en/stable/index.html).
4. Build search index (preferably, inverted one) based on the documents you get and learn to retrieve file names (e.g. `at least this file.txt`) in response to a query. Validate your code on the following set of queries (there are documents for each of them!):
```
segmentation
algorithm
classifer
printf
predecessor
Шеннон
Huffman
function
constructor
machine learning
dataset
Протасов
Protasov
```

## 2. Access GDrive ##

This is the example of how you can oranize your code - it's fine if you change it.

Let's extract the list of all files that are contained (recursively) in the folder of interest. In my case, I called it `air_oauth_folder`.

In [1]:
'''
Note: code from the example given from Google Drive API example 
https://developers.google.com/drive/api/v3/quickstart/python
'''
from __future__ import print_function
import pickle, io, re
import os.path
from googleapiclient.discovery import build
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request
from googleapiclient.http import MediaIoBaseDownload

creds = None
SCOPES = ['https://www.googleapis.com/auth/drive']
# The file token.pickle stores the user's access and refresh tokens, and is
# created automatically when the authorization flow completes for the first
# time.
if os.path.exists('token.pickle'):
    with open('token.pickle', 'rb') as token:
        creds = pickle.load(token)
# If there are no (valid) credentials available, let the user log in.
if not creds or not creds.valid:
    if creds and creds.expired and creds.refresh_token:
        creds.refresh(Request())
    else:
        flow = InstalledAppFlow.from_client_secrets_file(
            'credentials.json', SCOPES)
        creds = flow.run_local_server(port=0)
    # Save the credentials for the next run
    with open('token.pickle', 'wb') as token:
        pickle.dump(creds, token)

service = build('drive', 'v3', credentials=creds)


In [2]:
def extract_folder_content(service, folder):
    res = []
    query_str = "('{}' in parents)".format(folder["id"])
    files_in_folder = service.files().list(q=query_str, spaces='drive',fields="nextPageToken, files(id, name , mimeType)").execute()
    files_in_folder.get('files',[])
    return files_in_folder["files"]

In [3]:
def gdrive_get_all_files_in_folder(service,folder_name):
    query_str = "(name = '{}')".format(folder_name)
    
    results = service.files().list(
        q = query_str,spaces='drive',
        fields="nextPageToken, files(id, name)").execute()
    
    folder = results.get('files',[])

    if not folder:
        print('{} : does not exist in Drive'.format(folder_name))
        return []
    else:
        result_files = []
        folders = []
        
        for item in extract_folder_content(service,folder[0]):
            if item["mimeType"] == 'application/vnd.google-apps.folder': folders.append(item) 
            else : result_files.append(item)    
            
        while len(folders) != 0: #Go through all the sub-folders id any and collect any kind of document in it
            child_folder = folders.pop()
            for item in extract_folder_content(service,child_folder):
                if item["mimeType"] == 'application/vnd.google-apps.folder': folders.append(item) 
                else : result_files.append(item) 
            
        return result_files

def gdrive_download_file(service,file, path_to_save): 
    #TODO download file and save it under the path
    request = service.files().get_media(fileId=file.get("id"))
    fh = io.BytesIO()
    downloader = MediaIoBaseDownload(fh, request)
    done = False
    while done is False:
        status, done = downloader.next_chunk()
    
    filename = os.path.join(path_to_save, file.get('name'))
    f = open(filename, 'wb')
    f.write(fh.getvalue())
    f.close()

In [8]:
folder_of_interest = 'data'
files = gdrive_get_all_files_in_folder(service,folder_of_interest)

!mkdir "test_files"

test_dir = "test_files"
for item in files:
    gdrive_download_file(service,item, test_dir)

## 2. Tests ##
Please fill free to change function signatures and behaviour.

In [5]:
assert len(files) == 34, 'Number of files is incorrect'
print('n_files:', len(files))

print("file here means id, name and type e.g.: ", files[0])

gdrive_download_file(service,files[0], '.')

import os.path
assert os.path.isfile(os.path.join('.', files[0]["name"])), "File is not downloaded correctly"

n_files: 34
file here means id, name and type e.g.:  {'id': '1hOd7eR7kdvSxm7Jz8G8ZgXUjNAalahM0', 'name': 'dsa.pdf', 'mimeType': 'application/pdf'}


## 3. Read files ##

Write here the code to extract text from the files you just downoaded.

In [9]:
# !pip3 install tika
from tika import parser

def get_file_strings(path):
    #TODO change this function to handle different data types properly - textract is not able to parse everything
    # Take care of non-text data too
    text = None
    if path.lower().endswith(tuple(['.mp3', '.mpg', '.mp2', '.mpeg', '.mp4', '.m4p', '.m4v', '.avi', '.wmv', '.png','.jpeg','.gif', '.esp', '.svg'])):
        return None
    try :
        file_data = parser.from_file(path) # Get files text content
        if file_data["content"] is not None: 
            text = str(file_data['content'])
            text = " ".join(re.split("[^а-яА-Я||a-zA-Z]+",text))
            return " ".join(text.split())
        else : 
            return None
    except : 
        print("Skipped : {}".format(path))
        return None


In [10]:
# creating dictionary of parsed files
files_data = dict()
for file in os.scandir(test_dir):    
    strings = get_file_strings(file.path)
    if strings is not None:
        files_data[file.name] = strings

## 3. Tests ##

In [11]:
# Here i changed the last test because of the way i do preprocessing of text in get_file_strings method 
#Note : I think its better to stick to one language (not mix russian and english)
assert len(files_data) == 31
print(len(files_data))

assert "Protasov" in get_file_strings(os.path.join(test_dir, 'at least this file.txt')), "TXT File parsed incorrectly"
assert "A Image classification" in get_file_strings(os.path.join(test_dir, 'deep-features-scene (1).pdf')), "PDF File parsed incorrectly"

31


## 4. Index and search ##


Build a search index based on files you just parsed.

Create a Preprocessor to prepare document text for querying <br>

In [12]:
import nltk
from nltk.corpus import stopwords

class Preprocessor:
    
    def __init__(self):
        self.stop_words = stopwords.words('russian') + stopwords.words('english')
        self.ps = nltk.stem.PorterStemmer()


    # word tokenize text using nltk lib
    def tokenize(self, text):
        return nltk.word_tokenize(text)


    # stem word using provided stemmer
    def stem(self, word, stemmer):
        return stemmer.stem(word)


    # check if word is appropriate - not a stop word and isalpha, 
    # i.e consists of letters, not punctuation, numbers, dates
    def is_apt_word(self, word):
        return word not in self.stop_words and word.isalpha()


    # combines all previous methods together
    # tokenizes lowercased text and stems it, ignoring not appropriate words
    def preprocess(self, text):
        tokenized = self.tokenize(text.lower())
        return [self.stem(w, self.ps) for w in tokenized if self.is_apt_word(w) and len(w) > 2]

In [13]:
#Build inverted index with the documents from drive
from collections import Counter
def build_inverted_index(files_data):
    """returns inverted list, document lenghts and documents names map. 
    Its for okapi_scoring (boolean retrieval can be used) -- Implementation from previous Labs"""
    
    prep = Preprocessor()
    doc_lengths = {}
    inverted_list = {}
    doc_names = {}
    
    for doc_id , doc in enumerate(files_data.items()):
        file_content = prep.preprocess(doc[1])
        doc_lengths[doc_id] = len(file_content)
        doc_names[doc_id] = doc[0]
        
        article_index = Counter(file_content)
        
        for term in article_index.keys():
            article_freq = article_index[term]
            if term not in inverted_list:                
                inverted_list[term] = [article_freq, (doc_id, article_freq)]
            else:
                inverted_list[term][0] += article_freq
                inverted_list[term].append((doc_id, article_freq))
                
    return [inverted_list, doc_lengths, doc_names]

In [14]:
inverted_index, doc_lengths, doc_names = build_inverted_index(files_data)

In [15]:
import math
def find(query, index, doc_lengths, k1=1.2, b=0.75):
    scores = {}
    N = len(doc_lengths)
    avgdl = sum(doc_lengths.values()) / float(len(doc_lengths))
    for term in query.keys():
        if term not in index:  # ignoring absent terms
            continue
        n_docs = len(index[term]) - 1
        idf = math.log10((N - n_docs + 0.5) / (n_docs + 0.5))
        postings = index[term][1:]
        for posting in postings:
            doc_id = posting[0]
            doc_tf = posting[1]
            score = idf * doc_tf * (k1 + 1) / (doc_tf + k1 * (1 - b + b * (doc_lengths[doc_id] / avgdl)))
            if doc_id not in scores:
                scores[doc_id] = score
            else:  # accumulate scores
                scores[doc_id] += score
    return list(scores.keys())

## 4. Tests ## 

In [16]:
queries = ["segmentation", "algorithm", "printf", "predecessor", "Huffman",
           "function", "constructor", "machine learning", "dataset", "Protasov"]
prep = Preprocessor()

for query in queries:
    query = Counter(prep.preprocess(query))
    r = find(query, inverted_index,doc_lengths)
    r = list(map(lambda x : doc_names.get(x), r))
    #print("Results for: ", query)
    #print("\t", r)
    assert len(r) > 0, "Query should return at least 1 document"
    assert len(r) > 1, "Query should return at least 2 documents"
    assert "at least this file.txt" in r, "This file has all the queries. It should be in a result"