# What do you store in your Google Drive?

Sometimes it can be quite troublesome to crawl web data - for example, when you can't just collect data from web-pages because the authentification to a website is required. Today's tutorial is about a dataset of special type - namely, Google Drive data. You will need to get access to the system using OAuth protocol and download and parse files of different types.

Plan. 
1. Download [this little archive](https://drive.google.com/open?id=1Xji4A_dEAm_ycnO0Eq6vxj7ThcqZyJZR), **unzip** it and place the folder anywhere inside your Google Drive. You should get a subtree of 6 folders with files of different types: presentations, pdf-files, texts, and even code.
2. Go to [Google Drive API](https://developers.google.com/drive/api/v3/quickstart/python) documentation, read [intro](https://developers.google.com/drive/api/v3/about-sdk) and learn how to [search for files](https://developers.google.com/drive/api/v3/reference/files/list) and [download](https://developers.google.com/drive/api/v3/manage-downloads) them.
3. Learn how to open from python such files as [pptx](https://python-pptx.readthedocs.io/en/latest/user/quickstart.html), pdf, docx or even use generalized libraries like [textract](https://textract.readthedocs.io/en/stable/index.html).
4. Build search index (preferably, inverted one) based on the documents you get and learn to retrieve file names (e.g. `at least this file.txt`) in response to a query. Validate your code on the following set of queries (there are documents for each of them!):
```
segmentation
algorithm
classifer
printf
predecessor
Шеннон
Huffman
function
constructor
machine learning
dataset
Протасов
Protasov
```

## 2. Access GDrive ##

This is the example of how you can oranize your code - it's fine if you change it.

Let's extract the list of all files that are contained (recursively) in the folder of interest. In my case, I called it `air_oauth_folder`.

In [None]:
!pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib

In [2]:
def gdrive_get_all_files_in_folder(folder_name):
    #TODO retrieve all files from a given folder   
    return []

def gdrive_download_file(file, path_to_save): 
    #TODO download file and save it under the path
    pass

In [3]:
folder_of_interest = 'air_oauth_folder'
files = gdrive_get_all_files_in_folder(folder_of_interest)

In [5]:
test_dir = "test_files"
for item in files:
    gdrive_download_file(item, test_dir)

Download MS2 - Problems of multithread programming.pptx, 100%.
Download DSA_15 Lion in the desert.pptx, 100%.
Download DSA_09 - 2-3-4 and B-Trees.pdf, 100%.
Download ai-junior.pdf, 100%.
Download dsa.pdf, 100%.
Download origin-06.mp3, 100%.
Download L5-problems-2015.pdf, 100%.
Download cs.pdf, 100%.
Download 3cases.pdf, 100%.
Download [DM]-Course Description.docx, 100%.
Download Tutorial 9.pdf, 100%.
Download FuncnNEW.pdf, 100%.
Download Tutorial #8.pdf, 100%.
Download students.txt, 100%.
Download origin-05.mp3, 100%.
Download rdtsc-vc.cpp, 100%.
Download sort.js, 100%.
Download skiplist.js, 100%.
Download AY16-17 Academic Calendar .pdf, 100%.
Download Assessment Criteria (May).pdf, 100%.
Download rdtsc-gcc.c, 100%.
Download retake-2016-08-18.docx, 100%.
Download Program.cs, 100%.
Download hockey.avi, 100%.
Download nn.cpp, 100%.
Download lockexamples.c, 100%.
Download cyclomat.c, 100%.
Download neuro.html, 100%.
Download grant.txt, 100%.
Download grant-translate.txt, 100%.
Download bl

## 2. Tests ##
Please fill free to change function signatures and behaviour.

In [6]:
assert len(files) == 34, 'Number of files is incorrect'
print('n_files:', len(files))

print("file here means id and name, e.g.: ", files[0])

gdrive_download_file(files[0], '.')

import os.path
assert os.path.isfile(os.path.join('.', files[0][1])), "File is not downloaded correctly"

n_files: 34
file here means id and name, e.g.:  ('15f8dFmQ0zzS7PIJM8JU2WDGGWnteZtAA', 'MS2 - Problems of multithread programming.pptx')
Download MS2 - Problems of multithread programming.pptx, 100%.


## 3. Read files ##

Write here the code to extract text from the files you just downoaded.

In [None]:
!pip install textract

In [8]:
# for windows please refer to https://textract.readthedocs.io/en/latest/installation.html#don-t-see-your-operating-system-installation-instructions-here
# https://www.xpdfreader.com/download.html
# ALSO BE CAREFUL WITH SPACES IN NAMES. Better save without spaces!!!!!

import textract 

def get_file_strings(path):
    #TODO change this function to handle different data types properly - textract is not able to parse everything
    # Take care of non-text data too
    texts = str(textract.process(path)).replace('\\n', '\n').replace('\\r', '').split('\n')
    return texts

In [9]:
# creating dictionary of parsed files
files_data = dict()
for file in os.scandir(test_dir):    
    strings = get_file_strings(file.path)
    if strings:
        files_data[file.name] = strings

File format .avi is not supported
File format .mp3 is not supported
File format .mp3 is not supported


## 3. Tests ##

In [10]:
assert len(files_data) == 31 
print(len(files_data))

assert "Protasov" in get_file_strings(os.path.join(test_dir, 'at least this file.txt')), "TXT File parsed incorrectly"
assert "A. Image classification" in get_file_strings(os.path.join(test_dir, 'deep-features-scene (1).pdf')), "PDF File parsed incorrectly"

31


## 4. Index and search ##

Build a search index based on files you just parsed.

In [11]:
def build_inverted_index(files_data):
    #TODO build search index from files    
    return {}

In [12]:
inverted_index = build_inverted_index(files_data)

In [13]:
def find(query, index):
    #TODO implement search procedure
    return []

## 4. Tests ## 

In [14]:
queries = ["segmentation", "algorithm", "printf", "predecessor", "Huffman",
           "function", "constructor", "machine learning", "dataset", "Protasov"]

for query in queries:
    r = find(query, inverted_index)
#     print("Results for: ", query)
#     print("\t", r)
    assert len(r) > 0, "Query should return at least 1 document"
    assert len(r) > 1, "Query should return at least 2 documents"
    assert "at least this file.txt" in r, "This file has all the queries. It should be in a result"