# What do you store in your Google Drive?

Sometimes it can be quite troublesome to crawl web data - for example, when you can't just collect data from web-pages because the authentification to a website is required. Today's tutorial is about a dataset of special type - namely, Google Drive data. You will need to get access to the system using OAuth protocol and download and parse files of different types.

Plan. 
1. Download [this little archive](https://drive.google.com/open?id=1Xji4A_dEAm_ycnO0Eq6vxj7ThcqZyJZR), **unzip** it and place the folder anywhere inside your Google Drive. You should get a subtree of 6 folders with files of different types: presentations, pdf-files, texts, and even code.
2. Go to [Google Drive API](https://developers.google.com/drive/api/v3/quickstart/python) documentation, read [intro](https://developers.google.com/drive/api/v3/about-sdk) and learn how to [search for files](https://developers.google.com/drive/api/v3/reference/files/list) and [download](https://developers.google.com/drive/api/v3/manage-downloads) them.
3. Learn how to open from python such files as [pptx](https://python-pptx.readthedocs.io/en/latest/user/quickstart.html), pdf, docx or even use generalized libraries like [textract](https://textract.readthedocs.io/en/stable/index.html).
4. Build search index (preferably, inverted one) based on the documents you get and learn to retrieve file names (e.g. `at least this file.txt`) in response to a query. Validate your code on the following set of queries (there are documents for each of them!):
```
segmentation
algorithm
classifer
printf
predecessor
Шеннон
Huffman
function
constructor
machine learning
dataset
Протасов
Protasov
```

## 2. Access GDrive ##

This is the example of how you can oranize your code - it's fine if you change it.

Let's extract the list of all files that are contained (recursively) in the folder of interest. In my case, I called it `air_oauth_folder`.

In [1]:
!pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib

Requirement already up-to-date: google-api-python-client in /usr/local/lib/python3.6/dist-packages (1.7.11)
Requirement already up-to-date: google-auth-httplib2 in /usr/local/lib/python3.6/dist-packages (0.0.3)
Requirement already up-to-date: google-auth-oauthlib in /usr/local/lib/python3.6/dist-packages (0.4.1)


In [0]:
from __future__ import print_function
import pickle
import os
from googleapiclient.discovery import build
from apiclient.http import MediaIoBaseDownload
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request
import io

class Drive:
    def __init__(self, ):
        """
        init connection to google drive
        """
        SCOPES = ['https://www.googleapis.com/auth/drive.readonly']
        creds = None
        if os.path.exists('token.pickle'):
            with open('token.pickle', 'rb') as token:
                creds = pickle.load(token)
        # If there are no (valid) credentials available, let the user log in.
        if not creds or not creds.valid:
            if creds and creds.expired and creds.refresh_token:
                creds.refresh(Request())
            else:
                flow = InstalledAppFlow.from_client_secrets_file(
                    'credentials.json', SCOPES)
                creds = flow.run_local_server(port=0)
            # Save the credentials for the next run
            with open('token.pickle', 'wb') as token:
                pickle.dump(creds, token)
        self.service = build('drive', 'v3', credentials=creds)

    def gdrive_get_all_files_in_folder(self, folder_name):
        results = self.service.files().list(q=f"name = '{folder_name}'",
                                            fields='nextPageToken, files(id, name)').execute()
        parents, result_files = [], [] # initial set of parents
        for file in results.get('files', []):
            parents.append(file)
        # now while parents are not empty we keep on retrieving files that are children of them
        while(len(parents)>0):
            # take children of the first parent in the list and delete from list of parents that we have to check
            parent = parents.pop()
            results = self.service.files().list(q=f"'{parent['id']}' in parents",
                                              fields='nextPageToken, files(id, name)').execute()
            children = results.get('files', [])
            if children:
                for file in children:
                    parents.append(file)
            else:
                result_files.append(parent)
        return result_files

    def gdrive_download_file(self, file, path_to_save): 
        if not os.path.exists(path_to_save):
            os.makedirs(path_to_save)
        #TODO download file and save it under the path
        request = self.service.files().get_media(fileId=file['id'])
#         fh = io.BytesIO()
        fh = open(os.path.join(path_to_save, file['name']), 'wb')
        fh_str = open(os.path.join(path_to_save, file['name']), 'w')
        downloader = MediaIoBaseDownload(fh, request)
        downloader_str = MediaIoBaseDownload(fh_str, request)
        done = False
        while done is False:
          try: status, done = downloader.next_chunk()
          except:
            try: status, done = downloader_str.next_chunk()
            except:
              done = True
              continue


In [0]:
# folder_of_interest = 'air_oauth_folder'
drive = Drive()
folder_of_interest = 'data'
files = drive.gdrive_get_all_files_in_folder(folder_of_interest)

In [0]:
test_dir = "test_files"
for item in files:
    drive.gdrive_download_file(item, test_dir)

## 2. Tests ##
Please fill free to change function signatures and behaviour.

In [5]:
assert len(files) == 34, 'Number of files is incorrect'
print('n_files:', len(files))

print("file here means id and name, e.g.: ", files[0])

drive.gdrive_download_file(files[0], '.')

import os.path
assert os.path.isfile(os.path.join('.', files[0]['name'])), "File is not downloaded correctly"

n_files: 34
file here means id and name, e.g.:  {'id': '1_Prdscwt_Pu2_Zb5yoJTEP-QZSacQlNy', 'name': 'bloomset.js'}


## 3. Read files ##

Write here the code to extract text from the files you just downoaded.

In [6]:
!pip install textract

Collecting textract
  Downloading https://files.pythonhosted.org/packages/32/31/ef9451e6e48a1a57e337c5f20d4ef58c1a13d91560d2574c738b1320bb8d/textract-1.6.3-py3-none-any.whl
Collecting SpeechRecognition==3.8.1
[?25l  Downloading https://files.pythonhosted.org/packages/26/e1/7f5678cd94ec1234269d23756dbdaa4c8cfaed973412f88ae8adf7893a50/SpeechRecognition-3.8.1-py2.py3-none-any.whl (32.8MB)
[K     |████████████████████████████████| 32.8MB 115kB/s 
[?25hCollecting xlrd==1.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/b0/16/63576a1a001752e34bf8ea62e367997530dc553b689356b9879339cf45a4/xlrd-1.2.0-py2.py3-none-any.whl (103kB)
[K     |████████████████████████████████| 112kB 28.8MB/s 
[?25hCollecting docx2txt==0.8
  Downloading https://files.pythonhosted.org/packages/7d/7d/60ee3f2b16d9bfdfa72e8599470a2c1a5b759cb113c6fe1006be28359327/docx2txt-0.8.tar.gz
Collecting python-pptx==0.6.18
[?25l  Downloading https://files.pythonhosted.org/packages/bf/86/eb979f7b0333ec769041aae36df

In [7]:
!pip install python-docx 
!pip install tika

Collecting python-docx
[?25l  Downloading https://files.pythonhosted.org/packages/e4/83/c66a1934ed5ed8ab1dbb9931f1779079f8bca0f6bbc5793c06c4b5e7d671/python-docx-0.8.10.tar.gz (5.5MB)
[K     |████████████████████████████████| 5.5MB 2.9MB/s 
Building wheels for collected packages: python-docx
  Building wheel for python-docx (setup.py) ... [?25l[?25hdone
  Created wheel for python-docx: filename=python_docx-0.8.10-cp36-none-any.whl size=184491 sha256=c30b692d7d7956f7a4124d73894d35386ada8f6720d1f958d85ae6cce7dd8570
  Stored in directory: /root/.cache/pip/wheels/18/0b/a0/1dd62ff812c857c9e487f27d80d53d2b40531bec1acecfa47b
Successfully built python-docx
Installing collected packages: python-docx
Successfully installed python-docx-0.8.10
Collecting tika
  Downloading https://files.pythonhosted.org/packages/9a/c3/088827903bc1862f67b185e1df428071b8da6118155c1b46bcb0c61992ea/tika-1.23.1.tar.gz
Building wheels for collected packages: tika
  Building wheel for tika (setup.py) ... [?25l[?25hd

In [8]:
# for windows please refer to https://textract.readthedocs.io/en/latest/installation.html#don-t-see-your-operating-system-installation-instructions-here
# https://www.xpdfreader.com/download.html
# ALSO BE CAREFUL WITH SPACES IN NAMES. Better save without spaces!!!!!

import textract 
import nltk
from tika import parser
from pptx import Presentation
from docx import Document
nltk.download('punkt')

def extratc_docx(path):
    doc = Document(path)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return ' '.join(fullText)

def extract_pptx(path):
  prs = Presentation(path)
  text_runs = []

  for slide in prs.slides:
      for shape in slide.shapes:
          if not shape.has_text_frame:
              continue
          for paragraph in shape.text_frame.paragraphs:
              for run in paragraph.runs:
                  text_runs.append(run.text)
  return ' '.join(text_runs)

def extract_pdf(path):
  raw = parser.from_file(path)
  return raw['content'].replace('\n', ' ').strip()

def get_file_strings(path):
    #TODO change this function to handle different data types properly - textract is not able to parse everything
    # Take care of non-text data too (and delete non text data?)
    # take file content as I understand
    # and we should skip mp3 and avi files as well as I get
    if path.endswith('.mp3'):
      return None
    try: 
      texts = ''
      f = open(path, 'r')
      while True: 
        # Get next line from file 
        line = f.readline() 
        if not line: 
            break
        texts += line.replace('\n', ' ')
      return texts
    except:
      pass
    try: return extract_pptx(path)
    except: 
      pass
    try: return extratc_docx(path)
    except: 
      pass
    try: return extract_pdf(path)
    except:
      print(path)
      return None

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [9]:
# creating dictionary of parsed files
files_data = dict()
for file in os.scandir(test_dir): 
    strings = get_file_strings(file.path)
    # print(strings)
    if strings:
        files_data[file.name] = strings

2020-03-05 13:39:27,267 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.23/tika-server-1.23.jar to /tmp/tika-server.jar.
2020-03-05 13:39:27,953 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.23/tika-server-1.23.jar.md5 to /tmp/tika-server.jar.md5.
2020-03-05 13:39:28,370 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...


## 3. Tests ##

In [10]:
assert len(files_data) == 31
print(len(files_data))
assert "Protasov" in get_file_strings(os.path.join(test_dir, 'at least this file.txt')), "TXT File parsed incorrectly"
assert "A. Image classification" in get_file_strings(os.path.join(test_dir, 'deep-features-scene (1).pdf')), "PDF File parsed incorrectly"

31


## 4. Index and search ##

Build a search index based on files you just parsed.

In [11]:
import nltk
from collections import Counter
nltk.download('stopwords')

class Preprocessor:
    
    def __init__(self):
        self.stop_words = nltk.corpus.stopwords.words('english')
        self.ps = nltk.stem.PorterStemmer()


    # word tokenize text using nltk lib
    def tokenize(self, text):
        return nltk.word_tokenize(text)


    # stem word using provided stemmer
    def stem(self, word, stemmer):
        return stemmer.stem(word)


    # check if word is appropriate - not a stop word and isalpha, 
    # i.e consists of letters, not punctuation, numbers, dates
    def is_apt_word(self, word):
        return word not in self.stop_words and word.isalpha()


    # combines all previous methods together
    # tokenizes lowercased text and stems it, ignoring not appropriate words
    def preprocess(self, text):
        tokenized = self.tokenize(text.lower())
        return [self.stem(w, self.ps) for w in tokenized if self.is_apt_word(w)]

def build_inverted_index(files_data):
  index = dict()
  # doc_names = dict()
  def index_doc(doc_content, doc_id):
    prep = Preprocessor()
    doc_content = prep.preprocess(doc_content)
    article_index = Counter(doc_content)
    for term in article_index.keys():
        article_freq = article_index[term]
        if term not in index:                
            index[term] = [article_freq, (doc_id, article_freq)]
        else:
            index[term][0] += article_freq
            index[term].append((doc_id, article_freq))
  #TODO build search index from files
  for doc_id, file_name in enumerate(files_data):
    # doc_names[doc_id] = file_name
    index_doc(files_data[file_name], file_name)
  return index

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [0]:
inverted_index = build_inverted_index(files_data)

In [0]:
class QueryProcessing:
    @staticmethod
    def prepare_query(raw_query):
        prep = Preprocessor()
        # pre-process query the same way as documents
        query = prep.preprocess(raw_query)
        # count frequency
        return Counter(query)
    
    @staticmethod
    def boolean_retrieval(query, index):
        postings = []
        for term in query.keys():
            if term not in index:  # ignoring absent terms
                continue
            posting = index[term][1:]
            # extract document info only
            posting = [i[0] for i in posting]
            postings.append(posting)
        docs = set.intersection(*map(set,postings))
        
        return docs 

def find(query, index):
    #TODO implement search procedure
    # preprocess query
    query = QueryProcessing.prepare_query(query)
    return QueryProcessing.boolean_retrieval(query, index)

## 4. Tests ## 

In [14]:
queries = ["segmentation", "algorithm", "printf", "predecessor", "Huffman",
           "function", "constructor", "machine learning", "dataset", "Protasov"]

for query in queries:
    r = find(query, inverted_index)
    print("Results for: ", query)
    print("\t", r)
    assert len(r) > 0, "Query should return at least 1 document"
    assert len(r) > 1, "Query should return at least 2 documents"
    assert "at least this file.txt" in r, "This file has all the queries. It should be in a result"

Results for:  segmentation
	 {'deep-features-scene (1).pdf', 'at least this file.txt'}
Results for:  algorithm
	 {'Tutorial 9.pdf', 'grant.txt', 'dsa.pdf', 'at least this file.txt', 'nn.cpp', 'grant-translate.txt', '[DM]-Course Description.docx', 'deep-features-scene (1).pdf', 'DSA_15 Lion in the desert.pptx', 'DSA_09 - 2-3-4 and B-Trees.pdf', 'Tutorial #8.pdf', 'cs.pdf', 'retake-2016-08-18.docx'}
Results for:  printf
	 {'lockexamples.c', 'rdtsc-gcc.c', 'cyclomat.c', 'at least this file.txt'}
Results for:  predecessor
	 {'skiplist.js', 'Tutorial 9.pdf', 'DSA_09 - 2-3-4 and B-Trees.pdf', 'at least this file.txt'}
Results for:  Huffman
	 {'DSA_15 Lion in the desert.pptx', 'dsa.pdf', 'at least this file.txt'}
Results for:  function
	 {'sort.js', 'dsa.pdf', 'at least this file.txt', 'bloomset.js', 'grant-translate.txt', '[DM]-Course Description.docx', 'neuro.html', 'deep-features-scene (1).pdf', 'DSA_15 Lion in the desert.pptx', 'FuncnNEW.pdf', 'Tutorial #8.pdf', 'Assessment Criteria (May)