Upload HTML directly to S3 bucket, do not dump in database #273

btbonval · 2014-01-13T22:47:29Z

The S3 static buckets for beta and prod now have an 'html' directory in the base.

When we have HTML, instead of storing it into the database, we want to write it as a file onto S3 (possibly name the file by hash). Instead of storing the html in the database, we want to store the relative static path into something like Note.static_relpath.

The Note detail template would then use an IFRAME pointing at {{ STATIC_URL }}html/{{Note.static_relpath}}.

The text was updated successfully, but these errors were encountered:

btbonval · 2014-01-13T22:50:42Z

We will also need to post-process the HTML already in Production and on the VM and push that out to S3.

This will be a one time deal rather than a recurring thing, so a quick script that doesn't need to hang around should suffice.

btbonval · 2014-01-13T23:58:10Z

Did a quick search to see how out of fashion IFRAMEs are. Found this question about IFRAME and SEO. Being that SEO is a pretty recent topic, there is a good comment in here: http://productforums.google.com/forum/#!topic/webmasters/Y6DyIR7wLXg

Make sure there is an anchor link to IFRAME content on the page with the IFRAME. That sounds like good practice anyway, in case someone turns off IFRAMEs because they're so 1995.

btbonval · 2014-01-14T05:33:44Z

sanitize_html parses html in-place on the model. e.g. it loads self.html and saves self.html. We probably want to change this into a filter.

btbonval · 2014-01-14T05:39:45Z

We probably won't need to batch process HTML across notes, and if we do, the current function will need to be rewritten anyway. Should remove this:
https://github.com/FinalsClub/karmaworld/blob/b7ebe2b1d390232a16618977fb3b19cfa790f7b9/karmaworld/apps/notes/management/commands/process_note_html.py

beautifulsoup is part of the requirements. lxml does one thing, which is in sanitize_html. I have to rewrite sanitize_html to be a filter anyway, so if I replace lxml, the world will be a better, brighter place.

btbonval · 2014-01-14T06:24:03Z

No need to store a URL for the HTML snippet. Note.slug is supposed to be unique. I'm adding unique, not-null to Document.slug which will inherit to Note.slug. The static S3 filename will be based on the Note slug.

btbonval · 2014-01-14T06:39:37Z

was trying from django.core.files.storage import default_storage to write files.

unfortunately:

>>> default_storage.bucket_acl
'public-read'

Our static_s3.py configs are for a read-only API interface, which means there won't be uploading? How the heck does collectstatic work if it can't actually write to the S3 bucket using the static_s3.py settings?

I am missing something key.

btbonval · 2014-01-14T06:46:06Z

I was trying to create a file by simply opening it and writing to it, as per http://django-storages.readthedocs.org/en/latest/backends/amazon-S3.html#storage

That gives me an IOError even though open is set to write/create mode:

>>> somefile = default_storage.open('bryantestfile.html', 'w')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/django/core/files/storage.py", line 33, in open
    return self._open(name, mode)
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/storages/backends/s3boto.py", line 177, in _open
    raise IOError('File does not exist: %s' % name)
IOError: File does not exist: bryantestfile.html

Maybe Django sets the default_storage to read-only mode for static hosting reasons, but switches it for collectstatic. Clearly the bucket has everything it needs:

>>> default_storage.bucket.get_acl()
<Policy: Andrew (owner) = FULL_CONTROL>

btbonval · 2014-01-14T06:55:34Z

Nothing in docs about default_storage.acl. Nothing about ACL in django-storages. Only thing about ACL is in s3boto, but we can see that bucket ACL from s3boto is just fine.

guess who has two thumbs and has to read source code. this guy. nn/ \nn

btbonval · 2014-01-14T06:57:59Z

http://tartarus.org/james/diary/2013/07/18/fun-with-django-storage-backends

btbonval · 2014-01-14T07:04:25Z

>>> import storages.backends.s3boto
>>> protected_storage = storages.backends.s3boto.S3BotoStorage(acl='private')
>>> with protected_storage.open('html/bryantest.html', 'w') as s3file:
...     s3file.write(html)
... 
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/django/core/files/storage.py", line 33, in open
    return self._open(name, mode)
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/storages/backends/s3boto.py", line 177, in _open
    raise IOError('File does not exist: %s' % name)
IOError: File does not exist: html/bryantest.html
>>> protected_storage.acl
'private'
>>> protected_storage.bucket_acl
'public-read'
>>> protected_storage.bucket.get_acl()
<Policy: Andrew (owner) = FULL_CONTROL>

DO NOT WANT

btbonval · 2014-01-14T07:07:24Z

http://www.laurii.info/2013/05/improve-s3boto-djangostorages-performance-custom-settings/

btbonval · 2014-01-14T07:22:37Z

According to the above link, the acl comes from here: http://docs.aws.amazon.com/AmazonS3/latest/dev/ACLOverview.html

'public-read' should still give the owner full control, but the allusers group gets read.

It would seem like a bad idea to change the S3 ACL from 'public-read'. Not sure how to access this S3boto stuff as the owner.

btbonval · 2014-01-14T07:38:28Z

Files are called Keys in the raw s3boto bucket. e.g. default_storage.bucket.get_key('img/asc.gif'). new_key() creates a theoretical file on the S3 bucket. Key.open*() commands don't work, which would be nice for writing directly to the S3 file. Key.send_file() does work. Wrap up the HTML in a little StringIO file-like object and BAM, I just uploaded to S3.

Tested and confirmed. Ugly as junk.

>>> flo = StringIO(html)
>>> nk = default_storage.bucket.new_key('html/bryantest.html')
>>> nk.exists()
False
>>> nk.send_file(flo)
>>> nk.exists()
True
>>> with default_storage.open('html/bryantest.html', 'r') as s3file:
...     print s3file.read()
... 

<html>
<body>
<a href="whaaaat">the</a>
<a href="test" target="_blank">
woop
</a>
<a href="nope" target="werrird">wa</a>
</body>
</html>

btbonval · 2014-01-14T09:05:24Z

Most of the code is written now. I tried to kick off a process to convert HTML in the database to files on S3, but failed:

(venv)vagrant@vagrant-ubuntu-precise-32:~/karmaworld$ python manage.py populate_s3
Traceback (most recent call last):
...
  File "/home/vagrant/karmaworld/karmaworld/apps/notes/management/commands/populate_s3.py", line 42, in handle
    htmlflo = StringIO(note.html)
UnicodeEncodeError: 'ascii' codec can't encode character u'\ue001' in position 10111407: ordinal not in range(128)

"The StringIO object can accept either Unicode or 8-bit strings, but mixing the two may take some care. If both are used, 8-bit strings that cannot be interpreted as 7-bit ASCII (that use the 8th bit) will cause a UnicodeError to be raised when getvalue() is called."
http://docs.python.org/2/library/stringio.html

might as well pass the HTML into BeautifulSoup to see if it can read in the data and output it in consistent UTF-8.

btbonval · 2014-01-14T09:17:26Z

liar liar pants on fire. It turns out BeautifulSoup does not output UTF-8 by default even though all the docs say it does. Gotta run soup.prettify("utf-8") and suddenly StreamIO is pleased.

btbonval · 2014-01-14T09:23:08Z

oh good. random disconnection errors or something. More or less exactly what I want to deal with right now.

(venv)vagrant@vagrant-ubuntu-precise-32:~/karmaworld$ python manage.py populate_s3
Processing html/mit6_007s11_lec07pdf.html
Traceback (most recent call last):
...
  File "/home/vagrant/karmaworld/karmaworld/apps/notes/management/commands/populate_s3.py", line 48, in handle
    newkey.send_file(htmlflo)
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/boto/connection.py", line 910, in make_request
    return self._mexe(http_request, sender, override_num_retries)
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/boto/connection.py", line 872, in _mexe
    raise e
socket.error: [Errno 32] Broken pipe

…HTML

btbonval · 2014-01-14T09:24:52Z

well I guess I won't be running this overnight to process.

Can't test if anything worked until I get one Note onto S3 to see if my VM hosts it properly. Can't get one Note onto S3 because broken pipe.

Pushing WIP to origin as feature_html_on_s3 with commit HEAD 87bf8e2

btbonval · 2014-01-16T01:54:51Z

rebased master into branch and ran tests.

... still running.

still running?

btbonval · 2014-01-16T01:55:56Z

top says the CPU is mostly running SSHD and top. tests deadlocked?

btbonval · 2014-01-16T01:57:49Z

Looks like the manage.py tests are stuck running Xvfb, which is in turn not running anything (although it should run firefox). Time to double check master still works.

vagrant@vagrant-ubuntu-precise-32:~$ ps ax | grep python
 3219 pts/1    S+     0:02 python manage.py test
 3286 pts/0    S+     0:00 grep --color=auto python
vagrant@vagrant-ubuntu-precise-32:~$ pstree -p | grep -C 3 3219
        |-rsyslogd(828)-+-{rsyslogd}(837)
        |               |-{rsyslogd}(838)
        |               `-{rsyslogd}(839)
        |-sshd(799)-+-sshd(1158)---sshd(1244)---bash(1245)---python(3219)---Xvfb(3242)
        |           `-sshd(2078)---sshd(2164)---bash(2165)-+-grep(3289)
        |                                                  `-pstree(3288)
        |-udevd(323)-+-udevd(399)

btbonval · 2014-01-16T02:02:49Z

Tests completed on master branch in ~4 minutes.

Something tripped up feature_html_on_s3 branch so that tests deadlock :( No backtraces to help.

btbonval · 2014-01-16T02:10:37Z

python manage.py test -v 2 seems to be giving better output. Looks to be hungup on Evernote.

Test searching for a school by partial name ... ok
Test upload of an Evernote note ...

Same pstree as before with the dangling Xvfb. Definitely stuck here.

Code:

karmaworld/karmaworld/apps/document_upload/tests.py

Lines 47 to 53 in fe3879e

    
           def testEvernoteConversion(self): 
        
               """Test upload of an Evernote note""" 
        
               self.doConversionForPost({'fp_file': 'https://www.filepicker.io/api/file/vOtEo0FrSbu2WDbAOzLn', 
        
                                        'course': str(self.course.id), 
        
                                        'name': 'KarmaNotes test 3', 
        
                                        'tags': '', 
        
                                        'mimetype': 'text/enml'})

calls

karmaworld/karmaworld/apps/document_upload/tests.py

Lines 30 to 37 in fe3879e

    
           def doConversionForPost(self, post, user=None, session_key=None): 
        
               self.assertEqual(Note.objects.count(), 0) 
        
               r_d_f = RawDocumentForm(post) 
        
               self.assertTrue(r_d_f.is_valid()) 
        
               raw_document = r_d_f.save(commit=False) 
        
               raw_document.fp_file = post['fp_file'] 
        
               convert_raw_document(raw_document, user=user, session_key=session_key) 
        
               self.assertEqual(Note.objects.count(), 1)

Only place I can imagine it hanging is on convert_raw_document?

btbonval · 2014-01-16T02:13:21Z

The feature_html_on_s3 branch has no changes in the raw_document app.

btbonval · 2014-01-16T02:18:20Z

Double ctrl-c got a super long backtrace!

Test upload of an Evernote note ... ^C^CTraceback (most recent call last):
  File "manage.py", line 14, in <module>
    execute_from_command_line(sys.argv)
...
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/django/core/management/base.py", line 255, in execute
    output = self.handle(*args, **options)
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/south/management/commands/test.py", line 8, in handle
    super(Command, self).handle(*args, **kwargs)
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/django/core/management/commands/test.py", line 89, in handle
    failures = test_runner.run_tests(test_labels)
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/django_nose/runner.py", line 155, in run_tests
    result = self.run_suite(nose_argv)
...
  File "/usr/lib/python2.7/unittest/case.py", line 327, in run
    testMethod()
  File "/home/vagrant/karmaworld/karmaworld/apps/document_upload/tests.py", line 53, in testEvernoteConversion
    'mimetype': 'text/enml'})
  File "/home/vagrant/karmaworld/karmaworld/apps/document_upload/tests.py", line 36, in doConversionForPost
    convert_raw_document(raw_document, user=user, session_key=session_key)
  File "/home/vagrant/karmaworld/karmaworld/apps/notes/gdrive.py", line 244, in convert_raw_document
    newkey.send_file(htmlflo)
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/boto/s3/key.py", line 727, in send_file
    query_args=query_args)

Ahh that'd certainly be unique to this branch. Hanging on direct upload to S3. The html folder on the appropriate S3 is empty. Guess I'll play with this feature a little more, it's still leaving cake on the toothpick.

btbonval · 2014-01-16T02:48:29Z

note for later: It seems worth moving this one function for uploading to S3 from gdrive.py into Note.

btbonval · 2014-01-16T03:05:48Z

Testing a PDF that rends to 2.87 MiB of HTML using (mostly) what would be performed right now. Upload seems to do zilch.

In [7]: rds = RawDocument.objects.all()
In [14]: fp_file = rds[1].get_file()
In [19]: html = pdf2html(fp_file.read())
Preprocessing: 88/88
Working: 88/88
In [20]: len(html)
Out[20]: 3012503
In [21]: fhtml = notes[0].filter_html(html)
In [22]: len(fhtml)
Out[22]: 3365756
In [23]: filepath = notes[0].get_relative_s3_path()
In [24]: filepath
Out[24]: 'html/certificate-path-validation-testingpdf.html'
In [28]: fhtmlflo = StringIO(fhtml)
In [29]: newkey = default_storage.bucket.new_key(filepath)
In [30]: newkey.exists()
Out[30]: False
In [33]: fhtmlflo.seek(0)
In [35]: def status_update(transmit, maximum): print "transferred {0} / {1}".format(transmit, maximum)
In [36]: newkey.send_file(fhtmlflo, cb=status_update)
transferred 0 / 0
transferred 0 / 0
transferred 0 / 0
...

btbonval · 2014-01-16T03:40:01Z

Rewrote upload code to use set_contents_from_string. Moved upload code into Note. Replaced copy pasta in gdrive.py and process_s3.py to make use of the upload code in Note. commit 7b61d07

Running tests again.

btbonval · 2014-01-16T03:45:58Z

a number of tests errored. It looks like the tests hung, but firefox is actively running at the moment. It's been 5 minutes. :/

btbonval · 2014-01-16T03:49:46Z

karmaworld.apps.notes.models: ERROR: Error with IndexDen:
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/indextank/client.py", line 131, in create_index
    raise TooManyIndexes(e.msg)
TooManyIndexes: "Too many indexes for this account"

also made a copy/paste mistake.

btbonval · 2014-01-16T03:57:04Z

A few errors showing up, hanging on the firefox test as before.

This time, however, there are three HTML files on the S3!

The hanging thing bothers me. I'll have to use some verbose to see where that is happenin.

btbonval · 2014-01-16T04:00:30Z

Test upload of an Evernote note ... ok
Test upload of a file with a bogus mimetype ... ok

No files in S3 after these.

The later upload tests have files in S3 after they run.

btbonval · 2014-01-16T04:05:41Z

Tests didn't hang using verbose output. How bizarre.

Test that Note.save() doesn't make a slug ... ERROR
Search for a note within IndexDen ... ERROR
Test that the slug field is slugifying unicode Note.names ... ok
ERROR
testCreateCourse (test_selenium.AddCourseTest) ... ok

This test appears moot now that slug is unique and not nullable.

======================================================================
ERROR: Test that Note.save() doesn't make a slug
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/vagrant/karmaworld/karmaworld/apps/notes/tests.py", line 85, in test_save_no_slug
    self.note.save() # re-save the note
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/django/db/backends/postgresql_psycopg2/base.py", line 54, in execute
    return self.cursor.execute(query, args)
IntegrityError: null value in column "slug" violates not-null constraint

I'm guessing this is due to IndexDen not adding any more indices right now.

======================================================================
ERROR: test suite for <class 'karmaworld.apps.notes.tests.TestNoes'>
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/nose/suite.py", line 227, in run
    self.tearDown()
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/nose/suite.py", line 350, in tearDown
...
  File "/home/vagrant/karmaworld/karmaworld/apps/notes/tests.py", line 58, in tearDownClass
    api.delete_index(secret.INDEX)
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/indextank/client.py", line 38, in delete_index
    self.get_index(index_name).delete_index()
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/indextank/client.py", line 152, in delete_index
    _request('DELETE', self.__index_url)
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/indextank/client.py", line 457, in _request
    raise HttpException(response.status, response.body)
HttpException: HTTP 404: ["No index existed for the given name"]

Three failures from error, no true failures.

Time to check it by hand!

btbonval · 2014-01-16T04:19:55Z

Removed obsolete null Note.slug test, down to 2 errors caused by IndexDen. Can't get much further than this for now.

…HTML

btbonval · 2014-01-16T23:06:53Z

uploaded objects to S3 do not give permission to open/download them.

Need to do what is in this comment: #68 (comment)

btbonval · 2014-01-17T00:47:41Z

Figured out the IndexDen problem. Back to using Beta's IndexDen and all the tests ran just fine.

btbonval · 2014-01-17T01:27:37Z

These docs are about as helpful as a bag of wet socks. I guess there are uses for a bag of wet socks, but not many.
http://boto.readthedocs.org/en/latest/ref/s3.html

Here's what an Everyone Open/Download policy looks like in s3boto:

In [35]: policy.acl.grants[4].permission
Out[35]: u'READ'
In [36]: policy.acl.grants[4].display_name
In [37]: policy.acl.grants[4].type
Out[37]: u'Group'
In [38]: policy.acl.grants[4].uri
Out[38]: u'http://acs.amazonaws.com/groups/global/AllUsers'
In [39]: policy.acl.grants[4].id
In [42]: policy.acl.grants[4].__class__
Out[42]: boto.s3.acl.Grant

So to make that, it'd be something like

from boto.s3.acl import Grant
# once key exists
policy = newkey.get_acl()
policy.acl.add_grant(Grant(permission=u'READ', type=u'GROUP', uri=u'http://acs.amazonaws.com/groups/global/AllUsers'))

btbonval · 2014-01-17T02:00:32Z

Permission attempt failed. No errors, but the permissions according to S3 do not include Everyone.

Time for guess and check.

btbonval · 2014-01-17T02:08:22Z

I think the first problem is that changing the policy as noted above does not save that policy remotely. Probably need to call one of the newkey.set_*acl() commands.

In [12]: newkey.set_acl(policy)
S3ResponseError: S3ResponseError: 400 Bad Request
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>MalformedACLError</Code><Message>The XML you provided was not well-formed or did not validate against our published schema</Message><RequestId>3E57DBBC88D03C8E</RequestId><HostId>W1O4/vy8nDyXEhcgawGHyJrCFmGsaYpqwPcE5CwaLVWVXhuSfB/Suhq/6w0YFMSu</HostId></Error>

Here's a problem. Converting the permission into XML ignores the AllUsers URI.

In [23]: all_read.uri
Out[23]: u'http://acs.amazonaws.com/groups/global/AllUsers'
In [24]: all_read.to_xml()
Out[24]: u'<Grant><Grantee xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="GROUP"><EmailAddress>None</EmailAddress></Grantee><Permission>READ</Permission></Grant>'

btbonval · 2014-01-17T02:14:57Z

type is "GROUP". Looking at Boto source code it is case sensitive 'Group'.
https://github.com/boto/boto/blob/develop/boto/s3/acl.py#L155-L156

I'm tempted to write a ticket over there, but it's probably one of those things where the standard for the XML or whatever is case sensitive, therefore the Python must be as well.

btbonval · 2014-01-17T02:28:57Z

Here's what the grant XML should look like when it's correct vs what is being generated (identical):

In [48]: oldkey.get_xml_acl()
Out[48]: '...<Grant><Grantee xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="Group"><URI>http://acs.amazonaws.com/groups/global/AllUsers</URI></Grantee><Permission>READ</Permission></Grant>...'
In [50]: all_read.to_xml()
Out[50]: u'<Grant><Grantee xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="Group"><URI>http://acs.amazonaws.com/groups/global/AllUsers</URI></Grantee><Permission>READ</Permission></Grant>'

So the problem appears to be with boto's ability to generate either the ACL XML or the Policy XML in a way that satisfies S3.

As an experiment, let's just take the preexisting acl text and write it to the new key.

In [51]: newkey.set_xml_acl(oldkey.get_xml_acl())
In [52]:

Looks good on the S3 management page.
I guess I'll just grab that raw XML and put that into the source code. :(

btbonval · 2014-01-17T02:47:06Z

Fugly fugly fugly but it worked. That XML ACL is huge to be dropping in as a string, but boto is too messed up to do anything else I guess. I see the file on S3 with proper ACLs.

When viewing on the site, the URL asks if I want to download it, rather than showing it in the IFRAME.

Changed over to static S3 properly, and it still pops up a download question. It's an HTML file! Maybe the meta data is wrong?

btbonval · 2014-01-17T02:51:37Z

Yup. Metadata problem.
content-type: application/octet-stream

Gotta make sure these things all get uploaded with content-type as text/html.

That fixes the problem, but it takes forever to download from S3! Also the one I'm looking at looks terrible.

btbonval · 2014-01-17T02:58:01Z

DIEEEEEE BOTOOOOO!!!! (read as: boto.s3 doesn't do nothin with metadata!?)

In [5]: oldkey = default_storage.bucket.new_key('html/14_motor1pdf.html')
In [6]: oldkey.exists()
Out[6]: True
In [7]: oldkey.metadata
Out[7]: {}
In [8]: oldkey.get_metadata()
---------------------------------------------------------------------------
TypeError: get_metadata() takes exactly 2 arguments (1 given)
In [9]: oldkey.get_metadata('content-type')
In [10]: oldkey.get_metadata('Content-Type')
In [11]: help(oldkey.get_metadata)
Help on method get_metadata in module boto.s3.key:

get_metadata(self, name) method of boto.s3.key.Key instance
In [15]: oldkey.get_metadata(oldkey.name)
In [16]:

btw there is absolutely content-type on every single object, but especially this one when I explicitly set.

btbonval · 2014-01-17T03:01:43Z

Also tried the above iwht lookup instead of new_key, but I suspect they are exactly the same thing.

btbonval · 2014-01-17T03:05:27Z

get_metadata is just a wrapper around metadata attribute.
https://github.com/boto/boto/blob/develop/boto/s3/key.py#L523-L524

Here's where it gets metadata, during open_read() (not during __init__.py, of course!). not even a memoized fetching dict, just a dict.
https://github.com/boto/boto/blob/develop/boto/s3/key.py#L274-L275

I don't have enough middle fingers for this.

In [25]: oldkey.open_read()
In [26]: oldkey.metadata
Out[26]: {}
In [27]: oldkey.metadata.__class__
Out[27]: dict

btbonval · 2014-01-17T03:08:39Z

So even if I /read/ the metadata, it'd just be a local cached dict that gets updated.
https://github.com/boto/boto/blob/develop/boto/s3/key.py#L526-L534
https://github.com/boto/boto/blob/develop/boto/s3/key.py#L536-L537

It doesn't push that stuff anywhere. ever.

btbonval · 2014-01-17T03:16:43Z

Two types of metadata.
http://www.bucketexplorer.com/documentation/amazon-s3--amazon-s3-objects-metadata-http-header.html

Looks like HTTP Headers are used to set the Metadata for Files. But when? on upload?

btbonval · 2014-01-17T05:16:47Z

From above link:
"HTTP Headers: You can specify metadata for Amazon S3 Objects (Files), which are Name- Value pairs, which can be sent along with Amazon S3 PUT Request, similar to other standard HTTP headers. Once you upload the S3 Object, you cannot update the Object metadata on Amazon S3. The only way to modify the Object Metadata is to make a copy of the Object and set the Metadata."

They can be changed at the S3 console. So it looks like headers needs a dict with Content-Type. Let's try!

btbonval · 2014-01-17T05:26:10Z

Well I guess I dun gone shoopted some woops. Passed {'Content-Type': 'text/html'} into Key.set_contents_from_string() headers parameter. Wouldn't you know it, S3 manager shows the right content type.

Better still, the file uploaded with a preview ready to go. The HTML still looks like junk, but that's the fault of pdf2html or something. Not the problem of this ticket.

btbonval · 2014-01-17T05:49:35Z

pulled in master and running tests.

In the meantime, clicking around the VM site. Got a weird javascript error:

TypeError: $(...).dataTable is not a function

Testing finished. One error. Seems gdrive auth was refused? Guess I'll run it all again.

======================================================================
ERROR: Test setting the user of an uploaded document
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/vagrant/karmaworld/karmaworld/apps/document_upload/tests.py", line 117, in testSessionUserAssociation3
    session_key=s.session_key)
  File "/home/vagrant/karmaworld/karmaworld/apps/document_upload/tests.py", line 36, in doConversionForPost
    convert_raw_document(raw_document, user=user, session_key=session_key)
  File "/home/vagrant/karmaworld/karmaworld/apps/notes/gdrive.py", line 203, in convert_raw_document
    service = build_api_service()
  File "/home/vagrant/karmaworld/karmaworld/apps/notes/gdrive.py", line 73, in build_api_service
    return build('drive', 'v2', http=credentials.authorize(httplib2.Http()))
...
  File "/var/www/karmaworld/venv/local/lib/python2.7/site-packages/oauth2client/client.py", line 629, in _do_refresh_request
    raise AccessTokenRefreshError(error_msg)
AccessTokenRefreshError: Invalid response 403.

btbonval · 2014-01-17T05:58:51Z

Random fluke. Second test run finished fine.

Removed all pyc files and restarted the VM web system. Javascript error cleaned up.

Created a course. Uploaded a PDF. Viewed the PDF. All good.

Deleted course from moderator page. Cascaded down to note and tags. Evyting be irie.

Beta can't handle this kind of awesome, so holding back merge until tomorrow's meeting.

AndrewMagliozzi · 2014-01-17T14:16:44Z

Sweet. I can't wait to see it in action later today.

On Fri, Jan 17, 2014 at 12:58 AM, Bryan Bonvallet
notifications@github.comwrote:

Random fluke. Second test run finished fine.

Removed all pyc files and restarted the VM web system. Javascript error
cleaned up.

Created a course. Uploaded a PDF. Viewed the PDF. All good.

Deleted course from moderator page. Cascaded down to note and tags.
Evyting be irie.

Beta can't handle this kind of awesome, so holding back merge until
tomorrow's meeting.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/273#issuecomment-32582041
.

btbonval · 2014-01-17T16:49:57Z

Merged into master at commit 5319934

btbonval · 2014-01-18T05:59:20Z

how to delete all files in the html/ directory, for reference.
http://stackoverflow.com/questions/11426560/amazon-s3-boto-how-to-delete-folder

from django.core.files.storage import default_storage
keys = []
for key in default_storage.bucket.list('html/'):
    if key.name[-1] == '/':
        # placeholder for a directory, don't delete
        continue
    keys.append(key)
    if len(keys) >= 250:
        default_storage.bucket.delete_keys(keys)
        keys = []
if len(keys):
    default_storage.bucket.delete_keys(keys)

ghost assigned btbonval Jan 13, 2014

This was referenced Jan 13, 2014

Import MIT Notes #68

Closed

Re-visit S3 backups #89

Closed

btbonval added a commit that referenced this issue Jan 14, 2014

initial attempt at #273 to replace HTML in database with static file …

be5e062

…HTML

btbonval added a commit that referenced this issue Jan 16, 2014

initial attempt at #273 to replace HTML in database with static file …

344bf28

…HTML

btbonval closed this as completed Jan 18, 2014

Upload HTML directly to S3 bucket, do not dump in database #273

Upload HTML directly to S3 bucket, do not dump in database #273

Comments

btbonval commented Jan 13, 2014

btbonval commented Jan 13, 2014

btbonval commented Jan 13, 2014

btbonval commented Jan 14, 2014

btbonval commented Jan 14, 2014

btbonval commented Jan 14, 2014

btbonval commented Jan 14, 2014

btbonval commented Jan 14, 2014

btbonval commented Jan 14, 2014

btbonval commented Jan 14, 2014

btbonval commented Jan 14, 2014

btbonval commented Jan 14, 2014

btbonval commented Jan 14, 2014

btbonval commented Jan 14, 2014

btbonval commented Jan 14, 2014

btbonval commented Jan 14, 2014

btbonval commented Jan 14, 2014

btbonval commented Jan 14, 2014

btbonval commented Jan 16, 2014

btbonval commented Jan 16, 2014

btbonval commented Jan 16, 2014

btbonval commented Jan 16, 2014

btbonval commented Jan 16, 2014

btbonval commented Jan 16, 2014

btbonval commented Jan 16, 2014

btbonval commented Jan 16, 2014

btbonval commented Jan 16, 2014

btbonval commented Jan 16, 2014

btbonval commented Jan 16, 2014

btbonval commented Jan 16, 2014

btbonval commented Jan 16, 2014

btbonval commented Jan 16, 2014

btbonval commented Jan 16, 2014

btbonval commented Jan 16, 2014

btbonval commented Jan 16, 2014

btbonval commented Jan 17, 2014

btbonval commented Jan 17, 2014

btbonval commented Jan 17, 2014

btbonval commented Jan 17, 2014

btbonval commented Jan 17, 2014

btbonval commented Jan 17, 2014

btbonval commented Jan 17, 2014

btbonval commented Jan 17, 2014

btbonval commented Jan 17, 2014

btbonval commented Jan 17, 2014

btbonval commented Jan 17, 2014

btbonval commented Jan 17, 2014

btbonval commented Jan 17, 2014

btbonval commented Jan 17, 2014

btbonval commented Jan 17, 2014

btbonval commented Jan 17, 2014

btbonval commented Jan 17, 2014

AndrewMagliozzi commented Jan 17, 2014

btbonval commented Jan 17, 2014

btbonval commented Jan 18, 2014