-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upload HTML directly to S3 bucket, do not dump in database #273
Comments
We will also need to post-process the HTML already in Production and on the VM and push that out to S3. This will be a one time deal rather than a recurring thing, so a quick script that doesn't need to hang around should suffice. |
Did a quick search to see how out of fashion IFRAMEs are. Found this question about IFRAME and SEO. Being that SEO is a pretty recent topic, there is a good comment in here: http://productforums.google.com/forum/#!topic/webmasters/Y6DyIR7wLXg Make sure there is an anchor link to IFRAME content on the page with the IFRAME. That sounds like good practice anyway, in case someone turns off IFRAMEs because they're so 1995. |
|
We probably won't need to batch process HTML across notes, and if we do, the current function will need to be rewritten anyway. Should remove this: beautifulsoup is part of the requirements. lxml does one thing, which is in |
No need to store a URL for the HTML snippet. |
was trying unfortunately:
Our I am missing something key. |
I was trying to create a file by simply opening it and writing to it, as per http://django-storages.readthedocs.org/en/latest/backends/amazon-S3.html#storage That gives me an IOError even though open is set to write/create mode:
Maybe Django sets the
|
Nothing in docs about guess who has two thumbs and has to read source code. this guy. nn/ \nn |
DO NOT WANT |
According to the above link, the acl comes from here: http://docs.aws.amazon.com/AmazonS3/latest/dev/ACLOverview.html 'public-read' should still give the owner full control, but the allusers group gets read. It would seem like a bad idea to change the S3 ACL from 'public-read'. Not sure how to access this S3boto stuff as the owner. |
Files are called Keys in the raw s3boto bucket. e.g. Tested and confirmed. Ugly as junk.
|
Most of the code is written now. I tried to kick off a process to convert HTML in the database to files on S3, but failed:
"The StringIO object can accept either Unicode or 8-bit strings, but mixing the two may take some care. If both are used, 8-bit strings that cannot be interpreted as 7-bit ASCII (that use the 8th bit) will cause a UnicodeError to be raised when getvalue() is called." might as well pass the HTML into BeautifulSoup to see if it can read in the data and output it in consistent UTF-8. |
liar liar pants on fire. It turns out BeautifulSoup does not output UTF-8 by default even though all the docs say it does. Gotta run |
oh good. random disconnection errors or something. More or less exactly what I want to deal with right now.
|
well I guess I won't be running this overnight to process. Can't test if anything worked until I get one Note onto S3 to see if my VM hosts it properly. Can't get one Note onto S3 because broken pipe. Pushing WIP to origin as feature_html_on_s3 with commit HEAD 87bf8e2 |
rebased master into branch and ran tests. ... still running. still running? |
top says the CPU is mostly running SSHD and top. tests deadlocked? |
Looks like the manage.py tests are stuck running Xvfb, which is in turn not running anything (although it should run firefox). Time to double check master still works.
|
Tests completed on master branch in ~4 minutes. Something tripped up feature_html_on_s3 branch so that tests deadlock :( No backtraces to help. |
Same pstree as before with the dangling Xvfb. Definitely stuck here. Code: karmaworld/karmaworld/apps/document_upload/tests.py Lines 47 to 53 in fe3879e
calls karmaworld/karmaworld/apps/document_upload/tests.py Lines 30 to 37 in fe3879e
Only place I can imagine it hanging is on |
The feature_html_on_s3 branch has no changes in the raw_document app. |
Double ctrl-c got a super long backtrace!
Ahh that'd certainly be unique to this branch. Hanging on direct upload to S3. The html folder on the appropriate S3 is empty. Guess I'll play with this feature a little more, it's still leaving cake on the toothpick. |
note for later: It seems worth moving this one function for uploading to S3 from gdrive.py into Note. |
Testing a PDF that rends to 2.87 MiB of HTML using (mostly) what would be performed right now. Upload seems to do zilch.
|
Rewrote upload code to use Running tests again. |
a number of tests errored. It looks like the tests hung, but firefox is actively running at the moment. It's been 5 minutes. :/ |
also made a copy/paste mistake. |
A few errors showing up, hanging on the firefox test as before. This time, however, there are three HTML files on the S3! The hanging thing bothers me. I'll have to use some verbose to see where that is happenin. |
No files in S3 after these. The later upload tests have files in S3 after they run. |
Tests didn't hang using verbose output. How bizarre.
This test appears moot now that slug is unique and not nullable.
I'm guessing this is due to IndexDen not adding any more indices right now.
Three failures from error, no true failures. Time to check it by hand! |
Removed obsolete null Note.slug test, down to 2 errors caused by IndexDen. Can't get much further than this for now. |
uploaded objects to S3 do not give permission to open/download them. Need to do what is in this comment: #68 (comment) |
Figured out the IndexDen problem. Back to using Beta's IndexDen and all the tests ran just fine. |
These docs are about as helpful as a bag of wet socks. I guess there are uses for a bag of wet socks, but not many. Here's what an Everyone Open/Download policy looks like in s3boto:
So to make that, it'd be something like from boto.s3.acl import Grant
# once key exists
policy = newkey.get_acl()
policy.acl.add_grant(Grant(permission=u'READ', type=u'GROUP', uri=u'http://acs.amazonaws.com/groups/global/AllUsers')) |
Permission attempt failed. No errors, but the permissions according to S3 do not include Everyone. Time for guess and check. |
I think the first problem is that changing the policy as noted above does not save that policy remotely. Probably need to call one of the
Here's a problem. Converting the permission into XML ignores the AllUsers URI.
|
type is "GROUP". Looking at Boto source code it is case sensitive 'Group'. I'm tempted to write a ticket over there, but it's probably one of those things where the standard for the XML or whatever is case sensitive, therefore the Python must be as well. |
Here's what the grant XML should look like when it's correct vs what is being generated (identical):
So the problem appears to be with boto's ability to generate either the ACL XML or the Policy XML in a way that satisfies S3. As an experiment, let's just take the preexisting acl text and write it to the new key.
Looks good on the S3 management page. |
Fugly fugly fugly but it worked. That XML ACL is huge to be dropping in as a string, but boto is too messed up to do anything else I guess. I see the file on S3 with proper ACLs. When viewing on the site, the URL asks if I want to download it, rather than showing it in the IFRAME. Changed over to static S3 properly, and it still pops up a download question. It's an HTML file! Maybe the meta data is wrong? |
Yup. Metadata problem. Gotta make sure these things all get uploaded with content-type as That fixes the problem, but it takes forever to download from S3! Also the one I'm looking at looks terrible. |
DIEEEEEE BOTOOOOO!!!! (read as: boto.s3 doesn't do nothin with metadata!?)
btw there is absolutely content-type on every single object, but especially this one when I explicitly set. |
Also tried the above iwht |
Here's where it gets metadata, during I don't have enough middle fingers for this.
|
So even if I /read/ the metadata, it'd just be a local cached dict that gets updated. It doesn't push that stuff anywhere. ever. |
Two types of metadata. Looks like HTTP Headers are used to set the Metadata for Files. But when? on upload? |
From above link: They can be changed at the S3 console. So it looks like headers needs a dict with Content-Type. Let's try! |
Well I guess I dun gone shoopted some woops. Passed Better still, the file uploaded with a preview ready to go. The HTML still looks like junk, but that's the fault of pdf2html or something. Not the problem of this ticket. |
pulled in master and running tests. In the meantime, clicking around the VM site. Got a weird javascript error:
Testing finished. One error. Seems gdrive auth was refused? Guess I'll run it all again.
|
Random fluke. Second test run finished fine. Removed all pyc files and restarted the VM web system. Javascript error cleaned up. Created a course. Uploaded a PDF. Viewed the PDF. All good. Deleted course from moderator page. Cascaded down to note and tags. Evyting be irie. Beta can't handle this kind of awesome, so holding back merge until tomorrow's meeting. |
Sweet. I can't wait to see it in action later today. On Fri, Jan 17, 2014 at 12:58 AM, Bryan Bonvallet
|
Merged into master at commit 5319934 |
how to delete all files in the html/ directory, for reference. from django.core.files.storage import default_storage
keys = []
for key in default_storage.bucket.list('html/'):
if key.name[-1] == '/':
# placeholder for a directory, don't delete
continue
keys.append(key)
if len(keys) >= 250:
default_storage.bucket.delete_keys(keys)
keys = []
if len(keys):
default_storage.bucket.delete_keys(keys) |
The S3 static buckets for beta and prod now have an 'html' directory in the base.
When we have HTML, instead of storing it into the database, we want to write it as a file onto S3 (possibly name the file by hash). Instead of storing the html in the database, we want to store the relative static path into something like
Note.static_relpath
.The Note detail template would then use an IFRAME pointing at
{{ STATIC_URL }}html/{{Note.static_relpath}}
.The text was updated successfully, but these errors were encountered: