-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import MIT Notes #68
Comments
Lets also make sure we add a link to the original page so we maintain the CC-BY-NC compliance. On Jan 31, 2013, at 6:41 PM, Seth Woodworth notifications@github.com wrote:
|
CC-BY-NC licensing is now issue #97 |
Here is a link to the scraper for the MIT-OCW site: https://github.com/AndrewMagliozzi/mit-ocw-scraper (make sure to checkout the MIT-notes branch) |
"courseLink": "http://ocw.mit.edu//courses/nuclear-engineering/22-105-electromagnetic-interactions-fall-2005",
"courseStub": "22-105-electromagnetic-interactions-fall-2005",
"courseTitle": "Electromagnetic Interactions",
"professor": "Prof. Jeffrey Freidberg",
"noteLinks": [
{
"link": "http://ocw.mit.edu/courses/nuclear-engineering/22-105-electromagnetic-interactions-fall-2005/lecture-notes/lecture1.pdf",
"fileName": "lecture1.pdf"
}, parse out course info: year from courseLink, title, professor. parse out all links as notes. parse note info: title from fileName, no email address, tags of mit-ocw and karma. |
Also modify the database for handling licenses. |
pass remote link to FilePicker. (figure that bit out) how to convert filepicker results and shove them into database: https://github.com/FinalsClub/karmaworld/blob/c5af62fe0c2d14f2420f1eef0ab577b95f2e68d9/karmaworld/apps/document_upload/tests.py |
looks like there is no pythonic interface to FilePicker. Best answer seems to always be curl. Might as well implement something with urllib or whatevs, grab the API key out of secrets, whatnot. |
hrm. However, when I upload files using |
I give up for now. No matter what I do, Filepicker says the file type is "multipart/form-data", yet I see no reason for this. Check back with fresh eyes. commit to feature_ocw_upload in 3eb6d5e only other thing I can think of is to pass in the byte array using |
this is the bit that won't seem to upload properly: karmaworld/karmaworld/apps/notes/management/commands/import_ocw_json.py Lines 95 to 102 in 3eb6d5e
|
Is there an option to do a buffered download?
|
I also don't think we need to download each note to memory. We can simply pass the MIT link directly to filepicker then use the FP response for GDrive processing. Five a buzz when you're up, Bryan. I'd like to help with this.
|
Have to take cat to vet shortly, but I'll be ready to take a look when I Good thought on uploading via link, but I didn't see how to do that via FP
|
I think you can just pass the URL instead of the local file path. Let's On Mon, Jan 6, 2014 at 1:07 PM, Bryan Bonvallet notifications@github.comwrote:
|
curl -X POST -d "url=palmzlib.sourceforge.net/images/pengbrew.png"; " On Mon, Jan 6, 2014 at 3:28 PM, Andrew Magliozzi <andrew.magliozzi@gmail.com
|
aha, it's in the API.
This is how you specify the URL to FP and let them download it. |
Getting non-unique error from same course over different academic years.
There is a unique constraint which does not include Academic Year but should. However, there is no way to add Academic Year in the form. #253 Also we need to toss department into the import following completion of #236 |
Notes are duplicating. It appears Django is deciding to insert instead of update. One note has license and upstream_link set, the other does not. There is a single call of gdrive's |
RawDocument is updated in |
So celery does it one time and the conversion code does it one time. |
#253 is no longer the fix for Academic Year unique problems. remove "year" from the create_or_get statement so that it grabs the correct course agnostic of year. |
VM is sucking in courses. Start new VM from scratch, suck in ALL notes. If that works, move to beta. |
Upload to VM one time. If everything works well, switch over to using dump_json and restore_json to bring the VM notes over to beta. |
403 from here: karmaworld/karmaworld/apps/notes/management/commands/import_ocw_json.py Lines 142 to 145 in 200ca5c
Filepicker didn't used to return forbidden. |
We did just change the filepicker API. I suppose it won't hurt to use the old one as a test. yup. original filepicker API works fine, newer one fails. Does that mean beta Filepicker will fail? |
Things look mostly good. The static URL is returning error 403. While this link works just fine both in a new tab and imported onto the page: |
folders within buckets do not have special permissions. buckets have permissions as a whole. farg. |
Interestingly, "Static web hosting" is not enabled for the bucket at all. So whatever we're doing, we're not checking those tick marks. Man, I remember this from before. there's some evil voodoo crap going on. Some things work and some things do not work. Last time I had to nuke the VM and start over and suddenly CSS and so forth started working from S3. No changes to the S3 server made. |
The original S3 static hosting instructions used certainly did not mention anything at all about changing S3 settings themselves, just how to make Django push static files up to S3. Previous dark time with no real resolution: I think we're not doing it right, but somehow we're getting lucky. |
Each S3 object has its permissions. There is no way to inherit permissions from the bucket. There is no way to batch apply permissions across all objects in a bucket through the S3 interface. The only answer here is to change permissions on the Key at upload time in the |
Migrating Filepicker URLs won't work directly. Files uploaded to beta are on the Beta filepicker account, so the links are under that account's management.
|
I don't believe there is a way to check which account a link belongs to.
|
We'll have to use the prod Filepicker account creds on your VM for the MIT data.
|
I'm hoping the Filepicker API will have something clever I can use. Even if Actually I have been using the prod Filepicker account on my VM, which has Beta's Filepicker uploads to Filepicker S3 or whatever that we don't have On Thu, Jan 16, 2014 at 8:49 PM, Andrew Magliozzi
|
You're right. Nothing helpful with the Filepicker API. You can CRUD each That puts a very minor wrench in the cogs. It means we won't be able to On Thu, Jan 16, 2014 at 8:50 PM, Bryan btbonval@gmail.com wrote:
|
Since the first 15 or so notes and were converted to HTML poorly, I deleted them on the S3. (I also deleted the other things converted poorly with HTML in the database) Reran populate_s3 to fix the stuff with HTML in the database. That left 15 notes that aren't statically hosted on S3, of which a few are from the previous import OCW tests:
Intro to Neuroscience is from the previous OCW attempt. That has been cascade deleted and will be repopulated with the import OCW script. What are these other things? They are all notes which have null html and null text.
Interestingly all the blank notes were uploaded On 9 November 2013. Probably not a coincidence. There is no information which might help recover these files besides the note name and course name. Even then only the originator would know what that name that file refers to. Deleting those notes from the database. |
Running MIT OCW BCS dept notes on production in tmux window. First note finished and shows up in the right course. Looks good. links open in a new window. Will leave the script running and check on it later. |
I think I can find those blank files again. Stay tuned. Andrew
|
BCS and Chemistry department notes uploaded for MIT OCW. Beginning Anthropology and Economics. All the notes in Intro to Anthro are missing, but the script now skips missing upstream links:
|
Wrote a quick little ditty. Notes by department (I'm shooting for departments in the middle as we prioritize):
|
Anthropology and Economics uploaded. Physics and PolySci, why not? Launched for import. |
Awesome! PS - I found two more spam courses On Mon, Jan 20, 2014 at 4:15 AM, Bryan Bonvallet
|
PPS - Can we remove all courses where the professor is null? On Mon, Jan 20, 2014 at 9:53 AM, Andrew Magliozzi <
|
There are 316 notes for courses taught by null professors.
Are you asking me to clear out the MIT OCW courses which have no notes? The MIT OCW script has specific tags we can search against to find all courses which were uploaded by the script and have no notes. The subquery would look something like this (no ORDER BY clause):
|
231 MIT scraped courses have no notes. 200 MIT scraped courses have notes. |
That is exactly what I was thinking.
|
Done. According to the front page, only one course has no notes now. I see you also deleted another spam course that popped up. |
Andrew and I agree this ticket is done, but we might continue the discussion about MIT notes on it. |
I did this to clean out MIT OCW courses with no notes. Ugly nested subqueries, but it is fast enough and gets the job done. Might be worth an additional join so the tag IDs are not hard coded. DELETE FROM courses_course
WHERE id IN
(SELECT id FROM
(SELECT cc.id, COUNT(nn.id) AS notes
FROM courses_course AS cc
INNER JOIN taggit_taggeditem AS tt ON (tt.object_id = cc.id)
LEFT OUTER JOIN notes_note AS nn ON (cc.id = nn.course_id)
WHERE tt.tag_id IN (108,109) GROUP BY cc.id) AS subquery
WHERE notes = 0); |
python script for counting notes per course in the OCW json file. import sys
import json
from itertools import imap
# filename supplied as the first argument
filename = sys.argv[1]
# load the json structure from the supplied filename
fd = open(filename, 'r')
fc = json.load(fd)
fd.close()
# prepare some structures
courses = fc['courses']
ncourses = len(courses)
def num_links(obj):
# return number of links, or 0 if the key is missing
return (obj.has_key('noteLinks') or 0) and len(obj['noteLinks'])
# sum the notes for all courses
nnotes = sum(imap(num_links, iter(courses)))
print "{0},{1}".format(nnotes, filename) Run it something like so:
|
KarmaNotes is using CC-by on all pages.
inherit OCW CC-by-nc onto OCW pages for both course and note.
possibly create a license table. There'd be two entries to start: index 0 = CC-by, 1 = CC-by-nc. Add license FK into course and note models to license.
Default = 0 for KarmaNotes.
Importing from OCW will explicitly set license to 1.
The text was updated successfully, but these errors were encountered: