Import MIT Notes #68

sethwoodworth · 2013-02-01T02:41:15Z

KarmaNotes is using CC-by on all pages.

inherit OCW CC-by-nc onto OCW pages for both course and note.

possibly create a license table. There'd be two entries to start: index 0 = CC-by, 1 = CC-by-nc. Add license FK into course and note models to license.

Default = 0 for KarmaNotes.

Importing from OCW will explicitly set license to 1.

AndrewMagliozzi · 2013-02-01T04:35:39Z

Lets also make sure we add a link to the original page so we maintain the CC-BY-NC compliance.

On Jan 31, 2013, at 6:41 PM, Seth Woodworth notifications@github.com wrote:

Consider adding a ten of the top courses with video from MIT OCW as well. Do this by hand and add no more than 15 courses per network

—
Reply to this email directly or view it on GitHub.

sethwoodworth · 2013-03-25T16:33:19Z

CC-BY-NC licensing is now issue #97

AndrewMagliozzi · 2014-01-02T20:03:53Z

Here is a link to the scraper for the MIT-OCW site: https://github.com/AndrewMagliozzi/mit-ocw-scraper (make sure to checkout the MIT-notes branch)

btbonval · 2014-01-02T21:26:27Z

      "courseLink": "http://ocw.mit.edu//courses/nuclear-engineering/22-105-electromagnetic-interactions-fall-2005",
      "courseStub": "22-105-electromagnetic-interactions-fall-2005",
      "courseTitle": "Electromagnetic Interactions",
      "professor": "Prof. Jeffrey Freidberg",
      "noteLinks": [
        {
          "link": "http://ocw.mit.edu/courses/nuclear-engineering/22-105-electromagnetic-interactions-fall-2005/lecture-notes/lecture1.pdf",
          "fileName": "lecture1.pdf"
        },

parse out course info: year from courseLink, title, professor.

parse out all links as notes. parse note info: title from fileName, no email address, tags of mit-ocw and karma.

btbonval · 2014-01-03T14:57:53Z

Also modify the database for handling licenses.

btbonval · 2014-01-03T16:52:55Z

pass remote link to FilePicker. (figure that bit out)

how to convert filepicker results and shove them into database: https://github.com/FinalsClub/karmaworld/blob/c5af62fe0c2d14f2420f1eef0ab577b95f2e68d9/karmaworld/apps/document_upload/tests.py

btbonval · 2014-01-03T16:56:22Z

license handling of #97 is done in commit 34ea96f

btbonval · 2014-01-03T17:01:54Z

looks like there is no pythonic interface to FilePicker. Best answer seems to always be curl.
http://stackoverflow.com/questions/14115280/store-files-to-filepicker-io-from-the-command-line

Might as well implement something with urllib or whatevs, grab the API key out of secrets, whatnot.

btbonval · 2014-01-06T07:44:52Z

hrm. curl -F blah=@file will use multipart/form-data to upload files as though submit to a form. This is recommended by the above stackoverflow and on Filepicker's RESTful API:
https://developers.inkfilepicker.com/docs/web/#inkblob-store

However, when I upload files using requests multipart/form-data, the MIME type returned by Filepicker is "multipart/form-data" rather than the MIME type of the actual file.
http://docs.python-requests.org/en/latest/user/quickstart/#post-a-multipart-encoded-file

btbonval · 2014-01-06T08:12:19Z

I give up for now. No matter what I do, Filepicker says the file type is "multipart/form-data", yet I see no reason for this. Check back with fresh eyes.

commit to feature_ocw_upload in 3eb6d5e

only other thing I can think of is to pass in the byte array using dlresp.content instead of the file-like object of dlresp.raw, but that shouldn't change how the files parameter works for the requests POST (and thus should not effect the mimetype interpretation). worth a try tho.

btbonval · 2014-01-06T08:13:00Z

this is the bit that won't seem to upload properly:

karmaworld/karmaworld/apps/notes/management/commands/import_ocw_json.py

Lines 95 to 102 in 3eb6d5e

    
           # Upload raw contents of note to Filepicker 
        
           # https://developers.inkfilepicker.com/docs/web/#inkblob-store 
        
           print "Uploading to FP." 
        
           ulresp = requests.post(fpurl, files={ 
        
             #'fileUpload': (note['fileName'], dlresp.raw) 
        
             'fileUpload': dlresp.raw, 
        
           }) 
        
           ulresp.raise_for_status()

AndrewMagliozzi · 2014-01-06T14:56:08Z

Is there an option to do a buffered download?

On Jan 6, 2014, at 3:13 AM, Bryan Bonvallet notifications@github.com wrote:

this is the bit that won't seem to upload properly:

karmaworld/karmaworld/apps/notes/management/commands/import_ocw_json.py

Lines 95 to 102 in 3eb6d5e

# Upload raw contents of note to Filepicker

# https://developers.inkfilepicker.com/docs/web/#inkblob-store

print "Uploading to FP."

ulresp = requests.post(fpurl, files={

#'fileUpload': (note['fileName'], dlresp.raw)

'fileUpload': dlresp.raw,

})

ulresp.raise_for_status()

—
Reply to this email directly or view it on GitHub.

AndrewMagliozzi · 2014-01-06T15:16:46Z

I also don't think we need to download each note to memory. We can simply pass the MIT link directly to filepicker then use the FP response for GDrive processing. Five a buzz when you're up, Bryan. I'd like to help with this.

On Jan 6, 2014, at 3:13 AM, Bryan Bonvallet notifications@github.com wrote:

this is the bit that won't seem to upload properly:

karmaworld/karmaworld/apps/notes/management/commands/import_ocw_json.py

Lines 95 to 102 in 3eb6d5e

# Upload raw contents of note to Filepicker

# https://developers.inkfilepicker.com/docs/web/#inkblob-store

print "Uploading to FP."

ulresp = requests.post(fpurl, files={

#'fileUpload': (note['fileName'], dlresp.raw)

'fileUpload': dlresp.raw,

})

ulresp.raise_for_status()

—
Reply to this email directly or view it on GitHub.

btbonval · 2014-01-06T18:07:35Z

Have to take cat to vet shortly, but I'll be ready to take a look when I
get back.

Good thought on uploading via link, but I didn't see how to do that via FP
RESTful API docs. Should be possible.
On Jan 6, 2014 10:16 AM, "Andrew Magliozzi" notifications@github.com
wrote:

I also don't think we need to download each note to memory. We can simply
pass the MIT link directly to filepicker then use the FP response for
GDrive processing. Five a buzz when you're up, Bryan. I'd like to help with
this.

On Jan 6, 2014, at 3:13 AM, Bryan Bonvallet notifications@github.com
wrote:

this is the bit that won't seem to upload properly:

karmaworld/karmaworld/apps/notes/management/commands/import_ocw_json.py

Lines 95 to 102 in 3eb6d5e

# Upload raw contents of note to Filepicker

# https://developers.inkfilepicker.com/docs/web/#inkblob-store

print "Uploading to FP."

ulresp = requests.post(fpurl, files={

#'fileUpload': (note['fileName'], dlresp.raw)

'fileUpload': dlresp.raw,

})

ulresp.raise_for_status()

—
Reply to this email directly or view it on GitHub.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/68#issuecomment-31655516
.

AndrewMagliozzi · 2014-01-06T20:28:32Z

I think you can just pass the URL instead of the local file path. Let's
try it when you get back.

On Mon, Jan 6, 2014 at 1:07 PM, Bryan Bonvallet notifications@github.comwrote:

Have to take cat to vet shortly, but I'll be ready to take a look when I
get back.

Good thought on uploading via link, but I didn't see how to do that via FP
RESTful API docs. Should be possible.
On Jan 6, 2014 10:16 AM, "Andrew Magliozzi" notifications@github.com
wrote:

I also don't think we need to download each note to memory. We can
simply
pass the MIT link directly to filepicker then use the FP response for
GDrive processing. Five a buzz when you're up, Bryan. I'd like to help
with
this.

On Jan 6, 2014, at 3:13 AM, Bryan Bonvallet notifications@github.com

wrote:

this is the bit that won't seem to upload properly:

karmaworld/karmaworld/apps/notes/management/commands/import_ocw_json.py

Lines 95 to 102 in 3eb6d5e

# Upload raw contents of note to Filepicker

# https://developers.inkfilepicker.com/docs/web/#inkblob-store

print "Uploading to FP."

ulresp = requests.post(fpurl, files={

#'fileUpload': (note['fileName'], dlresp.raw)

'fileUpload': dlresp.raw,

})

ulresp.raise_for_status()

—
Reply to this email directly or view it on GitHub.

—
Reply to this email directly or view it on GitHub<
https://github.com/FinalsClub/karmaworld/issues/68#issuecomment-31655516>
.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/68#issuecomment-31670922
.

AndrewMagliozzi · 2014-01-06T20:45:09Z

curl -X POST -d "url=palmzlib.sourceforge.net/images/pengbrew.png"; "
filepicker.io/api/store/S3?key=MY_API_KEY&path=/images/…;

On Mon, Jan 6, 2014 at 3:28 PM, Andrew Magliozzi <andrew.magliozzi@gmail.com

wrote:

I think you can just pass the URL instead of the local file path. Let's
try it when you get back.

On Mon, Jan 6, 2014 at 1:07 PM, Bryan Bonvallet notifications@github.comwrote:

Have to take cat to vet shortly, but I'll be ready to take a look when I
get back.

Good thought on uploading via link, but I didn't see how to do that via
FP
RESTful API docs. Should be possible.
On Jan 6, 2014 10:16 AM, "Andrew Magliozzi" notifications@github.com
wrote:

I also don't think we need to download each note to memory. We can
simply
pass the MIT link directly to filepicker then use the FP response for
GDrive processing. Five a buzz when you're up, Bryan. I'd like to help
with
this.

On Jan 6, 2014, at 3:13 AM, Bryan Bonvallet notifications@github.com

wrote:

this is the bit that won't seem to upload properly:

karmaworld/karmaworld/apps/notes/management/commands/import_ocw_json.py

Lines 95 to 102 in 3eb6d5e

# Upload raw contents of note to Filepicker

# https://developers.inkfilepicker.com/docs/web/#inkblob-store

print "Uploading to FP."

ulresp = requests.post(fpurl, files={

#'fileUpload': (note['fileName'], dlresp.raw)

'fileUpload': dlresp.raw,

})

ulresp.raise_for_status()

—
Reply to this email directly or view it on GitHub.

—
Reply to this email directly or view it on GitHub<
https://github.com/FinalsClub/karmaworld/issues/68#issuecomment-31655516>

.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/68#issuecomment-31670922
.

btbonval · 2014-01-06T20:46:17Z

aha, it's in the API.

curl -X POST -d url="https://www.inkfilepicker.com/static/img/watermark.png" https://www.filepicker.io/api/store/S3?key=MY_API_KEY

This is how you specify the URL to FP and let them download it.

btbonval · 2014-01-06T21:21:14Z

Getting non-unique error from same course over different academic years.

DETAIL:  Key (school_id, name, instructor_name)=(10464, Designing Your Life, Gabriella Jordan, Lauren Zander) already exists.

There is a unique constraint which does not include Academic Year but should.

However, there is no way to add Academic Year in the form. #253

Also we need to toss department into the import following completion of #236

btbonval · 2014-01-06T22:08:58Z

Notes are duplicating. It appears Django is deciding to insert instead of update. One note has license and upstream_link set, the other does not. There is a single call of gdrive's convert_raw_document over a single RawDocument object.

btbonval · 2014-01-06T22:10:31Z

RawDocument is updated in convert_raw_document. Note only has save called once, excepting possibly the call to sanitize_html or some other Note method which might do its own save.

btbonval · 2014-01-06T22:20:19Z

RawDocument.save calls celery to run convert_raw_document via process_raw_document.

So celery does it one time and the conversion code does it one time.

btbonval · 2014-01-06T22:59:40Z

#253 is no longer the fix for Academic Year unique problems.

remove "year" from the create_or_get statement so that it grabs the correct course agnostic of year.

btbonval · 2014-01-06T23:42:30Z

VM is sucking in courses.

Start new VM from scratch, suck in ALL notes.

If that works, move to beta.

…letes #68

btbonval · 2014-01-07T20:07:51Z

Upload to VM one time. If everything works well, switch over to using dump_json and restore_json to bring the VM notes over to beta.

btbonval · 2014-01-16T05:03:57Z

403 from here:

karmaworld/karmaworld/apps/notes/management/commands/import_ocw_json.py

Lines 142 to 145 in 200ca5c

    
           ulresp = requests.post(fpurl, data={ 
        
             'url': url, 
        
           }) 
        
           ulresp.raise_for_status()

Filepicker didn't used to return forbidden.

btbonval · 2014-01-16T05:06:30Z

We did just change the filepicker API. I suppose it won't hurt to use the old one as a test.

yup. original filepicker API works fine, newer one fails. Does that mean beta Filepicker will fail?

btbonval · 2014-01-16T05:23:02Z

Things look mostly good. The static URL is returning error 403.
https://s3.amazonaws.com/karma-beta/html/09_vision1pdf.html

While this link works just fine both in a new tab and imported onto the page:
https://s3.amazonaws.com/karma-beta/css/global.css

btbonval · 2014-01-16T05:24:27Z

folders within buckets do not have special permissions. buckets have permissions as a whole.

farg.

btbonval · 2014-01-16T05:26:30Z

Interestingly, "Static web hosting" is not enabled for the bucket at all. So whatever we're doing, we're not checking those tick marks.

Man, I remember this from before. there's some evil voodoo crap going on. Some things work and some things do not work. Last time I had to nuke the VM and start over and suddenly CSS and so forth started working from S3. No changes to the S3 server made.

btbonval · 2014-01-16T05:33:43Z

The original S3 static hosting instructions used certainly did not mention anything at all about changing S3 settings themselves, just how to make Django push static files up to S3.
#65 (comment)

Previous dark time with no real resolution:
#192

I think we're not doing it right, but somehow we're getting lucky.

btbonval · 2014-01-16T23:05:09Z

Each S3 object has its permissions. There is no way to inherit permissions from the bucket. There is no way to batch apply permissions across all objects in a bucket through the S3 interface.

The only answer here is to change permissions on the Key at upload time in the Note.send_to_s3() code.

btbonval · 2014-01-17T01:34:08Z

Migrating Filepicker URLs won't work directly. Files uploaded to beta are on the Beta filepicker account, so the links are under that account's management.

Can we check if the fp_file URL is owned by the current Filepicker API?
Can we add a hook to Python's load_data which checks that, and if not owned, migrates it over?

AndrewMagliozzi · 2014-01-17T01:49:15Z

I don't believe there is a way to check which account a link belongs to.

On Jan 16, 2014, at 8:34 PM, Bryan Bonvallet notifications@github.com wrote:

Migrating Filepicker URLs won't work directly. Files uploaded to beta are on the Beta filepicker account, so the links are under that account's management.

Can we check if the fp_file URL is owned by the current Filepicker API?
Can we add a hook to Python's load_data which checks that, and if not owned, migrates it over?
—
Reply to this email directly or view it on GitHub.

AndrewMagliozzi · 2014-01-17T01:49:50Z

We'll have to use the prod Filepicker account creds on your VM for the MIT data.

On Jan 16, 2014, at 8:34 PM, Bryan Bonvallet notifications@github.com wrote:

Migrating Filepicker URLs won't work directly. Files uploaded to beta are on the Beta filepicker account, so the links are under that account's management.

Can we check if the fp_file URL is owned by the current Filepicker API?
Can we add a hook to Python's load_data which checks that, and if not owned, migrates it over?
—
Reply to this email directly or view it on GitHub.

btbonval · 2014-01-17T01:50:57Z

I'm hoping the Filepicker API will have something clever I can use. Even if
it has to do it by checksum (which could be slow, but worthwhile).

Actually I have been using the prod Filepicker account on my VM, which has
a secondary benefit: I can see the HTML uploaded on S3.

Beta's Filepicker uploads to Filepicker S3 or whatever that we don't have
access to.

On Thu, Jan 16, 2014 at 8:49 PM, Andrew Magliozzi
notifications@github.comwrote:

I don't believe there is a way to check which account a link belongs to.

On Jan 16, 2014, at 8:34 PM, Bryan Bonvallet notifications@github.com
wrote:

Migrating Filepicker URLs won't work directly. Files uploaded to beta
are on the Beta filepicker account, so the links are under that account's
management.

Can we check if the fp_file URL is owned by the current Filepicker API?
Can we add a hook to Python's load_data which checks that, and if not
owned, migrates it over?
—
Reply to this email directly or view it on GitHub.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/68#issuecomment-32573422
.

btbonval · 2014-01-17T01:55:45Z

You're right. Nothing helpful with the Filepicker API. You can CRUD each
file given its filepicker URL, but there isn't even a way to list files.
https://developers.inkfilepicker.com/docs/web/#rest

That puts a very minor wrench in the cogs. It means we won't be able to
test this import stuff on beta without pointing at prod's static S3 URL.
Easy thing to do for a quick read test, and then change it back.
-Bryan

On Thu, Jan 16, 2014 at 8:50 PM, Bryan btbonval@gmail.com wrote:

I'm hoping the Filepicker API will have something clever I can use. Even
if it has to do it by checksum (which could be slow, but worthwhile).

Actually I have been using the prod Filepicker account on my VM, which has
a secondary benefit: I can see the HTML uploaded on S3.

Beta's Filepicker uploads to Filepicker S3 or whatever that we don't have
access to.

On Thu, Jan 16, 2014 at 8:49 PM, Andrew Magliozzi <
notifications@github.com> wrote:

I don't believe there is a way to check which account a link belongs to.

On Jan 16, 2014, at 8:34 PM, Bryan Bonvallet notifications@github.com
wrote:

Migrating Filepicker URLs won't work directly. Files uploaded to beta
are on the Beta filepicker account, so the links are under that account's
management.

Can we check if the fp_file URL is owned by the current Filepicker API?
Can we add a hook to Python's load_data which checks that, and if not
owned, migrates it over?
—
Reply to this email directly or view it on GitHub.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/68#issuecomment-32573422
.

btbonval · 2014-01-18T06:31:21Z

Since the first 15 or so notes and were converted to HTML poorly, I deleted them on the S3. (I also deleted the other things converted poorly with HTML in the database)
#273 (comment)

Reran populate_s3 to fix the stuff with HTML in the database.

That left 15 notes that aren't statically hosted on S3, of which a few are from the previous import OCW tests:

karmanotes=# SELECT cc.id, cc.slug FROM courses_course AS cc, notes_note AS nn WHERE nn.static_html = FALSE AND cc.id = nn.course_id;
 id  |                         slug                          
-----+-------------------------------------------------------
  52 | economics-10
  65 | culture-and-belief-17-the-roman-games
  76 | societies-of-the-world-39-slavery-and-slave-trade
  45 | metaphysical-poetry
  55 | history-1330-social-thought-in-modern-america
  39 | government-1295-comparative-politics-in-latin-america
  46 | psychology-13-cognitive-psychology
  48 | us-and-the-world-13-medicine-and-society-in-america
  54 | government-1540-the-american-presidency
 120 | introduction-to-neuroscience-120
 120 | introduction-to-neuroscience-120
 120 | introduction-to-neuroscience-120
 120 | introduction-to-neuroscience-120
 120 | introduction-to-neuroscience-120
 120 | introduction-to-neuroscience-120
(15 rows)

Intro to Neuroscience is from the previous OCW attempt. That has been cascade deleted and will be repopulated with the import OCW script.

What are these other things? They are all notes which have null html and null text.

karmanotes=# SELECT course_id, slug, length(html) AS html_len, length(text) AS text_len FROM notes_note WHERE static_html = FALSE;
 course_id |                       slug                       | html_len | text_
len 
-----------+--------------------------------------------------+----------+----------
        52 | aggregate-demand-componentspdf                   |          |         
        65 | the-roman-games-study-guide                      |          |         
        76 | slavery-and-slave-trade-study-guide-11-9-378297  |          |         
        45 | classnotes-from-22305                            |          |         
        55 | guide-to-jello                                   |          |         
        39 | comparative-politics-of-latin-americ-class-notes |          |         
        46 | cognitive-psychology-notes                       |          |         
        48 | medicine-and-society-midterm-2-guide-11-9-60087  |          |         
        54 | the-american-presidency-study-guide              |          |         
(9 rows)
karmanotes=# SELECT length(NULL);
 length 
--------

(1 row)
karmanotes=# SELECT cc.name,nn.name,nn.uploaded_at,fp_file,mimetype,file_type,pdf_file,gdrive_url FROM notes_note AS nn, courses_course AS cc WHERE static_html = FALSE AND cc.id = nn.course_id;
                          name                           |                        name                        |          uploaded_at          | fp_file | mimetype | file_type | pdf_file | gdrive_url 
---------------------------------------------------------+----------------------------------------------------+-------------------------------+---------+----------+-----------+----------+------------
 Economics 10                                            | Aggregate Demand Components.pdf                    | 2013-11-09 18:11:36.495527+00 |         |          | ???       |          | 
 Culture and Belief 17 - The Roman Games                 | The Roman Games - Study Guide                      | 2013-11-09 18:11:50.345225+00 |         |          | ???       |          | 
 Societies of the World 39 - Slavery and Slave Trade     | Slavery and Slave Trade - Study Guide              | 2013-11-09 18:11:47.378297+00 |         |          | ???       |          | 
 Metaphysical Poetry                                     | Classnotes from 2/23/05                            | 2013-11-09 18:11:43.725581+00 |         |          | ???       |          | 
 History 1330 - Social Thought in Modern America         | Guide to Jello                                     | 2013-11-09 18:11:43.736942+00 |         |          | ???       |          | 
 Government 1295 - Comparative Politics in Latin America | Comparative Politics of Latin Americ - Class Notes | 2013-11-09 18:11:49.267428+00 |         |          | ???       |          | 
 Psychology 13 - Cognitive Psychology                    | Cognitive Psychology - Notes                       | 2013-11-09 18:11:46.11973+00  |         |          | ???       |          | 
 US and the World 13 - Medicine and Society in America   | Medicine and Society - Midterm 2 Guide             | 2013-11-09 18:11:47.060087+00 |         |          | ???       |          | 
 Government 1540 - The American Presidency               | The American Presidency - Study Guide              | 2013-11-09 18:11:49.523136+00 |         |          | ???       |          | 
(9 rows)

Interestingly all the blank notes were uploaded On 9 November 2013. Probably not a coincidence. There is no information which might help recover these files besides the note name and course name. Even then only the originator would know what that name that file refers to. Deleting those notes from the database.

btbonval · 2014-01-18T06:37:49Z

Running MIT OCW BCS dept notes on production in tmux window.

First note finished and shows up in the right course.
http://www.karmanotes.org/massachusetts-institute-of-technology/introduction-to-neuroscience-121/09_vision1pdf

Looks good. links open in a new window. Will leave the script running and check on it later.

AndrewMagliozzi · 2014-01-18T14:04:15Z

I think I can find those blank files again. Stay tuned.

Andrew

On Jan 18, 2014, at 1:31 AM, Bryan Bonvallet notifications@github.com wrote:

Since the first 15 or so notes and were converted to HTML poorly, I deleted them on the S3. (I also deleted the other things converted poorly with HTML in the database)
#273 (comment)

Reran populate_s3 to fix the stuff with HTML in the database.

That left 15 notes that aren't statically hosted on S3, of which a few are from the previous import OCW tests:

karmanotes=# SELECT cc.id, cc.slug FROM courses_course AS cc, notes_note AS nn WHERE nn.static_html = FALSE AND cc.id = nn.course_id;
id | slug
-----+-------------------------------------------------------
52 | economics-10
65 | culture-and-belief-17-the-roman-games
76 | societies-of-the-world-39-slavery-and-slave-trade
45 | metaphysical-poetry
55 | history-1330-social-thought-in-modern-america
39 | government-1295-comparative-politics-in-latin-america
46 | psychology-13-cognitive-psychology
48 | us-and-the-world-13-medicine-and-society-in-america
54 | government-1540-the-american-presidency
120 | introduction-to-neuroscience-120
120 | introduction-to-neuroscience-120
120 | introduction-to-neuroscience-120
120 | introduction-to-neuroscience-120
120 | introduction-to-neuroscience-120
120 | introduction-to-neuroscience-120
(15 rows)
Intro to Neuroscience is from the previous OCW attempt. That has been cascade deleted and will be repopulated with the import OCW script.

What are these other things? They are all notes which have null html and null text.

karmanotes=# SELECT course_id, slug, length(html) AS html_len, length(text) AS text_len FROM notes_note WHERE static_html = FALSE;
course_id | slug | html_len | text_
len
-----------+--------------------------------------------------+----------+----------
52 | aggregate-demand-componentspdf | |
65 | the-roman-games-study-guide | |
76 | slavery-and-slave-trade-study-guide-11-9-378297 | |
45 | classnotes-from-22305 | |
55 | guide-to-jello | |
39 | comparative-politics-of-latin-americ-class-notes | |
46 | cognitive-psychology-notes | |
48 | medicine-and-society-midterm-2-guide-11-9-60087 | |
54 | the-american-presidency-study-guide | |
(9 rows)
karmanotes=# SELECT length(NULL);

length

(1 row)
karmanotes=# SELECT cc.name,nn.name,nn.uploaded_at,fp_file,mimetype,file_type,pdf_file,gdrive_url FROM notes_note AS nn, courses_course AS cc WHERE static_html = FALSE AND cc.id = nn.course_id;
name | name | uploaded_at | fp_file | mimetype | file_type | pdf_file | gdrive_url
---------------------------------------------------------+----------------------------------------------------+-------------------------------+---------+----------+-----------+----------+------------
Economics 10 | Aggregate Demand Components.pdf | 2013-11-09 18:11:36.495527+00 | | | ??? | |
Culture and Belief 17 - The Roman Games | The Roman Games - Study Guide | 2013-11-09 18:11:50.345225+00 | | | ??? | |
Societies of the World 39 - Slavery and Slave Trade | Slavery and Slave Trade - Study Guide | 2013-11-09 18:11:47.378297+00 | | | ??? | |
Metaphysical Poetry | Classnotes from 2/23/05 | 2013-11-09 18:11:43.725581+00 | | | ??? | |
History 1330 - Social Thought in Modern America | Guide to Jello | 2013-11-09 18:11:43.736942+00 | | | ??? | |
Government 1295 - Comparative Politics in Latin America | Comparative Politics of Latin Americ - Class Notes | 2013-11-09 18:11:49.267428+00 | | | ??? | |
Psychology 13 - Cognitive Psychology | Cognitive Psychology - Notes | 2013-11-09 18:11:46.11973+00 | | | ??? | |
US and the World 13 - Medicine and Society in America | Medicine and Society - Midterm 2 Guide | 2013-11-09 18:11:47.060087+00 | | | ??? | |
Government 1540 - The American Presidency | The American Presidency - Study Guide | 2013-11-09 18:11:49.523136+00 | | | ??? | |
(9 rows)
Interestingly all the blank notes were uploaded On 9 November 2013. Probably not a coincidence. There is no information which might help recover these files besides the note name and course name. Even then only the originator would know what that name that file refers to. Deleting those notes from the database.

—
Reply to this email directly or view it on GitHub.

btbonval · 2014-01-20T02:58:36Z

BCS and Chemistry department notes uploaded for MIT OCW.

Beginning Anthropology and Economics.

All the notes in Intro to Anthro are missing, but the script now skips missing upstream links:

Course is in the database: Introduction to Anthropology
Uploading link http://ocw.mit.edu/courses/anthropology/21a-100-introduction-to-a
nthropology-fall-2004/lecture-notes/Ses1_OPENER.pdf to FP.
Failed to upload note: 404 Client Error: NOT FOUND

btbonval · 2014-01-20T03:00:55Z

Wrote a quick little ditty. Notes by department (I'm shooting for departments in the middle as we prioritize):

28 , ./Athletics, Physical Education, and Recreation.json 
36 , ./Literature.json 
63 , ./Writing and Humanistic Studies.json 
82 , ./History.json 
112 , ./Women's and Gender Studies.json 
140 , ./Media Arts and Sciences.json 
151 , ./Experimental Study Group.json 
174 , ./Music and Theater Arts.json 
177 , ./Science, Technology, and Society.json 
213 , ./Comparative Media Studies.json 
226 , ./Foreign Languages and Literatures.json 
312 , ./Special Programs.json 
320 , ./Architecture.json 
330 , ./Biology.json 
347 , ./Anthropology.json 
463 , ./Political Science.json 
475 , ./Nuclear Science and Engineering.json 
478 , ./Brain and Cognitive Sciences.json 
501 , ./Biological Engineering.json 
536 , ./Chemistry.json 
553 , ./Chemical Engineering.json 
602 , ./Health Sciences and Technology.json 
704 , ./Economics.json 
727 , ./Linguistics and Philosophy.json 
857 , ./Physics.json 
883 , ./Materials Science and Engineering.json 
943 , ./Urban Studies and Planning.json 
1088 , ./Earth, Atmospheric, and Planetary Sciences.json 
1166 , ./Engineering Systems Division.json 
1361 , ./Aeronautics and Astronautics.json 
1450 , ./Mechanical Engineering.json 
1484 , ./Civil and Environmental Engineering.json 
1926 , ./Management.json 
2186 , ./Mathematics.json 
3324 , ./Electrical Engineering and Computer Science.json

btbonval · 2014-01-20T09:15:09Z

Anthropology and Economics uploaded.

Physics and PolySci, why not? Launched for import.

AndrewMagliozzi · 2014-01-20T14:53:08Z

Awesome! PS - I found two more spam courses

On Mon, Jan 20, 2014 at 4:15 AM, Bryan Bonvallet
notifications@github.comwrote:

Anthropology and Economics uploaded.

Physics and PolySci, why not? Launched for import.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/68#issuecomment-32743865
.

AndrewMagliozzi · 2014-01-20T14:54:26Z

PPS - Can we remove all courses where the professor is null?

On Mon, Jan 20, 2014 at 9:53 AM, Andrew Magliozzi <
andrew.magliozzi@gmail.com> wrote:

Awesome! PS - I found two more spam courses

On Mon, Jan 20, 2014 at 4:15 AM, Bryan Bonvallet <notifications@github.com

wrote:

Anthropology and Economics uploaded.

Physics and PolySci, why not? Launched for import.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/68#issuecomment-32743865
.

btbonval · 2014-01-20T19:53:16Z

There are 316 notes for courses taught by null professors.

karmanotes=# SELECT count(nn.id) FROM notes_note AS nn, courses_course AS cc, courses_professortaught AS cpt WHERE nn.course_id = cc.id AND cc.id = cpt.course_id AND cpt.professor_id = 1;
 count 
-------
   316
(1 row)

Are you asking me to clear out the MIT OCW courses which have no notes? The MIT OCW script has specific tags we can search against to find all courses which were uploaded by the script and have no notes.

The subquery would look something like this (no ORDER BY clause):

SELECT cc.id, COUNT(nn.id) AS notes FROM courses_course AS cc INNER JOIN taggit_taggeditem AS tt ON (tt.object_id = cc.id) LEFT OUTER JOIN notes_note AS nn ON (cc.id = nn.course_id) WHERE tt.tag_id IN (108,109) GROUP BY cc.id ORDER BY notes ASC, cc.id ASC;

btbonval · 2014-01-20T20:16:33Z

231 MIT scraped courses have no notes. 200 MIT scraped courses have notes.

AndrewMagliozzi · 2014-01-20T20:17:27Z

That is exactly what I was thinking.

On Jan 20, 2014, at 2:53 PM, Bryan Bonvallet notifications@github.com wrote:

There are 316 notes for courses taught by null professors.

karmanotes=# SELECT count(nn.id) FROM notes_note AS nn, courses_course AS cc, courses_professortaught AS cpt WHERE nn.course_id = cc.id AND cc.id = cpt.course_id AND cpt.professor_id = 1;

count

316
(1 row)
Are you asking me to clear out the MIT OCW courses which have no notes? The MIT OCW script has specific tags we can search against to find all courses which were uploaded by the script and have no notes.

—
Reply to this email directly or view it on GitHub.

btbonval · 2014-01-20T20:22:01Z

Done. According to the front page, only one course has no notes now. I see you also deleted another spam course that popped up.

btbonval · 2014-01-20T20:31:14Z

Andrew and I agree this ticket is done, but we might continue the discussion about MIT notes on it.

btbonval · 2014-01-20T21:51:07Z

I did this to clean out MIT OCW courses with no notes. Ugly nested subqueries, but it is fast enough and gets the job done. Might be worth an additional join so the tag IDs are not hard coded.

DELETE FROM courses_course
WHERE id IN
    (SELECT id FROM
        (SELECT cc.id, COUNT(nn.id) AS notes
         FROM courses_course AS cc
             INNER JOIN taggit_taggeditem AS tt ON (tt.object_id = cc.id)
             LEFT OUTER JOIN notes_note AS nn ON (cc.id = nn.course_id)
         WHERE tt.tag_id IN (108,109) GROUP BY cc.id) AS subquery
     WHERE notes = 0);

btbonval · 2015-03-08T04:25:25Z

python script for counting notes per course in the OCW json file.

import sys
import json
from itertools import imap

# filename supplied as the first argument
filename = sys.argv[1]

# load the json structure from the supplied filename
fd = open(filename, 'r')
fc = json.load(fd)
fd.close()

# prepare some structures
courses = fc['courses']
ncourses = len(courses)
def num_links(obj):
    # return number of links, or 0 if the key is missing
    return (obj.has_key('noteLinks') or 0) and len(obj['noteLinks'])

# sum the notes for all courses
nnotes = sum(imap(num_links, iter(courses)))

print "{0},{1}".format(nnotes, filename)

Run it something like so:

find ./ -name "*.json" -print0 | xargs -0 -i% python ../count.py % | sort -n

ghost assigned AndrewMagliozzi Dec 16, 2013

btbonval mentioned this issue Jan 3, 2014

handle OCW copyright CC-by-nc #97

Closed

ghost assigned btbonval Jan 3, 2014

btbonval mentioned this issue Jan 6, 2014

VM is getting duplicate schools #252

Closed

btbonval added a commit that referenced this issue Jan 6, 2014

for #68, fixing upstream link 'display and uploading to FP by URL works.

0dd9acd

btbonval added a commit that referenced this issue Jan 6, 2014

closes #253 by removing field, adds backend for #236, and nearly comp…

6de0c1a

…letes #68

btbonval mentioned this issue Jan 16, 2014

Upload HTML directly to S3 bucket, do not dump in database #273

Closed

btbonval closed this as completed Jan 20, 2014

btbonval mentioned this issue Mar 8, 2015

import_ocw_json needs to be updated #359

Closed

4 tasks

Import MIT Notes #68

Import MIT Notes #68

Comments

sethwoodworth commented Feb 1, 2013

AndrewMagliozzi commented Feb 1, 2013

sethwoodworth commented Mar 25, 2013

AndrewMagliozzi commented Jan 2, 2014

btbonval commented Jan 2, 2014

btbonval commented Jan 3, 2014

btbonval commented Jan 3, 2014

btbonval commented Jan 3, 2014

btbonval commented Jan 3, 2014

btbonval commented Jan 6, 2014

btbonval commented Jan 6, 2014

btbonval commented Jan 6, 2014

AndrewMagliozzi commented Jan 6, 2014

AndrewMagliozzi commented Jan 6, 2014

btbonval commented Jan 6, 2014

AndrewMagliozzi commented Jan 6, 2014

AndrewMagliozzi commented Jan 6, 2014

btbonval commented Jan 6, 2014

btbonval commented Jan 6, 2014

btbonval commented Jan 6, 2014

btbonval commented Jan 6, 2014

btbonval commented Jan 6, 2014

btbonval commented Jan 6, 2014

btbonval commented Jan 6, 2014

btbonval commented Jan 7, 2014

btbonval commented Jan 16, 2014

btbonval commented Jan 16, 2014

btbonval commented Jan 16, 2014

btbonval commented Jan 16, 2014

btbonval commented Jan 16, 2014

btbonval commented Jan 16, 2014

btbonval commented Jan 16, 2014

btbonval commented Jan 17, 2014

AndrewMagliozzi commented Jan 17, 2014

AndrewMagliozzi commented Jan 17, 2014

btbonval commented Jan 17, 2014

btbonval commented Jan 17, 2014

btbonval commented Jan 18, 2014

btbonval commented Jan 18, 2014

AndrewMagliozzi commented Jan 18, 2014

length

btbonval commented Jan 20, 2014

btbonval commented Jan 20, 2014

btbonval commented Jan 20, 2014

AndrewMagliozzi commented Jan 20, 2014

AndrewMagliozzi commented Jan 20, 2014

btbonval commented Jan 20, 2014

btbonval commented Jan 20, 2014

AndrewMagliozzi commented Jan 20, 2014

count

btbonval commented Jan 20, 2014

btbonval commented Jan 20, 2014

btbonval commented Jan 20, 2014

btbonval commented Mar 8, 2015