Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import MIT Notes #68

Closed
sethwoodworth opened this issue Feb 1, 2013 · 81 comments
Closed

Import MIT Notes #68

sethwoodworth opened this issue Feb 1, 2013 · 81 comments
Assignees
Milestone

Comments

@sethwoodworth
Copy link
Member

KarmaNotes is using CC-by on all pages.

inherit OCW CC-by-nc onto OCW pages for both course and note.

possibly create a license table. There'd be two entries to start: index 0 = CC-by, 1 = CC-by-nc. Add license FK into course and note models to license.

Default = 0 for KarmaNotes.

Importing from OCW will explicitly set license to 1.

@AndrewMagliozzi
Copy link
Member

Lets also make sure we add a link to the original page so we maintain the CC-BY-NC compliance.

On Jan 31, 2013, at 6:41 PM, Seth Woodworth notifications@github.com wrote:

Consider adding a ten of the top courses with video from MIT OCW as well. Do this by hand and add no more than 15 courses per network


Reply to this email directly or view it on GitHub.

@sethwoodworth
Copy link
Member Author

CC-BY-NC licensing is now issue #97

@ghost ghost assigned AndrewMagliozzi Dec 16, 2013
@AndrewMagliozzi
Copy link
Member

Here is a link to the scraper for the MIT-OCW site: https://github.com/AndrewMagliozzi/mit-ocw-scraper (make sure to checkout the MIT-notes branch)

@btbonval
Copy link
Member

btbonval commented Jan 2, 2014

      "courseLink": "http://ocw.mit.edu//courses/nuclear-engineering/22-105-electromagnetic-interactions-fall-2005",
      "courseStub": "22-105-electromagnetic-interactions-fall-2005",
      "courseTitle": "Electromagnetic Interactions",
      "professor": "Prof. Jeffrey Freidberg",
      "noteLinks": [
        {
          "link": "http://ocw.mit.edu/courses/nuclear-engineering/22-105-electromagnetic-interactions-fall-2005/lecture-notes/lecture1.pdf",
          "fileName": "lecture1.pdf"
        },

parse out course info: year from courseLink, title, professor.

parse out all links as notes. parse note info: title from fileName, no email address, tags of mit-ocw and karma.

@btbonval
Copy link
Member

btbonval commented Jan 3, 2014

Also modify the database for handling licenses.

@ghost ghost assigned btbonval Jan 3, 2014
@btbonval
Copy link
Member

btbonval commented Jan 3, 2014

pass remote link to FilePicker. (figure that bit out)

how to convert filepicker results and shove them into database: https://github.com/FinalsClub/karmaworld/blob/c5af62fe0c2d14f2420f1eef0ab577b95f2e68d9/karmaworld/apps/document_upload/tests.py

@btbonval
Copy link
Member

btbonval commented Jan 3, 2014

license handling of #97 is done in commit 34ea96f

@btbonval
Copy link
Member

btbonval commented Jan 3, 2014

looks like there is no pythonic interface to FilePicker. Best answer seems to always be curl.
http://stackoverflow.com/questions/14115280/store-files-to-filepicker-io-from-the-command-line

Might as well implement something with urllib or whatevs, grab the API key out of secrets, whatnot.

@btbonval
Copy link
Member

btbonval commented Jan 6, 2014

hrm. curl -F blah=@file will use multipart/form-data to upload files as though submit to a form. This is recommended by the above stackoverflow and on Filepicker's RESTful API:
https://developers.inkfilepicker.com/docs/web/#inkblob-store

However, when I upload files using requests multipart/form-data, the MIME type returned by Filepicker is "multipart/form-data" rather than the MIME type of the actual file.
http://docs.python-requests.org/en/latest/user/quickstart/#post-a-multipart-encoded-file

@btbonval
Copy link
Member

btbonval commented Jan 6, 2014

I give up for now. No matter what I do, Filepicker says the file type is "multipart/form-data", yet I see no reason for this. Check back with fresh eyes.

commit to feature_ocw_upload in 3eb6d5e

only other thing I can think of is to pass in the byte array using dlresp.content instead of the file-like object of dlresp.raw, but that shouldn't change how the files parameter works for the requests POST (and thus should not effect the mimetype interpretation). worth a try tho.

@btbonval
Copy link
Member

btbonval commented Jan 6, 2014

this is the bit that won't seem to upload properly:

# Upload raw contents of note to Filepicker
# https://developers.inkfilepicker.com/docs/web/#inkblob-store
print "Uploading to FP."
ulresp = requests.post(fpurl, files={
#'fileUpload': (note['fileName'], dlresp.raw)
'fileUpload': dlresp.raw,
})
ulresp.raise_for_status()

@AndrewMagliozzi
Copy link
Member

Is there an option to do a buffered download?

On Jan 6, 2014, at 3:13 AM, Bryan Bonvallet notifications@github.com wrote:

this is the bit that won't seem to upload properly:

# Upload raw contents of note to Filepicker
# https://developers.inkfilepicker.com/docs/web/#inkblob-store
print "Uploading to FP."
ulresp = requests.post(fpurl, files={
#'fileUpload': (note['fileName'], dlresp.raw)
'fileUpload': dlresp.raw,
})
ulresp.raise_for_status()


Reply to this email directly or view it on GitHub.

@AndrewMagliozzi
Copy link
Member

I also don't think we need to download each note to memory. We can simply pass the MIT link directly to filepicker then use the FP response for GDrive processing. Five a buzz when you're up, Bryan. I'd like to help with this.

On Jan 6, 2014, at 3:13 AM, Bryan Bonvallet notifications@github.com wrote:

this is the bit that won't seem to upload properly:

# Upload raw contents of note to Filepicker
# https://developers.inkfilepicker.com/docs/web/#inkblob-store
print "Uploading to FP."
ulresp = requests.post(fpurl, files={
#'fileUpload': (note['fileName'], dlresp.raw)
'fileUpload': dlresp.raw,
})
ulresp.raise_for_status()


Reply to this email directly or view it on GitHub.

@btbonval
Copy link
Member

btbonval commented Jan 6, 2014

Have to take cat to vet shortly, but I'll be ready to take a look when I
get back.

Good thought on uploading via link, but I didn't see how to do that via FP
RESTful API docs. Should be possible.
On Jan 6, 2014 10:16 AM, "Andrew Magliozzi" notifications@github.com
wrote:

I also don't think we need to download each note to memory. We can simply
pass the MIT link directly to filepicker then use the FP response for
GDrive processing. Five a buzz when you're up, Bryan. I'd like to help with
this.

On Jan 6, 2014, at 3:13 AM, Bryan Bonvallet notifications@github.com
wrote:

this is the bit that won't seem to upload properly:

# Upload raw contents of note to Filepicker
# https://developers.inkfilepicker.com/docs/web/#inkblob-store
print "Uploading to FP."
ulresp = requests.post(fpurl, files={
#'fileUpload': (note['fileName'], dlresp.raw)
'fileUpload': dlresp.raw,
})
ulresp.raise_for_status()


Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHubhttps://github.com//issues/68#issuecomment-31655516
.

@AndrewMagliozzi
Copy link
Member

I think you can just pass the URL instead of the local file path. Let's
try it when you get back.

On Mon, Jan 6, 2014 at 1:07 PM, Bryan Bonvallet notifications@github.comwrote:

Have to take cat to vet shortly, but I'll be ready to take a look when I
get back.

Good thought on uploading via link, but I didn't see how to do that via FP
RESTful API docs. Should be possible.
On Jan 6, 2014 10:16 AM, "Andrew Magliozzi" notifications@github.com
wrote:

I also don't think we need to download each note to memory. We can
simply
pass the MIT link directly to filepicker then use the FP response for
GDrive processing. Five a buzz when you're up, Bryan. I'd like to help
with
this.

On Jan 6, 2014, at 3:13 AM, Bryan Bonvallet notifications@github.com

wrote:

this is the bit that won't seem to upload properly:

# Upload raw contents of note to Filepicker
# https://developers.inkfilepicker.com/docs/web/#inkblob-store
print "Uploading to FP."
ulresp = requests.post(fpurl, files={
#'fileUpload': (note['fileName'], dlresp.raw)
'fileUpload': dlresp.raw,
})
ulresp.raise_for_status()


Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHub<
https://github.com/FinalsClub/karmaworld/issues/68#issuecomment-31655516>
.


Reply to this email directly or view it on GitHubhttps://github.com//issues/68#issuecomment-31670922
.

@AndrewMagliozzi
Copy link
Member

curl -X POST -d "url=palmzlib.sourceforge.net/images/pengbrew.png"; "
filepicker.io/api/store/S3?key=MY_API_KEY&path=/images/…;

On Mon, Jan 6, 2014 at 3:28 PM, Andrew Magliozzi <andrew.magliozzi@gmail.com

wrote:

I think you can just pass the URL instead of the local file path. Let's
try it when you get back.

On Mon, Jan 6, 2014 at 1:07 PM, Bryan Bonvallet notifications@github.comwrote:

Have to take cat to vet shortly, but I'll be ready to take a look when I
get back.

Good thought on uploading via link, but I didn't see how to do that via
FP
RESTful API docs. Should be possible.
On Jan 6, 2014 10:16 AM, "Andrew Magliozzi" notifications@github.com
wrote:

I also don't think we need to download each note to memory. We can
simply
pass the MIT link directly to filepicker then use the FP response for
GDrive processing. Five a buzz when you're up, Bryan. I'd like to help
with
this.

On Jan 6, 2014, at 3:13 AM, Bryan Bonvallet notifications@github.com

wrote:

this is the bit that won't seem to upload properly:

# Upload raw contents of note to Filepicker
# https://developers.inkfilepicker.com/docs/web/#inkblob-store
print "Uploading to FP."
ulresp = requests.post(fpurl, files={
#'fileUpload': (note['fileName'], dlresp.raw)
'fileUpload': dlresp.raw,
})
ulresp.raise_for_status()


Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHub<
https://github.com/FinalsClub/karmaworld/issues/68#issuecomment-31655516>

.


Reply to this email directly or view it on GitHubhttps://github.com//issues/68#issuecomment-31670922
.

@btbonval
Copy link
Member

btbonval commented Jan 6, 2014

aha, it's in the API.

curl -X POST -d url="https://www.inkfilepicker.com/static/img/watermark.png" https://www.filepicker.io/api/store/S3?key=MY_API_KEY

This is how you specify the URL to FP and let them download it.

@btbonval
Copy link
Member

btbonval commented Jan 6, 2014

Getting non-unique error from same course over different academic years.

DETAIL:  Key (school_id, name, instructor_name)=(10464, Designing Your Life, Gabriella Jordan, Lauren Zander) already exists.

There is a unique constraint which does not include Academic Year but should.

However, there is no way to add Academic Year in the form. #253

Also we need to toss department into the import following completion of #236

@btbonval
Copy link
Member

btbonval commented Jan 6, 2014

Notes are duplicating. It appears Django is deciding to insert instead of update. One note has license and upstream_link set, the other does not. There is a single call of gdrive's convert_raw_document over a single RawDocument object.

@btbonval
Copy link
Member

btbonval commented Jan 6, 2014

RawDocument is updated in convert_raw_document. Note only has save called once, excepting possibly the call to sanitize_html or some other Note method which might do its own save.

@btbonval
Copy link
Member

btbonval commented Jan 6, 2014

RawDocument.save calls celery to run convert_raw_document via process_raw_document.

So celery does it one time and the conversion code does it one time.

@btbonval
Copy link
Member

btbonval commented Jan 6, 2014

#253 is no longer the fix for Academic Year unique problems.

remove "year" from the create_or_get statement so that it grabs the correct course agnostic of year.

@btbonval
Copy link
Member

btbonval commented Jan 6, 2014

VM is sucking in courses.

Start new VM from scratch, suck in ALL notes.

If that works, move to beta.

@btbonval
Copy link
Member

btbonval commented Jan 7, 2014

Upload to VM one time. If everything works well, switch over to using dump_json and restore_json to bring the VM notes over to beta.

@btbonval
Copy link
Member

403 from here:

ulresp = requests.post(fpurl, data={
'url': url,
})
ulresp.raise_for_status()

Filepicker didn't used to return forbidden.

@btbonval
Copy link
Member

We did just change the filepicker API. I suppose it won't hurt to use the old one as a test.

yup. original filepicker API works fine, newer one fails. Does that mean beta Filepicker will fail?

@btbonval
Copy link
Member

Things look mostly good. The static URL is returning error 403.
https://s3.amazonaws.com/karma-beta/html/09_vision1pdf.html

While this link works just fine both in a new tab and imported onto the page:
https://s3.amazonaws.com/karma-beta/css/global.css

@btbonval
Copy link
Member

folders within buckets do not have special permissions. buckets have permissions as a whole.

farg.

@btbonval
Copy link
Member

Interestingly, "Static web hosting" is not enabled for the bucket at all. So whatever we're doing, we're not checking those tick marks.

Man, I remember this from before. there's some evil voodoo crap going on. Some things work and some things do not work. Last time I had to nuke the VM and start over and suddenly CSS and so forth started working from S3. No changes to the S3 server made.

@btbonval
Copy link
Member

The original S3 static hosting instructions used certainly did not mention anything at all about changing S3 settings themselves, just how to make Django push static files up to S3.
#65 (comment)

Previous dark time with no real resolution:
#192

I think we're not doing it right, but somehow we're getting lucky.

@btbonval
Copy link
Member

Each S3 object has its permissions. There is no way to inherit permissions from the bucket. There is no way to batch apply permissions across all objects in a bucket through the S3 interface.

The only answer here is to change permissions on the Key at upload time in the Note.send_to_s3() code.

@btbonval
Copy link
Member

Migrating Filepicker URLs won't work directly. Files uploaded to beta are on the Beta filepicker account, so the links are under that account's management.

  1. Can we check if the fp_file URL is owned by the current Filepicker API?
  2. Can we add a hook to Python's load_data which checks that, and if not owned, migrates it over?

@AndrewMagliozzi
Copy link
Member

I don't believe there is a way to check which account a link belongs to.

On Jan 16, 2014, at 8:34 PM, Bryan Bonvallet notifications@github.com wrote:

Migrating Filepicker URLs won't work directly. Files uploaded to beta are on the Beta filepicker account, so the links are under that account's management.

Can we check if the fp_file URL is owned by the current Filepicker API?
Can we add a hook to Python's load_data which checks that, and if not owned, migrates it over?

Reply to this email directly or view it on GitHub.

@AndrewMagliozzi
Copy link
Member

We'll have to use the prod Filepicker account creds on your VM for the MIT data.

On Jan 16, 2014, at 8:34 PM, Bryan Bonvallet notifications@github.com wrote:

Migrating Filepicker URLs won't work directly. Files uploaded to beta are on the Beta filepicker account, so the links are under that account's management.

Can we check if the fp_file URL is owned by the current Filepicker API?
Can we add a hook to Python's load_data which checks that, and if not owned, migrates it over?

Reply to this email directly or view it on GitHub.

@btbonval
Copy link
Member

I'm hoping the Filepicker API will have something clever I can use. Even if
it has to do it by checksum (which could be slow, but worthwhile).

Actually I have been using the prod Filepicker account on my VM, which has
a secondary benefit: I can see the HTML uploaded on S3.

Beta's Filepicker uploads to Filepicker S3 or whatever that we don't have
access to.

On Thu, Jan 16, 2014 at 8:49 PM, Andrew Magliozzi
notifications@github.comwrote:

I don't believe there is a way to check which account a link belongs to.

On Jan 16, 2014, at 8:34 PM, Bryan Bonvallet notifications@github.com
wrote:

Migrating Filepicker URLs won't work directly. Files uploaded to beta
are on the Beta filepicker account, so the links are under that account's
management.

Can we check if the fp_file URL is owned by the current Filepicker API?
Can we add a hook to Python's load_data which checks that, and if not
owned, migrates it over?

Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHubhttps://github.com//issues/68#issuecomment-32573422
.

@btbonval
Copy link
Member

You're right. Nothing helpful with the Filepicker API. You can CRUD each
file given its filepicker URL, but there isn't even a way to list files.
https://developers.inkfilepicker.com/docs/web/#rest

That puts a very minor wrench in the cogs. It means we won't be able to
test this import stuff on beta without pointing at prod's static S3 URL.
Easy thing to do for a quick read test, and then change it back.
-Bryan

On Thu, Jan 16, 2014 at 8:50 PM, Bryan btbonval@gmail.com wrote:

I'm hoping the Filepicker API will have something clever I can use. Even
if it has to do it by checksum (which could be slow, but worthwhile).

Actually I have been using the prod Filepicker account on my VM, which has
a secondary benefit: I can see the HTML uploaded on S3.

Beta's Filepicker uploads to Filepicker S3 or whatever that we don't have
access to.

On Thu, Jan 16, 2014 at 8:49 PM, Andrew Magliozzi <
notifications@github.com> wrote:

I don't believe there is a way to check which account a link belongs to.

On Jan 16, 2014, at 8:34 PM, Bryan Bonvallet notifications@github.com
wrote:

Migrating Filepicker URLs won't work directly. Files uploaded to beta
are on the Beta filepicker account, so the links are under that account's
management.

Can we check if the fp_file URL is owned by the current Filepicker API?
Can we add a hook to Python's load_data which checks that, and if not
owned, migrates it over?

Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHubhttps://github.com//issues/68#issuecomment-32573422
.

@btbonval
Copy link
Member

Since the first 15 or so notes and were converted to HTML poorly, I deleted them on the S3. (I also deleted the other things converted poorly with HTML in the database)
#273 (comment)

Reran populate_s3 to fix the stuff with HTML in the database.

That left 15 notes that aren't statically hosted on S3, of which a few are from the previous import OCW tests:

karmanotes=# SELECT cc.id, cc.slug FROM courses_course AS cc, notes_note AS nn WHERE nn.static_html = FALSE AND cc.id = nn.course_id;
 id  |                         slug                          
-----+-------------------------------------------------------
  52 | economics-10
  65 | culture-and-belief-17-the-roman-games
  76 | societies-of-the-world-39-slavery-and-slave-trade
  45 | metaphysical-poetry
  55 | history-1330-social-thought-in-modern-america
  39 | government-1295-comparative-politics-in-latin-america
  46 | psychology-13-cognitive-psychology
  48 | us-and-the-world-13-medicine-and-society-in-america
  54 | government-1540-the-american-presidency
 120 | introduction-to-neuroscience-120
 120 | introduction-to-neuroscience-120
 120 | introduction-to-neuroscience-120
 120 | introduction-to-neuroscience-120
 120 | introduction-to-neuroscience-120
 120 | introduction-to-neuroscience-120
(15 rows)

Intro to Neuroscience is from the previous OCW attempt. That has been cascade deleted and will be repopulated with the import OCW script.

What are these other things? They are all notes which have null html and null text.

karmanotes=# SELECT course_id, slug, length(html) AS html_len, length(text) AS text_len FROM notes_note WHERE static_html = FALSE;
 course_id |                       slug                       | html_len | text_
len 
-----------+--------------------------------------------------+----------+----------
        52 | aggregate-demand-componentspdf                   |          |         
        65 | the-roman-games-study-guide                      |          |         
        76 | slavery-and-slave-trade-study-guide-11-9-378297  |          |         
        45 | classnotes-from-22305                            |          |         
        55 | guide-to-jello                                   |          |         
        39 | comparative-politics-of-latin-americ-class-notes |          |         
        46 | cognitive-psychology-notes                       |          |         
        48 | medicine-and-society-midterm-2-guide-11-9-60087  |          |         
        54 | the-american-presidency-study-guide              |          |         
(9 rows)
karmanotes=# SELECT length(NULL);
 length 
--------

(1 row)
karmanotes=# SELECT cc.name,nn.name,nn.uploaded_at,fp_file,mimetype,file_type,pdf_file,gdrive_url FROM notes_note AS nn, courses_course AS cc WHERE static_html = FALSE AND cc.id = nn.course_id;
                          name                           |                        name                        |          uploaded_at          | fp_file | mimetype | file_type | pdf_file | gdrive_url 
---------------------------------------------------------+----------------------------------------------------+-------------------------------+---------+----------+-----------+----------+------------
 Economics 10                                            | Aggregate Demand Components.pdf                    | 2013-11-09 18:11:36.495527+00 |         |          | ???       |          | 
 Culture and Belief 17 - The Roman Games                 | The Roman Games - Study Guide                      | 2013-11-09 18:11:50.345225+00 |         |          | ???       |          | 
 Societies of the World 39 - Slavery and Slave Trade     | Slavery and Slave Trade - Study Guide              | 2013-11-09 18:11:47.378297+00 |         |          | ???       |          | 
 Metaphysical Poetry                                     | Classnotes from 2/23/05                            | 2013-11-09 18:11:43.725581+00 |         |          | ???       |          | 
 History 1330 - Social Thought in Modern America         | Guide to Jello                                     | 2013-11-09 18:11:43.736942+00 |         |          | ???       |          | 
 Government 1295 - Comparative Politics in Latin America | Comparative Politics of Latin Americ - Class Notes | 2013-11-09 18:11:49.267428+00 |         |          | ???       |          | 
 Psychology 13 - Cognitive Psychology                    | Cognitive Psychology - Notes                       | 2013-11-09 18:11:46.11973+00  |         |          | ???       |          | 
 US and the World 13 - Medicine and Society in America   | Medicine and Society - Midterm 2 Guide             | 2013-11-09 18:11:47.060087+00 |         |          | ???       |          | 
 Government 1540 - The American Presidency               | The American Presidency - Study Guide              | 2013-11-09 18:11:49.523136+00 |         |          | ???       |          | 
(9 rows)

Interestingly all the blank notes were uploaded On 9 November 2013. Probably not a coincidence. There is no information which might help recover these files besides the note name and course name. Even then only the originator would know what that name that file refers to. Deleting those notes from the database.

@btbonval
Copy link
Member

Running MIT OCW BCS dept notes on production in tmux window.

First note finished and shows up in the right course.
http://www.karmanotes.org/massachusetts-institute-of-technology/introduction-to-neuroscience-121/09_vision1pdf

Looks good. links open in a new window. Will leave the script running and check on it later.

@AndrewMagliozzi
Copy link
Member

I think I can find those blank files again. Stay tuned.

Andrew

On Jan 18, 2014, at 1:31 AM, Bryan Bonvallet notifications@github.com wrote:

Since the first 15 or so notes and were converted to HTML poorly, I deleted them on the S3. (I also deleted the other things converted poorly with HTML in the database)
#273 (comment)

Reran populate_s3 to fix the stuff with HTML in the database.

That left 15 notes that aren't statically hosted on S3, of which a few are from the previous import OCW tests:

karmanotes=# SELECT cc.id, cc.slug FROM courses_course AS cc, notes_note AS nn WHERE nn.static_html = FALSE AND cc.id = nn.course_id;
id | slug
-----+-------------------------------------------------------
52 | economics-10
65 | culture-and-belief-17-the-roman-games
76 | societies-of-the-world-39-slavery-and-slave-trade
45 | metaphysical-poetry
55 | history-1330-social-thought-in-modern-america
39 | government-1295-comparative-politics-in-latin-america
46 | psychology-13-cognitive-psychology
48 | us-and-the-world-13-medicine-and-society-in-america
54 | government-1540-the-american-presidency
120 | introduction-to-neuroscience-120
120 | introduction-to-neuroscience-120
120 | introduction-to-neuroscience-120
120 | introduction-to-neuroscience-120
120 | introduction-to-neuroscience-120
120 | introduction-to-neuroscience-120
(15 rows)
Intro to Neuroscience is from the previous OCW attempt. That has been cascade deleted and will be repopulated with the import OCW script.

What are these other things? They are all notes which have null html and null text.

karmanotes=# SELECT course_id, slug, length(html) AS html_len, length(text) AS text_len FROM notes_note WHERE static_html = FALSE;
course_id | slug | html_len | text_
len
-----------+--------------------------------------------------+----------+----------
52 | aggregate-demand-componentspdf | |
65 | the-roman-games-study-guide | |
76 | slavery-and-slave-trade-study-guide-11-9-378297 | |
45 | classnotes-from-22305 | |
55 | guide-to-jello | |
39 | comparative-politics-of-latin-americ-class-notes | |
46 | cognitive-psychology-notes | |
48 | medicine-and-society-midterm-2-guide-11-9-60087 | |
54 | the-american-presidency-study-guide | |
(9 rows)
karmanotes=# SELECT length(NULL);

length

(1 row)
karmanotes=# SELECT cc.name,nn.name,nn.uploaded_at,fp_file,mimetype,file_type,pdf_file,gdrive_url FROM notes_note AS nn, courses_course AS cc WHERE static_html = FALSE AND cc.id = nn.course_id;
name | name | uploaded_at | fp_file | mimetype | file_type | pdf_file | gdrive_url
---------------------------------------------------------+----------------------------------------------------+-------------------------------+---------+----------+-----------+----------+------------
Economics 10 | Aggregate Demand Components.pdf | 2013-11-09 18:11:36.495527+00 | | | ??? | |
Culture and Belief 17 - The Roman Games | The Roman Games - Study Guide | 2013-11-09 18:11:50.345225+00 | | | ??? | |
Societies of the World 39 - Slavery and Slave Trade | Slavery and Slave Trade - Study Guide | 2013-11-09 18:11:47.378297+00 | | | ??? | |
Metaphysical Poetry | Classnotes from 2/23/05 | 2013-11-09 18:11:43.725581+00 | | | ??? | |
History 1330 - Social Thought in Modern America | Guide to Jello | 2013-11-09 18:11:43.736942+00 | | | ??? | |
Government 1295 - Comparative Politics in Latin America | Comparative Politics of Latin Americ - Class Notes | 2013-11-09 18:11:49.267428+00 | | | ??? | |
Psychology 13 - Cognitive Psychology | Cognitive Psychology - Notes | 2013-11-09 18:11:46.11973+00 | | | ??? | |
US and the World 13 - Medicine and Society in America | Medicine and Society - Midterm 2 Guide | 2013-11-09 18:11:47.060087+00 | | | ??? | |
Government 1540 - The American Presidency | The American Presidency - Study Guide | 2013-11-09 18:11:49.523136+00 | | | ??? | |
(9 rows)
Interestingly all the blank notes were uploaded On 9 November 2013. Probably not a coincidence. There is no information which might help recover these files besides the note name and course name. Even then only the originator would know what that name that file refers to. Deleting those notes from the database.


Reply to this email directly or view it on GitHub.

@btbonval
Copy link
Member

BCS and Chemistry department notes uploaded for MIT OCW.

Beginning Anthropology and Economics.

All the notes in Intro to Anthro are missing, but the script now skips missing upstream links:

Course is in the database: Introduction to Anthropology
Uploading link http://ocw.mit.edu/courses/anthropology/21a-100-introduction-to-a
nthropology-fall-2004/lecture-notes/Ses1_OPENER.pdf to FP.
Failed to upload note: 404 Client Error: NOT FOUND

@btbonval
Copy link
Member

Wrote a quick little ditty. Notes by department (I'm shooting for departments in the middle as we prioritize):

28 , ./Athletics, Physical Education, and Recreation.json 
36 , ./Literature.json 
63 , ./Writing and Humanistic Studies.json 
82 , ./History.json 
112 , ./Women's and Gender Studies.json 
140 , ./Media Arts and Sciences.json 
151 , ./Experimental Study Group.json 
174 , ./Music and Theater Arts.json 
177 , ./Science, Technology, and Society.json 
213 , ./Comparative Media Studies.json 
226 , ./Foreign Languages and Literatures.json 
312 , ./Special Programs.json 
320 , ./Architecture.json 
330 , ./Biology.json 
347 , ./Anthropology.json 
463 , ./Political Science.json 
475 , ./Nuclear Science and Engineering.json 
478 , ./Brain and Cognitive Sciences.json 
501 , ./Biological Engineering.json 
536 , ./Chemistry.json 
553 , ./Chemical Engineering.json 
602 , ./Health Sciences and Technology.json 
704 , ./Economics.json 
727 , ./Linguistics and Philosophy.json 
857 , ./Physics.json 
883 , ./Materials Science and Engineering.json 
943 , ./Urban Studies and Planning.json 
1088 , ./Earth, Atmospheric, and Planetary Sciences.json 
1166 , ./Engineering Systems Division.json 
1361 , ./Aeronautics and Astronautics.json 
1450 , ./Mechanical Engineering.json 
1484 , ./Civil and Environmental Engineering.json 
1926 , ./Management.json 
2186 , ./Mathematics.json 
3324 , ./Electrical Engineering and Computer Science.json

@btbonval
Copy link
Member

Anthropology and Economics uploaded.

Physics and PolySci, why not? Launched for import.

@AndrewMagliozzi
Copy link
Member

Awesome! PS - I found two more spam courses

On Mon, Jan 20, 2014 at 4:15 AM, Bryan Bonvallet
notifications@github.comwrote:

Anthropology and Economics uploaded.

Physics and PolySci, why not? Launched for import.


Reply to this email directly or view it on GitHubhttps://github.com//issues/68#issuecomment-32743865
.

@AndrewMagliozzi
Copy link
Member

PPS - Can we remove all courses where the professor is null?

On Mon, Jan 20, 2014 at 9:53 AM, Andrew Magliozzi <
andrew.magliozzi@gmail.com> wrote:

Awesome! PS - I found two more spam courses

On Mon, Jan 20, 2014 at 4:15 AM, Bryan Bonvallet <notifications@github.com

wrote:

Anthropology and Economics uploaded.

Physics and PolySci, why not? Launched for import.


Reply to this email directly or view it on GitHubhttps://github.com//issues/68#issuecomment-32743865
.

@btbonval
Copy link
Member

There are 316 notes for courses taught by null professors.

karmanotes=# SELECT count(nn.id) FROM notes_note AS nn, courses_course AS cc, courses_professortaught AS cpt WHERE nn.course_id = cc.id AND cc.id = cpt.course_id AND cpt.professor_id = 1;
 count 
-------
   316
(1 row)

Are you asking me to clear out the MIT OCW courses which have no notes? The MIT OCW script has specific tags we can search against to find all courses which were uploaded by the script and have no notes.

The subquery would look something like this (no ORDER BY clause):

SELECT cc.id, COUNT(nn.id) AS notes FROM courses_course AS cc INNER JOIN taggit_taggeditem AS tt ON (tt.object_id = cc.id) LEFT OUTER JOIN notes_note AS nn ON (cc.id = nn.course_id) WHERE tt.tag_id IN (108,109) GROUP BY cc.id ORDER BY notes ASC, cc.id ASC;

@btbonval
Copy link
Member

231 MIT scraped courses have no notes. 200 MIT scraped courses have notes.

@AndrewMagliozzi
Copy link
Member

That is exactly what I was thinking.

On Jan 20, 2014, at 2:53 PM, Bryan Bonvallet notifications@github.com wrote:

There are 316 notes for courses taught by null professors.

karmanotes=# SELECT count(nn.id) FROM notes_note AS nn, courses_course AS cc, courses_professortaught AS cpt WHERE nn.course_id = cc.id AND cc.id = cpt.course_id AND cpt.professor_id = 1;

count

316
(1 row)
Are you asking me to clear out the MIT OCW courses which have no notes? The MIT OCW script has specific tags we can search against to find all courses which were uploaded by the script and have no notes.


Reply to this email directly or view it on GitHub.

@btbonval
Copy link
Member

Done. According to the front page, only one course has no notes now. I see you also deleted another spam course that popped up.

@btbonval
Copy link
Member

Andrew and I agree this ticket is done, but we might continue the discussion about MIT notes on it.

@btbonval
Copy link
Member

I did this to clean out MIT OCW courses with no notes. Ugly nested subqueries, but it is fast enough and gets the job done. Might be worth an additional join so the tag IDs are not hard coded.

DELETE FROM courses_course
WHERE id IN
    (SELECT id FROM
        (SELECT cc.id, COUNT(nn.id) AS notes
         FROM courses_course AS cc
             INNER JOIN taggit_taggeditem AS tt ON (tt.object_id = cc.id)
             LEFT OUTER JOIN notes_note AS nn ON (cc.id = nn.course_id)
         WHERE tt.tag_id IN (108,109) GROUP BY cc.id) AS subquery
     WHERE notes = 0);

@btbonval
Copy link
Member

btbonval commented Mar 8, 2015

python script for counting notes per course in the OCW json file.

import sys
import json
from itertools import imap

# filename supplied as the first argument
filename = sys.argv[1]

# load the json structure from the supplied filename
fd = open(filename, 'r')
fc = json.load(fd)
fd.close()

# prepare some structures
courses = fc['courses']
ncourses = len(courses)
def num_links(obj):
    # return number of links, or 0 if the key is missing
    return (obj.has_key('noteLinks') or 0) and len(obj['noteLinks'])

# sum the notes for all courses
nnotes = sum(imap(num_links, iter(courses)))

print "{0},{1}".format(nnotes, filename)

Run it something like so:

find ./ -name "*.json" -print0 | xargs -0 -i% python ../count.py % | sort -n

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants