# Update records on GEOMG via script

The script aims to modify the documents on GEOMG via python, especially performing a similar or related modification for a large number of documents. Instead of downloading and reuploading, or opening and editing on wepages, the script helps save a lot of time to update the specific fields all at once.




> Originally created by **Gene Cheng [(@Ziiiiing)](https://github.com/Ziiiiing)** on **Oct 3, 2021**

In [None]:
# uncomment & run this cell if the 'mechanize' module is not installed yet

# pip install mechanize

In [None]:
import mechanize
import time
import csv

## Step 1: Prepare Modification

Prepare the **new value** and the corresponding **name** for each updated field. **Name** here refers to the name attribute of the input element in HTML. Store all pairs in a dictionary.

> For example, the `name` attribute for `<input class="..." type="text" name="document[title]" value="..." id="...">` is **document[title]**.


**Following tables provide the name attribute corresponding to each editable field:**

***[will be finished soon]***



#### Identification
|             | Field              | Name attr.                                  | Control Type | Format    |
| :---------- | :----------------- | :------------------------------------------ | :----------- | :-------- |
| Descriptive | Title              | document[title]                             |              |           |
|             | Alternative Title  | document[dct_alternative_sm_attributes][]   |              |           |
|             | Description        | document[dct_description_sm_attributes][]   |              |           |
|             | Language           | document[dct_language_sm_attributes][]      |              |           |
| Credits     | Creator            | document[dct_creator_sm_attributes][]       |              |           |
|             | Publisher          | document[dct_publisher_sm_attributes][]     |              |           |
|             | Provider           | document[schema_provider_s]                 |              |           |
| Categories  | Resource Class     | document[gbl_resourceClass_sm_attributes][] |              |           |
|             | Resource Type      | document[gbl_resourceType_sm_attributes][]  |              |           |
|             | Subject            | document[dct_subject_sm_attributes][]       |              |           |
|             | ISO Topic Category | document[dcat_theme_sm_attributes][]        |              |           |
|             | Keyword            | document[dcat_keyword_sm_attributes][]      |              |           |
| Temporal    | Temporal Coverage  | document[dct_temporal_sm_attributes][]      |              |           |
|             | Date Issued        | document[dct_issued_s]                      |              | YYYY      |
|             | Date range         | document[gbl_dateRange_drsim_attributes][]  |              | YYYY-YYYY |
| Spatial     | Spatial Coverage   | document[dct_spatial_sm_attributes][]       |              |           |
|             | Bounding Box       | document[locn_geometry]                     |              | W,S,E,N   |
|             | GeoNames           | document[b1g_geonames_sm_attributes][]      |              |           |
| Relations   | Relation           | document[dct_relation_sm_attributes][]      |              |           |
|             | Member Of          | document[pcdm_memberOf_sm_attributes][]     |              |           |
|             | Is Part Of         | document[dct_isPartOf_sm_attributes][]      |              |           |
|             | Source             | document[dct_source_sm_attributes][]        |              |           |
|             | Version            | document[dct_isVersionOf_sm_attributes][]   |              |           |
|             | Replaces           | document[dct_replaces_sm_attributes][]      |              |           |
|             | Is Replace By      | document[dct_isReplacedBy_sm_attributes][]  |              |           |

#### Distribution
|        | Field              | Name attr.                                             | Control Type | Format |
| :----- | :----------------- | :----------------------------------------------------- | :----------- | :----- |
| Object | Format             | document[dct_format_s]                                 |              |        |
|        | File Size          | document[gbl_fileSize_s]                               |              |        |
|        | WxS Identifier     | document[gbl_wxsIdentifier_s]                          |              |        |
|        | Georeferenced      | document[gbl_georeferenced_b]                          |              |        |
| Links  | Reference Category | document[dct_references_s_attributes][index][category] |              |        |
|        | Reference Value    | document[dct_references_s_attributes][index][value]    |              |        |
|        | B1G Image URL      | document[b1g_image_ss]                                 |              |        |

#### Administrative
|               | Field               | Name attr.                                    | Control Type | Format |
| :------------ | :------------------ | :-------------------------------------------- | :----------- | :----- |
| Codes         | ID                  | document[geomg_id_s]                          |              |        |
|               | Identifier          | document[dct_identifier_sm_attributes][]      |              |        |
|               | Code                | document[b1g_code_s]                          |              |        |
| Rights        | Access Rights       | document[dct_accessRights_s]                  |              |        |
|               | Right Holder        | document[dct_rightsHolder_sm_attributes][]    |              |        |
|               | License             | document[dct_license_sm_attributes][]         |              |        |
|               | Rights              | document[dct_rights_sm_attributes][]          |              |        |
| Life Cycle    | Accrual Method      | document[b1g_dct_accrualMethod_s]             |              |        |
|               | Accrual Periodicity | document[b1g_dct_accrualPeriodicity_s]        |              |        |
|               | Date Accessioned    | document[b1g_dateAccessioned_sm_attributes][] |              |        |
|               | Date Retired        | document[b1g_dateRetired_s]                   |              |        |
|               | Status              | document[b1g_status_s]                        |              |        |
|               | Publication State   | document[publication_state]                   |              |        |
| Accessibility | Gbl suppressed b    | document[gbl_suppressed_b]                    |              |        |
|               | Child Record        | document[b1g_child_record_b]                  |              |        |
|               | Mediator            | document[b1g_dct_mediator_sm_attributes][]    |              |        |
|               | Access              | document[b1g_access_s]                        |              |        |


In [None]:
# Find the name-value pairs and store in a dictionary
modifies = {}

In [None]:
# Hello, please edit here !!!
# Example of constructing new-value pairs

modifies["document[title]"] = "I am a new title!"
modifies["document[b1g_dateRetired_s]"] = "2021-10-03"
modifies["document[b1g_status_s]"] = ["Inactive"]
modifies["document[publication_state]"] = ["unpublished"]

## Step 2: Prepare IDs

In [None]:
# Store ids of documents(or let's say records ) in a list.
doc_id = []

In [None]:
# Hello, please edit here !!
# Example of constructing the id list

doc_id.append("322ee79b35f748869974ec661bd04bbc_10")
doc_id.append("322ee79b35f748869974ec661bd04bbc_27")
doc_id.append("9b2537e7a6e749328d84ab8d071f5c9f_158")

## Step 3: User Login on GEOMG

After preparation, we are ready for interacting with the GEOMG. First thing first, you need to modify the value of `username` and `password` to your own ones for GEOMG login. Make sure your personal information is not exposed on the internet.

In [None]:
# Hello, please edit here !!

username = "<your_username>"
password = "<your_password>"

In [None]:
# Perform login 
login_url = "https://geomg.lib.umn.edu/users/sign_in"

br = mechanize.Browser()
br.set_handle_robots(False)   # ignore robots

# browse the Login Page and select the right form for login
br.open(login_url)
br.select_form(nr=1)

# input and submit the username & password
br["user[email]"] = username
br["user[password]"] = password
br.submit()

# redirect if successfully logged in
if br.geturl() ==  login_url:
    print(">>> Failed to login.")
else:
    print('>>>> Successfully logged in.')


## Step 4: Scrape & Modify Fields Online

In [None]:
# iterate the 'modifies' dictionary and make updates
count = 0
nonexist = []
failed = []

for item in doc_id:
    count += 1
    item_url = "https://geomg.lib.umn.edu/documents/{}".format(item)

    try:
        br.open(item_url)          # open the edit page for each record
        br.select_form(nr=1)       # the index of the form is 1
    
        # iterate to modify
        for name, val in modifies.items:
            br[field] = val

        # submit the changes for this document
        br.submit()
        print(">>> [{}/{}] Updating {} .................... √".format(count, len(doc_id), item))
    
    # skip the nonexist record with error code 404 if any error occurs
    except mechanize.HTTPError as e:
        # ignore the non-exist records
        if e.code == 404:
            print(">>> [{}/{}] Updating {} .................... x".format(count, len(doc_id), item))
            nonexist.append(item)
        else:
            print(">>> [{}/{}] Updating {} .................... x".format(count, len(doc_id), item))
            failed.append(item)        # store failed item and try again later
    except:
        print(">>> [{}/{}] Updating {} .................... x".format(count, len(doc_id), item))
        failed.append(item)
            
# print out the summary
print('\n-------------- Summary --------------')
print('Successful Updates: {}'.format(len(doc_id)-len(nonexist)-len(failed)))
print('Records Not Exist: {}'.format(len(nonexist)))
print('Failed Updates: {}'.format(len(failed)))

if failed:
    print('\n-------------- Manual Edits Needed for Failed Updates --------------')
    for item in failed:
        item_url = 'https://geomg.lib.umn.edu/documents/{}'.format(item)
        print(item_url)