# Categorical data validation
What expression asserts that the distinct Nobel Prize categories catalogued by the "prizes" collection are the same as those catalogued by the "laureates"? Remember to explore example documents in the console via e.g. db.prizes.find_one() and db.laureates.find_one().

In [1]:
# assert set(db.prizes.distinct("category")) == set(db.laureates.distinct("prizes.category"))

In [2]:
from pymongo import MongoClient

client = MongoClient('mongodb://localhost:27017')

# Connect to the "nobel" database
db = client.nobel

# Never from there, but sometimes there at last
There are some recorded countries of death ("diedCountry") that do not appear as a country of birth ("bornCountry") for laureates. One such country is "East Germany".

In [3]:
# Countries recorded as countries of death but not as countries of birth
countries = set(db.laureates.distinct("diedCountry")) - set(db.laureates.distinct("bornCountry"))
print(countries)

{'USSR', 'Puerto Rico', 'Jamaica', 'Greece', 'Czechoslovakia', 'Israel', 'Gabon', 'Yugoslavia (now Serbia)', 'Tunisia', 'East Germany', 'Northern Rhodesia (now Zambia)', 'Philippines', 'Barbados'}


# Countries of affiliation
We saw in the last exercise that countries can be associated with a laureate as their country of birth and as their country of death. For each prize a laureate received, they may also have been affiliated with an institution at the time, located in a country.

In [4]:
# The number of distinct countries of laureate affiliation for prizes
count = len(db
    .laureates
    .distinct('prizes.affiliations.country'))
print(count)

29


# Born here, went there
In which countries have USA-born laureates had affiliations for their prizes?

Possible Answers:

Australia, Denmark, United Kingdom, USA

In [5]:
print(db.laureates.distinct('prizes.affiliations.country', {'bornCountry': 'USA'}))

['Australia', 'Denmark', 'USA', 'United Kingdom']


# Triple plays (mostly) all around
Prizes can be shared, even by more than two laureates. In fact, all prize categories but one – literature – have had prizes shared by three or more laureates.

In [6]:
# Save a filter for prize documents with three or more laureates
criteria = {
    'laureates.2': {
        '$exists': True
    }
}

# Save the set of distinct prize categories in documents satisfying the criteria
triple_play_categories = set(db
                            .prizes
                            .distinct(
                                'category', 
                                criteria))

# Confirm literature as the only category not satisfying the criteria.
assert set(db.prizes.distinct("category")) - triple_play_categories == {"literature"}

# Sharing in physics after World War II
What is the approximate ratio of the number of laureates who won an unshared ({"share": "1"}) prize in physics after World War II ({"year": {"$gte": "1945"}}) to the number of laureates who won a shared prize in physics after World War II?

For reference, the code below determines the number of laureates who won a shared prize in physics before 1945.

```
db.laureates.count_documents({
    "prizes": {"$elemMatch": {
        "category": "physics",
        "share": {"$ne": "1"},
        "year": {"$lt": "1945"}}}})
```

In [7]:
print(db.laureates.count_documents({
    "prizes": {"$elemMatch": {
        "category": "physics",
        "share": {"$ne": "1"},
        "year": {"$lt": "1945"}}}}))

19


In [8]:
print(db.laureates.count_documents({
    "prizes": {"$elemMatch": {
        "category": "physics",
        "share": {"$ne": "1"},
        "year": {"$gte": "1945"}}}}))

143


In [9]:
19 / 143

0.13286713286713286

# Meanwhile, in other categories...
We learned in the last exercise that there has been significantly more sharing of physics prizes since World War II: the ratio of the number of laureates who won an unshared prize in physics in or after 1945 to the number of laureates who shared a prize in physics in or after 1945 is approximately 0.13. What is this ratio for prize categories other than physics, chemistry, and medicine?

In [10]:
# Save a filter for laureates with unshared prizes
unshared = {
    "prizes": {
        '$elemMatch': {
        "category": {
            "$nin": [
                "physics", 
                "chemistry", 
                "medicine"]
        },
        "share": "1",
        "year": {
            "$gte": "1945"
            },
    }}}

# Save a filter for laureates with shared prizes
shared = {
    "prizes": {"$elemMatch": {
        "category": {
            "$nin": [
                "physics", 
                "chemistry", 
                "medicine"]
        },
        "share": {
            "$ne": "1"
        },
        "year": {
            "$gte": "1945"
        },
    }}}

ratio = db.laureates.count_documents(unshared) / db.laureates.count_documents(shared)
print(ratio)

1.3653846153846154


# Organizations and prizes over time
How many organizations won prizes before 1945 versus in or after 1945?

In [11]:
# Save a filter for organization laureates with prizes won before 1945
before = {
    "gender": "org",
    "prizes.year": {
        "$lt": "1945"
        },
    }

# Save a filter for organization laureates with prizes won in or after 1945
in_or_after = {
    'gender': 'org',
    'prizes.year': {
        '$gte': "1945"
        },
    }

n_before = db.laureates.count_documents(before)
n_in_or_after = db.laureates.count_documents(in_or_after)
ratio = n_in_or_after / (n_in_or_after + n_before)
print(ratio)

0.84


# Glenn, George, and others in the G.B. crew
There are two laureates with Berkeley, California as a prize affiliation city that have the initials G.S. - Glenn Seaborg and George Smoot. How many laureates in total have a first name beginning with "G" and a surname beginning with "S"?

Evaluate the expression

```
db.laureates.count_documents({"firstname": Regex(____), "surname": Regex(____)})
```

in the console, filling in the blanks appropriately.

In [12]:
from bson.regex import Regex
print(db.laureates.count_documents({
    "firstname": Regex("^G"), 
    "surname": Regex("^S")
    }))

9


# Germany, then and now
Just as we saw with Poland, there are laureates who were born somewhere that was in Germany at the time but is now not, and others born somewhere that was not in Germany at the time but now is.

In [13]:
from bson.regex import Regex

# Filter for laureates with "Germany" in their "bornCountry" value
criteria = {
    "bornCountry": Regex("Germany"
    )}
print(set(db.laureates.distinct("bornCountry", criteria)))

{'Germany (now Russia)', 'Mecklenburg (now Germany)', 'Germany (now Poland)', 'East Friesland (now Germany)', 'Bavaria (now Germany)', 'Hesse-Kassel (now Germany)', 'Prussia (now Germany)', 'Schleswig (now Germany)', 'West Germany (now Germany)', 'W&uuml;rttemberg (now Germany)', 'Germany', 'Germany (now France)'}


In [14]:
from bson.regex import Regex

# Filter for laureates with a "bornCountry" value starting with "Germany"
criteria = {
    "bornCountry": Regex("^Germany", 0)
    }

print(set(db.laureates.distinct("bornCountry", criteria)))

{'Germany (now Russia)', 'Germany (now Poland)', 'Germany', 'Germany (now France)'}


In [15]:
from bson.regex import Regex

# Fill in a string value to be sandwiched between the strings "^Germany " and "now"
criteria = {"bornCountry": Regex("^Germany " + "\(" + "now")}
print(set(db.laureates.distinct("bornCountry", criteria)))

{'Germany (now Russia)', 'Germany (now Poland)', 'Germany (now France)'}


In [16]:
from bson.regex import Regex

#Filter for currently-Germany countries of birth. Fill in a string value to be sandwiched between the strings "now" and "$"
criteria = {"bornCountry": Regex("now" + " Germany\)" + "$")}
print(set(db.laureates.distinct("bornCountry", criteria)))

{'Mecklenburg (now Germany)', 'East Friesland (now Germany)', 'Bavaria (now Germany)', 'Hesse-Kassel (now Germany)', 'Prussia (now Germany)', 'Schleswig (now Germany)', 'West Germany (now Germany)', 'W&uuml;rttemberg (now Germany)'}


# The prized transistor
Three people shared a Nobel prize "for their researches on semiconductors and their discovery of the transistor effect". We can filter on "transistor" as a substring of a laureate's "prizes.motivation" field value to find these laureates.

In [None]:
from bson.regex import Regex

# Save a filter for laureates with prize motivation values containing "transistor" as a substring
criteria = {'prizes.motivation': Regex('transistor')}

# Save the field names corresponding to a laureate's first name and last name
first, last = 'firstname', 'surname'
print([(laureate[first], laureate[last]) for laureate in db.laureates.find(criteria)])