<font size="+3"><strong>1. DS Internship Applicants</strong></font>

<div class="alert alert-block alert-warning">
<b>Data Ethics:</b> This project is based on a <b>synthetic data</b>. It is designed to have  characteristics that are similar to the real thing without exposing any actual personal data — like names, birthdays, and email addresses — that would violate our students' privacy.
</div>

In [2]:
# Import libraries
import pandas as pd
import numpy as np
import plotly.express as px
from pprint import PrettyPrinter
from pymongo.mongo_client import MongoClient
from country_converter import CountryConverter

The DS student data is stored in a MongoDB database. So we'll start the notebook by creating a `PrettyPrinter`, and connecting to the right database and collection.

# Prepare Data

## Concet

In [3]:
pp = PrettyPrinter (indent= 2)

In [5]:
client = MongoClient(uri)
db = client["ds-abtest"]
ds_app = db["ds-applicants"]

print("client:", type(client))
print("ds_app:", type(ds_app))

client: <class 'pymongo.mongo_client.MongoClient'>
ds_app: <class 'pymongo.collection.Collection'>


# Explore

## Nationality

Let's start the analysis. One of the possibilities in each record is the country of origin. We want to start by seeing where applicants are coming from.

First, we'll perform an aggregation to count countries.

In [6]:
result = ds_app.aggregate(
    [
        {
            "$group": {
                "_id": "$countryISO2", "count": {"$count":{}}
                      }
         }
     ]
)
print("result type:", type(result))

result type: <class 'pymongo.command_cursor.CommandCursor'>


Next, we'll create and print a DataFrame with the results.

In [7]:
df_nationality = (
    pd.DataFrame(result).rename({"_id": "country_iso2"}, axis=1).sort_values("count")
)

print("df_nationality type:", type(df_nationality))
print("df_nationality shape", df_nationality.shape)
df_nationality.head()

df_nationality type: <class 'pandas.core.frame.DataFrame'>
df_nationality shape (100, 2)


Unnamed: 0,country_iso2,count
49,GY,1
43,SN,1
46,IT,1
50,MU,1
51,MZ,1


Now we have the countries, but they're represented using the [ISO 3166-1 alpha-2 standard](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2), where each country has a two-letter code. It'll be much easier to interpret our data if we have the full country name, so we'll need to do some data enrichment using [country converter](https://github.com/konstantinstadler/country_converter) library.

Since `country_converter` is an open-source library, there are several things to think about before we can bring it into our project. The first thing we need to do is figure out if we're even allowed to use the library for the kind of project we're working on by taking a look at the library's license. `country_converter` has a [GNU General Public License](https://www.gnu.org/licenses/gpl-3.0.en.html), so there are no worries there.

Second, we need to make sure the software is being actively maintained. If the last time anybody changed the library was back in 2014, we're probably going to run into some problems when we try to use it. `country_converter`'s last update is very recent, so we aren't going to have any trouble there either.

Third, we need to see what kinds of quality-control measures are in place. Even if the library was updated five minutes ago and includes a license that gives us permission to do whatever we want, it's going to be entirely useless if it's full of mistakes. Happily, `country_converter`'s testing coverage and build badges look excellent, so we're good to go there as well.

The last thing we need to do is make sure the library will do the things we need it to do by looking at its documentation. `country_converter`'s documentation is very thorough, so if we run into any problems, we'll almost certainly be able to figure out what went wrong.

`country_converter` looks good across all those dimensions, so let's put it to work!

In [8]:
# Instantiate a CountryConverter object named cc, and then use it to add a "country_name" column to the DataFrame df_nationality
cc = CountryConverter()
df_nationality["country_name"] = cc.convert(
    df_nationality["country_iso2"], to="name_short"
)

print("df_nationality shape:", df_nationality.shape)
df_nationality.head()

df_nationality shape: (100, 3)


Unnamed: 0,country_iso2,count,country_name
49,GY,1,Guyana
43,SN,1,Senegal
46,IT,1,Italy
50,MU,1,Mauritius
51,MZ,1,Mozambique


In [9]:
# Create horizontal bar chart
fig = px.bar(
    data_frame = df_nationality.tail(10),
    x = "count",
    y = "country_name",
    orientation = "h",
    title = "DS Applicants: Nationality"
)
# Set axis labels
fig.update_layout(xaxis_title = "Frequency [count]", yaxis_title = "Country")
fig.show()

That's showing us the raw number of applicants from each country, but since we're working with admissions data, it might be more helpful to see the proportion of applicants each country represents. We can get there by normalizing the dataset.

In [10]:
df_nationality["count_pct"] = (
    (df_nationality["count"] / df_nationality["count"].sum()) * 100
)
print("df_nationality shape:", df_nationality.shape)
df_nationality.head()

df_nationality shape: (100, 4)


Unnamed: 0,country_iso2,count,country_name,count_pct
49,GY,1,Guyana,0.074906
43,SN,1,Senegal,0.074906
46,IT,1,Italy,0.074906
50,MU,1,Mauritius,0.074906
51,MZ,1,Mozambique,0.074906


Now we can turn that into a new bar chart.

In [11]:
# Create horizontal bar chart
fig = px.bar(
    data_frame = df_nationality.tail(10),
    x = "count_pct",
    y = "country_name",
    orientation = "h",
    title = "DS Applicants: Nationality"
)
# Set axis labels
fig.update_layout(xaxis_title="Frequency [%]", yaxis_title = "Country")
fig.show()

Bar charts are useful, but since we're talking about actual places here, let's see how this data looks when we put it on a world map. However, plotly express requires the [ISO 3166-1 alpha-3](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3) codes. This means that we'll need to add another column to our DataFrame before we can make our visualization.

In [12]:
df_nationality["country_iso3"] = cc.convert(df_nationality["country_iso2"], to = "ISO3")

print("df_nationality shape:", df_nationality.shape)
df_nationality.head()

df_nationality shape: (100, 5)


Unnamed: 0,country_iso2,count,country_name,count_pct,country_iso3
49,GY,1,Guyana,0.074906,GUY
43,SN,1,Senegal,0.074906,SEN
46,IT,1,Italy,0.074906,ITA
50,MU,1,Mauritius,0.074906,MUS
51,MZ,1,Mozambique,0.074906,MOZ


Let's turn the table into a map!

In [13]:
def build_nat_choropleth():
    fig = px.choropleth(
        data_frame = df_nationality,
        locations = "country_iso3",
        color = "count_pct",
        projection = "natural earth",
        color_continuous_scale = px.colors.sequential.Blues,
        title = "DS Applicants: Nationality"
        )
    return fig


nat_fig = build_nat_choropleth()
print("nat_fig type:", type(nat_fig))
nat_fig.show()

nat_fig type: <class 'plotly.graph_objs._figure.Figure'>


## Age

Now that we know where the applicants are from, let's see what else we can learn. For instance, how old are DS Lab applicants? We know the birthday of all our applicants, but we'll need to perform another aggregation to calculate their ages. We'll use the `"$birthday"` field and the `"$$NOW"` variable.

In [14]:
result = ds_app.aggregate(
    [
        {
            "$project": {
                "years": {
                    "$dateDiff": {
                        "startDate": "$birthday",
                        "endDate": "$$NOW",
                        "unit": "year"
                      }
                    }
                }
            }
        ]
   )


print("result type:", type(result))

result type: <class 'pymongo.command_cursor.CommandCursor'>


Once we have the query results, we can put them into a Series.

In [15]:
ages = pd.DataFrame(result) ["years"]

print("ages type:", type(ages))
print("ages shape:", ages.shape)
ages.head()

ages type: <class 'pandas.core.series.Series'>
ages shape: (1335,)


0    44
1    24
2    20
3    39
4    39
Name: years, dtype: int64

And finally, plot a histogram to show the distribution of ages.

In [16]:
def build_age_hist():
    # Create histogram of `ages`
    fig = px.histogram(x=ages, nbins=20, title="DS Applicants: Distribution of Ages")
    # Set axis labels
    fig.update_layout(xaxis_title="Age",yaxis_title="Frequency [count]")
    return fig


age_fig = build_age_hist()
print("age_fig type:", type(age_fig))
age_fig.show()

age_fig type: <class 'plotly.graph_objs._figure.Figure'>


## Education

Okay, there's one more attribute left for us to explore: educational attainment. Which degrees do our applicants have? First, let's count the number of applicants in each category...

In [17]:
result = ds_app.aggregate(
    [
        {
            "$group": {
                "_id": "$highestDegreeEarned",
                "count":{"$count":{}}
                }
            }
        ]
    )

print("result type:", type(result))

result type: <class 'pymongo.command_cursor.CommandCursor'>


... and create a Series...

In [18]:
education = (
    pd.DataFrame(result)
    .rename({"_id": "highest_degree_earned"}, axis="columns")
    .set_index("highest_degree_earned")
    .squeeze()
    )


print("education type:", type(education))
print("education shape:", education.shape)
education.head()

education type: <class 'pandas.core.series.Series'>
education shape: (5,)


highest_degree_earned
High School or Baccalaureate    221
Some College (1-3 years)        157
Master's degree                 226
Bachelor's degree               704
Doctorate (e.g. PhD)             27
Name: count, dtype: int64

Since we're talking about the highest level of education our applicants have, we need to sort the categories hierarchically rather than alphabetically or numerically. The order should be: `"High School or Baccalaureate"`, `"Some College (1-3 years)"`, `"Bachelor's Degree"`, `"Master's Degree"`, and `"Doctorate (e.g. PhD)"`. Let's do that with a function.

In [19]:
def ed_sort(counts):
    """Sort array `counts` from highest to lowest degree earned."""
    degrees = [
        "High School or Baccalaureate",
        "Some College (1-3 years)",
        "Bachelor's degree",
        "Master's degree",
        "Doctorate (e.g. PhD)",
    ]
    mapping = {k: v for v, k in enumerate(degrees)}
    sort_order = [mapping[c] for c in counts]
    return sort_order


education.sort_index(key=ed_sort, inplace=True)
education

highest_degree_earned
High School or Baccalaureate    221
Some College (1-3 years)        157
Bachelor's degree               704
Master's degree                 226
Doctorate (e.g. PhD)             27
Name: count, dtype: int64

In [20]:
def build_ed_bar():
    # Create bar chart
    fig = px.bar(
        x=education,
        y=education.index,
        orientation="h",
        title="DS Applicants: Highest Degree Earned"
        )
    # Add axis labels
    fig.update_layout(xaxis_title="Frequency [count]", yaxis_title="Highest Degree Earned")
    return fig


ed_fig = build_ed_bar()
print("ed_fig type:", type(ed_fig))
ed_fig.show()

ed_fig type: <class 'plotly.graph_objs._figure.Figure'>
