# Workshop 1

Here you're gonna test your __data scientist junior__ skills. Read carefully each one of the problems, write your own test cases, and validate everything works as expected.

## 1. Regular Expressions

As follows complete the code based on the __requirement__. There is a part `#YOUR CODE HERE` where you _should complete_ to accomplish the task. However, you _could_ change anything you want.

### Problem 1.1

Find a list of all of all of the names in the following string using _regex_.

In [24]:
import re
def names() -> list:
    """
    This function uses regular expressions to extract all names in the simple_string.
    
    Returns:
        list: A list of all names in the simple_string
    """
    simple_string = """Amy is 5 years old, and her sister Mary is 2 years old. 
    Ruth and Peter, their parents, have 3 kids."""

    pattern = r'[A-Z][a-z]{2,}'
    
    return re.findall(pattern, simple_string) 

In [25]:
# example of test case
assert len(names()) == 4, "There are four names in the simple_string."

### Problem 1.2

The _dataset file_ in [assets/grades.txt](./assets/grades.txt) contains a line separated _list of people_ with their __grade__ in a class. Create a _regex_ to generate a list of just those students who received a __B__ in the course.

In [26]:
import re

def grades() -> list:
    """
    The grades function returns the names of students who got B in their grades.
    
    Returns:
        list: A list of strings, where each string is the name of a student who got a B.
    """
    with open("assets/grades.txt", "r") as file:
        grades = file.read()

    # YOUR CODE HERE
    pattern = r'(\w+\s+\w+): B'
    students = re.findall(pattern, grades)

    return students

In [27]:
# example of test case
assert len(grades()) == 16

### Problem 1.3

Consider the standard _web log file_ in [assets/logdata.txt](./assets/logdata.txt). This _file_ records the _access_ a user makes when visiting a web page. Each __line of the log__ has the following _items_:

- a __host__ (e.g., `146.204.224.152`)
- a __user_name__ (e.g., `feest6811`. _Hint:_ sometimes the user name is missing! In this case, use `-` as the value for the username.)
- the __time__ a request was made (e.g., `21/Jun/2019:15:45:24 -0700`)
- the post __request type__ (e.g., `POST /incentivize HTTP/1.1`. _Note:_ not everything is a POST!)

Your task is to convert this into a list of dictionaries, where each dictionary looks like the following:

```python
example_dict = {"host":"146.204.224.152", 
                "user_name":"feest6811", 
                "time":"21/Jun/2019:15:45:24 -0700",
                "request":"POST /incentivize HTTP/1.1"}
```

In [56]:
import re


def logs() -> list:
    """
    The logs function returns a list of dictionaries, where each dictionary contains 
    the parts of the log message.
    
    returns:
        list: A list of dictionaries where each dictionary corresponds to a log message 
        from the logdata file.
    """
    with open("assets/logdata.txt", "r") as file:
        logdata = file.read()
    actions = re.split(r"\n", logdata)

    logs_list = []

    # YOUR CODE HERE
    pattern = r'(?P<host>\d+\.\d+\.\d+\.\d+)\s+-\s+(?P<user_name>(\w*|-))\s+\[(?P<time>.+)\]\s+"(?P<request>.+)"'
    for action in actions:
        action.replace('- -', ' - - -')
        match = re.match(pattern, action)
        if not match:
            break
        logs_list.append(match.groupdict())

    return logs_list

In [57]:
# example of test case
one_item = {
    "host": "146.204.224.152",
    "user_name": "feest6811",
    "time": "21/Jun/2019:15:45:24 -0700",
    "request": "POST /incentivize HTTP/1.1",
}
assert (
    one_item in logs()
), "Sorry, this item should be in the log results, check your formating"

# 2. Descriptive Analysis

For this section, you'll be looking at _2017 data on immunizations_ from the _CDC_. Your _datafile_ for next tasks is in [assets/NISPUF17.csv](./assets/NISPUF17.csv). A _data users guide_ for this, which you'll need to map the variables in the data to the questions being asked, is available at [assets/NIS-PUF17-DUG.pdf](./assets/NIS-PUF17-DUG.pdf).

# Problem 2.1

Write a function called _proportion\_of\_education_ which returns the proportion of __children__ in the dataset who had a mother with the education levels equal to less than high school ($<12$), high school ($12$), more than high school but not a college graduate ($>12$) and _college degree_.

This _function_ should return a __dictionary__ in the form of (use the _correct numbers_, do not round numbers):

```python
{
    "less than high school": 0.2,
    "high school": 0.4,
    "more than high school but not college": 0.2,
    "college": 0.2
}
```

In [29]:
import pandas as pd

def proportion_of_education() -> dict:
    """
    This function returns the proportion of children who had a mother with the education 
    levels of less than high school, high school, more than high school but not a college 
    graduate, and college graduate.
    
    Returns:
        dict: A dictionary where the keys are the education levels and the values are the 
        proportion of children who had a mother with that education level.
    """
    proportions = {}

    # YOUR CODE HERE
    immunization_df = pd.read_csv('assets/NISPUF17.csv')
    mother_education_proportion = immunization_df['EDUC1'].value_counts()\
                                  /len(immunization_df['EDUC1'])
    proportions["less than high school"] = mother_education_proportion[1]
    proportions["high school"] = mother_education_proportion[2]
    proportions["more than high school but not college"] = mother_education_proportion[3]
    proportions["college"] = mother_education_proportion[4]

    return proportions

In [30]:
proportion_of_education()

{'less than high school': np.float64(0.10202002459160373),
 'high school': np.float64(0.172352011241876),
 'more than high school but not college': np.float64(0.24588090637625154),
 'college': np.float64(0.47974705779026877)}

In [31]:
# example of test cases
assert type(proportion_of_education()) == type({}), "You must return a dictionary."
assert (
    len(proportion_of_education()) == 4
), "You have not returned a dictionary with four items in it."

## Problem 2.2

Let's explore the relationship between being _fed breastmilk_ as a child and getting a seasonal _influenza vaccine_ from a healthcare provider. Return a __tuple__ of the _average number of influenza vaccines_ for those children we know received breastmilk as a child and those who know did not.

This _function_ should return a __tuple__ in the form (use the _correct numbers_):

```python
(2.5, 0.1)
```

In [52]:
import pandas as pd

def average_influenza_doses() -> tuple:
    """
    This function calculates a tuple of the average number of influenza vaccines 
    for children we know receibed breastmilk.

    Returns:
        tuple: A tuple of the average number of influenza vaccines for children 
        we know receibed breastmilk.
    """
    doses = ()

    # YOUR CODE HERE
    immunization_df = pd.read_csv('assets/NISPUF17.csv')
    doses = (immunization_df[immunization_df['CBF_01'] == 1]['P_NUMFLU'].mean(),
             immunization_df[immunization_df['CBF_01'] == 2]['P_NUMFLU'].mean())

    return doses

In [53]:
# example of test cases
assert (
    len(average_influenza_doses()) == 2
), "Return two values in a tuple, the first for yes and the second for no."

## Problem 2.3

It would be interesting to see if there is any evidence of a link between _vaccine effectiveness_ and _sex of the child_. Calculate the _ratio of the number of children_ who contracted __chickenpox__ but _were vaccinated against it_ (at least one varicella dose) versus those who were vaccinated but did not contract _chicken pox_. Return results by _sex_.

This _function_ should return a __dictionary__ in the form of (use the _correct numbers_):

```python
{
    "male":0.2,
    "female":0.4
}
```

_Note:_ To aid in verification, the `chickenpox_by_sex()['female']` value the autograder is looking for starts with the digits `0.0077`.


In [49]:
import pandas as pd

def chickenpox_by_sex() -> dict:
    """
    This function returns the ratio of the number of children who contracted chickenpox 
    by sex and were vaccinated
    
    Returns:
        dict: A dictionary where the keys are the ratio of children who contracted 
        chickenpox by sex and were vaccinated
    """

    stats = {}

    # YOUR CODE HERE
    chickenpox_df = pd.read_csv('assets/NISPUF17.csv')
    chickenpox_df = chickenpox_df[['P_NUMVRC', 'SEX', 'HAD_CPOX']].dropna()

    vaccinated_and_contracted = chickenpox_df[(chickenpox_df['P_NUMVRC'] >= 1)\
                                            & (chickenpox_df['HAD_CPOX'] == 1)]
    vaccinated_and_not_contracted = chickenpox_df[(chickenpox_df['P_NUMVRC'] >= 1)\
                                                & (chickenpox_df['HAD_CPOX'] == 2)]

    stats = {
        'male': len(vaccinated_and_contracted[vaccinated_and_contracted['SEX'] == 1]) \
            / len(vaccinated_and_not_contracted[vaccinated_and_not_contracted['SEX'] == 1]),
        'female': len(vaccinated_and_contracted[vaccinated_and_contracted['SEX'] == 2]) \
            / len(vaccinated_and_not_contracted[vaccinated_and_not_contracted['SEX'] == 2])
    }

    return stats

In [50]:
chickenpox_by_sex()

{'male': 0.009675583380762664, 'female': 0.0077918259335489565}

## Problem 2.4

A __correlation__ is a _statistical relationship_ between two variables. If we wanted to know _if vaccines work_, we might look at the correlation between the use of the vaccine and whether it results in prevention of the infection or disease. In this task, you are to see if there is a correlation between _having had the chicken pox_ and the _number of chickenpox vaccine doses given_ (varicella).

Some notes on interpreting the answer. The `had_chickenpox_column` is either $1$ (for _yes_) or $2$ (for _no_), and the `num_chickenpox_vaccine_column` is the number of doses a child has been given of the varicella vaccine. A _positive correlation_ (e.g., $corr > 0$) means that an increase in _had\_chickenpox\_column_ (which means more _no_’s) would also increase the values of _num\_chickenpox\_vaccine\_column_ (which means _more doses of vaccine_). If there is a _negative correlation_ (e.g., $corr < 0$), it indicates that having had chickenpox is related to an increase in the number of vaccine doses.

Also, $pval$ is the probability that we observe a correlation between _had\_chickenpox\_column_ and _num\_chickenpox\_vaccine\_column_ which is greater than or equal to a particular value occurred by chance. A _small pval_ means that the observed correlation is highly unlikely to occur by chance. In this case, _pval_ should be very small (will end in $e-18$ indicating a very small number).

In [70]:
import scipy.stats as stats
import pandas as pd


def corr_chickenpox():

    # YOUR CODE HERE
    # load and prepare dataframes
    immunization_df = pd.read_csv('assets/NISPUF17.csv')
    immunization_df = immunization_df[["HAD_CPOX", "P_NUMVRC"]].dropna()
    immunization_df = immunization_df[immunization_df["HAD_CPOX"] < 3]

    chickenpox_df = pd.DataFrame(
        {
            "had_chickenpox_column": immunization_df["HAD_CPOX"],
            "num_chickenpox_vaccine_column": immunization_df["P_NUMVRC"]
        }
    )

    # correlation
    corr, pval = stats.pearsonr(
        chickenpox_df["had_chickenpox_column"],
        chickenpox_df["num_chickenpox_vaccine_column"]
    )
    
    return corr

In [71]:
# example of test cases
assert (
    -1 <= corr_chickenpox() <= 1
), "You must return a float number between -1.0 and 1.0."