# Workshop 1

Here you're gonna test your __data scientist junior__ skills. Read carefully each one of the problems, write your own test cases, and validate everything works as expected.

## 1. Regular Expressions

As follows complete the code based on the __requirement__. There is a part `#YOUR CODE HERE` where you _should complete_ to accomplish the task. However, you _could_ change anything you want.

### Problem 1.1

Find a list of all of all of the names in the following string using _regex_.

In [1]:
import re
def names():
    simple_string = """Amy is 5 years old, and her sister Mary is 2 years old. 
    Ruth and Peter, their parents, have 3 kids."""

    pattern = r'\b[A-Z]\w*'
    
    return re.findall(pattern, simple_string) 


In [2]:
# example of test case
assert len(names()) == 4, "There are four names in the simple_string."
print(names())

['Amy', 'Mary', 'Ruth', 'Peter']


### Problem 1.2

The _dataset file_ in [assets/grades.txt](./assets/grades.txt) contains a line separated _list of people_ with their __grade__ in a class. Create a _regex_ to generate a list of just those students who received a __B__ in the course.

In [1]:
import re


def grades():
    with open("assets/grades.txt", "r") as file:
        grades = file.read()

    b_students = r'\w[\w ]*: B'
    students = re.findall(b_students, grades)

    return students

In [2]:
# example of test case
assert len(grades()) == 16
print(grades())

['Bell Kassulke: B', 'Simon Loidl: B', 'Elias Jovanovic: B', 'Hakim Botros: B', 'Emilie Lorentsen: B', 'Jake Wood: B', 'Fatemeh Akhtar: B', 'Kim Weston: B', 'Yasmin Dar: B', 'Viswamitra Upandhye: B', 'Killian Kaufman: B', 'Elwood Page: B', 'Elodie Booker: B', 'Adnan Chen: B', 'Hank Spinka: B', 'Hannah Bayer: B']


### Problem 1.3

Consider the standard _web log file_ in [assets/logdata.txt](./assets/logdata.txt). This _file_ records the _access_ a user makes when visiting a web page. Each __line of the log__ has the following _items_:

- a __host__ (e.g., `146.204.224.152`)
- a __user_name__ (e.g., `feest6811`. _Hint:_ sometimes the user name is missing! In this case, use `-` as the value for the username.)
- the __time__ a request was made (e.g., `21/Jun/2019:15:45:24 -0700`)
- the post __request type__ (e.g., `POST /incentivize HTTP/1.1`. _Note:_ not everything is a POST!)

Your task is to convert this into a list of dictionaries, where each dictionary looks like the following:

```python
example_dict = {"host":"146.204.224.152", 
                "user_name":"feest6811", 
                "time":"21/Jun/2019:15:45:24 -0700",
                "request":"POST /incentivize HTTP/1.1"}
```

In [3]:
import re


def logs():
    with open("assets/logdata.txt", "r") as file:
        logdata = file.read()
    actions = re.split(r"\n", logdata)

    logs_list = []

    # Define regex pattern to extract data
    log_pattern = re.compile(
        r'(?P<host>[\d.]+) - (?P<user_name>\S+) \[(?P<time>[^\]]+)\] "(?P<request>[^"]+)"'
    )

    for action in actions:
        match = log_pattern.match(action)
        if match:
            log_entry = {
                "host": match.group("host"),
                "user_name": match.group("user_name"),
                "time": match.group("time"),
                "request": match.group("request")
            }
            logs_list.append(log_entry)
    
    return logs_list

In [4]:
# example of test case
one_item = {
    "host": "146.204.224.152",
    "user_name": "feest6811",
    "time": "21/Jun/2019:15:45:24 -0700",
    "request": "POST /incentivize HTTP/1.1",
}
assert (
    one_item in logs()
), "Sorry, this item should be in the log results, check your formating"

# 2. Descriptive Analysis

For this section, you'll be looking at _2017 data on immunizations_ from the _CDC_. Your _datafile_ for next tasks is in [assets/NISPUF17.csv](./assets/NISPUF17.csv). A _data users guide_ for this, which you'll need to map the variables in the data to the questions being asked, is available at [assets/NIS-PUF17-DUG.pdf](./assets/NIS-PUF17-DUG.pdf).

# Problem 2.1

Write a function called _proportion\_of\_education_ which returns the proportion of __children__ in the dataset who had a mother with the education levels equal to less than high school ($<12$), high school ($12$), more than high school but not a college graduate ($>12$) and _college degree_.

This _function_ should return a __dictionary__ in the form of (use the _correct numbers_, do not round numbers):

```python
{
    "less than high school": 0.2,
    "high school": 0.4,
    "more than high school but not college": 0.2,
    "college": 0.2
}
```

In [3]:
import pandas as pd

'''
    value in dataset -> category
    1 -> ( < 12 ) less than high school
    2 -> ( 12 ) high school
    3 -> ( > 12 ) more than high school but not college
    4 -> college degree
''' 

def proportion_of_education() -> dict:
    
    children_df = pd.read_csv('assets/NISPUF17.csv')
    
    
    total_rows = children_df.shape[0]
    
    count_ocurrences_by_values = children_df['EDUC1'].value_counts()
    
    proportion_calculated = count_ocurrences_by_values.apply(lambda x: x/total_rows)
    

    proportions = {
        "less than high school": float(proportion_calculated.loc[1]),
        "high school": float(proportion_calculated.loc[2]),
        "more than high school but no college": float(proportion_calculated.loc[3]),
        "college": float(proportion_calculated.loc[4]) 
    }

    return proportions



In [4]:
# example of test cases
assert type(proportion_of_education()) == type({}), "You must return a dictionary."
assert (
    len(proportion_of_education()) == 4
), "You have not returned a dictionary with four items in it."

print(proportion_of_education())

{'less than high school': 0.10202002459160373, 'high school': 0.172352011241876, 'more than high school but no college': 0.24588090637625154, 'college': 0.47974705779026877}


## Problem 2.2

Let's explore the relationship between being _fed breastmilk_ as a child and getting a seasonal _influenza vaccine_ from a healthcare provider. Return a __tuple__ of the _average number of influenza vaccines_ for those children we know received breastmilk as a child and those who know did not.

This _function_ should return a __tuple__ in the form (use the _correct numbers_):

```python
(2.5, 0.1)
```

In [6]:
'''
    [CBF_01]
    value in dataset -> category
    1 -> Yes
    2 -> No
''' 

def average_influenza_doses() -> tuple:

    children_df = pd.read_csv('assets/NISPUF17.csv')

    # Total rows
    #print(children_df.shape[0])
    
    # Has CBF_01 null values ?
    #print(children_df['CBF_01'].value_counts().sum())
    
    # Has P_NUMFLI null values ? -> 46% are null values
    #print(children_df['P_NUMFLU'].isnull().sum() / children_df.shape[0])
       
    # mean avoids Null values
    average_per_category_df = children_df.groupby(['CBF_01'], observed=False).agg({'P_NUMFLU': 'mean'})
    
    doses = (float(average_per_category_df.loc[1, 'P_NUMFLU']), float(average_per_category_df.loc[2, 'P_NUMFLU']))

    return doses


(1.8799187420058687, 1.5963945918878317)

In [12]:
# example of test cases
assert (
    len(average_influenza_doses()) == 2
), "Return two values in a tuple, the first for yes and the second for no."

print(average_influenza_doses())

(1.8799187420058687, 1.5963945918878317)


## Problem 2.3

It would be interesting to see if there is any evidence of a link between _vaccine effectiveness_ and _sex of the child_. Calculate the _ratio of the number of children_ who contracted __chickenpox__ but _were vaccinated against it_ (at least one varicella dose) versus those who were vaccinated but did not contract _chicken pox_. Return results by _sex_.

This _function_ should return a __dictionary__ in the form of (use the _correct numbers_):

```python
{
    "male":0.2,
    "female":0.4
}
```

_Note:_ To aid in verification, the `chickenpox_by_sex()['female']` value the autograder is looking for starts with the digits `0.0077`.


In [8]:
'''
    [HAD_CPOX]
    value in dataset -> category
    1 -> Yes
    2 -> No
    ---------------
    [SEX]
    value in dataset -> category
    1 -> Male
    2 -> Female
    [P_NUMVRC]
    total number of varicella doses
''' 

def calculate_ratio_by_sex(sex_code: bool, children_df: pd.DataFrame) -> float:
    
    yes_code = 1
    no_code = 2
    
    # number of population that has been vaccinated and infected / The query avoids NaN
    children_vaccined_infected = children_df[
        (children_df['SEX'] == sex_code)
        & (children_df['HAD_CPOX'] == yes_code)
        & (children_df['P_NUMVRC'] >= 1)
    ].shape[0]

    # number of population that has been vaccinated and has not infected / The query avoids NaN
    children_vaccined_no_infected = children_df[
        (children_df['SEX'] == sex_code)
        & (children_df['HAD_CPOX'] == no_code)
        & (children_df['P_NUMVRC'] >= 1)
    ].shape[0]

    return children_vaccined_infected / children_vaccined_no_infected

def chickenpox_by_sex():
    
    children_df = pd.read_csv('assets/NISPUF17.csv')
    male_code = 1
    female_code = 2
    
    male_ratio = calculate_ratio_by_sex(male_code, children_df)
    female_ratio = calculate_ratio_by_sex(female_code, children_df)

    stats = {
        "male": male_ratio,
        "female": female_ratio
    }

    return stats

print(chickenpox_by_sex())


{'male': 0.009675583380762664, 'female': 0.0077918259335489565}


## Problem 2.4

A __correlation__ is a _statistical relationship_ between two variables. If we wanted to know _if vaccines work_, we might look at the correlation between the use of the vaccine and whether it results in prevention of the infection or disease. In this task, you are to see if there is a correlation between _having had the chicken pox_ and the _number of chickenpox vaccine doses given_ (varicella).

Some notes on interpreting the answer. The `had_chickenpox_column` is either $1$ (for _yes_) or $2$ (for _no_), and the `num_chickenpox_vaccine_column` is the number of doses a child has been given of the varicella vaccine. A _positive correlation_ (e.g., $corr > 0$) means that an increase in _had\_chickenpox\_column_ (which means more _no_’s) would also increase the values of _num\_chickenpox\_vaccine\_column_ (which means _more doses of vaccine_). If there is a _negative correlation_ (e.g., $corr < 0$), it indicates that having had chickenpox is related to an increase in the number of vaccine doses.

Also, $pval$ is the probability that we observe a correlation between _had\_chickenpox\_column_ and _num\_chickenpox\_vaccine\_column_ which is greater than or equal to a particular value occurred by chance. A _small pval_ means that the observed correlation is highly unlikely to occur by chance. In this case, _pval_ should be very small (will end in $e-18$ indicating a very small number).

In [9]:
import scipy.stats as stats
import numpy as np
import pandas as pd

'''
    [had_chickenpox_column]
    1 -> yes
    2 -> no
    [num_chicken_vaccine_column]
    number of dosis given
'''

def corr_chickenpox() -> float:

    # this is just an example dataframe
    children_df = pd.read_csv('assets/NISPUF17.csv')
    vaccined_children_df = children_df[children_df["P_NUMVRC"] >= 1]

    # here is some stub code to actually run the correlation
    corr, pval = stats.pearsonr(
        vaccined_children_df["HAD_CPOX"], vaccined_children_df["P_NUMVRC"]
    )
    
    return float(corr)

-0.00727421127309124

In [13]:
# example of test cases
assert (
    -1 <= corr_chickenpox() <= 1
), "You must return a float number between -1.0 and 1.0."

print(corr_chickenpox())

-0.00727421127309124
