# Workshop 1

Here you're gonna test your __data scientist junior__ skills. Read carefully each one of the problems, write your own test cases, and validate everything works as expected.

## 1. Regular Expressions

As follows complete the code based on the __requirement__. There is a part `#YOUR CODE HERE` where you _should complete_ to accomplish the task. However, you _could_ change anything you want.

### Problem 1.1

Find a list of all of all of the names in the following string using _regex_.

In [5]:
import re
def names() -> list:
    """
    Find all names in a given string using regex.

    Returns:
        A list of names found in the string.
    """
    simple_string = """Amy is 5 years old, and her sister Mary is 2 years old. 
    Ruth and Peter, their parents, have 3 kids."""

    #All names are (and should) be capitalized
    pattern = r'[A-Z][a-z]+'

    return re.findall(pattern, simple_string)

In [6]:
# example of test case
assert len(names()) == 4, "There are four names in the simple_string."
print(names())

['Amy', 'Mary', 'Ruth', 'Peter']


### Problem 1.2

The _dataset file_ in [assets/grades.txt](./assets/grades.txt) contains a line separated _list of people_ with their __grade__ in a class. Create a _regex_ to generate a list of just those students who received a __B__ in the course.

In [7]:
import re
def grades() -> list:
    """
    Generate a list of students who received a B in the course.

    Returns:
        A list of students who received a B.
    """
    with open("assets/grades.txt", "r") as file:
        grades = file.read()

    pattern = r'^[A-Za-z\s]+: B\s*$'
    #\s to match any space, $ means the end of the line
    students = re.findall(pattern, grades, re.MULTILINE)

    return students

In [8]:
# example of test case
assert len(grades()) == 16
print(grades())

['Bell Kassulke: B', 'Simon Loidl: B ', 'Elias Jovanovic: B ', 'Hakim Botros: B', 'Emilie Lorentsen: B', 'Jake Wood: B', 'Fatemeh Akhtar: B', 'Kim Weston: B', 'Yasmin Dar: B', 'Viswamitra Upandhye: B', 'Killian Kaufman: B', 'Elwood Page: B', 'Elodie Booker: B', 'Adnan Chen: B', 'Hank Spinka: B', 'Hannah Bayer: B']


### Problem 1.3

Consider the standard _web log file_ in [assets/logdata.txt](./assets/logdata.txt). This _file_ records the _access_ a user makes when visiting a web page. Each __line of the log__ has the following _items_:

- a __host__ (e.g., `146.204.224.152`)
- a __user_name__ (e.g., `feest6811`. _Hint:_ sometimes the user name is missing! In this case, use `-` as the value for the username.)
- the __time__ a request was made (e.g., `21/Jun/2019:15:45:24 -0700`)
- the post __request type__ (e.g., `POST /incentivize HTTP/1.1`. _Note:_ not everything is a POST!)

Your task is to convert this into a list of dictionaries, where each dictionary looks like the following:

```python
example_dict = {"host":"146.204.224.152", 
                "user_name":"feest6811", 
                "time":"21/Jun/2019:15:45:24 -0700",
                "request":"POST /incentivize HTTP/1.1"}
```

In [18]:
import re
def logs() -> list:
    """
    Create a list of dictionaries from a web log file using regex.

    Returns:
        A list of dictionaries with keys 'host', 'user_name', 'time', and 'request'.
    """
    with open("assets/logdata.txt", "r") as file:
        logdata = file.read()
    actions = re.split(r"\n", logdata)

    logs_list = []

    pattern = r'(\d+\.\d+\.\d+\.\d+) - (\S+) \[(.*?)\] "(.*?)"'
    #each () is a group, \S is used to match any non-whitespace character
    #(.*?) gets everything inside the square brackets and the requests (which is inside the double quotes)
    
    for action in actions:
        match = re.match(pattern, action)
        if match:
            log_entry = {
                "host": match.group(1),
                "user_name": match.group(2),
                "time": match.group(3),
                "request": match.group(4),
            }
            logs_list.append(log_entry)

    return logs_list



In [19]:
# Test cases
one_item = {
    "host": "146.204.224.152",
    "user_name": "feest6811",
    "time": "21/Jun/2019:15:45:24 -0700",
    "request": "POST /incentivize HTTP/1.1",
}

# Ensure the item is in the logs
assert (
    one_item in logs()
), "Sorry, this item should be in the log results, check your formatting"

# Ensure the log length is correct
assert len(logs()) == 979, "There are 979 entries in the logdata."

# Print a few entries to verify
for entry in logs()[:3]:
    print(entry)

# To print the length of logs
print(f"Total log entries: {len(logs())}")

{'host': '146.204.224.152', 'user_name': 'feest6811', 'time': '21/Jun/2019:15:45:24 -0700', 'request': 'POST /incentivize HTTP/1.1'}
{'host': '197.109.77.178', 'user_name': 'kertzmann3129', 'time': '21/Jun/2019:15:45:25 -0700', 'request': 'DELETE /virtual/solutions/target/web+services HTTP/2.0'}
{'host': '156.127.178.177', 'user_name': 'okuneva5222', 'time': '21/Jun/2019:15:45:27 -0700', 'request': 'DELETE /interactive/transparent/niches/revolutionize HTTP/1.1'}
Total log entries: 979


# 2. Descriptive Analysis

For this section, you'll be looking at _2017 data on immunizations_ from the _CDC_. Your _datafile_ for next tasks is in [assets/NISPUF17.csv](./assets/NISPUF17.csv). A _data users guide_ for this, which you'll need to map the variables in the data to the questions being asked, is available at [assets/NIS-PUF17-DUG.pdf](./assets/NIS-PUF17-DUG.pdf).

# Problem 2.1

Write a function called _proportion\_of\_education_ which returns the proportion of __children__ in the dataset who had a mother with the education levels equal to less than high school ($<12$), high school ($12$), more than high school but not a college graduate ($>12$) and _college degree_.

This _function_ should return a __dictionary__ in the form of (use the _correct numbers_, do not round numbers):

```python
{
    "less than high school": 0.2,
    "high school": 0.4,
    "more than high school but not college": 0.2,
    "college": 0.2
}
```

In [22]:
import pandas as pd

def proportion_of_education() -> dict:
    """
    Calculates the proportion of children whose mothers have different levels of education.

    Returns:
        A dictionary with the proportion of children whose mothers have less than high school, 
        high school, more than high school but not college, and college education.
    """
    proportions = {}

    # YOUR CODE HERE
    df = pd.read_csv("assets/NISPUF17.csv")
    
    # Assuming the column for mother's education level is 'EDUC1'
    education_levels = {
        1: "less than high school",
        2: "high school",
        3: "more than high school but not college",
        4: "college"
    }
    
    total_count = len(df)
    proportions = df['EDUC1'].value_counts(normalize=True)
    
    result = {education_levels[level]: proportions[level] for level in education_levels}

    #Checking if proportions are complete (equals 1)
    total = sum(result.values())
    #print(total)
    return result

print(proportion_of_education())


{'less than high school': 0.10202002459160373, 'high school': 0.172352011241876, 'more than high school but not college': 0.24588090637625154, 'college': 0.47974705779026877}


In [21]:
# example of test cases
assert type(proportion_of_education()) == type({}), "You must return a dictionary."
assert (
    len(proportion_of_education()) == 4
), "You have not returned a dictionary with four items in it."

{'less than high school': '10.20%', 'high school': '17.24%', 'more than high school but not college': '24.59%', 'college': '47.97%'}
1.0
{'less than high school': '10.20%', 'high school': '17.24%', 'more than high school but not college': '24.59%', 'college': '47.97%'}
1.0


## Problem 2.2

Let's explore the relationship between being _fed breastmilk_ as a child and getting a seasonal _influenza vaccine_ from a healthcare provider. Return a __tuple__ of the _average number of influenza vaccines_ for those children we know received breastmilk as a child and those who know did not.

This _function_ should return a __tuple__ in the form (use the _correct numbers_):

```python
(2.5, 0.1)
```

In [24]:
import pandas as pd
def average_influenza_doses() -> tuple:
    """
    Calculate the average number of influenza vaccines for children who were breastfed and those who were not.

    Returns:
        A tuple containing the average number of influenza vaccines 
        for children who were breastfed and those who were not.
    """

    # Read the CSV file
    df = pd.read_csv("assets/NISPUF17.csv")

    breastfed = df[df['CBF_01'] == 1]['P_NUMFLU']
    not_breastfed = df[df['CBF_01'] == 2]['P_NUMFLU']

    avg_breastfed = breastfed.mean()
    avg_not_breastfed = not_breastfed.mean()

    doses = (avg_breastfed, avg_not_breastfed)

    return doses

# Example of calling the function
print(average_influenza_doses())

(1.8799187420058687, 1.5963945918878317)


In [25]:
# example of test cases
assert (
    len(average_influenza_doses()) == 2
), "Return two values in a tuple, the first for yes and the second for no."

## Problem 2.3

It would be interesting to see if there is any evidence of a link between _vaccine effectiveness_ and _sex of the child_. Calculate the _ratio of the number of children_ who contracted __chickenpox__ but _were vaccinated against it_ (at least one varicella dose) versus those who were vaccinated but did not contract _chicken pox_. Return results by _sex_.

This _function_ should return a __dictionary__ in the form of (use the _correct numbers_):

```python
{
    "male":0.2,
    "female":0.4
}
```

_Note:_ To aid in verification, the `chickenpox_by_sex()['female']` value the autograder is looking for starts with the digits `0.0077`.


In [28]:
import pandas as pd
def chickenpox_by_sex() -> dict:
    """
    Calculates the ratio of the number of children who contracted chickenpox but were vaccinated 
    against it versus those who were vaccinated but did not contract chickenpox, by sex.

    Returns:
        A dictionary with the ratio for male and female children.
    """

    stats = {}

    # YOUR CODE HERE
    df = pd.read_csv("assets/NISPUF17.csv")
    vaccinated = df[df['P_NUMVRC'] >= 1]

    # Males
    male_vaccinated = vaccinated[vaccinated['SEX'] == 1]
    male_had_cpox = male_vaccinated[male_vaccinated['HAD_CPOX'] == 1].shape[0]
    male_not_had_cpox = male_vaccinated[male_vaccinated['HAD_CPOX'] == 2].shape[0]
    male_ratio = male_had_cpox / male_not_had_cpox if male_not_had_cpox != 0 else 0

    # Females
    female_vaccinated = vaccinated[vaccinated['SEX'] == 2]
    female_had_cpox = female_vaccinated[female_vaccinated['HAD_CPOX'] == 1].shape[0]
    female_not_had_cpox = female_vaccinated[female_vaccinated['HAD_CPOX'] == 2].shape[0]
    female_ratio = female_had_cpox / female_not_had_cpox if female_not_had_cpox != 0 else 0

    stats = {
        "male": male_ratio,
        "female": female_ratio
    }

    return stats

print(chickenpox_by_sex())

{'male': 0.009675583380762664, 'female': 0.0077918259335489565}


## Problem 2.4

A __correlation__ is a _statistical relationship_ between two variables. If we wanted to know _if vaccines work_, we might look at the correlation between the use of the vaccine and whether it results in prevention of the infection or disease. In this task, you are to see if there is a correlation between _having had the chicken pox_ and the _number of chickenpox vaccine doses given_ (varicella).

Some notes on interpreting the answer. The `had_chickenpox_column` is either $1$ (for _yes_) or $2$ (for _no_), and the `num_chickenpox_vaccine_column` is the number of doses a child has been given of the varicella vaccine. A _positive correlation_ (e.g., $corr > 0$) means that an increase in _had\_chickenpox\_column_ (which means more _no_’s) would also increase the values of _num\_chickenpox\_vaccine\_column_ (which means _more doses of vaccine_). If there is a _negative correlation_ (e.g., $corr < 0$), it indicates that having had chickenpox is related to an increase in the number of vaccine doses.

Also, $pval$ is the probability that we observe a correlation between _had\_chickenpox\_column_ and _num\_chickenpox\_vaccine\_column_ which is greater than or equal to a particular value occurred by chance. A _small pval_ means that the observed correlation is highly unlikely to occur by chance. In this case, _pval_ should be very small (will end in $e-18$ indicating a very small number).

In [38]:
import scipy.stats as stats
import numpy as np
import pandas as pd


def corr_chickenpox() -> float:
    """
    Calculate the correlation between having had chickenpox and the number of chickenpox vaccine doses given.

    Returns:
        A float number which is the correlation coefficient.
    """

    # this is just an example dataframe
    df = pd.DataFrame(
        {
            "had_chickenpox_column": np.random.randint(1, 3, size=(100)),
            "num_chickenpox_vaccine_column": np.random.randint(0, 6, size=(100)),
        }
    )

    # here is some stub code to actually run the correlation
    corr, pval = stats.pearsonr(
        df["had_chickenpox_column"], df["num_chickenpox_vaccine_column"]
    )

    # YOUR CODE HERE
    df = pd.read_csv("assets/NISPUF17.csv")

    df = df[['HAD_CPOX', 'P_NUMVRC']].dropna()
    df = df[(df['HAD_CPOX'] == 1) | (df['HAD_CPOX'] == 2)]  # There are some values that are not 1 or 2

    df = df.rename(columns={'HAD_CPOX': 'had_chickenpox_column', 'P_NUMVRC': 'num_chickenpox_vaccine_column'})

    # Correlation and Pval
    corr, pval = stats.pearsonr(df['had_chickenpox_column'], df['num_chickenpox_vaccine_column'])
    print(pval)
    print(corr)

    return corr

correlation = corr_chickenpox()

2.7780263183463457e-18
0.07044873460147985


In [36]:
# example of test cases
assert (
    -1 <= corr_chickenpox() <= 1
), "You must return a float number between -1.0 and 1.0."

2.7780263183463457e-18
0.07044873460147985
