# Understanding Password Strength
## CourseKata Performance Assessment

This assessment was originally designed for college students, so some parts may feel challenging. Just do your best and use this opportunity to show what you have learned.

Run the code below to get started.

In [None]:
# load coursekata package
suppressPackageStartupMessages({
library(coursekata)
})

# get the data
pw_data <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSYR6mecmran9-VSd4MppNE2pKUeCxeMX0y42utpc8pizCnuY9-cwpXLuS9mjiquRGgg7iSbq8KoqQI/pub?output=csv") %>%
  mutate(contains_number = as.factor(contains_number),
         contains_special = as.factor(contains_special),
         contains_upper_lower = as.factor(contains_upper_lower))


## 1. Understanding Password Strength

In today’s digital world, passwords are one of the most common forms of security. However, not all passwords are created equal. Some passwords are extremely easy to guess, while others are much harder to crack.

Security researchers measure password strength based on factors like:

- **Length:** Longer passwords tend to be harder to guess.
- **Complexity:** Including a mix of uppercase and lowercase letters, numbers, and special characters can make a password stronger.
- **Unpredictability:** Common words or sequences (like "password" or "123456") are much weaker than random combinations of characters.

### The `pw_data` data frame

Today we'll look at the dataset `pw_data` . 

#### About `pw_data` 

This dataset contains **192 passwords** that were part of a simulated data leak. The data is inspired by real-world password data from [Information is Beautiful](https://docs.google.com/spreadsheets/d/1cz7TDhm0ebVpySqbTvrHrD3WpxeyE4hLZtifWSnoNTQ/edit#gid=21) and [TidyTuesday](https://github.com/rfordatascience/tidytuesday/blob/main/data/2020/2020-01-14/readme.md)) but has been modified for educational purposes. 

The dataset includes several pieces of information about each password, including its length, estimated strength, and an estimate of how long it would take to "crack" using a computer program designed to guess passwords. **Your job is to explore patterns in the dataset to find out what features make a password stronger or harder to guess.**

#### Variables in `pw_data`

- `password_text` The actual password text.
- `password_length` The number of characters in the password.
- `password_strength` A score from 1 to 10 indicating the estimated strength of the password (the higher the score the stronger the password).
- `crack_time_sec` The estimated time (in seconds) it would take a computer program to correctly guess the password. 
- `contains_number` 1 means the password contains at least one number; 0 means it does not.
- `contains_special` 1 means the password contains at least one special character (`!@#$%^&*` etc.); 0 means it does not.
- `contains_upper_lower` 1 means the password contains both uppercase and lowercase letters; 0 means it does not.

### 1.2 Write some code to see what's in the data frame

In [None]:
# code here

### 1.3 Which variables in the data frame would be possible outcome variables for your analysis? Which one will you choose to use for your analysis? Explain why.

(a) List the possible outcome variables

(b) Which variable will you use as the outcome variable?

(c) Why did you choose that variable?

## 2. Explore Variation

### 2.1 Create a visualization to help you examine the distribution of the outcome variable. 

In [None]:
# code here

### 2.2 Describe what you notice about the distribution. 

Does the distribution make sense? Do you see any unusual patterns or weird things? Consider the features of a distribution that are important to comment on. 

### 2.3 Formulate a hypothesis

Let's use password_strength as the outcome variable. Write a word equation to represent the hypothesis that longer passwords are stronger passwords (as indicated by `password_strength`).

### 2.4 Create a visualization to explore the  hypothesis

Create **at least one** data visualization to explore the hypothesis.  

In [None]:
# code here

### 2.5 Interpret Your Visualization  
Look at the visualization you created above. Does your visualization support your hypothesis? Why or why not?

(Optional) What other patterns or trends do you notice in your data?   

## 3. Model Variation

### 3.1 Fit and visualize the model  

Now that you've explored the hypothesis with a plot, let's fit a model to describe the relationship between your chosen outcome and explanatory variable.

In the cell below, write R code to: 
1. Fit your model and save it as `my_model`.  
2. Fit an empty model and save it as `empty_model`.  
3. Recreate your visualization, then put the `empty_model` and `my_model` onto it.  


In [None]:
# code here

### 3.2 Compare models visually

Does `my_model` seem to fit the data better than the empty model? Explain your answer.

### 3.3 Get the parameter estimates  

Use the code cell below to print out the parameter estimates from `my_model`.  

In [None]:
# code here

### 3.4 Interpret the parameter estimates  

Look at the parameter estimates from `my_model`. What does each estimate mean? How do the estimates relate to the hypothesis?  

(a) What does each estimate mean?

(b) How do the estimates relate to the hypothesis?


### 3.5 Write the model in GLM notation  

Now that you've found the best-fitting model based on the hypothesis, express it using General Linear Model (GLM) notation, substituting in the actual parameter estimates from `my_model`.

(If you aren't sure how to write mathematical notation in markdown, feel free to use regular letters such as X1, b0, or even just write the name of the variable such as password_strength.)

### 3.6 Write R code to generate measures of how well the model fits the data

In order to assess how well the model fits the data, we need to examine some quantitative measures of model fit. Write R code to generate such measures. 

In [None]:
# code here

### 3.7 Assess model fit  

How well does the model fit the data? Explain your answer using one or more quantitative measures (such as PRE, F, or SS).


## 4. Conclusion  

Now that you've explored your hypothesis, fit a model, and assessed how well the model explains variation in the outcome, take a step back and reflect on what you’ve learned.

### 4.1 Summarize your findings  and Interpret the results 
In a paragraph or two, describe how the results from your data analysis relate to the original hypothesis. Cite quantitative evidence to support your answer. Also discuss the implications of your findings for password security policies, and suggest any follow up questions you'd like to follow up on in a later data analysis.

## 5 (Optional) What is data science **to you**?

Now that you are nearing the end of your class, take a moment to reflect. This section is entirely optional and will not be graded. 

### 5.1 If your friend asked you to describe what data science is, what would you tell them?

### 5.2 Where do you think data science should be taught in high school?

- As part of a computer science course
- As part of a math course
- As part of a statistics course
- As a stand alone-course
- Other

### 5.2b If you marked "other" in the question above, where do you think data science should be taught in high school?

### 5.3 Why do you think data science should be taught that way?