# DSC 80: Review Problems

### Not to be turned in!

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import disc10 as disc

In [2]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import os
import requests
from bs4 import BeautifulSoup
import re
import sklearn

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, confusion_matrix

# Grouping and Joining

**Question 1**

*The datasets were taken from https://github.com/fivethirtyeight/data/tree/master/college-majors and altered for the purposes of this problem*
<br><br>
In this problem, there are two csv files:
- `majors-data.csv` contains employment and salary data for college majors post-grad.
- `majors-list.csv` contains major name and major category data.


You want to explore which major categories are the ***best***.
<br><br>
First, create a function `merge_majors` that merges the two csv files together. Assume that parameter `df1` will be the `majors-list.csv` file and that parameter `df2` will be the `majors-data.csv` file.
<br>

*Note: in the resultant dataframe, keep the column `Major_code` but remove the column `FOD1P`. It should look something like this:*
<img src="data/merged.png"> 
<br>
Then, create a function `best_majors` that takes in the merged dataframe and returns a list with the following values in the order given below:
- `Major_Category` with the highest average employment **rate** (not number employed).
- `Major_Category` with the highest median median salary.
- `Major_Category` with the highest minimum P75th salary.
- `Major_Category` with highest number of people employed year round.

In [4]:
df1 = pd.read_csv('data/majors-list.csv')
df2 = pd.read_csv('data/majors-data.csv')

In [6]:
def merge_majors(df1, df2):
    '''
    Merge the two input dataframes on major code number
    >>> df1 = pd.read_csv('data/majors-list.csv')
    >>> df2 = pd.read_csv('data/majors-data.csv')
    >>> merged = merge_majors(df1, df2)
    >>> len(merged) == len(df1)
    True
    >>> len(merged.columns)
    10
    >>> 'FOD1P' in merged.columns
    False
    '''
    return pd.merge(df1, df2, left_on='FOD1P', right_on='Major_code', how='left').drop('FOD1P', axis=1)

In [7]:
merged = merge_majors(df1, df2)
merged.head()

Unnamed: 0,Major,Major_Category,Major_code,Total,Employed,Employed_full_time_year_round,Unemployed,Median,P25th,P75th
0,GENERAL AGRICULTURE,Agriculture & Natural Resources,1100,128148,90245,74078,2423,50000,34000,80000.0
1,AGRICULTURE PRODUCTION AND MANAGEMENT,Agriculture & Natural Resources,1101,95326,76865,64240,2266,54000,36000,80000.0
2,AGRICULTURAL ECONOMICS,Agriculture & Natural Resources,1102,33955,26321,22810,821,63000,40000,98000.0
3,ANIMAL SCIENCES,Agriculture & Natural Resources,1103,103549,81177,64937,3619,46000,30000,72000.0
4,FOOD SCIENCE,Agriculture & Natural Resources,1104,24280,17281,12722,894,62000,38500,90000.0


In [8]:
def best_majors(df):
    '''
    Return a list of "best" majors
    >>> df1 = pd.read_csv('data/majors-list.csv')
    >>> df2 = pd.read_csv('data/majors-data.csv')
    >>> merged = merge_majors(df1, df2)
    >>> best = best_majors(merged)
    >>> len(best)
    4
    >>> all(pd.Series(best).isin(merged.Major_Category.unique()))
    True
    '''
    
    cop = df.copy()
    cop['Employment_rate'] = cop['Employed'] / cop['Total']
    best_emply_rate = cop.groupby('Major_Category')['Employment_rate'].mean().idxmax()
    best_median_sal = cop.groupby('Major_Category')['Median'].median().idxmax()
    best_min_p75 = cop.groupby('Major_Category')['P75th'].min().idxmax()
    year_round = cop.groupby('Major_Category')['Employed_full_time_year_round'].sum().idxmax()
    
    return [best_emply_rate, best_median_sal, best_min_p75, year_round]

In [9]:
best_majors(merged)

['Computers & Mathematics', 'Engineering', 'Engineering', 'Business']

# Hypothesis Testing and Permutation Testing

* Given a *single* observed sample, make an assumption of how it came to be:
    - This assumption is the *null hypothesis*.
    - Generate data under this assumption (*probability model*).
* Simulate data under the null hypothesis (*the null distribution*).
* Ask "is it likely the given observation arose from this assumption?"


* **Null Hypothesis**
   - Nothing special happened / anything special you observed is purely by chance.
   - A hypothesis associated with a contradiction to a theory one would like to prove (e.g. two means are the same).
   
* **Alternative Hypothesis**
   - Something special happened.
   - A hypothesis associated with a theory one would like to prove (e.g. two means are **NOT** the same).
   
### Finding the Appropriate Statistics

* Statistics are used as a tool to compare two datasets. 
* Common statistics to use: TVD, ks-statistics, difference of means.

**P-Value**
* The probability that, if the null hypothesis is true, we observe a test statistic that is at least as extreme as what we've observed.
* Must be decided before conducting a test.
    * e.g. Calculating the proportion of simulated TVD that is greater than or equal to the observed TVD.

**Permutaion Test**
* Shuffle the group labels a number of times.
    * **Null Hypothesis**: two distributions are the same.
    * **Alternative Hypothesis**: two distributions are **NOT** the same.
* Intuition: if two distributions are exactly the same, the pairing of label and data are completely random! One can therefore pair the labels and data randomly to mimic the situation under the null hypothesis. 




**Question 2**

The dataset is taken from https://archive.ics.uci.edu/ml/datasets/adult and has been preprocessed for the purpose of this problem.

In the dataset, for each bank customer with `ID`, there are the average payment amount `AVG_PAY_AMT` for the last six months, and `DEFAULT` which indicates whether the customer defaults the next payment or not (1: Yes, 0: No).

You want to explore whether a customer's average payment is related to his/her defaulting the next payment.

To do this, use a *permutation test* to assess whether customers who default the next payment have higher average payment in the last six months. 

First, create a function `null_and_statistic` that return a hard-coded list of answers to two questions below:

* Which of the following is **NOT** a valid null hypothesis?
    1. A customer's average payment is independent of his/her defaulting the next payment.
    2. Average payment for a customer who defaults is the same as that for a customer who does not default.
    3. A customer who defaults tend to have less average payment amount.
* Which of the following is a better test statistic to use in this scenario?
    1. Difference of Means
    2. TVD

Second, create a function `simulate_null` that takes in a dataframe like default, and returns one instance of the test-statistic that you chose above under the null hypothesis. 

Lastly, create a function `pval_default` that takes in a dataframe like default, and calculates the p-value for the permutation test using **1000** trials.

In [10]:
def null_and_statistic():
    """
    answers to the two multiple-choice
    questions listed above.

    :Example:
    >>> out = null_and_statistic()
    >>> isinstance(out, list)
    True
    >>> out[0] in [1,2,3]
    True
    >>> out[1] in [1,2]
    True
    """
    
    return [3, 1]

In [11]:
def simulate_null(data):
    """
    simulate_null takes in a dataframe like default, 
    and returns one instance of the test-statistic 
    (difference of means) under the null hypothesis.

    :Example:
    >>> default_fp = os.path.join('data', 'default.csv')
    >>> default = pd.read_csv(default_fp)
    >>> out = simulate_null(default)
    >>> isinstance(out, float)
    True
    >>> 0 <= out <= 1.0
    True
    """
    # shuffle the weights
    shuffled_DEFAULT = (
        data['DEFAULT']
        .sample(replace=False, frac=1)
        .reset_index(drop=True)
    )
    
    # put them in a table
    shuffled = (
        data
        .assign(**{'Shuffled DEFAULT': shuffled_DEFAULT})
    )
    
    # compute the group differences (test statistic!)
    group_means = (
        shuffled
        .groupby('Shuffled DEFAULT')
        .mean()
        .loc[:, 'AVG_PAY_AMT']
    )
    difference = group_means.diff().iloc[-1]

    return difference

In [12]:
def pval_default(data):
    """
    pval_default takes in a dataframe like default, 
    and calculates the p-value for the permutation 
    test using 1000 trials.
    
    :Example:
    >>> default_fp = os.path.join('data', 'default.csv')
    >>> default = pd.read_csv(default_fp)
    >>> out = pval_default(default)
    >>> isinstance(pval, float)
    True
    >>> 0 <= pval <= 0.1
    True
    """
    results = []
    for _ in range(1000):
        results.append(simulate_null(data))
        
    obs = (
        data
        .groupby('DEFAULT')['AVG_PAY_AMT']
        .mean()
        .diff()
        .iloc[-1]
    )
    pval = (pd.Series(results) < obs).mean()
    
    return pval

## Missingness Types

**Useful review materials from class to check out:**
- [Lecture 8: Missingness](https://github.com/ucsd-ets/dsc80-sp19/blob/master/lectures/08/Lecture%2008%20Missingness.ipynb)
- [Lecture 9: Data Imputation](https://github.com/ucsd-ets/dsc80-sp19/blob/master/lectures/09/Lecture%2009%20Data%20Imputation.ipynb)
- [Discussion 5](https://github.com/ucsd-ets/dsc80-sp19/blob/master/discussions/05/Discussion%2005.ipynb)
- [Lab 4: Questions 7 and 8](https://github.com/ucsd-ets/dsc80-sp19/blob/master/labs/lab04/lab04.ipynb)
- [Lab 5: Questions 1, 2, 3, 4](https://github.com/ucsd-ets/dsc80-sp19/blob/master/labs/lab05/lab05.ipynb)

##### Missing by Design (MD)
- The missing field is deliberately missing. The missing field is deliberately not collected or set to null (hence, "missing by design")
- The missingness can be exactly predicted when a column will be null, with only knowledge of the other columns using a function of the rows of the dataset
- **Example:** The executive board of a student organization is trying to choose 5 new board members for next year from a selection pool of 50. The selection process includes three rounds. After each round, each participant is ranked on a scale of 0 to 100 and only the top 50% of participants in each round are chosen to go to the next round. *In this scenario, only participants who scored highly in each round will have scores for the next round. Therefore, scores for rounds 2 and 3 are missing by design (MD) because only high scoring participants from the previous round will have these scores.*

##### Missing Completely At Random (MCAR)
- The missingness of missing value isn't related to the actual, unreported value itself, nor the values in any other fields. The missingness is not systematic.
- The missingness is unconditionally uniform across rows. MCAR doesn't bias the observed data.
- There is no relationship between the missing data and the any of the other data, observed or missing.
- **Example:** A student organization is taking headshots for all of their members (the order of the students is completely random). About 2 hours into the photo shoot, the camera battery dies and some of the students are left without headshots. *In this scenario, the missing headshots (data) has nothing to do with the students or anything they did (it was completely random), it was just bad luck. Therefore, the missingness of the headshots is missing completely at random (MCAR) because there is no correlation between the missing headshots and any of the other variables regarding the organization members.*

##### Missing At Random (MAR)
- The missingness of the missing value has nothing to do with the value itself, but may be related to another field.
- The missingness is uniform across rows, perhaps conditional on another column. MAR biases the observed data, but is fixable.
- There is a systematic relationship between the missing values and the observed data (but not the missing values themselves).
- Difference between MD and MAR: If you can *exactly/always* determine missingness on other columns, the missingness is MD. If there is just some sort of systematic relationship between the missing columns/values and other columns/values that may help us predict missingness, the missingness is MAR.
- **Example:** A department (let's say it's the DSC department) has a dataframe of all the students in the DSC major, their grades in all of the DSC classes, and their grade level. Some of the students have NaN values for certain DSC courses and the department notices that there's a lot of missingness for first year students. *In this scenario, the missing grades (data) is related to the students' grade level because most first year students haven't had the opportunity to take certain courses such as DSC80 or DSC170 yet because they haven't finished their prerequsite classes. Therefore, the missingness of the grades is missing at random (MAR) because there is a relation between the missing grades and the grade levels of the students (observed data).*

##### Not Missing At Random (NMAR)
- The missingness of the missing value is related to the actual, unreported value.
- NMAR biases the observed data in unobservable ways.
- There is relationship between the propensity of a value to be missing and its value.
- **Example:** A company is hiring students for an internship. On the online application, the GPA field is optional (while fields such as Name, Email, etc. are not). The company notices that some of the applications don't have a submitted GPA. *In this scenario, students with lower GPAs are less likely to self-report their GPA. Therefore, the missingness of the GPAs is not missing at random (NMAR) because there is a relation between the missing GPAs and the actual values of the missing GPAs.*


## Imputation
#####  Listwise deletion
- Procedure: .dropna()
- If MCAR, doesn't change statistics of the data
- If MCAR and small, may have high variance

##### Imputation with a single value: mean, median, mode  
- Procedure: .fillna(dataframe[col].mean()) (or other statistic)
- If MCAR, gives unbiased estimate of mean; variance is too low.
- Analogue for categorical data: imputation with the mode.

##### Imputation with a single value using a model: regression, kNN  
- Procedure: for a column c1, conditional on a second column c2:  
    `means = dataframe.groupby('c2').mean().to_dict()`  
    `imputed = dataframe['c1'].apply(lambda x: means[x] if pd.isnull(x) else x)`
- If MAR, gives unbiased estimate of mean; variance is too low.
- Increases correlations between the columns.
- If dependent on more than one column: use linear regression to predict missing value.

##### Probabilistic imputation by drawing from a distribution
- Procedure: draw from empirical distribution of observed data to fill missing values.
- If MCAR, gives unbiased estimate of mean and variance.
- Extending to MAR case: draw from conditional empirical distributions
    - If conditional on a single categorical column c2:
    - Apply MCAR procedure to the groups of `dataframe.groupby(c2)`
  
##### Multiple Imputation 
- Procedure: Apply probabilistic imputation multiple times, resulting in $N$ imputed datasets.
- Do analyses separately on the $N$ imputed datasets (e.g. compute correlation coefficient).
- Plot the distribution of the results of these analyses!
- If a column is missing conditional on multiple columns, your "multiple imputations" should include probabilistic imputations for each!

### Missingness Identification

**Question 3** 

In each of the following scenarios, choose the *best* answer. Return your answers in a function `identifications`.
1. Professors are expected to turn in their final grades on June 17th so that the university can release final grades the next day. However, professors do have the option to turn in grades late. On the day grades are released, some of the grades are missing. Consider the data to have two pieces of information: the course and the professor.
    - Is the missingness of the grades `MD, MCAR, MAR, or NMAR`?

2. At the end of a new student orientation, as students are being picked up, the orientation leaders ask new students to fill out an optional survey with their name, intended major, and favorite color. When reviewing the dataframe of surveys, the OLs notice that some of the new students did not fill out the survey. 
    - Is the missingness of the surveys `MD, MCAR, MAR, or NMAR`?
    
3. When collecting data on new employees on an online form, a company has an optional question where an employee can submit their preferred name if they have one (alongside their legal name). When reviewing this information, the company notices that the preferred name column has many empty values.
    - Is the missingness of preferred names `MD, MCAR, MAR, or NMAR`?

4. A translating company creates a language translator to change words from one language to another. Their software compiles the translated text into a dataframe, with one row for each translated word. If there is no direct conversion of the word from language 1 to language 2 the software continues onto the next word. 
    - Is the missingness of words `MD, MCAR, MAR, or NMAR`?
    
5. The DSC department has compiled a dataframe of DSC professors, their hire date, the classes they've taught, and their average recommendation rate and releases it to the students. Some of the students notice that certain professors don't have an average recommendation rate.
    - Is the missingness of the professors' average recommendation rates `MD, MCAR, MAR, or NMAR`?
    
**Be sure to address why you picked the answer you did!**

In [13]:
def identifications():
    """
    Multiple choice response for question X
    >>> out = identifications()
    >>> ans = ['MD', 'MCAR', 'MAR', 'NMAR']
    >>> len(out) == 5
    True
    >>> set(out) <= set(ans)
    True
    """
    #MAR - based on professor
    #MCAR - some students left already, not related to their name/major/color
    #MAR - People with more complicated names more likely to have a nickname; MD is reasonable as well
    #NMAR - if there is no direct translation, the software skips the word
    #MAR - professors that were just hired/haven't taught won't have an avg rec rate
    
    return ['MAR', 'MCAR', 'MD', 'NMAR', 'MAR']

### Imputation Questions
To explore imputation, we will be using the `cars` dataframe to practice different imputation techniques. The `cars` dataframe has three columns: `car_make`, `car_color`, and `car_year`.

First, we'll explore two different methods of conditional imputation: conditional median imputation and probabilistic imputation by drawing from a distribution.

**Question 4**

Create a function `impute_years` that takes in a DataFrame like `cars` and imputes the missing values of `car_year` with the (single) median value of `car_year` conditional on `car_make` of the missing row. Set the dtype of `car_year` to *int* when you're done!

In [None]:
fp = os.path.join('data', 'cars.csv')
cars = pd.read_csv(fp)

In [None]:
def impute_years(cars):
    """
    impute_years takes in a DataFrame of car data
    with missing values and imputes them using the scheme in
    the question.
    :Example:
    >>> fp = os.path.join('data', 'cars.csv')
    >>> df = pd.read_csv(fp)
    >>> out = impute_years(df)
    >>> out['car_year'].dtype == int
    True
    >>> out['car_year'].min() == df['car_year'].min()
    True
    """
    medians = cars.groupby('car_make')['car_year'].median()
    medians = medians.round()
    medians = medians.fillna(medians.median()) #In this dataset, the only McLaren car has no year

    def impute(row):
        if pd.isnull(row['car_year']):
            row['car_year'] = medians[row['car_make']]
        return row
    
    return cars.apply(impute, axis = 1)

**Question 5**

Create a function `impute_colors` that takes in a DataFrame like `cars` and probabilistically imputes the missing values of `car_color` conditional on `car_make` using the distributions of `car_color`. To do this, you will need to sample from `car_color` distributions based on the `car_make` of the row. 

*Note*: This method of imputation should not radically change the distribution of `car_color` conditional on `car_make`!

In [None]:
def impute_colors(cars):
    """
    impute_colors takes in a DataFrame of car data
    with missing values and imputes them using the scheme in
    the question.
    :Example:
    >>> fp = os.path.join('data', 'cars.csv')
    >>> df = pd.read_csv(fp)
    >>> out = impute_colors(df)
    >>> out.loc[out['car_make'] == 'Toyota'].nunique() == 19
    True
    >>> 'Crimson' in out.loc[out['car_make'] == 'Austin']['car_color'].unique()
    False
    """
    no_color = cars.loc[cars['car_color'].isnull()]
    color_dists = cars.groupby('car_make')['car_color'].value_counts()

    def color_impute(row):
        if pd.isnull(row['car_color']):
            row['car_color'] = color_dists[row['car_make']]\
                .sample(weights = color_dists[row['car_make']].values).index[0]
        return row

    imputed = no_color.apply(color_impute, axis = 1)
    cars[cars['car_color'].isnull()] = imputed
    
    return cars

# Data Collection (HTTP, Networking/Services, APIs)

### HyperText Transfer Protocol (HTTP)

- `GET`
  - Used to request data from a specified resource.
  - One of the most common HTTP methods.
- `POST`
  - Used to send data to a server to create/update a resource.

### Status Code
  - **200** (OK): The request has succeeded, the information returned with the response is dependent on the method used in the request.
  - **400** (Bad Request): The request could not be understood by the server due to malformed syntax. The client SHOULD NOT repeat the request without modifications.
  - **404** (Not Found): The server has not found anything matching the Request-URL. No indication is given of whether the condition is temporary or permanent. 
      - `requests.get("https://httpstat.us/404").status_code`

### APIs
  - Allows for authentication (and access to sensitive data).
  - Usually has more reliable data that is easier to parse.
  - Allows hosts to monitor usage and protect their website
  - Why use APIs?
      - Data is changing quickly.
          - E.g: stock price data: don't want to scrape a page every few minutes.
      - Want a small piece of a much larger set of data.
          - E.g: just pull your own comments on Reddit? (all is too much)
          - E.g: want your Google GPS history? (private)
      - Usability and stability, not changing HTML requiring translation.
          - Websites change all the time.

### `robots.txt`
  - Many sites have a published policy allowing or disallowing automatic access to their site.  
  - This policy is in a text file `robots.txt`: learn more about it [here](https://moz.com/learn/seo/robotstxt).
  - Remember the best practices above - just because you aren't prohibited by the robots policy doesn't mean you can scrape the site!

## HTML

### The Anatomy of HTML

* **HTML Document**: the totality of markup that makes up a web-page.
* **Document Object Model**: the internal representation of a HTML document as a *tree* structure.
* **HTML Element**: an object in the DOM, such as a paragraph, header, title.
* **HTML Tags**: markers that denote the *start* and *end* of an element (e.g. `<p>` and `</p>`).


### Useful Elements/Tags:

|Structure Elements|Description|
|---|---|
|`<html>`|the document|
|`<head>`|the header|
|`<body>`|the body|
|`<div>` |a logical division of the document|
|`<span>`|an *in-line* logical division|

|Head/Body Elements|Description|
|---|---|
|`<p>`|the paragraph|
|`<h1>, <h2>, ...`|header(s)|
|`<img>`|images|
|`<a>`| anchor (hyper-link)|
|[MANY MORE](https://en.wikipedia.org/wiki/HTML_element)||


### Example: Images and Hyperlinks

* Image:
```
<img src="HumDum.png" alt="Humbpty Dumpty">
```

* Hyperlink: 

```
<a href="https://ucsd.edu/">Visit our page!</a>
```

We don't have an HTML parsing problem for you, so let's walk through an example of how to grab a list of artist names from the web. 

In [None]:
# request data using the GET method
page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ1.htm')
page

In [None]:
# using html.parser, parse through the text on the web page
soup = BeautifulSoup(page.text, 'html.parser')
soup

In [None]:
# grab text from the 'BodyText' class
artist_name_list = soup.find(class_='BodyText')
artist_name_list

In [None]:
# all artist names are hyper-linked, so let's grab them
artist_name_list_items = artist_name_list.find_all('a')

for artist_name in artist_name_list_items:
    print(artist_name.prettify())

In [None]:
# the above result doesn't look very nice, so let's grab the contents instead to find the names
for artist_name in artist_name_list_items:
    names = artist_name.contents[0]
    print(names)

## Regular Expressions, Statistics on Text, and Text Features (NLP)

Some links you might find useful:

1. Read [Lecture 12 slides](https://github.com/ucsd-ets/dsc80-sp19/blob/master/lectures/12/Lecture%2012%20Text%20Data.ipynb) for general explanations and examples for regular expression we commonly use in this class. 
2. Utilize `cmd`+`F`/`ctrl`+`F` and [the official documentation of Regular expression operations](https://docs.python.org/3/library/re.html) for a quick lookup if you are unsure about your regular expression
3. Try it out at https://regex101.com/ (Note, you should try to come up with the complete regular expression by hand first as regex101 won't be available during the Final.)


In both of the questions below, refer to the starter code and doctests for a more detailed specification.

**Question 7** 

Write a line of regular expression that checks whether you (a general user) are allowed to scrape the entire website.

In [None]:
def match(robots):
    """
    >>> robots1 = "User-Agent: *\\nDisallow: /posts/\\nDisallow: /posts?\\nDisallow: /amzn/click/\\nDisallow: /questions/ask/\\nAllow: /"
    >>> match(robots1)
    False
    >>> robots2 = "User-Agent: *\\nAllow: /"
    >>> match(robots2)
    True
    >>> robots3 = "User-agent: Googlebot-Image\\nDisallow: /*/ivc/*\\nUser-Agent: *\\nAllow: /"
    >>> match(robots3)
    True
    """
    return re.search(r'\bUser-Agent: \*\nAllow: /$',robots) is not None

**Question 8**

Write a function that extracts all phone numbers from given text and return the findings as a list of strings. Phone numbers might contain parentheses or hyphens. You don't need to clean it.

In [14]:
def extract(text):
    """
    extracts all phone numbers from given 
    text and return the findings as a 
    list of strings
    :Example:
    >>> text1 = "Contact us\\nFinancial Aid and Scholarships Office\\nPhone: (858)534-4480\\nFax: (858)534-5459\\nWebsite: fas.ucsd.edu\\nEmail: finaid@ucsd.edu\\nMailing address:\\n9500 Gilman Drive, Mail Code 0013\\nLa Jolla, CA 92093-0013"
    ['(858)534-4480','(858)534-5459']
    >>> text2 = "Contact us\\nPhone: 858-534-4480\\nFax: 858-534-5459\\nMailing address:\\n9500 Gilman Drive, Mail Code 0013\\nLa Jolla, CA 92093-00130"
    ['858-534-4480','858-534-5459']
    """
    return re.findall('((?:\(\d{3}\)|\d{3})-?\d{3}-?\d{4})',text)

In [15]:
text1 = "Contact us\\nFinancial Aid and Scholarships Office\\nPhone: (858)534-4480\\nFax: (858)534-5459\\nWebsite: fas.ucsd.edu\\nEmail: finaid@ucsd.edu\\nMailing address:\\n9500 Gilman Drive, Mail Code 0013\\nLa Jolla, CA 92093-0013"
extract(text1)

['(858)534-4480', '(858)534-5459']

# Statistics on Text 

**Question 9**

What is the tf-idf value of the word 'data' in the given corpus sentences below? Write a function `tfidf_data` that takes in a dataframe like sentences and calculates the tf-idf value of the word 'data' for each text.

In [18]:
sentences = pd.Series(['In text processing, words of the text represent discrete, categorical features ',
                         'How do we encode such data in a way which is ready to be used by the algorithms',
                         'The mapping from textual data to real valued vectors is called feature extraction'])
sentences

0    In text processing, words of the text represen...
1    How do we encode such data in a way which is r...
2    The mapping from textual data to real valued v...
dtype: object

In [21]:
def tfidf_data(sentences):
    """
    tf-idf of the word 'data' in a list of `sentences`.
    """
    words = pd.Series(sentences.str.split().sum())

    # tf = sentences.iloc[1].count('cow') / (sentences.iloc[1].count(' ') + 1)
    tf = sentences.str.count(r'\bdata\b') / (sentences.str.count(' ') + 1)
    idf = np.log(len(sentences) / sentences.str.contains(r'\bdata\b').sum())

    tfidf = tf*idf
    return tfidf

In [22]:
tfidf_data(sentences)

0    0.000000
1    0.022526
2    0.031190
dtype: float64

### Text Features

Suppose we're some millionaires who plan on investing in a new movie. Before we throw in our money, we would like to find out what type of movie is most popular and what type of movie makes the most. For simplicity, we are only considering the score of the movies and gross income of the movies when evaluating.

`movie` is a collection of movie records. We'll be primarily working with the following columns in movie: `gross`, `genres`, `imdb_score`.

In [23]:
# loads the dataset for you to use, do not change anything here
movie = pd.read_csv('data/movie_metadata.csv')
movie = movie.loc[:,['gross', 'genres', 'imdb_score']].dropna()
movie.head()

Unnamed: 0,gross,genres,imdb_score
0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,7.9
1,309404152.0,Action|Adventure|Fantasy,7.1
2,200074175.0,Action|Adventure|Thriller,6.8
3,448130642.0,Action|Thriller,8.5
5,73058679.0,Action|Adventure|Sci-Fi,6.6


**Question 10**

We begin by extracting information from the `genres` column. Observe that genre of the movies are given as a string. Write a helper function to called `vectorize` that vectorizes the `genres` column. (Check out [Lecture 12](https://github.com/ucsd-ets/dsc80-sp19/blob/master/lectures/12/Lecture%2012%20Text%20Data.ipynb) if you get stuck! This is bag of words!)

In [24]:
def vectorize(df):
    """
    Create a vector, indexed by the distinct words, with counts of the words in that entry.
    """
    return pd.Series(df['genres'].split('|')).value_counts()

# Evaluating Models

You're working with a team to create a security feature for a new web browser and search engine: *Noogle Brome*.  You want to classify domains as either malicious or benign so that your searches don't return any sites that could harm your users!

Your team has already collected features from a bunch of sites whose maliciousness was already known.  Someone on your team looked through the scikit-learn documentation and built a model, but they don't know if it's any good.  That's where you come in.

Read in the data, `malicious-sites.csv` and take a look at what you're working with.

In [25]:
df = pd.read_csv("data/malicious-sites.csv")
df.head()

Unnamed: 0,MALICIOUS,URL_LENGTH,NUMBER_SPECIAL_CHARACTERS,CHARSET,SERVER,WHOIS_COUNTRY,WHOIS_STATEPRO,WHOIS_AGE_DAYS,WHOIS_UPDATED_DAYS,TCP_CONVERSATION_EXCHANGE,DIST_REMOTE_TCP_PORT,REMOTE_IPS,APP_BYTES,SOURCE_APP_PACKETS,REMOTE_APP_PACKETS,SOURCE_APP_BYTES,REMOTE_APP_BYTES,APP_PACKETS,DNS_QUERY_TIMES
0,0,17,6,ISO-8859-1,nginx,US,AK,7995,1999,31,22,3,3812,39,37,18784,4380,39,8.0
1,0,17,6,UTF-8,,US,TX,8212,573,57,2,5,4278,61,62,129889,4586,61,4.0
2,0,18,7,UTF-8,nginx,SC,Mahe,1179,1177,11,6,9,894,11,13,838,894,11,0.0
3,0,18,6,iso-8859-1,Apache/2,US,CO,6150,1240,12,0,3,1189,14,13,8559,1327,14,2.0
4,0,19,6,us-ascii,Microsoft-HTTPAPI/2.0,US,FL,8109,803,0,0,0,0,0,0,0,0,0,0.0


A teammate has already created a train-test split, and initialized a Random Forest classifier.

*Note:* Do NOT modify this cell.  It is imperative for the Question 1 that these parameters and random states remain fixed.

In [26]:
### Do NOT modify this cell ###

X, y = df.drop("MALICIOUS", axis=1), df["MALICIOUS"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

rf_classifier = RandomForestClassifier(
    n_estimators=50, max_depth=5, min_samples_split=2, class_weight={0:0.5},
    random_state=123
)

**Question 11**

Go ahead and finish the preprocessing.  We want all of our categorical columns to be one-hot encoded.  Return a list of the categorical columns in our dataset (any order is fine) in the function `categorical_columns`.

Then, finish the preprocessing step of the pipeline below by using the return value of `categorical_columns` to help you one-hot encode the categorical features before arriving at the classifier.  Make sure to get this correct, as a correct pipeline is necessary for the rest of this section.

In [32]:
preproc = (ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(handle_unknown='ignore'), ["CHARSET", "SERVER", "WHOIS_COUNTRY", "WHOIS_STATEPRO"]), # Preprocessing
],
                             remainder='passthrough'))

pl = Pipeline([
    
    # Finish this step!
    ("pre", preproc),
    
    ("clf", rf_classifier)
    
])

pl.fit(X_train, y_train)
y_pred = pl.predict(X_test)

**Question 12**

We're curious how "good" that model really is.  Calculate various metrics to get a sense of what our model is predicting, and how good it is, and then answer the following questions.

In [33]:
accuracy_score(y_pred, y_test)

0.9420731707317073

If our model has too many false positives, then we're \_\_\_\_, but if there are too many false negatives, then we're \_\_\_\_.  Return only the best choice in `false_consequences`.

1. Hurting innocent domain holders; Hurting malicious domain holders
2. Hurting malicious domain holders; Huring innocent domain holders
3. Hurting innocent domain holders; Exposing users to malicious sites
4. Exposing users to malicious sites; Hurting innocent domain holders

What do we call the proportion of malicious websites that you manage to successfully block? Write a function, `blocked_malicious`, which returns all correct choices as well as the proportion that the model above manages to successfully block (go to three decimal places).

1. Sensitivity
2. True positive rate
3. Recall
4. <font color="blue">+++++ OTHERS +++++

If your boss wants you to claim that the model above is perfect (even though it isn't), what metric would you report?  In the function `fairness_claims`, return all correct choices as well as the value which will be reported.

1. Specificity
2. Precision
3. <font color="blue">+++++ OTHERS +++++


In [None]:
def false_consequences():
    """
    
    >>> false_consequences() in range(1, 5)
    True
    """
    
    return 4

In [None]:
def blocked_malicious():
    """
    
    >>> out = blocked_malicious()
    >>> set(out[0]) <= set(range(5))
    True
    >>> 0 <= out[1] <= 1
    True
    """
    
    return ([1, 2, 3], 0.345)

In [None]:
def fairness_claims():
    """
    
    >>> out = fairness_claims()
    >>> set(out[0]) <= set(range(5))
    True
    >>> 0 <= out[1] <= 1
    True
    """
    
    return ([1, 2], 1)

**Question 13:**  

Now that we've taken a look at some of the different metrics conduct three parameter searchs on our RandomForestClassifier in order to maximize Recall, Precision, and F1 respectively.

As you do so, think about the effect each of the resulting models have on our users and domain holders.  What is maximizing recall good for?  Also, take a look at the confusion matrix for each of the resulting "best" models.  Decide for yourself which model *you* would consider to be the fairest.

You should check the following parameters:
- n_estimators: 10, 100
- max_depth: 10, 50
- min_samples_split: 2, 4
- class_weight: 50%, 20%, 10% weight on class zero

Write a function, `parameters` which returns a dictionary of parameters over which the pipeline will search. The keys of your dictionary should be `<step name>__<param name>`, using two underscores (see [sklearn documentation](https://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html#sphx-glr-auto-examples-model-selection-grid-search-text-feature-extraction-py)).

Write a function, `parameter_search` which takes in X, y, and a pipeline like the one defined in Question 0 (you may assume that there is a step named "pre" and a step named "clf") and conducts three parameter searches.

For each parameter search, return the best pipeline which was able to maximize the score for Recall, Precision, and F1 respectively, using 3-Fold cross validation.

*Note:* Our classifier exists inside of a Pipeline.  You will need to figure out how to specify this classifier in order to conduct the parameter search.

*Note:* A warning may occur for some of the parameter searches.  Don't be alarmed!  Can you figure out what is causing the warning?

In [34]:
def parameters():
    
    params = {
        "clf__n_estimators": [10, 100],
        "clf__max_depth": [10, 50],
        "clf__min_samples_split": [2, 4],
        "clf__class_weight": [{0:0.5}, {0:0.2}, {0:0.1}]
    }
    
    return params

In [35]:
def parameter_search(X, y, pl):
    
    params = parameters()

    recl_grid = GridSearchCV(pl, params, scoring="recall", cv=3)
    prec_grid = GridSearchCV(pl, params, scoring="precision", cv=3)
    f1_grid = GridSearchCV(pl, params, scoring="f1", cv=3)
    
    pl_recl = recl_grid.fit(X, y).best_estimator_
    pl_prec = prec_grid.fit(X, y).best_estimator_
    pl_f1 = f1_grid.fit(X, y).best_estimator_
    
    return [pl_recl, pl_prec, pl_f1]

**Question 14:**

Finally, let's take a look at parity measures.  You believe that the majority of your users will visit sites that have been established for around one or two decades, so you want to make sure that sites of different ages are being treated with equal fairness.

Write a function `age_parity` which takes in X, y, a pipeline, a scoring function, and a value k, and computes parity measures for different age brackets.  Your function should return a Series indexed by age bracket with values as the score of predictions in that age bracket.

The domain ages should be bracketed into bins with a width of k years, represented by their upper bound.  For example if we were to use $k=4$, then a domain that is $372$ days old would fall into bin $4$, and a domain that is $6925$ days old would fall into bin $20$.

In [39]:
def age_pairity(X, y, pl, scoring, k):
    
    def age_brackets(ages, k):
        return ages.apply(lambda x: k * (x // (k*365.25) + 1))
    
    results = X.assign(
        malicious=y,
        predicted=pl.predict(X),
        age_bracket=age_brackets(X.WHOIS_AGE_DAYS, k)
    )
    
    return results.groupby("age_bracket").apply(lambda x: scoring(x.malicious, x.predicted))

In [41]:
age_pairity(X, y, pl, accuracy_score, 4)

age_bracket
4.0     0.916667
8.0     0.740741
12.0    0.925110
16.0    0.888889
20.0    0.945824
24.0    0.991848
28.0    1.000000
32.0    1.000000
dtype: float64