# Data Science Interview Questions

## Problem: Fair and unfair coins
This problem was asked by Facebook.   
There is a fair coin (one side heads, one side tails) and an unfair coin (both sides tails).  
You pick one at random, flip it 5 times, and observe that it comes up as tails all five times. What is the chance that you are flipping the unfair coin?



#### Solution
$$P(unfair|5 tails) = P(5 tails|unfair) * P (unfair) = 1 * 0.5 = 0.5$$

In [2]:
p_unfair = 1
p_fair = 0.5**5
p_fair

0.03125

In [3]:
p_unfair / (p_unfair + p_fair)

0.9696969696969697

---

## Problem: Min of two uniform distributions 
This problem was asked by Google.    
Say we have X ~ Uniform(0, 1) and Y ~ Uniform(0, 1). What is the expected value of the minimum of X and Y?

#### Solution
Let $Z = min(X, Y)$. Then we know:

$P(Z \le z) = P(X \le z, Y \le z) = 1 - P(X > z, Y > z)$        
$P(Z≤z)=P(X≤z,Y≤z)=1−P(X>z,Y>z)$        

For a uniform distribution the following is true for a value of z between 0 and 1:    

$P(X > z) = 1-z \space \text{and} \space P(Y>z) = 1 - z$        
$P(X>z)=1−z and P(Y>z)=1−z$    

Since X and Y are iid, this yields:    

$P(Z \le z) = 1 - P(X > z, Y > z) = 1 - (1-z)^2$
$P(Z≤z)=1−P(X>z,Y>z)=1−(1−z)^2$

Now we have the CDF of z. We can get the PDF by taking a derivative to get:    

$f_Z(z)= 2(1-z)

Then we can solve for the expected value by taking the integral:    

$E[Z] = \int_{0}^{1} zf_Z(z)dz = 2\int_{0}^{1} z(1-z)dz = 2(\frac{1}{2}-\frac{1}{3}) = \frac{1}{3}$

---

## Problem: Fraud detection trade-offs
This problem was asked by Affirm.    
Assume we have a classifier that produces a score between 0 and 1 for the probability of a particular loan application being fraudulent.    
In this scenario:    
- a) what are false positives 
- b) what are false negatives
- c) what are the trade-offs between them in terms of dollars and how should the model be weighted accordingly?

#### Solution
- a) False positive would be all those loans not fraudulent, being classified as fraudulent
- b) False negatives would be all those fraudulent loans classified as not-fraudulent
- c) We definitely want to avoid false negatives in this case. One way would be to adjust the cost function to weight more the false negatives cases.

---

## Problem: Ranking list of shows 
This problem was asked by Netflix.    
How would you design a metric to compare rankings of lists of shows for a given user?    

#### Solution
- Kind of an edit distance for strings, which tell me how many changes away is one list to become the other.
- Weighted sum of simillar titles, creating the weights by the (priority) first appearence on the list

---

## Problem: Measure user retention 
This problem was asked by Facebook.    
Assume you have the below tables on user actions. Write a query to get the active user retention by month.     
user_actions    
    
column_name	type    
user_id	integer     
event_id	string ("sign-in", "like", "comment")    
timestamp	datetime     

#### Solution
In the actions table, we can first define a temporary table called "mau" to get the monthly active users by month. Then we can do a self join by comparing every user_id across last month vs. this month, using the add_months() function to compare this month versus last month, as follows:
```sql
WITH mau AS
    (SELECT DISTINCT DATE_TRUNC(‘month’, timestamp) AS month, user_id FROM user_actions)

SELECT
    mau.month,
    COUNT(DISTINCT user_id) AS retained_users
FROM
    mau curr_month
JOIN
    mau last_month
ON curr_month.month = add_months(last_month.month, 1)
    AND curr_month.user_id = last_month.user_id
GROUP BY 1 ORDER BY 1 ASC

```

---

## Problem: 
This problem was asked by Robinhood.

Given a number n, return the number of lists of consecutive positive integers that sum up to n.

For example, for n = 9, you should return 3 since the lists are: [2, 3, 4], [4, 5], and [9]. Can you do it in linear 
time?



#### Solution

In [25]:
def n_consecutive_list_sum_n(N):
    count = 0
    buffer = list()
    for i in range(1,1+(N+1)):
        if sum(buffer) < N:
            buffer.append(i)
        elif sum(buffer) == N:
            print('Found: ', buffer)
            count += 1
            buffer.pop(0)
            buffer.append(i)
        else:
            buffer.pop(0)
            if sum(buffer) == N:
                print('Found: ', buffer)
                count += 1
                buffer.pop(0)
            buffer.append(i)
    return count

n_consecutive_list_sum_n(9)

Found:  [2, 3, 4]
Found:  [4, 5]
Found:  [9]


3

---

## Problem: 
This problem was asked by Google.

A coin was flipped 1000 times, and 550 times it showed up heads. Do you think the coin is biased? Why or why not?

#### Solution
The statistic $Z$ we are looking for on a Bernouilli distribution is calculated as:
$$ Z = \frac{X - \mu}{\sigma^{2}} $$
From Bernouilli: 
$$mu = E[x] = p*n = 0.5*1000 $= 500$$
$$sigma = Var[x] = p*(1-p) = 0.5*0.5 = 0.025$$
$$ z < 3 $$
This means that, if the coin were fair, the event of seeing 550 heads should occur with a < 1% chance under normality assumptions. Therefore, the coin is likely biased.

---

## Problem: 
This problem was asked by Uber.

Say you need to produce a binary classifier for fraud detection. What metrics would you look at, how is each defined, and what is the interpretation of each one?

#### Solution
Because the classes are going to be imbalance, we need to take into account the rates we are interested.    
For this case, FN (classifying as non-fraudulent when it is) is very sensible and probably the unrepresented class.    
Essentially I want to penalize FP. I am happy to miss some TP in the trade-off reducing the FP rate.    
Therefore, Sensitivity is my main metric in this experiment. The TPR must be as high as possible.

---

## Problem: 
This problem was asked by Facebook.

Imagine the social graphs for both Facebook and Twitter. How do they differ? What metric would you use to measure how skewed the social graphs are?

#### Solution
The main difference is that Facebook is composed of friendships, in which two users are mutual friends, whereas Twitter is composed of followership, in which one user follows another (usually influential figure). This leads to Twitter likely having a small number of people with very large followership, whereas in Facebook that is less often the case (besides the fact that the number of friends on a personal profile is capped).

One way to measure the skewness of the social graphs is to have each graph as a node, and look at the degrees of the nodes. The degree of a node is simply the number of connections it has to other nodes. It is likely that for Twitter, you will see more right skewness, i.e. most nodes have a low degree but a small number of nodes have a very high degree - like a “hub-and-spoke” model.

---

## Problem: 
What does it mean for an estimator to be unbiased?     
What about consistent?     
Give examples of an unbiased but not consistent estimator, as well as a biased but consistent estimator.

#### Solution
To be unbiased for an estimator means that its expected value on the observable sample doesn't depend on the sampling of the population. It would vary around the estimated value for the real population itself.   

One unbiased but consistent estimator is the estimator of the variance. We divide it by $N-1$ (we bias it) to make more close to the variance of the real estimation, because if not it tends to overestimate.   
One bias example but consitent is any estimation over a parameter where we include a regularization factor. We bias the estimator towards limiting the complexity of it.


---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution


---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution


---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution


---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution

---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution


---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution


---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution


---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution


---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution

---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution

---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution


---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution


---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution


---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution


---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution

---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution


---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution


---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution


---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution


---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution

---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution

---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution


---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution


---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution


---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution


---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution

---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution


---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution


---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution


---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution


---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution

---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution

---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution


---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution


---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution


---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution


---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution

---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution


---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution


---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution


---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution


---

## Problem: 
This problem was asked by .    
Bla Bla

#### Solution

---