# `10 More (Real) Data Science Interview Questions`

<font color=red> Mr Fugu Data Science</font>


# (◕‿◕✿)



`----------------------------------------`

# 1.) In a `Random Forest` classifier, which of the following choices involves randomness?

**`Choose Answer[s]:`**

+ Choosing which trees to discard during inference

+ Choosing which loss function to use

+ Choosing which subset of features to use for a given tree

+ Choosing which branch to follow at a given node

**`Work Through It`**:

What is a `Random Forest Classifier`?

+ It is a machine learning algorithm; in a simple explanation you are building multiple decision trees and then merging them to create a more accurate and stable model. 

    + Can be used for Classification and Regression
    + Understand also, this is a predictive tool NOT a descriptive tool
    + fairly robust against overfitting unlike decision trees. Due mostly to the use of randomly selected features. 

+ Instead of looking for a specific or most important node, you will instead look at random subsets of your features. 

+ The random forest is built from decision trees; each tree is built from random features that are selected. Note: not all trees see all the selected features or even observations. This will help because you are trying to avoid correlelation and therefore overfitting.

https://towardsdatascience.com/feature-selection-using-random-forest-26d7b747597f

# 2.) Which type of SQL join can result in more matching rows than in either input table?

**`Choose Answer:`**

+ Full Outer Join

+ Right Join

+ Inner Join

+ Left Join

`Work Through It`

+ `Full Outer Join`: returns everything from left or right side if there is a match

+ `Right Join`: returns everything from right table and anything else matched from left

+ `Inner Join`: returns the intersection of both tables that match

+ `Left Join`: returns everything from left table and anything else matched from right

Now, what is our answer?

*The question says (more matching) rows than you had initially.*

If I am thinking of matching rows then I will not consider (Null) values. Therefore, while the Outer Join family will have more overall rows it will fail due to null values. 

`-------`

My answer will be `Inner Join`

https://en.wikipedia.org/wiki/Join_(SQL)

# 3.) Which approaches would give the best tradeoff between *speed* and *accuracy* when calculating the product of many probabilities?


**`Choose Answer:`**

+ Use an arbitrary precision decimal type (think of Big-Int)

+ Use a single-precision floating point (like float32)

+ Use a Fourier transform trick

+ Log-sum exponential trick

`Work Through It`

**Considerations Before Getting Started**

+ Something to mention: adding probabilities is faster than multiplying them!
    + Think of matrix multiplication, transformations
    + If we are doing multiplications we have to think of $O(n^2)$ so if we had say 2, matrices of $10^3$ we would end up with $10^6$ operations. This is why algorithms were built to get lower than $O(n^2)$. Another, issue is the memory you have available. 

+ Also, we have to think about underflow issues when doing probabilities. This can be an issue with floating point numbers. 

Now: what are we noticing from the answers?
+ **Well it appears that we are comparing**: 

1.) converting multiplications to additions (*`speed`*)

2.) under/overflow. (*`accuracy`*)

3.) Hardware vs Software implementing your calculations (*`for speed`*)

If we notice this then there are some considerations that we should take into account.

`-------------`

+ `Log Sum Exp Trick`: This will allow us to convert multiplications to additions but we need to also consider underflow issue.

+ `The float-32`: good for adding and slower with multiplications

+ `Big Int`: generally good will multiplications but not additions

+ `Fourier Transform Trick:` good for matrix multilpication, around $O(n log n)$ time. Great if you have large amounts of data if I am correct.


https://cs.stackexchange.com/questions/77135/why-is-adding-log-probabilities-faster-than-multiplying-probabilities

https://www.quantamagazine.org/the-math-behind-a-faster-multiplication-algorithm-20190923/

http://www-personal.umich.edu/~mejn/cp/chapters/errors.pdf

# 4.)  Can you compare a validation set with a test set?

`Work Through It`

+ The validation set is a subset of the original dataset. You are doing a train|validate|test.
    + The validation set is used after your training the model to basically make adjustments and tune parameters. In this step you are trying to ajdust for overfitting.
+ The test set is used to see how well your model performs. The test set will never have been seen by the model!
    


# 5.) Given streaming dataset (can only read once) due to size of file not fitting into memory. Which of the following statistics can be calculated?

**`Choose Answer:`**

+ Neither mean nor variance

+ Mean, but not variance

+ Both

+ Variance but not mean

+ who cares o_0

`Work Through It`


+ Welford Algorithm for rolling variance (one-pass) variance

+ you can recursively, find the mean by comparing previous to current

https://faithfull.me/recursive-statistics-for-data-streams/

# 6.) What `assumptions` does Naive-Bayes algorithm make?

**`Choose Answer:`**
    
+ Test data will be normally distributed

+ Test data is linearly separable 

+ Input features are conditionally independent

+ Input features are linearly independent

`Work Through It`

+ `Conditional Independence`: in order to reduce number of parameters


# 7.) Reverse List:

`
def reverse_list(dta):
    new_list = []
    for i in range(len(dta)):
         ---Place Code HERE
    return new_list
`

**`Choose Answer:`**

+ new_list.append( dta [ len( dta ) ] )

+ new_list.append( dta [ i ] )

+ new_list.append( dta [ len( dta )- i - 1] )

+ new_list.append( dta [ len( dta )-1 ] )

In [8]:
def reverse_list(dta):
    new_list = []
    for i in range(len(dta)):
        new_list.append(dta[ len( dta )-i - 1]) 
    return new_list

reverse_list([1,2,3,4,5])

[5, 4, 3, 2, 1]

# 8.) What are both of these queries acheiving?
(*Think distiction*)

`
SELECT category, AVG (price)
FROM products
GROUP BY category`

`SELECT category, AVG (price)
OVER (PARTITION BY category)
FROM products
`

+ The first computes the average price per category; while the second does not

+ the first is not valid SQL, the second is

+ The first returns unique categories; second may return duplicate categories

+ first is less efficient to compute than second

`Work Through It`

+ `Group By`: commonly used for doing sums() and avg(), and will return less rows [*collapsed*]

+ `Over Partition`: will not affect the number of rows [*perserved*]
    + This is a window function: meaning that you are performing aggregate-like operations but producing a result for each row unlike the group by that will condense into a single row. 

https://dev.mysql.com/doc/refman/8.0/en/window-functions-usage.html

# 9.) What is an `Activation Function` when doing Deep Learning such as Neural Networks?

`Work Through It`

+ Think of the `Activation Function` as a way to dictate if something turns on or not like a switch. This is done by using weighted sums and applying a bias term; this in term will introduce non-linearity to your neuron output. The output we receive will get updated depending on the weights and biases depending on the error we have. 
    + Since, this is a non-linear process we are able to learn complex tasks.
+ Consider the input as a current neuron moving to the next layer; the neuron will reach the activation function and either be turned on or off.

+ The types of Activation Functions are:
    + Linear
    + Non-Linear
    
https://medium.com/@snaily16/what-why-and-which-activation-functions-b2bf748c0441

# 10.) What's wrong with this code block?

`
try:
    file = open(filepath)
    data = file.read()
finally:
    file.close()
`

**`Choose Answer:`**

+ Not all bytes from the file are read

+ The file may be closed before all data is read

+ if opening the file fails, a different error code is raised

+ if an error is raised, the file will remain open


`Work It Through:`

+ `if opening the file fails, a different error code is raised`

This is because you will not have caught any errors; when you could have used this in either an `except or optional else clause`

https://docs.python.org/3/tutorial/errors.html

https://stackoverflow.com/questions/8774830/how-with-is-better-than-try-catch-to-open-a-file-in-python

# `Bonus`: Which of the following expressions defines the variance of a random variable X?

**`Choose Answer:`**

+ $E[ X^2-E[X] ]$

+ $E[X-E[X] ]$

+ $E[ X-E[X] ]^2$

+ $E[( X-E[X] )^2 ]$


`------------------------------------`

`Work Through It` (*discrete*)

+ <font color=red>1.)</font>

$E[X^2-E[X]] = E[X^2] -E[E[X]] $


+ <font color=red>2.)</font>

$E[ X - E[X]] = E[X] - E[ E[X] ] = 0 $

`Now E[X] = `



+ <font color=red>3.)</font>

$E[ X - E[X]]^2 = E [[X - E[X]^2] - Var(X - E[X]) = Var(X) - Var(X) = 0$


+ <font color=red>4.)</font> 

$Var(X) = \sum(x-\mu)^2 P(X)$

$= \sum(x^2-2 \mu x+\mu^2) P(X)$

$= \sum x^2 P(X) -2 \mu \sum x P(X) + \mu^2 \sum P(X)$

`Now Substitute:` $\mu = \sum x P(X)$

$= E[X^2] - 2 \mu \mu + \mu^2 (1)$

$= E[X^2] - \mu^2$

`Now:` $E[X]= \mu$

$= E[X^2] - E[X]^2$

$= E[( X-E[X] )^2 ]$

# Citations & Help:

# ◔̯◔

`Random Forest:`

https://builtin.com/data-science/random-forest-algorithm

https://medium.com/brillio-data-science/what-is-random-in-random-forest-7825be12c8c3

https://link.springer.com/chapter/10.1007/11731139_12

https://medium.com/analytics-vidhya/disclose-the-secret-of-randomness-in-random-forests-705eb751d4d7

https://cfss.uchicago.edu/notes/decision-trees/ 

https://bradleyboehmke.github.io/HOML/random-forest.html

https://scholar.smu.edu/cgi/viewcontent.cgi?article=1041&context=datasciencereview

`SQL Joins:`

https://www.w3schools.com/sql/sql_join.asp

`Accuracy & Speed Tradeoff for Multiplying Probabilities:`

https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

http://www.mikemeredith.net/blog/2017/UnderOverflow.htm

https://blog.smola.org/post/987977550/log-probabilities-semirings-and-floating-point

https://blog.feedly.com/tricks-of-the-trade-logsumexp/

`Streaming Data:`

http://www.nowozin.net/sebastian/blog/streaming-mean-and-variance-computation.html

`Activation Function:`

https://www.geeksforgeeks.org/activation-functions-neural-networks/#:~:text=Definition%20of%20activation%20function%3A%2D,the%20output%20of%20a%20neuron

https://medium.com/the-theory-of-everything/understanding-activation-functions-in-neural-networks-9491262884e0

`Group By & Partition Over:`

https://stackoverflow.com/questions/2404565/sql-server-difference-between-partition-by-and-group-by#:~:text=A%20group%20by%20normally%20reduces,window%20function's%20result%20is%20calculated

https://learnsql.com/blog/difference-between-group-by-partition-by/

`Bonus Question:`

https://www.statisticshowto.com/probability-and-statistics/expected-value/#:~:text=The%20basic%20expected%20value%20formula,(x)%20*%20n)

http://www.milefoot.com/math/stat/rv-expvar.htm

https://people.math.umass.edu/~lr7q/ps_files/teaching/math456/lecture16.pdf

https://math.stackexchange.com/questions/920853/does-ex-ex-0

https://www.youtube.com/watch?v=_dcyF_H2-r0