# Python Programming Questions

## Question 1

Write a function which takes as an argument a sentence (as a string) and returns the sentence where each word has been reversed *but the order of the words remains the same*.  Name the function `reverse_words`.  Shown below are some examples.


**Input**: "My Cat Is Small"

**Expected Output**: 'yM taC sI llamS'


**Input**: "Data science is a hot topic in industry"

**Expected Output**: "ataD ecneics si a toh cipot ni yrtsudni"

**Rationale Behind Question**: To solve the question, students should be knowledgable in loops and/or list comprehension.  Students will need to first break the word up at spaces (if they are unfamilliar with string manipulation, a hint may be provided), then loop through the list of words, reversing each word by slicing the string.  If successful, students show proper understanding of loops and slicing.

In [9]:
def reverse_words(sentence):
    
    split_sentence = sentence.split()
    
    revd_words = [word[::-1] for word in split_sentence]
    
    return ' '.join(revd_words)

print(reverse_words('My Cat Is Small'))

print(reverse_words('Data science is a hot topic in industry'))

yM taC sI llamS
ataD ecneics si a toh cipot ni yrtsudni


# Question 2

Shown below is data relating to the position of a car in meters. The data was recorded at the indicated times below (so at time = 1, the car was 1 meter from the starting position).  Load the data as a numpy array. Calculate the average speed at which the car was traveling between time points.  Do this with a loop and again using array slicing.

Hint: Speed = (Distance Travelled)/(Time To Travel Distance)


speeds: [0, 1, 1.2, 1.8, 2.0, 1.7, 1.5, 1.9, 2.1, 2.3]

times:  [0, 1, 1.5, 1.9, 2.3, 2.7, 3.8, 4.8, 5.4, 7.0]

**Rationale Behind Question**: Students should be familliar with vectorization since it is faster, and more readable.  Partial points may be awarded for the solution involving loops, since keeping track of indicies can be tricky.

In [10]:
import numpy as np

position = np.array([0, 1, 1.2, 1.8, 2.0, 1.7, 1.5, 1.9, 2.1, 2.3])
times = np.array([0, 1, 1.5, 1.9, 2.3, 2.7, 3.8, 4.8, 5.4, 7.0])

#With slicing
speed_vectorized = (position[1:] - position[:-1])/(times[1:] - times[:-1])

#Without slicing
speed = np.zeros(position.size - 1)
for i in range(speed.size):
    speed[i] = (position[i+1] - position[i])/(times[i+1] - times[i])
    
#Both are equivalent
np.isclose(speed,speed_vectorized, rtol = 1e-8).all()
    

True

# Question 3

Generate a random 100 by 100 2d array of integers using `numpy.random.randint` ranging from 1 to 100.  To ensure your answer is the same as ours, set the random seed to `19920908`. 

Which row has the largest mean?

Which column has the smallest sum?

Which is the first column (from left to right) to have sum exceding 600?

Answer these questions without the use of a loop.

Hint: The `argmin`, `argmax`, and `argwhere` functions may be useful.

**Rationale Behind Question**: This question is intended to make students look up function arguments. This is an important skill to have and will likely prevent being inundated with small, easy questions.  If students are not familliar with the `numpy.random.randint`, `argmin`, `argmax`, and/or `argwhere` functions, they should be capable of looking up documentation. 

In [11]:
np.random.seed(19920908)

#The plus 1 here is tricky.
X = np.random.randint(low = 1, high = 10+1, size = (100,100))

#Which row has largest mean?

print(X.mean(axis = 1).argmax())

#Which colimn has smallest sum?

print(X.sum(axis = 0).argmin())

#Which is the first column (from left to right) to have sum exceding 600?

print(np.argwhere((X.sum(axis = 0)>600)).min())


73
92
13


# Question 4

Newton's method is a numerical method finding the roots of a function.  Newton's method is

$$ x_{n+1} = x_{n} - \dfrac{f(x_n)}{f'(x_n)} $$

Below, I've written a function to try to use Newton's method to find the two roots of the function $f(x) = \exp(-x)\ln(x+1) - 0.25$.

My function should:

* Terminate when $\vert f(x_n) \vert < 1\times10^{-8}$ or when the number of iterations exceeds 1000.

* Take as its first argument the starting point for the method (i.e $x_0$)

* Take as its second argument the function $f$

* Take as its third argument the function $f'$

My code, as it stands, does not return the right answer.  Look through the code and debug the function so that it returns answers similar to `scipy.optimize.newton`.  Please don't completely rewrite the code (I spent a long time on it and want to learn what I messed up!).


Don't worry about `f` and `fprime`.  I've ensured those are correct.


**Rationale Behind Question**:  Debugging is an essential skill

In [19]:
f = lambda x: np.exp(-x)*np.log(x+1) - 0.25
fprime = lambda x: -np.exp(-x)*np.log(x+1) + np.exp(-x)/(x+1)

def broken_newtons_method(x0,f, fprime, tol = 1e-8, maxiter = 1000):
    
    res = float('inf')
    iters = 0
    x_n = x0
    
    while (res<tol) and (iters<maxiter):
        
        x_n -= f(x_n)\fprime(x_n)
        
        res = abs(f(x_n))
        
    return x_n
        
    
print('My algorithm, starting at 0.01, yields answer: ',broken_newtons_method(0.01,f,fprime))
print('My algorithm, starting at 2, yields answer: ', broken_newtons_method(2,f,fprime))
        

#compare with scipy
from scipy.optimize import newton

print('scipy.optimize.newton starting at 0.01 returns ',newton(f,0.01))
print('scipy.optimize.newton returns at 2 returns ', newton(f,2))

SyntaxError: unexpected character after line continuation character (<ipython-input-19-d3134f5a1b35>, line 12)

In [None]:
#Solution
f = lambda x: np.exp(-x)*np.log(x+1) - 0.25
fprime = lambda x: -np.exp(-x)*np.log(x+1) + np.exp(-x)/(x+1)


def true_newtons_method(x0,f, fprime, tol = 1e-8, maxiter = 1000):
    
    res = float('inf')
    iters = 0
    x_n = x0
    
    #Had the wrong sign in first  logical
    while (res>tol) and (iters<maxiter):
        
        #Wrong division 
        x_n -= f(x_n)/fprime(x_n)
        #No iteration update
        iters+=1
        res = abs(f(x_n))

        
    return x_n
        
    
    
from scipy.optimize import newton


print(true_newtons_method(0.01,f,fprime))
print(newton(f,0.01))

print(true_newtons_method(2,f,fprime))
print(newton(f,2))

# Question 5

Estimate through simulation the probability that a baseball player batting 0.300 hits fewer hits than a baseball player batting 0.275 at 45 at bats.

In [None]:
from scipy.stats import binom
np.mean(binom(n = 45, p = .3).rvs(1_000_000) < binom(n = 45, p = 0.275).rvs(1_000_000))

---
# Statistics Questions

# Question 1

What is the correct interpretation of the 95% confidence interval?

A. There is a 95% probability the true mean lies in your interval.

B. There is a 95% probability that if you ran the experiment again, the true mean would be in the 95% confidence interval.

C. There is a 95% probability that the mean is the midpoint of the interval

D. Upon repeated construction, the longterm relative frequency of 95% confidence intervals containing the true mean is 95%.


Answer: D

# Question 2

Bill James is creddited with creating sabermetrics (baseball analytics).  In one of his early "Baseball Abstracts", Bill writes...

>...If you see 15 games a year, there is a 40% chance that a .275 hitter will have more hits than a .300 hitter.

Bill refers to players by their *batting average* (i.e. .275 means the hitter will hit the ball 275 times for every 1000 times they come at bat).  The actual probability is quite smaller than that. Bill wrote this in the late 1970s without the ubiquity of computers to perform the simulations we can.  It is quite plausible that Bill used a Normal approximation to arrive at this conclusion.

Assuming that every batter appears 3 times per game for 15 games (for a total of 45 at bats), use a Normal approximation to estimate the probability that a .275 batter hits more hits than a .300 battter.  Assume the batters are independent.



# Solution

Let $A \sim \mbox{Binom}(0.275,45)$ and $B \sim \mbox{Binom}(0.300,45)$.  

We are looking for $p(A>B)$ or alternatively $p(0<A-B)$.

The expectation of $A-B$ is $E(A-B) = E(A) - E(B)  \approx 1.12$

The variance of $A-B$ is $\operatorname{Var}(A-B) = \operatorname{Var}(A) + \operatorname{Var}(B) - 2\operatorname{Cov}(A,B)$.

Since batters are assumed to be indpenendent, $\operatorname{Cov}(A,B) = 0$.

So the variance is then $\operatorname{Var}(A) + \operatorname{Var}(B) \approx 18.42$

The Normal approximation is then $A-B \sim \mathcal{N}(1.12, 18.42)$

and so the probability that $A-B>0$ is $1- \mathbf{\Phi}(0) \approx 0.4$

Where $\mathbf{\Phi}$ is the CDF for our normal approximation.



In [None]:
from scipy.stats import norm


norm(loc = 1.12, scale = np.sqrt(18.42)).cdf(0)

# Question 3

A diagnostic test has a 99% chance of correctly labeling a person as sick if they are truly sick.  The probability that the test labels someone as sick, regardless of disease status is 98%.  If 0.1% of the population has this disease, what is the probability you have the disease if your test comes back positive?

# Solution

Applying Bayes' Theorem...

$$ p(D+ \vert T+)  = \dfrac{p(T+ \vert D+) p(D+) }{p(T+)} = \dfrac{0.99 \times 0.001}{0.98}$$

$$ \approx 0.1\%$$

# Question 4

Why might someone want to know the median rather than the mean of their data?

# Solution

The median is far less sensitive to outliers than the mean.  If the data have many outliers, then the mean might not be a good measure of central tendency.

# Question 5

You obtain a dataset with $n$ rows and $n$ columns (the same number of rows and columns). Each column houses numeric data (no categories, just numbers). You're asked to perform a linear regression this data (the outcome is in a different file.  It is not one of the $n$ columns).  Assume that the data matrix is full rank.

What will the $R^2$ of this regression be?

# Solution

$R^2$ will be one since the problem is perfectly determined.

---

# Linear Algebra Questions

# Question 1

If $A$ $n \times n$ is a matrix, and $A$ has full rank, is $A$ invertible?

Answer: Yes.

# Question 2

If a matrix, $A$, is positive definite, which of the following is false:

A) $\mathbf{x}^T A \mathbf{x} >0 $ for every vector which is not 0

B) Every element of A is positive

C) The Eigenvalues of A are positive

D) A is symmetric

Answer: B

# Question 3

Let $x$ and $y$ be vectors such that $\vert x \vert = 3$ and $\vert y \vert = 4$.  Use the triangle inequality to put an upper bound on the length of $\vert x+y \vert$.

Anwer: $\vert x+y \vert \leq \vert x \vert + \vert y \vert =7$



# Question 4

Let $A$ be a matrix, and let $\mathbf{x},\mathbf{y}$ be vectors.  If $A\mathbf{x} = [4,3,2]^T$ and $A\mathbf{y} = [-1,2,0]^T$ what is $A(2\mathbf{x} - \mathbf{y})$?

Answer: $A(2\mathbf{x} - \mathbf{y}) = 2A\mathbf{x} - A\mathbf{y} = [8,6,4]^T - [-1,2,0]^T = [9,8,4]^T$