# Lab 2, Problem 2 - Machine Learning for Industry
Linköping University, Fall 2019.

Author: Mattias Villani, Linköping and Stockholm University, http://mattiasvillani.com

In the previous problem we looked at predicting if an item was **sold** on eBay. The dataset actually contained more information. The column <tt>nBids</tt> records the **exact number of bids** in each auction. 
In this part of the lab you will build a model that can predict the number of bids in an auction.

It is not obvious how to set up a loss function and all that here. We could use squared error $(y-\hat y)^2$, but that is less suitable when the data are not Gaussian. Rather than making up a loss function in an ad hoc fashion, we will assume a distribution for the data and then use maximum likelihood to learn the parameters. This lab will get you closer to the bare metal of fitting ML models to data.

**Important**: Study the code in [MLEbyOptimization.ipynb](/Extras/MLEbyOptimization.ipynb) before moving on. The logistic regression example can serve as a template for this problem.
    
Here the data are discrete **counts**, i.e. non-negative integers. There are many distributions for such data, but the most famous one is the **Poisson distribution** $Y\sim\mathrm{Pois}(\mu)$ with probability density (or mass function):
$$
p(y) = \frac{e^{-\mu}\mu^y}{y!}  
$$
Now, if $Y\sim\mathrm{Pois}(\mu)$, then the mean is $\mathbb{E}Y=\mu$. We want to model the mean of this distribution as a regression, that is as a function of features, just like we did in the classification problem. However, setting $\mathbb{E}Y=\mu=\mathbf{x}^\top \mathbf{w}$ is not great since $\mu$ must be strictly positive and we would have to put awkward restrictions on $\mathbf{w}$ to guarantee this. Let us instead use the model:
$$Y_i \vert \mathbf{x}_i \sim \mathrm{Pois}(\mu_i), \text{ where } \mu_i = \exp(\mathbf{x}_i^\top \mathbf{w}).$$

Note that this model uses a different distribution for each auction, depending on the features of that specific auction. Also, $\mu_i>0$ for all $i$.

<font color = "red">**Your first task**</font> is to use numerical maximum likelihood to learn the weight vector in the Poisson regression with <tt>nBids</tt> 
as response and the same features that use in the classification model in the previous problem. This entails programming up the (negative) log-likelihood function for the Poisson regression and then using Scipy's minimize function to find the maximum likelihood estimate. Don't worry if <tt>minimize</tt> returns that the optimization did not terminate successfully, the estimates are fine. Use all the data for training.\
**Hint**: <tt>from scipy.stats import poisson</tt>

Here are some imports, setting the seed to the random number generator, and also getting the dataset ready for the problem.

In [6]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(seed=123) # Set the seed for reproducibility

eBayData = pd.read_csv('https://github.com/STIMALiU/ml4industry/raw/master/Labs/eBayData.csv', sep = ',')
X = eBayData.drop(['nBids','Sold'], axis = 1)
X['const'] = 1
y = eBayData['nBids']
X.head()

Unnamed: 0,PowerSeller,VerifyID,Sealed,Minblem,MajBlem,LargNeg,LogBook,MinBidShare,const
0,0,0,0,0,0,0,-0.2237,-0.2088,1
1,1,0,0,0,0,0,0.6073,-0.3478,1
2,1,0,0,0,0,0,0.0332,0.4423,1
3,0,0,0,1,0,0,0.3755,0.1441,1
4,0,0,0,0,0,1,1.4347,-0.4104,1


Ok,over to you now.

In [11]:
# YOUR CODE FOR THE LOG-LIKELIHOOD FUNCTION HERE

In [12]:
# YOUR CODE FOR FINDING THE MAXIMUM LIKELIHOOD ESTIMATES HERE

<font color = "red">**Your second task**</font> is to use your learned Poisson model to plot the predictive distribution of <tt>nBids</tt> in a new auction with features ($x_{\star}$):
- PowerSeller = 0
- VerifyID = 0
- Sealed = 1
- MinBlem = 0
- MajBlem = 0
- LargeNeg = 0
- LogBook = 2
- MinBidShare = -0.5

Proceed like this:
- Compute the Poisson mean, $\mu$, for this specific auction.
- Simulate 10000 draws from this Poisson distribution
- Make a barplot (<tt>plt.bar</tt>) of the draws (you need to count them first) to approximate the predictive distribution of <tt>nBids</tt> in this new auction. 

In [13]:
# YOUR CODE HERE

<font color = "red">**Your final task**</font> is to use the simulation from the predictive distribution to compute an estimate of $\mathrm{Pr}(Y\geq4 \vert \mathrm{x_{\star})}$, the probability of at least four bids in the new auction.

In [10]:
# YOUR CODE HERE