# Assignment group 3: Probabilistic modeling and prediction

## Module C _(51 pts)_ Exploring Probabilistic Models of Discrete Data
Here, we'll be working with some data from the Indego bikeshare company:

- `./data/indego-trips-2017-q3.csv`

Our goal is to look at a particular numeric aspect:

- how often bikes get used (and worn out).

The entire data set takes place over a quarter of 2017. So all of the bikes are represented according to the same quantity of time, right? Well, if so and if each gets rented randomly at a fixed rate, $\lambda$, then the distribution of bike usage probabilities:

$$P(\text{a bike gets rented }\:x\:\text{ times in a quarter})$$

will be a Poisson distribution! Let's investigate to see if we can support this possibility.

__C1.__ _(2 pts_) To get started, import pandas and load the data as usual. Print the spreadsheet's head so that the data's structure is close at hand.

In [None]:
# code here

__C2.__ _(5 pts)_ Now, let's start things out by counting the number of trips that each bike has in total, using pandas `df.groupby()` to group the trips, and a counter, `NumBikes`, to store the number of bikes, $n$, rented $x$ times in the quarter, $n(x)$.

In [None]:
# code here

__C3.__ _(5 pts)_ Now that we've got our bikes counted up, let's compute the empirical probabilities:

$$P(x) = P(\text{a bike is rented }\:x\:\text{ times in a quarter}) = 
\frac{n(x)}{\sum n(x)}.$$

We already have $n(x)$ in our `Counter()` from __C2__, so let's start by turning its keys and values into numpy arrays (vectors), `n`, and `x`. After this is done, we can make the probabilities, `p`,  from a scalar product of `n`: divide it by its sum.

In [None]:
# code here

__C4.__ _(2 pts)_ Now it's time to find the average number of times a bike gets rented in a quarter. We'll call this quantity $\lambda$. So far, we've talked about averages of data, e.g., the arithmetic mean of $x$:

$$\overline{x} = \frac{1}{n}\sum_{i=1}^n x_i$$

But what we're now interested in is the average&mdash;center&mdash;of our probability distribution, $P(x)$. This quantity has a special name: the _expectation of $x$_, which is computed as:

$$E[x] = \sum_{i=1}^nxP(x)$$

This is actually a generalization of arithmetic mean above, if you view the arithmetic mean as utilizing a _uniform_ probability distribution, having equal probability ($1/n$) for each value, $x_i$. Here's the nice part for us: looking at the equation for $E[x]$, we simply have an inner product between $P(x)$ and $x$. So let's compute $\overline{x} = E[x]$ using the numpy dot product as an easy trick!

In [None]:
# code here

__C5.__ _(2 pts)_ Now let's use $\overline{x}$ to sample a poisson distribution. We sample as many points as there are unique bikes, using `numpy.random.poisson(rate, size = nbikes)` function for this.

In [None]:
# code here

__C6.__ _(3 pts)_ Now we've got our poisson sample, let's compute the analagous values to __Sec. C2__ for the sample and build a `Counter()` with the the number of sample-bikes, `s_n`, that were used $x$ times in the quarter.

In [None]:
# code here

__C7.__ _(2 pts)_ With the sample counted up, put the values and keys in numpy arrays, and then find the sample probabilities, just like we did in __C3__.

In [None]:
# code here

__C8.__ _(5 pts)_ Now it's time to plot _bar plots_ of both your sample and the data. Note: we're using bar plots, since our data are _discrete_ and not in need of binning! The edges of our bars will be given by the $n$ values and the heights will be given by the $t$ values.

What do you notice? Do the two line up well? If not, what poisson-distribution assumptions might not have been met?

_Response._

In [None]:
# code here

__C9.__ _(3 pts)_Okay, so there's probably a few things that went wrong:

- Did all bikes get used for the _same_ amount of time, i.e., were some brought in/out of commission?
- Do some bikes just happen to exist in higher-traffic areas?
- Do some times of year turn out a greater number of bikes?

It's not the _easiest_ thing in the world to overcome these issues. However, perhaps a slightly different quantity does:

- The number of times a given bike gets used in one day.

Does this quantity better meet the criteria above, and if so, why?

_Response_.

- Question: Does the quantity suffer from comparing bikes with different wear and tear?
    - _Response._
- Question: Does the quantity still suffer from dependence on environmental variation?   
    - _Response._
- Question: How does this quantity deal with seasonal variation differences, if at all?
    - _Response._

__C10.__ _(2 pts)_ Poisson for an _individual_ bike's daily rental rate? For this, we'll need to be able to count the number of times a given bike is rented each day. But first, we have to restrict to just the bike of interest. I choose `bike_id = 3331`. Create a boolean mask and filter the rows for this bike. When you're done, print out the head of the resulting 'start_time' column.

In [None]:
# code here

__C11.__ _(5 pts)_ What I want to know is if the number of times this particular bike is rented in a given day follows a Poisson distribution.  So, we'll have to count up the bikes usage _by day_. We'll want to ignore the timestamp portion and just keep the first ten characters of the date string. While we could do some datetime parsing and high-level work to make this happen, a quick and dirty way simply slices out the first ten characters of the dates as unique strings of the form:

- `yyyy-mm-dd`

simply by 

- `DateStr = DateTimeStr[0:10]`

With these, we can count up our bike's trips in a `Counter()` called `NumTripsPerDay`. We'll then go on to use this to build more $x$ and $P(x)$ values, but just for this bike.

In [None]:
# code here

__C12.__ _(2 pts)_ Now that we have the bike's trips grouped by date, let's compute a `Counter()`, `NumBikes`, that now stores the number of days, $n$, in which our bike was rented $x$ times: $n(x)$, as in __C2__.

In [None]:
# code here

__C13.__ _(3 pts)_ Now that we've revised our frequency data, let go on an compute our vector forms of $n$, $x$,  and $P(x)$, and then the expectation, $E[x] = \overline{x}$, as in __C2&ndash;C4__.

In [None]:
# code here

__C14.__ _(6 pts)_ We can now draw a Poisson sample again, but for the trips-per-day data and average. Additionally, continue through the steps in __C5&ndash;C7__ to make a the sample data's probabilities.

In [None]:
# code here

__C15.__ _(4 pts)_ Plot the two bar plots again. How well do they line up? Do these data appear to be more Poisson? If so, why do you think?

_Response._

In [None]:
# code here