Below we have a series of questions for you to translate into a technical plan. For each question, describe how you would make it testable and translate it from a general question into something statistically rigorous.

1. You work at an e-commerce company that sells three goods: widgets, doodads, and fizzbangs. The head of advertising asks you which they should feature in their new advertising campaign. You have data on individual visitors' sessions (activity on a website, pageviews, and purchases), as well as whether or not those users converted from an advertisement for that session. You also have the cost and price information for the goods.

The question we need to answer is: given what pages users viewed, what they purchased, and whether they purchased after clicking on an ad or not, what product would sell best if we featured it in an ad campaign? First: is advertising really that effective? Is it worth even spending the money on? First we should try to determine the amount of revenue generated by advertisements. We could do that naively by checking the amount of revenue made by purchases converted from ads (naive because this may not include purchases that were made because an ad, but not at the time they saw/clicked on the ad) for a given timeframe and seeing if it exceeds the cost by a wide enough margin. 

For the example's sake, let's say it's worth it.

Now we have another question: which product we should advertise for? This is kind of a loaded question because the product that sells the best may not be the product that is most profitable. It may be more profitable for the company to boost the sales of a product that ISN'T selling well rather than boosting the sales of a product that is already seeing success. We should compare the profit margins for each product first of all for our 3 products (we have the cost, price, and, because we have the purchase data, the sales of each product). 

We should then compare how well they perform without advertisements. Hopefully we have enough customers purchasing from the website without converting from an ad to have a big enough sample size to be confident in our conclusions for this question. Then we could look at the data for purchases converted from ads for each product. Since we already determined/assumed that advertising is effective, the product we want to boost the sales of is the one that will increase profit the most. The question is, where will the money spent on ads be most effective? Is it better spent on boosting the most successful products, or because theyre already selling so well, maybe advertising just suffers from diminishing returns (i.e. it may improve the sales but not by that much)? We need to generate some measurement of how effective advertising is.

The effectiveness of the ads for a product should be measured ideally by the number of sales generated by the ads (keep in mind this is the _number of sales,_ an integer, not the dollar value of the sales). We have that statistic (the number of sales converted from ads). We then multiply that by the value of the product being sold, less the cost of those products, and see how much money was generated by each ad (assuming there are existing ads for all products that we can use as a reference, which it seems like there must be since we have a statistic based on purchases converted from ads already). Ad effectiveness may vary by product, since ads might not be equally successful for all products. 

Then, we can simply multiply the effectiveness of the ads by the price of the products to see how much additional revenue is projected to be generated by the ads, and pick the product with the greatest estimated profit.

2. You work at a web design company that offers to build websites for clients. Signups have slowed, and you are tasked with finding out why. The onboarding funnel has three steps: email and password signup, plan choice, and payment. On a user level you have information on what steps they have completed as well as timestamps for all of those events for the past 3 years. You also have information on marketing spend on a weekly level.

The question we need to answer is ambiguous. Signups have slowed. Does that mean less people are making accounts? Wouldn't we expect that? As more prospective clients make accounts, there are less prospective clients to make accounts. Are we serving our clients faster than new potential clients are growing up (and entering the pool of possible clients)? Or are there too many clients to keep up with? If we are serving clients faster than new clients can appear, that's why signups are slowing. Also, we should check to see how many people who make accounts actually pick a plan and make a payment. You would think that people who make an account would be making one so they could choose a plan and give us their money, but maybe there are things you can do with just an account without choosing a plan or paying anything. Are we concerned about sign ups then? Or sales? Shouldn't we be most concerned about sales? But we're being asked about signups. Does that mean that sign ups are positively correlated with sales? That would make sense intuitively, but is there data to back it up? Well, yes, we can answer that question with our data by looking at the user information for when they completed the 3 onboard steps. How many people that sign up also choose a plan and make a payment.

In any case, we also know how much money we are spending on marketing. This may make the question something like as follows: Given user activity data and the amount of money spent on marketing, why are people signing up at a slower rate than before? We can compare the rate of sign ups with the amount of money spent on marketing, over time, and see if marketing expenditures are correlated with signup rate. We have 3 years of data to work with, so we can probably find long enough periods of time at different marketing spending levels and compare them. This is not necessarily the only thing affecting the number of signups, but it may be one of the few things we have control over. If it turns out that marketing spending is correlated with signup rate, then maybe we are just spending less on marketing and relying too much on word of mouth to create new business for our company. If there is no correlation, or there is a correlation but we haven't decreased our spending, then we might need to gather some data for other variables related to how our business spreads to help understand why less people.

3. You work at a hotel website and currently the website ranks search results by price. For simplicity's sake, let's say it's a website for one city with 100 hotels. You are tasked with proposing a better ranking system. You have session information, price information for the hotels, and whether each hotel is currently available.

What the hell does "session information" mean? Does it mean the experience a guest has staying at the hotel? Or does it mean visitors to the website? What does "better ranking" entail? The first thing that comes to mind for me is the fact that not everyone cares about only the price. They might care about time frame, or about quality, or about location. We need some better definitions to give a good answer to this question. In any case, since "session information" is ambiguous, I will answer for both possible defintions that I can think of. If session information means the quality of a guest's stay at the hotel, and consists of survey data of guests from each hotel, we could produce rankings by quality (some people may want to spend more money to stay at a hotel with better service or nicer facilities), below certain price points. The problem with this approach is that it's based on survey data, which is each guest's subjective opinion. With enough data though, over time, there should be enough of a sample size hopefully that some patterns will emerge. Now, if the session information just refers to website traffic, then I'm not sure how that relates to ranking hotels. It could however relate to the actual list itself and its ease of use. Maybe our website visitors aren't spending as much time on the website as we would like, so we need to make the website more concise, or more navigable. This is a front end design problem though and not really something that is feasible for us to attempt to solve with data. Price information is not really anything new since we already use that data in our current ranking. The best improvement we could make to this system is to set results up to a certain limit I believe instead of giving website visitors only the option to sort ascending/descending order by price. They may be looking for only a certain price range, and don't want to see the prices from the top. They not want to spend too much money on their hotel, but they also don't want to be too cheap and stay in a crappy hotel either. This type of functionality would help them find a middle of the road hotel that balances their level of comfort with their budget. The catch here though is that the price of a hotel is not always directly correlated with the quality of the guests' experiences there. This is why I was hoping that session information might have some survey data of guest experiences to help us try to provide some insight into the level of service at each price point for our website users. Using the availability data, we need our backend engineers to quickly whip up a way to sort the websites by availability on certain dates. It says we have information on whether each hotel is _currently_ available... but current as of when? Yesterday? Today? Tomorrow? How far into the future do we have availability information. There are plenty of prospective guests I'm sure who will only be staying at a hotel for the convention held there on a particular weekend, or some event in town, or nearby. Since our website specializes in one city, we might have some idea of when those events occur, too. I would also advise, given that we probably have some knowledges of such events, that we also gather location data for hotels so that our website visitors can see how far away the hotel is from the convention center, or where ever they need to be during their stay in the city. That information should not be too hard to gather since you can probably just google it. 

4. You work at a social network, and the management is worried about churn (users stopping using the product). You are tasked with finding out if their churn is atypical. You have three years of data for users with an entry for every time they've logged in, including the timestamp and length of session.

This is a tough question. What counts as logged in? I normally always have a tab on my computer logged in to facebook... but does that mean facebook thinks I'm constantly using their website (well, no, of course not, since they also must have a record of activity). But for the social network that I work at, they don't have that. They only have a login time, and an associated duration for the session. Does that mean that when I log in to this social network that I work for, the duration is huge for me because I just never bother to log out? This is probably not a great way to measure website activity. That said, it does filter out the people who aren't logging in to the website, even though it does not tell us the degree to which everyone is using their product. 

The next problem is the word "atypical." Let's assume we have some amount of churn that we already know to be normal. A user should be considered to be part of the churn once they haven't logged in for a certain amount of time (let's assume we know that already too. Though we could test that, too, by sampling, say, 10,000 people, that haven't logged in for a week/month/2 months/etc, and then look forward in time to see if they have logged in again since then). We'll see some average amount of time that represents the longest a person may not use the service for without quitting it. For example, it may be normal for people to not log into the social network for a week, but if someone doesn't log in for 2 weeks, on average, then they won't log in again. We're just looking for the average longest timedelta between login dates. Now that we know the threshhold where people change from active users to churn, we can compare the rate at which users become churn. We can see how many users cross that threshhold in a month, on average. We can compare the amount of churn over the past recent months to the churn in previous years to determine if the current amount of churn is greater than normal (and therefore "atypical"). If there is some amount of churn we already determined was atypical, we would just compare it to that instead. I know I said earlier that that's how we would determine it, but I wasn't really satisfied with just chalking it up to saying that we would assume we knew it, and just figure we could create our own definition of "normal" by comparing the present to the past.