# Introduction to Data Science
## Homework 2

Student Name: Yiyan Chen

Student Netid: yc2462
***

### Part 1: Case study
- Read [this article](http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html) in the New York Times.
- Use what we've learned in class and from the book to describe how one could set Target's problem up as a predictive modeling problem, such that they could have gotten the results that they did.  Formulate your solution as a proposed plan using our data science terminology.  Include all the aspects of the formulation that you see as relevant to solving the problem.  Be precise but concise.

Target wants to know about customers or potential customers' behaviors and then influence them to buy more stuff from Target and form a behavioral habit. And they found out people changed their shopping patterns after major life events, especially after a child came to their lives. Therefore, Target was intended to attract more pregnant women to form a routine to buy their stuff. 

They constructed a predictive model where the response variable is whether a woman is pregnant or not. The binary response variable determines this is a classification problem. The predictors are this persons’ the history purchases, such as whether the woman buys scent-free soaps, extra-big bags of cotton balls, and vitamins. The most appropriate model to start with is logistic regression. 

$$F(x) = \frac{1}{1+e^{-(\beta_{0}+\beta_{1}x_{1}+\ldots+\beta_{n}x_{n})}}$$

Besides, we need to run the univariate, pairwise and global correlations to see how each predictor performs when forecasting the response variable. Then do some transformations on variables, construct the model and see the accuracy of how the model predicts. 

After the model construction, the model needs to be put into reality to test. Target can put the recent customers’ data pool to see how the model perform. If the accuracy rate is satisfying, the model can put into use. 
 

### Part 2: Exploring data in the command line
For this part we will be using the data file located in `"advertising_events.csv"`. This file consists of records that pertain to some online advertising events on a given day. There are 4 comma separated columns in this order: `userid`, `timestamp`, `domain`, and `action`. These fields are of type `int`, `int`, `string`, and `int` respectively. Answer the following questions using Linux/Unix bash commands. All questions can be answered in one line (sometimes, with pipes)! Some questions will have many possible solutions. Don't forget that in IPython notebooks you must prefix all bash commands with an exclamation point, i.e. `"!command arguments"`.

[Hints: You can experiment with whatever you want in the notebook and then delete things to construct your answer later.  You can also use ssh to use the actual bash shell in your terminal and then just paste your answers here. Recall that once you enter the "!" then filename completion should work. Also, these are standard data exploration commands that are quick and easy to use in a terminal or in the notebook. We don't cover command line operations formally in this class, but these are worth learning (and thus are part of the HW). Be resourceful. Use whatever online cheat sheets or Stackoverflow to answer the question.]

1\. How many records (lines) are in this file (look up wc)?

In [1]:
# Place your code here
!wc -l advertising_events.csv

   10341 advertising_events.csv


2\. How many unique users are in this file? (hint: consider the 'cut' command and use pipe operator '|')

In [None]:
# Place your code here
!cut -d, -f1 advertising_events.csv | sort | uniq -c | wc -l

3\. Rank all domains by the number of visits they received in descending order. (hint: consider the 'cut', 'uniq' and 'sort' commands and the pipe operator).

In [1]:
# Place your code here
!cut -d, -f3 advertising_events.csv | sort | uniq -c | sort -k1 -n -r

3114 google.com
2092 facebook.com
1036 youtube.com
1034 yahoo.com
1022 baidu.com
 513 wikipedia.org
 511 amazon.com
 382 qq.com
 321 twitter.com
 316 taobao.com


4\. List all records for the user with user id 37. (hint: this can be done using 'grep')

In [None]:
# Place your code here
!grep -w "37" advertising_events.csv

### Part 3: Dealing with data Pythonically

In [4]:
# You might find these packages useful. You may import any others you want!
import pandas as pd
import numpy as np

1\. Load the data set `"ads_dataset.tsv"` into a Python Pandas data frame called `ads`.

In [91]:
# Place your code here
ads = pd.read_csv("ads_dataset.tsv",sep='\t')
ads.reset_index(drop = True)

Unnamed: 0,isbuyer,buy_freq,visit_freq,buy_interval,sv_interval,expected_time_buy,expected_time_visit,last_buy,last_visit,multiple_buy,multiple_visit,uniq_urls,num_checkins,y_buy
0,0,,1,0.0,0.000000,0.0,0.000000,106,106,0,0,169,2130,0
1,0,,1,0.0,0.000000,0.0,0.000000,72,72,0,0,154,1100,0
2,0,,1,0.0,0.000000,0.0,0.000000,5,5,0,0,4,12,0
3,0,,1,0.0,0.000000,0.0,0.000000,6,6,0,0,150,539,0
4,0,,2,0.0,0.500000,0.0,-101.149300,101,101,0,1,103,362,0
5,0,,1,0.0,0.000000,0.0,0.000000,42,42,0,0,17,35,0
6,0,,1,0.0,0.000000,0.0,0.000000,42,42,0,0,42,110,0
7,0,,2,0.0,29.791670,0.0,-106.188300,121,121,0,1,101,401,0
8,0,,3,0.0,45.479170,0.0,-34.144730,64,64,0,1,100,298,0
9,0,,1,0.0,0.000000,0.0,0.000000,13,13,0,0,53,247,0


2\. Write a Python function called `getDfSummary()` that does the following:
- Takes as input a data frame
- For each variable in the data frame calculates the following features:
  - `number_nan` to count the number of missing not-a-number values
  - Ignoring missing, NA, and Null values:
    - `number_distinct` to count the number of distinct values a variable can take on
    - `mean`, `max`, `min`, `std` (standard deviation), and `25%`, `50%`, `75%` to correspond to the appropriate percentiles
- All of these new features should be loaded in a new data frame. Each row of the data frame should be a variable from the input data frame, and the columns should be the new summary features.
- Returns this new data frame containing all of the summary information

Hint: The pandas `describe()` [(manual page)](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html) method returns a useful series of values that can be used here.

In [75]:
def getDfSummary(input_data):
    # Place your code here
    output_data = pd.DataFrame()
    number_nan = input_data.isnull().sum()
    in_mean = np.mean(input_data)
    in_max = np.max(input_data)
    in_min = np.min(input_data)
    in_std = np.std(input_data)
    in_25 = input_data.quantile(0.25)
    in_50 = input_data.quantile(0.50)
    in_75 = input_data.quantile(0.75)
#     input_data = input_data.dropna(axis = 0, how = "any")
    number_distinct = []
#     for num in range(len(input_data)):
#         if num not in count_num:
#             count_num.append(num)
#     number_distinct = len(count_num)
    for col in range(len(input_data.columns)):
        number_distinct.append(len(np.unique(input_data.iloc[:, col])))
        
    
    output_data["Count of NaN"] = number_nan
    output_data["Distinct Values"] = number_distinct
    output_data["Mean"] = in_mean
    output_data["Maximum"] = in_max
    output_data["Minmum"] = in_min
    output_data["Standard Deviation"] = in_std
    output_data["25% Percentile"] = in_25
    output_data["50% Percentile"] = in_50
    output_data["75% Percentile"] = in_75
    return output_data

3\. How long does it take for your `getDfSummary()` function to work on your `ads` data frame? Show us the results below.

Hint: `%timeit getDfSummary(ads)`

In [77]:
# Place your code here
%timeit getDfSummary(ads)

86 ms ± 4.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


4\. Using the results returned from `getDfSummary()`, which fields, if any, contain missing `NaN` values?

In [78]:
# Place your code here
getDfSummary(ads)

Unnamed: 0,Count of NaN,Distinct Values,Mean,Maximum,Minmum,Standard Deviation,25% Percentile,50% Percentile,75% Percentile
isbuyer,0,2,0.042632,1.0,0.0,0.202025,0.0,0.0,0.0
buy_freq,52257,52267,1.240653,15.0,1.0,0.78206,1.0,1.0,1.0
visit_freq,0,64,1.852777,84.0,0.0,2.921794,1.0,1.0,2.0
buy_interval,0,295,0.210008,174.625,0.0,3.92198,0.0,0.0,0.0
sv_interval,0,5886,5.82561,184.9167,0.0,17.595281,0.0,0.0,0.104167
expected_time_buy,0,348,-0.19804,84.28571,-181.9238,4.997746,0.0,0.0,0.0
expected_time_visit,0,15135,-10.210786,91.40192,-187.6156,31.87943,0.0,0.0,0.0
last_buy,0,189,64.729335,188.0,0.0,53.476168,18.0,51.0,105.0
last_visit,0,189,64.729335,188.0,0.0,53.476168,18.0,51.0,105.0
multiple_buy,0,2,0.006357,1.0,0.0,0.079478,0.0,0.0,0.0


Only buy_freq contain missing NaN values. 

5\. For the fields with missing values, does it look like the data is missing at random? Are there any other fields that correlate perfectly, or predict that the data is missing? If missing, what should the data value be?

Hint: create another data frame that has just the records with a missing value. Get a summary of this data frame using `getDfSummary()` and compare the differences. Do some feature distributions change dramatically?

In [79]:
# Place your code here
missing_data = ads[np.isnan(ads.buy_freq)]
getDfSummary(missing_data)

Unnamed: 0,Count of NaN,Distinct Values,Mean,Maximum,Minmum,Standard Deviation,25% Percentile,50% Percentile,75% Percentile
isbuyer,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0
buy_freq,52257,52257,,,,,,,
visit_freq,0,48,1.651549,84.0,1.0,2.147934,1.0,1.0,2.0
buy_interval,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sv_interval,0,5112,5.686388,184.9167,0.0,17.623387,0.0,0.0,0.041667
expected_time_buy,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0
expected_time_visit,0,13351,-9.669298,91.40192,-187.6156,31.238731,0.0,0.0,0.0
last_buy,0,189,65.741317,188.0,0.0,53.48411,19.0,52.0,106.0
last_visit,0,189,65.741317,188.0,0.0,53.48411,19.0,52.0,106.0
multiple_buy,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The buy_freq is correlated with variables such as isbuyer, buy_interval, expected_time_buy, multiple_buy. When those variables' values are zero, buy_freq is NaN which makes sense intuitively. 

6\. Which variables are binary?

In [98]:
# Place your code here
getDfSummary(ads).index[getDfSummary(ads)["Distinct Values"]==2]

Index(['isbuyer', 'multiple_buy', 'multiple_visit', 'y_buy'], dtype='object')

The binary variables are isbuyer, multiple_buy, multiple_visit and y_buy. 