# Introduction to Data Science
## Homework 2

Student Name: Zixuan Shao

Student Netid: zs2167
***

### Part 1: Case study (5 Points)
- Read [this article](http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html) in the New York Times.
- Use what we've learned in class and from the book to describe how one could set Target's problem up as a predictive modeling problem, such that they could have gotten the results that they did.  Formulate your solution as a proposed plan using our data science terminology.  Include all the aspects of the data mining process, and be sure to include the motivation for predictive modeling and give a sketch of a solution.  Be precise but concise.

### Motivation
- In order to target consumers who are about to have new birth of child when their buying habits are unstable, Target have to set up a solution to predict women's pregnancy so that the company can recommend related products to the consumers even before the birth.

### Data preparation
- extract raw data from database, and transform into a dataset that is useful for modeling through feature engineering. For example, consumers' gender, age, buying history etc. can be useful here. For best result, ten month to one year could be the best window size to cover the whole pregnant period of large group of women.

### Predictive modeling
- We can set up a classical classification problem with chance of pregnacy as target variable and all other features as predicting variable, then separate data into train, validate and test groups and fit the training data on all kinds of classification models, and choosing a consistent and proper evaluation metric such as F1 score and ROC.

### Evalution
- Then we can test the performance of our chosen model in our test data and do some data visualization.

### Deployment
- Eventually, we can deploy our model in a real-time system that collect current consumers' buying logs. The system extract useful information from raw data and fit into our model, and we can detect consumers with high chance of pregnancy when we can apply our business strategy on this targeted group.

### Part 2: Exploring data in the command line (4 Points)
For this part we will be using the data file located in `"data/advertising_events.csv"`. This file consists of records that pertain to some online advertising events on a given day. There are 4 comma separated columns in this order: `userid`, `timestamp`, `domain`, and `action`. These fields are of type `int`, `int`, `string`, and `int` respectively. Answer the following questions using Linux/Unix bash commands. All questions can be answered in one line (sometimes, with pipes)! Some questions will have many possible solutions. Don't forget that in IPython notebooks you must prefix all bash commands with an exclamation point, i.e. `"!command arguments"`.

[Hints: You can experiment with whatever you want in the notebook and then delete things to construct your answer later.  You can also use a bash shell (i.e., EC2 or a Mac terminal) and then just paste your answers here. Recall that once you enter the "!" then filename completion should work.]

[Here](https://opensource.com/article/17/2/command-line-tools-data-analysis-linux) is a good linux command line reference.

1\. How many records (lines) are in this file? (look up 'wc' command)

In [4]:
!wc -l advertising_events.csv

   10341 advertising_events.csv


2\. How many unique users are in this file? (hint: consider the 'cut' command and use pipe operator '|')

In [15]:
!cut -d',' -f 1 advertising_events.csv | sort -n | uniq | wc -l

     732


3\. Rank all domains by the number of visits they received in descending order. (hint: consider the 'cut', 'uniq' and 'sort' commands and the pipe operator).

In [12]:
!cut -d',' -f 3 advertising_events.csv | sort | uniq -c | sort -k 1nr

3114 google.com
2092 facebook.com
1036 youtube.com
1034 yahoo.com
1022 baidu.com
 513 wikipedia.org
 511 amazon.com
 382 qq.com
 321 twitter.com
 316 taobao.com


4\. List all records for the user with user id 37. (hint: this can be done using 'grep')

In [16]:
!grep -w 37 advertising_events.csv

37,648061658,google.com,0
37,642479972,google.com,2
37,644493341,facebook.com,2
37,654941318,facebook.com,1
37,649979874,baidu.com,1
37,653061949,yahoo.com,1
37,655020469,google.com,3
37,640878012,amazon.com,0
37,659864136,youtube.com,1
37,640361378,yahoo.com,1
37,653862134,facebook.com,0
37,648828970,youtube.com,0


### Part 3: Dealing with data Pythonically (16 Points)

1\. (1 Point) Download the data set `"data/ads_dataset.tsv"` and load it into a Python Pandas data frame called `ads`.

In [19]:
import pandas as pd
ads = pd.read_csv("ads_dataset.tsv", sep = "\t")

2\. (4 Points) Write a Python function called `getDfSummary()` that does the following:
- Takes as input a data frame
- For each variable in the data frame calculates the following features:
  - `number_nan` to count the number of missing not-a-number values
  - Ignoring missing, NA, and Null values:
    - `number_distinct` to count the number of distinct values a variable can take on
    - `mean`, `max`, `min`, `std` (standard deviation), and `25%`, `50%`, `75%` to correspond to the appropriate percentiles
- All of these new features should be loaded in a new data frame. Each row of the data frame should be a variable from the input data frame, and the columns should be the new summary features.
- Returns this new data frame containing all of the summary information

Hint: The pandas `describe()` method returns a useful series of values that can be used here.

In [44]:
def getDfSummary(input_data):
    output_data = input_data.describe().loc['mean':'max',:].T
    output_data["number_nan"] = input_data.isna().sum()
    output_data["number_distinct"] = input_data.nunique()
    return output_data

3\. How long does it take for your `getDfSummary()` function to work on your `ads` data frame? Show us the results below.

Hint: `use %timeit`

In [46]:
%timeit getDfSummary(ads)

52.2 ms ± 3.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


4\. (2 Points) Using the results returned from `getDfSummary()`, which fields, if any, contain missing `NaN` values?

In [48]:
summary_data = getDfSummary(ads)

In [49]:
summary_data["number_nan"]

isbuyer                    0
buy_freq               52257
visit_freq                 0
buy_interval               0
sv_interval                0
expected_time_buy          0
expected_time_visit        0
last_buy                   0
last_visit                 0
multiple_buy               0
multiple_visit             0
uniq_urls                  0
num_checkins               0
y_buy                      0
Name: number_nan, dtype: int64

#### only buy_freq has NaN values

5\. (4 Points) For the fields with missing values, does it look like the data is missing at random? Are there any other fields that correlate perfectly, or make it more likely that the data is missing? If missing, what should the data value be? Don't just show code here. Please explain your answer.[Edit this to ask for more details on why they are 0]

Hint: create another data frame that has just the records with a missing value. Get a summary of this data frame using `getDfSummary()` and compare the differences. Do some feature distributions change dramatically?

In [54]:
getDfSummary(ads[ads["buy_freq"].isna()])

Unnamed: 0,mean,std,min,25%,50%,75%,max,number_nan,number_distinct
isbuyer,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1
buy_freq,,,,,,,,52257,0
visit_freq,1.651549,2.147955,1.0,1.0,1.0,2.0,84.0,0,48
buy_interval,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1
sv_interval,5.686388,17.623555,0.0,0.0,0.0,0.041667,184.9167,0,5112
expected_time_buy,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1
expected_time_visit,-9.669298,31.23903,-187.6156,0.0,0.0,0.0,91.40192,0,13351
last_buy,65.741317,53.484622,0.0,19.0,52.0,106.0,188.0,0,189
last_visit,65.741317,53.484622,0.0,19.0,52.0,106.0,188.0,0,189
multiple_buy,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1


In [55]:
summary_data

Unnamed: 0,mean,std,min,25%,50%,75%,max,number_nan,number_distinct
isbuyer,0.042632,0.202027,0.0,0.0,0.0,0.0,1.0,0,2
buy_freq,1.240653,0.782228,1.0,1.0,1.0,1.0,15.0,52257,10
visit_freq,1.852777,2.92182,0.0,1.0,1.0,2.0,84.0,0,64
buy_interval,0.210008,3.922016,0.0,0.0,0.0,0.0,174.625,0,295
sv_interval,5.82561,17.595442,0.0,0.0,0.0,0.104167,184.9167,0,5886
expected_time_buy,-0.19804,4.997792,-181.9238,0.0,0.0,0.0,84.28571,0,348
expected_time_visit,-10.210786,31.879722,-187.6156,0.0,0.0,0.0,91.40192,0,15135
last_buy,64.729335,53.476658,0.0,18.0,51.0,105.0,188.0,0,189
last_visit,64.729335,53.476658,0.0,18.0,51.0,105.0,188.0,0,189
multiple_buy,0.006357,0.079479,0.0,0.0,0.0,0.0,1.0,0,2


#### From the comparison of summary of original data and data chosen with missing values, missing pattern can be fully explained by variable isbuyer because if the visitor is not a buyer, we cannot record his buying frequency. To recover the data, we should change all the missing value of buy_freq to zero as non-buyer will have zero buying frequency.

6\. (4 Points) Which variables are binary?

In [50]:
summary_data["number_distinct"]

isbuyer                    2
buy_freq                  10
visit_freq                64
buy_interval             295
sv_interval             5886
expected_time_buy        348
expected_time_visit    15135
last_buy                 189
last_visit               189
multiple_buy               2
multiple_visit             2
uniq_urls                207
num_checkins            4628
y_buy                      2
Name: number_distinct, dtype: int64

#### the summary shows that isbuyer, multiple_buy, multiple_visit and y_buy are binary variables