# Introduction to Data Science
## Homework 2

Student Name: Zhengyuan Ding

Student Netid: zd415
***

### Part 1: Case study (5 Points)
- Read [this article](http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html) in the New York Times.
- Use what we've learned in class and from the book to describe how one could set Target's problem up as a predictive modeling problem, such that they could have gotten the results that they did.  Formulate your solution as a proposed plan using our data science terminology.  Include all the aspects of the formulation that you see as relevant to solving the problem.  Be precise but concise.

- Input data and feature selection

As mentioned in the article, Target collects a lot of data from their customers. Each customer is assigned with a unique Guest ID and this ID is linked with the individual's demographic information and shopping behaviors. The demongraphic information includes age, gender, marriage, born place, distance to the store etc. The shopping behavioral data include payment method, whether used a coupon, mail for refund etc. Some of them are numerical and others are categorical. 

First, we should do some feature selection to both reduce the data size and help prediction. We can calculate the information gain of each attributes and rank their informativeness. In particular, an entropy graph for each attribute could help us better understand the data. 

In addition to these processes, we can also generate some new features from our input data based on some exploratary analysis and our domain knowledge. For instance, we can check the stats summary of each variable and visualize its distribution as a histogram to find some possible insights.

- Data preprocessing

After feature selection, we should do some preprocessing on the dataset we selected. For instance, categorical data could be transformed by one-hot encoding and numerical data can be normalized to same range. For data cleaning, we can check for outliers and missing value. If they are missing at random, we could replace them with average or just delete them if it's rare.

Also we should split the dataset to training, validation and test set.

- Target value y 

The target value y we are predicting could be categorical with value 0 or 1 to classify if the customer is pregnant or not. 

- Training Model

We could use SVM or decision tree to predict this classification problem. To avoid overfitting, we use a fitting curve with AUC on two datasets, training and validation set(or by cross-validation). If there is a large gap between the two curves, it shows a overfitting problem and we should tune the hyperparameters to control the complexity of the model.

- Evaluation

After tunning parameters in the training stage, we now come to the evaluation.
First, the accuracy should be at least higher than the base rate.
Second, we can compare the perfomance between the two models by generating ROC curves and computing AUC scores on the test set. The model with higher degree of bowing is likely to have better performance. Also we could use lift curve to check the performance as well.


### Part 2: Exploring data in the command line (4 Points)
For this part we will be using the data file located in `"data/advertising_events.csv"`. This file consists of records that pertain to some online advertising events on a given day. There are 4 comma separated columns in this order: `userid`, `timestamp`, `domain`, and `action`. These fields are of type `int`, `int`, `string`, and `int` respectively. Answer the following questions using Linux/Unix bash commands. All questions can be answered in one line (sometimes, with pipes)! Some questions will have many possible solutions. Don't forget that in IPython notebooks you must prefix all bash commands with an exclamation point, i.e. `"!command arguments"`.

[Hints: You can experiment with whatever you want in the notebook and then delete things to construct your answer later.  You can also use ssh to use the actual bash shell on EC2 (see original directions) and then just paste your answers here. Recall that once you enter the "!" then filename completion should work.]

1\. How many records (lines) are in this file? (look up 'wc' command)

In [1]:
# Place your code here
!wc -l "data/advertising_events.csv"

   10341 data/advertising_events.csv


2\. How many unique users are in this file? (hint: consider the 'cut' command and use pipe operator '|')

In [2]:
# Place your code here
!cut -d ',' -f 1 "data/advertising_events.csv"| sort -u | wc -l

     732


3\. Rank all domains by the number of visits they received in descending order. (hint: consider the 'cut', 'uniq' and 'sort' commands and the pipe operator).

In [3]:
# Place your code here
!cut -d ',' -f 3 "data/advertising_events.csv"|sort |uniq -c|sort -r

3114 google.com
2092 facebook.com
1036 youtube.com
1034 yahoo.com
1022 baidu.com
 513 wikipedia.org
 511 amazon.com
 382 qq.com
 321 twitter.com
 316 taobao.com


4\. List all records for the user with user id 37. (hint: this can be done using 'grep')

In [4]:
# Place your code here
! grep -w 37 "data/advertising_events.csv"

37,648061658,google.com,0
37,642479972,google.com,2
37,644493341,facebook.com,2
37,654941318,facebook.com,1
37,649979874,baidu.com,1
37,653061949,yahoo.com,1
37,655020469,google.com,3
37,640878012,amazon.com,0
37,659864136,youtube.com,1
37,640361378,yahoo.com,1
37,653862134,facebook.com,0
37,648828970,youtube.com,0


### Part 3: Dealing with data Pythonically (16 Points)

1\. (1 Point) Download the data set `"data/ads_dataset.tsv"` and load it into a Python Pandas data frame called `ads`.

In [5]:
# Place your code here
import pandas as pd
ads = pd.read_csv('data/ads_dataset.tsv',sep='\t',header=0)

In [6]:
ads.head()

Unnamed: 0,isbuyer,buy_freq,visit_freq,buy_interval,sv_interval,expected_time_buy,expected_time_visit,last_buy,last_visit,multiple_buy,multiple_visit,uniq_urls,num_checkins,y_buy
,0,,1,0.0,0.0,0.0,0.0,106,106,0,0,169,2130,0
,0,,1,0.0,0.0,0.0,0.0,72,72,0,0,154,1100,0
,0,,1,0.0,0.0,0.0,0.0,5,5,0,0,4,12,0
,0,,1,0.0,0.0,0.0,0.0,6,6,0,0,150,539,0
,0,,2,0.0,0.5,0.0,-101.1493,101,101,0,1,103,362,0


2\. (4 Points) Write a Python function called `getDfSummary()` that does the following:
- Takes as input a data frame
- For each variable in the data frame calculates the following features:
  - `number_nan` to count the number of missing not-a-number values
  - Ignoring missing, NA, and Null values:
    - `number_distinct` to count the number of distinct values a variable can take on
    - `mean`, `max`, `min`, `std` (standard deviation), and `25%`, `50%`, `75%` to correspond to the appropriate percentiles
- All of these new features should be loaded in a new data frame. Each row of the data frame should be a variable from the input data frame, and the columns should be the new summary features.
- Returns this new data frame containing all of the summary information

Hint: The pandas `describe()` method returns a useful series of values that can be used here.

In [7]:
def getDfSummary(input_data):
    output_data = input_data.describe().transpose()
    output_data['number_nan']=input_data.isnull().sum()
    output_data['number_distinct']=input_data.nunique(dropna=True)
    return output_data.iloc[:,1:]

3\. How long does it take for your `getDfSummary()` function to work on your `ads` data frame? Show us the results below.

Hint: `use %timeit`

In [8]:
# Place your code here
%timeit getDfSummary(ads)

59.7 ms ± 7.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


4\. (2 Points) Using the results returned from `getDfSummary()`, which fields, if any, contain missing `NaN` values?

In [9]:
# Place your code here
getDfSummary(ads)

Unnamed: 0,mean,std,min,25%,50%,75%,max,number_nan,number_distinct
isbuyer,0.042632,0.202027,0.0,0.0,0.0,0.0,1.0,0,2
buy_freq,1.240653,0.782228,1.0,1.0,1.0,1.0,15.0,52257,10
visit_freq,1.852777,2.92182,0.0,1.0,1.0,2.0,84.0,0,64
buy_interval,0.210008,3.922016,0.0,0.0,0.0,0.0,174.625,0,295
sv_interval,5.82561,17.595442,0.0,0.0,0.0,0.104167,184.9167,0,5886
expected_time_buy,-0.19804,4.997792,-181.9238,0.0,0.0,0.0,84.28571,0,348
expected_time_visit,-10.210786,31.879722,-187.6156,0.0,0.0,0.0,91.40192,0,15135
last_buy,64.729335,53.476658,0.0,18.0,51.0,105.0,188.0,0,189
last_visit,64.729335,53.476658,0.0,18.0,51.0,105.0,188.0,0,189
multiple_buy,0.006357,0.079479,0.0,0.0,0.0,0.0,1.0,0,2


<font color='blue'>Question4 Answer: The buy_freq variable contains a lot of missing values</font>

5\. (4 Points) For the fields with missing values, does it look like the data is missing at random? Are there any other fields that correlate perfectly, or predict that the data is missing? If missing, what should the data value be? Don't just show code here. Please explain your answer.

Hint: create another data frame that has just the records with a missing value. Get a summary of this data frame using `getDfSummary()` and compare the differences. Do some feature distributions change dramatically?

In [10]:
# Place your code here
nan_ads = ads[ads['buy_freq'].isnull()]
getDfSummary(nan_ads)

Unnamed: 0,mean,std,min,25%,50%,75%,max,number_nan,number_distinct
isbuyer,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1
buy_freq,,,,,,,,52257,0
visit_freq,1.651549,2.147955,1.0,1.0,1.0,2.0,84.0,0,48
buy_interval,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1
sv_interval,5.686388,17.623555,0.0,0.0,0.0,0.041667,184.9167,0,5112
expected_time_buy,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1
expected_time_visit,-9.669298,31.23903,-187.6156,0.0,0.0,0.0,91.40192,0,13351
last_buy,65.741317,53.484622,0.0,19.0,52.0,106.0,188.0,0,189
last_visit,65.741317,53.484622,0.0,19.0,52.0,106.0,188.0,0,189
multiple_buy,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1


<font color='blue'>Question5 answer: The missing value is not missing at random. isbuyer,buy_interval, expected_time_buy and multiple_buy variables all changes to 0,indicating that there is no purchase at all. Therefore the missing value should be zero</font>

6\. (4 Points) Which variables are binary?

<font color='blue'>Question6 answer: isbuyer, multiple_buy, multiple_vist,y_buy </font>