# Introduction to Data Science
## Homework 1: Due Midnight, March 4th. 1/3 of a Grade Deducted for each day late

Student Name: Ruojun Hong

Student Netid: rh2544
***

### Part 1: Case study
- Read [this article](http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html) in the New York Times.
- Use what we've learned in class and from the book to describe how one could set Target's problem up as a predictive modeling problem, such that they could have gotten the results that they did.  Formulate your solution as a proposed plan using our data science terminology.  Include all the aspects of the formulation that you see as relevant to solving the problem.  Be precise but concise.

The business goal of Target is to increase their profit, for which they obviously need to attract and capture more customers. A nice period of time to capture a customer is her second trimester when there's a great chance that they could capture her for years. Therefore, a business problem has been translated to a more specific and unambiguous data science problem: identifying pregnant customers.
The second step is to collect and understand data. Raw data to be studied can be obtained from Target's baby shower registry. From the buying behavior of customers at the baby shower registry. The features can be the products bought, the pregnancy time, age, income, etc. 
To start the machine learning, we first separate the data to the training set, cross-validation set and test set. We will build the predictive model from the training set and then optimize the model based on cross-validation set and test set. The model will be a supervised learning problem where we output a “pregnancy prediction" score for each shopper. By applying this predictive model on new shoppers, we will eventually be able to identify the potential mothers-to-be and send them promotion information and coupons.

### Part 2: Dealing with messy data
Not all data you will deal with is going to be clean. In fact, much of it will be very messy! For example, we have the HTML page that lists the contributors to Facebook's [osquery](https://github.com/facebook/osquery) project that is hosted on [Github.com](https://github.com). In this case, all we are interested in are the contributors and how many commits each of them has. Given the HTML page in `"data/osquery_contributors.html"` you will sift through tons of irrelevant data so that you can build a useful data structure.

Notice that the first six (out of 59 total) contributors are named "theopolis", "marpaia", "javuto", "jedi22", "unixist", and "mofarrell". They have 553, 477, 104, 49, 30, 25 commits respectively.

![Screenshot](images/osquery_contributors.png)

To get a better of understanding of how this data is stored in the file, try searching through the raw data file for these usernames to look for any patterns. Your final dictionary should have 59 elements!

1\. Turn this data into a Python dictionary called `contributors` where the keys are the contributor names and the values are the number of commits that each contributor has.

In [1]:
import re # you might find this package useful
import os

contributors = dict()

# Place your code here
datadir = os.getcwd()+'/data/'
f = open(datadir+"osquery_contributors.html", "r")
for line in f:
    #print(line)
    author = re.search("author=\w+",line)
    commits = re.search("\d+ commit",line)
    if author:
        contributors[author.group(0)[7:]]=int(commits.group(0)[:-7])
        
# This line will print your dictionary for grading purposed. Do not remove this line!!!
print (contributors)

{'theopolis': 553, 'marpaia': 477, 'javuto': 104, 'jedi22': 49, 'unixist': 30, 'mofarrell': 25, 'sharvilshah': 23, 'lwhsu': 22, 'wxsBSD': 20, 'polachok': 14, 'zwass': 14, 'eastebry': 9, 'maus': 9, 'vmauge': 8, 'astanway': 6, 'maclennann': 6, 'blakefrantz': 6, 'akshaydixi': 5, 'arirubinstein': 4, 'cdown': 4, 'deniszh': 3, 'achmiel': 3, 'brandt': 3, 'nlsun': 3, 'mimeframe': 3, 'mgoffin': 2, 'mathieuk': 2, 'ga2arch': 2, 'glensc': 2, 'jreese': 2, 'Anubisss': 2, 'timzimmermann': 2, 'jamesgpearce': 2, 'schettino72': 2, 'castrapel': 2, 'mlw': 2, 'apage43': 1, 'SimplyAhmazing': 1, 'quad': 1, 'yannick': 1, 'blackfist': 1, 'DavidGosselin': 1, 'ecin': 1, 'arubdesu': 1, 'yetanotherhacker': 1, 'rjeczalik': 1, 'larzconwell': 1, 'justintime32': 1, 'alex': 1, 'vlajos': 1, 'dreid': 1, 'kost': 1, 'mtmcgrew': 1, 'tburgin': 1, 'mark': 1, 'shawndavenport': 1, 'jacknagz': 1, 'd0ugal': 1, 'stevenhilder': 1}


### Part 3: Dealing with data Pythonically

In [2]:
# You might find these packages useful. You may import any others you want!
import pandas as pd
import numpy as np

1\. Load the data set `"data/ads_dataset.tsv"` into a Python Pandas data frame called `ads`.

In [3]:
# Place your code heres
ads = pd.read_csv(datadir+"ads_dataset.tsv",header=0,sep='\t')

2\. Write a Python function called `getDfSummary()` that does the following:
- Takes as input a data frame
- For each variable in the data frame calculates the following features:
  - `number_nan` to count the number of missing not-a-number values
  - Ignoring missing, NA, and Null values:
    - `number_distinct` to count the number of distinct values a variable can take on
    - `mean`, `max`, `min`, `std` (standard deviation), and `25%`, `50%`, `75%` to correspond to the appropriate percentiles
- All of these new features should be loaded in a new data frame. Each row of the data frame should be a variable from the input data frame, and the columns should be the new summary features.
- Returns this new data frame containing all of the summary information

Hint: The pandas `describe()` [(manual page)](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html) method returns a useful series of values that can be used here.

In [10]:
def getDfSummary(input_data):
    # Place your code here
    output_data = input_data.describe().transpose()
    output_data["number_nan"]=input_data.isnull().sum()
    output_data["number_distinct"]=input_data.apply(pd.Series.nunique)
    output_data = output_data.drop("count",1)
    return output_data
getDfSummary(ads)

Unnamed: 0,mean,std,min,25%,50%,75%,max,number_nan,number_distinct
isbuyer,0.042632,0.202027,0.0,0.0,0.0,0.0,1.0,0,2
buy_freq,1.240653,0.782228,1.0,1.0,1.0,1.0,15.0,52257,10
visit_freq,1.852777,2.92182,0.0,1.0,1.0,2.0,84.0,0,64
buy_interval,0.210008,3.922016,0.0,0.0,0.0,0.0,174.625,0,295
sv_interval,5.82561,17.595442,0.0,0.0,0.0,0.104167,184.9167,0,5886
expected_time_buy,-0.19804,4.997792,-181.9238,0.0,0.0,0.0,84.28571,0,348
expected_time_visit,-10.210786,31.879722,-187.6156,0.0,0.0,0.0,91.40192,0,15135
last_buy,64.729335,53.476658,0.0,18.0,51.0,105.0,188.0,0,189
last_visit,64.729335,53.476658,0.0,18.0,51.0,105.0,188.0,0,189
multiple_buy,0.006357,0.079479,0.0,0.0,0.0,0.0,1.0,0,2


3\. How long does it take for your `getDfSummary()` function to work on your `ads` data frame? Show us the results below.

Hint: `%timeit getDfSummary(ads)`

In [11]:
# Place your code here
%timeit getDfSummary(ads)

10 loops, best of 3: 77.8 ms per loop


4\. Using the results returned from `getDfSummary()`, which fields, if any, contain missing `NaN` values?

Answer: buy_freq

In [12]:
# Place your code here
summary_ads = getDfSummary(ads)
summary_ads[summary_ads.number_nan>0]

Unnamed: 0,mean,std,min,25%,50%,75%,max,number_nan,number_distinct
buy_freq,1.240653,0.782228,1.0,1.0,1.0,1.0,15.0,52257,10


5\. For the fields with missing values, does it look like the data is missing at random? Are there any other fields that correlate perfectly, or predict that the data is missing? If missing, what should the data value be?

Hint: create another data frame that has just the records with a missing value. Get a summary of this data frame using `getDfSummary()` and compare the differences. Do some feature distributions change dramatically?

Answer: The "buy_freq" is not missing at random. By comparing two summaries, we can tell that "isbuyer", "buy_interval", "expected_time_buy" and "multiple_buy" are correlated with the missing data. If "buy_freq" is missing, then "isbuyer", "buy_interval", "expected_time_buy" and "multiple_buy" are all supposed to be 0.

In [7]:
# Place your code here
missing_data=ads[ads.isnull().any(axis=1)]
getDfSummary(missing_data)

Unnamed: 0,mean,std,min,25%,50%,75%,max,number_nan,number_distinct
isbuyer,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1
buy_freq,,,,,,,,52257,0
visit_freq,1.651549,2.147955,1.0,1.0,1.0,2.0,84.0,0,48
buy_interval,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1
sv_interval,5.686388,17.623555,0.0,0.0,0.0,0.041667,184.9167,0,5112
expected_time_buy,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1
expected_time_visit,-9.669298,31.23903,-187.6156,0.0,0.0,0.0,91.40192,0,13351
last_buy,65.741317,53.484622,0.0,19.0,52.0,106.0,188.0,0,189
last_visit,65.741317,53.484622,0.0,19.0,52.0,106.0,188.0,0,189
multiple_buy,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1


6\. Which variables are binary?

"isbuyer", "multiple_buy", "multiple_visit" and "y_buy" are binary because their "number_distinct" entry is 2.

In [17]:
# Place your code here
summary_ads[summary_ads.number_distinct==2&summary_ads.min==0&summary_ads.max==1]

TypeError: unsupported operand type(s) for &: 'int' and 'method'