# Introduction to Data Science
## Homework 2

Student Name: Parth Patel 

Student Netid: pmp331
***

### Part 1: Case study
- Read [this article](http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html) in the New York Times.
- Use what we've learned in class and from the book to describe how one could set Target's problem up as a predictive modeling problem, such that they could have gotten the results that they did.  Formulate your solution as a proposed plan using our data science terminology.  Include all the aspects of the formulation that you see as relevant to solving the problem.  Be precise but concise.

Target like almost every major retailer, from grocery chains to investment banks to the USPS, has a “predictive analytics” department devoted to understanding not just the consumers’ shopping list but also their personal habits, so as to more efficiently market to them.

Target wanted an answer to a simple question, “Is their customer pregnant, even if she doesn’t want them to know?”, since, new parents are retailer’s holy grail as mentioned in the article. If they could identify pregnant shoppers, they could earn millions, which they eventually did. So earning money by targeting the to-be parents to buy baby products along with other primary households later was the business motivation behind this analysis. Target wanted to target this portion of the population before any of its competitors got hold of them.

The first step that is needed to be taken before one could find a solution to these types of questions is creating a dataset, in which one could find patterns in. From the article, it’s clear that birth records are public, and can be easily availed by the companies. This could act as one of the data sources. The other sources could be surveys conducted by Target for female customers. The main and the most valuable spruce would be customer data that they gather, containing information like, if customer uses a credit card or a coupon, or mail in a refund, or call the customer help line, or open an e-mail they sent to customer or visit their Web site. After cleaning and merging data from all these sources, they could have sliced just the data needed for predicting whether customer is pregnant or not. Like the data of just the female customers, containing features like the products bought by them, age, marital status are some of them. Training set could be female customers who have babies and finding which products did they buy during their pregnancy as the DOB of their baby would be easily available through the public records.

The target variable is whether the customer is pregnant or not. I would recommend the SVM (Support Vector Model) model here because given a set of training data, each marked for belonging to one of two categories (pregnant – Yes/No), a SVM training algorithm builds a model that assigns new examples into one category or the other, making it a non-probabilistic binary linear classifier.


# Grader  : 5.0/ 5.0

### Part 2: Exploring data in the command line
For this part we will be using the data file located in `"data/advertising_events.csv"`. This file consists of records that pertain to some online advertising events on a given day. There are 4 comma separated columns in this order: `userid`, `timestamp`, `domain`, and `action`. These fields are of type `int`, `int`, `string`, and `int` respectively. Answer the following questions using Linux/Unix bash commands. All questions can be answered in one line (sometimes, with pipes)! Some questions will have many possible solutions. Don't forget that in IPython notebooks you must prefix all bash commands with an exclamation point, i.e. `"!command arguments"`.

[Hints: You can experiment with whatever you want in the notebook and then delete things to construct your answer later.  You can also use ssh to use the actual bash shell on EC2 (see original directions) and then just paste your answers here. Recall that once you enter the "!" then filename completion should work.]

1\. How many records (lines) are in this file?

In [2]:
!wc -l advertising_events.csv

10341 advertising_events.csv


2. How many unique users are in this file? (hint: consider the 'cut' command and use pipe operator '|')

In [3]:
!cut -d, -f1 advertising_events.csv | sort | uniq | wc -l

732


3. Rank all domains by the number of visits they received in descending order. (hint: consider the 'cut', 'uniq' and 'sort' commands and the pipe operator).

In [6]:
cut -d, -f3 advertising_events | sort | uniq -c | sort -r

    513 wikipedia.org
    511 amazon.com
    382 qq.com
    321 twitter.com
    316 taobao.com
   3114 google.com
   2092 facebook.com
   1036 youtube.com
   1034 yahoo.com
   1022 baidu.com


4. List all records for the user with user id 37. (hint: this can be done using 'grep')

In [8]:
!grep '^37,' advertising_events.csv

37,648061658,google.com,0
37,642479972,google.com,2
37,644493341,facebook.com,2
37,654941318,facebook.com,1
37,649979874,baidu.com,1
37,653061949,yahoo.com,1
37,655020469,google.com,3
37,640878012,amazon.com,0
37,659864136,youtube.com,1
37,640361378,yahoo.com,1
37,653862134,facebook.com,0
37,648828970,youtube.com,0


### Part 3: Dealing with data Pythonically

In [9]:
# You might find these packages useful. You may import any others you want!
import pandas as pd
import numpy as np

1\. Load the data set `"data/ads_dataset.tsv"` into a Python Pandas data frame called `ads`.

In [11]:
#ads = pd.read_csv(r"C:\Users\parth\OneDrive\Documents\Jupyter Lab\ads_dataset.tsv", sep = "\t")
ads = pd.read_csv("data/ads_dataset.tsv", sep='\t')

2\. Write a Python function called `getDfSummary()` that does the following:
- Takes as input a data frame
- For each variable in the data frame calculates the following features:
  - `number_nan` to count the number of missing not-a-number values
  - Ignoring missing, NA, and Null values:
    - `number_distinct` to count the number of distinct values a variable can take on
    - `mean`, `max`, `min`, `std` (standard deviation), and `25%`, `50%`, `75%` to correspond to the appropriate percentiles
- All of these new features should be loaded in a new data frame. Each row of the data frame should be a variable from the input data frame, and the columns should be the new summary features.
- Returns this new data frame containing all of the summary information

Hint: The pandas `describe()` [(manual page)](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html) method returns a useful series of values that can be used here.

In [12]:
def getDfSummary(input_data):
    #Initialize the lists which will store the computed data
    attr_name = []
    number_nan = []
    number_distinct = []
    mean = []
    max_number = []
    min_number = []
    std = []
    q25 = []
    q50 = []
    q75 = []
    
    #Code to loop through the columns of data frame and calculate needed data
    for column in input_data.columns:
        attr_name.append(column)
        number_nan.append(ads[column].isnull().sum())
        number_distinct.append(len(ads[column].unique()))
        mean.append(ads[column].mean())
        max_number.append(ads[column].max())
        min_number.append(ads[column].min())
        std.append(ads[column].std())
        q25.append(ads[column].quantile(0.25))
        q50.append(ads[column].quantile(0.5))
        q75.append(ads[column].quantile(0.75))
    
    #Create output data frame and reaarange the columns of data frame according to the order asked in question
    output_data = pd.DataFrame({"variable":attr_name,"number_nan":number_nan, "number_distinct":number_distinct, "mean":mean, "max":max_number, "min":min_number, "std":std, "25%":q25, "50%":q50, "75%":q75})
    output_data = output_data[['variable', 'number_nan', 'number_distinct', 'mean', 'max', 'min', 'std', '25%', '50%', '75%']]
        
    return output_data

#Call the method
getDfSummary(ads)

Unnamed: 0,variable,number_nan,number_distinct,mean,max,min,std,25%,50%,75%
0,isbuyer,0,2,0.042632,1.0,0.0,0.202027,0,0,0.0
1,buy_freq,52257,11,1.240653,15.0,1.0,0.782228,1,1,1.0
2,visit_freq,0,64,1.852777,84.0,0.0,2.92182,1,1,2.0
3,buy_interval,0,295,0.210008,174.625,0.0,3.922016,0,0,0.0
4,sv_interval,0,5886,5.82561,184.9167,0.0,17.595442,0,0,0.104167
5,expected_time_buy,0,348,-0.19804,84.28571,-181.9238,4.997792,0,0,0.0
6,expected_time_visit,0,15135,-10.210786,91.40192,-187.6156,31.879722,0,0,0.0
7,last_buy,0,189,64.729335,188.0,0.0,53.476658,18,51,105.0
8,last_visit,0,189,64.729335,188.0,0.0,53.476658,18,51,105.0
9,multiple_buy,0,2,0.006357,1.0,0.0,0.079479,0,0,0.0


3. How long does it take for your `getDfSummary()` function to work on your `ads` data frame? Show us the results below.

Hint: `%timeit getDfSummary(ads)`

In [13]:
timeit getDfSummary(ads)

1 loop, best of 3: 144 ms per loop


4. Using the results returned from `getDfSummary()`, which fields, if any, contain missing `NaN` values?

In [14]:
odf = getDfSummary(ads)

nan_fields = []
nan_values = odf['number_nan'].tolist()
var_values = odf['variable'].tolist()
index = 0

for nan_value in nan_values:
    if nan_value > 0:
        #nan_fields.append(var_values[nan_values.index(nan_value)])
        nan_fields.append(var_values[index])
    index = index + 1
    
print('The field(s) with missing NaN values : ', nan_fields)


('The field(s) with missing NaN values : ', ['buy_freq'])


5\. For the fields with missing values, does it look like the data is missing at random? Are there any other fields that correlate perfectly, or predict that the data is missing? If missing, what should the data value be?

Hint: create another data frame that has just the records with a missing value. Get a summary of this data frame using `getDfSummary()` and compare the differences. Do some feature distributions change dramatically?

In [15]:
# ads_nan is another data frame that has just the records with a missing nan value
ads_nan = ads[ads.isnull().any(axis = 1)]
ads_nan_summary = getDfSummary(ads_nan)

From the summary of the data frame that has just the recordes with missing value, I think they are the records of those customers who haven't bought anything once, they are the people who fall under the category of window shoppers. According to me the correct value of the missing ones should be 0. There's a correlation between the features isbuyer and buy_freq.

6\. Which variables are binary?

In [17]:
odf = getDfSummary(ads)

binary_features = []
index = 0
distinct_values = odf['number_distinct'].tolist()
var_values = odf['variable'].tolist()

for val in distinct_values:
    if val == 2:
        binary_features.append(var_values[index])
    index = index + 1
print(binary_features)

['isbuyer', 'multiple_buy', 'multiple_visit', 'y_buy']


# Part 3 : 15/16
# Total : 5+3.5 +15 = 23.5