*Material adapted from: http://stanford.edu/~mwaskom/software/seaborn/tutorial/distributions.html*

February, 2016

#EXPLORATORY DATA ANALYSIS

This notebook's main goal is to get acquaintaced with the data.

The process of selecting model and function are explained throughout the notebook. 

###IMPORT PACKAGES

Let's import packages used in this notebook:

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

In [None]:
#VISUALIZATION
%matplotlib inline
import matplotlib.pyplot as plt

###IMPORT DATA

So first, we import the file "raw.csv", that contains the data provided for this project.

In [None]:
#Use panda to import csv
data = pd.read_csv('raw.csv', sep=';')
print data.columns.values

#Column 0 includes ad_ref. Column 1 refers to outcome (0= no click, 1=click)
data[:].loc[0:5]

The first column corresponds to ref of register. Let's import again but using this first column as "row" names.

In [None]:
#Import again, first column is row names
data = pd.read_csv('raw.csv', index_col = 0, sep=';')
data[:][0:5]

Let's look at the values of first register

In [None]:
data.loc[0]

There are several columns that contain dates. Let's check them:

In [None]:
data[["check_in","check_out","ts"]].loc[0]

In [None]:
print("Type of 'check_in': ")
print(type(data["check_in"].loc[0]))

So... they imported as text. Let's try to import them as data:

In [None]:
#Use panda to import csv
data = pd.read_csv('raw.csv', sep=';', parse_dates=["check_in","check_out","ts"])
print data.columns.values
data[:].loc[0:5]

We see it imported ok. Let's check type is date:

In [None]:
print("Type of 'check_in': ")
print(type(data["check_in"].loc[0]))

It's a success!

Now let's import the labels:

In [None]:
#ONLY ONE MODEL
label = pd.read_csv('outcome.csv', sep=';', names=["y"])
data["y"]=label["y"].loc[:]

Now let's play a little bit. First, shape of the dataframe:

In [None]:
print data.shape
print (type(data))

In [None]:
data.loc[0]

We have a panda's dataframe, 681314 cases and 17 features + 1 label. 

##Which features to check?

##Number of  adults per booking
Let's begin looking at profile of bookings:


In [None]:
#Let's plot the number of adults per booking

plt.hist(data["adults"])
plt.title("Number of adults")
plt.xlabel("Number of adults")
plt.ylabel("#cases")
plt.show()

I think it might be interesting to check behaviour of bookings depending if its a family or not.
Price: let's keep the price per night and per person  (ppnp)

First let's get a copy of the data:

In [None]:
data_2 = data[['adults','children','adv','stay','y']].loc[:]
data_2["ppnp"]=(data['price'])/((data['adults']+data['children'])*data['stay'])
data_2[:].loc[0]

I want to check if couples, families and solo bookings behave the same...
So, I will create a column where 1= solo, 2=couple and 3= family, 0= anything else.
I also want to check total number of people.

In [None]:
data_2["is_family"] =  data_2['children']!=0.0
data_2["no_children"]=data_2['children']==0.0
data_2["is_couple"]= (data_2['adults']==2.0)*data_2["no_children"] 
data_2["is_solo"]= data_2['adults']==1.0*data_2["no_children"]
data_2["is_group"] = (data_2['adults']>2.0)*data_2["no_children"]

data_2["type"]= 4*data_2["is_group"] +1*data_2["is_solo"] + 2*data_2["is_couple"] + 3*data_2["is_family"] 
data_2["people"]=data_2["adults"] + data_2["children"]

#Let's plot the number of each kind

plt.hist(data_2["type"], bins=20)
plt.title("Bookings by type: 1=solo, 2= couple, 3= family, 4=group")
plt.xlabel("Type")
plt.ylabel("#cases")
plt.show()

Let's drop de data we won't use:

In [None]:
del data_2["is_couple"]
del data_2["is_family"]
del data_2["is_solo"]
del data_2["children"]
del data_2["adults"]
del data_2["no_children"]

data_2[0:5][:]

In [None]:
sns.stripplot(x="type", y="ppnp", data=data_2[:], jitter=True);

There seems to be a very pricey booking that prevents me to see the distribution. Let's exclude it:

In [None]:

price = data_2["ppnp"]
print(price.quantile(q=0.999))

In [None]:
price = data_2["ppnp"].loc[:]
sns.stripplot(x="type", y="ppnp", data=data_2[:].loc[price < 1000], jitter=True)

Let's see if it relates to click:

In [None]:
sns.stripplot(x="type", y="ppnp", data=data_2[:].loc[price <1000], jitter=True, hue="y")

Now let's see about distributions:


In [None]:
sns.violinplot(x="type", y="ppnp", hue="y", data=data_2[:].loc[price<1000], split=True);

I think most of data is under 200. Let's check:

In [None]:
print(price.quantile(q=0.95))

In [None]:
sns.violinplot(x="type", y="ppnp", hue="y", data=data_2[:].loc[price<100], split=True);

Let's see if price and previous days impact click. First all data:

In [None]:
sns.jointplot(x="adv", y="ppnp",data=data_2[:].loc[:]);

Now we will separate click /no click

In [None]:
click = data_2["y"].loc[:]
print("CLICK=1")
sns.jointplot(x="adv", y="ppnp",data=data_2[click==1].loc[:]);

In [None]:
print("CLICK=0")
sns.jointplot(x="adv", y="ppnp",data=data_2[click==0].loc[:]);

In [None]:
data_3 = data_2[["adv","ppnp","y"]].loc[:]
sns.pairplot(data_3, hue="y")

In [None]:
sns.violinplot(x="type", y="adv", hue="click?", data=data_2[:].loc[price<100],
               split=True, inner="stick");

In [93]:
adv = data_2["adv"].loc[:]
print(adv.quantile(q=0.95))

57.0


In [None]:

sns.violinplot(x="type", y="adv", hue="y", data=data_2[:].loc[adv<=57],
               split=True, inner="stick");

That's it for now!