In [None]:
import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

**Crisp Methodology**

![](https://www.kdnuggets.com/wp-content/uploads/crisp-dm-4-problems-fig1.png)
    Crisp methodology is on the acceptable manners for data mining tasks. As it is belowed in the following figure, it contains three main parts should be passed to deliver a product to business
*     Data cleaning
        1. Understanding the business and data.
        2. Try to comprehent the business and extract the data which is needed
        3. Understand the dependencies between attributes. Analyzing the target variables. Handling missing values. Transforming data formats to standard data format.
*     Data Modeling
        1. Understanding the business and data.
        2. Selecting more accurate classfier or regression engine based on the charactristic any of them have.
        3. Train a model 
*     Evaluation and Deployment.
        1. Evalute created model using evaluation methods (test-data, cross-validation, etc)
        2. Catrefully Evaluate model with real data (i.e AB testing) (As it is shown in crisp diagram, there is a link between business undestanding and evaluation part). 
        3. Migrate to new model and replace the old one with new version.


**imports**
*     Importing packages and libraries.

In [None]:
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
%matplotlib inline

**Reading Data**

Reading data and caughting a glimpse of what data it is.

In [None]:
train_data = pd.read_csv("../input/train.csv")
train_data.head()

In [None]:
train_data.describe()

In [None]:
list(train_data.columns.values)

**Features descriptions**:

Returning back to Data description for understanding features.

*     channelGrouping - The channel via which the user came to the Store.
*     date - The date on which the user visited the Store.
*     device - The specifications for the device used to access the Store.
*     fullVisitorId- A unique identifier for each user of the Google Merchandise Store.
*     geoNetwork - This section contains information about the geography of the user.
*     sessionId - A unique identifier for this visit to the store.
*     socialEngagementType - Engagement type, either "Socially Engaged" or "Not Socially Engaged".
*     totals - This section contains aggregate values across the session.
*     trafficSource - This section contains information about the Traffic Source from which the session originated.
*     visitId - An identifier for this session. This is part of the value usually stored as the _utmb cookie. This is only unique to the user. For a completely unique ID, you should use a combination of fullVisitorId and visitId.
*     visitNumber - The session number for this user. If this is the first session, then this is set to 1.
*     visitStartTime - The timestamp (expressed as POSIX time).


**channelGrouping feature**

In [None]:
train_data.channelGrouping.value_counts().plot(kind="bar",title="channelGrouping distro",figsize=(8,8),rot=25,colormap='Paired')

**date and visitStartTime**

There are two varialbe related to time and can be used in time dependent analyzes specially TimeSeries.

In [None]:
"date :{}, visitStartTime:{}".format(train_data.head(1).date[0],train_data.head(1).visitStartTime[0])

date is stored in String and should be converted to pandas datetime format.
visitStartTime is stored in epoch unix format and should be converted to pandas datetime format.
doing the correspondence transforms and storing on the same attribute.

In [None]:
train_data["date"] = pd.to_datetime(train_data["date"],format="%Y%m%d")
train_data["visitStartTime"] = pd.to_datetime(train_data["visitStartTime"],unit='s')

Checking the transformed features.

In [None]:
train_data.head(1)[["date","visitStartTime"]]

**device**

device is stored in json format. There is a need to extract its fields and analyze them. Using json library to deserializing json values.

In [None]:
list_of_devices = train_data.device.apply(json.loads).tolist()
keys = []
for devices_iter in list_of_devices:
    for list_element in list(devices_iter.keys()):
        if list_element not in keys:
            keys.append(list_element)

keys existed in device attribute are listed below.
Now we should ignore the features which are not usefull in rest of the process. If feature is misrelated, or it contains lot of "NaN" values it should be discarded.
We select the ["browser","operatingSystem","deviceCategory","isMobile"] for doing the analyzing. The rest of the device features are ignored and will be removed.

In [None]:
"keys existed in device attribute are:{}".format(keys)

In [None]:
tmp_device_df = pd.DataFrame(train_data.device.apply(json.loads).tolist())[["browser","operatingSystem","deviceCategory","isMobile"]]

In [None]:
tmp_device_df.head()

In [None]:
tmp_device_df.describe()

In [None]:
fig, axes = plt.subplots(2,2,figsize=(15,15))
tmp_device_df["isMobile"].value_counts().plot(kind="bar",ax=axes[0][0],rot=25,legend="isMobile",color='tan')
tmp_device_df["browser"].value_counts().head(10).plot(kind="bar",ax=axes[0][1],rot=40,legend="browser",color='teal')
tmp_device_df["deviceCategory"].value_counts().head(10).plot(kind="bar",ax=axes[1][0],rot=25,legend="deviceCategory",color='lime')
tmp_device_df["operatingSystem"].value_counts().head(10).plot(kind="bar",ax=axes[1][1],rot=80,legend="operatingSystem",color='c')

**geoNetwork**

It is json and the similar manner to previous feature (device) should be done.


In [None]:
tmp_geo_df = pd.DataFrame(train_data.geoNetwork.apply(json.loads).tolist())[["continent","subContinent","country","city"]]

In [None]:
tmp_geo_df.head()

In [None]:
tmp_geo_df.describe()

analysing the distribution of users in 5 continents.

In [None]:
fig, axes = plt.subplots(3,2, figsize=(15,15))
tmp_geo_df["continent"].value_counts().plot(kind="bar",ax=axes[0][0],title="Global Distributions",rot=0,color="c")
tmp_geo_df[tmp_geo_df["continent"] == "Americas"]["subContinent"].value_counts().plot(kind="bar",ax=axes[1][0], title="America Distro",rot=0,color="tan")
tmp_geo_df[tmp_geo_df["continent"] == "Asia"]["subContinent"].value_counts().plot(kind="bar",ax=axes[0][1], title="Asia Distro",rot=0,color="r")
tmp_geo_df[tmp_geo_df["continent"] == "Europe"]["subContinent"].value_counts().plot(kind="bar",ax=axes[1][1],  title="Europe Distro",rot=0,color="lime")
tmp_geo_df[tmp_geo_df["continent"] == "Oceania"]["subContinent"].value_counts().plot(kind="bar",ax = axes[2][0], title="Oceania Distro",rot=0,color="teal")
tmp_geo_df[tmp_geo_df["continent"] == "Africa"]["subContinent"].value_counts().plot(kind="bar" , ax=axes[2][1], title="Africa Distro",rot=0,color="silver")

**socialEngagementType**

Describing this feature confirms its uniqueness. It should be dropped. Because its entropy is 0. 

In [None]:
train_data["socialEngagementType"].describe()

**totals**


In [None]:
train_data.head()
train_data["revenue"] = pd.DataFrame(train_data.totals.apply(json.loads).tolist())[["transactionRevenue"]]


Extracting all the revenues can bring us an overview about the total revenue.

In [None]:
revenue_datetime_df = train_data[["revenue" , "date"]].dropna()
revenue_datetime_df["revenue"] = revenue_datetime_df.revenue.astype(np.int64)
revenue_datetime_df.head()

Aggregation on days and plotting daily revenue.

In [None]:
daily_revenue_df = revenue_datetime_df.groupby(by=["date"],axis = 0 ).sum()
import matplotlib.pyplot as plt
fig, axes = plt.subplots(figsize=(20,10))
axes.set_title("Daily Revenue")
axes.set_ylabel("Revenue")
axes.set_xlabel("date")
axes.plot(daily_revenue_df["revenue"])


In [None]:
fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(9, 9))
axes.set_title("Daily revenue Violin")
axes.set_ylabel("revenue")
axes.violinplot(list(daily_revenue_df["revenue"].values),showmeans=False,showmedians=True)

**visitNumber**

Number of visits have profound potential to be an important factor in regression progress. 

In [None]:
visit_datetime_df = train_data[["date","visitNumber"]]
visit_datetime_df["visitNumber"] = visit_datetime_df.visitNumber.astype(np.int64)

In [None]:
daily_visit_df = visit_datetime_df.groupby(by=["date"], axis = 0).sum()

fig, axes = plt.subplots(1,1,figsize=(20,10))
axes.set_ylabel("# of visits")
axes.set_xlabel("date")
axes.set_title("Daily Visits")
axes.plot(daily_visit_df["visitNumber"])

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(9, 9))
axes.set_title("Daily visits Violin")
axes.set_ylabel("# of visitors")
axes.violinplot(list(daily_visit_df["visitNumber"].values),showmeans=False,showmedians=True)

Now, lets check another side of 'visitNumber' feature. As it is mentioned in data description, visitNumber is the number of sessions for each user. It can also be the factor of users interest. lets 'describe' and  'visualize' them.
using 'collections' package, we can count repetition of each element.

In [None]:
train_data.visitNumber.describe()

The 75% of sessions have visitNumber lower than one time. You can get more information about percentiles by calling np.percentile method.

In [None]:
"90 percent of sessions have visitNumber lower than {} times.".format(np.percentile(list(train_data.visitNumber),90))

Lets find most_common and least_common visitNumbers for being familiar with collections module and its powrefull tools ;-) 

In [None]:
import collections

tmp_least_10_visitNumbers_list = collections.Counter(list(train_data.visitNumber)).most_common()[:-10-1:-1]
tmp_most_10_visitNumbers_list = collections.Counter(list(train_data.visitNumber)).most_common(10)
least_visitNumbers = []
most_visitNumbers = []
for i in tmp_least_10_visitNumbers_list:
    least_visitNumbers.append(i[0])
for i in tmp_most_10_visitNumbers_list:
    most_visitNumbers.append(i[0])
"10 most_common visitNumbers are {} times and 10 least_common visitNumbers are {} times".format(most_visitNumbers,least_visitNumbers)

 It is clear that the dispersion of the 'visitNumber' per session is huge. for this sort of features, we can use Log and map the feature space to
new lower space. As a result of this mapping, visualization the data will be easier.

In [None]:
fig,ax = plt.subplots(1,1,figsize=(9,5))
ax.set_title("Histogram of log(visitNumbers) \n don't forget it is per session")
ax.set_ylabel("Repetition")
ax.set_xlabel("Log(visitNumber)")
ax.grid(color='b', linestyle='-', linewidth=0.1)
ax.hist(np.log(train_data.visitNumber))

**trafficSource**

What is the most conventional manner for visitor who visit to the website and do their shopping ? trafficSource attribute can resolve this qurestion.
Like a previous Json elements existed in the dataset, this attribute is also Json file. so, we use the similar way to deserialize it. We have select keyword, source and the medium as a features which can bring more useful infromation.


In [None]:
traffic_source_df = pd.DataFrame(train_data.trafficSource.apply(json.loads).tolist())[["keyword","medium" , "source"]]

In [None]:
fig,axes = plt.subplots(1,2,figsize=(15,10))
traffic_source_df["medium"].value_counts().plot(kind="bar",ax = axes[0],title="Medium",rot=0,color="tan")
traffic_source_df["source"].value_counts().head(10).plot(kind="bar",ax=axes[1],title="source",rot=75,color="teal")

As it is completely obvious in source diagram, google is the most repetitive source. It would be interesting if we replace all google subdomains with exact 'google' and do the same analyze again. let's do it.

In [None]:
traffic_source_df.loc[traffic_source_df["source"].str.contains("google") ,"source"] = "google"
fig,axes = plt.subplots(1,1,figsize=(8,8))
traffic_source_df["source"].value_counts().head(15).plot(kind="bar",ax=axes,title="source",rot=75,color="teal")

Google dependent redirects are more than twice the youtube sources. Combination of this feature with revenue and visits may have important result. We will do it in next step (when we are analyzing feature correlations).
Now let's move on keywords feature.
A glance to keyword featre represnets lot of missing values '(not provided)'. Drawing a bar chart for both of them...


In [None]:
fig,axes = plt.subplots(1,2,figsize=(15,10))
traffic_source_df["keyword"].value_counts().head(10).plot(kind="bar",ax=axes[0], title="keywords (total)",color="orange")
traffic_source_df[traffic_source_df["keyword"] != "(not provided)"]["keyword"].value_counts().head(15).plot(kind="bar",ax=axes[1],title="keywords (dropping NA)",color="c")

**fullVisitorId**

Now, lets see how many of users are repetitive ?! This feature can represent important information answering this question ? (Is more repeation proportional to more buy ?! ) 
The response will be discussed in next section (Where we are analyzing compound features) but now, lets move on calculation of repetitive visits percentiles.

In [None]:
repetitive_users = list(np.sort(list(collections.Counter(list(train_data["fullVisitorId"])).values())))
"25% percentile: {}, 50% percentile: {}, 75% percentile: {}, 88% percentile: {}, 88% percentile: {}".format(
np.percentile(repetitive_users,q=25),np.percentile(repetitive_users,q=50),
np.percentile(repetitive_users,q=75),np.percentile(repetitive_users,q=88), np.percentile(repetitive_users,q=89))

As it is shown, only 12 percent of users are repetitive and visited the website more than once. 
(Search about churn rate and conversion rate if you want to know why we have analyzed this feature ;-) )
