Business Goal: The challenge of this competition is to predict the potential business value of a person who has performed a specific activity.

Objective: Create a classification algorithm that accurately identifies which customers have the most potential business value for Red Hat based on their characteristics and activities.

The predicted outcome is either a zero (not valuable) or one (valuable) so this is a binary classification problem. Most of the features are anonimized as chars (characteristics) but we’ll try to see how they are related to the outcome.

Classification performance measured in AUC.

Initial Hypotheses:

1. There are some activities which bring a higher business value than other activities.
1. During certain times of the year chances are higher to derive business value from customers.
1. Some group of people allow for higher business value.
1. Characteristics of people and activities are indicative of business value.

Data Extraction

This competition uses two separate data files that may be joined together to create a single, unified data table: a people file and an activity file.

_ People.csv: Each row in the people file represents a unique person. Each person has a unique people_id. Contains characteristis of people.

_ Activity.csv:
The activity file contains all of the unique activities (and the corresponding activity characteristics) that each person has performed over time. Each row in the activity file represents a unique activity performed by a person on a certain date. Each activity has a unique activity_id. 

The activity file contains several different categories of activities. Type 1 activities are different from type 2-7 activities because there are more known characteristics associated with type 1 activities (nine in total) than type 2-7 activities (which have only one associated characteristic).

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import seaborn as sns

import matplotlib.pyplot as plt
%matplotlib inline

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
activities_train = pd.read_csv('../input/predicting-red-hat-business-value/act_train.csv.zip', parse_dates=['date'])
activities_test = pd.read_csv('../input/predicting-red-hat-business-value/act_test.csv.zip', parse_dates=['date'])
ppl = pd.read_csv('../input/predicting-red-hat-business-value/people.csv.zip', parse_dates=['date'])

df_train = pd.merge(activities_train, ppl, on='people_id')
df_test = pd.merge(activities_test, ppl, on='people_id')
del activities_train, activities_test, ppl

In [None]:
print(df_test.shape)
print(df_train.shape)
print(df_train.shape[0]/df_test.shape[0])

In [None]:
df_train.head(5)
df_test.head(5)

In [None]:
df_train.info(null_counts=True)

print(df['SalePrice'].describe())
plt.figure(figsize=(9, 8))
sns.distplot(df['SalePrice'], color='g', bins=100, hist_kws={'alpha': 0.4});

In [None]:
Data Analysis


There are 189k potential customers and 2.1M customer activities in the training set.
The test set contains 498k customer activities (train/test split of activities 18.5% in test)

potential typos/mistakes in mixed-type fields:
people: ppl_group, ppl_char_1 - 9
activity: act_category, act_char_10

Missing data
ppl: No missing data detected by pandas
activities: act_char_1-9 have same number of missing values. Agreement with documentation, as these are the 9 characteristics only available for activity type 1. However act_char_10 has also missing values. Why?

Critical questions

people characteristics: char_1 until 38. group_1 meaning? date could be the first contact with the person
date fields contain timestamps from future dates! Why?


In [None]:
df_train.sample(10, random_state=42)

In [None]:
df_num = df_train.select_dtypes(include = ['float64', 'int64'])
df_num.head()

df_num.hist(figsize=(16, 20), bins=50, xlabelsize=8, ylabelsize=8);

In [None]:
df_train['activity_category'] = df_train['activity_category'].astype('category')
df_test['activity_category'] = df_test['activity_category'].astype('category')


Univariate Analysis

activity_outcome: surprisingly fairly balanced classes.

date: activities are fairly distributed across the time, spanning roughly 1 year and 1 month. I find extreeme values with maximum of 48174 activies in just one day.

people_id: users. For a significant number of people 20% (in people.csv) we do not have any activities recorded. Can we discard those?

In [None]:
df_train['outcome'].astype('bool').value_counts(normalize=True)

In [None]:
ax = df_train['outcome'].astype('bool').value_counts(normalize=True).mul(100).plot(kind='bar')
ax.set_xlabel('business value'); ax.set_ylabel('% of customers'); plt.xticks(rotation=0)

How ares the activities distributed over time?

In [None]:
for d in ['date_x', 'date_y']:
    print('Start of ' + d + ': ' + str(df_train[d].min().date()))
    print('  End of ' + d + ': ' + str(df_train[d].max().date()))
    print('Range of ' + d + ': ' + str(df_train[d].max() - df_train[d].min()) + '\n')

In [None]:
activities_per_day = df_train.groupby([pd.Grouper(key='date_x', freq='1D')])['activity_id'].count().reset_index()

In [None]:
fig, ax = plt.subplots(figsize=(16,4))
sns.lineplot(data=activities_per_day, x='date_x', y='activity_id', ax=ax)
ax.set_ylabel('# of activities'); ax.set_xlabel('date')

Days with the most and the least activity

In [None]:
pd.concat([activities_per_day[activities_per_day['activity_id'] == activities_per_day['activity_id'].max()],
activities_per_day[activities_per_day['activity_id'] == activities_per_day['activity_id'].min()]],axis=0)

Users: 189118 in the peoples.csv but only 151295 in the characters

In [None]:
import datetime as dt
df_train['weekday'] = df_train[['date_x']].apply(lambda x: dt.datetime.strftime(x['date_x'], '%A'), axis=1)
df_train['monthday'] = df_train.date_x.dt.day
df_train["month"] = df_train.date_x.dt.month

In [None]:
df_train['weekday'].value_counts(normalize=True).plot(kind='bar')

In [None]:
fig , ax = plt.subplots(figsize=(18,4))
df_train['monthday'].value_counts(normalize=True).sort_index().plot(kind='bar', ax=ax)
ax.set_ylabel('fraction of activities'); ax.set_xlabel('day of month')