The problem statement is to predict the potential business value of a person who has performed a specific activity. The business value outcome is defined by a yes/no field attached to each unique activity in the activity file. The outcome field indicates whether or not each person has completed the outcome within a fixed window of time after each unique activity was performed.

In [1]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import re

In [2]:
print("Read people.csv...")
people = pd.read_csv(r'people.csv',
                       dtype={'people_id': np.str,
                              'activity_id': np.str,
                              'char_38': np.int32},
                       parse_dates=['date'])

Read people.csv...


In [3]:
print("Load train.csv...")
train = pd.read_csv(r'act_train.csv',
                        dtype={'people_id': np.str,
                               'activity_id': np.str,
                               'outcome': np.int8},
                        parse_dates=['date'])

Load train.csv...


In [4]:
print("Load test.csv...")
test = pd.read_csv(r'act_test.csv',
                       dtype={'people_id': np.str,
                              'activity_id': np.str},
                       parse_dates=['date'])

Load test.csv...


The activity file contains all of the unique activities (and the corresponding activity characteristics) that each person has performed over time. Each row in the activity file represents a unique activity performed by a person on a certain date. Each activity has a unique activity_id.

In [5]:
print ("\n\n---------------------")
print ("TRAIN SET INFORMATION")
print ("---------------------")
print ("Shape of training set:", train.shape, "\n")
print ("Column Headers:", list(train.columns.values), "\n")
print (train.dtypes)



---------------------
TRAIN SET INFORMATION
---------------------
Shape of training set: (2197291, 15) 

Column Headers: ['people_id', 'activity_id', 'date', 'activity_category', 'char_1', 'char_2', 'char_3', 'char_4', 'char_5', 'char_6', 'char_7', 'char_8', 'char_9', 'char_10', 'outcome'] 

people_id                    object
activity_id                  object
date                 datetime64[ns]
activity_category            object
char_1                       object
char_2                       object
char_3                       object
char_4                       object
char_5                       object
char_6                       object
char_7                       object
char_8                       object
char_9                       object
char_10                      object
outcome                        int8
dtype: object


In [6]:
missing_values = []
nonumeric_values = []

print ("TRAINING SET INFORMATION")
print ("========================\n")

for column in train:
    # Find all the unique feature values
    uniq = train[column].unique()
    print ("'{}' has {} unique values" .format(column,uniq.size))
    if (uniq.size > 10):
        print("~~Listing up to 10 unique values~~")
    print (uniq[0:10])
    print ("\n-----------------------------------------------------------------------\n")
    
    # Find features with missing values
    if (True in pd.isnull(uniq)):
        s = "{} has {} missing" .format(column, pd.isnull(train[column]).sum())
        missing_values.append(s)
    
    # Find features with non-numeric values
    for i in range (1, np.prod(uniq.shape)):
        if (re.match('nan', str(uniq[i]))):
            break
        if not (re.search('(^\d+\.?\d*$)|(^\d*\.?\d+$)', str(uniq[i]))):
            nonumeric_values.append(column)
            break
  
print ("\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n")
print ("Features with missing values:\n{}\n\n" .format(missing_values))
print ("Features with non-numeric values:\n{}" .format(nonumeric_values))
print ("\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n")

TRAINING SET INFORMATION

'people_id' has 151295 unique values
~~Listing up to 10 unique values~~
['ppl_100' 'ppl_100002' 'ppl_100003' 'ppl_100006' 'ppl_100013' 'ppl_100019'
 'ppl_100025' 'ppl_100028' 'ppl_100029' 'ppl_100032']

-----------------------------------------------------------------------

'activity_id' has 2197291 unique values
~~Listing up to 10 unique values~~
['act2_1734928' 'act2_2434093' 'act2_3404049' 'act2_3651215' 'act2_4109017'
 'act2_898576' 'act2_1233489' 'act2_1623405' 'act2_1111598' 'act2_1177453']

-----------------------------------------------------------------------

'date' has 411 unique values
~~Listing up to 10 unique values~~
['2023-08-26T00:00:00.000000000' '2022-09-27T00:00:00.000000000'
 '2023-08-04T00:00:00.000000000' '2022-11-23T00:00:00.000000000'
 '2023-02-07T00:00:00.000000000' '2023-06-28T00:00:00.000000000'
 '2022-08-10T00:00:00.000000000' '2023-03-02T00:00:00.000000000'
 '2022-09-13T00:00:00.000000000' '2023-02-10T00:00:00.000000000']

------

The people file contains all of the unique people and their characteristics, who have performed activities over time. Each row in the people file represents a unique person

In [7]:
print ("\n\n---------------------")
print ("PEOPLE SET INFORMATION")
print ("---------------------")
print ("Shape of training set:", people.shape, "\n")
print ("Column Headers:", list(people.columns.values), "\n")
print (people.dtypes)



---------------------
PEOPLE SET INFORMATION
---------------------
Shape of training set: (189118, 41) 

Column Headers: ['people_id', 'char_1', 'group_1', 'char_2', 'date', 'char_3', 'char_4', 'char_5', 'char_6', 'char_7', 'char_8', 'char_9', 'char_10', 'char_11', 'char_12', 'char_13', 'char_14', 'char_15', 'char_16', 'char_17', 'char_18', 'char_19', 'char_20', 'char_21', 'char_22', 'char_23', 'char_24', 'char_25', 'char_26', 'char_27', 'char_28', 'char_29', 'char_30', 'char_31', 'char_32', 'char_33', 'char_34', 'char_35', 'char_36', 'char_37', 'char_38'] 

people_id            object
char_1               object
group_1              object
char_2               object
date         datetime64[ns]
char_3               object
char_4               object
char_5               object
char_6               object
char_7               object
char_8               object
char_9               object
char_10                bool
char_11                bool
char_12                bool
char_13      

In [8]:
missing_values = []
nonumeric_values = []

print ("PEOPLE SET INFORMATION")
print ("========================\n")

for column in people:
    # Find all the unique feature values
    uniq = people[column].unique()
    print ("'{}' has {} unique values" .format(column,uniq.size))
    if (uniq.size > 10):
        print("~~Listing up to 10 unique values~~")
    print (uniq[0:10])
    print ("\n-----------------------------------------------------------------------\n")
    
    # Find features with missing values
    if (True in pd.isnull(uniq)):
        s = "{} has {} missing" .format(column, pd.isnull(people[column]).sum())
        missing_values.append(s)
    
    # Find features with non-numeric values
    for i in range (1, np.prod(uniq.shape)):
        if (re.match('nan', str(uniq[i]))):
            break
        if not (re.search('(^\d+\.?\d*$)|(^\d*\.?\d+$)', str(uniq[i]))):
            nonumeric_values.append(column)
            break
  
print ("\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n")
print ("Features with missing values:\n{}\n\n" .format(missing_values))
print ("Features with non-numeric values:\n{}" .format(nonumeric_values))
print ("\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n")

PEOPLE SET INFORMATION

'people_id' has 189118 unique values
~~Listing up to 10 unique values~~
['ppl_100' 'ppl_100002' 'ppl_100003' 'ppl_100004' 'ppl_100006' 'ppl_10001'
 'ppl_100010' 'ppl_100013' 'ppl_100019' 'ppl_100025']

-----------------------------------------------------------------------

'char_1' has 2 unique values
['type 2' 'type 1']

-----------------------------------------------------------------------

'group_1' has 34224 unique values
~~Listing up to 10 unique values~~
['group 17304' 'group 8688' 'group 33592' 'group 22593' 'group 6534'
 'group 25417' 'group 4204' 'group 45749' 'group 36096' 'group 18035']

-----------------------------------------------------------------------

'char_2' has 3 unique values
['type 2' 'type 3' 'type 1']

-----------------------------------------------------------------------

'date' has 1196 unique values
~~Listing up to 10 unique values~~
['2021-06-29T00:00:00.000000000' '2021-01-06T00:00:00.000000000'
 '2022-06-10T00:00:00.000000000' 

In [9]:
print ("Processing...")

for table in [train, test]:
        table['year'] = table['date'].dt.year
        table['month'] = table['date'].dt.month
        table['day'] = table['date'].dt.day
        table.drop('date', axis=1, inplace=True)
        table['activity_category'] = table['activity_category'].str.lstrip('type ').astype(np.int32)
        for i in range(1, 11):
            table['char_' + str(i)].fillna('type 0', inplace=True)
            table['char_' + str(i)] = table['char_' + str(i)].str.lstrip('type ').astype(np.int32)

people['year'] = people['date'].dt.year
people['month'] = people['date'].dt.month
people['day'] = people['date'].dt.day
people.drop('date', axis=1, inplace=True)
people['group_1'] = people['group_1'].str.lstrip('group ').astype(np.int32)
for i in range(1, 10):
    people['char_' + str(i)] = people['char_' + str(i)].str.lstrip('type ').astype(np.int32)
for i in range(10, 38):
    people['char_' + str(i)] = people['char_' + str(i)].astype(np.int32)

Processing...


Two separate data files must be joined together to create a single, unified data table: a people file and an activity file.

In [10]:
print("Merge...")
train = pd.merge(train, people, how='left', on='people_id', left_index=True)
train.fillna(0.0, inplace=True)
test = pd.merge(test, people, how='left', on='people_id', left_index=True)
test.fillna(0.0, inplace=True)

train = train.drop(['people_id'], axis=1)

train.describe()

#Separate label and data
Y = train['outcome']
X = train.drop(['outcome'], axis=1)

X = X.iloc[:,1:]

Merge...


In [14]:
#Predict using Linear Regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(max_iter=100)
# Fit the model to our training data
lr = lr.fit(X, Y)
score = lr.score(X, Y)
print("Mean accuracy of Logistic Regression: {0}".format(score))

Mean accuracy of Logistic Regression: 0.8260721952622571


In [12]:
#Create the random forest object:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100)
# Fit the model to our training data
rfc = rfc.fit(X, Y)
score = rfc.score(X, Y)
print("Mean accuracy of Random Forest: {0}".format(score))

test = test.drop(['people_id'], axis=1)
test_x = test.iloc[:, 1:]
test_y = list(map(int, rfc.predict(test_x)))

Mean accuracy of Random Forest: 0.9999995448941447


In [13]:
 #file for submission
test['outcome'] = test_y
test[['activity_id', 'outcome']] \
    .to_csv('results-rfc.csv', index=False)