# McKinsey Data Scientist Hackathon

link: https://datahack.analyticsvidhya.com/contest/mckinsey-analytics-online-hackathon-recommendation/?utm_source=sendinblue&utm_campaign=Download_The_Dataset_McKinsey_Analytics_Online_Hackathon__Recommendation_Design_is_now_Live&utm_medium=email

slack:https://analyticsvidhya.slack.com/messages/C8X88UJ5P/


## Problem Statement ##

Your client is a fast-growing mobile platform, for hosting coding challenges. They have a unique business model, where they crowdsource problems from various creators(authors). These authors create the problem and release it on the client's platform. The users then select the challenges they want to solve. The authors make money based on the level of difficulty of their problems and how many users take up their challenge.
 
The client, on the other hand makes money when the users can find challenges of their interest and continue to stay on the platform. Till date, the client has relied on its domain expertise, user interface and experience with user behaviour to suggest the problems a user might be interested in. You have now been appointed as the data scientist who needs to come up with the algorithm to keep the users engaged on the platform.
The client has provided you with history of last 10 challenges the user has solved, and you need to predict which might be the next 3 challenges the user might be interested to solve. Apply your data science skills to help the client make a big mark in their user engagements/revenue.

### Data Relationships
Client: problem platform maintainer
Creators: problem contributors
Users: people who solve these problems

Question? Given the 10 challenges the user solved, what might be the next 3 challenges user want to solve?

## Now let's first look at some raw data

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import sklearn
import pandas
import seaborn

In [2]:
x_data = pandas.read_csv('./train_mddNHeX/train.csv')
y_data = pandas.read_csv('./train_mddNHeX/challenge_data.csv')
x_test = pandas.read_csv('./test.csv')
y_sub_temp = pandas.read_csv('./sample_submission_J0OjXLi_DDt3uQN.csv')

In [3]:
print('shape of submission data = {}, number of users = {}'.format(y_sub_temp.shape, y_sub_temp.shape[0]/13))
y_sub_temp.head()

shape of submission data = (119196, 2), number of users = 9168


Unnamed: 0,user_sequence,challenge
0,4577_11,CI23648
1,4577_12,CI23648
2,4577_13,CI23648
3,4578_11,CI23648
4,4578_12,CI23648


In [4]:
print('shape of user data = {}, number of users = {}'.format(x_data.shape, x_data.shape[0]/13))
#x_data.sort_values('user_id')
x_data[0:20]

shape of user data = (903916, 4), number of users = 69532


Unnamed: 0,user_sequence,user_id,challenge_sequence,challenge
0,4576_1,4576,1,CI23714
1,4576_2,4576,2,CI23855
2,4576_3,4576,3,CI24917
3,4576_4,4576,4,CI23663
4,4576_5,4576,5,CI23933
5,4576_6,4576,6,CI25135
6,4576_7,4576,7,CI23975
7,4576_8,4576,8,CI25126
8,4576_9,4576,9,CI24915
9,4576_10,4576,10,CI24957


In [85]:
print('shape of user test data = {}, number of users = {}'.format(x_test.shape, x_test.shape[0]/10))
#x_test[0:15]
x_test.head(15)
#x_test.sort_values('user_id').head(15)

shape of user test data = (397320, 4), number of users = 39732


Unnamed: 0,user_sequence,user_id,challenge_sequence,challenge
0,4577_1,4577,1,CI23855
1,4577_2,4577,2,CI23933
2,4577_3,4577,3,CI24917
3,4577_4,4577,4,CI24915
4,4577_5,4577,5,CI23714
5,4577_6,4577,6,CI23663
6,4577_7,4577,7,CI24958
7,4577_8,4577,8,CI25135
8,4577_9,4577,9,CI25727
9,4577_10,4577,10,CI24530


In [6]:
print('shape of challenge data = {}'.format(y_data.shape))
y_data[0:10]#.tail()
#print(y_data.loc[:,['challenge_ID','challenge_series_ID']])
#print(y_data.groupby('challenge_series_ID'))

shape of challenge data = (5606, 9)


Unnamed: 0,challenge_ID,programming_language,challenge_series_ID,total_submissions,publish_date,author_ID,author_gender,author_org_ID,category_id
0,CI23478,2,SI2445,37.0,06-05-2006,AI563576,M,AOI100001,
1,CI23479,2,SI2435,48.0,17-10-2002,AI563577,M,AOI100002,32.0
2,CI23480,1,SI2435,15.0,16-10-2002,AI563578,M,AOI100003,
3,CI23481,1,SI2710,236.0,19-09-2003,AI563579,M,AOI100004,70.0
4,CI23482,2,SI2440,137.0,21-03-2002,AI563580,M,AOI100005,
5,CI23483,2,SI2445,1434.0,06-05-2006,,,,70.0
6,CI23484,1,SI2440,509.0,22-03-2006,AI563582,F,AOI100007,32.0
7,CI23485,1,SI2435,287.0,17-10-2002,AI563583,F,AOI100008,23.0
8,CI23486,2,SI2440,19.0,20-03-2002,AI563584,M,AOI100009,
9,CI23487,1,SI2435,28.0,14-10-2002,AI563585,M,AOI100010,141.0


## Dirty try
1. Need to find a feature vector for a given challenge 
    -  This is associated with [prog_lang, challenge_series, total submission, publish_time, auth_id, auth_org, categ]
2. Create a preference vector for each user 
    -  This will be randomly initialized
3. Use the first 10 samples from each users as ground truth for training the feature vector and the preference vector

## Prepare training data

Let's prepare the challange id as a lookup table to constuct training data


In [7]:
def str2ascii(astr):
    """
        input: 
            astr: a string
        output: 
            val: a number which is sum of char's ascii.
    """
    val = 0
    real = 0
    count_val, count_real = 0, 0
    for i in list(astr):
        num = ord(i)
        if 48<= num and num <= 57:
            real = real*10 + int(i)
            count_real += 1
        else:
            val += num
            count_val += 1
    val = val*10**count_real + real
    return val

In [8]:
# Retain the original copy of the y_data
ch_table = y_data
orig_y_data = y_data.copy()

In [9]:
print(ch_table.columns)

Index([u'challenge_ID', u'programming_language', u'challenge_series_ID',
       u'total_submissions', u'publish_date', u'author_ID', u'author_gender',
       u'author_org_ID', u'category_id'],
      dtype='object')


In [10]:
## Fill NaN with some values
values = {'challenge_series_ID':'SI0000','author_ID':'AI000000','author_gender':'I'
          ,'author_org_ID':'AOI000000', 'category_id':0.0
          ,'programming_language':0,'total_submissions':0, 'publish_date':'00-00-0000'}
ch_table = y_data.fillna(value = values)
print(y_data.head(), ch_table.head())

(  challenge_ID  programming_language challenge_series_ID  total_submissions  \
0      CI23478                     2              SI2445               37.0   
1      CI23479                     2              SI2435               48.0   
2      CI23480                     1              SI2435               15.0   
3      CI23481                     1              SI2710              236.0   
4      CI23482                     2              SI2440              137.0   

  publish_date author_ID author_gender author_org_ID  category_id  
0   06-05-2006  AI563576             M     AOI100001          NaN  
1   17-10-2002  AI563577             M     AOI100002         32.0  
2   16-10-2002  AI563578             M     AOI100003          NaN  
3   19-09-2003  AI563579             M     AOI100004         70.0  
4   21-03-2002  AI563580             M     AOI100005          NaN  ,   challenge_ID  programming_language challenge_series_ID  total_submissions  \
0      CI23478                     2

In [11]:
ch_table.iloc[3996]

challenge_ID               CI27492
programming_language             1
challenge_series_ID         SI0000
total_submissions               95
publish_date            01-12-2008
author_ID                 AI566131
author_gender                    M
author_org_ID            AOI101339
category_id                     39
Name: 3996, dtype: object

In [12]:
## Change strings to some encoded values
columns = ['challenge_series_ID','author_ID','author_gender','author_org_ID','publish_date']
#print(ch_table[0:10])
for col in columns:
    print(col)
    #ch_table[col] = ch_table.apply(lambda x: str2ascii(x[col]),axis=1)
    ch_table[col] = ch_table[col].apply(lambda x: str2ascii(x))

challenge_series_ID
author_ID
author_gender
author_org_ID
publish_date


In [13]:
ch_table[0:10]
y_data['programming_language'].describe()

count    5606.000000
mean        1.081877
std         0.316487
min         1.000000
25%         1.000000
50%         1.000000
75%         1.000000
max         3.000000
Name: programming_language, dtype: float64

### Now, we need to normalize the table

In [14]:
## using normalizer
from sklearn import preprocessing
normalizer = preprocessing.Normalizer()
min_max_scaler = preprocessing.MinMaxScaler()

In [15]:
## Decrease the variance between data points in each columns
columns = ch_table.columns
#print(columns[1:],ch_table.loc[:,columns[1:]])
ch_table.loc[:,columns[1:]].head()
minmax_ch_table = min_max_scaler.fit_transform(ch_table.loc[:,columns[1:]])
norm_ch_table = preprocessing.normalize(ch_table.loc[:,columns[1:]],norm='l2')
#ch_table.loc[:,columns[1:]] = norm_ch_table

In [16]:
#ch_table.head()
print(pandas.DataFrame(minmax_ch_table, columns=columns[1:]).head(2))
print(pandas.DataFrame(norm_ch_table, columns=columns[1:]).head(2))

   programming_language  challenge_series_ID  total_submissions  publish_date  \
0                   0.5             0.852213           0.000852      0.167498   
1                   0.5             0.848728           0.001106      0.534729   

   author_ID  author_gender  author_org_ID  category_id  
0   0.993856            1.0        0.98312     0.000000  
1   0.993858            1.0        0.98313     0.105263  
   programming_language  challenge_series_ID  total_submissions  publish_date  \
0          2.219821e-10             0.000173       4.106670e-09      0.999591   
1          2.217103e-10             0.000173       5.321048e-09      0.999592   

   author_ID  author_gender  author_org_ID   category_id  
0   0.015379   8.546312e-09       0.024096  0.000000e+00  
1   0.015360   8.535848e-09       0.024067  3.547365e-09  


In [17]:
## Finally put the scaled data back
ch_table[columns[1:]] = minmax_ch_table

In [18]:
ch_table.head(10)

Unnamed: 0,challenge_ID,programming_language,challenge_series_ID,total_submissions,publish_date,author_ID,author_gender,author_org_ID,category_id
0,CI23478,0.5,0.852213,0.000852,0.167498,0.993856,1.0,0.98312,0.0
1,CI23479,0.5,0.848728,0.001106,0.534729,0.993858,1.0,0.98313,0.105263
2,CI23480,0.0,0.848728,0.000346,0.501495,0.99386,1.0,0.98314,0.0
3,CI23481,0.0,0.94458,0.005437,0.600864,0.993861,1.0,0.983149,0.230263
4,CI23482,0.5,0.850471,0.003156,0.665337,0.993863,1.0,0.983159,0.0
5,CI23483,0.5,0.852213,0.033035,0.167498,0.0,0.428571,0.0,0.230263
6,CI23484,0.0,0.850471,0.011726,0.698571,0.993867,0.0,0.983179,0.105263
7,CI23485,0.0,0.848728,0.006612,0.534729,0.993868,0.0,0.983189,0.075658
8,CI23486,0.5,0.850471,0.000438,0.632104,0.99387,1.0,0.983199,0.0
9,CI23487,0.0,0.848728,0.000645,0.435028,0.993872,1.0,0.983208,0.463816


## Great!, now we have feature vectors for every challenges

Next lets prepare the ground truth matrix for users

Shape of y = (n_c, n_u)

    1.  n_c: the number of challenges
    2.  n_u: the number of users

In [19]:
## The ch_features contains 
ch_features = ch_table.sort_values('challenge_ID')
ch_features = ch_features.loc[:,columns[1:]].values
#ch_features.head(10)
print('Shape of feature (n_c, n_f) = {}'.format(ch_features.shape))

Shape of feature (n_c, n_f) = (5606, 8)


In [173]:
## Setting up the lookup table
ch_lookup = {}
tmp = ch_table['challenge_ID'].to_dict()
ch_id_lookup=tmp
#for key in tmp.keys
for key in tmp.keys():
    #print(key, tmp[key])
    ch_lookup[tmp[key]] = key
#ch_lookup

In [21]:
## now lets set up a training y array with shape = (n_c, n_u)
def findChallengeFeatures(challenge_id, table, lookup):
    """
        input: 
            challenge_id: a string of the challenge_id
            table: pandas dataframe lookup table
        output:
            features: numpy array of features
    """
    columns = table.columns
    return table.loc[lookup[challenge_id], columns[1:]]

In [22]:
%%time
ch_table.head()
featureVec = findChallengeFeatures(x_data.loc[0,'challenge'],ch_table, ch_lookup)
print(featureVec.shape)

(8,)
CPU times: user 68 ms, sys: 36.5 ms, total: 105 ms
Wall time: 117 ms


In [212]:
%%time
from operator import itemgetter
#myvalues = itemgetter(*mykeys)(mydict)
columns = ch_table.columns.values
usr_table = x_data
print(columns[1:])
for i in columns[1:]:\
    usr_table[i] = np.nan
nSamples = x_data.shape[0]
## Finding indices
indices = np.array([ch_lookup[i] for i in x_data.loc[:nSamples-1,'challenge']])
print(indices.shape)
usr_table.loc[:nSamples-1, columns[1:]] = ch_table.loc[indices, columns[1:]].values
#print(ch_table.loc[indices,columns[1:3]])
#print(usr_table.loc[:nSamples-1, columns[1:]].shape)

['programming_language' 'challenge_series_ID' 'total_submissions'
 'publish_date' 'author_ID' 'author_gender' 'author_org_ID' 'category_id']
(903916,)
CPU times: user 1.33 s, sys: 333 ms, total: 1.66 s
Wall time: 1.77 s


In [213]:
usr_table.head(15)
usr_table.to_csv('train_withFeatureVec_allsamples.csv')
ch_table.to_csv('challenge_featureVecTable_allsamples.csv')

## Let's prepare the labels

First, we need an empyty array to hold challenges

In [37]:
ch_emptyVec = np.zeros((ch_table.shape[0]))
ch_emptyVec.shape

(5606,)

In [48]:
x_data.head(13)

Unnamed: 0,user_sequence,user_id,challenge_sequence,challenge,programming_language,challenge_series_ID,total_submissions,publish_date,author_ID,author_gender,author_org_ID,category_id
0,4576_1,4576,1,CI23714,0.0,0.863367,0.339169,0.367232,0.994191,1.0,0.984378,0.095395
1,4576_2,4576,2,CI23855,0.0,0.86023,0.483609,0.766368,0.994292,1.0,0.0,0.095395
2,4576_3,4576,3,CI24917,0.0,0.887069,1.0,0.035228,0.995641,0.0,0.988822,0.217105
3,4576_4,4576,4,CI23663,0.0,0.861624,0.204957,0.46793,0.994117,1.0,0.984074,0.148026
4,4576_5,4576,5,CI23933,0.0,0.86023,0.347532,0.866068,0.994221,1.0,0.984575,0.101974
5,4576_6,4576,6,CI25135,0.0,0.890903,0.125458,0.799934,0.99417,1.0,0.984378,0.200658
6,4576_7,4576,7,CI23975,0.0,0.858139,0.21203,0.764374,0.994292,1.0,0.0,0.200658
7,4576_8,4576,8,CI25126,0.0,0.890903,0.098344,0.666999,0.994378,1.0,0.983946,0.151316
8,4576_9,4576,9,CI24915,0.0,0.887069,0.170218,0.035228,0.994233,0.0,0.984624,0.095395
9,4576_10,4576,10,CI24957,0.0,0.887069,0.051625,0.168162,0.994233,0.0,0.984624,0.095395


In [241]:
## constructing a (n_u, n_c) array for nSamples
%%time

columns = ch_table.columns
nSamples = int(x_data.shape[0])
x_train = np.zeros((nSamples, 10, len(ch_table.columns)-1)) ## (m, n_i, n_f)
y_train = np.zeros((nSamples, ch_table.shape[0])) ## (m, n_c)
for i in range(nSamples/13): 
    curpt = i*13
    #print(i)
    x_train[i] = x_data.loc[curpt:(curpt+9), columns[1:]]  ## 0-10, 13-26
    #print(x_train[i].shape)
    #y_train[i] = ch_emptyVec
    #tmp = x_data.loc[(curpt+10):(curpt+12), 'challenge'].values
    #tmp = [ch_lookup[tmp[0]],ch_lookup[tmp[1]],ch_lookup[tmp[2]]]
    #y_train[i,tmp] = 1 ## 10-13, 26-29
    indices = [int(ch_lookup[j]) for j in x_data.loc[(curpt+10):(curpt+12), 'challenge']]                  
    #print(indices, np.ones(3), tmp)
    y_train[i, indices] = 1 ## 10-13, 26-29
    #break
print('x_train shape = {}, y_train shape = {}'.format(x_train.shape, y_train.shape))
    

x_train shape = (903916, 10, 8), y_train shape = (903916, 5606)


In [245]:
## Flatten the array
x_train = x_train.reshape((x_data.shape[0],-1))

## Finally lets dunmp it into a classifier

In [73]:
from sklearn.naive_bayes import GaussianNB
from sklearn import tree
gnb = GaussianNB()
clf = tree.DecisionTreeClassifier()


In [74]:
clf.fit(x_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [78]:
## A simple NN
from keras.models import Sequential
from keras.layers import Dense, Activation

model = Sequential()
model.add(Dense(64, input_dim=80))
model.add(Activation('relu'))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dense(256))
model.add(Activation('relu'))
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dense(1024))
model.add(Activation('relu'))
model.add(Dense(y_train.shape[1]))
model.add(Activation('softmax'))



Using TensorFlow backend.


In [79]:
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

In [246]:
model.fit(x_train, y_train, epochs=1)

Epoch 1/1


<keras.callbacks.History at 0x105bc9bd0>

In [247]:
model.save_weights('simpleNN.h5')


## Running out of time just gonna plug it in and submit

In [248]:
%%time
nSamples = x_test.shape[0]
columns = ch_table.columns
test_table = x_test
print(columns[1:]) 
for i in columns[1:]:
    test_table[i] = np.nan
indices = np.array([ch_lookup[i] for i in x_test.loc[:nSamples-1,'challenge']])
test_table.loc[:nSamples-1, columns[1:]] = ch_table.loc[indices, columns[1:]].values
print(indices.shape)


Index([u'programming_language', u'challenge_series_ID', u'total_submissions',
       u'publish_date', u'author_ID', u'author_gender', u'author_org_ID',
       u'category_id'],
      dtype='object')
(397320,)
CPU times: user 647 ms, sys: 412 ms, total: 1.06 s
Wall time: 1.66 s


In [249]:
%%time
test_table.to_csv('prepared_test_table_for_prediction.csv')

CPU times: user 2.92 s, sys: 274 ms, total: 3.19 s
Wall time: 3.93 s


In [250]:
x_submit = np.zeros((nSamples/10, 10, len(ch_table.columns)-1)) ## (m, n_i, n_f)
y_submit = pandas.DataFrame(columns=['user_sequence','challenge'], 
                            data = np.empty((x_test.shape[0]/10*3,2), dtype=np.str))
#y_submit['user_sequence']
#y_submit.head(15)

In [251]:
%%time
for i in range(nSamples/10): 
    curpt = i*10
    #print(i)
    x_submit[i] = x_test.loc[curpt:(curpt+9), columns[1:]]  ## 0-10, 13-26
    pred = model.predict(x_submit[i].reshape((1,80)))
    ids = np.argsort(pred.reshape(-1))[-3:]
    #print(pred, ids, ids.shape)
    #print(pred[0,ids])
    
    outpt = i*3
    user_id = x_test.loc[curpt,'user_id']
    y_submit.iloc[outpt:outpt+3,:] = [[str(user_id)+'_11', ch_id_lookup[ids[0]]],
                                      [str(user_id)+'_12', ch_id_lookup[ids[1]]],
                                      [str(user_id)+'_13', ch_id_lookup[ids[2]]]
                                     ]
    #print(y_submit.iloc[outpt:outpt+3,:])
y_submit.head()

CPU times: user 6min 9s, sys: 30.8 s, total: 6min 40s
Wall time: 6min 29s


In [252]:
y_submit.head(15)

Unnamed: 0,user_sequence,challenge
0,4577_11,CI25126
1,4577_12,CI24915
2,4577_13,CI24958
3,4578_11,CI23691
4,4578_12,CI24915
5,4578_13,CI24958
6,4579_11,CI23848
7,4579_12,CI23855
8,4579_13,CI23933
9,4583_11,CI24228


In [253]:
y_submit.to_csv('ftl_submission.csv')