Version 1.1.0

# Mean encodings

In this programming assignment you will be working with `1C` dataset from the final competition. You are asked to encode `item_id` in 4 different ways:

    1) Via KFold scheme;  
    2) Via Leave-one-out scheme;
    3) Via smoothing scheme;
    4) Via expanding mean scheme.

**You will need to submit** the correlation coefficient between resulting encoding and target variable up to 4 decimal places.

### General tips

* Fill NANs in the encoding with `0.3343`.
* Some encoding schemes depend on sorting order, so in order to avoid confusion, please use the following code snippet to construct the data frame. This snippet also implements mean encoding without regularization.

In [1]:
import pandas as pd
import numpy as np
from itertools import product
from grader import Grader

# Read data

In [2]:
sales = pd.read_csv('../readonly/final_project_data/sales_train.csv.gz')

In [3]:
sales.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
0,02.01.2013,0,59,22154,999.0,1.0
1,03.01.2013,0,25,2552,899.0,1.0
2,05.01.2013,0,25,2552,899.0,-1.0
3,06.01.2013,0,25,2554,1709.05,1.0
4,15.01.2013,0,25,2555,1099.0,1.0


In [4]:
sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2935849 entries, 0 to 2935848
Data columns (total 6 columns):
date              object
date_block_num    int64
shop_id           int64
item_id           int64
item_price        float64
item_cnt_day      float64
dtypes: float64(2), int64(3), object(1)
memory usage: 134.4+ MB


In [5]:
sales.isnull().sum()

date              0
date_block_num    0
shop_id           0
item_id           0
item_price        0
item_cnt_day      0
dtype: int64

In [6]:
sales.describe()

Unnamed: 0,date_block_num,shop_id,item_id,item_price,item_cnt_day
count,2935849.0,2935849.0,2935849.0,2935849.0,2935849.0
mean,14.56991,33.00173,10197.23,890.8532,1.242641
std,9.422988,16.22697,6324.297,1729.8,2.618834
min,0.0,0.0,0.0,-1.0,-22.0
25%,7.0,22.0,4476.0,249.0,1.0
50%,14.0,31.0,9343.0,399.0,1.0
75%,23.0,47.0,15684.0,999.0,1.0
max,33.0,59.0,22169.0,307980.0,2169.0


# Aggregate data

Since the competition task is to make a monthly prediction, we need to aggregate the data to montly level before doing any encodings. The following code-cell serves just that purpose.

In [None]:
# ※ 以下で1セルずつ実行するので、回さない

index_cols = ['shop_id', 'item_id', 'date_block_num']

# For every month we create a grid from all shops/items combinations from that month
grid = [] 
for block_num in sales['date_block_num'].unique():
    cur_shops = sales[sales['date_block_num']==block_num]['shop_id'].unique()
    cur_items = sales[sales['date_block_num']==block_num]['item_id'].unique()
    grid.append(np.array(list(product(*[cur_shops, cur_items, [block_num]])),dtype='int32'))

#turn the grid into pandas dataframe
grid = pd.DataFrame(np.vstack(grid), columns = index_cols,dtype=np.int32)

#get aggregated values for (shop_id, item_id, month)
gb = sales.groupby(index_cols,as_index=False).agg({'item_cnt_day':{'target':'sum'}})

#fix column names
gb.columns = [col[0] if col[-1]=='' else col[-1] for col in gb.columns.values]
#join aggregated data to the grid
all_data = pd.merge(grid,gb,how='left',on=index_cols).fillna(0)
#sort the data
all_data.sort_values(['date_block_num','shop_id','item_id'],inplace=True)

In [4]:
index_cols = ['shop_id', 'item_id', 'date_block_num']

In [5]:
# For every month we create a grid from all shops/items combinations from that month
grid = [] 
for block_num in sales['date_block_num'].unique():
    cur_shops = sales[sales['date_block_num']==block_num]['shop_id'].unique()
    cur_items = sales[sales['date_block_num']==block_num]['item_id'].unique()
    # cur_shops, cur_items, [block_num]の全ての組み合わせ
    grid.append(np.array(list(product(*[cur_shops, cur_items, [block_num]])),dtype='int32'))

# itertools product test

In [5]:
l1 = ['a', 'b', 'c']
l2 = ['X', 'Y', 'Z']

p = product(l1, l2)
# 下の書き方でも同じ結果になる
# p = product(*[l1, l2])
list(p)

[('a', 'X'),
 ('a', 'Y'),
 ('a', 'Z'),
 ('b', 'X'),
 ('b', 'Y'),
 ('b', 'Z'),
 ('c', 'X'),
 ('c', 'Y'),
 ('c', 'Z')]

# --------------------

In [6]:
grid

[array([[   59, 22154,     0],
        [   59,  2552,     0],
        [   59,  2554,     0],
        ..., 
        [   45,   628,     0],
        [   45,   631,     0],
        [   45,   621,     0]], dtype=int32), array([[   50,  3880,     1],
        [   50,  4128,     1],
        [   50,  4124,     1],
        ..., 
        [   28, 12885,     1],
        [   28, 12791,     1],
        [   28, 13433,     1]], dtype=int32), array([[    5, 20175,     2],
        [    5, 20083,     2],
        [    5,    31,     2],
        ..., 
        [    4, 12388,     2],
        [    4, 12340,     2],
        [    4, 10649,     2]], dtype=int32), array([[   25,  8092,     3],
        [   25,  7850,     3],
        [   25,  8051,     3],
        ..., 
        [   41, 14063,     3],
        [   41, 20690,     3],
        [   41, 19235,     3]], dtype=int32), array([[   59, 22114,     4],
        [   59, 20239,     4],
        [   59, 20238,     4],
        ..., 
        [    6,  1924,     4],
      

In [7]:
len(grid)

34

In [8]:
np.vstack(grid)

array([[   59, 22154,     0],
       [   59,  2552,     0],
       [   59,  2554,     0],
       ..., 
       [   21,  7640,    33],
       [   21,  7632,    33],
       [   21,  7440,    33]], dtype=int32)

In [6]:
#turn the grid into pandas dataframe
grid = pd.DataFrame(np.vstack(grid), columns = index_cols,dtype=np.int32)

In [10]:
grid.head()

Unnamed: 0,shop_id,item_id,date_block_num
0,59,22154,0
1,59,2552,0
2,59,2554,0
3,59,2555,0
4,59,2564,0


In [7]:
#get aggregated values for (shop_id, item_id, month)
gb = sales.groupby(index_cols,as_index=False).agg({'item_cnt_day':{'target':'sum'}}) #agg({'item_cnt_day':'sum'})と同じ。'target'が列名に追加されるだけ

  return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)


In [12]:
gb.head()

Unnamed: 0_level_0,shop_id,item_id,date_block_num,item_cnt_day
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,target
0,0,30,1,31.0
1,0,31,1,11.0
2,0,32,0,6.0
3,0,32,1,10.0
4,0,33,0,3.0


In [13]:
# 列名が階層構造になっている
gb.columns.values

array([('shop_id', ''), ('item_id', ''), ('date_block_num', ''),
       ('item_cnt_day', 'target')], dtype=object)

In [8]:
#fix column names→階層構造になっている列名の修正
gb.columns = [col[0] if col[-1]=='' else col[-1] for col in gb.columns.values]

In [15]:
gb.columns

Index(['shop_id', 'item_id', 'date_block_num', 'target'], dtype='object')

In [9]:
#join aggregated data to the grid
all_data = pd.merge(grid,gb,how='left',on=index_cols).fillna(0)

In [17]:
all_data.head()

Unnamed: 0,shop_id,item_id,date_block_num,target
0,59,22154,0,1.0
1,59,2552,0,0.0
2,59,2554,0,0.0
3,59,2555,0,0.0
4,59,2564,0,0.0


In [10]:
#sort the data
all_data.sort_values(['date_block_num','shop_id','item_id'],inplace=True) #inplace=True→元の変数を上書きする

In [19]:
all_data.head()

Unnamed: 0,shop_id,item_id,date_block_num,target
139255,0,19,0,0.0
141495,0,27,0,0.0
144968,0,28,0,0.0
142661,0,29,0,0.0
138947,0,32,0,6.0


# Mean encodings without regularization

After we did the techinical work, we are ready to actually *mean encode* the desired `item_id` variable. 

Here are two ways to implement mean encoding features *without* any regularization. You can use this code as a starting point to implement regularized techniques. 

#### Method 1

In [None]:
# ※ 以下で1セルずつ実行するので、回さない

# Calculate a mapping: {item_id: target_mean}
item_id_target_mean = all_data.groupby('item_id').target.mean()

# In our non-regularized case we just *map* the computed means to the `item_id`'s
all_data['item_target_enc'] = all_data['item_id'].map(item_id_target_mean)

# Fill NaNs
all_data['item_target_enc'].fillna(0.3343, inplace=True) 

# Print correlation
encoded_feature = all_data['item_target_enc'].values
print(np.corrcoef(all_data['target'].values, encoded_feature)[0][1])

In [11]:
# Calculate a mapping: {item_id: target_mean}
# item_id　別でtargetの平均値を計算
item_id_target_mean = all_data.groupby('item_id').target.mean()

In [12]:
# In our non-regularized case we just *map* the computed means to the `item_id`'s
all_data['item_target_enc'] = all_data['item_id'].map(item_id_target_mean)

In [22]:
all_data.head()

Unnamed: 0,shop_id,item_id,date_block_num,target,item_target_enc
139255,0,19,0,0.0,0.022222
141495,0,27,0,0.0,0.056834
144968,0,28,0,0.0,0.141176
142661,0,29,0,0.0,0.037383
138947,0,32,0,6.0,1.319042


In [23]:
all_data.isnull().sum()

shop_id            0
item_id            0
date_block_num     0
target             0
item_target_enc    0
dtype: int64

In [13]:
# Fill NaNs
all_data['item_target_enc'].fillna(0.3343, inplace=True) 

In [14]:
# Print correlation
encoded_feature = all_data['item_target_enc'].values
print(np.corrcoef(all_data['target'].values, encoded_feature)[0][1])

0.483038698862


In [26]:
np.corrcoef(all_data['target'].values, encoded_feature)

array([[ 1.       ,  0.4830387],
       [ 0.4830387,  1.       ]])

#### Method 2

In [None]:
# ※ 以下で1セルずつ実行するので、回さない

'''
     Differently to `.target.mean()` function `transform` 
   will return a dataframe with an index like in `all_data`.
   Basically this single line of code is equivalent to the first two lines from of Method 1.
'''
all_data['item_target_enc'] = all_data.groupby('item_id')['target'].transform('mean')

# Fill NaNs
all_data['item_target_enc'].fillna(0.3343, inplace=True) 

# Print correlation
encoded_feature = all_data['item_target_enc'].values
print(np.corrcoef(all_data['target'].values, encoded_feature)[0][1])

In [31]:
all_data.groupby('item_id')['target'].mean()

item_id
0        0.020000
1        0.023810
2        0.019802
3        0.019802
4        0.020000
5        0.020000
6        0.020000
7        0.020000
8        0.019802
9        0.019608
10       0.020000
11       0.020000
12       0.021739
13       0.020000
14       0.020000
15       0.020000
16       0.020000
17       0.020000
18       0.019608
19       0.022222
20       0.019608
21       0.020000
22       0.021277
23       0.021277
24       0.021277
25       0.019608
26       0.019231
27       0.056834
28       0.141176
29       0.037383
           ...   
22140    0.191461
22141    0.276042
22142    0.069930
22143    1.930804
22144    0.266667
22145    0.650407
22146    0.040334
22147    0.082100
22148    0.021739
22149    0.058923
22150    0.088821
22151    1.159375
22152    0.155109
22153    0.026627
22154    0.109870
22155    0.093671
22156    0.029197
22157    0.021978
22158    0.022727
22159    0.172414
22160    0.097030
22161    0.022222
22162    1.556793
22163    0.581395
22

In [32]:
all_data.groupby('item_id')['target'].transform('mean')

139255      0.022222
141495      0.056834
144968      0.141176
142661      0.037383
138947      1.319042
138948      0.527112
138949      0.146108
139247      0.944681
142672      0.070943
142065      0.085828
139208      0.070596
142670      0.032847
139207      0.086773
138950      0.110971
143764      0.058450
141505      0.076040
139199      0.069005
138952      0.116646
139176      0.044444
138951      0.148802
139177      0.067236
139178      0.119798
139179      0.073126
143769      0.112575
142671      0.052989
144539      0.098361
139180      0.157485
138953      0.071168
144265      0.043796
141744      0.056250
              ...   
10772600    0.830357
10770510    0.140000
10769953    0.502326
10769955    1.362817
10768833    0.163556
10769961    0.370044
10770625    0.159066
10769956    0.699195
10771598    1.937198
10767854    2.173392
10768086    3.324716
10768087    0.751576
10768088    1.317150
10767847    2.267442
10769954    1.003861
10767848    6.594595
10767849    0

In [15]:
all_data['item_target_enc'] = all_data.groupby('item_id')['target'].transform('mean')

In [16]:
# Fill NaNs
all_data['item_target_enc'].fillna(0.3343, inplace=True)

In [17]:
# Print correlation
encoded_feature = all_data['item_target_enc'].values
print(np.corrcoef(all_data['target'].values, encoded_feature)[0][1])

0.483038698862


See the printed value? It is the correlation coefficient between the target variable and your new encoded feature. You need to **compute correlation coefficient** between the encodings, that you will implement and **submit those to coursera**.

In [18]:
grader = Grader()

# 1. KFold scheme

Explained starting at 41 sec of [Regularization video](https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization).

**Now it's your turn to write the code!** 

You may use 'Regularization' video as a reference for all further tasks.

First, implement KFold scheme with five folds. Use KFold(5) from sklearn.model_selection. 

1. Split your data in 5 folds with `sklearn.model_selection.KFold` with `shuffle=False` argument.
2. Iterate through folds: use all but the current fold to calculate mean target for each level `item_id`, and  fill the current fold.

    *  See the **Method 1** from the example implementation. In particular learn what `map` and pd.Series.map functions do. They are pretty handy in many situations.

In [37]:
all_data.head()

Unnamed: 0,shop_id,item_id,date_block_num,target,item_target_enc
139255,0,19,0,0.0,0.022222
141495,0,27,0,0.0,0.056834
144968,0,28,0,0.0,0.141176
142661,0,29,0,0.0,0.037383
138947,0,32,0,6.0,1.319042


# 参考資料
[K-Fold Target Encoding](https://medium.com/@pouryaayria/k-fold-target-encoding-dfe9a594874b)

In [19]:
from sklearn.model_selection import KFold

#  Split your data in 5 folds with sklearn.model_selection.KFold with shuffle=False argument.
kf = KFold(n_splits=5, shuffle=False, random_state=123)

In [20]:
all_data['Kfold_target_enc'] = np.nan

In [21]:
for tr_ind, val_ind in kf.split(all_data):
    X_tr, X_val = all_data.iloc[tr_ind], all_data.iloc[val_ind]
    all_data.loc[all_data.index[val_ind], 'Kfold_target_enc'] = X_val['item_id'].map(X_tr.groupby('item_id')['target'].mean())
    
    all_data['Kfold_target_enc'].fillna(0.3343, inplace = True)

In [41]:
all_data.describe()

Unnamed: 0,shop_id,item_id,date_block_num,target,item_target_enc,Kfold_target_enc
count,10913850.0,10913850.0,10913850.0,10913850.0,10913850.0,10913850.0
mean,31.1872,11309.26,14.97334,0.3342731,0.3342731,0.3514806
std,17.34959,6209.978,9.495618,3.417243,1.650661,1.727671
min,0.0,0.0,0.0,-22.0,-0.06043956,-0.1
25%,16.0,5976.0,7.0,0.0,0.06092715,0.05762305
50%,30.0,11391.0,14.0,0.0,0.1266491,0.1288855
75%,46.0,16605.0,23.0,0.0,0.2905983,0.3343
max,59.0,22169.0,33.0,2253.0,129.4976,149.5291


In [22]:
# YOUR CODE GOES HERE
encoded_feature = all_data['Kfold_target_enc'].values

# You will need to compute correlation like that
corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(corr)
grader.submit_tag('KFold_scheme', corr)

0.41645907128
Current answer for task KFold_scheme is: 0.41645907128


In [23]:
all_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10913850 entries, 139255 to 10770511
Data columns (total 6 columns):
shop_id             int32
item_id             int32
date_block_num      int32
target              float64
item_target_enc     float64
Kfold_target_enc    float64
dtypes: float64(3), int32(3)
memory usage: 778.0 MB


In [24]:
# データが重くなるので列をdropしておく
all_data = all_data.drop('Kfold_target_enc', axis = 1)

In [25]:
all_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10913850 entries, 139255 to 10770511
Data columns (total 5 columns):
shop_id            int32
item_id            int32
date_block_num     int32
target             float64
item_target_enc    float64
dtypes: float64(2), int32(3)
memory usage: 694.7 MB


# 2. Leave-one-out scheme

Now, implement leave-one-out scheme. Note that if you just simply set the number of folds to the number of samples and run the code from the **KFold scheme**, you will probably wait for a very long time. 

To implement a faster version, note, that to calculate mean target value using all the objects but one *given object*, you can:

1. Calculate sum of the target values using all the objects.
2. Then subtract the target of the *given object* and divide the resulting value by `n_objects - 1`. 

Note that you do not need to perform `1.` for every object. And `2.` can be implemented without any `for` loop.

It is the most convenient to use `.transform` function as in **Method 2**.

In [26]:
all_data.head()

Unnamed: 0,shop_id,item_id,date_block_num,target,item_target_enc
139255,0,19,0,0.0,0.022222
141495,0,27,0,0.0,0.056834
144968,0,28,0,0.0,0.141176
142661,0,29,0,0.0,0.037383
138947,0,32,0,6.0,1.319042


# small dataでテスト

In [45]:
test = pd.DataFrame({'item_id': [1, 1, 1, 2, 2, 2, 2], 'target': [3, 4, 5, 6, 7, 8, 9]})
test

Unnamed: 0,item_id,target
0,1,3
1,1,4
2,1,5
3,2,6
4,2,7
5,2,8
6,2,9


In [27]:
# groupbyオブジェクト(Ｓｅｒｉｅｓ)を渡すと、Looスキームで処理したtarget encoding結果を返す関数
loo = lambda x: (x.sum() - x)/(len(x) - 1)

In [47]:
test.groupby('item_id')['target'].transform(loo)

0    4.500000
1    4.000000
2    3.500000
3    8.000000
4    7.666667
5    7.333333
6    7.000000
Name: target, dtype: float64

# ---------------------------

In [28]:
all_data['Loo_target_enc'] = all_data.groupby('item_id')['target'].transform(loo)

In [29]:
# YOUR CODE GOES HERE
encoded_feature = all_data['Loo_target_enc'].values

corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(corr)
grader.submit_tag('Leave-one-out_scheme', corr)

0.480384831129
Current answer for task Leave-one-out_scheme is: 0.480384831129


In [30]:
# データが重くなるので列をdropしておく
all_data = all_data.drop('Loo_target_enc', axis = 1)

# 3. Smoothing

Explained starting at 4:03 of [Regularization video](https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization).

Next, implement smoothing scheme with $\alpha = 100$. Use the formula from the first slide in the video and $0.3343$ as `globalmean`. Note that `nrows` is the number of objects that belong to a certain category (not the number of rows in the dataset).

# video 4:03 formula
{mean(target)✕nrows + globalmean✕alpha} / (nrows + alpha)

In [31]:
all_data.head()

Unnamed: 0,shop_id,item_id,date_block_num,target,item_target_enc
139255,0,19,0,0.0,0.022222
141495,0,27,0,0.0,0.056834
144968,0,28,0,0.0,0.141176
142661,0,29,0,0.0,0.037383
138947,0,32,0,6.0,1.319042


# small dataでのテスト

In [32]:
test = pd.DataFrame({'item_id': [1, 1, 1, 2, 2, 2, 2], 'target': [3, 4, 5, 6, 7, 8, 9]})
test

Unnamed: 0,item_id,target
0,1,3
1,1,4
2,1,5
3,2,6
4,2,7
5,2,8
6,2,9


In [33]:
globalmean = test.target.mean()
alpha = 100

In [32]:
#groupbyオブジェクトを渡すとsmoothingしたtarget encoding結果を返す関数
smooth = lambda x: (x.mean() * len(x) + globalmean * alpha) / (len(x) + alpha)

In [35]:
test.groupby('item_id')['target'].transform(smooth)

0    5.941748
1    5.941748
2    5.941748
3    6.057692
4    6.057692
5    6.057692
6    6.057692
Name: target, dtype: float64

# ----------------------

In [36]:
all_data.target.mean()

0.33427305671234259

In [33]:
globalmean = 0.3343
alpha = 100

In [34]:
all_data['Smooth_target_enc'] = all_data.groupby('item_id')['target'].transform(smooth)

In [35]:
# YOUR CODE GOES HERE
encoded_feature = all_data['Smooth_target_enc'].values

corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(corr)
grader.submit_tag('Smoothing_scheme', corr)

0.48181987971
Current answer for task Smoothing_scheme is: 0.48181987971


In [36]:
# データが重くなるので列をdropしておく
all_data = all_data.drop('Smooth_target_enc', axis = 1)

# 4. Expanding mean scheme

Explained starting at 5:50 of [Regularization video](https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization).

Finally, implement the *expanding mean* scheme. It is basically already implemented for you in the video, but you can challenge yourself and try to implement it yourself. You will need [`cumsum`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.cumsum.html) and [`cumcount`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.cumcount.html) functions from pandas.

# note
expanding mean encodingはCatBoostにBuilt-inされているもの

# test with small data

In [42]:
test = pd.DataFrame({'item_id': [1, 1, 1, 2, 2, 2, 2], 'target': [3, 4, 5, 6, 7, 8, 9]})
test

Unnamed: 0,item_id,target
0,1,3
1,1,4
2,1,5
3,2,6
4,2,7
5,2,8
6,2,9


In [43]:
test.groupby('item_id')['target'].cumsum()

0     3
1     7
2    12
3     6
4    13
5    21
6    30
Name: target, dtype: int64

In [45]:
cumsum = test.groupby('item_id')['target'].cumsum() - test['target']

In [47]:
test.groupby('item_id').cumcount()

0    0
1    1
2    2
3    0
4    1
5    2
6    3
dtype: int64

In [48]:
cumcnt = test.groupby('item_id').cumcount()

In [49]:
test['Expanding_target_enc'] = cumsum/cumcnt

In [50]:
test

Unnamed: 0,item_id,target,Expanding_target_enc
0,1,3,
1,1,4,3.0
2,1,5,3.5
3,2,6,
4,2,7,6.0
5,2,8,6.5
6,2,9,7.0


In [52]:
test['Expanding_target_enc'].fillna(0.3343, inplace = True)

In [53]:
np.corrcoef(test['target'].values, test['Expanding_target_enc'].values)[0][1]

0.81206508905724351

# -----------------------

In [37]:
cumsum = all_data.groupby('item_id')['target'].cumsum() - all_data['target']

In [38]:
cumcnt = all_data.groupby('item_id').cumcount()

In [39]:
all_data['Expanding_target_enc'] = cumsum/cumcnt

In [40]:
all_data['Expanding_target_enc'].fillna(0.3343, inplace = True)

In [41]:
all_data.head()

Unnamed: 0,shop_id,item_id,date_block_num,target,item_target_enc,Expanding_target_enc
139255,0,19,0,0.0,0.022222,0.3343
141495,0,27,0,0.0,0.056834,0.3343
144968,0,28,0,0.0,0.141176,0.3343
142661,0,29,0,0.0,0.037383,0.3343
138947,0,32,0,6.0,1.319042,0.3343


In [42]:
# YOUR CODE GOES HERE
encoded_feature = all_data['Expanding_target_enc'].values

corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(corr)
grader.submit_tag('Expanding_mean_scheme', corr)

0.502524521108
Current answer for task Expanding_mean_scheme is: 0.502524521108


## Authorization & Submission
To submit assignment parts to Cousera platform, please, enter your e-mail and token into variables below. You can generate token on this programming assignment page. Note: Token expires 30 minutes after generation.

In [43]:
STUDENT_EMAIL = 'kuze3110@gmail.com'# EMAIL HERE
STUDENT_TOKEN = '6RB77KscfBT1mTRy'# TOKEN HERE
grader.status()

You want to submit these numbers:
Task KFold_scheme: 0.41645907128
Task Leave-one-out_scheme: 0.480384831129
Task Smoothing_scheme: 0.48181987971
Task Expanding_mean_scheme: 0.502524521108


In [44]:
grader.submit(STUDENT_EMAIL, STUDENT_TOKEN)

Submitted to Coursera platform. See results on assignment page!
