# Preprocessing 

In [1]:
import numpy as np
import pandas as pd

## Data 
- follower_count `user`table 
- friend_count `user` table 
- friend_avg_inflow `repost` table. `friend` table.
$\frac{\Sigma retweet(i) \times in_flow(i)}{\Sigma retweet(i)}$ retweet is the number of retweets from friend u' to user u. in_flow is the number of tweets that friend u' posts in a period of time. 
- friend_avg_outflow `repost` table `friend` table 
$\frac{\Sigma retweet(i) \times retweet_rate(i)}{\Sigma retweet(i)}$ retweet_rate is the number of retweets from user u to follower u'' in a period of time.
- inflow rate:the number of posts user receives in a period of time (sum of the outflow rate of friends) 
- outflow rate: the number of posts user sends in a period of time `status` table 

Summing up all the above, we need the number of retweets from u' to u, the retweet rate from u' to u and the status_rate of any u.

## Cascades 
$T_{i,j}$ 
every repost in the cascade: (parent, user, time )

*This needs complete cascade info.*

## Survival analysis 
Take a few sample cascades and see if the size growth matches survival functions.

~~Implementation: take `repost` table and join with `status` table for the time. Sort by time.~~
`status_update` tracks the growth of cascades.

In [2]:
repost = pd.read_csv('data/repost.csv',header=None,names= ['original_status_id','status_id','user_id','topic','original_user_id'])

In [3]:
repost.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614637 entries, 0 to 614636
Data columns (total 5 columns):
original_status_id    614637 non-null int64
status_id             614637 non-null int64
user_id               614637 non-null int64
topic                 614637 non-null int64
original_user_id      614637 non-null int64
dtypes: int64(5)
memory usage: 23.4 MB


In [20]:
retweet = pd.read_csv('data/retweet.csv',header=None,names = ['original_status_id','status_id','user_id'])

In [24]:
retweet.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 662870 entries, 0 to 662869
Data columns (total 3 columns):
original_status_id    662870 non-null int64
status_id             662870 non-null int64
user_id               662870 non-null int64
dtypes: int64(3)
memory usage: 15.2 MB


Very small union of `friends ` table and `repost` table.
Here we assume that all the reposts are from friend to user.

Set multilevel indexes `user_id` and `original_user_id `.

In [21]:
status_new = pd.read_csv('data/status_new.csv',header=None,
                     names= ['status_id','retweet_count','favorite_count','created_at','text','catch_time','user_id'])

In [22]:
status_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1573155 entries, 0 to 1573154
Data columns (total 7 columns):
status_id         1573155 non-null int64
retweet_count     1573155 non-null int64
favorite_count    1573155 non-null int64
created_at        1573155 non-null object
text              1573155 non-null object
catch_time        1573155 non-null object
user_id           1573155 non-null int64
dtypes: int64(4), object(3)
memory usage: 84.0+ MB


In [14]:
status = pd.read_csv('data/status.csv',header=None,
                     names= ['status_id','retweet_count','favorite_count','created_at','text','catch_time','user_id'])

In [15]:
status.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2218834 entries, 0 to 2218833
Data columns (total 7 columns):
status_id         int64
retweet_count     int64
favorite_count    int64
created_at        object
text              object
catch_time        object
user_id           int64
dtypes: int64(4), object(3)
memory usage: 118.5+ MB


In [23]:
statusall = status.append(status_new,ignore_index=True)

In [25]:
repostWithTime = pd.merge(retweet,status,how='inner',on='status_id') 

In [26]:
repostWithTime.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 988 entries, 0 to 987
Data columns (total 9 columns):
original_status_id    988 non-null int64
status_id             988 non-null int64
user_id_x             988 non-null int64
retweet_count         988 non-null int64
favorite_count        988 non-null int64
created_at            988 non-null object
text                  988 non-null object
catch_time            988 non-null object
user_id_y             988 non-null int64
dtypes: int64(6), object(3)
memory usage: 77.2+ KB


*merging repost with status only yields 988 results... the repost time is yet to be known*
*The repost rate is unknown.*

In [27]:
repost_count = repost.groupby(by=['user_id','original_user_id']).size()

In [31]:
rc = pd.DataFrame(repost_count)
rc.reset_index(inplace=True)
rc.columns= ['user_id','original_user_id','repostn']

In [32]:
rc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 605409 entries, 0 to 605408
Data columns (total 3 columns):
user_id             605409 non-null int64
original_user_id    605409 non-null int64
repostn             605409 non-null int64
dtypes: int64(3)
memory usage: 13.9 MB


In [36]:
status_count = statusall.groupby(by=['user_id']).size()

In [38]:
sc = pd.DataFrame(status_count)
sc.reset_index(inplace=True)
sc.columns = ['user_id','statusn']

We randomly collect statuses from the Twitter firehose. If the sampling is uniform, the number of statuses we collect from a user is only a fraction away from the post_rate of this user.

In [39]:
friends = pd.read_csv('data/friends.csv',header=None,names = ['user_id','friend_id'])

In [40]:
inflow = pd.merge(friends,sc,left_on='friend_id',right_on='user_id',how='left')

In [43]:
inflow.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28783177 entries, 0 to 28783176
Data columns (total 3 columns):
user_id_x    int64
friend_id    int64
statusn      float64
dtypes: float64(1), int64(2)
memory usage: 878.4 MB


In [42]:
inflow.drop('user_id_y',axis=1,inplace=True)

In [44]:
inflow.columns=['user_id','friend_id','statusn']

In [45]:
inflow2 = pd.merge(inflow,rc,left_on=['user_id','friend_id'],right_on=['user_id','original_user_id'],how='left')

In [46]:
inflow2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28783177 entries, 0 to 28783176
Data columns (total 5 columns):
user_id             int64
friend_id           int64
statusn             float64
original_user_id    float64
repostn             float64
dtypes: float64(3), int64(2)
memory usage: 1.3 GB


In [47]:
inflow2.drop('original_user_id',axis=1,inplace=True)
inflow2.loc[inflow2.statusn.isnull(),'statusn']=0
inflow2.loc[inflow2.repostn.isnull(),'repostn']=0

In [49]:
inflow2['scxrepost'] = inflow2['statusn'] * inflow2['repostn']

In [68]:
inflow2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28783177 entries, 0 to 28783176
Data columns (total 5 columns):
user_id      int64
friend_id    int64
statusn      float64
repostn      float64
scxrepost    float64
dtypes: float64(3), int64(2)
memory usage: 1.3 GB


In [70]:
friend_avg_inflow = inflow2.groupby('user_id').agg({
        'statusn':np.sum,
        'repostn':np.sum,
        'scxrepost':np.sum,
    })

In [86]:
friend_avg_inflow.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80480 entries, 0 to 80479
Data columns (total 5 columns):
user_id                80480 non-null int64
inflow                 80480 non-null float64
repost_sum             80480 non-null float64
status_weighted_sum    80480 non-null float64
avg_inflow             80480 non-null float64
dtypes: float64(4), int64(1)
memory usage: 3.1 MB


After grouping by `user_id` the number of rows drops drastically. This implies that only a portion of users are included in the `friends` table.

In [73]:
friend_avg_inflow.reset_index(inplace=True)

friend_avg_inflow.columns = ['user_id','inflow','repost_sum','status_weighted_sum']

In [74]:
friend_avg_inflow['avg_inflow'] = friend_avg_inflow['repost_sum']/friend_avg_inflow['status_weighted_sum']

In [75]:
friend_avg_inflow.loc[friend_avg_inflow.status_weighted_sum==0,'avg_inflow']=0

In [76]:
friend_avg_inflow.head()

Unnamed: 0,user_id,inflow,repost_sum,status_weighted_sum,avg_inflow
0,1000,47.0,0.0,0.0,0.0
1,1001,2.0,0.0,0.0,0.0
2,10006,136.0,0.0,0.0,0.0
3,10026,106.0,0.0,0.0,0.0
4,10051,29.0,0.0,0.0,0.0


In [77]:
sc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2026674 entries, 0 to 2026673
Data columns (total 2 columns):
user_id    int64
statusn    int64
dtypes: int64(2)
memory usage: 30.9 MB


In [78]:
temp = pd.merge(friend_avg_inflow,sc,on='user_id',how='left')

In [80]:
friend_avg_inflow.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80480 entries, 0 to 80479
Data columns (total 5 columns):
user_id                80480 non-null int64
inflow                 80480 non-null float64
repost_sum             80480 non-null float64
status_weighted_sum    80480 non-null float64
avg_inflow             80480 non-null float64
dtypes: float64(4), int64(1)
memory usage: 3.1 MB


In [83]:
temp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 80480 entries, 0 to 80479
Data columns (total 4 columns):
user_id       80480 non-null int64
inflow        80480 non-null float64
avg_inflow    80480 non-null float64
outflow       80480 non-null float64
dtypes: float64(3), int64(1)
memory usage: 3.1 MB


In [81]:
temp.loc[temp.statusn.isnull(),'statusn']=0

In [82]:
temp.drop(['repost_sum','status_weighted_sum'],axis=1,inplace=True)
temp.columns = ['user_id','inflow','avg_inflow','outflow']

In [85]:
user1 = pd.read_csv('data/users.csv',header=None,
                    names = ['user_id','friend_count','follower_count','listed_count',
                             'status_count','favorites_count','created_at','name','verified'],
                   usecols=[0,1,2])
user2 = pd.read_csv('data/users_new.csv',header=None,
                    names = ['user_id','friend_count','follower_count','listed_count',
                             'status_count','favorites_count','created_at','name','verified'],
                   usecols=[0,1,2])
users= user1.append(user2,ignore_index=True)

In [87]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2138359 entries, 0 to 2138358
Data columns (total 3 columns):
user_id           int64
friend_count      int64
follower_count    int64
dtypes: int64(3)
memory usage: 48.9 MB


In [88]:
user_data = pd.merge(users,temp,on='user_id',how='left')

In [89]:
user_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2138359 entries, 0 to 2138358
Data columns (total 6 columns):
user_id           int64
friend_count      int64
follower_count    int64
inflow            float64
avg_inflow        float64
outflow           float64
dtypes: float64(3), int64(3)
memory usage: 114.2 MB


In [90]:
user_data.loc[user_data.inflow.isnull(),'inflow']=0
user_data.loc[user_data.avg_inflow.isnull(),'avg_inflow']=0
user_data.loc[user_data.outflow.isnull(),'outflow']=0

In [92]:
user_data.to_csv('data/micro-macro-user-features.csv')

## Testing on another dataset: Influence Locality

In [15]:
diffusion = pd.read_csv('../data/micro/diffusion-10k-training.csv',
                        parse_dates=['time'],infer_datetime_format=True)


In [17]:
diffusion.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1157699 entries, 0 to 1157698
Data columns (total 4 columns):
root_status_id    1157699 non-null int64
root_user_id      1157699 non-null int64
time              1157699 non-null datetime64[ns]
user_id           1157699 non-null int64
dtypes: datetime64[ns](1), int64(3)
memory usage: 35.3 MB


In [18]:
diffusion.columns = ['original_status_id','original_user_id','time','user_id']

diffusion.drop_duplicates(subset=['original_status_id','user_id'],inplace=True)

In [20]:
diffusion.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1157699 entries, 0 to 1157698
Data columns (total 4 columns):
original_status_id    1157699 non-null int64
original_user_id      1157699 non-null int64
time                  1157699 non-null datetime64[ns]
user_id               1157699 non-null int64
dtypes: datetime64[ns](1), int64(3)
memory usage: 44.2 MB


In [56]:
from datetime import timedelta

In [27]:
def f(group):
    return pd.DataFrame({
            'count':group.size,
            'rate':(group['time'].max()-group['time'].min())/group.size,
            'original_user_id':group['original_user_id'],
            'user_id':group['user_id']
        })


In [28]:
# retweet = diffusion.groupby(['original_user_id','user_id']).apply(f) 
# very slow performance

KeyboardInterrupt: 

In [91]:
grouped = diffusion.groupby(['original_user_id','user_id'])

count = grouped.size()

max_time = grouped.time.max()

min_time = grouped.time.min()

rate = count.div(max_time.sub(min_time).apply(timedelta.total_seconds))

In [92]:
retweet = pd.DataFrame({
        'rate':rate,
        'count':count,
    },index=count.index)

In [93]:
retweet.info() # original_user_id | user_id | count | rate 

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1139792 entries, (36, 36) to (1786772, 1786855)
Data columns (total 2 columns):
count    1139792 non-null int64
rate     1139792 non-null float64
dtypes: float64(1), int64(1)
memory usage: 26.1+ MB


In [108]:
retweet.loc[np.isinf(retweet['rate']),'rate']=0.0

In [109]:
retweet.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,rate
original_user_id,user_id,Unnamed: 2_level_1,Unnamed: 3_level_1
36,36,1,0.0
52,52,1,0.0
52,256469,1,0.0
52,331685,1,0.0
52,487544,1,0.0


In [111]:
g = diffusion.groupby('user_id')
outflow = g.size().div(g.time.max().sub(g.time.min()).apply(timedelta.total_seconds))

outflow.loc[np.isinf(outflow)]=0.0

In [114]:
A = retweet.reset_index()
B = pd.DataFrame(outflow).reset_index()
tmp = A.merge(B,left_on='original_user_id',right_on='user_id',how='inner')

In [115]:
tmp.head()

Unnamed: 0,original_user_id,user_id_x,count,rate,user_id_y,0
0,36,36,1,0.0,36,1.836699e-07
1,52,52,1,0.0,52,2.453959e-07
2,52,256469,1,0.0,52,2.453959e-07
3,52,331685,1,0.0,52,2.453959e-07
4,52,487544,1,0.0,52,2.453959e-07


In [None]:
tmp.drop('user_id_y',axis=1,inplace=True)
tmp.columns = ['original_user_id','user_id','count','rate','inflow']

tmp['cxo'] = tmp['count'] * tmp['inflow']

In [128]:
avg_inflow = tmp.groupby('user_id').aggregate({
        'cxo':np.sum,
        'count':np.sum,
        'inflow':np.sum,
    })

avg_inflow['avg_inflow'] = avg_inflow['cxo']/avg_inflow['count']

In [129]:
avg_inflow.head()

Unnamed: 0_level_0,cxo,inflow,count,avg_inflow
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,4.67419e-07,4.67419e-07,1,4.67419e-07
1,1.111496e-07,1.111496e-07,1,1.111496e-07
3,3.937098e-07,3.937098e-07,4,9.842745e-08
7,3.343089e-07,3.343089e-07,1,3.343089e-07
8,2.623796e-07,2.623796e-07,3,8.745988e-08


In [122]:
A['cxr'] = A['count'] * A['rate']
avg_outflow = A.groupby('original_user_id').aggregate({
        'cxr':np.sum,
        'count':np.sum,
    })
avg_outflow['avg_outflow'] = avg_outflow['cxr']/avg_outflow['count']

In [123]:
avg_outflow.head()

Unnamed: 0_level_0,cxr,count,avg_outflow
original_user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
36,0.0,1,0.0
52,0.0,7,0.0
62,2.208047e-07,2,1.104024e-07
112,0.0,1,0.0
124,0.0,9,0.0


In [None]:
user_detail = avg_inflow[['inflow','avg_inflow']].join(avg_outflow[['avg_outflow']],how='outer',)

user_detail.fillna(0,inplace=True)

outflow.name='outflow'

In [None]:
user_data = user_detail.join(outflow,how='outer')

user_data.reset_index(inplace=True)

In [147]:
user_data.columns=['user_id','inflow','avg_inflow','avg_outflow','outflow']

In [149]:
user_data.to_csv('../data/micro/user_detail.csv',index=False)

In [148]:
user_data.head()

Unnamed: 0,user_id,inflow,avg_inflow,avg_outflow,outflow
0,0,4.67419e-07,4.67419e-07,0.0,0.0
1,1,1.111496e-07,1.111496e-07,0.0,0.0
2,3,3.937098e-07,9.842745e-08,0.0,1.124817e-07
3,7,3.343089e-07,3.343089e-07,0.0,0.0
4,8,2.623796e-07,8.745988e-08,0.0,5.636523e-08


### Sampling and profiling

In [35]:
diffusion_subset = diffusion.sample(frac=0.01)

In [36]:
%timeit diffusion_subset.groupby(['original_user_id','user_id']).apply(lambda x: (x.time.max()-x.time.min())/x.size)

1 loop, best of 3: 14.8 s per loop


In [41]:
%timeit diffusion_subset.groupby(['original_user_id','user_id']).aggregate({'time':lambda x: np.max(x) - np.min(x)})

1 loop, best of 3: 12 s per loop


In [45]:
def get_tweet_rate(group):
    return pd.DataFrame({
            'count':group.size,
            'max':group.time.max(),
            'min':group.time.min(),
        },index=group.index)

In [46]:
%timeit diffusion_subset.groupby(['original_user_id','user_id']).apply(get_tweet_rate)

1 loop, best of 3: 54.6 s per loop


##  Model preparation

In [2]:
user_data = pd.read_csv('../data/micro/user_data-10k.csv')
cascade = pd.read_csv('../data/micro/diffusion-10k.csv',
                          parse_dates=['time'],
                          infer_datetime_format=True)

In [3]:
cascade.columns = ['original_status_id', 'original_user_id', 'time', 'user_id']

In [4]:
from pickle import load
with  open('../data/micro/messaged-10k', 'rb') as mfile:
    message_dict = load(mfile)

In [5]:
def map_index(df, col_list, dict_list):  # col is string array of column names, dict is a dictionary array
    df_mapped = df.copy()
    for i in range(len(col_list)):
        col = col_list[i]
        d = dict_list[i]
        # only apply if x exists as key , else return None

        df_mapped[col] = df_mapped[col].apply(lambda x: d.get(x))
        df_mapped = df_mapped.dropna(subset=[col, ])
    return df_mapped.set_index(col_list)

In [6]:
cascade_mapped = map_index(cascade,['original_status_id'],[message_dict])

In [7]:
cascade_mapped.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1537661 entries, 0 to 9999
Data columns (total 3 columns):
original_user_id    1537661 non-null int64
time                1537661 non-null datetime64[ns]
user_id             1537661 non-null int64
dtypes: datetime64[ns](1), int64(2)
memory usage: 46.9 MB


In [7]:
messagen = len(message_dict)
# usern = user_data.shape[0]
usern = cascade_mapped.original_user_id.max()+1 

In [8]:
from scipy import sparse
from datetime import timedelta

In [9]:
user_data.set_index('uid', inplace=True)

In [10]:
def to_sparse(t, m, n):  # create sparse matrix from multi-index
    i, j = list(zip(*t.index))
    data = t.values
    return sparse.coo_matrix((data, (i, j)), shape=(m, n))

In [11]:
def get_delta_series(group):
    # return Series
    s = pd.Series((group['time'] - group['time'].min()).apply(lambda dt: dt / timedelta(seconds=1)),index=group.index)
    return s

In [12]:
user_matrix = user_data.as_matrix()

# cascade.sort_values(['original_status_id', 'original_user_id', 'time'], axis=0, inplace=True)
# M is the number of reposts for each <cascade, original_user> pair
m = cascade_mapped.reset_index().groupby(['original_status_id', 'original_user_id']).size()
m_sparse = to_sparse(m,messagen,usern)#m is sparse

C = cascade_mapped.reset_index().groupby(['original_status_id', 'original_user_id']).apply(get_delta_series)
C.index = C.index.droplevel(2)  # drop the original index
C = C[C != 0]

In [14]:
m.head()

original_status_id  original_user_id
0                   1500872               3
1                   153602               18
2                   519514              104
3                   490872              146
4                   1501242               5
dtype: int64

In [13]:
def G1(Lambda, K, C, m, cascaden,usern):
    logL = 0
    for i in range(cascaden):
        tmp =pd.DataFrame(C.loc[i, :]).reset_index().\
            drop('original_status_id', axis=1)\
            .set_index(['original_user_id'], append=True)
        tmp.index = tmp.index.swaplevel(i=0,j=1)
        coln = m.getrow(i).sum()
        c_sparse = to_sparse(tmp['time'],usern,coln )
        K_full = np.hstack((K,) * coln)
        p = np.power(c_sparse.toarray(),K_full).sum(axis=0)
        logL = logL + m.getrow(i).multiply(np.log(K).transpose()).sum() + \
               (np.log(c_sparse.toarray()).sum(axis=0) * (K-1)).sum()-\
               m.getrow(i).multiply(K.transpose()).multiply(np.log(Lambda.transpose())).sum() -\
            (np.power(Lambda,K*(-1))*p).sum()
        break
    return -logL

In [14]:
K_array = np.ones((usern,1))
Lambda_array = np.ones((usern,1))

In [18]:
%timeit loglikelihood = G1(Lambda_array,K_array,C,m_sparse,messagen,usern) #this will take 3h for 10k

1 loop, best of 3: 3.11 s per loop


In [44]:
Beta = np.random.rand(7,1)
Gamma = np.random.rand(7,1)
Alpha = [6 * 10 ** (-5), 8 * 10 ** (-6)]  # according to the paper
Mu = 10
Yita = 10

In [37]:
def get_C_row(C, usern, coln, i):
    tmp = pd.DataFrame(C.loc[i, :]).reset_index().drop('original_status_id', axis=1).set_index(['original_user_id'], append=True)
    tmp.index = tmp.index.swaplevel(i=0, j=1)
    c_sparse = to_sparse(tmp['time'], usern, coln)
    return c_sparse.toarray()

In [83]:
def derLambda(L, cascaden, usern, C, m, user_data, Alpha, Mu, K, Beta):
    derG1 = np.zeros(L.shape) #col vector
    for i in range(cascaden):
        coln = m.getrow(i).sum()
        c_array = get_C_row(C, usern, coln, i)
        K_full = np.hstack((K,) * coln)
        p = to_col(np.power(c_array,K_full).sum(axis=1))
        a = m.getrow(i).transpose().multiply(K)/L  # col
        b = np.power(L,K-1) *p *K
        derG1 = derG1 +a +b
        break 
    loss = np.log(L) - np.log(user_data).dot(Beta)
    derG2 = 1/usern * (loss / L)
    result = derG1 + derG2 * Mu
    return result

In [84]:
%timeit result = derLambda(Lambda_array,messagen,usern,C,m_sparse,user_matrix,Alpha,Mu,K_array,Beta)

ValueError: operands could not be broadcast together with shapes (1786773,1) (1655678,1) 

In [None]:
# L and K should be the same shapes as user_data 

In [80]:
def to_col(l):
    return np.ndarray(buffer=l,shape=(l.shape[0],1))

In [81]:
def derK(K, cascaden, usern, C, m, user_data, Alpha, Yita, Lambda, Gamma):
    derG1 = np.zeros(K.shape)
    for i in range(cascaden):
        coln = m.getrow(i).sum()
        c_array = get_C_row(C,usern,coln,i)
        tmpm = m.getrow(i)
        K_full = np.hstack((K,) * coln)
        p= to_col(np.power(c_array,K_full).sum(axis=1))
        q = to_col((np.power(c_array,K_full) * np.log(c_array)).sum(axis=1))
        derG1 = derG1 + tmpm.transpose()/K \
             + to_col(np.log(c_array).sum(axis=1)) \
             - tmpm.transpose().multiply(np.log(Lambda)).toarray()\
             + np.power(Lambda,K*(-1)) * np.log(K) * p \
             - np.power(Lambda,K*(-1))*q
        break
    loss = np.log(K) - np.log(user_data).dot(Gamma)
    derG3 = 1/usern * (loss/K)
    result = derG1 + derG3 *Yita
    return result

In [82]:
result2 = derK(K_array,messagen,usern,C,m_sparse,user_matrix,Alpha,Yita,Lambda_array,Gamma)

ValueError: operands could not be broadcast together with shapes (1786773,1) (1655678,1) 