## Twitter Models
__ Author: Gabrielle Agrocostea__<br>
__Last updated: March 2016__

This notebook is used to predict twitter impressions based on retweets and followers.
The model is used to initialize, clean data, train, predict and preform cross validation. 

### Model 1: 
The 1st model is built using all 27 properties and by remove top and bottom 10% of retweets and followers. Ridge regression is used with alpha set to 0.1, and R-squared is 0.764 using 90% training and 10% test data. Using an average error threshold  of 60% filters out properties with larger average errors such as CNN, NFL, MTV, BET and InStyle. Performing 5-fold cross validation to find the best alpha for ridge regression yields about the same score with slightly lower average errors.

### Model 2: 
The 2nd model removes properties that had the highest average error (over 60%) and also performs 5-fold cross-validation and searches for best alpha for Ridge regression. The average score on the folds is higher than the 1st model (around 0.82). The largest errors are just at 50% with smaller properties, which leads me to think that maybe we need to build a model for properties such as CNN and MTV and another model for smaller properties.

### Model 3: 
The 3rd model is built using only those properties with the largest errors (over 60%) and the average R-squared error is closer to 0.89, so this model outperforms the other two. While there are still some properties with large errors, some like CNN do a better job using this model than the 1st model which includes all properties. 


### A bit of documentation on Model class: 
    
The Model class initializes a model with the following attributes:
- Twitter and user data <br>
- Model results: a dataframe to store page_id, followers, retweets, impressions, predicted and percent differences
- rmse score
- n data points
- model (linear regression)

The _clean_ function:
- Removes retweets and impressions above and below the input quantile (this is .10 by default)
- Add weekday as a feature and drop time column

The _train_ function gets randomized training and test data based on percent to train (default is 90%), fits the model and sets the score and predicted values.

The _cross validate_ function takes as input alphas and the number of k-folds to run (default is 5) and performs cross-validation to find the best alpha. It returns the best alpha found based on k-fold cross validation.


In [29]:
class Model:
    '''
    class to build linear regression model to predict impressions based on followers & re-tweets
    '''
    def __init__(self):
        '''
        Initialize model with Twitter data with user data (not used yet)
        Has rmse, r_score, n and model attributes
        '''
        self.tw_data = load_tw_data()
        self.user_data = load_user_data()

        # self.result = pd.DataFrame(columns = ['base_user','cross_user','base_total_overlap','perc_overlap'])
        self.model_results = pd.DataFrame(columns=['page_id', 'followers', 'retweets', 'impressions', 'predicted', \
                                                   'perc_diff', 'err_cat'])
        self.rmse = None
        self.r_score = None
        self.n = None
        self.model = None

    def clean(self, quantile):
        '''
        
        :param quantile: float between 0.0 and 1.0 used to clean up top and bottom quantiles for tweets and impressions 
        '''
        self.tw_data = self.tw_data[(self.tw_data.retweets > self.tw_data.retweets.quantile(quantile)) & (
            self.tw_data.retweets < self.tw_data.retweets.quantile(1.0 - quantile))]
        self.tw_data = self.tw_data[(self.tw_data.impressions > self.tw_data.impressions.quantile(quantile)) & (
            self.tw_data.impressions < self.tw_data.impressions.quantile(1.0 - quantile))]
        # reset new index for data
        self.n = self.tw_data.shape[0]
        # twitter_index = np.arange(0, self.n)
        self.tw_data = self.tw_data.set_index(np.arange(0, self.n))

        # add weekday
        try:
            self.tw_data['weekday'] = self.tw_data.time.map(lambda x: np.int(x.date().weekday()))
            self.tw_data.drop('time', axis=1, inplace=True)

        except:
            pass


    def _get_train_test(self, perc_train=0.9):
        '''
        Based on 90% training and 10% testing, set training & test data
        :param perc_train: default is .90
        :return: x_train, y_train, x_test, y_test data
        '''
        # sample for training
        n_train = np.round(self.n * perc_train)
        ndex_train = np.random.randint(0, self.n, int(n_train))

        # # training data
        x_train = self.tw_data.ix[ndex_train]
        y_train = self.tw_data['impressions'][ndex_train]
        x_train = x_train[["followers", "retweets"]]

        # # test data
        x_test = self.tw_data.drop(ndex_train, axis=0)
        y_test = x_test[['impressions']]

        self.model_results[['page_id', 'followers', 'retweets', 'impressions']] = x_test[
            ['page_id', 'followers', 'retweets', 'impressions']].copy()
        x_test = x_test[["followers", "retweets"]]

        return [x_train, y_train, x_test, y_test]

    def cross_validate(self, alphas, folds):
        '''
        Perform k-fold cross-validation
        :param alphas: list of alphas for parameter search 
        :param folds: how many time to perform cross-validation (default k is 5)
        :return: best alpha found
        '''
        model_cv = linear_model.RidgeCV(alphas=alphas, cv=5)
        k_fold = cross_validation.KFold(n=self.n, n_folds=folds, shuffle=True)
        cv_scores = list()
        cv_alphas = list()
        for k, (train, test) in enumerate(k_fold):
            # X = self.tw_data.drop(['tw_name','impressions'], axis=1)
            X = self.tw_data.drop(['impressions'], axis=1)
            Y = self.tw_data.impressions
            model_cv.fit(X.ix[train], Y.ix[train])
            model_cv.alpha_ = alphas[k]
            cv_alphas.append(model_cv.alpha_)
            cv_scores.append(model_cv.score(X.ix[test], Y.ix[test]))
            print("[fold {0}] alpha: {1:.9f}, score: {2:.5f}".format(k, model_cv.alpha_,
                                                                     model_cv.score(X.ix[test], Y.ix[test])))
        model_cv_df = pd.DataFrame({'fold': range(folds), 'alpha': cv_alphas, 'score': cv_scores})
        print "Best alpha's\n", model_cv_df.sort_values('score', ascending=False).head(10)
        # print "Best alpha's\n", model_cv_df.sort_values('score', ascending=False).head(10)
        best_alpha = model_cv_df.sort_values('score', ascending=False).alpha.iloc[0]
        return best_alpha

    def train(self, model, perc_train, alpha):
        '''
        fit model on training data, set score and model results 
        :param model: linear model to set (Ridge, Lasso..)  
        :param perc_train: percent to use for training (float between 0 & 1)
        :param alpha: penalty value for model 

        '''
        x_train, y_train, x_test, y_test = self._get_train_test(perc_train)
        self.model = model(alpha)
        self.model.fit(x_train, y_train)
        self.r_score = self.model.score(x_test, y_test)
        self.model_results.predicted = self.model.predict(x_test)

    def get_coefs(self):
        '''
        
        :return: model coefficients 
        '''
        return self.model.coef_

    def get_results(self):
        tw_samples = self.user_data[['user_id', 'tw_name']].drop_duplicates()
        tw_samples.rename(columns={'user_id': 'page_id'}, inplace=True)

        diff = np.subtract(self.model_results.impressions, self.model_results.predicted)
        self.model_results.perc_diff = abs(diff / self.model_results.impressions)

        bins = np.arange(0.0, 1.0, .3)
        large_bins = np.arange(1.0, self.model_results.perc_diff.max() + 5, 4)
        all_bins = np.concatenate([bins, large_bins])

        self.model_results['err_cat'] = pd.cut(self.model_results.perc_diff, bins=all_bins, right=True)
        self.model_results = self.model_results.merge(tw_samples, on='page_id')

        results = self.model_results.groupby(['page_id', 'tw_name', 'err_cat'])[['err_cat']].agg('count')
        results.rename(columns={'err_cat': 'frequency'}, inplace=True)
        results = results.reset_index()
        results.sort_values(['frequency', 'tw_name'], ascending=False, inplace=True)
        results['perc_total_err'] = results.groupby('err_cat')['frequency'].apply(lambda x: x / (x.sum()))
        return results


### Ridge Regression - Model 1 

In [2]:
print "*" * 30
print "MODEL 1 - Using all 27 properties for modeling impressions"
print ".... CLEANING DATA .... REMOVING OUTLIERS ...."
linModel1 = Model()
linModel1.clean(quantile = .1)

print ".... TRAINING MODEL 1 ...."
linModel1.train(model = Ridge, perc_train = .9, alpha = .1)

print ".... GETTING RESULTS FOR MODEL 1...."
score = linModel1.r_score
print "ADJUSTED R2 = ", score

results = linModel1.get_results()
print "Model Coefficients:"
coefs = linModel1.get_coefs()
print coefs

******************************
MODEL 1 - Using all 27 properties for modeling impressions
.... CLEANING DATA .... REMOVING OUTLIERS ....
.... TRAINING MODEL 1 ....


.... GETTING RESULTS FOR MODEL 1....
ADJUSTED R2 =  0.765818282025


Model Coefficients:
[  8.88925118e-03   2.56322289e+02]


In [12]:
err_threshold = 0.6
model_err_stats = linModel1.model_results.groupby(['page_id','tw_name'])['perc_diff'].agg(['count','sum','mean']).sort_values('mean',ascending = False, axis =0).reset_index()
model_err_stats.rename(columns = {'count':'frequency', 'sum':'err_sum' ,'mean':'err_mean'}, inplace = True)
tw_names_drop = model_err_stats[model_err_stats.err_mean > err_threshold][['page_id','tw_name']].drop_duplicates()

print "MODEL 1 RESULTS:"
print model_err_stats.head(10)
print
print "BIGGEST ERRORS:"
print tw_names_drop.head(10)


MODEL 1 RESULTS:
     page_id      tw_name  frequency       err_sum   err_mean
0     759251          CNN        350   4086.403460  11.675438
1   19426551          nfl        123    875.916783   7.121275
2    2367911          MTV       5697  17919.794619   3.145479
3   30309979   106andpark        316    585.683421   1.853429
4   16560657          bet       7939   8510.282256   1.071959
5   14934818      InStyle      14974  13662.544602   0.912418
6   32448740    brueggers         66     55.459232   0.840291
7  634784951  dasaniwater          1      0.814984   0.814984
8   27677483   essencemag       5839   4565.960017   0.781976
9   18342955   abc11_wtvd       4178   3119.168456   0.746570

BIGGEST ERRORS:
     page_id      tw_name
0     759251          CNN
1   19426551          nfl
2    2367911          MTV
3   30309979   106andpark
4   16560657          bet
5   14934818      InStyle
6   32448740    brueggers
7  634784951  dasaniwater
8   27677483   essencemag
9   18342955   abc11_wtv

In [30]:
# perform 5-fold cross validation and search for best alpha
print "...5-fold cross validation... "
alphas = np.logspace(-4, -.5, 20)
best_alpha = linModel1.cross_validate(alphas=alphas, folds=5)
print "\nFOUND BEST ALPHA TO USE IN MODEL 1"
print ".... TRAINING MODEL 1 with alpha = ...", best_alpha
linModel1 = Model()
linModel1.clean(quantile = .1)
linModel1.train(model = Ridge, perc_train = .9, alpha = best_alpha)
print "\n"
print ".... GETTING RESULTS FOR MODEL 1...."
score = linModel1.r_score
print "ADJUSTED R2 = ", score
print "\n"
results = linModel1.get_results()
print "Model Coefficients:"
coefs = linModel1.get_coefs()
print coefs

...5-fold cross validation... 
[fold 0] alpha: 0.000100000, score: 0.76615
[fold 1] alpha: 0.000152831, score: 0.76248
[fold 2] alpha: 0.000233572, score: 0.76689
[fold 3] alpha: 0.000356970, score: 0.75942
[fold 4] alpha: 0.000545559, score: 0.76597
Best alpha's
      alpha  fold     score
2  0.000234     2  0.766890
0  0.000100     0  0.766153
4  0.000546     4  0.765967
1  0.000153     1  0.762475
3  0.000357     3  0.759420

FOUND BEST ALPHA TO USE IN MODEL 1
.... TRAINING MODEL 1 with alpha = ... 0.000233572146909


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix




.... GETTING RESULTS FOR MODEL 1....
ADJUSTED R2 =  0.762183871331


Model Coefficients:
[  8.94001327e-03   2.54000698e+02]


In [15]:
err_threshold = 0.6
model_err_stats = linModel1.model_results.groupby(['page_id','tw_name'])['perc_diff'].agg(['count','sum','mean']).sort_values('mean',ascending = False, axis =0).reset_index()
model_err_stats.rename(columns = {'count':'frequency', 'sum':'err_sum' ,'mean':'err_mean'}, inplace = True)
tw_names_drop = model_err_stats[model_err_stats.err_mean > err_threshold][['page_id','tw_name']].drop_duplicates()

print "MODEL 1 RESULTS USING BEST ALPHA:"
print model_err_stats.head(10)
print
print "BIGGEST ERRORS:"
print tw_names_drop.head(10)


MODEL 1 RESULTS USING BEST ALPHA:
     page_id      tw_name  frequency       err_sum   err_mean
0     759251          CNN        350   4086.403460  11.675438
1   19426551          nfl        123    875.916783   7.121275
2    2367911          MTV       5697  17919.794619   3.145479
3   30309979   106andpark        316    585.683421   1.853429
4   16560657          bet       7939   8510.282256   1.071959
5   14934818      InStyle      14974  13662.544602   0.912418
6   32448740    brueggers         66     55.459232   0.840291
7  634784951  dasaniwater          1      0.814984   0.814984
8   27677483   essencemag       5839   4565.960017   0.781976
9   18342955   abc11_wtvd       4178   3119.168456   0.746570

BIGGEST ERRORS:
     page_id      tw_name
0     759251          CNN
1   19426551          nfl
2    2367911          MTV
3   30309979   106andpark
4   16560657          bet
5   14934818      InStyle
6   32448740    brueggers
7  634784951  dasaniwater
8   27677483   essencemag
9   183

### Ridge Regression - Model 2  
- remove properties with largest errors (over 60% incorrect on average)
- use 5-fold cross-validation for best alpha

In [17]:
alphas = np.logspace(-4, -.5, 20)
print "*" * 30
print "MODEL 2"
print "REMOVING TWITTER USERS WITH AVG ERRORS > 60%"


linModel2 = Model()
# remove names with largest average of error
linModel2.tw_data = linModel2.tw_data[~linModel2.tw_data.page_id.isin(tw_names_drop.page_id)]
print ".... CLEANING DATA .... REMOVING OUTLIERS ...."
linModel2.clean(quantile = .1)
print "\n"
print "5 FOLD CROSS VALIDATION ...."
best_alpha = linModel2.cross_validate(alphas=alphas, folds=5)
print
print "FOUND BEST ALPHA USED IN MODEL 2: ", best_alpha
print ".... TRAINING MODEL 2...."
linModel2.train(model = Ridge, perc_train = .9, alpha = best_alpha)
score2 = linModel2.r_score
print "ADJUSTED R2 = ", score2
results2 = linModel2.get_results()
print "Model Coefficients:"
coefs2 = linModel2.get_coefs()
print coefs2

******************************
MODEL 2
REMOVING TWITTER USERS WITH AVG ERRORS > 60%
.... CLEANING DATA .... REMOVING OUTLIERS ....


5 FOLD CROSS VALIDATION ....
[fold 0] alpha: 0.000100000, score: 0.81777
[fold 1] alpha: 0.000152831, score: 0.82348
[fold 2] alpha: 0.000233572, score: 0.82171
[fold 3] alpha: 0.000356970, score: 0.81639
[fold 4] alpha: 0.000545559, score: 0.81868
Best alpha's
      alpha  fold     score
1  0.000153     1  0.823482
2  0.000234     2  0.821708
4  0.000546     4  0.818679
0  0.000100     0  0.817768
3  0.000357     3  0.816386

FOUND BEST ALPHA USED IN MODEL 2:  0.000152830673266
.... TRAINING MODEL 2....
ADJUSTED R2 =  0.816688528879
Model Coefficients:
[  8.76712081e-03   2.40737750e+02]


In [16]:
model_err_stats2 = linModel2.model_results.groupby(['page_id','tw_name'])['perc_diff'].agg(['count','sum','mean']).sort_values('mean',ascending = False, axis =0).reset_index()
model_err_stats2.rename(columns = {'count':'frequency', 'sum':'err_sum' ,'mean':'err_mean'}, inplace = True)
tw_names_drop2 = model_err_stats2[model_err_stats2.err_mean > err_threshold][['page_id','tw_name']].drop_duplicates()

print "MODEL 2 RESULTS USING BEST ALPHA:"
print model_err_stats2
print
print "BIGGEST ERRORS:"
print tw_names_drop2


MODEL 2 RESULTS USING BEST ALPHA:
      page_id          tw_name  frequency       err_sum  err_mean
0   192981351     LandRoverUSA         42     21.147072  0.503502
1    25053299  fortunemagazine      22297  10810.510498  0.484841
2    21308602   cartoonnetwork        247    108.331923  0.438591
3     5988062     theeconomist       3055   1167.208866  0.382065
4    14946736          DIRECTV        307    110.054806  0.358485
5    73200694            Coach        122     36.610598  0.300087
6   436171805        fusionpop         25      7.497409  0.299896
7   226299107          betnews         29      7.099350  0.244805
8     9695312        billboard      11431   2498.741824  0.218593
9   119606058   aquiyahorashow         14      2.841013  0.202930
10   16374678             ABC7      11804   2360.990145  0.200016
11   25589776           people      25607   3735.139277  0.145864
12   14293310             TIME      29090   3846.456865  0.132226

BIGGEST ERRORS:
Empty DataFrame
Columns: 

### Ridge Regression - Model 3  
- model using properties where error is >60%
- perform 10-fold cross validation

In [18]:
print "*" * 30
print "MODEL 3"

print "*" * 30
print "BUILDING MODEL FOR TWITTER USERS GUILTY OF LARGEST ERRORS"

linModel3 = Model()
linModel3.tw_data = linModel3.tw_data[linModel3.tw_data.page_id.isin(tw_names_drop.page_id)]
print ".... CLEANING DATA .... REMOVING OUTLIERS ...."
linModel3.clean(quantile = .1)
print "\n"
print "10 FOLD CROSS VALIDATION ...."
best_alpha_3 = linModel3.cross_validate(alphas=alphas, folds=10)
print
print "FOUND BEST ALPHA USED IN MODEL 3: ", best_alpha_3
print ".... TRAINING MODEL 3...."
linModel3.train(model = Ridge, perc_train = .9, alpha = best_alpha_3)
score3 = linModel3.r_score
print "ADJUSTED R2 = ", score3
print "Model Coefficients:"
coefs3 = linModel3.get_coefs()
print coefs3

******************************
MODEL 3
******************************
BUILDING MODEL FOR TWITTER USERS GUILTY OF LARGEST ERRORS
.... CLEANING DATA .... REMOVING OUTLIERS ....


10 FOLD CROSS VALIDATION ....
[fold 0] alpha: 0.000100000, score: 0.89105
[fold 1] alpha: 0.000152831, score: 0.89236
[fold 2] alpha: 0.000233572, score: 0.88658
[fold 3] alpha: 0.000356970, score: 0.89746
[fold 4] alpha: 0.000545559, score: 0.88830
[fold 5] alpha: 0.000833782, score: 0.89534
[fold 6] alpha: 0.001274275, score: 0.88810
[fold 7] alpha: 0.001947483, score: 0.89164
[fold 8] alpha: 0.002976351, score: 0.89149
[fold 9] alpha: 0.004548778, score: 0.89680
Best alpha's
      alpha  fold     score
3  0.000357     3  0.897459
9  0.004549     9  0.896798
5  0.000834     5  0.895344
1  0.000153     1  0.892364
7  0.001947     7  0.891644
8  0.002976     8  0.891486
0  0.000100     0  0.891055
4  0.000546     4  0.888304
6  0.001274     6  0.888100
2  0.000234     2  0.886581

FOUND BEST ALPHA USED IN MODEL 

In [22]:
model_err_stats3 = linModel3.model_results.groupby(['page_id','tw_name'])['perc_diff'].agg(['count','sum','mean']).sort_values('mean',ascending = False, axis =0).reset_index()
model_err_stats3.rename(columns = {'count':'frequency', 'sum':'err_sum' ,'mean':'err_mean'}, inplace = True)

print "MODEL 3 RESULTS USING BEST ALPHA:"
print model_err_stats3

MODEL 3 RESULTS USING BEST ALPHA:
       page_id          tw_name  frequency       err_sum  err_mean
0     18342955       abc11_wtvd       3294  12386.628306  3.760361
1     32448740        brueggers         48    165.043697  3.438410
2   1426645165           bustle       3209  10950.713596  3.412500
3    223525053     ringlingbros         30     95.801266  3.193376
4      2367911              MTV       6596  19620.996732  2.974681
5     27677483       essencemag       5628  15410.318480  2.738152
6     25453312  hallmarkchannel       2481   6604.009992  2.661834
7     19426551              nfl        644   1702.890740  2.644240
8     30309979       106andpark        339    651.353689  1.921397
9    634784951      dasaniwater          2      2.129312  1.064656
10      759251              CNN      13272   8216.304356  0.619071
11    14934818          InStyle      15052   8632.908271  0.573539
12    16560657              bet       8277   1810.044625  0.218684
