# Part 3: Using cross validation

I've performed cross validation on the dataset. We use 3 different models for this question, i.e., Linear Regression, Random Forest, and SVM. Then, we evaluate the accuracy of the models by calculating the average prediction error.

In this part of the project, we create different regression model for different periods of time
which are:
1. Before Feb. 1, 8:00 a.m
2. Between Feb. 1, 8:00 a.m. and 8:00 p.m
3. After Feb. 1, 8:00 p.m

Then, we use these regression models to predict the results for the three different periods.
Average prediction error is then calculated to evaluate the accuracy of the models for different
periods.

## Importing libraries

In [17]:
import os
import json
import numpy as np
import pandas as pd
from sklearn.svm import SVR #For non-linear SVM
import statsmodels.api as sm
from pandas import DataFrame
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression 
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_predict

## Extracting data

In [11]:

#----------------------Function to extract the data from the file--------------
def extract_info(hashtag):
    hashtag_dictionary = {'#GoHawks' : ['tweets_#gohawks.txt', 188136],
                          '#GoPatriots' : ['tweets_#gopatriots.txt', 26232],
                          '#NFL' : ['tweets_#nfl.txt', 259024],
                          '#Patriots' : ['tweets_#patriots.txt', 489713],
                          '#SB49' : ['tweets_#sb49.txt', 826951],
                          '#SuperBowl' : ['tweets_#superbowl.txt', 1348767]}             

    #-----------------------To extract the data from the file------------------                
    time_stamps = [0]*hashtag_dictionary[hashtag][1]
    is_retweet = [False]*hashtag_dictionary[hashtag][1]
    followers_of_users = [0]*hashtag_dictionary[hashtag][1]
    
    no_of_url_citations = [0]*hashtag_dictionary[hashtag][1]
    usernames = ['']*hashtag_dictionary[hashtag][1]
    no_of_mentions = [0]*hashtag_dictionary[hashtag][1]
    ranking_scores = [0.0]*hashtag_dictionary[hashtag][1] # for each tweet, we will get a ranking score
    no_of_hashtags = [0]*hashtag_dictionary[hashtag][1]
    
    file_in = open('./Training_data/'+hashtag_dictionary[hashtag][0], encoding="utf8")
    for (line, index) in zip(file_in, range(0, hashtag_dictionary[hashtag][1])):
        tr_data = json.loads(line)
        time_stamps[index] = tr_data['citation_date']
        followers_of_users[index] = tr_data['author']['followers']

        username = tr_data['author']['nick']
        original_username = tr_data['original_author']['nick']
        if username != original_username:
            is_retweet[index] = True

        no_of_url_citations[index] = len(tr_data['tweet']['entities']['urls'])
        usernames[index] = username
        no_of_mentions[index] = len(tr_data['tweet']['entities']['user_mentions'])
        ranking_scores[index] = tr_data['metrics']['ranking_score']
        no_of_hashtags[index] = tr_data['title'].count('#')
        
    file_in.close()
    
    #--------------------To calculate the related parameters-------------------
    hrs_passed = int((max(time_stamps)-min(time_stamps))/3600)+1
    hr_no_of_tweets = [0] * hrs_passed
    hr_no_of_retweets = [0] * hrs_passed
    hr_sum_of_followers = [0] * hrs_passed
    hr_max_no_of_followers = [0] * hrs_passed
    hr_time_of_the_day = [0] * hrs_passed
    hr_no_of_url_citations = [0] * hrs_passed
    hr_no_of_users = [0] * hrs_passed
    hr_user_set = [0] * hrs_passed
    hr_no_of_mentions = [0] * hrs_passed
    hr_total_ranking_scores = [0.0] * hrs_passed
    hr_no_of_hashtags = [0] * hrs_passed
    for i in range(0, hrs_passed):
        hr_user_set[i] = set([])
    
    start_time = min(time_stamps)
    for i in range(0, hashtag_dictionary[hashtag][1]):
        current_hr = int((time_stamps[i]-start_time)/3600)
        
        hr_no_of_tweets[current_hr] += 1
        
        if is_retweet[i]:
            hr_no_of_retweets[current_hr] += 1
    
        if followers_of_users[i] > hr_max_no_of_followers[current_hr]:
            hr_max_no_of_followers[current_hr] = followers_of_users[i]

        hr_sum_of_followers[current_hr] += followers_of_users[i]
        hr_no_of_url_citations[current_hr] += no_of_url_citations[i]
        hr_user_set[current_hr].add(usernames[i])
        hr_no_of_mentions[current_hr] += no_of_mentions[i]
        hr_total_ranking_scores[current_hr] += ranking_scores[i]
        hr_no_of_hashtags[current_hr] += no_of_hashtags[i]

    for i in range(0, len(hr_user_set)):
        hr_no_of_users[i] = len(hr_user_set[i])
    
    for i in range(0, len(hr_time_of_the_day)):
        hr_time_of_the_day[i] = i%24

    #------------------To build the DataFrame and save it to file--------------
    target = hr_no_of_tweets[1:]
    target.append(0)
    
    data = np.array([hr_no_of_tweets,
                     hr_no_of_retweets,
                     hr_sum_of_followers,
                     hr_max_no_of_followers,
                     hr_time_of_the_day,
                     hr_no_of_url_citations,
                     hr_no_of_users,
                     hr_no_of_mentions,
                     hr_total_ranking_scores,
                     hr_no_of_hashtags,
                     target])
    data = np.transpose(data)
    
    data_frame = DataFrame(data)
    data_frame.columns = ['no_of_tweets', 
                          'no_of_retweets', 
                          'sum_of_followers',
                          'max_no_of_followers',
                          'time_of_day',
                          'no_of_URLs',
                          'no_of_users',
                          'no_of_mentions',
                          'ranking_score',
                          'no_of_hashtags',
                          'target']
    
    if os.path.isdir('./Extracted_data'):
        pass
    else:
        os.mkdir('./Extracted_data')
    
    data_frame.to_csv('./Extracted_data/data_cv_'+hashtag+'.csv', index = False)   
#------------------------------------------------------------------------------

## Function to perform cross validation

In [12]:
#-----------------To calculate the average cross-validation error--------------
def avg_cross_validation_error(target_data, cross_validation_values):
    total_error = 0.0
    for (actual, predicted) in zip(target_data, cross_validation_values):
        total_error += abs(actual - predicted)
    print(total_error/len(target_data))
#------------------------------------------------------------------------------

In [13]:

#---------------------Function to perform Cross Validation---------------------
def cross_validation(hashtag):
    training_data = pd.read_csv('./Extracted_data/data_cv_'+hashtag+'.csv')
    
    #----------------------------One-hot Encoding------------------------------
    time_of_day_set = range(0,24)
    for time_of_day in time_of_day_set:
        time_of_day_column_to_add = []
        for time_of_day_item in training_data['time_of_day']:
            if time_of_day_item == time_of_day:
                time_of_day_column_to_add.append(1)
            else:
                time_of_day_column_to_add.append(0)
        training_data.insert(training_data.shape[1]-1,
                             str(time_of_day)+'th_hour',
                             time_of_day_column_to_add)
    
    #----------------------------Splitting the data----------------------------   
    training_data.drop('time_of_day', 1, inplace = True)
    target_data = training_data.pop('target')
    
    training_data_before = training_data[:440] #Find the minimum and maximum timestamps. Convert to date and hour. 
    training_data_during = training_data[440:452]
    training_data_after = training_data[452:]
        
    target_data_before = target_data[:440]
    target_data_during = target_data[440:452]
    target_data_after = target_data[452:]   
        
    #------------------------Cross Validation for RF---------------------------
    reg_before = RandomForestRegressor(n_estimators = 20, max_depth = 9)
    reg_during = RandomForestRegressor(n_estimators = 20, max_depth = 9)
    reg_after = RandomForestRegressor(n_estimators = 20, max_depth = 9)
    
    cross_validation_values_before = cross_val_predict(reg_before,
                                                       training_data_before,
                                                       target_data_before,
                                                       cv = 10)
    
    cross_validation_values_during = cross_val_predict(reg_during,
                                                       training_data_during,
                                                       target_data_during,
                                                       cv = 10)

    cross_validation_values_after = cross_val_predict(reg_after,
                                                      training_data_after,
                                                      target_data_after,
                                                      cv = 10)

    cross_validation_values = np.concatenate([cross_validation_values_before,
                                              cross_validation_values_during,
                                              cross_validation_values_after])
    print(hashtag)
    print('For Random Forest:')
    print('    Average cross-validation error before Super Bowl:',
    avg_cross_validation_error(target_data_before,cross_validation_values_before))

    print('    Average cross-validation error during Super Bowl:',
    avg_cross_validation_error(target_data_during,cross_validation_values_during))

    print('    Average cross-validation error after Super Bowl:',
    avg_cross_validation_error(target_data_after,cross_validation_values_after))

    print('    Total average cross-validation error:',
    avg_cross_validation_error(target_data,cross_validation_values))
    print('')
#------------------------------------------------------------------------------

#--------------------Cross Validation for Linear Regression--------------------
    reg_before = LinearRegression(fit_intercept = False) 
    reg_during = LinearRegression(fit_intercept = False)
    reg_after = LinearRegression(fit_intercept = False)
    
    cross_validation_values_before = cross_val_predict(reg_before,
                                                       training_data_before,
                                                       target_data_before,
                                                       cv = 10)
    
    cross_validation_values_during = cross_val_predict(reg_during,
                                                       training_data_during,
                                                       target_data_during,
                                                       cv = 10)

    cross_validation_values_after = cross_val_predict(reg_after,
                                                      training_data_after,
                                                      target_data_after,
                                                      cv = 10)

    cross_validation_values = np.concatenate([cross_validation_values_before,
                                              cross_validation_values_during,
                                              cross_validation_values_after])
    print('For Linear Regression:')
    print('    Average cross-validation error before Super Bowl:',
    avg_cross_validation_error(target_data_before,cross_validation_values_before))

    print('    Average cross-validation error during Super Bowl:',
    avg_cross_validation_error(target_data_during,cross_validation_values_during))

    print('    Average cross-validation error after Super Bowl:',
    avg_cross_validation_error(target_data_after,cross_validation_values_after))

    print('    Total average cross-validation error:',
    avg_cross_validation_error(target_data,cross_validation_values))
    print('')
#------------------------------------------------------------------------------
         
#------------Cross Validation for Non-Linear SVM with RBF kernel---------------
    reg_before = SVR(kernel='rbf')  #Epsilon-Support Vector Regression.
    reg_during = SVR(kernel='rbf')
    reg_after = SVR(kernel='rbf')
    
    cross_validation_values_before = cross_val_predict(reg_before,
                                                       training_data_before,
                                                       target_data_before,
                                                       cv = 10)
    
    cross_validation_values_during = cross_val_predict(reg_during,
                                                       training_data_during,
                                                       target_data_during,
                                                       cv = 10)

    cross_validation_values_after = cross_val_predict(reg_after,
                                                      training_data_after,
                                                      target_data_after,
                                                      cv = 10)

    cross_validation_values = np.concatenate([cross_validation_values_before,
                                              cross_validation_values_during,
                                              cross_validation_values_after])
    print('For Non-linear SVM:')
    print('    Average cross-validation error before Super Bowl:',
    avg_cross_validation_error(target_data_before,cross_validation_values_before))

    print('    Average cross-validation error during Super Bowl:',
    avg_cross_validation_error(target_data_during,cross_validation_values_during))

    print('    Average cross-validation error after Super Bowl:',
    avg_cross_validation_error(target_data_after,cross_validation_values_after))

    print('    Total average cross-validation error:',
    avg_cross_validation_error(target_data,cross_validation_values))
    print('')
#-----------------------------------------------------------------------------


## Perform

In [14]:
def perform_cross_validation(hashtag):
    extract_info(hashtag)
    cross_validation(hashtag)

In [15]:
perform_cross_validation('#GoHawks')
perform_cross_validation('#GoPatriots')
perform_cross_validation('#NFL')
perform_cross_validation('#Patriots')
perform_cross_validation('#SB49')
perform_cross_validation('#SuperBowl')

#GoHawks
For Random Forest:
162.83403155028927
    Average cross-validation error before Super Bowl: None
1998.0166666666664
    Average cross-validation error during Super Bowl: None
26.03819674164867
    Average cross-validation error after Super Bowl: None
170.86360080883722
    Total average cross-validation error: None

For Linear Regression:
423.1941463092207
    Average cross-validation error before Super Bowl: None
1874.479577852336
    Average cross-validation error during Super Bowl: None
42.65350708308319
    Average cross-validation error after Super Bowl: None
369.80341055239506
    Total average cross-validation error: None





For Non-linear SVM:
224.51454545454484
    Average cross-validation error before Super Bowl: None
3233.5833333333335
    Average cross-validation error during Super Bowl: None
37.16822195145171
    Average cross-validation error after Super Bowl: None
245.78543037622495
    Total average cross-validation error: None

#GoPatriots
For Random Forest:
11.08602067025832
    Average cross-validation error before Super Bowl: None
1020.3458333333333
    Average cross-validation error during Super Bowl: None
3.732529618493883
    Average cross-validation error after Super Bowl: None
30.57582650084938
    Total average cross-validation error: None

For Linear Regression:
23.21725047340594
    Average cross-validation error before Super Bowl: None
4730.221389742165
    Average cross-validation error during Super Bowl: None
2.7677754715267726
    Average cross-validation error after Super Bowl: None
117.07597090122158
    Total average cross-validation error: None





For Non-linear SVM:
13.938616769900696
    Average cross-validation error before Super Bowl: None
1488.75
    Average cross-validation error during Super Bowl: None
4.575286309400884
    Average cross-validation error after Super Bowl: None
42.71435059967413
    Total average cross-validation error: None

#NFL
For Random Forest:
113.3268111078512
    Average cross-validation error before Super Bowl: None
2805.8291666666664
    Average cross-validation error during Super Bowl: None
155.1840470098972
    Average cross-validation error after Super Bowl: None
177.9958998872073
    Total average cross-validation error: None

For Linear Regression:
126.99462055599338
    Average cross-validation error before Super Bowl: None
5173.07777335174
    Average cross-validation error during Super Bowl: None
152.14540569863436
    Average cross-validation error after Super Bowl: None
235.93559811613915
    Total average cross-validation error: None





For Non-linear SVM:
189.33184415584435
    Average cross-validation error before Super Bowl: None
4284.916666666667
    Average cross-validation error during Super Bowl: None
293.32641975308644
    Average cross-validation error after Super Bowl: None
296.9745793785996
    Total average cross-validation error: None

#Patriots
For Random Forest:
194.6911017211036
    Average cross-validation error before Super Bowl: None
15885.933333333332
    Average cross-validation error during Super Bowl: None
103.71210429517251
    Average cross-validation error after Super Bowl: None
494.5424511705855
    Total average cross-validation error: None

For Linear Regression:
352.62035534600165
    Average cross-validation error before Super Bowl: None
105075.182919283
    Average cross-validation error during Super Bowl: None
124.36033067294389
    Average cross-validation error after Super Bowl: None
2440.9604702291044
    Total average cross-validation error: None







For Non-linear SVM:
262.7485151515152
    Average cross-validation error before Super Bowl: None
14053.333333333334
    Average cross-validation error during Super Bowl: None
152.09990123456825
    Average cross-validation error after Super Bowl: None
519.2211811470759
    Total average cross-validation error: None

#SB49
For Random Forest:
686.0651927541616
    Average cross-validation error before Super Bowl: None
12174.887499999999
    Average cross-validation error during Super Bowl: None
119.08446973650594
    Average cross-validation error after Super Bowl: None
795.1413385031101
    Total average cross-validation error: None

For Linear Regression:
1252.7310661341423
    Average cross-validation error before Super Bowl: None
18817.900222159882
    Average cross-validation error during Super Bowl: None
117.00433378210579
    Average cross-validation error after Super Bowl: None
1359.0806852322428
    Total average cross-validation error: None





For Non-linear SVM:
728.9761670151291
    Average cross-validation error before Super Bowl: None
51105.75
    Average cross-validation error during Super Bowl: None
291.71984732824427
    Average cross-validation error after Super Bowl: None
1667.639474248125
    Total average cross-validation error: None

#SuperBowl
For Random Forest:
292.13376570078486
    Average cross-validation error before Super Bowl: None
55290.13333333333
    Average cross-validation error during Super Bowl: None
235.0854726723641
    Average cross-validation error after Super Bowl: None
1405.327491888127
    Total average cross-validation error: None

For Linear Regression:
400.79212945426565
    Average cross-validation error before Super Bowl: None
241660.82895463065
    Average cross-validation error during Super Bowl: None
579.832191830187
    Average cross-validation error after Super Bowl: None
5382.211600888547
    Total average cross-validation error: None





For Non-linear SVM:
443.882424242424
    Average cross-validation error before Super Bowl: None
94434.75
    Average cross-validation error during Super Bowl: None
593.2559701492521
    Average cross-validation error after Super Bowl: None
2402.767178612051
    Total average cross-validation error: None



## Evaluations for the combined model


In [16]:
#-------------------Evaluations for the combined model------------------------
hashtag='#GoHawks'
aggregate = pd.read_csv('./Extracted_data/data_cv_'+hashtag+'.csv')
for hashtag in ['#GoPatriots','#NFL','#Patriots','#SB49','#SuperBowl']:
    training_data = pd.read_csv('./Extracted_data/data_cv_'+hashtag+'.csv')
    aggregate=aggregate.append(training_data)
aggregate.to_csv('./Extracted_data/data_cv_Aggregate.csv', index = False)
cross_validation('Aggregate')

Aggregate
For Random Forest:
161.58057831949841
    Average cross-validation error before Super Bowl: None
1920.229166666667
    Average cross-validation error during Super Bowl: None
605.9665707740219
    Average cross-validation error after Super Bowl: None
554.5628860358805
    Total average cross-validation error: None

For Linear Regression:
423.1941463112231
    Average cross-validation error before Super Bowl: None
1874.4795778424568
    Average cross-validation error during Super Bowl: None
1175.7967363308478
    Average cross-validation error after Super Bowl: None
1083.5002120212973
    Total average cross-validation error: None







For Non-linear SVM:
224.51454545454484
    Average cross-validation error before Super Bowl: None
3233.5833333333335
    Average cross-validation error during Super Bowl: None
953.8457860979605
    Average cross-validation error after Super Bowl: None
869.902721952613
    Total average cross-validation error: None



We see that the error during Super Bowl, i.e., for the second period, is much larger than the other
two periods i.e., before and after the Super Bowl. The reasons for this are as follows:
1. The data during the second period is only for 10 hours. In comparison, the data during the other two periods contains several hundred hours. The amount of data during the second period is very small, therefore, it cannot provide enough training data for the model. Hence, the fit of the model is bad, resulting in large error.
2. There were unusually large number of users tweeting during the Super Bowl event, therefore the regularity is much harder to predict. It is difficult to use simple a linear model to fit because the gusts of tweets often happen in a very short period.

In [None]:
** RF is better than SVR. SVR is better than LR