# Interesting Tweet Analysis

Twitter provides API to extract the tweet content based on the search keyword or id or user. Tweet content includes various numerical as well as textual parameters such as tweet text, like count, retweet count, etc. It does not provide the direct reply message content of any tweet or the chat based export of any tweet. For this user have to use different web scrapping methods (PHP, HTML or javascript based). And that will be a very big and time-consuming project.

Accordingly, in below assignment, I have built a ML model to extract the 'n' number of interesting tweet for any keyword or hashtag. I have used two different techniques to train two different models and finally combine them for the better result. These models are trained using limited trained data and without taking help of any content expert. So the accuracy of a model will not be much impressive. All other explanations are mentioned along with the code.

In [18]:
import tweepy
import csv
import nltk
import pandas as pd
import numpy as np
from textblob import TextBlob
import re
import pickle

## Tweet scraping

Twitter API only provides a limited number of tweets per pool as well as there is a restriction of 15 min for total pooling. So for complete tweet extraction, I am using web scraping tool rather than Twitter API.

The limitation of this tool is that it only provides limited details such as tweet text, username, likes, retweets, etc. But it does not provide you with the detail of user such as a number of followers, verified or not, etc. So after analysing scrapped tweets, we will use its ids to extract all the relevant information for further modelling.

Here I have scrapped all the tweets of "W Brom vs Liverpool" premier league match with a hashtag <b>'#WBALIV'.

##### Note: After extracting tweets, I have removed my Twitter API credential for privacy issues

In [65]:
#API credentials
#removed Twitter API credential for privacy issues

consumer_key = '#'
consumer_secret = '#'
access_token = '#-#'
access_token_secret = '#'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth,wait_on_rate_limit=True)

In [20]:
#Scrapping the tweet with web scrapper
#scrapping query has been marked as a comment to avoid it from running every time.

from twitterscraper import query_tweets
import datetime as dt

#if __name__ == '__main__':
    #list_of_tweets = query_tweets("#WBALIV", 10000, begindate=dt.date(2018,4,20), enddate=dt.date.today(), lang='en')
    

In [21]:
#print the retrieved tweets to the screen:

def populate_tweet_df(tweets):
    data = pd.DataFrame()
    data['id'] = list(map(lambda tweet: int(tweet.id), tweets))
    data['fullname'] = list(map(lambda tweet: tweet.fullname, tweets))
    data['user'] = list(map(lambda tweet: tweet.user, tweets))
    data['text'] = list(map(lambda tweet: tweet.text, tweets))
    data['likes'] = list(map(lambda tweet: int(tweet.likes), tweets))
    data['retweets'] = list(map(lambda tweet: int(tweet.retweets), tweets))  
    data['replies'] = list(map(lambda tweet: int(tweet.replies), tweets)) 
    data['timestamp'] = list(map(lambda tweet: tweet.timestamp, tweets))
    return data

In [22]:
#Saving the tweets as pickle object

#with open("tweets.pickle", "wb") as f:
        #pickle.dump(list_of_tweets, f)

In [23]:
with open('tweets.pickle', 'rb') as handle:
    tweets = pickle.load(handle)

In [24]:
#Converting tweet object into dataframe

tweet_data=populate_tweet_df(tweets).drop_duplicates(keep='first')

In [25]:
#validating the tweets

tweet_data.sort_values(by='retweets', ascending=False).reset_index(drop=True).head()

Unnamed: 0,id,fullname,user,text,likes,retweets,replies,timestamp
0,987680042141601793,Alan Shearer,alanshearer,Magnificent @22mosalah @premierleague goals...,9213,2332,101,2018-04-21 13:10:46
1,987679304241831941,Premier League,premierleague,Mo Salah has scored 31 #PL goals this season ‚Äì...,5311,1700,96,2018-04-21 13:07:50
2,987682515057102848,Liverpool FC,LFC,The points are shared.\n\n#WBALIV pic.twitter....,4285,1439,550,2018-04-21 13:20:36
3,987682605301665799,Premier League,premierleague,The Baggies refuse to give up and are rewarded...,1908,566,105,2018-04-21 13:20:58
4,987680971142172672,Liverpool FC,LFC,88: Goal for West Brom. Rondon.\n\n[2-2]\n\n#W...,2119,527,455,2018-04-21 13:14:28


In [26]:
#Extracting ids of the tweets having more than 1 likes or retweets

ids=list(tweet_data[(tweet_data['retweets']>1) | (tweet_data['likes']>1)]['id'])

In [27]:
#Using tweepy to extract the complete details of above-extracted ids

import tweepy

#tweets = twapi.statuses_lookup(id_=idlist, include_entities=True, trim_user=False)

#This does not works with extented tweet texts (280 character limit)

In [28]:
# Creating loop to extract individual tweet from each id with more than one (1) likes or retweet

#all_tweets=[]
#for i in range(len(ids)):
    #x=api.get_status(ids[i], tweet_mode='extended')._json
    #all_tweets.append(x)

In [29]:
#Function to transfer tweepy output into a dataframe for better preprocessing and modelling

def get_tweet_df(tweets):    
    tw = pd.DataFrame()
 
    tw['text'] = list(map(lambda tweet: tweet['full_text'], tweets))
    tw['retweet_count'] = list(map(lambda tweet: tweet['retweet_count'], tweets))
    tw['favorite_count'] = list(map(lambda tweet: tweet['favorite_count'], tweets))  
    tw['lang'] = list(map(lambda tweet: tweet['lang'], tweets))
    
    #User Details
    tw['user_screen_name'] = list(map(lambda tweet: tweet['user']['screen_name'], tweets))
    tw['user_verified'] = list(map(lambda tweet: tweet['user']['verified'], tweets))
    tw['user_followers_count'] = list(map(lambda tweet: tweet['user']['followers_count'], tweets))
    tw['user_favourites_count'] = list(map(lambda tweet: tweet['user']['favourites_count'], tweets))
    tw['user_listed_count'] = list(map(lambda tweet: tweet['user']['statuses_count'], tweets))
    tw['user_statuses_count'] = list(map(lambda tweet: tweet['user']['statuses_count'], tweets))
    
    return tw

In [30]:
#Save tweets dataframe into excel format for tweet classification

#get_tweet_df(all_tweets).to_excel('all_tweet.xlsx', index_label=None)

## Model 1

In first model we will extract the notable tweets with general criteria like retweet_count > 1 & favorite_count > 4. Then we manually tag the interesting tweet for interesting categories (~360 tweets).

Then we preprocess the tweet text for modelling. We will train the model using Tf-IDF vectoriser and LinearSVC algorithm.
Based on the trained model, we will predict the interesting category for all of the notable tweets (~710). 

Here we are using very less manually tagged training data so accuracy might not be very good. We can get better accuracy with larger training data set.

In [31]:
train=pd.read_excel('all_tweet.xlsx', index=False)

In [32]:
#Marking tweet as interesting if mentioned criteria are followed

for i in range(len(train)):
    if (train.loc[i,'user_verified']==False) & (train.loc[i,'retweet_count']>10) & (train.loc[i,'favorite_count']>10):
        train.loc[i,'interesting']=True

In [33]:
# Extracting tweets with more than 1 retweet or 4 likes for manually marking interesting tag

train_rel=train[(train['retweet_count']>1) | (train['favorite_count']>4)].reset_index(drop=True)

In [34]:
#train_rel.to_excel('all_tweet_train.xlsx', index_label=None)

In [35]:
#Reading the manually marked tweet data

df=pd.read_excel('all_tweet_tagged.xlsx', index=False)
df.head()

Unnamed: 0,text,retweet_count,favorite_count,lang,user_screen_name,user_verified,user_followers_count,user_favourites_count,user_listed_count,user_statuses_count,interesting
0,Game day make it ‚öΩÔ∏è#31 Mo #WBALIV https://t.co...,3,22,en,O1Paul,False,7758,12008,1395,1395,False
1,‚òÅÔ∏è9Ô∏è\n\n@LFC are unbeaten in their last nine #...,296,2467,en,premierleague,True,17811702,999,91545,91545,True
2,"Early risers, we've got another one at 730AM t...",1,5,en,LFCAtlanta,False,2683,4218,31447,31447,False
3,#PL üè¥Û†ÅßÛ†Å¢Û†Å•Û†ÅÆÛ†ÅßÛ†Åø‚öΩ\nWest Bromwich Albion üÜö Liverpool...,8,5,en,JugadaDepCL,False,6585,978,13554,13554,False
4,"West Brom vs Liverpool: Preview, Team News, an...",16,36,en,BloodsugarNatz,False,7252,19824,9590,9590,True


In [36]:
def clean_tweet(tweet):
    '''
    Utility function to clean the text in a tweet by removing 
    links and special characters using regex.
    '''
    clean_tweets= ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split()).lower()
    
    return clean_tweets

In [37]:
df['clean_text']=df['text'].apply(lambda x: clean_tweet(x))

print(df['clean_text'][:10])

0                        game day make it 31 mo wbaliv
1    9 are unbeaten in their last nine pl meetings ...
2    early risers we ve got another one at 730am to...
3    pl west bromwich albion liverpool relatos come...
4    west brom vs liverpool preview team news and w...
5    tomorrow west brom v liverpool ko 12 30pm it s...
6    here s how ray would lineup the lfc side for w...
7    looking forward to a trip to the baggies tomor...
8    the latest news from both sides ahead of our t...
9    before the big 1 next wk need saturday ritual ...
Name: clean_text, dtype: object


In [38]:
#Text preprocessing for stopwords removal, lemmatization and spelling correction

from nltk.corpus import stopwords
from textblob import Word
stop_words = set(stopwords.words('english'))

def tweet_preprocessing(tweet):
    
    stopwords=  " ".join([word for word in tweet.split() if word not in stop_words])
    lemmatize= " ".join([Word(word).lemmatize() for word in stopwords.split()])
    clean_tweet= str(TextBlob(lemmatize).correct())
    
    return clean_tweet

In [39]:
df['text_processed'] = df['clean_text'].apply(lambda x: tweet_preprocessing(x))

print(df['text_processed'][:10])

0                           game day make 31 mo wbaliv
1           9 beaten last nine ll meeting we do wbaliv
2    early riser got another one 730am tomorrow mor...
3    ll west bromwich action liverpool relates comm...
4    west from v liverpool review team news way wat...
5    tomorrow west from v liverpool to 12 pm bit tr...
6            ray would line of side wbaliv join u hour
7    looking forward trip maggie tomorrow sit hand ...
8    latest news side ahead trip hawthorne wbaliv t...
9    big 1 next we need saturday ritual done go cla...
Name: text_processed, dtype: object


In [40]:
x=df['text_processed']
y=df['interesting']

In [41]:
#Creating a model using Tf-IDF vectoriser and LinearSVC
#Used GridsearchCV for hyperparameter tuning

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVC

params = {"tfidf__ngram_range": [(1, 2)], "svc__C": [0.001,.01, .1, 1, 10, 100]}

clf = Pipeline([("tfidf", TfidfVectorizer(sublinear_tf=True, stop_words='english')), ("svc", LinearSVC())])

gs = GridSearchCV(clf, params, cv=30, verbose=2, n_jobs=-1)
gs.fit(x, y)
print(gs.best_estimator_)
print(gs.best_score_)

Fitting 30 folds for each of 6 candidates, totalling 180 fits


[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    8.2s
[Parallel(n_jobs=-1)]: Done 165 out of 180 | elapsed:   10.2s remaining:    0.8s


Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,
 ...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])
0.7119113573407202


[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:   10.5s finished


In [42]:
#Preprocessing all the tweets

train['clean_text']=train['text'].apply(lambda x: clean_tweet(x))
train['text_processed'] = train['clean_text'].apply(lambda x: tweet_preprocessing(x))

train.head()

Unnamed: 0,text,retweet_count,favorite_count,lang,user_screen_name,user_verified,user_followers_count,user_favourites_count,user_listed_count,user_statuses_count,interesting,clean_text,text_processed
0,Game day make it ‚öΩÔ∏è#31 Mo #WBALIV https://t.co...,3,22,en,O1Paul,False,7758,12008,1395,1395,,game day make it 31 mo wbaliv,game day make 31 mo wbaliv
1,‚òÅÔ∏è9Ô∏è\n\n@LFC are unbeaten in their last nine #...,296,2467,en,premierleague,True,17811702,999,91545,91545,,9 are unbeaten in their last nine pl meetings ...,9 beaten last nine ll meeting we do wbaliv
2,Another big win for the Baggies against a weak...,0,3,en,JPW_NBCSports,True,18786,9396,33749,33749,,another big win for the baggies against a weak...,another big win maggie weakened liverpool side...
3,"Early risers, we've got another one at 730AM t...",1,5,en,LFCAtlanta,False,2683,4218,31447,31447,,early risers we ve got another one at 730am to...,early riser got another one 730am tomorrow mor...
4,#PL üè¥Û†ÅßÛ†Å¢Û†Å•Û†ÅÆÛ†ÅßÛ†Åø‚öΩ\nWest Bromwich Albion üÜö Liverpool...,8,5,en,JugadaDepCL,False,6585,978,13554,13554,,pl west bromwich albion liverpool relatos come...,ll west bromwich action liverpool relates comm...


In [43]:
#Predicting the categories of all the tweets

predicted=gs.predict(train['clean_text'])
predicted.shape

(709,)

In [44]:
#Creating the dataframe of model 1 output

final_predict = pd.DataFrame(predicted,columns=['interesting'])
result = train[['text', 'user_screen_name','favorite_count', 'retweet_count']]
tweets1 = pd.concat([result,final_predict],axis=1)

## Model 2

Here we use tweet features such as retweet_count, favorite_count, user_verified, user_followers_count, etc. for model building. 
We will build multiple binary classification models using various classification algorithm. Finally, we will use VotingClassifier function to create the final model.

In [45]:
df.columns

Index(['text', 'retweet_count', 'favorite_count', 'lang', 'user_screen_name',
       'user_verified', 'user_followers_count', 'user_favourites_count',
       'user_listed_count', 'user_statuses_count', 'interesting', 'clean_text',
       'text_processed'],
      dtype='object')

In [46]:
df_x=df[['retweet_count', 'favorite_count','user_verified', 'user_followers_count', 'user_favourites_count',
       'user_listed_count']]
df_y=df['interesting']

In [47]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(df_x)
scaled = scaler.transform(df_x)

In [48]:
#Cross validation parameters
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

kfold = KFold(n_splits=40, random_state=42)

In [49]:
#Applying Support Vector Machine

from sklearn.svm import LinearSVC
l_svc = LinearSVC(random_state =42, C=0.0001)

results_l_svc = cross_val_score(l_svc, scaled, df_y, cv=kfold, scoring='accuracy')
print("Accuracy: %.3f" %(results_l_svc.mean()*100))

Accuracy: 65.167


In [50]:
#Applying Logistic Regression

from sklearn.linear_model import LogisticRegression
lr=LogisticRegression(penalty='l2',random_state=42)

results_lr = cross_val_score(lr, scaled, df_y, cv=kfold, scoring='accuracy')
print("Accuracy: %.3f" %(results_lr.mean()*100))

Accuracy: 66.028


In [51]:
#Applying Naive Bayes

from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

results_gnb = cross_val_score(gnb, scaled, df_y, cv=kfold, scoring='accuracy')
print("Accuracy: %.3f" %(results_gnb.mean()*100))

Accuracy: 66.861


In [52]:
#Hyperparameter tuning for RandomForest with GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on approximte params  

param_grid = {
    'max_depth': [None, 1, 2, 4, 8],
    'n_estimators': [5, 10, 20, 40, 70, 100, 200]
}

# Create a based model
rf = RandomForestClassifier()
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, cv = 5, n_jobs = -1, verbose = 2)

# Fit the grid search to the data
grid_search.fit(scaled, df_y)
grid_search.best_params_

Fitting 5 folds for each of 35 candidates, totalling 175 fits


[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    7.9s
[Parallel(n_jobs=-1)]: Done 160 out of 175 | elapsed:   11.3s remaining:    1.0s
[Parallel(n_jobs=-1)]: Done 175 out of 175 | elapsed:   12.1s finished


{'max_depth': None, 'n_estimators': 200}

In [53]:
#Applying random forest

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, criterion='entropy',max_depth=None, oob_score=True, random_state=42, n_jobs=-1)

results_rf = cross_val_score(rf, scaled, df_y, cv=kfold, scoring='accuracy')
print("Accuracy: %.3f" %(results_rf.mean()*100))

Accuracy: 74.333


In [56]:
#Applying VotingClassifier function
#VotingClassifier function collects the output from each model and gives the mean of all for classification problems

from sklearn.ensemble import VotingClassifier

eclf = VotingClassifier(estimators=[('l_svc', l_svc), ('LogReg', lr), ('Bayes', gnb), 
                                    ('forest', rf)], voting='hard')
eclf.fit(scaled, df_y)
results_eclf = cross_val_score(eclf, scaled, df_y, cv=kfold, scoring='accuracy',  n_jobs=-1)
print("Accuracy: %.3f" %(results_eclf.mean()*100))

Accuracy: 67.667


In [57]:
#Applying trained model on all the relevant tweets

test_x=train[['retweet_count', 'favorite_count','user_verified', 'user_followers_count', 'user_favourites_count',
       'user_listed_count']]

In [58]:
#Applying trained model on all the relevant tweets

predicted_2 = eclf.predict(scaler.transform(test_x))
predicted_2.shape

  if diff:


(709,)

In [59]:
#Creating the dataframe of predicted values

final_predict_2 = pd.DataFrame(predicted_2,columns=['check_interesting'])
tweets_check = pd.concat([tweets1,final_predict_2],axis=1)

## Model 1 + Model 2

In [61]:
#Final Dataframe: it contains only tweets which are interessting as per both the models

tweet_final=tweets_check[(tweets_check['check_interesting']==True) & (tweets_check['interesting']==True)]
tweet_final.head()

Unnamed: 0,text,user_screen_name,favorite_count,retweet_count,interesting,check_interesting
5,"Happy Friday, Reds! We‚Äôll see you bright and e...",LFCRaleigh,3,0,True,True
6,Gn family...sweet dreams... Going to try and ...,sshlfc,3,1,True,True
7,West Brom have scored in each of their last se...,FPLUpdates_Tips,3,0,True,True
8,"West Brom vs Liverpool: Preview, Team News, an...",BloodsugarNatz,36,16,True,True
9,"West Brom vs Liverpool: Preview, Team News, an...",LFCOffside,2,1,True,True


In [62]:
#Function which return the top tweets based on the user requirements
#It provides top tweets for two methods ('likes' or 'retweets')

def top_tweets(method, n):
    #method: 'likes' or 'retweets'
    #n: total number of tweets
    if 'retweets' in str(method):
        interessting_tweets = tweet_final.sort_values(by='retweet_count', ascending=False).reset_index(drop=True)[['text','user_screen_name','favorite_count','retweet_count']].head(int(n))
    else: interessting_tweets = tweet_final.sort_values(by='favorite_count', ascending=False).reset_index(drop=True)[['text','user_screen_name','favorite_count','retweet_count']].head(int(n))
        
    return interessting_tweets

In [63]:
#Final Result

top_tweets('retweets', 10)

Unnamed: 0,text,user_screen_name,favorite_count,retweet_count
0,ü§°#JokeReferee of the weekend winner is Stuart ...,JokeReferee,450,252
1,@jeremycorbyn @jacindaardern @EmilyThornberry ...,Jezza4_PM,304,228
2,üôå Darren Moore as WBA manager...\n\n- Unbeaten...,TheSportsman,721,159
3,"@jeremycorbyn ""He's put up with the most appal...",Jezza4_PM,273,141
4,ü§°Stuart Atwell seen here clearly looking at an...,JokeReferee,191,99
5,Stuart Attwell didn't have the best of games. ...,jimbeglin,547,95
6,Klopp just sent the interviewer for a hot dog ...,ZIYAAD_LFC,207,78
7,Darren Moore has really got the players going ...,mrdanwalker,477,57
8,"This Hegazi is a right dirty little bastard, s...",crlfc74,217,45
9,The greatest threat posed is unlawful evil int...,GoboMontaco,57,37


### How can we increase the accuracy of above model:

 1) By increasing the number of tagged tweets (training data). 
<br> 2) By increasing the quality of training data.
<br> 3) By spending more time in cleaning each tweet text.
<br> 4) By calculating Klout score for each user.
<br> 5) By tracking media content of all the tweets such as GIFs, Images, Videos, etc.