1.2 Twitter data preparation and cleaning
======


**In this example we will process the hashtags and quoted_hashtags from tweets, so that each hashtag will be linked to a quoted_hashtag in the data table, later on we can use these data to visualise the hashtags and quoted hashtags relationships from the tweets.**

In [128]:
#Import required libraries
import pandas as pd
import glob, os 

#enter path
path = r'./'   
#read all or selected csv to dataframe
all_csv = glob.glob(os.path.join(path, "dba_tweets_*.csv"))   
df = (pd.read_csv(f) for f in all_csv)
#read all the data and drop the index column
df_all   = pd.concat(df, ignore_index=True)

**Check and inspect the data frame.**

In [129]:
#Check and inspect the data frame.
print(df_all.shape)
print(df_all.head)
print(df_all.dtypes)

(5000, 13)
<bound method NDFrame.head of                                                  Tweets  \
0     RT @SpirosMargaris: 5 #Free #Books \n\nto #Lea...   
1     RT @Eli_Krumova: #Anatomy Education in #3D #Au...   
2     RT @Eli_Krumova: #Anatomy Education in #3D #Au...   
3     RT @Eli_Krumova: #Anatomy Education in #3D #Au...   
4     RT @Eli_Krumova: #Anatomy Education in #3D #Au...   
...                                                 ...   
4995  Key findings in the @SPLUNK "#DataAge" survey ...   
4996  RT @Bomoimajid: 💡Get a Bang 💣 for Your Buck! W...   
4997  RT @Bomoimajid: 💡Get a Bang 💣 for Your Buck! W...   
4998  RT @Udemy_Coupons1: Microsoft OneDrive Master ...   
4999  RT @Bomoimajid: 💡Get a Bang 💣 for Your Buck! W...   

                             User  User_statuses_count  user_followers  \
0                           Irene                  698              12   
1     The Secret Junior Developer               226992            1187   
2     The Secret Junior Deve

**Select only the hastags columns and make a subset from the dataframe.**

In [130]:
# select subset of columns/hashtags columns from data
df_hashtags = df_all[['hashtags', 'quoted_hashtags']]
print(df_hashtags.head)

<bound method NDFrame.head of                                                hashtags quoted_hashtags
0     ['Free', 'Books', 'Learn', 'Statistics', 'Data...              []
1     ['Anatomy', '3D', 'AugmentedReality', 'Video',...              []
2     ['Anatomy', '3D', 'AugmentedReality', 'Video',...              []
3     ['Anatomy', '3D', 'AugmentedReality', 'Video',...              []
4     ['Anatomy', '3D', 'AugmentedReality', 'Video',...              []
...                                                 ...             ...
4995                                        ['DataAge']              []
4996                   ['YouTube', 'Online', 'BigData']              []
4997                   ['YouTube', 'Online', 'BigData']              []
4998  ['udemycoupon', 'MachineLearning', 'BigData', ...              []
4999                   ['YouTube', 'Online', 'BigData']              []

[5000 rows x 2 columns]>


**Remove hashtags contain only [] which means no hashtags or quoted_hashtags.**

In [131]:
# remove hashtags contain only []
hashtags_clear = df_hashtags[~df_hashtags['hashtags'].isin(['[]',])]
hashtags_clear = hashtags_clear[~hashtags_clear['quoted_hashtags'].isin(['[]',])]
print(hashtags_clear.shape)

(39, 2)


**Seperate hashtags with comma as the seperator.**

In [132]:
# use pandas DataFrame package to stack/seperate hashtags content based on comma.
from pandas import DataFrame
hashtags_stack = DataFrame(hashtags_clear.hashtags.str.split(',').tolist(), index=hashtags_clear.quoted_hashtags).stack()
hashtags_stack = hashtags_stack.reset_index()[[0, 'quoted_hashtags']] # var1 variable is currently labeled 0
hashtags_stack.columns = ['hashtags', 'quoted_hashtags'] # renaming var1
hashtags_stack.head()

Unnamed: 0,hashtags,quoted_hashtags
0,['Anatomy',"['3D', 'AugmentedReality', 'Analytics', 'AI', ..."
1,'3D',"['3D', 'AugmentedReality', 'Analytics', 'AI', ..."
2,'AugmentedReality',"['3D', 'AugmentedReality', 'Analytics', 'AI', ..."
3,'Video',"['3D', 'AugmentedReality', 'Analytics', 'AI', ..."
4,'Analytics',"['3D', 'AugmentedReality', 'Analytics', 'AI', ..."


**Seperate quoted_hashtags with comma as the seperator.**

In [133]:
# use pandas DataFrame package to stack/seperate quoted hashtags content based on comma.
hashtags_stack2 = DataFrame(hashtags_stack.quoted_hashtags.str.split(',').tolist(), index=hashtags_stack.hashtags).stack()
hashtags_stack2 = hashtags_stack2.reset_index()[[0, 'hashtags']] # var1 variable is currently labeled 0
hashtags_stack2.columns = ['quoted_hashtags','hashtags'] # renaming var1
hashtags_stack2.head()

Unnamed: 0,quoted_hashtags,hashtags
0,['3D',['Anatomy'
1,'AugmentedReality',['Anatomy'
2,'Analytics',['Anatomy'
3,'AI',['Anatomy'
4,'Rstats',['Anatomy'


**Remove [' '] from the hashtags and quoted_hashtags content.**

In [134]:
# remove [ and ' 
hashtags_stack2['quoted_hashtags'] = hashtags_stack2['quoted_hashtags'].map(lambda x: x.strip(" ['] ").replace("'",""))
hashtags_stack2['hashtags'] = hashtags_stack2['hashtags'].map(lambda x: x.strip(" ['] ").replace("'",""))

hashtags_stack2.head()



Unnamed: 0,quoted_hashtags,hashtags
0,3D,Anatomy
1,AugmentedReality,Anatomy
2,Analytics,Anatomy
3,AI,Anatomy
4,Rstats,Anatomy


**Check the processed hashtags dataframe dimension.**

In [135]:
hashtags_stack2.shape

(939, 2)

**Check the dimension again after removing data duplications.**

In [136]:
# remove data duplications and check dimension
hashtags_stack2.drop_duplicates()
hashtags_stack2.shape

(939, 2)

**Change all the first character in the hashtags dataframe content to upper case.**

In [137]:
# change first character to upper case 
hashtags_stack2['quoted_hashtags']=hashtags_stack2['quoted_hashtags'].str.title() 
hashtags_stack2['hashtags']=hashtags_stack2['hashtags'].str.title()
hashtags_stack2.head()

Unnamed: 0,quoted_hashtags,hashtags
0,3D,Anatomy
1,Augmentedreality,Anatomy
2,Analytics,Anatomy
3,Ai,Anatomy
4,Rstats,Anatomy


**Check the dimension again after removing data duplications.**

In [138]:
# remove data duplications and check dimension again
hashtags_stack2.drop_duplicates()
hashtags_stack2.shape

(939, 2)

**Remove the data which quoted hashtags and original hashtags are the same.**

In [139]:
# remove the data which quoted hashtags and original hashtags are the same
hashtags_stack_final = hashtags_stack2[hashtags_stack2['quoted_hashtags']!=hashtags_stack2['hashtags']]
hashtags_stack2.shape


(939, 2)

**Save the processed hashtags data to csv file.**

In [140]:
# save processed hashtags data to csv
hashtags_stack_final.to_csv(("hashtags2.csv") )

**Filter hashtags with the selected contents. Prepare for the network graph which should focus or centre on the selected contents instead of everything.**

In [141]:
# only keep data which original hashtags are below
hashtags_test = hashtags_stack_final[(hashtags_stack2['hashtags'].str.contains('Datascience', case=False)|hashtags_stack2['hashtags'].str.contains('Bigdata', case=False)|hashtags_stack2['hashtags'].str.contains('Ai', case=False) )]
hashtags_test.head()

Unnamed: 0,quoted_hashtags,hashtags
40,3D,Ai
41,Augmentedreality,Ai
42,Analytics,Ai
44,Rstats,Ai
45,Reactjs,Ai


**Save the filtered dataframe to a csv file.**

In [142]:
# save processed hashtags data to csv
hashtags_test.to_csv(("hashtags_test.csv") )

**Run the custmoised network graph python script. It will output the network graph to quoted_hashtags_networkx_graph2.html.**

In [None]:
%run network/network.py 

### Reference:
**How to get API Keys and Tokens for Twitter<br/>
https://www.slickremix.com/docs/how-to-get-api-keys-and-tokens-for-twitter/<br/>
Stream Tweets in Under 15 Lines of Code + Some Interactive Data Visualization<br/>
https://dzone.com/articles/stream-tweets-the-easy-way-in-under-15-lines-of-co<br/>
Twitter Firehose vs. Twitter API: What’s the difference and why should you care?<br/>
https://brightplanet.com/2013/06/25/twitter-firehose-vs-twitter-api-whats-the-difference-and-why-should-you-care/ <br/>
Twitter Data Visualisation<br/>
https://nbviewer.jupyter.org/github/SantaDS/DataVisualisation/blob/master/TwitterDataAnalysis/twitter_data_analysis.ipynb<br/>
Dynamic Visualization For Twitter Data<br/>
https://github.com/shihao1007/vggm<br/>
Mine Twitter's Stream For Hashtags Or Words<br/>
https://chrisalbon.com/python/other/mine_a_twitter_hashtags_and_words/<br/>
Twitter Data Visualisation<br/>
https://www.kaggle.com/tuncbileko/twitter-data-visualisation/<br/>
Tweepy API Reference<br/>
http://docs.tweepy.org/en/latest/api.html#API.search<br/>
Tweepy Streaming<br/>
https://github.com/tweepy/tweepy/blob/78d2883a922fa5232e8cdfab0c272c24b8ce37c4/tweepy/streaming.py<br/>
Twitter API<br/>
https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/user<br/>**