## Exploring the data in each field

The purpose of this notebook is exploring each field, looking at the data. These are the conclusions I observed:

* Include *user_id* in the dataset: If *user_id* is not included, we won't be able to compare users, for example with the field *user_verified*, as there may be many tweets from one user and the field *user_verified* will be all the same and we won't know if it comes from the same user.

* Irrelevant columns:
	* favorite_count: All 0
	* retweet_count: All 0
	* retweeted: All False
	* user_following: All NaN

* The data goes from Aug 12th to Sept 12th, 2016. This data is 2~3 months approx before the elections (Nov 8th). It will be difficult to compare data over the time as the gap is only 1 month.

* There are only 5705 verified users

* Unstructured data:
	* country: We have tweets that come from 'México' and 'Mexico', or 'Nederland' and 'The Netherlands'. If we want to explore data crossing the country, we should use 'place_country_code' as this field is unstructured
	* user_location: location can be a city, country, home, etc.

In [88]:
import pandas as pd
import numpy as np

tweets = pd.read_pickle("../../data/cp_tweets.pkl")

In [89]:
len(tweets)

657307

In [90]:
tweets.keys()

Index(['tweet_id', 'created_at', 'entities_hashtags', 'place_bounding_box',
       'country', 'place_country_code', 'place_full_name', 'place_id',
       'place_name', 'place_place_type', 'place_url', 'favorite_count',
       'geo_coordinates', 'geo_type', 'text', 'lang', 'retweet_count',
       'retweeted', 'source', 'timestamp_ms', 'user_created_at',
       'user_favourites_count', 'user_followers_count', 'user_following',
       'user_friends_count', 'user_location', 'user_screen_name',
       'user_statuses_count', 'user_time_zone', 'user_url', 'user_verified'],
      dtype='object')

In [92]:
tweets.tweet_id.value_counts().head()

766452532084342788    3
767841456241410048    3
773872651718815744    2
773809373646692353    2
764898321483739137    2
Name: tweet_id, dtype: int64

It's interesting that there are repeated tweets. We should probably delete them.

In [3]:
tweets['created_at'].describe()

count                  657307
unique                 562062
top       2016-09-01 01:55:48
freq                        8
first     2016-08-12 10:04:00
last      2016-09-12 13:20:48
Name: created_at, dtype: object

The data goes from Aug 12th to Sept 12th, 2016. This data is 2~3 months approx before the elections (Nov 8th). It will be difficult to compare data over the time as the gap is only 1 month.

In [4]:
print(tweets['entities_hashtags'].at[99998])
print(tweets.at[99998,'text'])

[[[48, 63], 'socialservices']]
But people WAIT! It's for "women and children!" #socialservices https://t.co/1bog9XaQp8


This field indicates where exactly are the hashtags in the tweet. This could be interesting in case we want to remove hashtags.

In [23]:
t = tweets.place_bounding_box.count(level=None)
print(t/len(tweets)*100,'%')

99.9952837867237 %


We have value in this field for almost all the tweets.

In [9]:
tweets.country.describe()

count            657276
unique              350
top       United States
freq             591990
Name: country, dtype: object

In [93]:
tweets.country.value_counts().head(30)

United States      591990
Canada              17228
United Kingdom       8599
México               7830
Australia            2613
Mexico               2439
India                1593
France               1257
Estados Unidos       1041
Ireland               915
Singapore             906
Colombia              804
Germany               800
Japan                 646
South Africa          633
Deutschland           623
Nederland             561
Italia                559
Vietnam               531
Spain                 497
Armenia               476
New Zealand           470
Brasil                468
Pakistan              436
Thailand              424
Nigeria               410
Venezuela             382
España                358
The Netherlands       337
Mauritania            319
Name: country, dtype: int64

We have tweets that come from 'México' and 'Mexico', or 'Nederland' and 'The Netherlands'. If we want to explore data crossing the country, we should use 'place_country_code'. So this field would be irrelevant.

In [95]:
tweets.place_country_code.value_counts().head(10)

US    593268
CA     17249
MX     10293
GB      8637
AU      2613
IN      1625
DE      1448
FR      1293
SG      1039
IE       921
Name: place_country_code, dtype: int64

In [106]:
tweets.place_full_name.value_counts().head(10)

Florida, USA           17582
Los Angeles, CA        12917
Pennsylvania, USA      12640
Manhattan, NY          12423
Georgia, USA           10121
Chicago, IL             9880
Kentucky, USA           9119
North Carolina, USA     7443
New York, USA           7290
Texas, USA              7218
Name: place_full_name, dtype: int64

In [107]:
tweets.place_id.head()

0       29a119f18820c3ad
1       c7ef5f3368b68777
10      faef11a3eaa8abdb
100     dd9c503d6c35364b
1000    49a6be2d1d5284d1
Name: place_id, dtype: object

In [108]:
tweets.place_name.head()

0              Frontenac
1            Baton Rouge
10      Chesapeake Beach
100         Pennsylvania
1000          Flemington
Name: place_name, dtype: object

In [109]:
tweets.place_place_type.value_counts()

city            489275
admin           156311
country           8082
poi               2218
neighborhood      1390
Name: place_place_type, dtype: int64

In [12]:
tweets.place_url.head()

0       https://api.twitter.com/1.1/geo/id/29a119f1882...
1       https://api.twitter.com/1.1/geo/id/c7ef5f3368b...
10      https://api.twitter.com/1.1/geo/id/faef11a3eaa...
100     https://api.twitter.com/1.1/geo/id/dd9c503d6c3...
1000    https://api.twitter.com/1.1/geo/id/49a6be2d1d5...
Name: place_url, dtype: object

In [14]:
tweets.favorite_count.describe()

count    657307.0
mean          0.0
std           0.0
min           0.0
25%           0.0
50%           0.0
75%           0.0
max           0.0
Name: favorite_count, dtype: float64

This is another field that we can skip.

In [24]:
t = tweets.geo_coordinates.count(level=None)
print(t/len(tweets)*100,'%')

2.130967721323522 %


Only 2.13% of the data contains geo coordinates. It could be irrelevant

In [25]:
tweets.geo_type.value_counts()

Point    14007
Name: geo_type, dtype: int64

This field is related to geo_coordinates

In [26]:
tweets.text.describe()

count               657307
unique              649107
top       @realDonaldTrump
freq                   506
Name: text, dtype: object

In [96]:
tweets.text.value_counts().head()

@realDonaldTrump                                                                                                                     506
@HillaryClinton                                                                                                                      142
@realDonaldTrump https://t.co/pVTNPKABtg #MakeAmericaGreatAgain #NoSacredCows #noPC #nomoresocialexperiments #TeamTrump #Breaking    117
@greta @realDonaldTrump yes                                                                                                          104
@greta @HillaryClinton yes                                                                                                            68
Name: text, dtype: int64

In [28]:
tweets.text.count(level=None)

657307

We have not Null tweets.

In [97]:
tweets.lang.value_counts().head()

en     563329
und     72178
es      12881
fr       1959
tl       1013
Name: lang, dtype: int64

There are 72178 tweets that the language cannot be determined. This could happen because there is no text in the tweet, just tags.

In [58]:
tweets[tweets['lang'] == 'und']['text'].head(10).tolist()

['@theblaze @realDonaldTrump https://t.co/TY9DlZ584c',
 '@bfraser747 @WinnaWinna2016 @HillaryClinton @RedNationRising #Alinsky #Soros #Hill #Barry IAF https://t.co/7eLCaMLcs9',
 '@FoxNews @realDonaldTrump @seanhannity  LOL',
 '#Obamasucks  https://t.co/JhQRJTyuez',
 '#USA #MAGA #Vets4Trump #ElvisPresley https://t.co/1LyiF6Kind',
 '@FoxNews @realDonaldTrump @seanhannity https://t.co/g1qKPfwwRQ',
 '@HillaryClinton @JoeBiden @realDonaldTrump https://t.co/j0GlnPnV1R',
 '@realDonaldTrump @maddow @Morning_Joe @JohnCleese https://t.co/ADcoFGaz2i',
 '@SarahKSilverman @realDonaldTrump https://t.co/l27ZvNA7sJ',
 '@realDonaldTrump']

In [60]:
tweets.retweet_count.describe()

count    657307.0
mean          0.0
std           0.0
min           0.0
25%           0.0
50%           0.0
75%           0.0
max           0.0
Name: retweet_count, dtype: float64

This is another field we can skip

In [62]:
tweets.retweeted.describe()

count     657307
unique         1
top        False
freq      657307
Name: retweeted, dtype: object

This is another field we can skip

In [67]:
tweets.source.head(10).tolist()

['<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
 '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>',
 '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
 '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>',
 '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>',
 '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>',
 '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>',
 '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>',
 '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
 '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>']

It tells us where is the tweet coming from

In [68]:
tweets.timestamp_ms.describe()

count                         657307
unique                        656947
top       2016-08-19 01:51:38.305000
freq                               3
first     2016-08-12 10:04:00.225000
last      2016-09-12 13:20:48.096000
Name: timestamp_ms, dtype: object

It gives us more accuracy about when the tweet was created.

In [76]:
tweets.user_created_at.describe()

count                  657307
unique                  86398
top       2010-02-17 16:55:49
freq                     6244
first     2006-07-05 19:52:46
last      2016-09-12 00:13:33
Name: user_created_at, dtype: object

In [81]:
tweets.user_favourites_count.describe()

count    657307.000000
mean       8988.641432
std       21316.855890
min           0.000000
25%         369.000000
50%        2196.000000
75%        7876.000000
max      743534.000000
Name: user_favourites_count, dtype: float64

In [82]:
tweets.user_followers_count.describe()

count    6.573070e+05
mean     3.179759e+03
std      4.478260e+04
min      0.000000e+00
25%      9.700000e+01
50%      4.270000e+02
75%      1.615000e+03
max      1.136233e+07
Name: user_followers_count, dtype: float64

In [83]:
tweets.user_following.describe()

count    0.0
mean     NaN
std      NaN
min      NaN
25%      NaN
50%      NaN
75%      NaN
max      NaN
Name: user_following, dtype: float64

In [84]:
tweets.user_friends_count.describe()

count    657307.000000
mean       1843.207267
std        6521.584588
min           0.000000
25%         151.000000
50%         543.000000
75%        1896.000000
max      945156.000000
Name: user_friends_count, dtype: float64

In [98]:
tweets.user_location.value_counts().head(20)

United States          15892
USA                     9447
Crab Orchard, Ky.       6244
Florida, USA            6049
Chicago, IL             5680
California, USA         4456
Austin, TX              3805
Washington, DC          3580
longville La            3470
Los Angeles, CA         3359
North Carolina, USA     3094
New York, NY            3043
Texas, USA              2906
New Jersey, USA         2801
Alabama, USA            2778
home                    2776
San Diego, CA           2611
NYC                     2586
Pennsylvania, USA       2465
New York                2292
Name: user_location, dtype: int64

This field is unstructured as location can be a city, country, home, etc.

In [100]:
tweets.user_screen_name.value_counts().head(10)

AppaloosaGuy     6244
sunnyherring1    4541
ofarther         3470
chigobiker       3006
pvtbonehead      2707
Unclerojelio     2184
Non_MSM_News     2074
Kegan05          1692
djcaldwelldmd    1640
purdycan         1439
Name: user_screen_name, dtype: int64

In [101]:
tweets.user_statuses_count.head()

0       17620
1        5046
10        277
100     10294
1000      944
Name: user_statuses_count, dtype: int64

The number of Tweets (including retweets) issued by the user.

In [104]:
tweets.user_time_zone.value_counts().head(10)

Eastern Time (US & Canada)     113807
Pacific Time (US & Canada)      84910
Central Time (US & Canada)      58968
Atlantic Time (Canada)          18142
Quito                           14165
Mountain Time (US & Canada)     10806
Arizona                          9875
America/New_York                 4832
London                           4702
Hawaii                           4678
Name: user_time_zone, dtype: int64

In [111]:
tweets.user_verified.value_counts()

False    651602
True       5705
Name: user_verified, dtype: int64