# Initial Data Discovery

In this notebook, I will start with simple data discovey of the data in order to evaluate the dataset, implement some data cleaning and fomating before saving the data to a new file

In [1]:
import numpy as np
import pandas as pd

In [2]:
data = pd.read_csv('trump.csv')
data.head()

Unnamed: 0,"source,text,created_at,retweet_count,favorite_count,is_retweet,id_str"
0,"Twitter for Android,""@BrettNeveraski: I see yo..."
1,"Twitter for Android,To EVERYONE including all ..."
2,"Twitter for Android,""@cpetelis: @realDonaldTru..."
3,"Twitter for Android,""@djspookyshadow: Feeling ..."
4,"Twitter for Android,""@joelmch2os: @realDonaldT..."


In [3]:
data = data["source,text,created_at,retweet_count,favorite_count,is_retweet,id_str"].str.split(",", n = 7, expand = True,)

In [4]:
data.columns = ['source','text','time','retweets','favorite','is_retweet','id']
data.head()

Unnamed: 0,source,text,time,retweets,favorite,is_retweet,id
0,Twitter for Android,"""@BrettNeveraski: I see you @realDonaldTrump h...",12-31-2014 21:07:30,53,166,False,550397860240707584
1,Twitter for Android,To EVERYONE including all haters and losers HA...,12-31-2014 21:15:21,1271,1209,False,550399835682390016
2,Twitter for Android,"""@cpetelis: @realDonaldTrump If you run for Pr...",12-31-2014 23:56:23,6,18,False,550440363090280448
3,Twitter for Android,"""@djspookyshadow: Feeling a deep gratitude for...",12-31-2014 23:57:02,9,31,False,550440523094577152
4,Twitter for Android,"""@joelmch2os: @realDonaldTrump announce your p...",12-31-2014 23:57:25,8,26,False,550440620792492032


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28564 entries, 0 to 28563
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   source      28564 non-null  object
 1   text        28564 non-null  object
 2   time        28564 non-null  object
 3   retweets    28564 non-null  object
 4   favorite    28564 non-null  object
 5   is_retweet  28564 non-null  object
 6   id          28564 non-null  object
dtypes: object(7)
memory usage: 1.5+ MB


In [6]:
data.source.value_counts()

Twitter for iPhone          18438
Twitter for Android          7304
Twitter Web Client           2153
Twitter Media Studio          159
Media Studio                  156
Twitter Ads                    97
Twitter for BlackBerry         94
Instagram                      70
Twitter for iPad               60
Twitter QandA                  10
Periscope                       7
Neatly For BlackBerry 10        5
Twitter Web App                 4
Mobile Web (M5)                 2
TweetDeck                       2
Facebook                        2
Twitter Mirror for iPad         1
Name: source, dtype: int64

In [7]:
data.time.value_counts().head()

07-15-2018 13:33:15    5
05-10-2019 11:22:22    4
06-09-2019 12:26:37    4
05-20-2019 11:20:53    4
09-04-2019 21:08:58    4
Name: time, dtype: int64

In [8]:
data.retweets.describe()

count     28564
unique    16718
top           9
freq        169
Name: retweets, dtype: object

In [9]:
data.favorite.head()

0     166
1    1209
2      18
3      31
4      26
Name: favorite, dtype: object

In [10]:
data.is_retweet.sample(10)

1006     false
24768     true
25012     true
2964     false
6822     false
8326     false
19137    false
25606     true
235      false
7143     false
Name: is_retweet, dtype: object

In [11]:
data.id.head()

0    550397860240707584
1    550399835682390016
2    550440363090280448
3    550440523094577152
4    550440620792492032
Name: id, dtype: object

In [12]:
data.is_retweet.unique()
data.is_retweet.value_counts('')

false    25083
true      3423
            58
Name: is_retweet, dtype: int64

after a simple discovery of how the data looks like, I decided to remove the following columns:
- source



In [13]:
list(data.columns)

['source', 'text', 'time', 'retweets', 'favorite', 'is_retweet', 'id']

In [14]:
data = data.drop(['source'], axis=1)


In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28564 entries, 0 to 28563
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   text        28564 non-null  object
 1   time        28564 non-null  object
 2   retweets    28564 non-null  object
 3   favorite    28564 non-null  object
 4   is_retweet  28564 non-null  object
 5   id          28564 non-null  object
dtypes: object(6)
memory usage: 1.3+ MB


Since all of the data are object data types, I have to change the following columns to integers:
- favorite_count 
- retweet_count 


In [16]:
# change values to numeric
data['favorite'] = data['favorite'].astype(int)
data['retweets'] = data['retweets'].astype(int)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28564 entries, 0 to 28563
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   text        28564 non-null  object
 1   time        28564 non-null  object
 2   retweets    28564 non-null  int64 
 3   favorite    28564 non-null  int64 
 4   is_retweet  28564 non-null  object
 5   id          28564 non-null  object
dtypes: int64(2), object(4)
memory usage: 1.3+ MB


Change the created_at to a time/date format
- created_at

In [17]:
data.describe()

Unnamed: 0,retweets,favorite
count,28564.0,28564.0
mean,11686.523771,38478.624142
std,12813.099887,52291.454507
min,0.0,0.0
25%,1530.75,36.0
50%,9138.0,9519.0
75%,17798.0,70529.75
max,369530.0,879647.0


In [18]:
# change to datetime 
data["time"] = pd.to_datetime(data["time"])
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28564 entries, 0 to 28563
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   text        28564 non-null  object        
 1   time        28564 non-null  datetime64[ns]
 2   retweets    28564 non-null  int64         
 3   favorite    28564 non-null  int64         
 4   is_retweet  28564 non-null  object        
 5   id          28564 non-null  object        
dtypes: datetime64[ns](1), int64(2), object(3)
memory usage: 1.3+ MB


In [19]:
data.set_index('time', inplace=True)
data.head()

Unnamed: 0_level_0,text,retweets,favorite,is_retweet,id
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2014-12-31 21:07:30,"""@BrettNeveraski: I see you @realDonaldTrump h...",53,166,False,550397860240707584
2014-12-31 21:15:21,To EVERYONE including all haters and losers HA...,1271,1209,False,550399835682390016
2014-12-31 23:56:23,"""@cpetelis: @realDonaldTrump If you run for Pr...",6,18,False,550440363090280448
2014-12-31 23:57:02,"""@djspookyshadow: Feeling a deep gratitude for...",9,31,False,550440523094577152
2014-12-31 23:57:25,"""@joelmch2os: @realDonaldTrump announce your p...",8,26,False,550440620792492032


The dataset is almost ready. However, one last thing to check is the values in the is_retweet column

In [20]:
data.is_retweet.unique()

array(['false', 'true', ''], dtype=object)

In [21]:
data.is_retweet.value_counts('')

false    25083
true      3423
            58
Name: is_retweet, dtype: int64

There are a total of 58 ows that are not specified. which is about 0.2% of the dataset. I decided to emove it in order to chabge the column type to boolein

In [22]:
data = data[data.is_retweet != '']
data.is_retweet.head()


time
2014-12-31 21:07:30    false
2014-12-31 21:15:21    false
2014-12-31 23:56:23    false
2014-12-31 23:57:02    false
2014-12-31 23:57:25    false
Name: is_retweet, dtype: object

In [28]:
To_Bool = {'true': True, 'false': False}

data['is_retweet'] = data['is_retweet'].map(To_Bool)

In [29]:
data.is_retweet.unique()

array([False,  True])

In [30]:
data.is_retweet.value_counts()

False    25083
True      3423
Name: is_retweet, dtype: int64

In [31]:
data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 28506 entries, 2014-12-31 21:07:30 to 2020-03-30 20:50:35
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   text        28506 non-null  object
 1   retweets    28506 non-null  int64 
 2   favorite    28506 non-null  int64 
 3   is_retweet  28506 non-null  bool  
 4   id          28506 non-null  object
dtypes: bool(1), int64(2), object(2)
memory usage: 1.1+ MB


In [32]:
# save the new data to another CSV file befoe moving to the next stage
data.to_csv("TheRealDonald.csv")