# Brainster Academy Final Project
### Team 3: Tatjana Veljkovic, Ilija Todorov, Ivana Tomovska Efremov

#### Project Description: the project task is to analyze health related tweets by 16 different media outlets (BBC, CBC, CBB, KaiserNews, etc), produced in the period of 2011 to 2015. Final task is to compare various media outlets and determine various trends, such as dominating health news topic by media outlet, dominating health news trends through time period and other related trends. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import date

## Step 1: Loading and joining the data 
### Files are read, columns are renamed, additional column with source name is added

In [2]:
list_of_all_files = ['bbchealth.txt','cbchealth.txt', 'cnnhealth.txt', 'everydayhealth.txt',
                     'foxnewshealth.txt','gdnhealthcare.txt', 'goodhealth.txt', 'KaiserHealthNews.txt',
                     'latimeshealth.txt','msnhealthnews.txt', 'NBChealth.txt', 'nprhealth.txt',
                     'nytimeshealth.txt','reuters_health.txt', 'usnewshealth.txt', 'wsjhealth.txt']
df_lists = []

In [3]:
for filename in list_of_all_files:
    #print(filename)
    DataFileName = filename.split('.', 1)[0]
    #print(DataFileName)
    DataFrame = pd.read_csv(filename, delimiter='|', error_bad_lines=False, warn_bad_lines=False, header=None)
    DataFrame.rename(columns = {0:'Number', 1:'Date_Time',2:'info'}, inplace = True)
    DataFrame['source'] = pd.Series([DataFileName for x in range(len(DataFrame.index))])
    df_lists.append(DataFrame)
print('Reading all files is done !')

Reading all files is done !


### Merging files

In [4]:
twitter_health = pd.concat(df_lists, axis=0)

In [5]:
twitter_health

Unnamed: 0,Number,Date_Time,info,source
0,585978391360221184,Thu Apr 09 01:31:50 +0000 2015,Breast cancer risk test devised http://bbc.in/...,bbchealth
1,585947808772960257,Wed Apr 08 23:30:18 +0000 2015,GP workload harming care - BMA poll http://bbc...,bbchealth
2,585947807816650752,Wed Apr 08 23:30:18 +0000 2015,Short people's 'heart risk greater' http://bbc...,bbchealth
3,585866060991078401,Wed Apr 08 18:05:28 +0000 2015,New approach against HIV 'promising' http://bb...,bbchealth
4,585794106170839041,Wed Apr 08 13:19:33 +0000 2015,Coalition 'undermined NHS' - doctors http://bb...,bbchealth
...,...,...,...,...
3195,415494259022655489,Tue Dec 24 14:48:45 +0000 2013,RT @stefaniei: Addiction and the brain: scient...,wsjhealth
3196,415493351396233216,Tue Dec 24 14:45:09 +0000 2013,RT @timothywmartin: Ho-ho-hold up! A surprise ...,wsjhealth
3197,415493203983204352,Tue Dec 24 14:44:33 +0000 2013,RT @stefaniei: Health-Insurance Deadline Exten...,wsjhealth
3198,415386956420231169,Tue Dec 24 07:42:22 +0000 2013,Boston Scientific Eyes China Expansion http://...,wsjhealth


In [6]:
twitter_health['source'].value_counts()

goodhealth          7708
nytimeshealth       5947
nprhealth           4837
reuters_health      4719
NBChealth           4215
latimeshealth       4171
cnnhealth           4045
bbchealth           3929
cbchealth           3728
KaiserHealthNews    3508
everydayhealth      3239
wsjhealth           3200
msnhealthnews       3199
gdnhealthcare       2977
foxnewshealth       2000
usnewshealth        1395
Name: source, dtype: int64

In [7]:
for file in df_lists:
    print('File name:', file['source'].unique(), len(file))

File name: ['bbchealth'] 3929
File name: ['cbchealth'] 3728
File name: ['cnnhealth'] 4045
File name: ['everydayhealth'] 3239
File name: ['foxnewshealth'] 2000
File name: ['gdnhealthcare'] 2977
File name: ['goodhealth'] 7708
File name: ['KaiserHealthNews'] 3508
File name: ['latimeshealth'] 4171
File name: ['msnhealthnews'] 3199
File name: ['NBChealth'] 4215
File name: ['nprhealth'] 4837
File name: ['nytimeshealth'] 5947
File name: ['reuters_health'] 4719
File name: ['usnewshealth'] 1395
File name: ['wsjhealth'] 3200


### Reseting index

In [8]:
twitter_health=twitter_health.reset_index()

In [9]:
twitter_health

Unnamed: 0,index,Number,Date_Time,info,source
0,0,585978391360221184,Thu Apr 09 01:31:50 +0000 2015,Breast cancer risk test devised http://bbc.in/...,bbchealth
1,1,585947808772960257,Wed Apr 08 23:30:18 +0000 2015,GP workload harming care - BMA poll http://bbc...,bbchealth
2,2,585947807816650752,Wed Apr 08 23:30:18 +0000 2015,Short people's 'heart risk greater' http://bbc...,bbchealth
3,3,585866060991078401,Wed Apr 08 18:05:28 +0000 2015,New approach against HIV 'promising' http://bb...,bbchealth
4,4,585794106170839041,Wed Apr 08 13:19:33 +0000 2015,Coalition 'undermined NHS' - doctors http://bb...,bbchealth
...,...,...,...,...,...
62812,3195,415494259022655489,Tue Dec 24 14:48:45 +0000 2013,RT @stefaniei: Addiction and the brain: scient...,wsjhealth
62813,3196,415493351396233216,Tue Dec 24 14:45:09 +0000 2013,RT @timothywmartin: Ho-ho-hold up! A surprise ...,wsjhealth
62814,3197,415493203983204352,Tue Dec 24 14:44:33 +0000 2013,RT @stefaniei: Health-Insurance Deadline Exten...,wsjhealth
62815,3198,415386956420231169,Tue Dec 24 07:42:22 +0000 2013,Boston Scientific Eyes China Expansion http://...,wsjhealth


## Step 2: Data cleaning

##### The "info" column can be splitted with "http" since the core text is at the beginning of the cell and the info after is redundant, however there are 119 rows of the "info column which start with "http" and we need to take care of them first in order not to lose data from this 119 rows

#### File comprised from the 119 rows which start with "http" in the "info" column

In [10]:
#df[df["lot"].str.startswith("A-0")]
split_exceptions = twitter_health[twitter_health['info'].str.startswith("http")]
split_exceptions

Unnamed: 0,index,Number,Date_Time,info,source
19543,2602,532856398615674880,Thu Nov 13 11:24:00 +0000 2014,http://gu.com/p/438n2/tw,gdnhealthcare
25536,5618,280298707818262529,Sun Dec 16 13:09:54 +0000 2012,http://pinterest.com/pin/34832597089912501/ Br...,goodhealth
25683,5765,273832894416441344,Wed Nov 28 16:57:04 +0000 2012,http://pinterest.com/pin/34832597089793116/ We...,goodhealth
26351,6433,199182851944628225,Sun May 06 17:04:27 +0000 2012,http://pinterest.com/pin/34832597088302332/ 7 ...,goodhealth
26353,6435,198905597666656257,Sat May 05 22:42:44 +0000 2012,http://pinterest.com/pin/34832597088298094/ 5 ...,goodhealth
...,...,...,...,...,...
59957,340,533426053901860864,Sat Nov 15 01:07:36 +0000 2014,http://HealthCare.gov Expected to Work Better ...,wsjhealth
60379,762,520007777175879680,Thu Oct 09 00:28:10 +0000 2014,http://HealthCare.gov Shortens Insurance Appli...,wsjhealth
60394,777,519674939507359745,Wed Oct 08 02:25:35 +0000 2014,http://HealthCare.gov Testing to Be Confidenti...,wsjhealth
60465,848,517514080131629056,Thu Oct 02 03:19:06 +0000 2014,http://HealthCare.gov Delays Web Host Switch h...,wsjhealth


In [11]:
rows_to_skip = split_exceptions.index
rows_to_skip

Int64Index([19543, 25536, 25683, 26351, 26353, 26367, 26415, 26439, 26444,
            26446,
            ...
            54888, 55210, 55674, 59848, 59956, 59957, 60379, 60394, 60465,
            62206],
           dtype='int64', length=119)

#### File without the 119 rows which start with "http" in the "info" column

In [12]:
file_to_split = twitter_health.drop((rows_to_skip), axis=0)
file_to_split

Unnamed: 0,index,Number,Date_Time,info,source
0,0,585978391360221184,Thu Apr 09 01:31:50 +0000 2015,Breast cancer risk test devised http://bbc.in/...,bbchealth
1,1,585947808772960257,Wed Apr 08 23:30:18 +0000 2015,GP workload harming care - BMA poll http://bbc...,bbchealth
2,2,585947807816650752,Wed Apr 08 23:30:18 +0000 2015,Short people's 'heart risk greater' http://bbc...,bbchealth
3,3,585866060991078401,Wed Apr 08 18:05:28 +0000 2015,New approach against HIV 'promising' http://bb...,bbchealth
4,4,585794106170839041,Wed Apr 08 13:19:33 +0000 2015,Coalition 'undermined NHS' - doctors http://bb...,bbchealth
...,...,...,...,...,...
62812,3195,415494259022655489,Tue Dec 24 14:48:45 +0000 2013,RT @stefaniei: Addiction and the brain: scient...,wsjhealth
62813,3196,415493351396233216,Tue Dec 24 14:45:09 +0000 2013,RT @timothywmartin: Ho-ho-hold up! A surprise ...,wsjhealth
62814,3197,415493203983204352,Tue Dec 24 14:44:33 +0000 2013,RT @stefaniei: Health-Insurance Deadline Exten...,wsjhealth
62815,3198,415386956420231169,Tue Dec 24 07:42:22 +0000 2013,Boston Scientific Eyes China Expansion http://...,wsjhealth


In [13]:
62817 - 119

62698

In [14]:
file_to_split_check =  file_to_split[file_to_split['info'].str.startswith("http")]
file_to_split_check

Unnamed: 0,index,Number,Date_Time,info,source


### Dataframe status 

In [15]:
#rows which start with 'http:'' excluded - 119 rows excluded
file_to_split

Unnamed: 0,index,Number,Date_Time,info,source
0,0,585978391360221184,Thu Apr 09 01:31:50 +0000 2015,Breast cancer risk test devised http://bbc.in/...,bbchealth
1,1,585947808772960257,Wed Apr 08 23:30:18 +0000 2015,GP workload harming care - BMA poll http://bbc...,bbchealth
2,2,585947807816650752,Wed Apr 08 23:30:18 +0000 2015,Short people's 'heart risk greater' http://bbc...,bbchealth
3,3,585866060991078401,Wed Apr 08 18:05:28 +0000 2015,New approach against HIV 'promising' http://bb...,bbchealth
4,4,585794106170839041,Wed Apr 08 13:19:33 +0000 2015,Coalition 'undermined NHS' - doctors http://bb...,bbchealth
...,...,...,...,...,...
62812,3195,415494259022655489,Tue Dec 24 14:48:45 +0000 2013,RT @stefaniei: Addiction and the brain: scient...,wsjhealth
62813,3196,415493351396233216,Tue Dec 24 14:45:09 +0000 2013,RT @timothywmartin: Ho-ho-hold up! A surprise ...,wsjhealth
62814,3197,415493203983204352,Tue Dec 24 14:44:33 +0000 2013,RT @stefaniei: Health-Insurance Deadline Exten...,wsjhealth
62815,3198,415386956420231169,Tue Dec 24 07:42:22 +0000 2013,Boston Scientific Eyes China Expansion http://...,wsjhealth


In [16]:
#rows which begin with http: to be excluded from file splitting below
split_exceptions

Unnamed: 0,index,Number,Date_Time,info,source
19543,2602,532856398615674880,Thu Nov 13 11:24:00 +0000 2014,http://gu.com/p/438n2/tw,gdnhealthcare
25536,5618,280298707818262529,Sun Dec 16 13:09:54 +0000 2012,http://pinterest.com/pin/34832597089912501/ Br...,goodhealth
25683,5765,273832894416441344,Wed Nov 28 16:57:04 +0000 2012,http://pinterest.com/pin/34832597089793116/ We...,goodhealth
26351,6433,199182851944628225,Sun May 06 17:04:27 +0000 2012,http://pinterest.com/pin/34832597088302332/ 7 ...,goodhealth
26353,6435,198905597666656257,Sat May 05 22:42:44 +0000 2012,http://pinterest.com/pin/34832597088298094/ 5 ...,goodhealth
...,...,...,...,...,...
59957,340,533426053901860864,Sat Nov 15 01:07:36 +0000 2014,http://HealthCare.gov Expected to Work Better ...,wsjhealth
60379,762,520007777175879680,Thu Oct 09 00:28:10 +0000 2014,http://HealthCare.gov Shortens Insurance Appli...,wsjhealth
60394,777,519674939507359745,Wed Oct 08 02:25:35 +0000 2014,http://HealthCare.gov Testing to Be Confidenti...,wsjhealth
60465,848,517514080131629056,Thu Oct 02 03:19:06 +0000 2014,http://HealthCare.gov Delays Web Host Switch h...,wsjhealth


#### Cleaning the 119 rows that start with "http"

In [17]:
split_exceptions['info']

19543                             http://gu.com/p/438n2/tw
25536    http://pinterest.com/pin/34832597089912501/ Br...
25683    http://pinterest.com/pin/34832597089793116/ We...
26351    http://pinterest.com/pin/34832597088302332/ 7 ...
26353    http://pinterest.com/pin/34832597088298094/ 5 ...
                               ...                        
59957    http://HealthCare.gov Expected to Work Better ...
60379    http://HealthCare.gov Shortens Insurance Appli...
60394    http://HealthCare.gov Testing to Be Confidenti...
60465    http://HealthCare.gov Delays Web Host Switch h...
62206    http://HealthCare.Gov Plans Deadline Leeway ht...
Name: info, Length: 119, dtype: object

#### Splitting column "info" from dataset "split_exeptions" by delimiter ' '

In [18]:
split_exceptions_splitted = split_exceptions['info'].str.split(" ", n=1, expand=True)

In [19]:
split_exceptions_splitted.rename(columns = {0:'redundant_info', 1:'Core_info'}, inplace = True)
split_exceptions_splitted

Unnamed: 0,redundant_info,Core_info
19543,http://gu.com/p/438n2/tw,
25536,http://pinterest.com/pin/34832597089912501/,Brooke Burke Charvet: 8 Things You Should Know...
25683,http://pinterest.com/pin/34832597089793116/,Weight Loss Tip: Peel Off Pudge With Pepper
26351,http://pinterest.com/pin/34832597088302332/,7 Foods That Help You Shed Pounds
26353,http://pinterest.com/pin/34832597088298094/,5 Healthy Cinco de Mayo Recipes
...,...,...
59957,http://HealthCare.gov,Expected to Work Better This Year http://on.ws...
60379,http://HealthCare.gov,Shortens Insurance Application http://on.wsj.c...
60394,http://HealthCare.gov,Testing to Be Confidential http://on.wsj.com/1...
60465,http://HealthCare.gov,Delays Web Host Switch http://on.wsj.com/1rKuiWc


In [20]:
split_exceptions_splitted = split_exceptions_splitted.drop(['redundant_info'], axis=1)
split_exceptions_splitted

Unnamed: 0,Core_info
19543,
25536,Brooke Burke Charvet: 8 Things You Should Know...
25683,Weight Loss Tip: Peel Off Pudge With Pepper
26351,7 Foods That Help You Shed Pounds
26353,5 Healthy Cinco de Mayo Recipes
...,...
59957,Expected to Work Better This Year http://on.ws...
60379,Shortens Insurance Application http://on.wsj.c...
60394,Testing to Be Confidential http://on.wsj.com/1...
60465,Delays Web Host Switch http://on.wsj.com/1rKuiWc


In [21]:
exceptions = pd.concat([split_exceptions, split_exceptions_splitted], axis = 1)
exceptions

Unnamed: 0,index,Number,Date_Time,info,source,Core_info
19543,2602,532856398615674880,Thu Nov 13 11:24:00 +0000 2014,http://gu.com/p/438n2/tw,gdnhealthcare,
25536,5618,280298707818262529,Sun Dec 16 13:09:54 +0000 2012,http://pinterest.com/pin/34832597089912501/ Br...,goodhealth,Brooke Burke Charvet: 8 Things You Should Know...
25683,5765,273832894416441344,Wed Nov 28 16:57:04 +0000 2012,http://pinterest.com/pin/34832597089793116/ We...,goodhealth,Weight Loss Tip: Peel Off Pudge With Pepper
26351,6433,199182851944628225,Sun May 06 17:04:27 +0000 2012,http://pinterest.com/pin/34832597088302332/ 7 ...,goodhealth,7 Foods That Help You Shed Pounds
26353,6435,198905597666656257,Sat May 05 22:42:44 +0000 2012,http://pinterest.com/pin/34832597088298094/ 5 ...,goodhealth,5 Healthy Cinco de Mayo Recipes
...,...,...,...,...,...,...
59957,340,533426053901860864,Sat Nov 15 01:07:36 +0000 2014,http://HealthCare.gov Expected to Work Better ...,wsjhealth,Expected to Work Better This Year http://on.ws...
60379,762,520007777175879680,Thu Oct 09 00:28:10 +0000 2014,http://HealthCare.gov Shortens Insurance Appli...,wsjhealth,Shortens Insurance Application http://on.wsj.c...
60394,777,519674939507359745,Wed Oct 08 02:25:35 +0000 2014,http://HealthCare.gov Testing to Be Confidenti...,wsjhealth,Testing to Be Confidential http://on.wsj.com/1...
60465,848,517514080131629056,Thu Oct 02 03:19:06 +0000 2014,http://HealthCare.gov Delays Web Host Switch h...,wsjhealth,Delays Web Host Switch http://on.wsj.com/1rKuiWc


#### Preparing core dataframe "file_to_split" to be merged with dataframe "exeptions"

In [22]:
file_to_split

Unnamed: 0,index,Number,Date_Time,info,source
0,0,585978391360221184,Thu Apr 09 01:31:50 +0000 2015,Breast cancer risk test devised http://bbc.in/...,bbchealth
1,1,585947808772960257,Wed Apr 08 23:30:18 +0000 2015,GP workload harming care - BMA poll http://bbc...,bbchealth
2,2,585947807816650752,Wed Apr 08 23:30:18 +0000 2015,Short people's 'heart risk greater' http://bbc...,bbchealth
3,3,585866060991078401,Wed Apr 08 18:05:28 +0000 2015,New approach against HIV 'promising' http://bb...,bbchealth
4,4,585794106170839041,Wed Apr 08 13:19:33 +0000 2015,Coalition 'undermined NHS' - doctors http://bb...,bbchealth
...,...,...,...,...,...
62812,3195,415494259022655489,Tue Dec 24 14:48:45 +0000 2013,RT @stefaniei: Addiction and the brain: scient...,wsjhealth
62813,3196,415493351396233216,Tue Dec 24 14:45:09 +0000 2013,RT @timothywmartin: Ho-ho-hold up! A surprise ...,wsjhealth
62814,3197,415493203983204352,Tue Dec 24 14:44:33 +0000 2013,RT @stefaniei: Health-Insurance Deadline Exten...,wsjhealth
62815,3198,415386956420231169,Tue Dec 24 07:42:22 +0000 2013,Boston Scientific Eyes China Expansion http://...,wsjhealth


In [23]:
file_to_split['Core_info'] = file_to_split['info']

In [24]:
file_to_split

Unnamed: 0,index,Number,Date_Time,info,source,Core_info
0,0,585978391360221184,Thu Apr 09 01:31:50 +0000 2015,Breast cancer risk test devised http://bbc.in/...,bbchealth,Breast cancer risk test devised http://bbc.in/...
1,1,585947808772960257,Wed Apr 08 23:30:18 +0000 2015,GP workload harming care - BMA poll http://bbc...,bbchealth,GP workload harming care - BMA poll http://bbc...
2,2,585947807816650752,Wed Apr 08 23:30:18 +0000 2015,Short people's 'heart risk greater' http://bbc...,bbchealth,Short people's 'heart risk greater' http://bbc...
3,3,585866060991078401,Wed Apr 08 18:05:28 +0000 2015,New approach against HIV 'promising' http://bb...,bbchealth,New approach against HIV 'promising' http://bb...
4,4,585794106170839041,Wed Apr 08 13:19:33 +0000 2015,Coalition 'undermined NHS' - doctors http://bb...,bbchealth,Coalition 'undermined NHS' - doctors http://bb...
...,...,...,...,...,...,...
62812,3195,415494259022655489,Tue Dec 24 14:48:45 +0000 2013,RT @stefaniei: Addiction and the brain: scient...,wsjhealth,RT @stefaniei: Addiction and the brain: scient...
62813,3196,415493351396233216,Tue Dec 24 14:45:09 +0000 2013,RT @timothywmartin: Ho-ho-hold up! A surprise ...,wsjhealth,RT @timothywmartin: Ho-ho-hold up! A surprise ...
62814,3197,415493203983204352,Tue Dec 24 14:44:33 +0000 2013,RT @stefaniei: Health-Insurance Deadline Exten...,wsjhealth,RT @stefaniei: Health-Insurance Deadline Exten...
62815,3198,415386956420231169,Tue Dec 24 07:42:22 +0000 2013,Boston Scientific Eyes China Expansion http://...,wsjhealth,Boston Scientific Eyes China Expansion http://...


In [25]:
display(file_to_split)
display(exceptions)

Unnamed: 0,index,Number,Date_Time,info,source,Core_info
0,0,585978391360221184,Thu Apr 09 01:31:50 +0000 2015,Breast cancer risk test devised http://bbc.in/...,bbchealth,Breast cancer risk test devised http://bbc.in/...
1,1,585947808772960257,Wed Apr 08 23:30:18 +0000 2015,GP workload harming care - BMA poll http://bbc...,bbchealth,GP workload harming care - BMA poll http://bbc...
2,2,585947807816650752,Wed Apr 08 23:30:18 +0000 2015,Short people's 'heart risk greater' http://bbc...,bbchealth,Short people's 'heart risk greater' http://bbc...
3,3,585866060991078401,Wed Apr 08 18:05:28 +0000 2015,New approach against HIV 'promising' http://bb...,bbchealth,New approach against HIV 'promising' http://bb...
4,4,585794106170839041,Wed Apr 08 13:19:33 +0000 2015,Coalition 'undermined NHS' - doctors http://bb...,bbchealth,Coalition 'undermined NHS' - doctors http://bb...
...,...,...,...,...,...,...
62812,3195,415494259022655489,Tue Dec 24 14:48:45 +0000 2013,RT @stefaniei: Addiction and the brain: scient...,wsjhealth,RT @stefaniei: Addiction and the brain: scient...
62813,3196,415493351396233216,Tue Dec 24 14:45:09 +0000 2013,RT @timothywmartin: Ho-ho-hold up! A surprise ...,wsjhealth,RT @timothywmartin: Ho-ho-hold up! A surprise ...
62814,3197,415493203983204352,Tue Dec 24 14:44:33 +0000 2013,RT @stefaniei: Health-Insurance Deadline Exten...,wsjhealth,RT @stefaniei: Health-Insurance Deadline Exten...
62815,3198,415386956420231169,Tue Dec 24 07:42:22 +0000 2013,Boston Scientific Eyes China Expansion http://...,wsjhealth,Boston Scientific Eyes China Expansion http://...


Unnamed: 0,index,Number,Date_Time,info,source,Core_info
19543,2602,532856398615674880,Thu Nov 13 11:24:00 +0000 2014,http://gu.com/p/438n2/tw,gdnhealthcare,
25536,5618,280298707818262529,Sun Dec 16 13:09:54 +0000 2012,http://pinterest.com/pin/34832597089912501/ Br...,goodhealth,Brooke Burke Charvet: 8 Things You Should Know...
25683,5765,273832894416441344,Wed Nov 28 16:57:04 +0000 2012,http://pinterest.com/pin/34832597089793116/ We...,goodhealth,Weight Loss Tip: Peel Off Pudge With Pepper
26351,6433,199182851944628225,Sun May 06 17:04:27 +0000 2012,http://pinterest.com/pin/34832597088302332/ 7 ...,goodhealth,7 Foods That Help You Shed Pounds
26353,6435,198905597666656257,Sat May 05 22:42:44 +0000 2012,http://pinterest.com/pin/34832597088298094/ 5 ...,goodhealth,5 Healthy Cinco de Mayo Recipes
...,...,...,...,...,...,...
59957,340,533426053901860864,Sat Nov 15 01:07:36 +0000 2014,http://HealthCare.gov Expected to Work Better ...,wsjhealth,Expected to Work Better This Year http://on.ws...
60379,762,520007777175879680,Thu Oct 09 00:28:10 +0000 2014,http://HealthCare.gov Shortens Insurance Appli...,wsjhealth,Shortens Insurance Application http://on.wsj.c...
60394,777,519674939507359745,Wed Oct 08 02:25:35 +0000 2014,http://HealthCare.gov Testing to Be Confidenti...,wsjhealth,Testing to Be Confidential http://on.wsj.com/1...
60465,848,517514080131629056,Thu Oct 02 03:19:06 +0000 2014,http://HealthCare.gov Delays Web Host Switch h...,wsjhealth,Delays Web Host Switch http://on.wsj.com/1rKuiWc


#### Merging back the full file - after removing the "http" at the beginnig of the cell in the column "Core_info"

In [26]:
file_for_split_full = pd.concat([file_to_split, exceptions], axis = 0)
file_for_split_full

Unnamed: 0,index,Number,Date_Time,info,source,Core_info
0,0,585978391360221184,Thu Apr 09 01:31:50 +0000 2015,Breast cancer risk test devised http://bbc.in/...,bbchealth,Breast cancer risk test devised http://bbc.in/...
1,1,585947808772960257,Wed Apr 08 23:30:18 +0000 2015,GP workload harming care - BMA poll http://bbc...,bbchealth,GP workload harming care - BMA poll http://bbc...
2,2,585947807816650752,Wed Apr 08 23:30:18 +0000 2015,Short people's 'heart risk greater' http://bbc...,bbchealth,Short people's 'heart risk greater' http://bbc...
3,3,585866060991078401,Wed Apr 08 18:05:28 +0000 2015,New approach against HIV 'promising' http://bb...,bbchealth,New approach against HIV 'promising' http://bb...
4,4,585794106170839041,Wed Apr 08 13:19:33 +0000 2015,Coalition 'undermined NHS' - doctors http://bb...,bbchealth,Coalition 'undermined NHS' - doctors http://bb...
...,...,...,...,...,...,...
59957,340,533426053901860864,Sat Nov 15 01:07:36 +0000 2014,http://HealthCare.gov Expected to Work Better ...,wsjhealth,Expected to Work Better This Year http://on.ws...
60379,762,520007777175879680,Thu Oct 09 00:28:10 +0000 2014,http://HealthCare.gov Shortens Insurance Appli...,wsjhealth,Shortens Insurance Application http://on.wsj.c...
60394,777,519674939507359745,Wed Oct 08 02:25:35 +0000 2014,http://HealthCare.gov Testing to Be Confidenti...,wsjhealth,Testing to Be Confidential http://on.wsj.com/1...
60465,848,517514080131629056,Thu Oct 02 03:19:06 +0000 2014,http://HealthCare.gov Delays Web Host Switch h...,wsjhealth,Delays Web Host Switch http://on.wsj.com/1rKuiWc


#### Splitting the column "Core_info" of the entire dataset (62817 rows) with "http" in order to remove it

In [27]:
Core_info_split = file_for_split_full['Core_info'].str.split("http", n=1, expand=True)
Core_info_split

Unnamed: 0,0,1
0,Breast cancer risk test devised,://bbc.in/1CimpJF
1,GP workload harming care - BMA poll,://bbc.in/1ChTBRv
2,Short people's 'heart risk greater',://bbc.in/1ChTANp
3,New approach against HIV 'promising',://bbc.in/1E6jAjt
4,Coalition 'undermined NHS' - doctors,://bbc.in/1CnLwK7
...,...,...
59957,Expected to Work Better This Year,://on.wsj.com/117yaGw
60379,Shortens Insurance Application,://on.wsj.com/1sdfGyR
60394,Testing to Be Confidential,://on.wsj.com/1BPftTH
60465,Delays Web Host Switch,://on.wsj.com/1rKuiWc


In [28]:
Core_info_split.rename(columns = {0:'Core_info_final', 1:'redundant_info'}, inplace = True)
Core_info_split

Unnamed: 0,Core_info_final,redundant_info
0,Breast cancer risk test devised,://bbc.in/1CimpJF
1,GP workload harming care - BMA poll,://bbc.in/1ChTBRv
2,Short people's 'heart risk greater',://bbc.in/1ChTANp
3,New approach against HIV 'promising',://bbc.in/1E6jAjt
4,Coalition 'undermined NHS' - doctors,://bbc.in/1CnLwK7
...,...,...
59957,Expected to Work Better This Year,://on.wsj.com/117yaGw
60379,Shortens Insurance Application,://on.wsj.com/1sdfGyR
60394,Testing to Be Confidential,://on.wsj.com/1BPftTH
60465,Delays Web Host Switch,://on.wsj.com/1rKuiWc


#### Adding the splitted Core_info column to the entire dataset

In [29]:
file_for_split_full
Core_info_split

Unnamed: 0,Core_info_final,redundant_info
0,Breast cancer risk test devised,://bbc.in/1CimpJF
1,GP workload harming care - BMA poll,://bbc.in/1ChTBRv
2,Short people's 'heart risk greater',://bbc.in/1ChTANp
3,New approach against HIV 'promising',://bbc.in/1E6jAjt
4,Coalition 'undermined NHS' - doctors,://bbc.in/1CnLwK7
...,...,...
59957,Expected to Work Better This Year,://on.wsj.com/117yaGw
60379,Shortens Insurance Application,://on.wsj.com/1sdfGyR
60394,Testing to Be Confidential,://on.wsj.com/1BPftTH
60465,Delays Web Host Switch,://on.wsj.com/1rKuiWc


In [30]:
twitter_full = pd.concat([file_for_split_full, Core_info_split], axis = 1)

In [31]:
twitter_full

Unnamed: 0,index,Number,Date_Time,info,source,Core_info,Core_info_final,redundant_info
0,0,585978391360221184,Thu Apr 09 01:31:50 +0000 2015,Breast cancer risk test devised http://bbc.in/...,bbchealth,Breast cancer risk test devised http://bbc.in/...,Breast cancer risk test devised,://bbc.in/1CimpJF
1,1,585947808772960257,Wed Apr 08 23:30:18 +0000 2015,GP workload harming care - BMA poll http://bbc...,bbchealth,GP workload harming care - BMA poll http://bbc...,GP workload harming care - BMA poll,://bbc.in/1ChTBRv
2,2,585947807816650752,Wed Apr 08 23:30:18 +0000 2015,Short people's 'heart risk greater' http://bbc...,bbchealth,Short people's 'heart risk greater' http://bbc...,Short people's 'heart risk greater',://bbc.in/1ChTANp
3,3,585866060991078401,Wed Apr 08 18:05:28 +0000 2015,New approach against HIV 'promising' http://bb...,bbchealth,New approach against HIV 'promising' http://bb...,New approach against HIV 'promising',://bbc.in/1E6jAjt
4,4,585794106170839041,Wed Apr 08 13:19:33 +0000 2015,Coalition 'undermined NHS' - doctors http://bb...,bbchealth,Coalition 'undermined NHS' - doctors http://bb...,Coalition 'undermined NHS' - doctors,://bbc.in/1CnLwK7
...,...,...,...,...,...,...,...,...
59957,340,533426053901860864,Sat Nov 15 01:07:36 +0000 2014,http://HealthCare.gov Expected to Work Better ...,wsjhealth,Expected to Work Better This Year http://on.ws...,Expected to Work Better This Year,://on.wsj.com/117yaGw
60379,762,520007777175879680,Thu Oct 09 00:28:10 +0000 2014,http://HealthCare.gov Shortens Insurance Appli...,wsjhealth,Shortens Insurance Application http://on.wsj.c...,Shortens Insurance Application,://on.wsj.com/1sdfGyR
60394,777,519674939507359745,Wed Oct 08 02:25:35 +0000 2014,http://HealthCare.gov Testing to Be Confidenti...,wsjhealth,Testing to Be Confidential http://on.wsj.com/1...,Testing to Be Confidential,://on.wsj.com/1BPftTH
60465,848,517514080131629056,Thu Oct 02 03:19:06 +0000 2014,http://HealthCare.gov Delays Web Host Switch h...,wsjhealth,Delays Web Host Switch http://on.wsj.com/1rKuiWc,Delays Web Host Switch,://on.wsj.com/1rKuiWc


In [32]:
twitter_health = twitter_full.drop(['Core_info', 'index', "redundant_info"], axis = 1)

In [33]:
twitter_health

Unnamed: 0,Number,Date_Time,info,source,Core_info_final
0,585978391360221184,Thu Apr 09 01:31:50 +0000 2015,Breast cancer risk test devised http://bbc.in/...,bbchealth,Breast cancer risk test devised
1,585947808772960257,Wed Apr 08 23:30:18 +0000 2015,GP workload harming care - BMA poll http://bbc...,bbchealth,GP workload harming care - BMA poll
2,585947807816650752,Wed Apr 08 23:30:18 +0000 2015,Short people's 'heart risk greater' http://bbc...,bbchealth,Short people's 'heart risk greater'
3,585866060991078401,Wed Apr 08 18:05:28 +0000 2015,New approach against HIV 'promising' http://bb...,bbchealth,New approach against HIV 'promising'
4,585794106170839041,Wed Apr 08 13:19:33 +0000 2015,Coalition 'undermined NHS' - doctors http://bb...,bbchealth,Coalition 'undermined NHS' - doctors
...,...,...,...,...,...
59957,533426053901860864,Sat Nov 15 01:07:36 +0000 2014,http://HealthCare.gov Expected to Work Better ...,wsjhealth,Expected to Work Better This Year
60379,520007777175879680,Thu Oct 09 00:28:10 +0000 2014,http://HealthCare.gov Shortens Insurance Appli...,wsjhealth,Shortens Insurance Application
60394,519674939507359745,Wed Oct 08 02:25:35 +0000 2014,http://HealthCare.gov Testing to Be Confidenti...,wsjhealth,Testing to Be Confidential
60465,517514080131629056,Thu Oct 02 03:19:06 +0000 2014,http://HealthCare.gov Delays Web Host Switch h...,wsjhealth,Delays Web Host Switch


In [34]:
print('----Split check-----')
print('  ')
print('Before split:')
print(file_to_split['info'][3929])
print('---------------')
print('After split split:')
print(twitter_full['Core_info_final'][3929])
print(twitter_full['redundant_info'][3929])

----Split check-----
  
Before split:
Drugs need careful monitoring for expiry dates, pharmacists say http://www.cbc.ca/news/health/drugs-need-careful-monitoring-for-expiry-dates-pharmacists-say-1.3026749?cmp=rss
---------------
After split split:
Drugs need careful monitoring for expiry dates, pharmacists say 
://www.cbc.ca/news/health/drugs-need-careful-monitoring-for-expiry-dates-pharmacists-say-1.3026749?cmp=rss


#### Check for rows starting with "http:"

In [35]:
twitter_health[twitter_health['info'].str.startswith("http")]

Unnamed: 0,Number,Date_Time,info,source,Core_info_final
19543,532856398615674880,Thu Nov 13 11:24:00 +0000 2014,http://gu.com/p/438n2/tw,gdnhealthcare,
25536,280298707818262529,Sun Dec 16 13:09:54 +0000 2012,http://pinterest.com/pin/34832597089912501/ Br...,goodhealth,Brooke Burke Charvet: 8 Things You Should Know...
25683,273832894416441344,Wed Nov 28 16:57:04 +0000 2012,http://pinterest.com/pin/34832597089793116/ We...,goodhealth,Weight Loss Tip: Peel Off Pudge With Pepper
26351,199182851944628225,Sun May 06 17:04:27 +0000 2012,http://pinterest.com/pin/34832597088302332/ 7 ...,goodhealth,7 Foods That Help You Shed Pounds
26353,198905597666656257,Sat May 05 22:42:44 +0000 2012,http://pinterest.com/pin/34832597088298094/ 5 ...,goodhealth,5 Healthy Cinco de Mayo Recipes
...,...,...,...,...,...
59957,533426053901860864,Sat Nov 15 01:07:36 +0000 2014,http://HealthCare.gov Expected to Work Better ...,wsjhealth,Expected to Work Better This Year
60379,520007777175879680,Thu Oct 09 00:28:10 +0000 2014,http://HealthCare.gov Shortens Insurance Appli...,wsjhealth,Shortens Insurance Application
60394,519674939507359745,Wed Oct 08 02:25:35 +0000 2014,http://HealthCare.gov Testing to Be Confidenti...,wsjhealth,Testing to Be Confidential
60465,517514080131629056,Thu Oct 02 03:19:06 +0000 2014,http://HealthCare.gov Delays Web Host Switch h...,wsjhealth,Delays Web Host Switch


### Cleaned column 'Core_info_final' to work with

In [36]:
twitter_health

Unnamed: 0,Number,Date_Time,info,source,Core_info_final
0,585978391360221184,Thu Apr 09 01:31:50 +0000 2015,Breast cancer risk test devised http://bbc.in/...,bbchealth,Breast cancer risk test devised
1,585947808772960257,Wed Apr 08 23:30:18 +0000 2015,GP workload harming care - BMA poll http://bbc...,bbchealth,GP workload harming care - BMA poll
2,585947807816650752,Wed Apr 08 23:30:18 +0000 2015,Short people's 'heart risk greater' http://bbc...,bbchealth,Short people's 'heart risk greater'
3,585866060991078401,Wed Apr 08 18:05:28 +0000 2015,New approach against HIV 'promising' http://bb...,bbchealth,New approach against HIV 'promising'
4,585794106170839041,Wed Apr 08 13:19:33 +0000 2015,Coalition 'undermined NHS' - doctors http://bb...,bbchealth,Coalition 'undermined NHS' - doctors
...,...,...,...,...,...
59957,533426053901860864,Sat Nov 15 01:07:36 +0000 2014,http://HealthCare.gov Expected to Work Better ...,wsjhealth,Expected to Work Better This Year
60379,520007777175879680,Thu Oct 09 00:28:10 +0000 2014,http://HealthCare.gov Shortens Insurance Appli...,wsjhealth,Shortens Insurance Application
60394,519674939507359745,Wed Oct 08 02:25:35 +0000 2014,http://HealthCare.gov Testing to Be Confidenti...,wsjhealth,Testing to Be Confidential
60465,517514080131629056,Thu Oct 02 03:19:06 +0000 2014,http://HealthCare.gov Delays Web Host Switch h...,wsjhealth,Delays Web Host Switch


#### Some additional cleaning
#### * Removing NA values from "Core_info_final"
#### * Columns "Number", "Date_Time", and "info" can be deleted since they are not needed for further analysis

In [37]:
twitter_health.isna().sum()

Number             0
Date_Time          0
info               0
source             0
Core_info_final    1
dtype: int64

In [38]:
twitter_health.loc[twitter_health['Core_info_final'].isna()]

Unnamed: 0,Number,Date_Time,info,source,Core_info_final
19543,532856398615674880,Thu Nov 13 11:24:00 +0000 2014,http://gu.com/p/438n2/tw,gdnhealthcare,


In [39]:
twitter_health.dropna(subset=['Core_info_final'], inplace=True)

In [40]:
twitter_health.shape

(62816, 5)

#### Saving the cleaned file to CSV for further use

In [41]:
twitter_health.to_csv('PrvDel.csv', index=False)

#### Adding an additional feature "year" from the 'DateTime' column for further use in the analysis

In [42]:
twitter_health['Date_Time'] = pd.to_datetime(twitter_health['Date_Time'], errors='coerce')
twitter_health['year'] = twitter_health['Date_Time'].dt.year

In [43]:
years = twitter_health['year'].value_counts()
years.sort_index()

2011     2303
2012     5797
2013    18906
2014    25995
2015     9815
Name: year, dtype: int64

In [44]:
sns

<module 'seaborn' from 'C:\\Users\\USER\\anaconda3\\lib\\site-packages\\seaborn\\__init__.py'>

In [45]:
years = ['2011','2012','2013','2014','2015']

for year in years:
    tweet = ''
    tweet_year = tweet+year
    tweet_year = twitter_health[twitter_health['year'] == int(year)]
    PartOne = 'PartOne_'
    PartOneYear = PartOne+year+'.csv'   
    tweet_year.to_csv(PartOneYear, index=False)

#### End of Part One and move to Part Two