# GDELT Raw Data File Collection

### This file will follow the steps below to collect the gdelt data
#### 1. This file will download the master file from official gdelt website. 
#### 2. Based on the user preference it will download the data from gdelt in compressed format.
#### 3. Un-compressing the downloaded files
#### 4. Combining the un-compressed files into one single csv file
#### 5. Filter the data based on Actor1Code = 'HLH' and Actor2Code ='HLH' for healthcare events and eventRootCode IN ('10','11','12','13','14')
#### 5. Push the combined csv file to NoSQL MongoDB

### Download the Master file

#### Check if the old master file already present in the directory. If it's present then delete the old file and download the latest one from GDELT

In [1]:
import time
#import tracemalloc
#tracemalloc.start()
start_time = time.process_time()

import os
file_path = r".\master\masterfilelist.txt"
if os.path.isfile(file_path):
  os.remove(file_path)
  print("Old Master file existed and has been deleted")
else:
  print("Old Master file not found!")

Old Master file existed and has been deleted


#### Download the master file which maps the raw data files

In [2]:
import wget
path_master = r".\master"
wget.download("http://data.gdeltproject.org/gdeltv2/masterfilelist.txt", out = path_master)

  0% [                                                                        ]        0 / 82746692  0% [                                                                        ]     8192 / 82746692  0% [                                                                        ]    16384 / 82746692  0% [                                                                        ]    24576 / 82746692  0% [                                                                        ]    32768 / 82746692  0% [                                                                        ]    40960 / 82746692  0% [                                                                        ]    49152 / 82746692  0% [                                                                        ]    57344 / 82746692  0% [                                                                        ]    65536 / 82746692  0% [                                                                        ]    73728 / 82746692

  1% [.                                                                       ]  1531904 / 82746692  1% [.                                                                       ]  1540096 / 82746692  1% [.                                                                       ]  1548288 / 82746692  1% [.                                                                       ]  1556480 / 82746692  1% [.                                                                       ]  1564672 / 82746692  1% [.                                                                       ]  1572864 / 82746692  1% [.                                                                       ]  1581056 / 82746692  1% [.                                                                       ]  1589248 / 82746692  1% [.                                                                       ]  1597440 / 82746692  1% [.                                                                       ]  1605632 / 82746692

  4% [...                                                                     ]  3448832 / 82746692  4% [...                                                                     ]  3457024 / 82746692  4% [...                                                                     ]  3465216 / 82746692  4% [...                                                                     ]  3473408 / 82746692  4% [...                                                                     ]  3481600 / 82746692  4% [...                                                                     ]  3489792 / 82746692  4% [...                                                                     ]  3497984 / 82746692  4% [...                                                                     ]  3506176 / 82746692  4% [...                                                                     ]  3514368 / 82746692  4% [...                                                                     ]  3522560 / 82746692

  5% [...                                                                     ]  4161536 / 82746692  5% [...                                                                     ]  4169728 / 82746692  5% [...                                                                     ]  4177920 / 82746692  5% [...                                                                     ]  4186112 / 82746692  5% [...                                                                     ]  4194304 / 82746692  5% [...                                                                     ]  4202496 / 82746692  5% [...                                                                     ]  4210688 / 82746692  5% [...                                                                     ]  4218880 / 82746692  5% [...                                                                     ]  4227072 / 82746692  5% [...                                                                     ]  4235264 / 82746692

  8% [.....                                                                   ]  6807552 / 82746692  8% [.....                                                                   ]  6815744 / 82746692  8% [.....                                                                   ]  6823936 / 82746692  8% [.....                                                                   ]  6832128 / 82746692  8% [.....                                                                   ]  6840320 / 82746692  8% [.....                                                                   ]  6848512 / 82746692  8% [.....                                                                   ]  6856704 / 82746692  8% [.....                                                                   ]  6864896 / 82746692  8% [.....                                                                   ]  6873088 / 82746692  8% [.....                                                                   ]  6881280 / 82746692

 11% [........                                                                ]  9895936 / 82746692 11% [........                                                                ]  9904128 / 82746692 11% [........                                                                ]  9912320 / 82746692 11% [........                                                                ]  9920512 / 82746692 11% [........                                                                ]  9928704 / 82746692 12% [........                                                                ]  9936896 / 82746692 12% [........                                                                ]  9945088 / 82746692 12% [........                                                                ]  9953280 / 82746692 12% [........                                                                ]  9961472 / 82746692 12% [........                                                                ]  9969664 / 82746692

 15% [...........                                                             ] 12820480 / 82746692 15% [...........                                                             ] 12828672 / 82746692 15% [...........                                                             ] 12836864 / 82746692 15% [...........                                                             ] 12845056 / 82746692 15% [...........                                                             ] 12853248 / 82746692 15% [...........                                                             ] 12861440 / 82746692 15% [...........                                                             ] 12869632 / 82746692 15% [...........                                                             ] 12877824 / 82746692 15% [...........                                                             ] 12886016 / 82746692 15% [...........                                                             ] 12894208 / 82746692

 18% [.............                                                           ] 15040512 / 82746692 18% [.............                                                           ] 15048704 / 82746692 18% [.............                                                           ] 15056896 / 82746692 18% [.............                                                           ] 15065088 / 82746692 18% [.............                                                           ] 15073280 / 82746692 18% [.............                                                           ] 15081472 / 82746692 18% [.............                                                           ] 15089664 / 82746692 18% [.............                                                           ] 15097856 / 82746692 18% [.............                                                           ] 15106048 / 82746692 18% [.............                                                           ] 15114240 / 82746692

 20% [...............                                                         ] 17252352 / 82746692 20% [...............                                                         ] 17260544 / 82746692 20% [...............                                                         ] 17268736 / 82746692 20% [...............                                                         ] 17276928 / 82746692 20% [...............                                                         ] 17285120 / 82746692 20% [...............                                                         ] 17293312 / 82746692 20% [...............                                                         ] 17301504 / 82746692 20% [...............                                                         ] 17309696 / 82746692 20% [...............                                                         ] 17317888 / 82746692 20% [...............                                                         ] 17326080 / 82746692

 23% [................                                                        ] 19480576 / 82746692 23% [................                                                        ] 19488768 / 82746692 23% [................                                                        ] 19496960 / 82746692 23% [................                                                        ] 19505152 / 82746692 23% [................                                                        ] 19513344 / 82746692 23% [................                                                        ] 19521536 / 82746692 23% [................                                                        ] 19529728 / 82746692 23% [.................                                                       ] 19537920 / 82746692 23% [.................                                                       ] 19546112 / 82746692 23% [.................                                                       ] 19554304 / 82746692

 26% [..................                                                      ] 21692416 / 82746692 26% [..................                                                      ] 21700608 / 82746692 26% [..................                                                      ] 21708800 / 82746692 26% [..................                                                      ] 21716992 / 82746692 26% [..................                                                      ] 21725184 / 82746692 26% [..................                                                      ] 21733376 / 82746692 26% [..................                                                      ] 21741568 / 82746692 26% [..................                                                      ] 21749760 / 82746692 26% [..................                                                      ] 21757952 / 82746692 26% [..................                                                      ] 21766144 / 82746692

 28% [....................                                                    ] 23920640 / 82746692 28% [....................                                                    ] 23928832 / 82746692 28% [....................                                                    ] 23937024 / 82746692 28% [....................                                                    ] 23945216 / 82746692 28% [....................                                                    ] 23953408 / 82746692 28% [....................                                                    ] 23961600 / 82746692 28% [....................                                                    ] 23969792 / 82746692 28% [....................                                                    ] 23977984 / 82746692 28% [....................                                                    ] 23986176 / 82746692 28% [....................                                                    ] 23994368 / 82746692

 31% [......................                                                  ] 26148864 / 82746692 31% [......................                                                  ] 26157056 / 82746692 31% [......................                                                  ] 26165248 / 82746692 31% [......................                                                  ] 26173440 / 82746692 31% [......................                                                  ] 26181632 / 82746692 31% [......................                                                  ] 26189824 / 82746692 31% [......................                                                  ] 26198016 / 82746692 31% [......................                                                  ] 26206208 / 82746692 31% [......................                                                  ] 26214400 / 82746692 31% [......................                                                  ] 26222592 / 82746692

 34% [........................                                                ] 28368896 / 82746692 34% [........................                                                ] 28377088 / 82746692 34% [........................                                                ] 28385280 / 82746692 34% [........................                                                ] 28393472 / 82746692 34% [........................                                                ] 28401664 / 82746692 34% [........................                                                ] 28409856 / 82746692 34% [........................                                                ] 28418048 / 82746692 34% [........................                                                ] 28426240 / 82746692 34% [........................                                                ] 28434432 / 82746692 34% [........................                                                ] 28442624 / 82746692

 36% [..........................                                              ] 30597120 / 82746692 36% [..........................                                              ] 30605312 / 82746692 36% [..........................                                              ] 30613504 / 82746692 37% [..........................                                              ] 30621696 / 82746692 37% [..........................                                              ] 30629888 / 82746692 37% [..........................                                              ] 30638080 / 82746692 37% [..........................                                              ] 30646272 / 82746692 37% [..........................                                              ] 30654464 / 82746692 37% [..........................                                              ] 30662656 / 82746692 37% [..........................                                              ] 30670848 / 82746692

 39% [............................                                            ] 32808960 / 82746692 39% [............................                                            ] 32817152 / 82746692 39% [............................                                            ] 32825344 / 82746692 39% [............................                                            ] 32833536 / 82746692 39% [............................                                            ] 32841728 / 82746692 39% [............................                                            ] 32849920 / 82746692 39% [............................                                            ] 32858112 / 82746692 39% [............................                                            ] 32866304 / 82746692 39% [............................                                            ] 32874496 / 82746692 39% [............................                                            ] 32882688 / 82746692

 42% [..............................                                          ] 35028992 / 82746692 42% [..............................                                          ] 35037184 / 82746692 42% [..............................                                          ] 35045376 / 82746692 42% [..............................                                          ] 35053568 / 82746692 42% [..............................                                          ] 35061760 / 82746692 42% [..............................                                          ] 35069952 / 82746692 42% [..............................                                          ] 35078144 / 82746692 42% [..............................                                          ] 35086336 / 82746692 42% [..............................                                          ] 35094528 / 82746692 42% [..............................                                          ] 35102720 / 82746692

 45% [................................                                        ] 37265408 / 82746692 45% [................................                                        ] 37273600 / 82746692 45% [................................                                        ] 37281792 / 82746692 45% [................................                                        ] 37289984 / 82746692 45% [................................                                        ] 37298176 / 82746692 45% [................................                                        ] 37306368 / 82746692 45% [................................                                        ] 37314560 / 82746692 45% [................................                                        ] 37322752 / 82746692 45% [................................                                        ] 37330944 / 82746692 45% [................................                                        ] 37339136 / 82746692

 47% [..................................                                      ] 39362560 / 82746692 47% [..................................                                      ] 39370752 / 82746692 47% [..................................                                      ] 39378944 / 82746692 47% [..................................                                      ] 39387136 / 82746692 47% [..................................                                      ] 39395328 / 82746692 47% [..................................                                      ] 39403520 / 82746692 47% [..................................                                      ] 39411712 / 82746692 47% [..................................                                      ] 39419904 / 82746692 47% [..................................                                      ] 39428096 / 82746692 47% [..................................                                      ] 39436288 / 82746692

 50% [....................................                                    ] 41385984 / 82746692 50% [....................................                                    ] 41394176 / 82746692 50% [....................................                                    ] 41402368 / 82746692 50% [....................................                                    ] 41410560 / 82746692 50% [....................................                                    ] 41418752 / 82746692 50% [....................................                                    ] 41426944 / 82746692 50% [....................................                                    ] 41435136 / 82746692 50% [....................................                                    ] 41443328 / 82746692 50% [....................................                                    ] 41451520 / 82746692 50% [....................................                                    ] 41459712 / 82746692

 53% [......................................                                  ] 43917312 / 82746692 53% [......................................                                  ] 43925504 / 82746692 53% [......................................                                  ] 43933696 / 82746692 53% [......................................                                  ] 43941888 / 82746692 53% [......................................                                  ] 43950080 / 82746692 53% [......................................                                  ] 43958272 / 82746692 53% [......................................                                  ] 43966464 / 82746692 53% [......................................                                  ] 43974656 / 82746692 53% [......................................                                  ] 43982848 / 82746692 53% [......................................                                  ] 43991040 / 82746692

 55% [........................................                                ] 46145536 / 82746692 55% [........................................                                ] 46153728 / 82746692 55% [........................................                                ] 46161920 / 82746692 55% [........................................                                ] 46170112 / 82746692 55% [........................................                                ] 46178304 / 82746692 55% [........................................                                ] 46186496 / 82746692 55% [........................................                                ] 46194688 / 82746692 55% [........................................                                ] 46202880 / 82746692 55% [........................................                                ] 46211072 / 82746692 55% [........................................                                ] 46219264 / 82746692

 58% [..........................................                              ] 48373760 / 82746692 58% [..........................................                              ] 48381952 / 82746692 58% [..........................................                              ] 48390144 / 82746692 58% [..........................................                              ] 48398336 / 82746692 58% [..........................................                              ] 48406528 / 82746692 58% [..........................................                              ] 48414720 / 82746692 58% [..........................................                              ] 48422912 / 82746692 58% [..........................................                              ] 48431104 / 82746692 58% [..........................................                              ] 48439296 / 82746692 58% [..........................................                              ] 48447488 / 82746692

 61% [............................................                            ] 50593792 / 82746692 61% [............................................                            ] 50601984 / 82746692 61% [............................................                            ] 50610176 / 82746692 61% [............................................                            ] 50618368 / 82746692 61% [............................................                            ] 50626560 / 82746692 61% [............................................                            ] 50634752 / 82746692 61% [............................................                            ] 50642944 / 82746692 61% [............................................                            ] 50651136 / 82746692 61% [............................................                            ] 50659328 / 82746692 61% [............................................                            ] 50667520 / 82746692

 63% [.............................................                           ] 52822016 / 82746692 63% [.............................................                           ] 52830208 / 82746692 63% [.............................................                           ] 52838400 / 82746692 63% [.............................................                           ] 52846592 / 82746692 63% [.............................................                           ] 52854784 / 82746692 63% [.............................................                           ] 52862976 / 82746692 63% [..............................................                          ] 52871168 / 82746692 63% [..............................................                          ] 52879360 / 82746692 63% [..............................................                          ] 52887552 / 82746692 63% [..............................................                          ] 52895744 / 82746692

 66% [...............................................                         ] 55025664 / 82746692 66% [...............................................                         ] 55033856 / 82746692 66% [...............................................                         ] 55042048 / 82746692 66% [...............................................                         ] 55050240 / 82746692 66% [...............................................                         ] 55058432 / 82746692 66% [...............................................                         ] 55066624 / 82746692 66% [...............................................                         ] 55074816 / 82746692 66% [...............................................                         ] 55083008 / 82746692 66% [...............................................                         ] 55091200 / 82746692 66% [...............................................                         ] 55099392 / 82746692

 69% [.................................................                       ] 57114624 / 82746692 69% [.................................................                       ] 57122816 / 82746692 69% [.................................................                       ] 57131008 / 82746692 69% [.................................................                       ] 57139200 / 82746692 69% [.................................................                       ] 57147392 / 82746692 69% [.................................................                       ] 57155584 / 82746692 69% [.................................................                       ] 57163776 / 82746692 69% [.................................................                       ] 57171968 / 82746692 69% [.................................................                       ] 57180160 / 82746692 69% [.................................................                       ] 57188352 / 82746692

 71% [...................................................                     ] 59482112 / 82746692 71% [...................................................                     ] 59490304 / 82746692 71% [...................................................                     ] 59498496 / 82746692 71% [...................................................                     ] 59506688 / 82746692 71% [...................................................                     ] 59514880 / 82746692 71% [...................................................                     ] 59523072 / 82746692 71% [...................................................                     ] 59531264 / 82746692 71% [...................................................                     ] 59539456 / 82746692 71% [...................................................                     ] 59547648 / 82746692 71% [...................................................                     ] 59555840 / 82746692

 73% [....................................................                    ] 60514304 / 82746692 73% [....................................................                    ] 60522496 / 82746692 73% [....................................................                    ] 60530688 / 82746692 73% [....................................................                    ] 60538880 / 82746692 73% [....................................................                    ] 60547072 / 82746692 73% [....................................................                    ] 60555264 / 82746692 73% [....................................................                    ] 60563456 / 82746692 73% [....................................................                    ] 60571648 / 82746692 73% [....................................................                    ] 60579840 / 82746692 73% [....................................................                    ] 60588032 / 82746692

 75% [......................................................                  ] 62554112 / 82746692 75% [......................................................                  ] 62562304 / 82746692 75% [......................................................                  ] 62570496 / 82746692 75% [......................................................                  ] 62578688 / 82746692 75% [......................................................                  ] 62586880 / 82746692 75% [......................................................                  ] 62595072 / 82746692 75% [......................................................                  ] 62603264 / 82746692 75% [......................................................                  ] 62611456 / 82746692 75% [......................................................                  ] 62619648 / 82746692 75% [......................................................                  ] 62627840 / 82746692

 78% [........................................................                ] 64790528 / 82746692 78% [........................................................                ] 64798720 / 82746692 78% [........................................................                ] 64806912 / 82746692 78% [........................................................                ] 64815104 / 82746692 78% [........................................................                ] 64823296 / 82746692 78% [........................................................                ] 64831488 / 82746692 78% [........................................................                ] 64839680 / 82746692 78% [........................................................                ] 64847872 / 82746692 78% [........................................................                ] 64856064 / 82746692 78% [........................................................                ] 64864256 / 82746692

 80% [..........................................................              ] 67018752 / 82746692 81% [..........................................................              ] 67026944 / 82746692 81% [..........................................................              ] 67035136 / 82746692 81% [..........................................................              ] 67043328 / 82746692 81% [..........................................................              ] 67051520 / 82746692 81% [..........................................................              ] 67059712 / 82746692 81% [..........................................................              ] 67067904 / 82746692 81% [..........................................................              ] 67076096 / 82746692 81% [..........................................................              ] 67084288 / 82746692 81% [..........................................................              ] 67092480 / 82746692

 83% [............................................................            ] 69246976 / 82746692 83% [............................................................            ] 69255168 / 82746692 83% [............................................................            ] 69263360 / 82746692 83% [............................................................            ] 69271552 / 82746692 83% [............................................................            ] 69279744 / 82746692 83% [............................................................            ] 69287936 / 82746692 83% [............................................................            ] 69296128 / 82746692 83% [............................................................            ] 69304320 / 82746692 83% [............................................................            ] 69312512 / 82746692 83% [............................................................            ] 69320704 / 82746692

 86% [..............................................................          ] 71458816 / 82746692 86% [..............................................................          ] 71467008 / 82746692 86% [..............................................................          ] 71475200 / 82746692 86% [..............................................................          ] 71483392 / 82746692 86% [..............................................................          ] 71491584 / 82746692 86% [..............................................................          ] 71499776 / 82746692 86% [..............................................................          ] 71507968 / 82746692 86% [..............................................................          ] 71516160 / 82746692 86% [..............................................................          ] 71524352 / 82746692 86% [..............................................................          ] 71532544 / 82746692

 89% [................................................................        ] 73662464 / 82746692 89% [................................................................        ] 73670656 / 82746692 89% [................................................................        ] 73678848 / 82746692 89% [................................................................        ] 73687040 / 82746692 89% [................................................................        ] 73695232 / 82746692 89% [................................................................        ] 73703424 / 82746692 89% [................................................................        ] 73711616 / 82746692 89% [................................................................        ] 73719808 / 82746692 89% [................................................................        ] 73728000 / 82746692 89% [................................................................        ] 73736192 / 82746692

 91% [..................................................................      ] 75890688 / 82746692 91% [..................................................................      ] 75898880 / 82746692 91% [..................................................................      ] 75907072 / 82746692 91% [..................................................................      ] 75915264 / 82746692 91% [..................................................................      ] 75923456 / 82746692 91% [..................................................................      ] 75931648 / 82746692 91% [..................................................................      ] 75939840 / 82746692 91% [..................................................................      ] 75948032 / 82746692 91% [..................................................................      ] 75956224 / 82746692 91% [..................................................................      ] 75964416 / 82746692

 94% [...................................................................     ] 78110720 / 82746692 94% [...................................................................     ] 78118912 / 82746692 94% [...................................................................     ] 78127104 / 82746692 94% [...................................................................     ] 78135296 / 82746692 94% [...................................................................     ] 78143488 / 82746692 94% [....................................................................    ] 78151680 / 82746692 94% [....................................................................    ] 78159872 / 82746692 94% [....................................................................    ] 78168064 / 82746692 94% [....................................................................    ] 78176256 / 82746692 94% [....................................................................    ] 78184448 / 82746692

 97% [.....................................................................   ] 80330752 / 82746692 97% [.....................................................................   ] 80338944 / 82746692 97% [.....................................................................   ] 80347136 / 82746692 97% [.....................................................................   ] 80355328 / 82746692 97% [.....................................................................   ] 80363520 / 82746692 97% [.....................................................................   ] 80371712 / 82746692 97% [.....................................................................   ] 80379904 / 82746692 97% [.....................................................................   ] 80388096 / 82746692 97% [.....................................................................   ] 80396288 / 82746692 97% [.....................................................................   ] 80404480 / 82746692

 99% [....................................................................... ] 82550784 / 82746692 99% [....................................................................... ] 82558976 / 82746692 99% [....................................................................... ] 82567168 / 82746692 99% [....................................................................... ] 82575360 / 82746692 99% [....................................................................... ] 82583552 / 82746692 99% [....................................................................... ] 82591744 / 82746692 99% [....................................................................... ] 82599936 / 82746692 99% [....................................................................... ] 82608128 / 82746692 99% [....................................................................... ] 82616320 / 82746692 99% [....................................................................... ] 82624512 / 82746692

'.\\master/masterfilelist.txt'

### Read the master file into a dataframe

In [3]:
import pandas as pd
header_list = ['A','B','C']
df = pd.read_csv(r".\master\masterfilelist.txt", sep= ' ', names = header_list)

#### Display the type of data that master file has and the amount of data it has

In [4]:
df.head()

Unnamed: 0,A,B,C
0,150383,297a16b493de7cf6ca809a7cc31d0b93,http://data.gdeltproject.org/gdeltv2/201502182...
1,318084,bb27f78ba45f69a17ea6ed7755e9f8ff,http://data.gdeltproject.org/gdeltv2/201502182...
2,10768507,ea8dde0beb0ba98810a92db068c0ce99,http://data.gdeltproject.org/gdeltv2/201502182...
3,149211,2a91041d7e72b0fc6a629e2ff867b240,http://data.gdeltproject.org/gdeltv2/201502182...
4,339037,dec3f427076b716a8112b9086c342523,http://data.gdeltproject.org/gdeltv2/201502182...


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 774279 entries, 0 to 774278
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   A       774279 non-null  object
 1   B       774224 non-null  object
 2   C       774224 non-null  object
dtypes: object(3)
memory usage: 17.7+ MB


### Fetch all the export table files of event data
#### This code will download only the event data from GDELT

In [6]:
list_of_export_table = []

In [7]:
for i in df['C']:
    if 'export' in str(i):
        list_of_export_table.append(i)

In [8]:
len(list_of_export_table)

258073

In [9]:
#shows only first 5 records fro reference
list_of_export_table[:5]

['http://data.gdeltproject.org/gdeltv2/20150218230000.export.CSV.zip',
 'http://data.gdeltproject.org/gdeltv2/20150218231500.export.CSV.zip',
 'http://data.gdeltproject.org/gdeltv2/20150218233000.export.CSV.zip',
 'http://data.gdeltproject.org/gdeltv2/20150218234500.export.CSV.zip',
 'http://data.gdeltproject.org/gdeltv2/20150219000000.export.CSV.zip']

### Download the x number of files from today
#### This will download all the files going back from the day (Exact time) the code is run

In [10]:
print("ENTER THE NUMEBR OF FILES YOU WANT TO DOWNLOAD")
num = input()

ENTER THE NUMEBR OF FILES YOU WANT TO DOWNLOAD
100


In [11]:
path_zip = r'.\zip_data'

In [12]:
import shutil
if len(os.listdir(path_zip)) == 0:
    print("Directory is empty")
else:
    shutil.rmtree(path_zip)

In [13]:
os.mkdir(path_zip)

In [14]:
for i in list_of_export_table[-int(num):]:
    wget.download(i, out = path_zip)

100% [..............................................................................] 92549 / 92549

In [15]:
len(os.listdir(path_zip))

100

### Unzip the downloaded files to a seperate CSV folder

In [16]:
import shutil

In [17]:
path_csv = r'.\csv_data'

In [18]:
import shutil
if len(os.listdir(path_csv)) == 0:
    print("Directory is empty")
else:
    shutil.rmtree(path_csv)

In [19]:
os.mkdir(path_csv)

In [20]:
file_names = os.listdir(path_zip)

In [21]:
for i in file_names:
    shutil.unpack_archive(path_zip + '\\' + i, path_csv)

### Combine all the CSV into one

In [22]:
header = ['GLOBALEVENTID', 
            'SQLDATE', 
            'MonthYear',
            'Year', 
            'FractionDate',
            'Actor1Code', 
            'Actor1Name', 
            'Actor1CountryCode',
            'Actor1KnownGroupCode',
            'Actor1EthnicCode',
            'Actor1Religion1Code',
            'Actor1Religion2Code',
            'Actor1Type1Code',
            'Actor1Type2Code',
            'Actor1Type3Code',
            'Actor2Code',
            'Actor2Name',
            'Actor2CountryCode',
            'Actor2KnownGroupCode',
            'Actor2EthnicCode',
            'Actor2Religion1Code',
            'Actor2Religion2Code',
            'Actor2Type1Code',
            'Actor2Type2Code',
            'Actor2Type3Code',
            'IsRootEvent',
            'EventCode',
            'EventBaseCode',
            'EventRootCode',
            'QuadClass',
            'GoldsteinScale',
            'NumMentions',
            'NumSources',
            'NumArticles',
            'AvgTone',
            'Actor1Geo_Type',
            'Actor1Geo_FullName',
            'Actor1Geo_CountryCode',
            'Actor1Geo_ADM1Code',
            'Actor1Geo_ADM2Code',
            'Actor1Geo_Lat',
            'Actor1Geo_Long',
            'Actor1Geo_FeatureID',
            'Actor2Geo_Type',
            'Actor2Geo_FullName',
            'Actor2Geo_CountryCode',
            'Actor2Geo_ADM1Code',
            'Actor2Geo_ADM2Code',
            'Actor2Geo_Lat',
            'Actor2Geo_Long',
            'Actor2Geo_FeatureID',
            'ActionGeo_Type',
            'ActionGeo_FullName',
            'ActionGeo_CountryCode',
            'ActionGeo_ADM1Code',
            'ActionGeo_ADM2Code',
            'ActionGeo_Lat',
            'ActionGeo_Long',
            'ActionGeo_FeatureID',
            'DATEADDED',
            'SOURCEURL']


In [23]:
len(header)

61

In [24]:
import glob
files = os.path.join(path_csv, "*.csv")

In [25]:
files

'.\\csv_data\\*.csv'

In [26]:
files = glob.glob(files)

In [27]:
final_csv = pd.concat([pd.read_csv(f, sep = '\t', names = header) for f in files])

In [28]:
final_csv.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 114833 entries, 0 to 1316
Data columns (total 61 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   GLOBALEVENTID          114833 non-null  int64  
 1   SQLDATE                114833 non-null  int64  
 2   MonthYear              114833 non-null  int64  
 3   Year                   114833 non-null  int64  
 4   FractionDate           114833 non-null  float64
 5   Actor1Code             103614 non-null  object 
 6   Actor1Name             103614 non-null  object 
 7   Actor1CountryCode      64340 non-null   object 
 8   Actor1KnownGroupCode   1404 non-null    object 
 9   Actor1EthnicCode       663 non-null     object 
 10  Actor1Religion1Code    1422 non-null    object 
 11  Actor1Religion2Code    355 non-null     object 
 12  Actor1Type1Code        49726 non-null   object 
 13  Actor1Type2Code        3613 non-null    object 
 14  Actor1Type3Code        78 non-null    

In [29]:
len(final_csv)

114833

#### Run the query on the combined data to filter the data based on healthcare events and event root code

In [30]:
import pandasql as pds

In [31]:
query = """SELECT * FROM final_csv WHERE EventRootCode IN ('10','11','12','13','14') AND (ACTOR1CODE = 'HLH' OR 'ACTOR2CODE' = 'HLH') """

In [32]:
new_df = pds.sqldf(query,globals())

In [33]:
new_df.head()

Unnamed: 0,GLOBALEVENTID,SQLDATE,MonthYear,Year,FractionDate,Actor1Code,Actor1Name,Actor1CountryCode,Actor1KnownGroupCode,Actor1EthnicCode,...,ActionGeo_Type,ActionGeo_FullName,ActionGeo_CountryCode,ActionGeo_ADM1Code,ActionGeo_ADM2Code,ActionGeo_Lat,ActionGeo_Long,ActionGeo_FeatureID,DATEADDED,SOURCEURL
0,1060046088,20220824,202208,2022,2022.6411,HLH,HOSPITAL,,,,...,4,"Wellington, New Zealand (general), New Zealand",NZ,NZ00,22391,-41.3,174.783,-1521348,20220824221500,https://kitchener.ctvnews.ca/grand-river-hospi...
1,1060047085,20220824,202208,2022,2022.6411,HLH,DOCTOR,,,,...,1,United Kingdom,UK,UK,,54.0,-4.0,UK,20220824223000,https://vancouverisland.ctvnews.ca/british-col...
2,1060050113,20220824,202208,2022,2022.6411,HLH,NURSE,,,,...,3,"Asheville, North Carolina, United States",US,USNC,NC021,35.6009,-82.554,1018864,20220824230000,https://www.healthleadersmedia.com/nursing/mis...
3,1060050114,20220824,202208,2022,2022.6411,HLH,NURSE,,,,...,3,"Asheville, North Carolina, United States",US,USNC,NC021,35.6009,-82.554,1018864,20220824230000,https://www.healthleadersmedia.com/nursing/mis...
4,1060050116,20220824,202208,2022,2022.6411,HLH,NURSE,,,,...,3,"Asheville, North Carolina, United States",US,USNC,NC021,35.6009,-82.554,1018864,20220824230000,https://www.healthleadersmedia.com/nursing/mis...


In [34]:
len(new_df)

69

In [35]:
new_df.EventRootCode.unique()

array([11, 12, 10, 14, 13], dtype=int64)

In [36]:
new_df.head()

Unnamed: 0,GLOBALEVENTID,SQLDATE,MonthYear,Year,FractionDate,Actor1Code,Actor1Name,Actor1CountryCode,Actor1KnownGroupCode,Actor1EthnicCode,...,ActionGeo_Type,ActionGeo_FullName,ActionGeo_CountryCode,ActionGeo_ADM1Code,ActionGeo_ADM2Code,ActionGeo_Lat,ActionGeo_Long,ActionGeo_FeatureID,DATEADDED,SOURCEURL
0,1060046088,20220824,202208,2022,2022.6411,HLH,HOSPITAL,,,,...,4,"Wellington, New Zealand (general), New Zealand",NZ,NZ00,22391,-41.3,174.783,-1521348,20220824221500,https://kitchener.ctvnews.ca/grand-river-hospi...
1,1060047085,20220824,202208,2022,2022.6411,HLH,DOCTOR,,,,...,1,United Kingdom,UK,UK,,54.0,-4.0,UK,20220824223000,https://vancouverisland.ctvnews.ca/british-col...
2,1060050113,20220824,202208,2022,2022.6411,HLH,NURSE,,,,...,3,"Asheville, North Carolina, United States",US,USNC,NC021,35.6009,-82.554,1018864,20220824230000,https://www.healthleadersmedia.com/nursing/mis...
3,1060050114,20220824,202208,2022,2022.6411,HLH,NURSE,,,,...,3,"Asheville, North Carolina, United States",US,USNC,NC021,35.6009,-82.554,1018864,20220824230000,https://www.healthleadersmedia.com/nursing/mis...
4,1060050116,20220824,202208,2022,2022.6411,HLH,NURSE,,,,...,3,"Asheville, North Carolina, United States",US,USNC,NC021,35.6009,-82.554,1018864,20220824230000,https://www.healthleadersmedia.com/nursing/mis...


#### Push the data into a new CSV file

In [37]:
%%time
new_df.to_csv(r".\GDELT_data.csv", index = False)

CPU times: total: 15.6 ms
Wall time: 188 ms


### Export data to MongoDB collection

In [38]:
import pymongo

In [39]:
#Create connection to the mongodb client
client = pymongo.MongoClient("mongodb://localhost:27017")

In [40]:
client.list_database_names()

['GDELT', 'admin', 'config', 'local']

In [41]:
db = client['GDELT']
db.list_collection_names()

['Balanced_Data_All', 'Balanced_Data_No_Protest_Code_Significant']

In [42]:
if('raw_data_files' in db.list_collection_names()):
    db.raw_data_event.drop()
else:
    print("Collection does not exists!")

Collection does not exists!


In [43]:
raw_data_files = db['raw_data_files']

In [44]:
#read the generated CSV
data = pd.read_csv(r".\GDELT_data.csv")

In [45]:
import json
final_data = json.loads(data.to_json(orient='records'))

In [46]:
%%time
raw_data_files.insert_many(final_data)
end_time = time.process_time()

CPU times: total: 0 ns
Wall time: 608 ms


In [47]:
total_time = end_time - start_time

In [48]:
total_time

32.53125