# Data Exploration

## Steps during data exploration:
### 1) Acquire dataset from csv file using pandas.
### 2) Observe the head (first few rows) of the data set to see if
### any cleaning needs to be done.
### 3) Filter out unnecessary columns to clean the data set and reduce
### the size for future data processing.
### 4) Use matplotlib to create simple plot w/ main features.

## Note:
### - I have been told by previous mentor that it is sufficient to simply work with the head of each file to 
### perform exploratory data analysis.

In [29]:
# Import packages
import pandas as pd

In [16]:
# Read file

# - Note that we have 14 zip files in our current directory that
# we want to work with/explore.
# - Need to unzip the CSV file and then read using the above packages.

# global_20210325 = pd.read_csv("20210325.gkg.csv.zip")

In [27]:
# Try (from StackOverflow):
# - Here, we can observe the file being unzipped while we create file
# objects to download files and perform reads.
# - Our train object is what we want to work with to perform
# crucial data transformations.
import zipfile
import pandas as pd
with zipfile.ZipFile("20210325.gkg.csv.zip") as z:
    with z.open("20210325.gkg.csv") as f:
        global_2021325 = pd.read_csv(f, header=0, delimiter="\t")
        print(global_2021325.head())    # print the first 5 rows

       DATE  NUMARTS                                             COUNTS  \
0  20210325        1                                                NaN   
1  20210325        1                                                NaN   
2  20210325        4                                                NaN   
3  20210325        1  KILL#2014##2#New York, United States#US#USNY#4...   
4  20210325        1  AFFECT#2##2#New York, United States#US#USNY#42...   

                                              THEMES  \
0  TAX_FNCACT;TAX_FNCACT_GUIDE;TAX_WORLDFISH;TAX_...   
1  TAX_FNCACT;TAX_FNCACT_LEADERS;TAX_ETHNICITY;TA...   
2  EDUCATION;WB_470_EDUCATION;WB_482_TERTIARY_EDU...   
3  KILL;EDUCATION;SOC_POINTSOFINTEREST;SOC_POINTS...   
4  UNGP_FORESTS_RIVERS_OCEANS;AFFECT;SECURITY_SER...   

                                           LOCATIONS  \
0  2#Minnesota, United States#US#USMN#45.7326#-93...   
1  1#Germany#GM#GM#51.5#10.5#GM;1#Hungary#HU#HU#4...   
2  1#Bangladesh#BG#BG#24#90#BG;1#United King

### Above, we can see that there are null values for the COUNTS and CAMEOEVENTIDS columns already in the first few rows. 
### In addition, most of the columns contain rows w/ multiple values and should be cleaned.
### We will do more work in regards to this in Step 5.

### Next, we want to remove unnecessary columns and write code so that we will perform our necessary data transformations for large amounts of data.

In [28]:
# Display rows of all columns:
global_2021325.head(5)

Unnamed: 0,DATE,NUMARTS,COUNTS,THEMES,LOCATIONS,PERSONS,ORGANIZATIONS,TONE,CAMEOEVENTIDS,SOURCES,SOURCEURLS
0,20210325,1,,TAX_FNCACT;TAX_FNCACT_GUIDE;TAX_WORLDFISH;TAX_...,"2#Minnesota, United States#US#USMN#45.7326#-93...",jeff brown;marty kirchner;red wing;mike mccormick,stoddard fire department;mccormick river guide...,"0.925925925925926,1.85185185185185,0.925925925...",,lacrossetribune.com,https://lacrossetribune.com/outdoors/outdoors-...
1,20210325,1,,TAX_FNCACT;TAX_FNCACT_LEADERS;TAX_ETHNICITY;TA...,1#Germany#GM#GM#51.5#10.5#GM;1#Hungary#HU#HU#4...,jessica johnson;igor matovi;janez jan;mateusz ...,euratom community;european union;european comm...,"-0.840336134453782,2.18487394957983,3.02521008...",,euractiv.com,https://www.euractiv.com/section/energy-enviro...
2,20210325,4,,EDUCATION;WB_470_EDUCATION;WB_482_TERTIARY_EDU...,1#Bangladesh#BG#BG#24#90#BG;1#United Kingdom#U...,mark richards;adrian smith,royal society;careers research;advisory centre...,"0.114285714285714,1.14285714285714,1.028571428...","976419965,976452643,976419965,976452643,976419...",bracknellnews.co.uk;bridgwatermercury.co.uk;as...,https://www.bracknellnews.co.uk/news/national/...
3,20210325,1,"KILL#2014##2#New York, United States#US#USNY#4...",KILL;EDUCATION;SOC_POINTSOFINTEREST;SOC_POINTS...,"2#New York, United States#US#USNY#42.1497#-74....",russell rickford;harry smith;andrew cuomo;svan...,department of community solutions;tompkins cou...,"-2.06185567010309,2.86368843069874,4.925544100...","976466942,976466943,976466963,976467024,976467...",apnews.com,https://apnews.com/article/new-york-ithaca-rac...
4,20210325,1,"AFFECT#2##2#New York, United States#US#USNY#42...",UNGP_FORESTS_RIVERS_OCEANS;AFFECT;SECURITY_SER...,"2#New York, United States#US#USNY#42.1497#-74....",john christopher nelson;jeff dougherty,jamestown police department;york state police ...,"-0.735294117647059,1.47058823529412,2.20588235...",976375445,post-journal.com,https://www.post-journal.com/news/latest-news/...


In [22]:
# Filter columns
global_2021325 = global_2021325.drop(['NUMARTS','COUNTS', 'TONE', 'CAMEOEVENTIDS', 'SOURCES', 'SOURCEURLS'], axis=1)
global_2021325.head(5)

Unnamed: 0,DATE,THEMES,LOCATIONS,PERSONS,ORGANIZATIONS
0,20210325,TAX_FNCACT;TAX_FNCACT_GUIDE;TAX_WORLDFISH;TAX_...,"2#Minnesota, United States#US#USMN#45.7326#-93...",jeff brown;marty kirchner;red wing;mike mccormick,stoddard fire department;mccormick river guide...
1,20210325,TAX_FNCACT;TAX_FNCACT_LEADERS;TAX_ETHNICITY;TA...,1#Germany#GM#GM#51.5#10.5#GM;1#Hungary#HU#HU#4...,jessica johnson;igor matovi;janez jan;mateusz ...,euratom community;european union;european comm...
2,20210325,EDUCATION;WB_470_EDUCATION;WB_482_TERTIARY_EDU...,1#Bangladesh#BG#BG#24#90#BG;1#United Kingdom#U...,mark richards;adrian smith,royal society;careers research;advisory centre...
3,20210325,KILL;EDUCATION;SOC_POINTSOFINTEREST;SOC_POINTS...,"2#New York, United States#US#USNY#42.1497#-74....",russell rickford;harry smith;andrew cuomo;svan...,department of community solutions;tompkins cou...
4,20210325,UNGP_FORESTS_RIVERS_OCEANS;AFFECT;SECURITY_SER...,"2#New York, United States#US#USNY#42.1497#-74....",john christopher nelson;jeff dougherty,jamestown police department;york state police ...


In [37]:
# Let's try another file:

with zipfile.ZipFile("20210325.gkgcounts.csv.zip") as z:
    with z.open("20210325.gkgcounts.csv") as f:
        globalcounts_2021325 = pd.read_csv(f, header=0, delimiter="\t")
        print(globalcounts_2021325.head())    # print the first 5 rows

       DATE  NUMARTS                          COUNTTYPE  NUMBER OBJECTTYPE  \
0  20210325        1                 CRISISLEX_T03_DEAD      25  Americans   
1  20210325        2                               KILL     689        NaN   
2  20210325        1               CRISISLEX_C07_SAFETY      10        NaN   
3  20210325        1  EPU_CATS_MIGRATION_FEAR_MIGRATION      90        NaN   
4  20210325        1                               KILL      14        NaN   

   GEO_TYPE                                     GEO_FULLNAME GEO_COUNTRYCODE  \
0         3  Washington, District Of Columbia, United States              US   
1         2                              Iowa, United States              US   
2         3       Spring Lake, South Carolina, United States              US   
3         1                                    United States              US   
4         4        Chilwell, Nottinghamshire, United Kingdom              UK   

  GEO_ADM1CODE    GEO_LAT  GEO_LONG GEO_FEATUREID 

In [39]:
# Obtain head
globalcounts_2021325.head(5)

Unnamed: 0,DATE,NUMARTS,COUNTTYPE,NUMBER,OBJECTTYPE,GEO_TYPE,GEO_FULLNAME,GEO_COUNTRYCODE,GEO_ADM1CODE,GEO_LAT,GEO_LONG,GEO_FEATUREID,CAMEOEVENTIDS,SOURCES,SOURCEURLS
0,20210325,1,CRISISLEX_T03_DEAD,25,Americans,3,"Washington, District Of Columbia, United States",US,USDC,38.8951,-77.0364,531871,"976480217,976480335,976349267,976331594,976425...",theepochtimes.com,https://www.theepochtimes.com/gohmert-pelosis-...
1,20210325,2,KILL,689,,2,"Iowa, United States",US,USIA,42.0046,-93.214,IA,"976442782,976442783,976528776,976528785,976528...",ottumwacourier.com;ottumwacourier.com,https://www.ottumwacourier.com/news/reynolds-d...
2,20210325,1,CRISISLEX_C07_SAFETY,10,,3,"Spring Lake, South Carolina, United States",US,USSC,34.2746,-80.7348,1235213,"976402281,976402282,976402283,976402284,976334...",sanfordherald.com,https://www.sanfordherald.com/news/police-beat...
3,20210325,1,EPU_CATS_MIGRATION_FEAR_MIGRATION,90,,1,United States,US,US,39.828175,-98.5795,US,976333943976331763976331822,live5news.com,https://www.live5news.com/2021/03/25/graham-vi...
4,20210325,1,KILL,14,,4,"Chilwell, Nottinghamshire, United Kingdom",UK,UKJ9,52.9333,-1.23333,-2592455,,nottinghampost.com,https://www.nottinghampost.com/news/local-news...


In [41]:
# I wanted to see how many COUNTTYPE categories exist (unique values).
len(globalcounts_2021325["COUNTTYPE"].unique())

41

### There are 41 unique values inside the second stream of data for COUNTTYPE.
#### - This is useful, if we want to do a different type of analysis.
### Q: What if we want to predict the range of values for a given category or country code? 
#### - We can select the "DATE", "COUNTTYPE", and "GEO_COUNTRYCODE" as values for our predictive model.

In [42]:
# 26252 Rows for second data stream sample dataset

len(globalcounts_2021325)

26252