# Client Project - Problem 2: Leveraging News and Media for Situational Awareness
## Part 1 - Data Gathering
## Problem Statement
When disaster strikes, it is critically important to provide the most relevant information to first responders and the general public. During a disaster, people are inundated with a barrage of news sources, resulting in an environment of confusing and misinformation. As of right now, there is no central medium to find relevant news sources for a disaster specific article. The goal of this project is to deliver to the public a website where users can find relevant information and get the key facts during a disaster.

## Executive Summary
1. Data Gathering
2. Data Cleaning and Preprocessing
3. Combined Data Tokenized Analysis and Individual Page Content Creation.
4. Word2Vec Search Engine, Recommender and Home Page Content Creation.

### Importing Libraries

In [4]:
import requests 
from bs4 import BeautifulSoup
import pprint
import pandas as pd
import numpy as np 

### Webscraping News API 

In [4]:
url = 'https://newsapi.org/v2/everything?'

Input API Key from News API before running next cells.

In [5]:
api_key = "Put your API key here"

#### Wildfire news 10/29

In [238]:
parameters_10_29 = {
    'q': 'Wildfire',
    'pageSize': 100,
    'apiKey': api_key,
    'language': 'en',
    'from' : '2019-10-29'
    
}
response_10_29 = requests.get(url, params = parameters_10_29)
response_json_10_29 = response_10_29.json()

In [273]:
#response_json_10_29

#### Wildfire news 10/26

In [244]:
parameters_10_26 = {
    'q': 'Wildfire',
    'pageSize': 100,
    'apiKey': api_key,
    'language': 'en',
    'from' : '2019-10-26',
    
    
}
response_10_26 = requests.get(url, params = parameters_10_26)
response_json_10_26 = response_10_26.json()

In [272]:
#response_json_10_26

#### Wildfire news 10/23

In [246]:
parameters_10_23 = {
    'q': 'Wildfire',
    'pageSize': 100,
    'apiKey': api_key,
    'language': 'en',
    'from' : '2019-10-23',
    
    
}
response_10_23 = requests.get(url, params = parameters_10_23)
response_json_10_23 = response_10_23.json()

In [271]:
#response_json_10_23

#### Wildfire news 10/20

In [249]:
parameters_10_20 = {
    'q': 'Wildfire',
    'pageSize': 100,
    'apiKey': api_key,
    'language': 'en',
    'from' : '2019-10-20',
   
    
}
response_10_20 = requests.get(url, params = parameters_10_20)
response_json_10_20 = response_10_20.json()

In [270]:
#response_json_10_20

#### Wildfire news 10/17

In [265]:
parameters_10_17 = {
    'q': 'Wildfire',
    'pageSize': 100,
    'apiKey': api_key,
    'language': 'en',
    'from' : '2019-10-20',
   
    
}
response_10_17 = requests.get(url, params = parameters_10_17)
response_json_10_17 = response_10_17.json()

In [2]:
response_json_10_17

Due to Kincade Wildfire happening simultaneously to this project, most news we were able to scrape are related to wildfires.

#### Flood

In [None]:
flood_parameters = {
    'q': 'flood',
    'pageSize': 100,
    'apiKey': api_key,
    'language': 'en'
}
flood_response = requests.get(url, params = flood_parameters)
flood_response_json = flood_response.json()

#### Tornado

In [None]:
torn_parameters = {
    'q': 'tornado',
    'pageSize': 100,
    'apiKey': api_key,
    'language': 'en'
}
torn_response = requests.get(url, params = torn_parameters)
torn_response_json = torn_response.json()

#### Earthquake

In [None]:
earth_parameters = {
    'q': 'earthquake',
    'pageSize': 100,
    'apiKey': api_key,
    'language': 'en'
}
earth_response = requests.get(url, params = earth_parameters)
earth_response_json = earth_response.json()

#### Hurricane

In [None]:
hurr_parameters = {
    'q': 'hurricane',
    'pageSize': 100,
    'apiKey': api_key,
    'language': 'en'
}
hurr_response = requests.get(url, params = hurr_parameters)
hurr_response_json = hurr_response.json()

#### Blizzard

In [None]:
blizz_parameters = {
    'q': 'blizzard',
    'pageSize': 100,
    'apiKey': api_key,
    'language': 'en'
}
blizz_response = requests.get(url, params = blizz_parameters)
blizz_response_json = blizz_response.json()

### Webscraping cleaning 

#### Seeing the keys we need to access the data 

In [251]:
response_json.keys()

dict_keys(['status', 'totalResults', 'articles'])

#### Converting the JSON file to an organized readable format 

In [275]:
wildfire_articles_10_29 = response_json_10_29['articles']
wildfire_articles_10_26 = response_json_10_26['articles']
wildfire_articles_10_23 = response_json_10_23['articles']
wildfire_articles_10_20 = response_json_10_20['articles']
wildfire_articles_10_17 = response_json_10_17['articles']
flood_articles = flood_response_json["articles"]
torn_articles = torn_response_json["articles"]
earth_articles = earth_response_json["articles"]
hurr_articles = hurr_response_json["articles"]
blizz_articles = blizz_response_json["articles"]

#### This function extracts the relevant information from the list of lists 

In [223]:
def get_articles(file): 
    article_results = []
    
    for i in range(len(file)):
        article_dict = {}
        article_dict['title'] = file[i]['title']
        article_dict['author'] = file[i]['author']
        article_dict['source'] = file[i]['source']
        article_dict['description'] = file[i]['description']
        article_dict['content'] = file[i]['content']
        article_dict['pub_date'] = file[i]['publishedAt']
        article_dict['url'] = file[i]["url"]
        article_dict['photo_url'] = file[i]['urlToImage']
        
        
        article_results.append(article_dict)
    return article_results
        
    

#### Converting the wildfire scrape into customized dataframes 

In [276]:
wildfire_df_10_29 = pd.DataFrame(get_articles(wildfire_articles_10_29))
wildfire_df_10_26 = pd.DataFrame(get_articles(wildfire_articles_10_26))
wildfire_df_10_23 = pd.DataFrame(get_articles(wildfire_articles_10_23))
wildfire_df_10_20 = pd.DataFrame(get_articles(wildfire_articles_10_20))
wildfire_df_10_17 = pd.DataFrame(get_articles(wildfire_articles_10_17))
flood_df = pd.DataFrame(get_articles(flood_articles))
torn_df = pd.DataFrame(get_articles(torn_articles))
earth_df = pd.DataFrame(get_articles(earth_articles))
hurr_df = pd.DataFrame(get_articles(hurr_articles))
blizz_df = pd.DataFrame(get_articles(blizz_articles))

#### Checking the shape of all the dataframes 

In [277]:
print(wildfire_df_10_29.shape)
print(wildfire_df_10_26.shape)
print(wildfire_df_10_23.shape)
print(wildfire_df_10_20.shape)
print(wildfire_df_10_17.shape)
print(flood_df.shape)
print(torn_df.shape)
print(earth_df.shape)
print(hurr_df.shape)
print(blizz_df.shape)

(100, 8)
(100, 8)
(100, 8)
(100, 8)
(100, 8)


#### This function extracts the media source from the dictionared column "source".

In [230]:
def source_getter(df):
    
    source = []
    for source_dict in df['source']:
        source.append(source_dict['name'])
   
    df['source'] = source

In [278]:
source_getter(wildfire_df_10_29)
source_getter(wildfire_df_10_26)
source_getter(wildfire_df_10_23)
source_getter(wildfire_df_10_20)
source_getter(wildfire_df_10_17)
source_getter(flood_df)
source_getter(torn_df)
source_getter(earth_df)
source_getter(hurr_df)
source_getter(blizz_df)

#### This lambda function changed the publication date into something more readable 

In [282]:
wildfire_df_10_29['pub_date'] = pd.to_datetime(wildfire_df_10_29['pub_date']).apply(lambda x: x.date())
wildfire_df_10_26['pub_date'] = pd.to_datetime(wildfire_df_10_26['pub_date']).apply(lambda x: x.date())
wildfire_df_10_23['pub_date'] = pd.to_datetime(wildfire_df_10_23['pub_date']).apply(lambda x: x.date())
wildfire_df_10_20['pub_date'] = pd.to_datetime(wildfire_df_10_20['pub_date']).apply(lambda x: x.date())
wildfire_df_10_17['pub_date'] = pd.to_datetime(wildfire_df_10_17['pub_date']).apply(lambda x: x.date())
flood_df['pub_date'] = pd.to_datetime(flood_df['pub_date']).apply(lambda x: x.date())
torn_df['pub_date'] = pd.to_datetime(torn_df['pub_date']).apply(lambda x: x.date())
earth_df['pub_date'] = pd.to_datetime(earth_df['pub_date']).apply(lambda x: x.date())
hurr_df['pub_date'] = pd.to_datetime(hurr_df['pub_date']).apply(lambda x: x.date())
blizz_df['pub_date'] = pd.to_datetime(blizz_df['pub_date']).apply(lambda x: x.date())

#### Combining all of the wildfire articles into one dataframe

In [10]:
wildfire_df_pt1 = pd.concat([wildfire_df_10_29,wildfire_df_10_26, 
                                        wildfire_df_10_23, wildfire_df_10_20,
                                        wildfire_df_10_17])

In [8]:
wildfire_df_pt1.head(3)

In [5]:
pd.set_option('display.max_colwidth', -1)

In [None]:
wildfire_df_pt

### Converting Dataframe into CSV

In [305]:
wildfire_df_pt1.to_csv('1.raw_data/teamNJV_wildfire_df.csv', index = False)

In [None]:
flood_df.to_csv('1.raw_data/teamNJV_flood.csv', index = False)
torn_df.to_csv('1.raw_data/teamNJV_tornado.csv', index = False)
earth_df.to_csv('1.raw_data/teamNJV_earthquake.csv', index = False)
hurr_df.to_csv('1.raw_data/teamNJV_hurricane.csv', index = False)
blizz_df.to_csv('1.raw_data/teamNJV_blizzard.csv', index = False)