# Which place made news the most in the first week of January 2017?

The purpose of this project is to identify which Town/City/Country was mentioned most number of times in The New York Times during January 1st through Jauary 7th in 2017.

I use the NYTimes Article Search API to extract details about article, which is then used to extract the entire articles.

In [1]:
#Importing libraries
from nytimesarticle import articleAPI
import time

In [2]:
api = articleAPI(nytimes_api)

### Each API call returns one page of the results. Each page contains details of 10 articles.

### Due to the NYTimes API call limits, we will restrict our results to 50 articles for this project.

In [3]:
article_details = []
def articlesForAMonth_50(start,end):
    '''Function that makes API calls to extract details of articles'''
    #5 pages = 50 articles
    for i in range(0,5):
        article_details.append(api.search(fq = {'source':'The New York Times'}, begin_date = start, end_date = end, page=i))
        #To avoid per second api call limit
        time.sleep(2)
    return article_details

In [4]:
#Extracting articles for January 2017 (date in YYYYMMDD format)
january17_fifty = articlesForAMonth_50(20170101, 20170107)
len(january17_fifty)

5

### Thus we have verifies that details worth 5 pages of articles, or 50 articles have been extracted successfully.

### The following function 'parse_articles' takes in the list of article details and parses it to generate a dictionary of details such as
* Article ID
* Article Headline
* Article Type
* Publication Date
* Web URL
* Locations mentioned (using Keywords)
* Subject of the article

In [5]:
def parse_articles(articles_list):
    '''
    This function takes in a response to the NYT api and parses
    the articles into a list of dictionaries
    '''
    news = []
    
    for page_num in range(0,5):
        page = articles_list[page_num]
        for i in page['response']['docs']:
            dic = {}
            dic['id'] = i['_id']
            dic['headline'] = i['headline']['main'].encode("utf8")
            dic['date'] = i['pub_date'][0:10] # cutting time of day.
            dic['type'] = i['type_of_material']
            dic['url'] = i['web_url']
            # locations
            locations = []
            for x in range(0,len(i['keywords'])):
                if 'glocations' in i['keywords'][x]['name']:
                    locations.append(i['keywords'][x]['value'])
            dic['locations'] = locations
            # subject
            subjects = []
            for x in range(0,len(i['keywords'])):
                if 'subject' in i['keywords'][x]['name']:
                    subjects.append(i['keywords'][x]['value'])
            dic['subjects'] = subjects   
            news.append(dic)
    return(news)

In [6]:
jan50News = parse_articles(january17_fifty)
len(jan50News)

50

Thus we have verified that there have been details for all 50 articles parsed.

### Extracting locations from extracted articles in a list

In [7]:
temp = [jan50News[i]['locations'] for i in range(0,50)]
places = reduce(lambda x,y : x+y, temp)
places = [element.encode('ascii', 'ignore') for element in places]
places

['Russia',
 'Russia',
 'Afghanistan',
 'Russia',
 'Richmond (Va)',
 'Nashville (Tenn)',
 'Tennessee',
 'Washington (State)',
 'United States',
 'Russia',
 'Walden Pond (Concord, Mass)',
 'New York City',
 'Texas',
 'North Carolina',
 'Fort Lauderdale (Fla)',
 'Puerto Rico',
 'Anchorage (Alaska)',
 'New York State',
 'Missouri',
 'Ferguson (Mo)',
 'Portugal',
 'Alabama',
 'Russia',
 'Azaz (Syria)',
 'Syria',
 'Kilis (Turkey)',
 'Ivory Coast',
 'Russia',
 'California',
 'China',
 'Clinton Hill (Brooklyn, NY)',
 'China',
 'Brazil',
 'Cuiaba (Brazil)',
 'China',
 'Fort Lauderdale (Fla)']

### Extracting article contents

In [8]:
#Importing necessary packages
import requests
from bs4 import BeautifulSoup

In [9]:
#Extracting the text in a list 'all_text'
all_text = []
for i in range(0,50):
    url = jan50News[i]['url'].encode('ascii', 'ignore')
    webpage = requests.get(url)
    soup = BeautifulSoup(webpage.content, 'html.parser')
    for paragraph in soup.find_all('p', class_='story-body-text'):
        temp = (paragraph.text).encode('ascii', 'ignore')
        all_text.append(temp)

In [10]:
#Getting total number of paragraphs
print 'Number of paragraphs found in the extract: ' +str(len(all_text))

Number of paragraphs found in the extract: 1148


In [11]:
#Saving file for future use
myFile = open('nyTimes_Jan50articles.txt', 'w')
for item in all_text:
  myFile.write("%s\n" % item)

In [12]:
#Converting the list to a string for further use
textString = ''.join(all_text)

In [13]:
#with open('nyTimes_Jan50articles.txt', 'r') as myfile:
#    textString = myfile.read().replace('\n', '') 

### On closely observing our list of 'places' we notice a few substrings present that could affect the accuracy of results.

#### Example1:

'Richmond (Va)' and 'Nashville (Tenn)', though detected within the articles may not reoccur in the same format in those respective or other articles. The abbreviated states must be removed for increasing the chances of finding those cities within the collected corpus.

#### Example2:
'Walden Pond (Concord, Mass)' and 'Clinton Hill (Brooklyn, NY)', though detected precisely within the corpus, may not be ideal to plot on a map set to a global scale. It may also be easier to find more instances of 'Brooklyn' than 'Clinton Hill' within the corpus.
Hence, the granularity needs to be removed for sake of visualizing the results.

#### Example3:
This list also contains duplicate results (eg. Russia). The duplication needs to be eliminated too.


### Since our list of 'places' contains only 36 values, I will edit it manually.

In [14]:
places_edit = ['Russia', 'Afghanistan', 'Richmond', 'Nashville', 'Tennessee', 'Washington', 'United States', 'Concord',
               'New York City', 'Texas', 'North Carolina', 'Fort Lauderdale', 'Puerto Rico', 'Anchorage', 'New York State',
                'Missouri', 'Ferguson', 'Portugal', 'Alabama', 'Azaz', 'Syria', 'Kilis', 'Ivory Coast', 'California', 
                'China', 'Brooklyn', 'Brazil', 'Cuiaba']

### We now use the variable 'textString' that contains all the articles to get a corresponding frequency count for each city in the list 'places_edit'

In [15]:
#Generating list of frequencies
places_count = [textString.count(place) for place in places_edit]
places_count

[108,
 8,
 10,
 3,
 2,
 25,
 66,
 0,
 8,
 15,
 3,
 8,
 6,
 5,
 2,
 3,
 5,
 8,
 8,
 5,
 14,
 1,
 5,
 16,
 31,
 10,
 26,
 3]

### In order to plot the locations on a map, we will need to derive their latitude-longitude.

In [16]:
#Import library for geocoding locations
import geocoder

In [17]:
#Generating list of [latitude,longitude] for each place
lat_lon = [geocoder.google(place).latlng for place in places_edit]
lat_lon

[[61.52401, 105.318756],
 [33.93911, 67.709953],
 [37.5407246, -77.4360481],
 [36.1626638, -86.7816016],
 [35.5174913, -86.5804473],
 [47.7510741, -120.7401386],
 [37.09024, -95.712891],
 [42.4603719, -71.3489484],
 [40.7127837, -74.0059413],
 [31.9685988, -99.9018131],
 [35.7595731, -79.01929969999999],
 [26.1224386, -80.13731740000001],
 [18.220833, -66.590149],
 [61.2180556, -149.9002778],
 [43.2994285, -74.21793260000001],
 [37.9642529, -91.8318334],
 [38.7442175, -90.3053915],
 [39.39987199999999, -8.224454],
 [32.3182314, -86.902298],
 [36.5868261, 37.0480843],
 [34.80207499999999, 38.996815],
 [36.716477, 37.114661],
 [7.539988999999999, -5.547079999999999],
 [36.778261, -119.4179324],
 [35.86166, 104.195397],
 [40.6781784, -73.9441579],
 [-14.235004, -51.92528],
 [-15.6014109, -56.0978917]]

In [18]:
#Separating them into 2 lists 'latitude' and 'longitude'
latitude = [element[0] for element in lat_lon]
longitude = [element[1] for element in lat_lon]

### We will now combine all the necessary lists as a Pandas dataframe

In [19]:
import pandas as pd

results = pd.DataFrame(
    {'places': places_edit,
     'count': places_count,
     'latitude': latitude,
     'longitude': longitude
    })
results.head()

Unnamed: 0,count,latitude,longitude,places
0,108,61.52401,105.318756,Russia
1,8,33.93911,67.709953,Afghanistan
2,10,37.540725,-77.436048,Richmond
3,3,36.162664,-86.781602,Nashville
4,2,35.517491,-86.580447,Tennessee


### We will now use the Python package 'folium' to generate a web map for our data

In [20]:
#Importing library for plotting webmaps
import folium

In [21]:
mymap = folium.Map(location=[33.642063, -13.535156], zoom_start=2)
results.apply(lambda row:folium.CircleMarker(location=[row["latitude"], row["longitude"]], radius=6, popup=(str(row['places'])+' appeared '+str(row['count'])+' times in NYTimes articles'),
                                            fill_color = '#ff0000', fill_opacity = 0.5).add_to(mymap), axis=1)
mymap.save('mymap.html')
mymap