# Project 1

| Name         | PID          | 
|--------------|--------------|
| Peter Murphy | `petermurphy`|
|Joseph McAlister | `josephrm`|
    
We have neither given nor received unauthorized assistance on this assignment. See the course sylabus for details on the Honor Code policy. In particular, sharing lines of solution code is prohibited.

# Q1 : Qualitative

## Q1.1: Introduction / Initial Problem Statement
Our goal is to set out to find relationships in suicide rates according to country, age demographic, and year provided by the this  which spanning 1985 - 2016.  

Specifically, some questions we think might be interesting to investigate in depth include:
1. What country has the most suicides, or the highest average rate of suicide from year to year?
2. Is there a relationship between generation/age demographic and propensity for suicide?
3. Is there relationship between a country's gdp and rate of suicide?
4. Do the events of a given year have an impact on global or regionally focused suicides?  If so, which years and what kinds of events?
  - Create our own criteria based on findings, if we can, that contribute to the negation of high suicide rates (e.g. high gdp? positive words in the text analysis as well). As the motivation of the Q.

## Q1.2 Data Collection
Our [primary data set](https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016) taken from Kaggle has sufficient information to examine the first 3 hypotheses, but in order to qualify/quantify information about the events from each year which may provide insights into the cause of change in this data from year to year, country to country, etc., we need to collect more data in the hopes of modelling a sentiment value that we can ascribe to each of the years contained within our first data set.

To gather information about each year from 1985 - 2016, we webscrape the Wikipedia pages for each year, saving each of the notable events into a dataframe which we can further mine for positive/negative sentiment which may give insight into why some years show an global/regional/isolated increase in suicide rates from year to year.

![](img1.png)

We observe that each of the text elements on the webpage is stored in a `<ul>` element that directly follows the `<h3>` tag containing a `<span>` describing the month of the year.  With this in mind, our approach is to find the `<span>`s corresponding to each month, navigate to the next `<ul>` that immediately proceeds that `<span>` and gather the text fields from each constituent `<li>` tag within. 

In [3]:
DATA_ALREADY_COLLECTED = False # a flag which can be used to skip the data collection process

In [21]:
import pandas as pd
from bs4 import BeautifulSoup as soup
from bs4 import Tag
import requests
import csv 

if not DATA_ALREADY_COLLECTED:
    years = range(1985, 2017) # the years that correspond to our master data set
    months = ['January', 'February', 'March', 'April', 'May', 'June', 
              'July', 'August', 'September', 'October', 'November', 'December']
    base_url = "https://en.wikipedia.org/wiki"

    events = pd.DataFrame()   # stores each of the event  


    for year in years:        # iterate over each year
        print('.', end='')    # simulate a loading bar to indicate scraping progress
        url = f"{base_url}/{year}"

        response = requests.get(url)
        if response.status_code != 200:
            print(f"GET failed with response code: {response.status_code}")
            raise

        year_soup = soup(response.text, 'html5lib')
        uls = year_soup.find_all('ul')

        for month in months:   # iterate over each month
            span = year_soup.find('span', {'id': month}) # find the corresponding span
            for elem in span.parent.findNext('ul'):      # iterate over each element in the ul tag that follows
                if isinstance(elem, Tag):                # ensure the element is a Tag 
                    # append the contents of the element to the dataframe
                    events = events.append({'year': int(year), 
                                            'month': month, 
                                            'text': elem.get_text().encode('ascii','ignore').decode()}, 
                                           ignore_index=True)
    
    with open('data/events.csv', 'w') as f:
        f.write(events.to_csv())
        DATA_ALREADY_COLLECTED = True
else: 
    events = pd.read_csv('data/events.csv')
    events = events.drop('Unnamed: 0', axis=1) # drop the excess index column if preloading

events

Unnamed: 0,month,text,year
0,January,January 1\nThe Internet's Domain Name System i...,1985.0
1,January,January 7 Japan Aerospace Exploration Agency ...,1985.0
2,January,"January 10 Sinclair C5, the world's first mas...",1985.0
3,January,January 13 A passenger train plunges into a r...,1985.0
4,January,January 15 Tancredo Neves is elected presiden...,1985.0
...,...,...,...
2947,December,"December 19 Andrei Karlov, the Russian ambass...",2016.0
2948,December,December 22 A study finds the VSV-EBOV vaccin...,2016.0
2949,December,December 23 The United Nations Security Counc...,2016.0
2950,December,December 25 2016 Russian Defence Ministry Tup...,2016.0


## Q1.3 Data Processing
Next, we want to process and group or data in order to make it easier to analyze our posed questions.
Beginning with the `events` data set above:
- we clean the data,
- filter out stop words,
- tally the number of negative words in each entry using the bag of words approach
  - keep an additional tally of mentions of 'suicide', 'suicidal', etc.
- add additional measures of negative sentiment (one general, one suicide specific) according to year which we can further join with our primary dataset as needed

In [32]:
# load / define words of negative sentiment
negatives = list(set(open('data/negative.txt').read().split()))
stopwords = list(set(open('data/stopwords.txt').read().split()))
keywords = ['suicide', 'suicidal', 'depression']

# create bag of words/sentiment for each row, 
events_cleaned = events.copy()

# store all the non stopwords
events_cleaned['bag'] = events_cleaned['text'].apply(lambda t: 
    [w for w in t.lower().replace(',',' ').replace('.', ' ').split() if w not in stopwords])

# store all the negative words
events_cleaned['neg_words'] = events_cleaned['bag'].apply(lambda t: 
    [w for w in t if w in negatives])

# store all the keywords words -- TODO THIS MAY BE UNNECESSARY SINCE THERE APPEAR TO BE NONE?
events_cleaned['keywords'] = events_cleaned['bag'].apply(lambda t: 
    [w for w in t if w in keywords])

# tally number of negatives
events_cleaned['n_neg_words'] = events_cleaned['neg_words'].apply(lambda t: len(t))
# tally number of keywords 
events_cleaned['n_keywords'] = events_cleaned['keywords'].apply(lambda t: len(t))
# tally number of total meaningful words
events_cleaned['n_total_words'] = events_cleaned['bag'].apply(lambda t: len(t))
# extract a negative sentiment value 
events_cleaned['neg_sentiment'] = events_cleaned.apply(lambda row: row['n_neg_words']/row['n_total_words'], axis=1)
events_cleaned.sort_values('n_keywords', ascending=True)
events_cleaned
# groupby year
# events_by_year = events.groupby('year')

Unnamed: 0,month,text,year,bag,neg_words,keywords,n_neg_words,n_keywords,n_total_words,neg_sentiment
0,January,January 1\nThe Internet's Domain Name System i...,1985.0,"[january, 1, internet's, domain, created, gree...",[],[],0,0,25,0.000000
1,January,January 7 Japan Aerospace Exploration Agency ...,1985.0,"[january, 7, japan, aerospace, exploration, ag...",[deep],[],1,0,20,0.050000
2,January,"January 10 Sinclair C5, the world's first mas...",1985.0,"[january, 10, sinclair, c5, world's, mass-prod...",[],[],0,0,8,0.000000
3,January,January 13 A passenger train plunges into a r...,1985.0,"[january, 13, passenger, train, plunges, ravin...","[worst, disaster]",[],2,0,16,0.125000
4,January,January 15 Tancredo Neves is elected presiden...,1985.0,"[january, 15, tancredo, neves, elected, presid...",[],[],0,0,12,0.000000
...,...,...,...,...,...,...,...,...,...,...
2947,December,"December 19 Andrei Karlov, the Russian ambass...",2016.0,"[december, 19, andrei, karlov, russian, ambass...",[],[],0,0,16,0.000000
2948,December,December 22 A study finds the VSV-EBOV vaccin...,2016.0,"[december, 22, study, finds, vsv-ebov, vaccine...","[virus, disease]",[],2,0,16,0.125000
2949,December,December 23 The United Nations Security Counc...,2016.0,"[december, 23, united, nations, security, coun...",[],[],0,0,17,0.000000
2950,December,December 25 2016 Russian Defence Ministry Tup...,2016.0,"[december, 25, 2016, russian, defence, ministr...",[black],[],1,0,44,0.022727
