* It should have some categorical features. However, it does not need to only have categorical features -
a mix of categorical and continuous features is also fine.
* All features should either be categorical or numerical. This means that you shouldn’t choose a data
set where the observations are images, long blocks of texts, or other data types that would require
significant processing before applying machine algorithms (you’ll have the opportunity to work with
this type of data for Project 3).
* Your data set should be reasonably large.
Because you’ve already done a project on classification, I suggest that you try to do a project on regression,
but this is not a requirement.
I encourage you to look for your own data set (through the UCI repository, or other sources). If you’re
having trouble finding one that interests you, here are some suggestions:
* Bank Marketing Data Set. https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
* Census Income Data Set. https://archive.ics.uci.edu/ml/datasets/Census+Income
* Contraceptive Method Choice Data Set. https://archive.ics.uci.edu/ml/datasets/Contraceptive+
Method+Choice
* Crimes in Chicago. https://www.kaggle.com/currie32/crimes-in-chicago
* Default of Credit Card Clients Data Set. https://archive.ics.uci.edu/ml/datasets/default+
of+credit+card+clients
* Mushroom Data Set. https://archive.ics.uci.edu/ml/datasets/Mushroom
* New York City Property Sales. https://www.kaggle.com/new-york-city/nyc-property-sales
* Online Shoppers Purchasing Intention Dataset. https://archive.ics.uci.edu/ml/datasets/Online+
Shoppers+Purchasing+Intention+Dataset
* Poker Hand Data Set. https://archive.ics.uci.edu/ml/datasets/Poker+Hand
* Synthetic Financial Datasets for Fraud Detection. https://www.kaggle.com/ntnu-testimon/paysim1
(actually a fake data set, but should still provide useful practice)
* Video Game Sales. https://www.kaggle.com/rush4ratio/video-game-sales-with-ratings
(continued on reverse)
Once you’ve chosen your data set, complete the following tasks.
Write a Plan. Before you start working with your data set, make a plan for the workflow of your project.
Write up this plan, and include it at the beginning of your Jupyter notebook. Once you start to work with
your data set, you might find that you need to deviate from your original plan, this is good! When this
happens, don’t make delete tasks from your original plan - just acknowledge that you encountered something
unexpected, and explain what adjustments will need to be made.
Your plan should include the following:
* What will be your goal in working with this data set?
* What data cleaning will you need to do to prepare the data set?
* What kinds of exploratory data analysis will you do?
* Which machine learning algorithms will you use to train models? Why are you choosing these algorithms?
* How will you attempt to optimize your models?
* How will you analyze the accuracy of your models? Approximately what do you think will be “good”
accuracy rates for your chosen task?
Preparing Your Data Set. Following the outline given in your plan, prepare your data set for training
using the methods we’ve discussed. This is likely to include handling missing values and categorical variables.
Any choices that you make should be explained.
Exploratory Data Analysis. Following the outline given in your plan, perform some exploratory data
analysis. Discuss the results, and what this means for the distribution of the data, as well as your expectations for your results. This should include multiple methods of data visualization.
Train Models Using Machine Learning Algorithms. Although you do not need to use every algorithm
that we’ve covered, you should use those that make sense for your data set. You should explain your choices,
and perform parameter tuning in order to optimize their performance.
Discussion of Results. Compare the performance of your models, and discuss their accuracy. Do you consider your results successful? Explain why or why not, and discuss how further improvements might be made.
Evaluation of Your Plan. Discuss how closely you were able to follow your initial plan. What unexpected
issues did you encounter? What adjustments did you need to make? How would this affect planning a future
machine learning project?
For this project, you will submit a Jupyter Notebook through moodle and give an in-class presentation on
Thursday, January 24th. In order to make sure that you’re on track, you should write your plan
prior to class on Monday, January 20th, so we can discuss it.

# Suicide Detection using Socio-Economic Metrics as Signals


The data and a description of the data can be found at 
https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016.

Content:
    This compiled dataset pulled from four other datasets linked by time and place, and was built to find signals correlated to increased suicide rates among different cohorts globally, across the socio-economic spectrum.
    
References:

    - United Nations Development Program. (2018). Human development index (HDI). Retrieved from                                                 http://hdr.undp.org/en/indicators/137506
    http://hdr.undp.org/en/content/human-development-index-hdi

    - World Bank. (2018). World development indicators: GDP (current US$) by country:1985 to 2016. Retrieved from http://databank.worldbank.org/data/source/world-development-indicators#

    - [Szamil]. (2017). Suicide in the Twenty-First Century [dataset]. Retrieved from https://www.kaggle.com/szamil/suicide-in-the-twenty-first-century/notebook

    - World Health Organization. (2018). Suicide prevention. Retrieved from http://www.who.int/mental_health/suicide-prevention/en/



# Preprocessing

In [23]:
import pandas as pd

# load data from path into a dataframe
path = "master.csv"
df = pd.read_csv(path)

df.head()

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X
3,Albania,1987,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation
4,Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers


The syntax of the features is not consistent. There are both underscores to fill in spaces and spaces. We will change the names of the features to make the accessing the data easier in the preprocessing.

In [24]:
df.rename(columns={"gdp_per_capita ($)": "gdp_per_capita", "gdp_for_year ($)": "gdp_for_year", 
                   "HDI for year": "HDI_for_year", "country-year": "country_year", 
                   "suicides/100k pop": "suicides/100k_pop"})

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X
3,Albania,1987,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation
4,Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers
5,Albania,1987,female,75+ years,1,35600,2.81,Albania1987,,2156624900,796,G.I. Generation
6,Albania,1987,female,35-54 years,6,278800,2.15,Albania1987,,2156624900,796,Silent
7,Albania,1987,female,25-34 years,4,257200,1.56,Albania1987,,2156624900,796,Boomers
8,Albania,1987,male,55-74 years,1,137500,0.73,Albania1987,,2156624900,796,G.I. Generation
9,Albania,1987,female,5-14 years,0,311000,0.00,Albania1987,,2156624900,796,Generation X


In [None]:
# seperate features and targets
features = ["country", "year", "sex", "age", "suicides_no", "population"," suicides/100k_pop", "country_year",
            "HDI_for_year", "gdp_for_year", "gdp_per_capita", "generation"]

# targets = 

# print("features: ", features)


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27820 entries, 0 to 27819
Data columns (total 12 columns):
country               27820 non-null object
year                  27820 non-null int64
sex                   27820 non-null object
age                   27820 non-null object
suicides_no           27820 non-null int64
population            27820 non-null int64
suicides/100k pop     27820 non-null float64
country-year          27820 non-null object
HDI for year          8364 non-null float64
 gdp_for_year ($)     27820 non-null object
gdp_per_capita ($)    27820 non-null int64
generation            27820 non-null object
dtypes: float64(2), int64(4), object(6)
memory usage: 2.5+ MB


In [9]:
df.describe()

Unnamed: 0,year,suicides_no,population,suicides/100k pop,HDI for year,gdp_per_capita ($)
count,27820.0,27820.0,27820.0,27820.0,8364.0,27820.0
mean,2001.258375,242.574407,1844794.0,12.816097,0.776601,16866.464414
std,8.469055,902.047917,3911779.0,18.961511,0.093367,18887.576472
min,1985.0,0.0,278.0,0.0,0.483,251.0
25%,1995.0,3.0,97498.5,0.92,0.713,3447.0
50%,2002.0,25.0,430150.0,5.99,0.779,9372.0
75%,2008.0,131.0,1486143.0,16.62,0.855,24874.0
max,2016.0,22338.0,43805210.0,224.97,0.944,126352.0


Let's check for null values.

In [10]:
#number of missing values in the entire dataframe
print(df.isnull().values.sum())

#the column-wise distribution of missing values
print(df.isnull().sum())

19456
country                   0
year                      0
sex                       0
age                       0
suicides_no               0
population                0
suicides/100k pop         0
country-year              0
HDI for year          19456
 gdp_for_year ($)         0
gdp_per_capita ($)        0
generation                0
dtype: int64


There is only one feature where there are missing values, HDI for year. HDI, Human Development Index is a metric of measuring the . The United Nations releases a study every year on the HDI of every country. This feature may be a very meaniful metric of suicide detection.

0          NaN
1          NaN
2          NaN
3          NaN
4          NaN
5          NaN
6          NaN
7          NaN
8          NaN
9          NaN
10         NaN
11         NaN
12         NaN
13         NaN
14         NaN
15         NaN
16         NaN
17         NaN
18         NaN
19         NaN
20         NaN
21         NaN
22         NaN
23         NaN
24         NaN
25         NaN
26         NaN
27         NaN
28         NaN
29         NaN
         ...  
27790    0.668
27791    0.668
27792    0.668
27793    0.668
27794    0.668
27795    0.668
27796    0.672
27797    0.672
27798    0.672
27799    0.672
27800    0.672
27801    0.672
27802    0.672
27803    0.672
27804    0.672
27805    0.672
27806    0.672
27807    0.672
27808    0.675
27809    0.675
27810    0.675
27811    0.675
27812    0.675
27813    0.675
27814    0.675
27815    0.675
27816    0.675
27817    0.675
27818    0.675
27819    0.675
Name: HDI for year, Length: 27820, dtype: float64


We will also have to check for any duplicate values in the data.

In [15]:
for feature in features:
    print(df[feature].value_counts())

female    13910
male      13910
Name: sex, dtype: int64


Convert catagorical values to continous values.

In [None]:
def to_binary(entry):
    if entry == "male" or entry == "Male":
        return 0
    if entry == "female" or entry == "Female":
        return 1
    return entry

df = df.applymap(to_binary)
df.head()