# Overview
From the Kaggle web site (https://www.kaggle.com/datasets) download the Suicide Rates Overview 1985 to 2016 dataset. This dataset has 12 features and 27820 data points. In this assignment we would like to develop a machine learned model to predict, given some feature vectors, if the outcome would be suicide or not, as a binary dependent variable. The binary categories could be {"low suicide rate", "high suicide rate"}. (Note that a different approach could seek to generate a numerical value by solving a regression problem.)


A machine learning solution would require us to pre-process the dataset and prepare/design our experimentation.


Load the dataset in your model development framework (Jupyter notebook) and examine the features. Note that the Kaggle website also has histograms that you can inspect. However, you might want to look at the data grouped by some other features. For example, what does the 'number of suicides / 100k' histogram look like from country to country?


To answer the following questions, you have to think thoroughly, and possibly attempt some pilot experiments. There is no one right or wrong answer to some questions below, but you will always need to work from the data to build a convincing argument for your audience.

### 1. [10 pts] Due to the severity of this real-world crisis, what information would be the most important to "machine learn"? Can it be learned? (Note that this is asking you to define the big-picture question that we want to answer from this dataset. This is not asking you to conjecture which feature is going to turn out being important.

#### 1. Answer

In my opinion the most important thing to determine with the dataset is what causes suicides.  It seems a problem people struggle to understand, therefore teaching a machine to understand it wouldn't be possible.  Maybe we could teach a machine to observe for causes of suicide.  I think a machine can be taught to observe for causes of suicide, and how likely they are to occur in a population or an individual.  It seems unlikely that we will be able to do that with this dataset.  The data has too few features-- `year` and `generation` are tightly correlated, as are `year` and `country-year`. 

In [1]:
### 1. Experiments
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.dpi"] = 72
import numpy as np
import pandas as pd
# Visualizations
import seaborn as sns; sns.set(style="ticks", color_codes=True)

# Locate and load the data file
df = pd.read_csv('./datasets/master.csv', thousands=',')

# Sanity
print(f'#rows={len(df)} #columns={len(df.columns)}')
df.head()

#rows=27820 #columns=12


Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X
3,Albania,1987,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation
4,Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers


### 2. [10 pts] Explain in detail how one should set up the problem. Would it be a regression or a classification problem? Is any unsupervised approach, to look for patterns, worthwhile?

#### 2. Answer
Starting with the last question, "is any unsupervised approach worthwhile," considering this dataset no.  The dataset has labeled features.  It might be useful do a comparison of omitting labels and testing whether an unsupervised approach finds possible relations.

To address the larger question-- we could use a Decision Tree, or by extension a random forest, but this is a regression problem since we want to determine where on a numeric scale the rate of suicides per 100,000 persons will trend.  To setup the problem we will need to normalize the numeric features `HDI for year`, `gdp_for_year($)`, `gdp_per_capita ($)`, `population`, `suicides_no`, `year` (we could have standardized them too); map the ordinal features `age` and `generation`; and encode the nominal features `country` and `sex`.  

We will drop the feature `country-year` since it is captured in the dataset.  We could have chosen to drop `country` and `year` however I like that year is numeric.  Additionally, I will train two version of the model, one including `suicides_no` and `population` and one without.  These two features in combination have a correlation coefficient of 1 with the target.  That doesn't answer the question _I_ am curious about, which is about what in a population makes them susceptible to suicide.

### 3. [20 pts] What should be the dependent variable?

#### 3. Answer
According to the dataset page on Kaggle, the dataset intends to collate information on "suicide rates by cohort."  This means the `suicides/100k pop` is the target, which makes sense.  We will explore it in the next question, but that may mean we can remove `suicides_no` or `population` features for training deployment.


### 4. [20 pts] Find some strong correlations between the independent variables and the dependent variable you decided and use them to rank the independent variables.

In [16]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder


df = pd.read_csv('./datasets/master.csv', thousands=',')
#############################################
####  Cleanup the data
# Checkin
## what are the columns that are strings
## how many unique values in these columns?
# Test 1
## drop CountryYear, 
## one hot encode country
# Test 2
## drop county, year
## one-hot encode CountryYear
# One-hot encode generation
#############################################
print(df.dtypes)


# Check for duplicates, this adds a new column to the dataset
print(f'Count of duplicates: {len(df.duplicated())}')
## none, good


print(f'{df.isna().any()}')
## 
print(f'"HDI for year" is the only column with NaN.  Let\'s compare how many NaN {df["HDI for year"].isna().count()} vs {df["HDI for year"].notna().count()} non-NaN of a total of {len(df["HDI for year"])} samples')
print(df["HDI for year"].describe())
print(f'This is reporting specious results.  DataFrame.describe() is showing 8364 non-NaN values.  Let\'s fill NaN with the means and see what that changes.')
# df.loc[:,df["HDI for year"].isna()] = df["HDI for year"].mean()
# df.fillna(df["HDI for year"].mean())
mean_value = df['HDI for year'].mean()
df['HDI for year'] = np.where(df['HDI for year'].isna(), mean_value, df['HDI for year'])
print(df['HDI for year'].describe())

## Using the method described in the module notebook, check unique values by column
for col in df.columns:
    if df[col].dtype == object:
        print(col, df[col].unique())

## We will remove country-year since it is described by two other columns, one of which is numeric
df = df.drop(columns=['country-year'])


## LabelEncoder
# Encode object types. They are all strings.  Save the labelencoders paired with the column names so we can reverse the values later
columns_to_encode = df.select_dtypes(include='object')
encoders = dict.fromkeys(columns_to_encode)
for column in columns_to_encode:
    le = LabelEncoder()
    df[column] = le.fit_transform(df[column].astype(str))
    encoders[column] = le

#############################################
### Read the data
#############################################
# df = pd.read_csv('./datasets/master.csv', thousands=',')
labels = list(df.columns)
feature_labels = list(df.columns)
target_label = 'suicides/100k pop'
feature_labels.remove(target_label)
X = df.drop(target_label, axis=1).values
y = df[target_label].values
print(labels, '\n',feature_labels, '\n', target_label, '\n', )
print(f'\n\n{"-"*50}\nPandas DataFrame.corrwith():\n{"-"*50}\n{df.corrwith(df[target_label])}')
print('\n\n\n')



#######################################################################
##          From the book
#######################################################################

# ### sc = StandardScaler()
# X_train_std = sc.fit_transform(X_train)

# df = pd.read_csv('./datasets/master.csv', thousands=',')
# feat_labels = df.columns[1:]

# forest = RandomForestClassifier(n_estimators=500,
#                                 random_state=1)

# forest.fit(X_train, y_train)
# importances = forest.feature_importances_

# indices = np.argsort(importances)[::-1]

# for f in range(X_train.shape[1]):
#     print("%2d) %-*s %f" % (f + 1, 30, 
#                             feat_labels[indices[f]], 
#                             importances[indices[f]]))

# plt.title('Feature importance')
# plt.bar(range(X_train.shape[1]), 
#         importances[indices],
#         align='center')

# plt.xticks(range(X_train.shape[1]), 
#            feat_labels[indices], rotation=90)
# plt.xlim([-1, X_train.shape[1]])
# plt.tight_layout()
# # plt.savefig('figures/04_10.png', dpi=300)
# plt.show()

country                object
year                    int64
sex                    object
age                    object
suicides_no             int64
population              int64
suicides/100k pop     float64
country-year           object
HDI for year          float64
 gdp_for_year ($)       int64
gdp_per_capita ($)      int64
generation             object
dtype: object
Count of duplicates: 27820
country               False
year                  False
sex                   False
age                   False
suicides_no           False
population            False
suicides/100k pop     False
country-year          False
HDI for year           True
 gdp_for_year ($)     False
gdp_per_capita ($)    False
generation            False
dtype: bool
"HDI for year" is the only column with NaN.  Let's compare how many NaN 27820 vs 27820 non-NaN of a total of 27820 samples
count    8364.000000
mean        0.776601
std         0.093367
min         0.483000
25%         0.713000
50%         0.779000
75

#### 4. Answer

TODO: describe the steps I outline in my comments, and what I actually did.

### 5. [20 pts] Pre-process the dataset and list the major features you want to use. Note that not all features are crucial. For example, country-year variable is a derived feature and for a classifier it would not be necessary to include the year, the country and the country -year together. In fact, one must avoid adding a derived feature and the original at the same time.
List the independent features you want to use.

### 6. [20 pts] Devise a classification problem and present a working prototype model. (It does not have to perform great, but it has to be functional.) Note that we will continue with this problem in the following modules.

# References
1. Raschka, Sebastian, et al. Machine Learning with PyTorch and Scikit-Learn: Develop machine learning and deep learning models with Python. Packt Publishing Ltd, 2022.
2. Guven, Erhan. Applied Machine Learning: Module 3 Notebook.  Last accesses 6 February, 2025.