# Overview
From the Kaggle web site (https://www.kaggle.com/datasets) download the Suicide Rates Overview 1985 to 2016 dataset. This dataset has 12 features and 27820 data points. In this assignment we would like to develop a machine learned model to predict, given some feature vectors, if the outcome would be suicide or not, as a binary dependent variable. The binary categories could be {"low suicide rate", "high suicide rate"}. (Note that a different approach could seek to generate a numerical value by solving a regression problem.)


A machine learning solution would require us to pre-process the dataset and prepare/design our experimentation.


Load the dataset in your model development framework (Jupyter notebook) and examine the features. Note that the Kaggle website also has histograms that you can inspect. However, you might want to look at the data grouped by some other features. For example, what does the 'number of suicides / 100k' histogram look like from country to country?


To answer the following questions, you have to think thoroughly, and possibly attempt some pilot experiments. There is no one right or wrong answer to some questions below, but you will always need to work from the data to build a convincing argument for your audience.

### 1. [10 pts] Due to the severity of this real-world crisis, what information would be the most important to "machine learn"? Can it be learned? (Note that this is asking you to define the big-picture question that we want to answer from this dataset. This is not asking you to conjecture which feature is going to turn out being important.

#### 1. Answer

In my opinion the most important thing to determine with the dataset is what causes suicides.  It seems a problem people struggle to understand, therefore teaching a machine to understand it wouldn't be possible.  Maybe we could teach a machine to observe for causes of suicide.  I think a machine can be taught to observe for causes of suicide, and how likely they are to occur in a population or an individual.  It seems unlikely that we will be able to do that with this dataset.  The data has too few features-- `year` and `generation` are tightly correlated, as are `year` and `country-year`. 

In [1]:
### 1. Experiments
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.dpi"] = 72
import numpy as np
import pandas as pd
# Visualizations
import seaborn as sns; sns.set(style="ticks", color_codes=True)

# Locate and load the data file
df = pd.read_csv('./datasets/master.csv', thousands=',')

# Sanity
print(f'#rows={len(df)} #columns={len(df.columns)}')
df.head()

#rows=27820 #columns=12


Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X
3,Albania,1987,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation
4,Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers


### 2. [10 pts] Explain in detail how one should set up the problem. Would it be a regression or a classification problem? Is any unsupervised approach, to look for patterns, worthwhile?

#### 2. Answer
Starting with the last question, "is any unsupervised approach worthwhile," considering this dataset no.  The dataset has labeled features.  It might be useful do a comparison of omitting labels and testing whether an unsupervised approach finds possible relations.

To address the larger question-- we could use a Decision Tree, or by extension a random forest; but this is also a regression problem since we want to determine where on a numeric scale the rate of suicides per 100,000 persons will trend.  To setup the problem we will need to normalize the numeric features `HDI for year`, `gdp_for_year($)`, `gdp_per_capita ($)`, `population`, `suicides_no`, `year` (we could have standardized them too); map the ordinal features `age` and `generation`; and encode the nominal features `country` and `sex`.  

We will drop the feature `country-year` since it is captured in the dataset.  We could have chosen to drop `country` and `year` however I like that year is numeric, and perhaps there are relations between the year and different countries.  Additionally, I will train two version of the model, one including `suicides_no` and `population` and one without.  These two features in combination have a correlation coefficient of 1 with the target.  That doesn't answer the question _I_ am curious about, which is about what in a population makes them susceptible to suicide.

### 3. [20 pts] What should be the dependent variable?

#### 3. Answer
According to the dataset page on Kaggle, the dataset intends to collate information on "suicide rates by cohort."  This means the `suicides/100k pop` is the target, which makes sense.


### 4. [20 pts] Find some strong correlations between the independent variables and the dependent variable you decided and use them to rank the independent variables.

#### 4. Answer

Finding correlations when there are missing or NaN values would be in error since it would skew the data.  Granted, I will skew the data by adding the mean value if the feature is a numeric type.  After processing the data I will use Pandas `DataFrame.corrwith` to find correlation with the target column `suicides/100k pop`, which I will rename to `suicides per 100k pop`.  


Here is a summary of the steps I took.
- Drop `country-year` since it is accounted for with other features.
- Remove ($), parens and `/` from feature names. Note `gdp_for_year ($)` gave me trouble so I had to do it in an irritating way. 
- Check for duplicates. _None found._
- Check for null values. _None found._
- Check for NaN. _Only found NaN values in the `HDI for year` features. `DataFrame.describe() shows 8364 non-NaN values, which means almost more than 2x that number are NaN.  Perhaps I should drop this features, but there are so few already that I am choosing to keep it._
- Fill NaN values of `HDI for year`with the mean of `HDI for year`.
- Scale the numeric features using normalization
- Use Pandas to check for correlation of each feature with the target

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns; sns.set(style="ticks", color_codes=True)
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder


def fix_data_before_numeric_or_ordinal_changes(df, printer=True):
    if printer: print(df.dtypes)
    
    # df = df.drop(columns=['country-year', 'population', 'suicides_no'])
    df = df.drop(columns=['suicides_no', 'country-year'])
    df = df.rename(columns={'gdp_per_capita ($)': 'gdp_per_capita', 'suicides/100k pop': 'suicides per 100k pop'})
    
    
    for c in df.columns:
        if 'gdp_for_year' in c:
            new_column_name = c.replace('($)','').strip()
            df = df.rename(columns={c: new_column_name})
            if printer: print(new_column_name)
    
    if printer: print(df.columns)
    
    # Check for duplicates, this adds a new column to the dataset
    if printer: print(f'Count of duplicates: {len(df.duplicated())}')
    
    if printer: print('\n\n\n')
    if printer: print(df.isnull().any())
    
    
    if printer: print('\n\n\n')
    if printer: print(f'See that there is only one column with NaN values:\n{df.isna().any()}')
    if printer: print(f'"HDI for year" is the only column with NaN.  Let\'s compare how many NaN {df["HDI for year"].isna().count()} vs {df["HDI for year"].notna().count()} non-NaN of a total of {len(df["HDI for year"])} samples')
    if printer: print(df["HDI for year"].describe())
    if printer: print(f'This is reporting specious results.  DataFrame.describe() is showing 8364 non-NaN values.  Let\'s fill NaN with the means and see what that changes.')
    
    # Replace NaN values, or leave it as is otherwise
    mean_value = df['HDI for year'].mean()
    df['HDI for year'] = df['HDI for year'].fillna(mean_value)
    # This shows that there are no longer any missing values
    if printer: print(f'\n\nSee, no more missing values.{df.isna().any()}')
    
    
    if printer: print('\n\n\n')
    ## Using the method described in the module notebook, check unique values by column
    for col in df.columns:
        if df[col].dtype == object:
            if printer: print(col, df[col].unique())
    return df


df4 = fix_data_before_numeric_or_ordinal_changes(df)

# Scale the numeric features
numeric_columns = list(df4.select_dtypes(exclude=['object']).columns)
sc = StandardScaler()
df4[numeric_columns] = sc.fit_transform(df4[numeric_columns])


## LabelEncoder
# Encode object types. They are all strings.  Save the labelencoders paired with the column names so we can reverse the values later
columns_to_encode = df4.select_dtypes(include='object')
for column in columns_to_encode:
    le = LabelEncoder()
    df4[column] = le.fit_transform(df4[column].astype(str))


# Use pandas to print correlation
print('\n\n\n')
labels = list(df4.columns)
feature_labels = list(df4.columns)
target_label = 'suicides per 100k pop'
feature_labels.remove(target_label)
X = df4.drop(target_label, axis=1).values
y = df4[target_label].values
print(labels, '\n',feature_labels, '\n', target_label, '\n', )
print(f'\n\n{"-"*50}\nPandas DataFrame.corrwith():\n{"-"*50}\n{df4.corrwith(df4[target_label]).sort_values()}')

country                object
year                    int64
sex                    object
age                    object
suicides_no             int64
population              int64
suicides/100k pop     float64
country-year           object
HDI for year          float64
 gdp_for_year ($)       int64
gdp_per_capita ($)      int64
generation             object
dtype: object
gdp_for_year
Index(['country', 'year', 'sex', 'age', 'population', 'suicides per 100k pop',
       'HDI for year', 'gdp_for_year', 'gdp_per_capita', 'generation'],
      dtype='object')
Count of duplicates: 27820




country                  False
year                     False
sex                      False
age                      False
population               False
suicides per 100k pop    False
HDI for year              True
gdp_for_year             False
gdp_per_capita           False
generation               False
dtype: bool




See that there is only one column with NaN values:
country                  False
y

### 5. [20 pts] Pre-process the dataset and list the major features you want to use. Note that not all features are crucial. For example, country-year variable is a derived feature and for a classifier it would not be necessary to include the year, the country and the country -year together. In fact, one must avoid adding a derived feature and the original at the same time.
List the independent features you want to use.

#### 5. Answer
My independent variables will be `age`, `country`, `generation`, `gdp_per_capita`, `gdp_for_year`, `HDI for year`, `population`, `sex`, and `year`.  I will not be using `suicide_no`because dividing it with `population` is equal to `suicides per 100k pop`.

### 6. [20 pts] Devise a classification problem and present a working prototype model. (It does not have to perform great, but it has to be functional.) Note that we will continue with this problem in the following modules.

#### 6. Answer

I called this problem a regression problem at the start of the homework due to the datatype I saw in `suicides/100k pop`.  I see now that it is a classification problem, where the classes are bins.  


For the sake of my earlier statements I tried to get good results with a regressor. The `GradientBoostingRegressor` has good documentation, many examples, and plenty of hyperparameters to tune.  I tried it with the default parameters and acheived an MSE of 42, making this barely better than a flip of the coin.  I tried normalizing features, rather than standardizing them, which received worse scores in MSE and R^2.  Note: R^2 is the "coefficient of determination;" where a score of 1 is perfect, a score of 0 means the input features aren't used; and a negative score means you have a bad model. Finally, I tried using 5x more trees than the default, and lowering the learning rate to 10% of the default; the results are better than normalizing the features and worse than using the default parameters...


To fall on my sword, I lastly show a `RandomForestClassifier`.  Using the default parameters it acheives 93% accuracy and an F-1 score of 93%.  This is excellent, given that I tried to force a wrong option for an entire day.


Recall that in `#4` I extracted `X` and `y` for correlation.

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score

def q6_gbr_scaled(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    
    gbr = GradientBoostingRegressor(n_estimators=100, random_state=42)
    gbr.fit(X_train, y_train)
    
    y_pred = gbr.predict(X_test)
    
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    print(f'******* Test with Gradient Boosted Regressor\n******* standardized features')
    print(f'Mean Squared Error: {mse:.2f}')
    print(f'R^2 Score: {r2:.2f}')
q6_gbr_scaled(X, y)

******* Test with Gradient Boosted Regressor
******* standardized features
Mean Squared Error: 0.42
R^2 Score: 0.57


In [4]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, Normalizer
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score

def q6_gbr_normalized(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    normer = Normalizer()
    X_train = normer.fit_transform(X_train)
    X_test = normer.transform(X_test)
    
    gbr = GradientBoostingRegressor(n_estimators=100, random_state=42)
    gbr.fit(X_train, y_train)
    
    y_pred = gbr.predict(X_test)
    
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    print(f'******* Test with Gradient Boosted Regressor\n******* normalized features')
    print(f'Mean Squared Error: {mse:.2f}')
    print(f'R^2 Score: {r2:.2f}')
q6_gbr_normalized(X, y)

******* Test with Gradient Boosted Regressor
******* normalized features
Mean Squared Error: 0.48
R^2 Score: 0.51


In [5]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score

def q6_gbr_custom_params(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    
    # Custom parameters. Similar to what examples on Scitkit learn have for GradientBoostedRegressor
    params = {
        "n_estimators": 500,
        "max_depth": 4,
        "min_samples_split": 5,
        "learning_rate": 0.01,
    }
    
    gbr = GradientBoostingRegressor(**params)
    gbr.fit(X_train, y_train)
    
    # Make predictions on the test set
    y_pred = gbr.predict(X_test)
    
    # Evaluate the model
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    print(f'******* Test with Gradient Boosted Regressor\n******* custom parameters')
    print(f'Mean Squared Error: {mse:.2f}')
    print(f'R^2 Score: {r2:.2f}')

q6_gbr_custom_params(X, y)

******* Test with Gradient Boosted Regressor
******* custom parameters
Mean Squared Error: 0.44
R^2 Score: 0.55


In [6]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report


def q6_rfc(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    clf = RandomForestClassifier(n_estimators=100, random_state=42)
    clf.fit(X_train, y_train)
    
    # Predictions
    y_pred = clf.predict(X_test)
    
    # Evaluate model
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)

    print(f'Test with Random Forest Classifier\ndefault parameters')
    print(f'Accuracy: {accuracy:.2f}')
    print('Classification Report:\n', report)

# Re-work for continuous data
df_rfc = fix_data_before_numeric_or_ordinal_changes(df, printer=False)

columns_to_encode = df_rfc.select_dtypes(include='object')
for column in columns_to_encode:
    le = LabelEncoder()
    df_rfc[column] = le.fit_transform(df_rfc[column].astype(str))
X = df_rfc.drop(columns=['suicides per 100k pop'])
y = (df_rfc['suicides per 100k pop'] > df_rfc['suicides per 100k pop'].median()).astype(int)  
q6_rfc(X, y)

Test with Random Forest Classifier
default parameters
Accuracy: 0.93
Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.93      0.93      2819
           1       0.93      0.93      0.93      2745

    accuracy                           0.93      5564
   macro avg       0.93      0.93      0.93      5564
weighted avg       0.93      0.93      0.93      5564



# References
1. Raschka, Sebastian, et al. Machine Learning with PyTorch and Scikit-Learn: Develop machine learning and deep learning models with Python. Packt Publishing Ltd, 2022.
2. Guven, Erhan. Applied Machine Learning: Module 3 Notebook.  Last accessed 6 February, 2025.
3. Scikit-Learn. https://scikit-learn.org/stable/. Last accessed 6 February, 2025.