Continuing with the previous machine learning problem, let's get back to the pre-processed
dataset Suicide Rates Overview 1985 to 2016 file. We would like to have a machine learning
model to predict the suicide rate 'suicides/100k pop'.

1. [10 pts] Use your previous pre-processed dataset, keep the variables as one-hot encoded,
and develop a multiple linear regression model. How many regression coefficients does this
model have?

#### Ans.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.dpi"] = 72
from IPython.display import display
import numpy as np
import pandas as pd
import seaborn as sns; sns.set(style="ticks", color_codes=True)

# Locate and load the data file
df_raw = pd.read_csv('../../Desktop/APML/Datasets/suicide_predict.csv')

# Renaming columns for cleaner names
df_raw.rename(columns={"suicides/100k pop":"suicide_100k_pop","HDI for year":"hdi_for_year", "country-year": "country_year",
                  " gdp_for_year ($) ":"gdp_for_year","gdp_per_capita ($)":"gdp_per_capita",
                    }, inplace=True)

In [2]:
# Checking for duplicates - adapted from Guven, 2024.
df_raw["is_duplicate"]= df_raw.duplicated()

print(f"#total= {len(df_raw)}")
print(f"#duplicated= {len(df_raw[df_raw['is_duplicate']==True])}")

# Drop the duplicate rows using index - best way to drop in pandas
index_to_drop = df_raw[df_raw['is_duplicate']==True].index
df_raw.drop(index_to_drop, inplace=True)

# Remove the duplicate marker column
df_raw.drop(columns='is_duplicate', inplace=True)
print(f'#total= {len(df_raw)}')

#total= 27820
#duplicated= 0
#total= 27820


In [3]:
# Checking for missing values in df
df_raw.isnull().sum()

# Dropping country-year (derived feature) and hdi_for_year (missing data) 
# df is the clean dataset
df = df_raw.drop(['country_year','hdi_for_year'],axis=1)

# Convert 'gdp_for_year' to numeric after removing commas
df['gdp_for_year'] = df['gdp_for_year'].str.replace(',', '').astype(float)

# Check clean dataset
print(f'N rows={len(df)}, M columns={len(df.columns)}')
df.head()

N rows=27820, M columns=10


Unnamed: 0,country,year,sex,age,suicides_no,population,suicide_100k_pop,gdp_for_year,gdp_per_capita,generation
0,Albania,1987,male,15-24 years,21,312900,6.71,2156625000.0,796,Generation X
1,Albania,1987,male,35-54 years,16,308000,5.19,2156625000.0,796,Silent
2,Albania,1987,female,15-24 years,14,289700,4.83,2156625000.0,796,Generation X
3,Albania,1987,male,75+ years,1,21800,4.59,2156625000.0,796,G.I. Generation
4,Albania,1987,male,25-34 years,9,274300,3.28,2156625000.0,796,Boomers


In [4]:
# Dropping variables reasoned as redundant feactures
#  Country dropped to simplify feature space    
df = df.drop(columns=['suicides_no', 'country',  'gdp_for_year']) 

In [5]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

# Preparing features for normalization and encoding
numerical_features = ['gdp_per_capita']
categorical_features = ['sex', 'age', 'generation']

# Normalization for numerical features
numerical_transformer = StandardScaler()

# One-hot encoding for categorical features
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

# Preprocessor for numerical and categorical features
preprocessor = ColumnTransformer(
    transformers=[
         ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('model', LinearRegression())])

# Splitting the original dataset into features and target variable, excluding the features we decided to drop
X = df.drop(columns=['suicide_100k_pop'])
y = df['suicide_100k_pop']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fitting model pipeline to the training data
pipeline.fit(X_train, y_train)

#  we access its coefficients
coefficients = pipeline.named_steps['model'].coef_
intercept = pipeline.named_steps['model'].intercept_

# Displaying the coefficients and intercept
print(f'Regression coefficients: {len(coefficients) + 1}')

Regression coefficients: 16


2. [10 pts] Use this model to predict the target variable for people with age 20, male, and
generation X. Report this prediction. What is the MAE error of this prediction?

#### Ans. 
The filtering is done on the test dataset for the specific criteria and then the MAE is predicted to replicate the analyses.

In [6]:
# Filter the test set based on the specified criteria
X_test_filter = X_test[(X_test['age'] == '15-24 years') & (X_test['sex'] == 'male') & (X_test['generation'] == 'Generation X')]

# Also filter the y_test to have the corresponding target values
y_test_filter = y_test[X_test_filter.index]

# Use the model to predict suicide rates for the filtered test set
filtered_predictions = pipeline.predict(X_test_filter)

# Calculate the MAE
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test_filter, filtered_predictions)
mae

9.831707092574694

3. [20 pts] Now go back to the original sex, age, and generation variables in their original
numerical form (i.e. prior to the one-hot encoding) and build a new model. I.e., feature
engineer the original nominal age and generation features into truly numerical features.)
How many line coefficients are there?

#### Ans.

In [7]:
sex_mapping = {
    'female': 0,
    'male' : 1
}

# Mapping for 'generation' 
generation_mapping = {
    'G.I. Generation': 1,
    'Silent': 2,
    'Boomers': 3,
    'Generation X': 4,
    'Millenials': 5,
    'Generation Z': 6,
    'Gen Alpha': 7
}

# Mapping 'age' to numerical 
age_mapping = {
    '5-14 years': 1,
    '15-24 years': 2,
    '25-34 years': 3,
    '35-54 years': 4,
    '55-74 years': 5,
    '75+ years': 6
}

df_num = df.copy()

df_num['sex_n'] = df_num['sex'].map(sex_mapping)
df_num['generation_n'] = df_num['generation'].map(generation_mapping)
df_num['age_n'] = df_num['age'].map(age_mapping)

df_num.head()

Unnamed: 0,year,sex,age,population,suicide_100k_pop,gdp_per_capita,generation,sex_n,generation_n,age_n
0,1987,male,15-24 years,312900,6.71,796,Generation X,1,4,2
1,1987,male,35-54 years,308000,5.19,796,Silent,1,2,4
2,1987,female,15-24 years,289700,4.83,796,Generation X,0,4,2
3,1987,male,75+ years,21800,4.59,796,G.I. Generation,1,1,6
4,1987,male,25-34 years,274300,3.28,796,Boomers,1,3,3


In [8]:
# Adjusting the preprocessing pipeline 
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['gdp_per_capita', 'sex_n', 'generation_n', 'age_n']),  # Includes encoded features
#         ('cat', OneHotEncoder(handle_unknown='ignore'), ['country'])  # Only 'country' is one-hot encoded
    ])

# Creating a new pipeline with the adjusted preprocessing and linear regression model
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('model', LinearRegression())])

# Splitting the dataset into features and the target variable again
X = df_num.drop(columns=['suicide_100k_pop'])
y = df_num['suicide_100k_pop']

# Splitting the data into training and testing sets again
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Keep only the numeric mapping
X_train =  X_train.drop(columns = ['sex', 'age', 'generation'])

# Fitting the new model pipeline to the training data
pipeline.fit(X_train, y_train)

# Accessing the new model's coefficients
new_coefficients = pipeline.named_steps['model'].coef_
new_intercept = pipeline.named_steps['model'].intercept_

# The number of regression coefficients in the new model, including the intercept
new_num_coefficients = len(new_coefficients) + 1
# Displaying the coefficients and intercept
print(f'Regression coefficients: {new_num_coefficients}')

Regression coefficients: 5


In [9]:
X_train.head()

Unnamed: 0,year,population,gdp_per_capita,sex_n,generation_n,age_n
668,1991,2310000,6404,0,3,3
18772,2005,575210,1704,1,3,4
8703,2009,105083,3765,1,3,4
16909,2013,1815829,11478,0,2,6
20924,2015,1088177,9431,1,6,1


4. [10 pts] Use this new Q3. model to predict the target value for the people with age 20,
male, and generation X. Report the prediction. What is the MAE error of this prediction?

In [10]:
# Filter the test set based on the specified criteria
X_test_filter = X_test[(X_test['age'] == '15-24 years') & (X_test['sex'] == 'male') & (X_test['generation'] == 'Generation X')]

# Keep only the numeric mapping
X_test_filter = X_test_filter.drop(columns = ['sex', 'age', 'generation'])  

# Also filter the y_test to have the corresponding target values
y_test_filter = y_test[X_test_filter.index]

# Use the model to predict suicide rates for the filtered test set
filtered_predictions = pipeline.predict(X_test_filter)

# Calculate the MAE
mae = mean_absolute_error(y_test_filter, filtered_predictions)
mae

9.128505752474695

5. [10 pts] Did you note any change in these two model performances?

#### Ans.
The regression with numerical features shows slight improvement over the one-hot encoded features likely due to reduced complexity making the model more interpretable and potentially less prone to overfitting. This approach might be less accurate in capturing the nuanced effects of each category but simplifies the model and may perform sufficiently well for predictive purposes.

6. [10 pts] Use your Q3. model to predict the target value for age 33, male, and generation
Alpha (i.e. the generation after generation Z); report the prediction.

#### Ans. 

Note: Generation Alpha is not included in the original mappings, so adding it as the generation after Z (which is 6).

In [14]:
# Adjusting the input for a 33-year-old male from Generation Alpha using the numerical mappings defined earlier
input_data = {
    'gdp_per_capita': X_test['gdp_per_capita'].mean(),  # Using the mean GDP per capita from the test set
    'generation_n': 7,  # Generation Alpha
    'age_n': 3,  # age 33 falls into the '25-34 years' category mapped to 3,
    'sex_n': 1 # male
}

input_df = pd.DataFrame([input_data])

# Using the Q3. model to predict the suicide rate for the specified individual
prediction = pipeline.predict(input_df)

prediction[0]

16.80779347706776

7. [10 pts] Give one advantage when using regression (as opposed to classification with
nominal features) in terms of independent variables.

#### Ans. 
Regression allows for the inclusion of continuous independent variables, offering a more precise representation of relationships compared to classification with nominal features to capture the nuances and variations in the data more accurately, potentially leading to more precise predictions or estimates of the dependent variable.

8. [10 pts] Give one advantage when using regular numerical values rather than one-hot
encoding for regression.

#### Ans.
Using regular numerical values instead of one-hot encoding for regression preserves the ordinality or magnitude of the original data, enabling the regression model to directly leverage this information for more accurate predictions. 

9. [10 pts] Now that you developed both a classifier (previously) and a regression model for
the problem in this assignment, which method do you suggest to your machine learning
model customer? Classifier or regression? Why?

#### Ans. 

For suicide prevention by predicting suicide rate, I would suggest using a regression model since they offer accurate forecasts of suicide rates with statistical metrics such as coefficients and confidence intervals. Regression models capture the subtleties and variations in predictor variables. Additionally, the model will allow flexibility to the customer to customize their own bins based on their specific analyses and increase the usability of the model.