<h1>An Exploratory Analysis of the Factors Influencing Youth Smoking Risk and Predictive Modeling Using NHS Survey Data</h1>

# Table of Contents

### **1. Introduction**
* [1.1 Project Background](#1.1-Project-Background)
* [1.2 Problem Definition](#1.2-Problem-Definition)
* [1.3 Objectives](#1.3-Objectives)
* [1.4 Significance of the Study](#1.4-Significance-of-the-Study)

---

### **2. Data Overview and Preparation**
* [2.1 Data Loading](#2.1-Data-Loading)
* [2.2 Data Dictionary](#2.2-Data-Dictionary)
* [2.3 Data Inspection and Summary Statistics](#2.3-Data-Inspection-and-Summary-Statistics)
* [2.4 Handling Missing Values](#2.4-Handling-Missing-Values)
    * [2.4.1 Imputation Strategies](#2.4.1-Imputation-Strategies)

---

### **3. Exploratory Data Analysis (EDA)**
* [3.1 Univariate Analysis](#3.1-Univariate-Analysis)
* [3.2 Bivariate Analysis](#3.2-Bivariate-Analysis)
    * [3.2.1 Youth Perceptions of Peer Smoking Reasons with Age](#3.2.1-Youth-Perceptions-of-Peer-Smoking-Reasons-with-Age)
    * [3.2.2 Proportion of Youth Who Have Smoked by Exposure to Secondhand Smoke](#3.2.2-Proportion-of-Youth-Who-Have-Smoked-by-Exposure-to-Secondhand-Smoke)
    * [3.2.3 Peer Influences](#3.2.3-Peer-Influences)
    * [3.2.4 Education Efforts](#3.2.4-Education-Efforts)

---

### **4. Feature Engineering**
* [4.1 Composite Features](#4.1-Composite-Features)
* [4.2 Binary Flags](#4.2-Binary-Flags)
* [4.3 Interaction Feature](#4.3-Interaction-Feature)
* [4.4 Encode Categorical Variables](#4.4-Encode-Categorical-Variables)
* [4.5 Final Dataset Summary](#4.5-Final-Dataset-Summary)

---

### **5. Predictive Modeling**
* [5.1 Model Selection](#5.1-Model-Selection)
* [5.2 Model Training and Evaluation](#5.2-Model-Training-and-Evaluation)
* [5.3 Model Comparison](#5.3-Model-Comparison)
---

### **6. Findings and Limitations**
* [6.1 Key Findings](#6.1-Key-Findings)
* [6.2 Challenges and Limitations](#6.2-Challenges-and-Limitations)

## **1. Introduction**

#### **1.1 Project Background**

Smoking among young students in the UK remains a significant public health issue, particularly for those aged 11 to 15 in secondary schools (NDRS, n.d.). Recent changes in survey methods, transitioning from paper to online formats, have improved data collection and allowed the inclusion of social and emotional well-being factors, such as loneliness and exclusion. This provides a deeper understanding of the factors influencing smoking behaviors among youth.

#### **1.2 Problem Definition**

Despite declining smoking rates, the habit continues to affect secondary school students, posing long-term health risks and challenges for public health efforts. Current interventions often lack insights into the social and emotional dimensions that influence smoking behaviors. This study seeks to fill this gap by analyzing a comprehensive dataset to identify key predictors and inform more targeted prevention strategies.



#### **1.3 Objectives**

This study aims to examine smoking behaviors among secondary school students in England using survey data collected from 2021. The analysis focuses on identifying key trends, behaviors, and influencing factors, with the ultimate goal of providing recommendations to reduce smoking rates in this demographic.

#### **1.4 Significance of the Study**
This research offers actionable insights into the behavioral and social factors associated with youth smoking, enabling more effective public health campaigns and school-based interventions. By addressing these dimensions, the study contributes to broader efforts to support the well-being of young individuals and promote smoke-free lifestyles.

## **2. Data Overview and Preparation**
In this section, we will load the dataset, examine its structure, and define the key variables used in our analysis.


#### **2.1 Data Loading**
We begin by importing the necessary libraries and loading the dataset containing information about smoking behaviors, drug use, and alcohol consumption among youth. The dataset has been obtained from the NHS and is based on a survey conducted in 2021, targeting secondary school students in the UK.

In [3]:
# Importing core libraries for data manipulation, visualization, and machine learning
import numpy as np  # For numerical operations
import pandas as pd  # For handling dataframes
import matplotlib.pyplot as plt  # For plotting
import seaborn as sns  # For advanced visualization

# For missing data handling and imputation
from sklearn.experimental import enable_iterative_imputer  # To use iterative imputer
from sklearn.impute import KNNImputer, IterativeImputer, SimpleImputer  # For missing data handling
from sklearn.linear_model import LinearRegression  # To support imputation with regression

# For data preprocessing and splitting
from sklearn.preprocessing import StandardScaler  # For feature scaling
from sklearn.model_selection import train_test_split  # For splitting data

# Machine learning models
#from sklearn.ensemble import RandomForestClassifier
#from xgboost import XGBClassifier
#from lightgbm import LGBMClassifier

# For evaluation metrics and performance analysis
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score,
    precision_recall_curve, confusion_matrix
)


In [4]:
# Load the dataset into a dataframe
df = pd.read_csv("./data/survey_data.csv")

#### **2.2 Data Dictionary**
To streamline the analysis, we define the key columns related to smoking behaviors and other factors of interest. These columns are extracted into a separate dataframe for targeted analysis.


In [5]:
# Import a helper function to display the data dictionary
from helpers import show_dictionary  

# Define the relevant columns for analysis related to smoking behaviors
smoking_cols = ['sex', 'age1115', 'ethnicgp5','cgpppar', 'cgppfrsa', 'cgfamn', 'cgfams',
                'cgwhypre', 'cgwhyadd', 'cgwhyrel', 'cgwhystr', 'cgwhygdf', 'cgwhycoo', 'okcg1', 
                'okcgw', 'cgshin', 'cgshcar', 'cgwhohme', 'einfsmk', 'dlssmk','dcgevr']

# Create a subset of the dataframe with only the smoking-related variables
df_smoking = df[smoking_cols].copy()


# Display the data dictionary for the selected columns
show_dictionary(df_smoking)

Unnamed: 0,Subsection,Variable Name,Description,Data Type
0,Demographics,sex,Sex of the respondent,float64
1,,age1115,Age group (11-15),float64
2,,ethnicgp5,Ethnicity categorized into 5 groups,float64
3,Family attitude,cgpppar,Whether parents/guardians of the respondent smoke,float64
4,,cgppfrsa,Whether friends of the respondent smoke,float64
5,,cgfamn,Family attitudes towards smoking (non-smokers),float64
6,,cgfams,Family attitudes towards smoking (smokers),float64
7,Personal attitudes,dcgevr,Has ever smoked,float64
8,,cgwhypre,Belief that peers smoke due to peer pressure,float64
9,,cgwhyadd,Belief that peers smoke because of addiction,float64


#### **2.3 Data Inspection and Summary Statistics**
Before performing any analysis, it's crucial to inspect the dataset for its shape, structure, and summary statistics. This step helps in understanding the data distribution, identifying potential issues such as missing values, and planning for any necessary preprocessing.


In [None]:
# Print the shape and datatype of the dataset to understand its dimensions and type consistency
print('The dataset has {0} rows and {1} columns of the datatype {2}'.format(
    df_smoking.shape[0], df_smoking.shape[1], df_smoking.dtypes.value_counts().index[0]))

In [None]:
# Display the first five rows to get an overview of the dataset's structure and sample values
df_smoking.head()

In [None]:
# Using the `.describe()` method, we obtain key summary statistics for each column
df_smoking.describe()

#### **2.4 Handling Missing Values**
In this dataset, missing values are encoded as `-1.0` and `-9.0` to represent "Not Applicable" or "Not Answered" responses and `-8.0`, and `6.0` to "Don't know". To standardize the handling of missing data, these values are replaced with `NaN`. This step allows for consistent processing in subsequent analyses.

In [None]:
# Replace encoded missing values (-1, -8, 6, -9) with NaN for consistent handling
df_smoking.replace([-1.0, -9.0, -8.0,6.0], np.nan, inplace=True)

##### **Missing Value Summary**
The proportion of missing values in each column is calculated to identify the extent of missingness. Columns with high missing values might require special attention, such as imputation or exclusion from the analysis.

In [None]:
# Calculate the percentage of missing values for each column
missing_values = df_smoking.isna().sum() / len(df_smoking) * 100
print('-' * 42)
print("Columns with the highest percentage of missing values:")
print(missing_values[missing_values > 50].sort_values(ascending=False))
print('-' * 42)

##### **Visualizing Missing Data**
A heatmap is used to visualize the distribution of missing values across the dataset. Yellow regions indicate missing values, helping us identify patterns or clusters of missing data.


In [None]:
# Visualize missing values
sns.heatmap(df_smoking.isna(), cbar=False, cmap='viridis');

In [None]:
# Calculate and visualize the correlation matrix to understand relationships between variables
correlation_matrix = df_smoking.corr()
# Visualize the correlation matrix using a heatmap
plt.figure(figsize=(12,8))
sns.heatmap(
    correlation_matrix,  # Correlation matrix data
    annot=True,          # Display correlation values
    cmap="coolwarm",     # Color scheme for better differentiation
    fmt='.2f',           # Format values to two decimal places
    vmin=-1, vmax=1      # Set min/max values for color scale
)
plt.title('Correlation matrix')
plt.show();


The correlation matrix visualizes relationships between numerical variables (Bock, 2018). High correlations (near 1 or -1) indicate strong relationships, guiding feature selection and engineering. Values near 0 suggest weak or no relationship. This helps identify key variables before imputing missing values.

#### **2.4.1 Imputation Strategies**
Each method was chosen based on the nature of the variables:

**Mode** the imputation for binary and categorical variables to retain class distributions.\
**Group-based** imputation to account for relationships between demographic variables.\
**KNN, regression, and iterative** imputation for more complex relationships, ensuring dependencies between features are respected.

In [None]:
# Mode Imputation for Binary/Categorical Variables
# Using the mode (most frequent value) as it is representative for such variables
binary_columns = ['einfsmk', 'dlssmk', 'dcgevr', 'sex']
for column in binary_columns:
    mode_value = df_smoking[column].mode()[0]  # Get the most frequent value
    df_smoking[column].fillna(mode_value, inplace=True)

# Group-Based Median Imputation for 'age1115'
# Impute missing 'age1115' values using the median within groups defined by 'sex'
df_smoking['age1115'] = df_smoking.groupby('sex', group_keys=False)['age1115'].apply(
    lambda group: group.fillna(group.median())
)

# Mode Imputation for 'ethnicgp5'
# Impute missing 'ethnicgp5' values by mode within groups defined by 'sex' and 'age1115'
df_smoking['ethnicgp5'] = df_smoking.groupby(['sex', 'age1115'])['ethnicgp5'].transform(
    lambda group: group.fillna(group.mode().iloc[0]) if not group.mode().empty else group
)

# KNN Imputation for Family Attitude Variables
family_columns = ['cgpppar', 'cgppfrsa', 'cgfamn', 'cgfams']
knn_imputer = KNNImputer(n_neighbors=5)
df_smoking[family_columns] = knn_imputer.fit_transform(df_smoking[family_columns])

# Regression-Based Imputation for Personal Attitude Variables
# Use linear regression to predict missing values for personal attitude variables
# Missing values are predicted based on the relationship with other features
personal_attitude_vars = ['cgwhypre', 'cgwhyadd', 'cgwhyrel', 'cgwhystr', 'cgwhygdf', 'cgwhycoo']
for target_var in personal_attitude_vars:
    # Split data into known and missing values
    known_data = df_smoking[df_smoking[target_var].notnull()]
    missing_data = df_smoking[df_smoking[target_var].isnull()]
    
    # Features (X) and target (y) for regression
    X = known_data.drop(columns=personal_attitude_vars).fillna(known_data.median())
    y = known_data[target_var]
    
    # Train regression model
    model = LinearRegression().fit(X, y)
    
    # Predict missing values using the trained model
    missing_X = missing_data.drop(columns=personal_attitude_vars).fillna(X.median())
    df_smoking.loc[df_smoking[target_var].isnull(), target_var] = model.predict(missing_X)

# Imputation for Exposure and Opinion Variables
# This method models each variable as a function of the others, iteratively updating estimates
exposure_opinion_columns = ['okcg1', 'okcgw', 'cgshin', 'cgshcar', 'cgwhohme']
iter_imputer = IterativeImputer(max_iter=10, random_state=42)
df_smoking[exposure_opinion_columns] = iter_imputer.fit_transform(df_smoking[exposure_opinion_columns])


After imputation, missing values are verified to ensure the dataset is ready for analysis.\
Categorical variables are rounded to maintain discrete values.

In [None]:

# Ensuring categorical and binary variables remain discrete after imputation by rounding
df_smoking = df_smoking.round()

# Generate a heatmap to confirm the absence of missing values
sns.heatmap(df_smoking.isna(), cbar=False, cmap='viridis');


## **3. Exploratory Data Analysis (EDA)**

#### **3.1 Univariate Analysis**

In [None]:
# Map values for the 'sex' variable
df_demographics = df_smoking[['sex', 'age1115', 'ethnicgp5']].copy()

# Replace numerical values with labels for ethnicity
ethnicity_map = {1.0: "White", 2.0: "Mixed", 3.0: "Asian", 4.0: "Black", 5.0: "Other"}
df_demographics['ethnicgp5'] = df_demographics['ethnicgp5'].replace(ethnicity_map)

# Replace numerical values with labels for gender
gender_map = {1.0: "Male", 2.0: "Female"}
df_demographics['sex'] = df_demographics['sex'].replace(gender_map)

# Rename columns for clarity in plots
df_demographics.rename(columns={'sex': 'Gender', 'age1115': 'Age Group', 'ethnicgp5': 'Ethnicity Group'}, inplace=True)

# Plot distributions for demographics
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
axes = axes.flatten()

for i, column in enumerate(df_demographics.columns):
    sns.countplot(data=df_demographics, x=column, ax=axes[i], color='#ffb3c6')
    axes[i].set_title(f"Distribution of {column}")
    axes[i].set_xlabel(column)
    axes[i].set_ylabel("Count")
    axes[i].tick_params(axis='x', rotation=0)

plt.tight_layout()
plt.show();


In [None]:
# Subset and map Family Attitude variables
df_family = df_smoking[['cgpppar', 'cgppfrsa', 'cgfamn', 'cgfams']].copy()

# Mapping values
family_map = {0.0: "No", 1.0: "Yes"}
df_family = df_family.replace(family_map)

# Define variable titles
family_vars = {
    'cgpppar': 'Parents Smoking',
    'cgppfrsa': 'Friend Smoking',
}

# Plot distributions
fig, axes = plt.subplots(1,2, figsize=(12, 5))
axes = axes.flatten()

for i, (var, title) in enumerate(family_vars.items()):
    sns.countplot(data=df_family, x=var, ax=axes[i], color='#ffb3c6')
    axes[i].set_title(f"Distribution of {title}")
    axes[i].set_xlabel(title)
    axes[i].set_ylabel("Count")
    axes[i].tick_params(axis='x', rotation=0)

plt.tight_layout()
plt.show();


In [None]:
# Subset and map Exposure variables
df_exposure = df_smoking[['cgshcar', 'cgwhohme']].copy()
df_exposure['cgwhohme'] = df_exposure['cgwhohme'].replace({1.0:"Yes",2.0:"No"})
# Mapping values
exposure_map = {1.0: "Always",2.0:"Often",3.0:"Sometimes",4.0:"Rarely",5.0:"Never"}
df_exposure = df_exposure.replace(exposure_map)

# Define variable titles
exposure_vars = {    
    'cgshcar': 'Secondhand Smoke in Cars',
    'cgwhohme': 'Smokers in the household'
}

# Plot distributions
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes = axes.flatten()

for i, (var, title) in enumerate(exposure_vars.items()):
    sns.countplot(data=df_exposure, x=var, ax=axes[i], color='#ffb3c6')
    axes[i].set_title(f"Distribution of {title}")
    axes[i].set_xlabel(title)
    axes[i].set_ylabel("Count")
    axes[i].tick_params(axis='x', rotation=0)

plt.tight_layout()
plt.show();



#### **3.2 Bivariate Analysis**

##### **3.2.1 Youth Perceptions of Peer Smoking Reasons with Age**

In [None]:
belief_df = df_smoking[['age1115', 'cgwhypre', 'cgwhygdf', 'cgwhycoo', 'cgwhystr']].copy()
# Rename belief columns for clarity
belief_df.rename(columns={
    'cgwhypre': 'Pressure',
    'cgwhygdf': 'Good Feeling',
    'cgwhycoo': 'Look Cool',
    'cgwhystr': 'Cope with Stress'
}, inplace=True)

# Melt the dataset into long format
belief_long = belief_df.melt(id_vars='age1115', var_name='Belief', value_name='Value')

# Filter rows where the belief holds true (Value == 1)
belief_long = belief_long[belief_long['Value'] == 1]

# Calculate the proportion of each belief within each age group
belief_proportion = belief_long.groupby(['age1115', 'Belief']).size().reset_index(name='Count')
total_by_age = belief_long.groupby('age1115').size().reset_index(name='Total')
belief_proportion = belief_proportion.merge(total_by_age, on='age1115')
belief_proportion['Proportion'] = belief_proportion['Count'] / belief_proportion['Total']

print("-" *50)
print("Proportions table")
print(belief_proportion)
print("-" *50)

# Pivot the data for stacked bar plotting
belief_pivot = belief_proportion.pivot(index='age1115', columns='Belief', values='Proportion').fillna(0)

# Plot the stacked bar chart
plt.figure(figsize=(8, 5))
belief_pivot.plot(kind='bar', stacked=True, width=0.8, color=['#9ecae1', '#fdae61', '#fee08b', '#ffb3c6'], ax=plt.gca())
plt.title('Proportion of Beliefs About Why Peers Smoke by Age Group')
plt.xlabel('Age')
plt.ylabel('Proportion of Respondents')
plt.xticks(rotation=0)

# Place the legend outside the plot on the right
plt.legend(title='Belief About Why Peers Smoke', loc='center left', bbox_to_anchor=(1, 0.5))

# Ensure layout is tight to avoid clipping
plt.tight_layout()
plt.show();


This chart highlights how beliefs about why peers smoke vary by age group, with peer pressure consistently being the most cited reason across all ages. As age increases, the perception of smoking to "look cool" slightly declines, while beliefs about smoking for "coping with stress" and "good feelings" remain relatively stable.

##### **3.2.2 Proportion of Youth Who Have Smoked by Exposure to Secondhand Smoke**

In [None]:
exposure_df = df_smoking[['cgshin', 'dcgevr']].copy()

# Map codes to meaningful labels
exposure_df['cgshin'] = exposure_df['cgshin'].map({
    5: 'Never',
    4: 'Rarely',
    3: 'Sometimes',
    2: 'Often',
    1: 'Always'
})
exposure_df['dcgevr'] = exposure_df['dcgevr'].map({1: 'Yes', 2: 'No'})

# Calculate proportions
exposure_counts = exposure_df.groupby(['cgshin', 'dcgevr']).size().reset_index(name='Count')
total_by_exposure = exposure_df.groupby('cgshin').size().reset_index(name='Total')
exposure_proportion = exposure_counts.merge(total_by_exposure, on='cgshin')
exposure_proportion['Proportion'] = exposure_proportion['Count'] / exposure_proportion['Total']
print("-" *50)
print("Proportions table")
print(exposure_proportion)
print("-" *50)

# Reorder the categories for exposure
exposure_order = ['Never', 'Rarely', 'Sometimes', 'Often', 'Always']

# Plot the proportional bar chart for exposure
plt.figure(figsize=(8, 5))
sns.barplot(data=exposure_proportion, x='cgshin', y='Proportion', hue='dcgevr', palette=['#ffb3c6', '#b3e6ff'], order=exposure_order)
plt.title('Proportion of Youth Who Have Smoked by Exposure to Secondhand Smoke')
plt.xlabel('Exposure to Secondhand Smoke')
plt.ylabel('Proportion of Respondents')
plt.legend(title='Has Ever Smoked')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show();

This visualization demonstrates a strong positive relationship between exposure to secondhand smoke and the likelihood of having ever smoked. Respondents who reported "always" being exposed to secondhand smoke had the highest proportion of smokers, while those with no exposure were the least likely to smoke. This highlights the significant influence of secondhand smoke exposure as a risk factor for smoking behavior among youth.

##### **3.2.3 Peer Influences**

In [None]:
peer_influence_df = df_smoking[['cgppfrsa', 'dcgevr']].copy()

# Map codes to meaningful labels for the number of friends who smoke
peer_influence_df['cgppfrsa'] = peer_influence_df['cgppfrsa'].map({
    0: 'No Friends Smoke',
    1: 'Some Friends Smoke',
})
peer_influence_df['dcgevr'] = peer_influence_df['dcgevr'].map({1: 'Yes', 2: 'No'})

# Calculate proportions
peer_counts = peer_influence_df.groupby(['cgppfrsa', 'dcgevr']).size().reset_index(name='Count')
total_by_peer = peer_influence_df.groupby('cgppfrsa').size().reset_index(name='Total')
peer_proportion = peer_counts.merge(total_by_peer, on='cgppfrsa')
peer_proportion['Proportion'] = peer_proportion['Count'] / peer_proportion['Total']
print("-" *50)
print("Proportions table")
print(peer_proportion)
print("-" *50)

# Plot the proportional bar chart
plt.figure(figsize=(8, 5))
sns.barplot(data=peer_proportion, x='cgppfrsa', y='Proportion', hue='dcgevr', palette=['#ffb3c6', '#b3e6ff'])
plt.title('Proportion of Youth Who Have Smoked by Peer Influence')
plt.xlabel('Peer Smoking')
plt.ylabel('Proportion of Respondents')
plt.legend(title='Has Ever Smoked')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show();

This visualization highlights the influence of peer smoking behavior on youth smoking habits. Among respondents with friends who smoke, a significantly higher proportion reported having ever smoked compared to those whose friends do not smoke. This indicates a strong correlation between peer influence and smoking behavior, underscoring the social pressures and environmental factors that contribute to youth smoking.

##### **3.2.4 Education Efforts**

In [None]:
# Create a new DataFrame for education's effect on smoking
education_df = df_smoking[['einfsmk', 'dlssmk', 'dcgevr']].copy()

# Map codes to meaningful labels
education_df['einfsmk'] = education_df['einfsmk'].map({2.0: 'No Info Provided', 1.0: 'School provided Info'})
education_df['dlssmk'] = education_df['dlssmk'].map({2.0: 'No Lessons', 1.0: 'Lessons Provided'})
education_df['dcgevr'] = education_df['dcgevr'].map({1: 'Yes', 2: 'No'})

# Melt the dataset for proportional analysis
education_long = education_df.melt(id_vars='dcgevr', var_name='Education Variable', value_name='Response')

# Calculate proportions
education_counts = education_long.groupby(['Education Variable', 'Response', 'dcgevr']).size().reset_index(name='Count')
total_by_edu = education_long.groupby(['Education Variable', 'Response']).size().reset_index(name='Total')
education_proportion = education_counts.merge(total_by_edu, on=['Education Variable', 'Response'])
education_proportion['Proportion'] = education_proportion['Count'] / education_proportion['Total']
print("-" *50)
print("Proportions table")
print(education_proportion)
print("-" *50)

# Plot the proportional bar chart
plt.figure(figsize=(8, 5))
sns.barplot(data=education_proportion, x='Response', y='Proportion', hue='dcgevr', palette=['#ffb3c6', '#b3e6ff'])
plt.title('Proportion of Youth Who Have Smoked by Education Efforts')
plt.xlabel('Education Efforts')
plt.ylabel('Proportion of Respondents')
plt.legend(title='Has Ever Smoked', loc='upper right')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show();


This visualization demonstrates the role of education efforts in influencing youth smoking behavior. Respondents who were provided with school lessons or information on smoking reported significantly lower proportions of having ever smoked compared to those who received no such education. This highlights the potential effectiveness of educational interventions in reducing youth smoking rates and emphasizes the need for consistent and comprehensive education programs targeting smoking prevention.

We further explore relationships between our variables and smoking behaviour:

In [None]:
age_smoking = df_smoking.groupby('age1115')['dcgevr'].value_counts(normalize=True).unstack()
age_smoking.columns = ['Yes', 'No'] 
print("-" * 42)
print(age_smoking)
print("-" * 42)

In [None]:
ethnic_smoking = df_smoking.groupby('ethnicgp5')['dcgevr'].value_counts(normalize=True).unstack()
ethnic_smoking.columns = ['Yes', 'No']  
print("-" * 42)
print(ethnic_smoking)
print("-" * 42)

In [None]:
peer_smoking = df_smoking.groupby('cgppfrsa')['dcgevr'].value_counts(normalize=True).unstack()
peer_smoking.columns = ['Yes', 'No']  
print("-" * 42)
print(peer_smoking)
print("-" * 42)

In [None]:
belief_vars = ['cgwhypre', 'cgwhyadd', 'cgwhyrel', 'cgwhystr']
for var in belief_vars:
    belief_smoking = df_smoking.groupby(var)['dcgevr'].value_counts(normalize=True).unstack()
    belief_smoking.columns = ['Yes', 'No']  # Assuming 1=Yes, 2=No in dcgevr
    print(f"Smoking behavior based on {var}:\n", belief_smoking)


## **4. Feature Engineering**

#### **4.1 Composite Features**
We create composite features by aggregating related variables to simplify the analysis and highlight key behavioral and environmental influences. These include family influence, social influence, and exposure scores.

In [None]:
# Calculating the average of variables related to peer pressure
df_smoking['peer_pressure_score'] = df_smoking[
    ['cgwhypre', 'cgwhyadd', 'cgwhyrel', 'cgwhystr', 'cgwhygdf', 'cgwhycoo']
].mean(axis=1)

# Averaging variables indicating secondhand smoke exposure
df_smoking['exposure_score'] = df_smoking[['cgshin', 'cgshcar', 'cgwhohme']].mean(axis=1)

# Averaging family attitude variables to gauge family influence
df_smoking['family_influence'] = df_smoking[['cgpppar', 'cgppfrsa', 'cgfamn', 'cgfams']].mean(axis=1)

# Calculating attitude scores based on individual motivations for smoking
df_smoking['attitude_score'] = df_smoking[['okcg1', 'okcgw']].mean(axis=1)

#### **4.2 Binary Flags**

In [None]:
# Flag indicating if an individual smokes for stress relief
df_smoking['smokes_for_stress'] = (df_smoking['cgwhystr'] > 0).astype(int)

# Flag indicating if an individual smokes to relax
df_smoking['smokes_to_relax'] = (df_smoking['cgwhygdf'] > 0).astype(int)

#### **4.3 Interaction Feature**
Interaction features highlight relationships between variables, such as the interplay of gender and exposure. These features can improve predictive modeling by capturing nuanced patterns in the data (Zhao and Cen,2014).

In [None]:
# Interaction between gender and exposure score
df_smoking['gender_exposure_interaction'] = df_smoking['sex'] * df_smoking['exposure_score']

#### **4.4 Encode Categorical Variables**
Categorical variables are encoded into numerical representations

In [None]:
# One-hot encoding the 'ethnicgp5' column to represent categorical data numerically
df_smoking = pd.get_dummies(df_smoking, columns=['ethnicgp5'], drop_first=True)

#### **4.5 Final Dataset Summary**
This section provides an overview of the final dataset after data cleaning, feature engineering, and encoding. It ensures that the dataset is ready for predictive modeling

In [None]:
# Display the shape of the final dataset
print(f"The final dataset has {df_smoking.shape[0]} rows and {df_smoking.shape[1]} columns.")


# Preview the final dataset
print("\nPreview of the final dataset:")
display(df_smoking.head())


## **5. Predictive Modelling**
For this study, we selected machine learning models capable of handling both engineered features and imbalanced datasets effectively. The target variable, representing smoking behavior, exhibits class imbalance, where fewer students report engaging in smoking behavior. To address this, we chose models that offer mechanisms like class weighting or boosting to ensure accurate predictions for the minority class.

These models are well-suited for capturing complex relationships and interactions within the dataset, such as the effects of family influence, peer pressure, and demographic factors.

Below, we provide a detailed justification for each model.

#### **5.1 Model Selection**

**LightGBM** (Light Gradient Boosting Machine) is an efficient gradient-boosting algorithm designed for large datasets and high-dimensional data (Analytics Vidhya, 2021). It is particularly effective for datasets with class imbalances, as it includes the class_weight parameter to penalize misclassification of the minority class.

**Random Fores**t is a learning method that builds multiple decision trees and aggregates their outputs to improve accuracy and robustness (IBM, 2023). It is less prone to overfitting compared to individual decision trees, making it a reliable choice for datasets with complex datasets

**XGBoost** (Extreme Gradient Boosting) is a high-performance that models complex relationships effectively. It works well with imbalanced datasets by using the scale_pos_weight parameter to balance the contribution of the minority class during training (NVIDIA Data Science Glossary, n.d.).

#### **5.2 Model Training and Evaluation**
We preprocess the data by scaling the features using `StandardScaler`, split the dataset into training and testing sets, and train each model using the training set. 

Performance metrics (Accuracy, Precision, Recall, F1-Score, ROC-AUC) are calculated to evaluate model performance.


In [None]:
# Define features and target variable

ml_cols = [
    'family_influence',
    'peer_pressure_score',     # Engineered feature
    'attitude_score',          # Engineered feature
    'exposure_score',          # Engineered feature
    'smokes_for_stress',       # Binary flag
    'smokes_to_relax',         # Binary flag
    'gender_exposure_interaction',  # Interaction term
    'sex',                     # Original feature
    'age1115',                 # Original feature
    'ethnicgp5_2.0', 'ethnicgp5_3.0', 'ethnicgp5_4.0', 'ethnicgp5_5.0'
]

target_col = 'dcgevr'

# Separate features and target variable
X = df_smoking[ml_cols]
y = df_smoking[target_col]

# Map target labels to 0 and 1 if needed
if y.min() > 0:  
    y = y.replace({1.0: 0, 2.0: 1})


In [None]:
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [None]:
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, stratify=y, random_state=42
)

In [None]:
# Define models
class_counts = y_train.value_counts()
majority_class = class_counts.index[0]
minority_class = class_counts.index[1]

models = {
    "LightGBM": LGBMClassifier(
        class_weight='balanced', learning_rate=0.1, n_estimators=100, random_state=42
    ),
    "Random Forest": RandomForestClassifier(
        class_weight='balanced', n_estimators=100, random_state=42
    ),
    "XGBoost": XGBClassifier(
        scale_pos_weight=class_counts[majority_class] / class_counts[minority_class],
        eval_metric='logloss', random_state=42
    )
}


In [None]:
# Train models and evaluate
results = []

for name, model in models.items():
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    
    # Evaluate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    
    # Store results
    results.append({
        "Model": name,
        "Accuracy": accuracy,
        "Precision": precision,
        "Recall": recall,
        "F1-Score": f1,
        "ROC-AUC": roc_auc
    })

#### **5.3 Model Comparison**
The results of the models are compared using key metrics:
- **Accuracy**: Overall correctness of the predictions.
- **Precision**: Proportion of correctly identified positive cases.
- **Recall**: Ability to capture all positive cases.
- **F1-Score**: Harmonic mean of Precision and Recall.
- **ROC-AUC**: Model's ability to discriminate between positive and negative cases.

The table below summarizes the performance metrics for each model:


In [None]:
# Display results in a DataFrame
comparison_df = pd.DataFrame(results)
print(comparison_df)

Based on the evaluation metrics, XGBoost emerged as the strongest model due to its exceptional recall (0.9921) and F1-score (0.9288), which highlight its effectiveness in identifying smoking behaviors and balancing precision with recall. Random Forest followed with strong overall performance, particularly in accuracy (0.8127) and precision (0.9234), indicating its robustness for general predictions. LightGBM, while achieving a respectable accuracy (0.7467), demonstrated slightly lower recall, suggesting it may be less effective in identifying all smoking cases. Each model leveraged the dataset’s features differently, but XGBoost’s ability to handle imbalances and complex interactions gave it a clear edge.

In [None]:
# Feature Importance for XGBoost
xgb_model = models["XGBoost"]
feature_importances = xgb_model.feature_importances_
features = ml_cols

# Create a DataFrame for feature importances
feature_imp_df = pd.DataFrame({'Feature': features, 'Importance': feature_importances})
feature_imp_df = feature_imp_df.sort_values(by='Importance', ascending=True)

# Plot
plt.figure(figsize=(10, 6))
plt.barh(feature_imp_df['Feature'], feature_imp_df['Importance'], color='#9ecae1')
plt.xlabel('Importance')
plt.title('XGBoost Feature Importance')
plt.show();


The XGBoost model highlights **attitude_score**, **peer_pressure_score**, and **family_influence** as the top predictors for smoking behavior. These features underscore the significant impact of social and psychological factors. Such insights can guide interventions targeting these specific areas.

In [None]:
# Precision-Recall Curve
y_pred_proba = xgb_model.predict_proba(X_test)[:, 1]
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba)

plt.figure(figsize=(10, 6))
plt.plot(recall, precision, marker='.')
plt.title('Precision-Recall Curve (XGBoost)')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.show();

The curve demonstrates the XGBoost model's strength in maintaining high precision while covering a wide range of recall values. This balance is critical for identifying smoking behaviors effectively without excessive false positives.

In [None]:
# Confusion Matrix
y_pred = xgb_model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No', 'Yes'], yticklabels=['No', 'Yes'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix (XGBoost)')
plt.show();

The confusion matrix reveals strong performance, with a high number of true positives (correctly predicted smokers) compared to a minimal number of false negatives. This suggests the model is reliable in detecting smoking behavior while maintaining minimal errors in prediction.

## **6. Findings and Limitations**


#### **6.1 Key Findings**
The findings below provide valuable insights into the factors influencing youth smoking behavior, including demographic, social, and environmental variables. They underline the importance of targeted education, family involvement, and peer influence mitigation in reducing smoking prevalence.

- Smoking prevalence increases significantly with age among respondents. At age 11, only 1.9% of respondents reported smoking, but by age 15, this rises to 26.2%. This highlights adolescence as a critical period for smoking prevention efforts.
  
- Among respondents whose friends do not smoke, only 10.4% report smoking. This proportion rises sharply to 32.4% for respondents with friends who smoke. This finding underscores the significant role of social influence and peer pressure in smoking initiation.

- Exposure to secondhand smoke in households and cars is associated with higher smoking prevalence. For instance, 32.2% of respondents exposed to smoke "Always" in households reported smoking, compared to only 5.3% among those "Never" exposed. These results suggest that secondhand exposure may normalize smoking behavior within families.

- Only 12.1% of respondents who received smoking lessons in school reported smoking, while 17.2% of those with "No information provided" reported smoking. Similarly, schools providing smoking-related information reduced smoking prevalence to 12.2%.

- Respondents who believe peers smoke to "relax" or "cope with stress" are significantly more likely to smoke themselves (24.5% and 15.5%, respectively) compared to those who do not hold these beliefs (4.2% and 4.8%, respectively).

- Among the models tested, XGBoost emerged as the most effective, achieving the highest recall (99.2%) and F1-Score (92.9%) with a strong accuracy (87.0%). This model effectively identified smoking behavior while maintaining a balance between false positives and negatives.

  
- Feature importance analysis revealed that attitude scores and peer pressure scores were the most significant predictors, emphasizing the psychological and social dimensions of smoking.


#### **6.2 Challenges and Limitations**
During our analysis, several limitations and challenges in the dataset became apparent. While we made efforts to address some issues, such as class imbalances and missing data, others required more in-depth exploration or alternative approaches that were beyond the scope of this project. Due to time constraints and the complexity of some factors, these limitations remain partially unresolved. These limitations are shown below:

- Imbalanced Data: The dataset exhibited significant class imbalances which requires advanced techniques such as resampling to mitigate their impact during predictive modeling.
  
- Many variables contained missing data, While imputation efforts aimed to maintain data integrity, the process may have introduced biases or inaccuracies.

- Some questions were specific to smokers or non-smokers. It created challenges in drawing consistent comparisons and limiting the generalization of some variables to the entire population.

- Because the survey is self reported, Responses might be biased. Participants may have misrepresented or underreported certain behaviors, affecting data reliability.

#### **References**
Analytics Vidhya. (2021). LightGBM in Python | Complete guide on how to Use LightGBM in Python. [online] Available at: https://www.analyticsvidhya.com/blog/2021/08/complete-guide-on-how-to-use-lightgbm-in-python/.

Bock, T. (2018). What is a Correlation Matrix? | Displayr. [online] Displayr. Available at: https://www.displayr.com/what-is-a-correlation-matrix/.

IBM (2023). What Is Random Forest? | IBM. [online] www.ibm.com. Available at: https://www.ibm.com/topics/random-forest.

NDRS. (n.d.). Smoking, Drinking and Drug Use among Young People in England, 2021. [online] Available at: https://digital.nhs.uk/data-and-information/publications/statistical/smoking-drinking-and-drug-use-among-young-people-in-england/2021#data-sets.

NVIDIA Data Science Glossary. (n.d.). What is XGBoost? [online] Available at: https://www.nvidia.com/en-gb/glossary/xgboost/.

Zhao, Y. and Cen, Y. (). Data mining applications with R. Waltham, MA: Academic Press.

#### **GenAI**

![alt text2](GenAI-components.jpg)

#### **Trello**

Week 1

![](./Trello/1.jpeg)

Week 2

![](./Trello/2.jpeg)

Week 3

![](./Trello/3.jpeg)

Week 4

![](./Trello/4.jpeg)

Week 5

![](./Trello/5.jpeg)

Week 6

![](./Trello/6.jpeg)

Week 7

![](7.jpg)