## Data preparation

Select only the features from above.
Check if the missing values are presented in the features.

In [3]:

import pandas as pd

# Specify the full path to the CSV file
csv_path = r'C:\Users\marti\Desktop\Machine_Learning\Zoomcamp\bank-full.csv'

# Load the data from the CSV file
df = pd.read_csv(csv_path, sep=';')

# Check the first few rows of the DataFrame
print(df.head())



   age           job  marital  education default  balance housing loan  \
0   58    management  married   tertiary      no     2143     yes   no   
1   44    technician   single  secondary      no       29     yes   no   
2   33  entrepreneur  married  secondary      no        2     yes  yes   
3   47   blue-collar  married    unknown      no     1506     yes   no   
4   33       unknown   single    unknown      no        1      no   no   

   contact  day month  duration  campaign  pdays  previous poutcome   y  
0  unknown    5   may       261         1     -1         0  unknown  no  
1  unknown    5   may       151         1     -1         0  unknown  no  
2  unknown    5   may        76         1     -1         0  unknown  no  
3  unknown    5   may        92         1     -1         0  unknown  no  
4  unknown    5   may       198         1     -1         0  unknown  no  


## Select specific columns

In [4]:
# Select only the specified columns
columns = ['age', 'job', 'marital', 'education', 'balance', 'housing', 'contact', 
           'day', 'month', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'y']
df = df[columns]

# Display the first few rows of the selected columns
print(df.head())


   age           job  marital  education  balance housing  contact  day month  \
0   58    management  married   tertiary     2143     yes  unknown    5   may   
1   44    technician   single  secondary       29     yes  unknown    5   may   
2   33  entrepreneur  married  secondary        2     yes  unknown    5   may   
3   47   blue-collar  married    unknown     1506     yes  unknown    5   may   
4   33       unknown   single    unknown        1      no  unknown    5   may   

   duration  campaign  pdays  previous poutcome   y  
0       261         1     -1         0  unknown  no  
1       151         1     -1         0  unknown  no  
2        76         1     -1         0  unknown  no  
3        92         1     -1         0  unknown  no  
4       198         1     -1         0  unknown  no  


## Checking Missing Values

In [5]:
# Check for missing values in each column
missing_values = df.isnull().sum()
print("Missing values per column:\n", missing_values)

# Check which columns have any missing values
missing_columns = df.isnull().any()
print("\nColumns with missing values:\n", missing_columns[missing_columns == True])

# Get a summary of the DataFrame
print("\nDataFrame info:\n")
df.info()

# Get a statistical summary of the DataFrame
print("\nStatistical summary:\n", df.describe())

Missing values per column:
 age          0
job          0
marital      0
education    0
balance      0
housing      0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64

Columns with missing values:
 Series([], dtype: bool)

DataFrame info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 15 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   balance    45211 non-null  int64 
 5   housing    45211 non-null  object
 6   contact    45211 non-null  object
 7   day        45211 non-null  int64 
 8   month      45211 non-null  object
 9   duration   45211 non-null  int64 
 10  campaign   45211 non-null  int64 
 11  pdays      45211 non-null  int64 
 12  p

## Question 1

What is the most frequent observation (mode) for the column education?

In [6]:
# Mode of the 'education' column
mode_education = df['education'].mode()[0]
print("Mode of the 'education' column:", mode_education)


Mode of the 'education' column: secondary


## Question 2

Create the correlation matrix for the numerical features of your dataset. In a correlation matrix, you compute the correlation coefficient between every pair of features.

What are the two features that have the biggest correlation?

In [7]:
# Select the numerical columns
numerical_cols = ['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous']

# Create the correlation matrix
correlation_matrix = df[numerical_cols].corr()

# Find the two features with the strongest correlation
max_corr = correlation_matrix.unstack().sort_values(ascending=False)
# Ignore the diagonal (self-correlation)
max_corr = max_corr[max_corr < 1].drop_duplicates()

# Print the pair of features with the highest correlation
print(max_corr.head(1))


previous  pdays    0.45482
dtype: float64


## Target encoding

Now we want to encode the y variable.

Let's replace the values yes/no with 1/0.

In [8]:
# Encode the 'y' column
df['y'] = df['y'].map({'yes': 1, 'no': 0})


## Split the data

Split your data in train/val/test sets with 60%/20%/20% distribution.

Use Scikit-Learn for that (the train_test_split function) and set the seed to 42.

Make sure that the target value y is not in your dataframe.

In [9]:
from sklearn.model_selection import train_test_split

# Split the data into training, validation, and test sets
df_train, df_temp = train_test_split(df, test_size=0.4, random_state=42)
df_val, df_test = train_test_split(df_temp, test_size=0.5, random_state=42)

# Extract the target variable (y)
y_train = df_train['y']
y_val = df_val['y']
y_test = df_test['y']

# Remove the target variable from the features
X_train = df_train.drop(columns=['y'])
X_val = df_val.drop(columns=['y'])
X_test = df_test.drop(columns=['y'])


## Question 3

Calculate the mutual information score between y and other categorical variables in the dataset. Use the training set only.

Round the scores to 2 decimals using round(score, 2).

In [10]:
from sklearn.feature_selection import mutual_info_classif

# Select the categorical columns
categorical_cols = ['job', 'marital', 'education', 'housing', 'contact', 'month', 'poutcome']

# Apply One-Hot encoding on the categorical columns
X_train_encoded = pd.get_dummies(X_train, columns=categorical_cols)

# Calculate mutual information scores
mi_scores = mutual_info_classif(X_train_encoded, y_train)

# Display the results for the categorical variables
mi_scores_df = pd.Series(mi_scores, index=X_train_encoded.columns).round(2)
print(mi_scores_df)


age                    0.01
balance                0.02
day                    0.01
duration               0.07
campaign               0.00
pdays                  0.03
previous               0.01
job_admin.             0.00
job_blue-collar        0.00
job_entrepreneur       0.00
job_housemaid          0.00
job_management         0.00
job_retired            0.00
job_self-employed      0.00
job_services           0.00
job_student            0.00
job_technician         0.00
job_unemployed         0.00
job_unknown            0.00
marital_divorced       0.00
marital_married        0.00
marital_single         0.00
education_primary      0.00
education_secondary    0.00
education_tertiary     0.01
education_unknown      0.00
housing_no             0.01
housing_yes            0.02
contact_cellular       0.01
contact_telephone      0.00
contact_unknown        0.01
month_apr              0.00
month_aug              0.00
month_dec              0.00
month_feb              0.00
month_jan           

In [11]:
# Find the variable with the highest mutual information score
max_mi_variable = mi_scores_df.idxmax()  # Get the index (variable name) with the highest score
max_mi_score = mi_scores_df.max()         # Get the highest mutual information score

# Print the result
print(f"The variable with the highest mutual information score is '{max_mi_variable}' with a score of {max_mi_score}.")


The variable with the highest mutual information score is 'duration' with a score of 0.07.


## Question 4

Now let's train a logistic regression.

Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.

Fit the model on the training dataset.

To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:

model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)

Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder

# One-Hot encoding for all categorical variables
X_train_encoded = pd.get_dummies(X_train)
X_val_encoded = pd.get_dummies(X_val)

# Train a Logistic Regression model
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model.fit(X_train_encoded, y_train)

# Calculate the accuracy on the validation set
y_pred_val = model.predict(X_val_encoded)
accuracy_val = accuracy_score(y_val, y_pred_val)
print("Accuracy:", round(accuracy_val, 2))


Accuracy: 0.9


## Question 5
Let's find the least useful feature using the feature elimination technique.

Train a model with all these features (using the same parameters as in Q4).

Now exclude each feature from this set and train a model without it. Record the accuracy for each model.

For each feature, calculate the difference between the original accuracy and the accuracy without the feature.

In [18]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Entraîner le modèle avec toutes les caractéristiques
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model.fit(X_train_encoded, y_train)

# Précision avec toutes les caractéristiques
accuracy_full = accuracy_score(y_val, model.predict(X_val_encoded))
print("Accuracy with all features:", round(accuracy_full, 4))

# Liste des caractéristiques à évaluer
features_to_evaluate = ['age', 'balance', 'marital', 'previous']

# Dictionnaire pour stocker les différences
differences = {}

# Exclure chaque caractéristique et mesurer l'impact sur l'accuracy
for feature in features_to_evaluate:
    # Vérifier si la caractéristique est dans les colonnes
    if feature in X_train_encoded.columns:
        # Exclure la caractéristique
        X_train_reduced = X_train_encoded.drop(columns=[feature])
        X_val_reduced = X_val_encoded.drop(columns=[feature])
        
        # Entraîner le modèle sans cette caractéristique
        model.fit(X_train_reduced, y_train)
        
        # Précision sans la caractéristique
        accuracy_reduced = accuracy_score(y_val, model.predict(X_val_reduced))
        
        # Calculer la différence
        difference = accuracy_full - accuracy_reduced
        differences[feature] = round(difference, 4)
    else:
        print(f"Feature '{feature}' not found in the encoded features.")

# Afficher les différences
for feature, diff in differences.items():
    print(f"Difference for {feature}: {diff}")

# Trouver la caractéristique avec la plus petite différence
if differences:  # S'assurer qu'il y a des différences à évaluer
    least_impact_feature = min(differences, key=differences.get)
    print("Feature with the smallest difference:", least_impact_feature)
else:
    print("No valid features to evaluate.")




Accuracy with all features: 0.9005
Feature 'marital' not found in the encoded features.
Difference for age: -0.0002
Difference for balance: -0.0003
Difference for previous: -0.0001
Feature with the smallest difference: balance


## Question 6

Now let's train a regularized logistic regression.

Let's try the following values of the parameter C: [0, 0.01, 0.1, 1, 10].

Train models using all the features as in Q4.

Calculate the accuracy on the validation dataset and round it to 3 decimal digits.

In [15]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Test different values of C, starting from a small positive value
C_values = [0.01, 0.1, 1, 10, 100]  # Removed 0

for C in C_values:
    model = LogisticRegression(solver='liblinear', C=C, max_iter=1000, random_state=42)
    model.fit(X_train_encoded, y_train)
    
    # Accuracy on the validation set
    accuracy_val = accuracy_score(y_val, model.predict(X_val_encoded))
    print(f"Accuracy for C={C}: {round(accuracy_val, 3)}")



Accuracy for C=0.01: 0.899
Accuracy for C=0.1: 0.9
Accuracy for C=1: 0.9
Accuracy for C=10: 0.901
Accuracy for C=100: 0.901


In [16]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Test different values of C
C_values = [0.01, 0.1, 1, 10, 100]
best_accuracy = 0
best_C = None

# Loop through each C value
for C in C_values:
    model = LogisticRegression(solver='liblinear', C=C, max_iter=1000, random_state=42)
    model.fit(X_train_encoded, y_train)
    
    # Calculate accuracy on the validation set
    accuracy_val = accuracy_score(y_val, model.predict(X_val_encoded))
    print(f"Accuracy for C={C}: {round(accuracy_val, 3)}")
    
    # Update the best accuracy and best C
    if accuracy_val > best_accuracy:
        best_accuracy = accuracy_val
        best_C = C

# Print the best accuracy and corresponding C
print(f"\nThe best accuracy is {round(best_accuracy, 3)} at C={best_C}.")


Accuracy for C=0.01: 0.899
Accuracy for C=0.1: 0.9
Accuracy for C=1: 0.9
Accuracy for C=10: 0.901
Accuracy for C=100: 0.901

The best accuracy is 0.901 at C=10.
