# PRODIGY_DS_03

<img src='https://github.com/Theconjecture/PRODIGY_DS_03/blob/main/Purchase_Predict.png?raw=true' alt=“Alttext” title=“Title” />

## Task: 

* **Build a decision tree classifier to predict whether a customer will <br>purchase a product or service based on their demographic and behavioral data**

## Dataset: 

__[Dataset source](https://archive.ics.uci.edu/dataset/222/bank+marketing)__

* **The data is related with direct marketing campaigns of a Portuguese banking institution.<br> The marketing campaigns were based on phone calls. Often, more than one contact to the<br> same client was required, in order to access if the product (bank term deposit)<br> would be ('yes') or not ('no') subscribed.**

###   **There are four datasets:**
    
1. **`bank-additional-full.csv` with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014]**
2. **`bank-additional.csv` with 10% of the examples (4119), randomly selected from 1), and 20 inputs.**
3. **`bank-full.csv` with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs).**
4. **`bank.csv`with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs).** 

*The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).*

<div class="alert alert-block alert-info">
<b>Note:</b> Even though four files are provided, we will only use `bank-additional-full.csv` then we'll run our own train-test split
</div>

In [1]:
## Paths
bank_additional_full = '/Users/mostaphaatta/Downloads/Internship/Task_3/bank+marketing/bank-additional/bank-additional-full.csv'

## Data Import & Exploration

In [2]:
import pandas as pd
import numpy as np

In [3]:
# importing the csv file
df = pd.read_csv(bank_additional_full)
# setting the number of visibal columns, to prevent gaps in display
pd.set_option('display.max_columns', None)

In [4]:
df.shape

(41188, 21)

In [5]:
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,261,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,149,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,226,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,151,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,307,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [6]:
df.describe()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
count,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0
mean,40.02406,258.28501,2.567593,962.475454,0.172963,0.081886,93.575664,-40.5026,3.621291,5167.035911
std,10.42125,259.279249,2.770014,186.910907,0.494901,1.57096,0.57884,4.628198,1.734447,72.251528
min,17.0,0.0,1.0,0.0,0.0,-3.4,92.201,-50.8,0.634,4963.6
25%,32.0,102.0,1.0,999.0,0.0,-1.8,93.075,-42.7,1.344,5099.1
50%,38.0,180.0,2.0,999.0,0.0,1.1,93.749,-41.8,4.857,5191.0
75%,47.0,319.0,3.0,999.0,0.0,1.4,93.994,-36.4,4.961,5228.1
max,98.0,4918.0,56.0,999.0,7.0,1.4,94.767,-26.9,5.045,5228.1


In [7]:
df.isnull().sum() # checking for NULL

age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
emp.var.rate      0
cons.price.idx    0
cons.conf.idx     0
euribor3m         0
nr.employed       0
y                 0
dtype: int64

**For the decision tree classifier to work effeciently, we need to validate some things:**

>1. There are no NULL values (documentation says NULL values don't exist)
>2. Seperate our target variable

**Seperating our target variable:**

In [8]:
target = df['y']
features = df.drop(labels = 'y', axis = 1)

In [9]:
features.shape

(41188, 20)

<div class="alert alert-block alert-info">
<b>Note:</b> Now we have to make sure the attributes are either numeric or categorical
</div>


In [10]:
features.dtypes

age                 int64
job                object
marital            object
education          object
default            object
housing            object
loan               object
contact            object
month              object
day_of_week        object
duration            int64
campaign            int64
pdays               int64
previous            int64
poutcome           object
emp.var.rate      float64
cons.price.idx    float64
cons.conf.idx     float64
euribor3m         float64
nr.employed       float64
dtype: object

In [11]:
# changing the data type from object to category
for col in features.select_dtypes(include='object').columns:
    features[col] = features[col].astype('category')

**Why change the data type?**

>**Memory Efficiency**: Categorical data types consume less memory than object data types.<br> This can be significant when working with large datasets.

>**Performance Improvement**: Some algorithms, including decision trees, can process categorical data more efficiently.<br> They can handle categories without needing to convert them to numerical values first.

>**Explicit Categorical Information**: Using the category data type makes it clear that certain columns are categorical, <br>brwhich can help with data analysis and understanding.



In [12]:
## Importing decision tree and other ML metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

### Encoding String Values in Machine Learning

In machine learning, encoding is the process of converting categorical string values into numerical values that models can understand and process. This step is crucial because most machine learning algorithms require numerical input.

### Types of Encoding:

1. **Label Encoding**: Converts each unique category to a different integer. Suitable for ordinal data where the categories have an inherent order.

    ```python
    from sklearn.preprocessing import LabelEncoder
    encoder = LabelEncoder()
    df['category'] = encoder.fit_transform(df['category'])
    ```

2. **One-Hot Encoding**: Creates a binary column for each category. Suitable for nominal data where there is no order between categories.

    ```python
    df_encoded = pd.get_dummies(df, columns=['category'])
    ```

### Importance of Encoding:

1. **Model Compatibility**: Algorithms like decision trees, SVMs, and neural networks require numerical input to calculate mathematical operations.
2. **Performance**: Proper encoding can improve the performance and accuracy of machine learning models by correctly representing categorical data.
3. **Avoiding Misinterpretation**: For categorical data without a specific order, one-hot encoding prevents algorithms from assuming any ordinal relationship between categories, which can be misleading.

By converting string values to numerical values, encoding ensures that machine learning models can process and learn from the data effectively.

In [13]:
for col in features.select_dtypes(include='category').columns:
    features = pd.get_dummies(features, columns=[col])

In [14]:
x_train, x_test, y_train, y_test = train_test_split(features, target, test_size=.2, random_state = 23)

In [15]:
dtc = DecisionTreeClassifier()

In [16]:
dtc.fit(x_train,y_train)

In [17]:
pred = dtc.predict(x_test)

In [18]:
accuracy = accuracy_score(y_test, pred)
print("Accuracy:", accuracy)

Accuracy: 0.8917212915756252


In [19]:
# Confusion matrix
conf_matrix = confusion_matrix(y_test, pred)
print("Confusion Matrix:\n", conf_matrix)

Confusion Matrix:
 [[6881  470]
 [ 422  465]]


In [20]:
# Classification report
class_report = classification_report(y_test, pred)
print("Classification Report:\n", class_report)

Classification Report:
               precision    recall  f1-score   support

          no       0.94      0.94      0.94      7351
         yes       0.50      0.52      0.51       887

    accuracy                           0.89      8238
   macro avg       0.72      0.73      0.72      8238
weighted avg       0.89      0.89      0.89      8238



### Model Evaluation

#### Confusion Matrix

**Feedback:**
- **True Negatives (6881)**: The model correctly predicted 'no' for 6881 instances.
- **False Positives (470)**: The model incorrectly predicted 'yes' for 470 instances that are actually 'no'.
- **False Negatives (422)**: The model incorrectly predicted 'no' for 422 instances that are actually 'yes'.
- **True Positives (465)**: The model correctly predicted 'yes' for 465 instances.

The model seems to perform well in predicting the 'no' class but struggles with the 'yes' class, indicating potential class imbalance or difficulty in identifying the 'yes' instances.

#### Classification Report

**Feedback:**
- **Precision**: 'no' - 0.94, 'yes' - 0.50
  - Precision indicates how many of the predicted 'yes' instances are actually 'yes'. The model has high precision for 'no' but low for 'yes'.
- **Recall**: 'no' - 0.94, 'yes' - 0.52
  - Recall indicates how many of the actual 'yes' instances are correctly identified by the model. The model has high recall for 'no' but low for 'yes'.
- **F1-score**: 'no' - 0.94, 'yes' - 0.51
  - The F1-score is the harmonic mean of precision and recall, reflecting a balance between the two. The model performs much better for 'no' compared to 'yes'.
- **Overall Accuracy**: 0.89
  - The accuracy of 89% is quite high, but given the imbalance in class performance, it might be misleading. The model's performance for the minority class ('yes') needs improvement.

#### Accuracy Calculation

**Feedback:**
The accuracy of 0.89 (or 89%) indicates that the model correctly predicts 89% of the instances. While this is a good overall accuracy, it's important to consider the performance on individual classes, especially if there is a significant imbalance in the dataset. The classification report and confusion matrix indicate the model performs much better for the 'no' class.

#### Suggestions for Improvement

**1. Address Class Imbalance:** 
   - Resampling (over-sampling the minority class or under-sampling the majority class).
   - Using class weights to give more importance to the minority class during training.

**2. Feature Engineering:** 
   - Explore additional features or transformations that might help the model distinguish between 'yes' and 'no' better.

**3. Model Tuning:** 
   - Experiment with different hyperparameters or use more complex models like Random Forests or Gradient Boosting which might capture the patterns better.

**4. Cross-validation:** 
   - Use cross-validation to ensure the model generalizes well to unseen data and the results are not due to overfitting.

**5. Ensemble Methods:** 
   - Combine multiple models to improve prediction performance, especially for the minority class.