<a href="https://colab.research.google.com/github/Ayesha52774/PRODIGY-DS-02/blob/main/Task3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Build a decision tree classifier to predict whether a
customer will purchase a product or service based on their
demographic and behavioral data.

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt

In [None]:
import pandas as pd

# UCI Adult dataset URL
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"

# Column names
columns = ["age", "workclass", "fnlwgt", "education", "education_num",
           "marital_status", "occupation", "relationship", "race", "sex",
           "capital_gain", "capital_loss", "hours_per_week", "native_country", "income"]

# Load dataset
df = pd.read_csv(url, names=columns, na_values=" ?", skipinitialspace=True)

print(df.head())
print(df.info())

   age         workclass  fnlwgt  education  education_num  \
0   39         State-gov   77516  Bachelors             13   
1   50  Self-emp-not-inc   83311  Bachelors             13   
2   38           Private  215646    HS-grad              9   
3   53           Private  234721       11th              7   
4   28           Private  338409  Bachelors             13   

       marital_status         occupation   relationship   race     sex  \
0       Never-married       Adm-clerical  Not-in-family  White    Male   
1  Married-civ-spouse    Exec-managerial        Husband  White    Male   
2            Divorced  Handlers-cleaners  Not-in-family  White    Male   
3  Married-civ-spouse  Handlers-cleaners        Husband  Black    Male   
4  Married-civ-spouse     Prof-specialty           Wife  Black  Female   

   capital_gain  capital_loss  hours_per_week native_country income  
0          2174             0              40  United-States  <=50K  
1             0             0             

In [None]:
df.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education_num',
       'marital_status', 'occupation', 'relationship', 'race', 'sex',
       'capital_gain', 'capital_loss', 'hours_per_week', 'native_country',
       'income'],
      dtype='object')

Column           | Type         | Meaning
-----------------|--------------|-----------------------------------------------
age              | Numeric      | Age of the person
workclass        | Categorical  | Employment type (Private, Self-employed, Govt)
fnlwgt           | Numeric      | Sampling weight (can be dropped)
education        | Categorical  | Highest education level (Bachelors, HS-grad)
education_num    | Numeric      | Years of education (numeric version of education)
marital_status   | Categorical  | Marital status (Married, Never-married, Divorced)
occupation       | Categorical  | Job type (Sales, Exec-managerial, Tech-support)
relationship     | Categorical  | Relationship within family (Husband, Wife, etc.)
race             | Categorical  | Race (White, Black, Asian-Pac-Islander, etc.)
sex              | Categorical  | Gender (Male, Female)
capital_gain     | Numeric      | Capital gains in the year
capital_loss     | Numeric      | Capital losses in the year
hours_per_week   | Numeric      | Working hours per week
native_country   | Categorical  | Country of origin
income (Target)  | Binary       | <=50K or >50K (whether they earn more than 50K/year)

In [None]:
df.isnull().sum()

Unnamed: 0,0
age,0
workclass,0
fnlwgt,0
education,0
education_num,0
marital_status,0
occupation,0
relationship,0
race,0
sex,0


In [None]:
df.drop('fnlwgt', axis=1, inplace=True)
display(df.head())

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [None]:
# Step 3: Convert the target column 'income' to binary values (0 and 1)
df['income'] = df['income'].map({'<=50K': 0, '>50K': 1})

In [None]:
# List of categorical columns
cat_cols = ['workclass','education','marital_status','occupation',
            'relationship','race','sex','native_country']

# One-hot encode them
df = pd.get_dummies(df, columns=cat_cols, drop_first=True)

print(df.head())
print("✅ All categorical columns converted to numeric!")

   age  education_num  capital_gain  capital_loss  hours_per_week  income  \
0   39             13          2174             0              40       0   
1   50             13             0             0              13       0   
2   38              9             0             0              40       0   
3   53              7             0             0              40       0   
4   28             13             0             0              40       0   

   workclass_Federal-gov  workclass_Local-gov  workclass_Never-worked  \
0                  False                False                   False   
1                  False                False                   False   
2                  False                False                   False   
3                  False                False                   False   
4                  False                False                   False   

   workclass_Private  ...  native_country_Portugal  \
0              False  ...                   

In [None]:
# feature selection
X=df.drop("income",axis=1)
Y=df["income"]

In [None]:
# train the model
X_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.2,random_state=42)

In [None]:
#building the model
Model =DecisionTreeClassifier(max_depth=10,            # allow more splits for better learning
    min_samples_split=50,    # avoid overfitting small groups
    min_samples_leaf=20,     # ensure each leaf has enough samples
    class_weight='balanced', # handle class imbalance
random_state=42)
Model.fit(X_train,y_train)

In [None]:
Y_pred=Model.predict(x_test)

In [None]:

# Convert 0/1 → Yes/No meaning
pred_labels = ["YES (Will Purchase)" if p == 1 else "NO (Will NOT Purchase)" for p in Y_pred]

# Show first 10 predictions with Yes/No
print(pred_labels[:10])

['NO (Will NOT Purchase)', 'YES (Will Purchase)', 'YES (Will Purchase)', 'NO (Will NOT Purchase)', 'NO (Will NOT Purchase)', 'YES (Will Purchase)', 'YES (Will Purchase)', 'NO (Will NOT Purchase)', 'NO (Will NOT Purchase)', 'YES (Will Purchase)']


In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

print("Accuracy:", accuracy_score(y_test, Y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, Y_pred))
print("\nReport:\n", classification_report(y_test, Y_pred))

Accuracy: 0.8053124520190389

Confusion Matrix:
 [[3923 1019]
 [ 249 1322]]

Report:
               precision    recall  f1-score   support

           0       0.94      0.79      0.86      4942
           1       0.56      0.84      0.68      1571

    accuracy                           0.81      6513
   macro avg       0.75      0.82      0.77      6513
weighted avg       0.85      0.81      0.82      6513



✅ Summary

The decision tree achieved 80.5% accuracy. It predicts non-purchasers (class 0) with good precision (94%) but moderate recall (79%). For purchasers (class 1), it has high recall (84%)—meaning it catches most buyers—but lower precision (56%), leading to more false positives.

In [None]:

# Pick one example from test data
sample_customer = x_test.iloc[[1]]   # double brackets to keep it as DataFrame

# Predict
sample_pred = Model.predict(sample_customer)[0]

# Show result
print("Prediction for this customer:", "YES (Will Purchase)" if sample_pred == 1 else "NO (Will NOT Purchase)")

Prediction for this customer: NO (Will NOT Purchase)


In [None]:
# --- Now predicting for NEW customer data ---

# Step 1: Define a new customer’s raw data (as dict)
new_customer = {
    "age": 40,
    "workclass": "Private",
    "education": "Bachelors",
    "education_num": 13,
    "marital_status": "Married-civ-spouse",
    "occupation": "Exec-managerial",
    "relationship": "Husband",
    "race": "White",
    "sex": "Male",
    "capital_gain": 0,
    "capital_loss": 0,
    "hours_per_week": 50,
    "native_country": "United-States"
}

# Step 2: Convert dict to DataFrame
new_df = pd.DataFrame([new_customer])

# Step 3: One-hot encode new data exactly like training data
new_df_encoded = pd.get_dummies(new_df)

# Step 4: Align new data columns with training data columns, fill missing with 0
new_df_encoded = new_df_encoded.reindex(columns=X.columns, fill_value=0)

# Step 5: Predict
prediction = Model.predict(new_df_encoded)[0]

# Step 6: Convert prediction to human-readable result
result = "YES (Will Purchase)" if prediction == 1 else "NO (Will NOT Purchase)"
print("Prediction for NEW customer:", result)

Prediction for NEW customer: YES (Will Purchase)


✅ Overall Summary

Goal: Predict whether a customer will purchase a product/service using their demographic and behavioral data from the UCI Adult dataset.

Steps Taken:

1. Loaded & cleaned data (dropped unnecessary fnlwgt, handled missing values).


2. Converted target (income) into binary (0 = ≤50K, 1 = >50K).


3. One-hot encoded categorical columns to make them numeric.


4. Split data into training (80%) and testing (20%).


5. Built a Decision Tree Classifier with tuned parameters (max_depth=10, min_samples_split=50, class_weight='balanced').


6. Evaluated the model using accuracy, confusion matrix & classification report.


7. Tested predictions on a sample customer and a completely new customer.



Results:

Accuracy: 80.5%

Non-purchasers: high precision (94%) but moderate recall (79%)

Purchasers: high recall (84%) but lower precision (56%) → catches most buyers but with some false positives