## Problem statement
A cloth manufacturing company is interested to know about the different attributes contributing to high sales. Build a decision tree & random forest model with Sales as target variable (first convert it into categorical variable).




### Business Objective
The main goal is to understand the attributes contributing to high sales and build a predictive model to categorize sales based on these attributes.

### Constraints
Data Balance: Sales data may need preprocessing to convert it into a categorical variable (high/low), depending on distribution.

In [52]:
import pandas as pd
import numpy as np
df=pd.read_csv("Company_Data.csv")
df.head()

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban,US
0,9.5,138,73,11,276,120,Bad,42,17,Yes,Yes
1,11.22,111,48,16,260,83,Good,65,10,Yes,Yes
2,10.06,113,35,10,269,80,Medium,59,12,Yes,Yes
3,7.4,117,100,4,466,97,Medium,55,14,Yes,Yes
4,4.15,141,64,3,340,128,Bad,38,13,Yes,No


In [53]:
df.shape

(400, 11)

### Preprocessing

In [57]:
df.isnull().sum()

Sales          0
CompPrice      0
Income         0
Advertising    0
Population     0
Price          0
ShelveLoc      0
Age            0
Education      0
Urban          0
US             0
dtype: int64

In [58]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Sales        400 non-null    float64
 1   CompPrice    400 non-null    int64  
 2   Income       400 non-null    int64  
 3   Advertising  400 non-null    int64  
 4   Population   400 non-null    int64  
 5   Price        400 non-null    int64  
 6   ShelveLoc    400 non-null    object 
 7   Age          400 non-null    int64  
 8   Education    400 non-null    int64  
 9   Urban        400 non-null    object 
 10  US           400 non-null    object 
dtypes: float64(1), int64(7), object(3)
memory usage: 34.5+ KB


In [59]:
df.describe()

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,Age,Education
count,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0
mean,7.496325,124.975,68.6575,6.635,264.84,115.795,53.3225,13.9
std,2.824115,15.334512,27.986037,6.650364,147.376436,23.676664,16.200297,2.620528
min,0.0,77.0,21.0,0.0,10.0,24.0,25.0,10.0
25%,5.39,115.0,42.75,0.0,139.0,100.0,39.75,12.0
50%,7.49,125.0,69.0,5.0,272.0,117.0,54.5,14.0
75%,9.32,135.0,91.0,12.0,398.5,131.0,66.0,16.0
max,16.27,175.0,120.0,29.0,509.0,191.0,80.0,18.0


**Convert sales column to categorical (high/low)  based on medium**

In [62]:
sales_median=df['Sales'].median()
sales_median

7.49

In [66]:
df['Sales_category']=np.where(df['Sales']>=sales_median ,'High','Low')
df['Sales_category'].value_counts()

Sales_category
High    201
Low     199
Name: count, dtype: int64

In [68]:
df.head()

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban,US,Sales_category
0,9.5,138,73,11,276,120,Bad,42,17,Yes,Yes,High
1,11.22,111,48,16,260,83,Good,65,10,Yes,Yes,High
2,10.06,113,35,10,269,80,Medium,59,12,Yes,Yes,High
3,7.4,117,100,4,466,97,Medium,55,14,Yes,Yes,Low
4,4.15,141,64,3,340,128,Bad,38,13,Yes,No,Low


In [70]:
df = df.drop('Sales', axis=1)  
df.head()

Unnamed: 0,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban,US,Sales_category
0,138,73,11,276,120,Bad,42,17,Yes,Yes,High
1,111,48,16,260,83,Good,65,10,Yes,Yes,High
2,113,35,10,269,80,Medium,59,12,Yes,Yes,High
3,117,100,4,466,97,Medium,55,14,Yes,Yes,Low
4,141,64,3,340,128,Bad,38,13,Yes,No,Low


**convert to  vector but for only some of columns like ShelveLoc , Urban , US , Sales_category**

In [73]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df['ShelveLoc'] = label_encoder.fit_transform(df['ShelveLoc'])
df['Urban'] = label_encoder.fit_transform(df['Urban'])
df['US'] = label_encoder.fit_transform(df['US'])
df['Sales_category'] = label_encoder.fit_transform(df['Sales_category'])

In [75]:
df.head()

Unnamed: 0,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban,US,Sales_category
0,138,73,11,276,120,0,42,17,1,1,0
1,111,48,16,260,83,1,65,10,1,1,0
2,113,35,10,269,80,2,59,12,1,1,0
3,117,100,4,466,97,2,55,14,1,1,1
4,141,64,3,340,128,0,38,13,1,0,1


In [77]:
### split the data
X = df.drop('Sales_category', axis=1)
y = df['Sales_category']

In [79]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### Build a decision tree model

In [82]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier


In [84]:
dt_model = DecisionTreeClassifier(max_depth=5, random_state=42)
dt_model.fit(X_train, y_train)

In [86]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
y_pred_dt = dt_model.predict(X_test)
print("\nDecision Tree Model - Accuracy:", accuracy_score(y_test, y_pred_dt))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_dt))
print("Classification Report:\n", classification_report(y_test, y_pred_dt))


Decision Tree Model - Accuracy: 0.6666666666666666
Confusion Matrix:
 [[48 17]
 [23 32]]
Classification Report:
               precision    recall  f1-score   support

           0       0.68      0.74      0.71        65
           1       0.65      0.58      0.62        55

    accuracy                           0.67       120
   macro avg       0.66      0.66      0.66       120
weighted avg       0.67      0.67      0.66       120



### Build a random forest model


In [89]:
rf_model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)
print("\nRandom Forest Model - Accuracy:", accuracy_score(y_test, y_pred_rf))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))
print("Classification Report:\n", classification_report(y_test, y_pred_rf))


Random Forest Model - Accuracy: 0.8083333333333333
Confusion Matrix:
 [[52 13]
 [10 45]]
Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.80      0.82        65
           1       0.78      0.82      0.80        55

    accuracy                           0.81       120
   macro avg       0.81      0.81      0.81       120
weighted avg       0.81      0.81      0.81       120



In [91]:
feature_importance = pd.Series(rf_model.feature_importances_, index=X.columns)
print("\nFeature Importance for Random Forest:\n", feature_importance.sort_values(ascending=False))


Feature Importance for Random Forest:
 Price          0.296689
Age            0.160650
ShelveLoc      0.135991
Advertising    0.095522
CompPrice      0.091934
Income         0.086186
Population     0.076417
Education      0.038907
US             0.011384
Urban          0.006320
dtype: float64


### Price,Age ,ShelveLoc are most important featutes for contributing to high sales