## **Project Overview: Predicting eBay Auction Competitiveness using Decision Trees**

### **1. Objective**
This project aims to develop and evaluate a supervised learning model to classify whether an eBay auction listing is *competitive* based on various categorical and numerical features. The classification task uses a Decision Tree Classifier, a non-parametric model capable of capturing nonlinear relationships and providing interpretable decision rules.

### **2. Dataset Overview**
* Source: `ebayAuctions.xlsx`
* Target Variable: `Competitive?` — a binary label indicating whether an auction received sufficient interest (i.e., competitive).
* Predictor Variables: Categorical (e.g., `Category`, `Duration`, `Currency`, `endDay`) and numerical (`OpenPrice`, `sellerRating`, etc.)

In [None]:
xlsx = pd.ExcelFile('ebayAuctions.xlsx')
ebay_df = pd.read_excel(xlsx, sheet_name=1)
ebay_df.info()

### **3. Data Preprocessing**
#### **Dummy Encoding**
All categorical features were one-hot encoded to convert them into a format suitable for scikit-learn's tree-based models:

In [None]:
df = pd.get_dummies(ebay_df, columns=['Category', 'Currency', 'Duration', 'endDay'], drop_first=False)

#### **Class Distribution Check**

In [None]:
df["Competitive?"].value_counts().plot(kind='bar')

### **4. Correlation Analysis**
A correlation matrix was created for numeric predictors to understand potential linear relationships:

In [None]:
numeric_predictors = df.select_dtypes(include=['number'])
correlation_matrix = numeric_predictors.corr()

### **5. Model 1: Basic Decision Tree Classifier**
#### **Train-Test Split (60/40)**

In [None]:
X = df.drop(columns=['Competitive?'])
y = df['Competitive?']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

#### **Model Training**

In [None]:
FirstClassTree = DecisionTreeClassifier(min_samples_leaf=50, random_state=1)
FirstClassTree.fit(X_train, y_train)

#### **Evaluation**
* **Accuracy**:

In [None]:
accuracy_score(y_test, FirstClassTree.predict(X_test))

* **Confusion Matrix**:

In [None]:
classificationSummary(y_test, FirstClassTree.predict(X_test))

#### **Interpretability**

In [None]:
tree.plot_tree(FirstClassTree, feature_names=X.columns, class_names=['0', '1'], rounded=True)

### **6. Model 2: Enhanced Tree with Feature Engineering**
A second version of the model was built with the `ClosePrice` column excluded (as it is a post-auction outcome, not a predictor):

In [None]:
AUC.drop(columns=['ClosePrice'], inplace=True)
AUC_dummy = pd.get_dummies(AUC, columns=['Category', 'Currency', 'Duration', 'endDay'])

#### **New Train-Test Split**

In [None]:
X2 = AUC_dummy.drop(columns=['Competitive?'])
y2 = AUC_dummy['Competitive?']
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, test_size=0.4, random_state=1)

#### **Training and Evaluation**

In [None]:
Second_ClassTree = DecisionTreeClassifier(min_samples_leaf=50, random_state=1)
Second_ClassTree.fit(X_train2, y_train2)
accuracy_score(y_test2, Second_ClassTree.predict(X_test2))
classificationSummary(y_test2, Second_ClassTree.predict(X_test2))

### **7. Visual Exploration**
To further explore data relationships, scatter plots were used to evaluate the interaction between key numeric features:

In [None]:
sns.scatterplot(x="sellerRating", y="OpenPrice", hue="Competitive?", data=AUC)

Due to skewness and outliers, log-transformed plots were also used:

In [None]:
X_resized = np.log10(np.where(AUC_dummy['sellerRating'] == 0, np.nan, AUC_dummy['sellerRating']))
Y_resized = np.log10(np.where(AUC_dummy['OpenPrice'] == 0, np.nan, AUC_dummy['OpenPrice']))

### **8. Final Model Representation**
The final tree structure was visualized using both textual and graphical formats:

In [None]:
tree.plot_tree(Second_ClassTree, feature_names=X2.columns, class_names=['0', '1'], rounded=True)

### **9. Conclusion**
This project demonstrated how to build interpretable classification models using decision trees on eBay auction data. Key takeaways include:

* The dataset was clean and balanced, allowing for straightforward preprocessing.
* Categorical variables were effectively handled via one-hot encoding.
* The decision tree model with `min_samples_leaf=50` generalized well on the test set.
* Visual and textual tree representations enhanced model transparency.
* Feature engineering (e.g., dropping post-outcome variables) helped refine model fairness and relevance.