# Chapter 2- Group Exercise 2 - Regression Analysis and Feature Selection
## Kaggle datasource : https://www.kaggle.com/datasets/asadullahcreative/e-commerce-microphone-marketplace-dataset

# PART 1: Handle Missing Values & Outliers  (Done by Saniya Shaikh)

## Loading Dataset (Done by Saniya Shaikh)

In [22]:
import pandas as pd
import numpy as np




In [23]:
df = pd.read_excel("../data/microphone_data.xlsx")
df.shape


(1360, 30)

## Handling Missing Values (Done by Saniya Shaikh)

In [24]:
df.isnull().sum()


serial no                   0
title                       0
price                       0
sold_count                  0
rating                      0
review_count                0
location                    0
seller_name                 0
category                  150
original_price              0
discount_percent            0
is_discounted               0
discount_category           0
price_tier                  0
price_per_rating            0
popularity_score            0
review_rate                 0
value_score                 0
price_vs_category_avg     150
is_top_rated              150
seller_tier               782
seller_avg_rating           0
seller_total_sales          0
discount_effectiveness      0
market_position             0
review_density              0
virality_score              0
price_capped                0
sold_count_capped           0
review_count_capped         0
dtype: int64

## Separating numerical & categorical features (Done by Saniya Shaikh)

In [25]:
num_cols = df.select_dtypes(include=["int64", "float64"]).columns
cat_cols = df.select_dtypes(include=["object"]).columns


## Imputing missing values (Done by Saniya Shaikh)

In [26]:
from sklearn.impute import SimpleImputer

num_imputer = SimpleImputer(strategy="median")
cat_imputer = SimpleImputer(strategy="most_frequent")

df[num_cols] = num_imputer.fit_transform(df[num_cols])
df[cat_cols] = cat_imputer.fit_transform(df[cat_cols])


## Handling Outliers (Done by Saniya Shaikh)

In [27]:
def remove_outliers_iqr(data, columns):
    df_clean = data.copy()
    for col in columns:
        Q1 = df_clean[col].quantile(0.25)
        Q3 = df_clean[col].quantile(0.75)
        IQR = Q3 - Q1
        lower = Q1 - 1.5 * IQR
        upper = Q3 + 1.5 * IQR
        df_clean = df_clean[(df_clean[col] >= lower) & (df_clean[col] <= upper)]
    return df_clean


In [28]:
df = remove_outliers_iqr(df, num_cols)
df.shape


(343, 30)

# Part 2 Feature Scaling (Standardization) (Done by Saniya Shaikh)

In [29]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])


# Part 3  Feature Selection (Done by Saniya Shaikh)

In [None]:
target = "sold_count"

X = df.drop(columns=[target])
y = df[target]



In [31]:
# Encode categorical features
X = pd.get_dummies(X, drop_first=True)


In [33]:
#Selecting top features 
from sklearn.feature_selection import SelectKBest, f_regression

selector = SelectKBest(score_func=f_regression, k=20)
X_selected = selector.fit_transform(X, y)


In [34]:
selected_features = X.columns[selector.get_support()]
selected_features


Index(['category_W', 'category_WIRELESS MIC K',
       'category_WOE LAVALIER MICROPHONE ', 'category_X', 'category_XO MKF ',
       'category_Y', 'category_YOGA M', 'discount_category_low',
       'discount_category_medium', 'discount_category_none',
       'price_tier_luxury', 'price_tier_mid', 'price_tier_premium',
       'seller_tier_Growing', 'seller_tier_Starter', 'seller_tier_Top Seller',
       'discount_effectiveness_No Heavy Discount',
       'market_position_luxury_High', 'market_position_mid_High',
       'market_position_premium_High'],
      dtype='object')

### Preprocessing Summary
- Missing values were handled using median (numerical) and mode (categorical) imputation.
- Outliers were treated using the IQR method to reduce the influence of extreme values.
- Numerical features were standardized to ensure equal contribution during learning.
- Feature selection was applied using SelectKBest to retain the most relevant features.
