1: Load and split preprocessed data

In [142]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Load the preprocessed dataset
dataset = pd.read_csv('market_segmentation.csv')

# View dataset information
data.info()

# Split the dataset
X = data.drop('Segmentation', axis=1)  
y = data['Segmentation']

# Handling missing values ​​for numeric features
numerical_features = ['Age', 'Work_Experience', 'Family_Size']
imputer = SimpleImputer(strategy='mean')
X[numerical_features] = imputer.fit_transform(X[numerical_features])

# Handling Missing Values ​​for Categorical Features
categorical_features = ['Gender', 'Ever_Married', 'Graduated', 'Profession', 'Spending_Score', 'Var_1']
X[categorical_features] = X[categorical_features].fillna(X[categorical_features].mode().iloc[0])

# Onehotencoding of features of type string
encoder = OneHotEncoder(sparse=False)
X_encoded = pd.DataFrame(encoder.fit_transform(X[categorical_features]))
X_encoded.columns = encoder.get_feature_names_out(categorical_features)
X.drop(categorical_features, axis=1, inplace=True)
X = pd.concat([X, X_encoded], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8068 entries, 0 to 8067
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ID               8068 non-null   int64  
 1   Gender           8068 non-null   object 
 2   Ever_Married     7928 non-null   object 
 3   Age              8068 non-null   int64  
 4   Graduated        7990 non-null   object 
 5   Profession       7944 non-null   object 
 6   Work_Experience  7239 non-null   float64
 7   Spending_Score   8068 non-null   object 
 8   Family_Size      7733 non-null   float64
 9   Var_1            7992 non-null   object 
 10  Segmentation     8068 non-null   object 
dtypes: float64(2), int64(2), object(7)
memory usage: 693.5+ KB




Task 2: Choose an algorithm

In [143]:
classifier = LogisticRegression()

Task 3: Train and test a model

In [144]:
# Train the model
classifier.fit(X_train, y_train)

y_pred_classification = classifier.predict(X_test)


Task 4： Evaluate the model

In [145]:
accuracy = accuracy_score(y_test, y_pred_classification)

print("Classification Accuracy:", accuracy)

Classification Accuracy: 0.2936802973977695


Task 5: Summary

Steps:

Loaded and explored the dataset to understand its structure and features.
Preprocessed the data by handling missing values and converting categorical features into numerical representation using one-hot encoding.
Selected Logistic Regression for classification task and Linear Regression for regression task based on task requirements and available data.
Trained the selected algorithms on preprocessed data to predict segmentation category for classifiers and numerical values for regressors.
Evaluated the trained models using appropriate metrics: accuracy score for classification model and mean squared error (MSE) for regression model.

Results: 

The classification model achieved an accuracy score of approximately 29.37%, indicating relatively low performance. Further improvement or alternative algorithms are needed to achieve better results.

Interesting Observations:

Class imbalance may contribute to the classification model struggling with certain classes more than others. Conducting exploratory data analysis (EDA) on model predictions can identify the classes the model struggles with the most. Techniques like oversampling or undersampling can address class imbalance and potentially improve model performance.
Coefficients in the regression model reveal the relationship between features and the target variable. Interpreting the coefficients helps identify features with significant impact on predicted values. Note that coefficient interpretation depends on the algorithm and data context.
Overall, the training and evaluation process provided insights into the data, model performance, and areas for improvement.
