In [5]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, MultiLabelBinarizer

In [7]:
# Step 1: Load the dataset
file_path = "../../data/external/advertisement.csv"
df = pd.read_csv(file_path)

# print the categorical columns
# Filter out categorical columns
categorical_columns = df.select_dtypes(include=['object', 'category', 'bool']).columns
categorical_data = df[categorical_columns]

# Display the categorical columns
print("Categorical Columns:\n", categorical_columns)
# print("\nCategorical Data (First 5 Rows):\n", categorical_data.head())

label_encoder = LabelEncoder()
for col in categorical_columns:
    df[col] = label_encoder.fit_transform(df[col])
    
print(df.describe())

Categorical Columns:
 Index(['gender', 'education', 'married', 'city', 'occupation',
       'most bought item', 'labels'],
      dtype='object')
               age       gender        income    education     married  \
count  1000.000000  1000.000000   1000.000000  1000.000000  1000.00000   
mean     40.836000     0.494000  49349.796167     1.534000     0.49000   
std      13.786848     0.500214   9894.479148     1.109989     0.50015   
min      18.000000     0.000000  21908.867759     0.000000     0.00000   
25%      28.000000     0.000000  42577.352034     1.000000     0.00000   
50%      41.000000     0.000000  48993.757137     2.000000     0.00000   
75%      53.000000     1.000000  56566.795992     3.000000     1.00000   
max      64.000000     1.000000  79459.294416     3.000000     1.00000   

          children         city  occupation  purchase_amount  \
count  1000.000000  1000.000000  1000.00000      1000.000000   
mean      1.508000   483.397000     5.09000       101.098170

### Report on Categorical Data in the Dataset

This report provides an overview of the categorical columns present in the dataset, along with insights derived from the summary statistics. The dataset consists of 1,000 samples and includes various features related to demographics, purchasing behavior, and labels.

#### Categorical Columns Overview

The following categorical columns are present in the dataset:

1. **Gender**
2. **Education**
3. **Married**
4. **City**
5. **Occupation**
6. **Most Bought Item**
7. **Labels**

### Summary Statistics

#### 1. Gender
- **Type**: Binary (0 = Female, 1 = Male)
- **Mean**: 0.494
- **Distribution**: Approximately equal representation of genders, with a slight inclination towards females (0) and males (1).

#### 2. Education
- **Type**: Ordinal (0 = No education, 1 = Primary, 2 = Secondary, 3 = Tertiary)
- **Mean**: 1.534
- **Distribution**: The majority of individuals have completed primary or secondary education, with fewer individuals having tertiary education.

#### 3. Married
- **Type**: Binary (0 = Not Married, 1 = Married)
- **Mean**: 0.490
- **Distribution**: The dataset is evenly split between married and unmarried individuals.

#### 4. City
- **Type**: Continuous numerical representation (could represent city IDs or some metric)
- **Mean**: 483.397
- **Range**: Cities range from 0 to 968.
- **Insights**: This variable may require further categorization or encoding for effective analysis.

#### 5. Occupation
- **Type**: Ordinal/Discrete (0 to 10 representing different occupations)
- **Mean**: 5.090
- **Distribution**: Occupations are varied, with a mean indicating a mid-range occupation level.

#### 6. Most Bought Item
- **Type**: Discrete numerical representation (item IDs or counts)
- **Mean**: 11.505
- **Range**: Items range from 0 to 23.
- **Insights**: This feature may indicate popular purchasing trends among consumers.

#### 7. Labels
- **Type**: Continuous numerical representation (could represent target variables for regression or classification)
- **Mean**: 194.313
- **Range**: Labels range from 0 to 396.
- **Insights**: This variable likely serves as a target for predictive modeling.

### Insights and Observations

1. **Gender Balance**:
   - The dataset appears to have a balanced representation of genders, which is beneficial for generalizing findings across different gender groups.

2. **Educational Attainment**:
   - A significant portion of the population has at least primary education, indicating a relatively educated sample group.

3. **Marital Status Representation**:
   - The near-equal distribution of married and unmarried individuals allows for comparative studies on purchasing behavior based on marital status.

4. **City Representation**:
   - The continuous nature of the city variable suggests that it may need further categorization or transformation before analysis.

5. **Occupation Diversity**:
   - The variety in occupations provides an opportunity to explore how different job roles influence consumer behavior.

6. **Purchasing Behavior Insights**:
   - The "Most Bought Item" feature can be analyzed to understand consumer preferences and trends in purchasing behavior.

7. **Label Distribution for Target Variable**:
   - The labels column will be critical for any predictive modeling efforts and should be examined for distribution characteristics to ensure appropriate modeling techniques are applied.

### Conclusion

This dataset contains valuable categorical information that can provide insights into consumer behavior based on demographic factors such as gender, education, marital status, city, and occupation. Further analysis should focus on how these categorical variables interact with purchasing behaviors and the target labels for predictive modeling purposes.

### Recommendations

1. Consider encoding categorical variables (e.g., one-hot encoding) for use in machine learning models.
2. Explore relationships between categorical features and purchasing behavior through visualizations.
3. Conduct further statistical analysis to identify significant predictors among categorical variables for the target labels.
4. Ensure that any continuous variables are appropriately categorized if they represent nominal data (e.g., city).