<a name = Section1></a>
#### **1. Importing Libraries**

In [2]:
import pandas as pd                                                 # Importing for panel data analysis
#-------------------------------------------------------------------------------------------------------------------------------
import numpy as np                                                  # Importing package numpys (For Numerical Python)
#-------------------------------------------------------------------------------------------------------------------------------
import matplotlib.pyplot as plt                                     # Importing pyplot interface of matplotlib
import seaborn as sns                                               # Importing seaborn library for interactive visualization
%matplotlib inline
#-------------------------------------------------------------------------------------------------------------------------------
import warnings                                                     # Importing warning to disable runtime warnings
warnings.filterwarnings("ignore")                                   # Warnings will appear only once

In [3]:
##DataFrame for unit testing
country_name_ls = ['Albania'] *6 + ['Russia']*2 + ['India']*2
year_ls = ['2009', '2009','2010', '2011', '2011', '2012', '2010', '2012', '2009', '2010']
# sex_ls = ['m']*4 + ['f']*2 + ['m'] + ['f']*3
sex_ls = ['m']*5 + ['f']*5
test_df = pd.DataFrame()
test_df['Country'] = country_name_ls
test_df['Year'] = year_ls
test_df['Sex'] = sex_ls
print(test_df.info)
print(test_df.shape)

<bound method DataFrame.info of    Country  Year Sex
0  Albania  2009   m
1  Albania  2009   m
2  Albania  2010   m
3  Albania  2011   m
4  Albania  2011   m
5  Albania  2012   f
6   Russia  2010   f
7   Russia  2012   f
8    India  2009   f
9    India  2010   f>
(10, 3)


<a name = Section1></a>
#### **2. Data Acquisition and Description**

Lets analyze the dataset and identify what attributes require generalization/categorization before we perform BUC on them.

In [4]:
data = pd.read_excel(f'./data/master.xlsx') # Load the Excel dataset
print('Shape of the dataset:', data.shape)
data.head(3)

Shape of the dataset: (27820, 12)


Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X


- We have 27820 records and 12 attributes.
- In our records, we have variety of data including nominal data, binomial data, numerical data.

In [5]:
data.info() # Display basic information about the dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27820 entries, 0 to 27819
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   country             27820 non-null  object 
 1   year                27820 non-null  int64  
 2   sex                 27820 non-null  object 
 3   age                 27820 non-null  object 
 4   suicides_no         27820 non-null  int64  
 5   population          27820 non-null  int64  
 6   suicides/100k pop   27820 non-null  float64
 7   country-year        27820 non-null  object 
 8   HDI for year        8364 non-null   float64
 9    gdp_for_year ($)   27820 non-null  int64  
 10  gdp_per_capita ($)  27820 non-null  int64  
 11  generation          27820 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.5+ MB


Checking for:
1. duplicate values in rows - delete duplicate rows
2. missing values in column

In [7]:
duplicate = data[data.duplicated()] # Selecting duplicate rows except first occurrence based on all columns
duplicate

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation


- It means there are no duplicate records

In [9]:
print(data.isnull().sum())  # Check for missing values

country                   0
year                      0
sex                       0
age                       0
suicides_no               0
population                0
suicides/100k pop         0
country-year              0
HDI for year          19456
 gdp_for_year ($)         0
gdp_per_capita ($)        0
generation                0
dtype: int64


- As we can observe HDI has got 19,456 null values, out of total 27,820 entries. Given, more than half of the entries having NULL values, let's discount this column.

In [10]:
data.drop(['HDI for year'], axis=1, inplace=True)   # Remove the mentioned column

In [11]:
data.columns # remaining columns in our dataframe

Index(['country', 'year', 'sex', 'age', 'suicides_no', 'population',
       'suicides/100k pop', 'country-year', ' gdp_for_year ($) ',
       'gdp_per_capita ($)', 'generation'],
      dtype='object')

<a name = Section2></a>
#### **3. Data Analysis and AOI**

Now, let's one by one, analyze the 11 dimensions and determine for which dimensions, we need to perform Attribute Oriented Induction (AOI) for generalization/categorization.

Data generalization summarizes data by replacing relatively low-level values with higher-level concepts, or by reducing the number of dimensions to summarize data in concept space involving fewer dimensions.

In [12]:
print("Unique values in coloumn country:\n", data["country"].unique())
print("---------------------------------------------------------")
print("Number of unique values:", data["country"].nunique())

Unique values in coloumn country:
 ['Albania' 'Antigua and Barbuda' 'Argentina' 'Armenia' 'Aruba' 'Australia'
 'Austria' 'Azerbaijan' 'Bahamas' 'Bahrain' 'Barbados' 'Belarus' 'Belgium'
 'Belize' 'Bosnia and Herzegovina' 'Brazil' 'Bulgaria' 'Cabo Verde'
 'Canada' 'Chile' 'Colombia' 'Costa Rica' 'Croatia' 'Cuba' 'Cyprus'
 'Czech Republic' 'Denmark' 'Dominica' 'Ecuador' 'El Salvador' 'Estonia'
 'Fiji' 'Finland' 'France' 'Georgia' 'Germany' 'Greece' 'Grenada'
 'Guatemala' 'Guyana' 'Hungary' 'Iceland' 'Ireland' 'Israel' 'Italy'
 'Jamaica' 'Japan' 'Kazakhstan' 'Kiribati' 'Kuwait' 'Kyrgyzstan' 'Latvia'
 'Lithuania' 'Luxembourg' 'Macau' 'Maldives' 'Malta' 'Mauritius' 'Mexico'
 'Mongolia' 'Montenegro' 'Netherlands' 'New Zealand' 'Nicaragua' 'Norway'
 'Oman' 'Panama' 'Paraguay' 'Philippines' 'Poland' 'Portugal'
 'Puerto Rico' 'Qatar' 'Republic of Korea' 'Romania' 'Russian Federation'
 'Saint Kitts and Nevis' 'Saint Lucia' 'Saint Vincent and Grenadines'
 'San Marino' 'Serbia' 'Seychelles' 'Singap

- We will use the values of 'country' dimension as it is, because it is already in the highest-level of concept hierarchy.

In [13]:
print("Unique values in coloumn year:\n", data["year"].unique())
print("--------------------------------------------")
print("Number of unique values:", data["year"].nunique())

Unique values in coloumn year:
 [1987 1988 1989 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002
 2003 2004 2005 2006 2007 2008 2009 2010 1985 1986 1990 1991 2012 2013
 2014 2015 2011 2016]
--------------------------------------------
Number of unique values: 32


- We will use the values of 'year' dimension as it has 32 distinct values.

In [15]:
print("Unique values in coloumn sex:\n", data["sex"].unique())
print("--------------------------------------------")
print("Number of unique values:", data["sex"].nunique())

Unique values in coloumn sex:
 ['male' 'female']
--------------------------------------------
Number of unique values: 2


- We will use the values of 'sex' attribute/dimension as it is because it is already generalized, having two distinct values of the dimension.

In [16]:
print("Unique values in coloumn age:\n", data["age"].unique())
print("--------------------------------------------")
print("Number of unique values:", data["age"].nunique())

Unique values in coloumn age:
 ['15-24 years' '35-54 years' '75+ years' '25-34 years' '55-74 years'
 '5-14 years']
--------------------------------------------
Number of unique values: 6


- We will use the values of 'age' dimension as it is because it is already characterized by six distinct values of the dimension.

In [17]:
print("Unique values in coloumn suicide_no:\n", data["suicides_no"].unique())
print("--------------------------------------------")
print("Number of unique values:", data["suicides_no"].nunique())

Unique values in coloumn suicide_no:
 [  21   16   14 ... 5503 4359 2872]
--------------------------------------------
Number of unique values: 2084


In [21]:
data["suicides_no"].describe()    # describe the values of the suicides_no attribute.

count    27820.000000
mean       242.574407
std        902.047917
min          0.000000
25%          3.000000
50%         25.000000
75%        131.000000
max      22338.000000
Name: suicides_no, dtype: float64

In [20]:
data["suicides_no"].value_counts()

suicides_no
0       4281
1       1539
2       1102
3        867
4        696
        ... 
2158       1
525        1
2297       1
5241       1
2872       1
Name: count, Length: 2084, dtype: int64

- We can derive from the describe() that, the minimum number of suicides are 0. The maximum number of suicide value is 22338.
At 25th percentile, the suicide value is 3. This means that 25 percent of data that lies below this 25th percentile point will have value equal to or less than 3.
- At 50th percentile, the suicide value is 25. This means half of the data points below 50th percentile point will have value equal to or less than 25. For the high-level description purpose, we can label all those values as `low_suicide_range`.
- At 75th percentile, the suicide value is 131. This means that 75% of the data points that lies below this 75th percentile point will have value equal to or less than 131. For the high-level description purpose, we can label all those values above `low_suicide_range` and below the value at 75th percentile as `medium_suicide_range`.
- Similarly, the maximum suicide number reported is 22338. All values that lie between 75th percentile value to the maximum reported value can be termed as `high_suicide_range`.

In [22]:
# Define the labels and conditions
conditions = [
    (data['suicides_no'] <= data['suicides_no'].quantile(0.5)),
    (data['suicides_no'] > data['suicides_no'].quantile(0.5)) & (data['suicides_no'] <= data['suicides_no'].quantile(0.75)),
    (data['suicides_no'] > data['suicides_no'].quantile(0.75))
]

labels = ['low_suicides_range', 'medium_suicides_range', 'high_suicides_range']

# Create a new column with the labels
data['suicides_range'] = np.select(conditions, labels, default='unknown')
data.drop(['suicides_no'], axis=1, inplace=True)   # Remove the column 'suicides_no' because we are using 'suicides_range' in place of that.
# Display the first few rows of the DataFrame with the new column
data.head(2)

Unnamed: 0,country,year,sex,age,population,suicides/100k pop,country-year,gdp_for_year ($),gdp_per_capita ($),generation,suicides_range
0,Albania,1987,male,15-24 years,312900,6.71,Albania1987,2156624900,796,Generation X,low_suicides_range
1,Albania,1987,male,35-54 years,308000,5.19,Albania1987,2156624900,796,Silent,low_suicides_range


In [23]:
print("Unique values in column population:\n", data["population"].unique())
print("--------------------------------------------")
print("Number of unique values:", data["population"].nunique())

Unique values in column population:
 [ 312900  308000  289700 ... 2762158 2631600 1438935]
--------------------------------------------
Number of unique values: 25564


- "population" can't be used directly. Need to perform AOI to create higher-level descriptions or categories for numerical data.

In [24]:
data["population"].describe().apply(lambda x: format(x, 'f')) # Suppress Scientific Notation

count       27820.000000
mean      1844793.617398
std       3911779.441756
min           278.000000
25%         97498.500000
50%        430150.000000
75%       1486143.250000
max      43805214.000000
Name: population, dtype: object

In [25]:
data["population"].value_counts()

population
24000      20
26900      13
20700      12
22000      12
4900       11
           ..
3282478     1
3953119     1
5745824     1
8448839     1
1438935     1
Name: count, Length: 25564, dtype: int64

- We can derive from the describe() that, the minimum reported population is 278. The maximum reported population is 43805214.
- At 25th percentile, the reported population value is 97498.5. This means that 25 percent of data that lies below this 25th percentile point will have value equal to or less than 97498.
- At 50th percentile, the population value is 430150. This means half of the data points below 50th percentile point will have value equal to or less than 430150.
- At 75th percentile, the population value is 1486143.25. This means that 75% of the data points that lies below this 75th percentile point will have value equal to or less than 1486143.
- For the high-level description purpose:
    - we can label all those values that lie between 0 and 25th percentile value as `low_population_range`.
    - all the values that lie between 25th percentile value and 75th percentile value as `medium_population_range`.
    - all values that lie between 75th percentile value to the maximum reported value can be termed as `high_population_range`.

In [26]:
# Define the labels and conditions
conditions = [
    (data['population'] <= data['population'].quantile(0.25)),
    (data['population'] > data['population'].quantile(0.25)) & (data['population'] <= data['population'].quantile(0.75)),
    (data['population'] > data['population'].quantile(0.75))
]

labels = ['low_population_range', 'medium_population_range', 'high_population_range']

# Create a new column with the labels
data['population_range'] = np.select(conditions, labels, default='unknown')
data.drop(['population'], axis=1, inplace=True)   # Remove the column 'population' because we are using 'population_range' in place of that.
# Display the first few rows of the DataFrame with the new column
data.head(2)

Unnamed: 0,country,year,sex,age,suicides/100k pop,country-year,gdp_for_year ($),gdp_per_capita ($),generation,suicides_range,population_range
0,Albania,1987,male,15-24 years,6.71,Albania1987,2156624900,796,Generation X,low_suicides_range,medium_population_range
1,Albania,1987,male,35-54 years,5.19,Albania1987,2156624900,796,Silent,low_suicides_range,medium_population_range


In [29]:
print(data["suicides/100k pop"].value_counts())

suicides/100k pop
0.00     4281
0.29       72
0.32       69
0.34       55
0.37       52
         ... 
46.73       1
41.47       1
61.03       1
28.25       1
26.61       1
Name: count, Length: 5298, dtype: int64


In [30]:
print("Unique values in the column \"suicides/100k pop\":\n", data["suicides/100k pop"].unique())
print("Number of unique values:", data["suicides/100k pop"].nunique())
print("--------------------------------------------")
print("Unique values in the column \"country-year\":\n", data["country-year"].unique())
print("Number of unique values:", data["country-year"].nunique())
print("--------------------------------------------")

Unique values in the column "suicides/100k pop":
 [ 6.71  5.19  4.83 ... 47.86 40.75 26.61]
Number of unique values: 5298
--------------------------------------------
Unique values in the column "country-year":
 ['Albania1987' 'Albania1988' 'Albania1989' ... 'Uzbekistan2012'
 'Uzbekistan2013' 'Uzbekistan2014']
Number of unique values: 2321
--------------------------------------------


- As we are already using 'population' and 'suicides_no' attributes in their generalized form, we can remove the dimension "suicides/100k pop" from our dataset.
- Similarly, we are using distinct values in 'country' and 'year' dimension, therefore, we will drop the dimension 'country-year'.

In [31]:
data.drop(['suicides/100k pop'], axis=1, inplace=True)   # Remove the mentioned column
data.drop(['country-year'], axis=1, inplace=True)   # Remove the mentioned column
data.drop(['gdp_per_capita ($)'], axis=1, inplace=True)

In [32]:
data[" gdp_for_year ($) "].describe().apply(lambda x: format(x, 'f')) # Suppress Scientific Notation

count             27820.000000
mean       445580969025.726624
std       1453609985940.912109
min            46919625.000000
25%          8985352832.000000
50%         48114688201.000000
75%        260202429150.000000
max      18120714000000.000000
Name:  gdp_for_year ($) , dtype: object

- We can derive from the describe() that, the minimum reported gdp_per_year value is 46,919,625\\$. The maximum reported population is 18,120,714,000,000\\$.
- At 25th percentile, the reported gdp_per_year value is 8,985,352,832\\$. This means that 25 percent of data that lies below this 25th percentile point will have value equal to or less than 8,985,352,832\\$.
- At 50th percentile, the gdp_per_year value is 48,114,688,201\\$. This means half of the data points below 50th percentile point will have value equal to or less than 48,114,688,201\\$.
- At 75th percentile, the gdp_per_year value is 260,202,429,150\\$. This means that 75\% of the data points that lies below this 75th percentile point will have value equal to or less than 260,202,429,150\\$. 
- For the high-level description purpose:
    - we can label all those values that lie between 0 and 25th percentile value as `low_income_range`.
    - all the values that lie between 25th percentile value and 75th percentile value as `medium_income_range`.
    - all values that lie between 75th percentile value to the maximum reported value can be termed as `high_income_range`.

In [33]:
# Define the labels and conditions
conditions = [
    (data[' gdp_for_year ($) '] <= data[' gdp_for_year ($) '].quantile(0.25)),
    (data[' gdp_for_year ($) '] > data[' gdp_for_year ($) '].quantile(0.25)) & (data[' gdp_for_year ($) '] <= data[' gdp_for_year ($) '].quantile(0.75)),
    (data[' gdp_for_year ($) '] > data[' gdp_for_year ($) '].quantile(0.75))
]

labels = ['low_income_range', 'medium_income_range', 'high_income_range']

# Create a new column with the labels
data['gdp_per_year_income_range'] = np.select(conditions, labels, default='unknown')
data.drop([' gdp_for_year ($) '], axis=1, inplace=True)   # Remove the column 'gdp+per_year ($)' because we are using 'gdp_per_year_income_range' in place of that.
# Display the first few rows of the DataFrame with the new column
data.head(2)

Unnamed: 0,country,year,sex,age,gdp_per_capita ($),generation,suicides_range,population_range,gdp_per_year_income_range
0,Albania,1987,male,15-24 years,796,Generation X,low_suicides_range,medium_population_range,low_income_range
1,Albania,1987,male,35-54 years,796,Silent,low_suicides_range,medium_population_range,low_income_range


In [35]:
print(data["generation"].value_counts())

generation
Generation X       6408
Silent             6364
Millenials         5844
Boomers            4990
G.I. Generation    2744
Generation Z       1470
Name: count, dtype: int64


In [36]:
print("Unique values:\n", data["generation"].unique())
print("--------------------------------------------")
print("Number of unique values:", data["generation"].nunique())

Unique values:
 ['Generation X' 'Silent' 'G.I. Generation' 'Boomers' 'Millenials'
 'Generation Z']
--------------------------------------------
Number of unique values: 6


In [37]:
print(data.info)

<bound method DataFrame.info of           country  year     sex          age  gdp_per_capita ($)  \
0         Albania  1987    male  15-24 years                 796   
1         Albania  1987    male  35-54 years                 796   
2         Albania  1987  female  15-24 years                 796   
3         Albania  1987    male    75+ years                 796   
4         Albania  1987    male  25-34 years                 796   
...           ...   ...     ...          ...                 ...   
27815  Uzbekistan  2014  female  35-54 years                2309   
27816  Uzbekistan  2014  female    75+ years                2309   
27817  Uzbekistan  2014    male   5-14 years                2309   
27818  Uzbekistan  2014  female   5-14 years                2309   
27819  Uzbekistan  2014  female  55-74 years                2309   

            generation         suicides_range         population_range  \
0         Generation X     low_suicides_range  medium_population_range   
1  

In [39]:
#Delete unnecessary variabels after Preprocessing stesps
del conditions, duplicate, labels

<a name = Section4></a>
#### **4. BUC Implementation**

In [2]:
##DataFrame for unit testing
country_name_ls = ['Albania'] *4 + ['Russia']*2 + ['Albania']*2 + ['India']*2
year_ls = ['2009', '2009','2010', '2011', '2011', '2012', '2010', '2012', '2009', '2010']
sex_ls = ['m']*4 + ['f']*2 + ['m'] + ['f']*3
test_df = pd.DataFrame()
test_df['Country'] = country_name_ls
test_df['Year'] = year_ls
test_df['Sex'] = sex_ls

In [5]:
class preprocess_df:
  '''
  Class to preprocess DataFrame
  '''
  def encode_attributes(self, input_df, column_indices):
    transformed_dicts_ls = []
    transformed_df = input_df.copy(deep = True)
    column_names = transformed_df.columns.tolist()
    print(f"{column_names = }")
    for col_iter in column_indices:
      temp_dict = {}
      temp_key = 0
      temp_ls = []
      column_name = column_names[col_iter]
      for col in transformed_df.iloc[:,col_iter].tolist():
        if col not in [*temp_dict.keys()]:
          temp_dict[col] = temp_key
          temp_key += 1
        temp_ls.append(temp_dict[col])
      dict_inv = {v:k for k,v in temp_dict.items()}
      transformed_dicts_ls.append(dict_inv)
      transformed_df[column_name] = temp_ls
    return transformed_df, transformed_dicts_ls

In [11]:
#BUC implementation
class buc:
    '''
    Class for implementing BUC
    '''
    def __init__(self, df, column_enc_dicts_ls, minsup):
        self.numDims = df.shape[1]
        self.cardinality = []
        self.minsup = minsup
        self.output_df = None
        self.datacounts = [[]] * df.shape[1]
        self.attribute_ls = ["*"] * df.shape[1]
        self.debug_counter = 0
        self.output_dict = {}
        self.column_enc_dicts_ls = column_enc_dicts_ls

    def counting_sort(self, array_a, df_idx_ls):
      '''
      Inputs 
      array_a: List to be sorted
      df_idx_ls: Index list corresponding to the array_a. For example: DataFrame indices corresponding to array_a.
      Output
      idx_ls: Order in which df_idx_ls should be arranged so that array_a is in the sorted order.
      '''
      array_c = [0]*(max(array_a) + 1)
      idx_ls = [-1] * (len(array_a))

      # print(f"{array_a = }")
      # print(f"{array_c = }")
      for i in range(0, len(array_a)):
        array_c[array_a[i]] += 1

      for i in range(0, len(array_c) - 1):
        array_c[i+1] = array_c[i] + array_c[i+1]

      for i in range(len(array_a) - 1, -1, -1):
        array_c[array_a[i]] = array_c[array_a[i]] - 1
        idx = array_c[array_a[i]]
        idx_ls[idx] = df_idx_ls[i]

      # idx_ls = [i + min_idx for i in idx_ls]
      # print(f"{array_a = }")
      # print(f"{idx_ls = }")
      return idx_ls


    def partition(self, input_df, d, bigc):
        '''
        Implements partitioning logic i.e sorts the input dataframe and populates self.datacounts
        Inputs:
        input_df: Input DataFrame
        d: column number based on which sorting is performed
        Output:
        input_df: DataFrame which is sorted according to the specified column
        '''
        #Sorting the dataframe
        temp_counter_dict = {}
        sorted_idx = self.counting_sort(input_df.iloc[:,d].tolist(), input_df.index.tolist())
        input_df = input_df.reindex(sorted_idx)
        #Populating self.datacounts
        for attribute in input_df.iloc[:,d].tolist():
            temp_counter_dict[attribute] = temp_counter_dict.get(attribute, 0) + 1
        self.datacounts[d] = [*temp_counter_dict.values()]
        # self.datacounts[d] = input_df.iloc[:,d].value_counts().tolist()
        # print(f"{self.datacounts = }")
        return input_df

    def buc_implementation(self, input, dim):
        '''
        Function to implement BUC as indicated in the original paper. 
        Populates self.output_dict which is the output dictionary.
        NOTE:All the variable names are exactly as indicated in the original paper.
        Input
        input: Input DataFrame
        dim: Starting column for performing aggregation
        '''
        self.debug_counter += 1
        # print(f"iter: {self.debug_counter - 1}")
        # print(f"Aggregate: {input.shape[0]}")
        if tuple(self.attribute_ls) in [*self.output_dict.keys()]:
          print(f"Error!!")
        self.output_dict[tuple(self.attribute_ls)] = input.shape[0]
        # if self.debug_counter == 12:
          # return 
        # print(f"{dim = }")
        for d in range(dim, self.numDims,1):
            bigc = input.iloc[:,d].nunique()
            # print(f"{d= }, {bigc=}")
            # print(f"Input before partitioning: {input}")
            input = self.partition(input, d, bigc)
            # print(f"Input after partitioning on {d}: {input}")
            # print(f"{self.datacounts = }")
            k = 0
            for i in range(0, bigc, 1):
                # print(f"################Inside i loop######################")
                # print(f"{d = }, {i = }")
                # print(f"{self.datacounts = }")
                smallc = self.datacounts[d][i]
                # print(f"{smallc = }")
                if smallc >= self.minsup:
                    # print(f"**********************Inside if condition***********************")
                    print(f"k:{k},  d:{d}")
                    # print(f"Attribute: {input.iloc[k,d]}")
                    # print(f"{transformed_dicts[d] = }")
                    self.attribute_ls[d] = self.column_enc_dicts_ls[d][input.iloc[k,d]]
                    self.buc_implementation(input.iloc[k:k+smallc,:], dim=d+1)
                    # if self.debug_counter == 12:
                      # return
                    # print(f"d inside if condition: {d}")
                    # print(f"******************************************************************")
                k += smallc
            # print(f"#################################################################")
            # print(f"ALL:")
            print(f"k: {k}, d:{d}")
            self.attribute_ls[d] = "*"

In [12]:
#Parameter
minsup = 1
input_df = test_df
# print(f"input_df: {input_df}")
preprocess_obj = preprocess_df()
transformed_df, column_enc_dicts_ls = preprocess_obj.encode_attributes(input_df, [*range(0,input_df.shape[1])]) #NOTE: This should be modified as required
# print(f"transformed_df: {transformed_df}")
# print(f"column_enc_dicts_ls: {column_enc_dicts_ls}")
buc_obj = buc(transformed_df, column_enc_dicts_ls, minsup)
buc_obj.buc_implementation(transformed_df, 0)
output_dict = buc_obj.output_dict
# print(f"{output_dict = }")
# output_df = pd.DataFrame(columns=input_df.columns.tolist())

column_names = ['Country', 'Year', 'Sex']
self.output_dict ={('*', '*', '*'): 10}
self.output_dict ={('*', '*', '*'): 10, ('Albania', '*', '*'): 6}
self.output_dict ={('*', '*', '*'): 10, ('Albania', '*', '*'): 6, ('Albania', '2009', '*'): 2}
self.output_dict ={('*', '*', '*'): 10, ('Albania', '*', '*'): 6, ('Albania', '2009', '*'): 2, ('Albania', '2009', 'm'): 2}
self.output_dict ={('*', '*', '*'): 10, ('Albania', '*', '*'): 6, ('Albania', '2009', '*'): 2, ('Albania', '2009', 'm'): 2, ('Albania', '2010', '*'): 1}
self.output_dict ={('*', '*', '*'): 10, ('Albania', '*', '*'): 6, ('Albania', '2009', '*'): 2, ('Albania', '2009', 'm'): 2, ('Albania', '2010', '*'): 1, ('Albania', '2010', 'm'): 1}
self.output_dict ={('*', '*', '*'): 10, ('Albania', '*', '*'): 6, ('Albania', '2009', '*'): 2, ('Albania', '2009', 'm'): 2, ('Albania', '2010', '*'): 1, ('Albania', '2010', 'm'): 1, ('Albania', '2011', '*'): 2}
self.output_dict ={('*', '*', '*'): 10, ('Albania', '*', '*'): 6, ('Albania', '2009', '

In [10]:
output_dict_transformed = {}
columns_ls = input_df.columns.tolist()
for column in columns_ls:
    output_dict_transformed[column] = []
    output_dict_transformed['count'] = []
for tuple_key, value in output_dict.items():
    output_dict_transformed['count'].append(value)
    for tuple_key_iter in range(0,len(tuple_key)):
        output_dict_transformed[columns_ls[tuple_key_iter]].append(tuple_key[tuple_key_iter])
print(f"{output_dict = }")
# print(f"{output_dict_transformed = }")
output_df = pd.DataFrame.from_dict(output_dict_transformed)
columns_order = ['Country', 'Year', 'Sex', 'count']
# columns_order = ['country', 'year', 'sex', 'age', 'generation', 'suicides_range', 'population_range', 'gdp_per_year_income_range', 'count']
output_df = output_df.reindex(columns = columns_order)
output_df

output_dict = {('*', '*', '*'): 10, ('Albania', '*', '*'): 6, ('Albania', '2009', '*'): 2, ('Albania', '2009', 'm'): 2, ('Albania', '2010', '*'): 1, ('Albania', '2010', 'm'): 1, ('Albania', '2011', '*'): 2, ('Albania', '2011', 'm'): 2, ('Albania', '2012', '*'): 1, ('Albania', '2012', 'f'): 1, ('Albania', '*', 'm'): 5, ('Albania', '*', 'f'): 1, ('Russia', '*', '*'): 2, ('Russia', '2010', '*'): 1, ('Russia', '2010', 'f'): 1, ('Russia', '2012', '*'): 1, ('Russia', '2012', 'f'): 1, ('Russia', '*', 'f'): 2, ('India', '*', '*'): 2, ('India', '2009', '*'): 1, ('India', '2009', 'f'): 1, ('India', '2010', '*'): 1, ('India', '2010', 'f'): 1, ('India', '*', 'f'): 2, ('*', '2009', '*'): 3, ('*', '2009', 'm'): 2, ('*', '2009', 'f'): 1, ('*', '2010', '*'): 3, ('*', '2010', 'm'): 1, ('*', '2010', 'f'): 2, ('*', '2011', '*'): 2, ('*', '2011', 'm'): 2, ('*', '2012', '*'): 2, ('*', '2012', 'f'): 2, ('*', '*', 'm'): 5, ('*', '*', 'f'): 5}


Unnamed: 0,Country,Year,Sex,count
0,*,*,*,10
1,Albania,*,*,6
2,Albania,2009,*,2
3,Albania,2009,m,2
4,Albania,2010,*,1
5,Albania,2010,m,1
6,Albania,2011,*,2
7,Albania,2011,m,2
8,Albania,2012,*,1
9,Albania,2012,f,1


In [None]:
output_dict