<a href="https://colab.research.google.com/github/MungeliDeli/speaker-ruling-classification/blob/main/MyDataPreparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Data Preparation**
This seciton performs the Data Preparation phase for our parliamentary Speaker's Rulings classification project.

Following the CRISP-DM methodology, this phase is critical as it directly impacts the quality of our model.

##**Goals of Data Preparation Phase**

* Data Selection: Choose relevant  features for multi-label classification
* Data Preprocessing: Clean and normalize text data for TF-IDF processing
* Data Transformation: Create model-ready features for Logistic Regression
* Multi-label Preparation: Transform categories for multi-label classification

##1. Environment Setup and Data Loading

This section is mean for setting up the preparation environment such as

* Importing the necessary libraries
> * ***pandas***
> * ***pyplot***
> * ***seaborn***
> * ***numpy***
> * ***re(regular expressions)***
> * ***nltk***

* Downloading the necesary packages such as:
> * ***punkt***
> * ***stopwors***

* Setting up to use a copy of the data set






In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import nltk
import string

# Download necessary NLTK data for text preprocessing
nltk.download('punkt_tab')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:


# Load the Speaker's Rulings dataset
file_path = "/content/drive/MyDrive/rulings_classifier_data/speaker_ruling_classification.csv"
raw_rulings_df = pd.read_csv(file_path)

print("Dataset loaded successfully!")
print(f"Original dataset shape: {raw_rulings_df.shape}")

# Create working copy to preserve original data
working_df = raw_rulings_df.copy()

Dataset loaded successfully!
Original dataset shape: (143, 5)


#2. Data Selection Process
Based on our business and  understanding we will select the most relevant features.

The Dataset has the following features:

> * **rulingTitle**
> * **rulingText**
> * **context**
> * **categories**(Target column)
> * **standingOrder**

Selected relevant columns:

> * **rulingText**(merged with ruling title)
> * **context**
> * **categories**(Target column)
> * **standingOrder**

**Rationale:**

###2.1 Checking and Handling duplicates

In [None]:
#this part checks and handles duplicates by the rulingText since its te primary column

duplicate_rows = working_df['rulingText'].duplicated().sum()

print(f"Number of duplicates based on rulingText:{duplicate_rows}" )



Number of duplicates based on rulingText:6


In [None]:
#removing the duplicates
working_df.drop_duplicates(subset=['rulingText'], inplace=True)


In [None]:
#checking the number of dublicate based on all features after removing
duplicate_rows = working_df['rulingText'].duplicated().sum()

print(f"Number of duplicates after removing dublicates:{duplicate_rows}" )

Number of duplicates after removing dublicates:0


###2.2 Merging rulling text with the title
This section merges the rullingTitle and the rulingText by merging where title exists and filling with title where ruling text is missing

In [None]:

# Combine rulingTitle and rulingText into a single column
# If rulingTitle is missing, just use rulingText, if rulingText is missing, just use rulingTitle
working_df['rulingText'] = (
    working_df['rulingTitle'].fillna('') + ' ' + working_df['rulingText'].fillna('')
).str.strip()


#Drop rulingTitle
working_df = working_df.drop(columns=['rulingTitle'])


In [None]:
#checking if the merge was successful

working_df.head(10)

Unnamed: 0,rulingText,context,categories,standingOrder
0,RULING BY THE HONOURABLE MADAM SPEAKER ON A PO...,point of order,"disciplinary actions, procedural rulings",139
1,Ruling by Hon Madam Speaker - On a Point of Or...,point of order,"disciplinary actions, procedural rulings","223, 243"
2,RULING BY THE HONOURABLE MADAM SPEAKER ON A PO...,point of order,"constitutional interpretations, procedural rul...",139
3,RULING BY THE HON MADAM FIRST DEPUTY SPEAKER O...,point of order,procedural rulings,"215, 213"
4,RULING BY THE HONOURABLE MADAM FIRST DEPUTY SP...,point of order,"disciplinary actions, procedural rulings","223, 140"
5,RULING BY THE HON SECOND DEPUTY SPEAKER ON A P...,point of order,"constitutional interpretations, disciplinary a...",72
6,RULING BY THE HON MADAM SPEAKER ON A POINT OF ...,point of order,"disciplinary actions, procedural rulings","203, 139"
7,RULING BY THE HON MADAM SPEAKER ON A POINT OF ...,point of order,"disciplinary actions, procedural rulings","205, 140"
8,RULING BY THE HON MADAM FIRST DEPUTY SPEAKER O...,complaint,"disciplinary actions, procedural rulings","203, 205"
9,RULING BY THE HON MADAM FIRST DEPUTY SPEAKER O...,complaint,"disciplinary actions, procedural rulings","202, 208, 204, 205"


###2.3 Checking and Handling missing values

In [None]:
#missing valie columns and their count
missing_values_count = working_df.isnull().sum()
missing_values_count = missing_values_count[missing_values_count > 0]

print(f"missing values before handling: {missing_values_count}")


missing values before handling: categories        7
standingOrder    22
dtype: int64


In [None]:
#all categories that are missing are to  be droped
working_df.dropna(subset=['categories'], inplace=True);

#all missing standing orders to be replaced with NaN

working_df.fillna({'standingOrder': 'NaN'}, inplace=True)

In [None]:
#missig values after handling them
missing_values_count = working_df.isnull().sum()
missing_values_count = missing_values_count[missing_values_count > 0]
print(f"missing values before handling: {missing_values_count}")

missing values before handling: Series([], dtype: int64)


#3.Text Preprocessing
This section preprocess the ruling text by creating a preprocessing pipline

In [None]:
def fxn_convert_to_lowercase(var_text):
    return var_text.lower()

def fxn_remove_punctuation(var_text):
    return "".join([var_char for var_char in var_text if var_char not in string.punctuation])

def fxn_remove_stopwords(var_text):
    var_tokens = word_tokenize(var_text)
    var_stop_words = set(stopwords.words('english'))
    var_filtered_tokens = [var_word for var_word in var_tokens if var_word not in var_stop_words]
    return " ".join(var_filtered_tokens)

def fxn_stem_text(var_text):
    var_tokens = word_tokenize(var_text)
    var_stemmer = PorterStemmer()
    var_stemmed_tokens = [var_stemmer.stem(var_word) for var_word in var_tokens]
    return " ".join(var_stemmed_tokens)

def fxn_preprocess_text_pipeline(var_text):
    if not isinstance(var_text, str):
        return ""
    var_processed_text = fxn_convert_to_lowercase(var_text)
    var_processed_text = fxn_remove_punctuation(var_processed_text)
    var_processed_text = fxn_remove_stopwords(var_processed_text)
    var_processed_text = fxn_stem_text(var_processed_text)
    return var_processed_text

working_df['rulingText'] = working_df['rulingText'].apply(fxn_preprocess_text_pipeline)
print("--- Text Pre-processing Complete ---")
working_df['rulingText'].head()

--- Text Pre-processing Complete ---


Unnamed: 0,rulingText
0,rule honour madam speaker point order rais tue...
1,rule hon madam speaker point order rais tuesda...
2,rule honour madam speaker point order rais wed...
3,rule hon madam first deputi speaker point orde...
4,rule honour madam first deputi speaker point o...


#4 Categorical Encoding

For the section will perform muilti hot encodin for our three categorical columns which are ***categories*** , ***context*** and ***standing order***

first is to understand this data,how it looks like and figure out the best way to encode it for the model to work with




##4.1 Understnading the categorical columns

#

In [None]:
#getting unique values from categories
value_count = working_df['categories'].value_counts()

print("Uniques categories and their frequencies")
print(value_count)

Uniques categories and their frequencies
categories
procedural rulings                                                          65
disciplinary actions, procedural rulings                                    24
disciplinary actions                                                        22
constitutional interpretations, procedural rulings                           6
debate management, procedural rulings                                        6
constitutional interpretations, disciplinary actions, procedural rulings     3
administrative decisions, procedural rulings                                 1
administrative decisions                                                     1
constitutional interpretations, disciplinary actions                         1
constitutional interpretations                                               1
Name: count, dtype: int64


In [None]:
#getting unique values from categories
value_count = working_df['context'].value_counts()

print("Uniques context and their frequencies")
print(value_count)

Uniques context and their frequencies
context
point of order                        116
complaint                              11
matter of urgent public importance      2
guidance                                1
Name: count, dtype: int64


In [None]:
#getting unique values from categories
value_count = working_df['standingOrder'].value_counts()

print("Uniques standing Orders and their frequencies")
print(value_count)

Uniques standing Orders and their frequencies
standingOrder
65          21
NaN         21
53           8
131          6
139          4
            ..
145, 148     1
33, 34       1
27           1
19           1
70, 72       1
Name: count, Length: 62, dtype: int64


its very clear that the categories and the standingOrders are comma seperated and for proper encoding we will need to convert these into a list

#4.2 Converting the raw comma seperated text to a list

In [None]:
#converting categories to list for easir working
working_df['categories'] = working_df['categories'].apply(lambda x: [cat.strip() for cat in x.split(',')])

print("\n--- After Converting to Lists ---")
print(working_df['categories'])


--- After Converting to Lists ---
0             [disciplinary actions, procedural rulings]
1             [disciplinary actions, procedural rulings]
2      [constitutional interpretations, procedural ru...
3                                   [procedural rulings]
4             [disciplinary actions, procedural rulings]
                             ...                        
137    [constitutional interpretations, procedural ru...
138              [debate management, procedural rulings]
140           [disciplinary actions, procedural rulings]
141              [debate management, procedural rulings]
142              [debate management, procedural rulings]
Name: categories, Length: 130, dtype: object


In [None]:
#converting standingOrder to list for easir working
working_df['standingOrder'] = working_df['standingOrder'].apply(lambda x: [cat.strip() for cat in x.split(',')])

print("\n--- After Converting to Lists ---")
print(working_df['standingOrder'])


--- After Converting to Lists ---
0           [139]
1      [223, 243]
2           [139]
3      [215, 213]
4      [223, 140]
          ...    
137      [70, 72]
138         [NaN]
140         [NaN]
141         [165]
142         [165]
Name: standingOrder, Length: 130, dtype: object


#Categorical Encoding

This section will encode the categorical columns
* The ***categories*** and the ***standingOrder*** will recieve multi hot encoding because one record and belong to multiple of them
* The context since has only one instance per record will recieve one hot encoding

In [None]:
# Use MultiLabelBinarizer from scikit learn to perform Multi-Hot Encoding on categories
from sklearn.preprocessing import MultiLabelBinarizer


# Multi-label encoding
mlb = MultiLabelBinarizer()
encoded = mlb.fit_transform(working_df['categories'])

encoded_df = pd.DataFrame(encoded, columns=mlb.classes_).astype(int)
working_df = working_df.drop(columns=['categories']).join(encoded_df)



In [None]:
working_df.head().T

Unnamed: 0,0,1,2,3,4
rulingText,rule honour madam speaker point order rais tue...,rule hon madam speaker point order rais tuesda...,rule honour madam speaker point order rais wed...,rule hon madam first deputi speaker point orde...,rule honour madam first deputi speaker point o...
context,point of order,point of order,point of order,point of order,point of order
standingOrder,[139],"[223, 243]",[139],"[215, 213]","[223, 140]"
administrative decisions,0.0,0.0,0.0,0.0,0.0
constitutional interpretations,0.0,0.0,1.0,0.0,0.0
debate management,0.0,0.0,0.0,0.0,0.0
disciplinary actions,1.0,1.0,0.0,0.0,1.0
procedural rulings,1.0,1.0,1.0,1.0,1.0


In [None]:
# Use MultiLabelBinarizer from scikit learn to perform Multi-Hot Encoding on standing orders and ignoring the NaN
from sklearn.preprocessing import MultiLabelBinarizer

working_df['standingOrder'] = working_df['standingOrder'].apply(
    lambda x: [str(i).strip() for i in (x if isinstance(x, list) else [x]) if str(i).strip().lower() != "nan"]
)



# Multi-label encoding
mlb = MultiLabelBinarizer()
encoded = mlb.fit_transform(working_df['standingOrder'])

encoded_df = pd.DataFrame(encoded, columns=mlb.classes_).astype(int)
working_df = working_df.drop(columns=['standingOrder']).join(encoded_df)

In [None]:
working_df.head().T

Unnamed: 0,0,1,2,3,4
rulingText,rule honour madam speaker point order rais tue...,rule hon madam speaker point order rais tuesda...,rule honour madam speaker point order rais wed...,rule hon madam first deputi speaker point orde...,rule honour madam first deputi speaker point o...
context,point of order,point of order,point of order,point of order,point of order
administrative decisions,0.0,0.0,0.0,0.0,0.0
constitutional interpretations,0.0,0.0,1.0,0.0,0.0
debate management,0.0,0.0,0.0,0.0,0.0
disciplinary actions,1.0,1.0,0.0,0.0,1.0
procedural rulings,1.0,1.0,1.0,1.0,1.0
1,0.0,0.0,0.0,0.0,0.0
11,0.0,0.0,0.0,0.0,0.0
131,0.0,0.0,0.0,0.0,0.0


In [None]:
#performing one hot encodin for the context
encoder = OneHotEncoder(sparse_output=False)
context_encoded = encoder.fit_transform(working_df[['context']]).astype(int)
context_encoded_df = pd.DataFrame(context_encoded, columns=encoder.get_feature_names_out(['context']))
working_df = pd.concat([working_df, context_encoded_df], axis=1)



In [None]:
working_df.head().T

Unnamed: 0,0,1,2,3,4
rulingText,rule honour madam speaker point order rais tue...,rule hon madam speaker point order rais tuesda...,rule honour madam speaker point order rais wed...,rule hon madam first deputi speaker point orde...,rule honour madam first deputi speaker point o...
context,point of order,point of order,point of order,point of order,point of order
administrative decisions,0.0,0.0,0.0,0.0,0.0
constitutional interpretations,0.0,0.0,1.0,0.0,0.0
debate management,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...
86,0.0,0.0,0.0,0.0,0.0
context_complaint,0.0,0.0,0.0,0.0,0.0
context_guidance,0.0,0.0,0.0,0.0,0.0
context_matter of urgent public importance,0.0,0.0,0.0,0.0,0.0


now the Categorical encoding is done this is how the data set look all together

In [None]:
#view entire dataset with encoded categories

working_df.head().T

Unnamed: 0,0,1,2,3,4
rulingText,rule honour madam speaker point order rais tue...,rule hon madam speaker point order rais tuesda...,rule honour madam speaker point order rais wed...,rule hon madam first deputi speaker point orde...,rule honour madam first deputi speaker point o...
context,point of order,point of order,point of order,point of order,point of order
categories,"[disciplinary actions, procedural rulings]","[disciplinary actions, procedural rulings]","[constitutional interpretations, procedural ru...",[procedural rulings],"[disciplinary actions, procedural rulings]"
standingOrder,[139],"[223, 243]",[139],"[215, 213]","[223, 140]"


#4 Text Transformation
In this section we will perform text transformation using TF-IDF

In [None]:
# TF-IDF transformation for the 'ruling_text' column (scikit-learn)

# Prepare your DataFrame
df = working_df.copy()
df['rulingText'] = df['rulingText'].fillna('')
df['rulingText'] = df['rulingText'].astype(str)


# Configure the TF-IDF vectorizer
vectorizer = TfidfVectorizer()

#Fit the vectorizer and transform the corpus
tfidf_matrix = vectorizer.fit_transform(df['rulingText'])

#tfidf_matrix is a scipy.sparse CSR matrix (memory-efficient)
print(f"TF-IDF matrix shape: {tfidf_matrix.shape}  (rows: docs, cols: features)")

# Convert to a DataFrame for inspection (sparse-aware)
feature_names = vectorizer.get_feature_names_out()
tfidf_df = pd.DataFrame.sparse.from_spmatrix(tfidf_matrix, index=df.index, columns=feature_names)

# Quick sample inspection
print("\n--- TF-IDF sample (sparse DataFrame head) ---")
print(tfidf_df.head())   # transpose + head to view feature rows for the first docs





TF-IDF matrix shape: (142, 4279)  (rows: docs, cols: features)

--- TF-IDF sample (sparse DataFrame head) ---
   000  00000  000000  00000000  0028  029  03  07  10  100  ...  \
0    0      0       0         0     0    0   0   0   0    0  ...   
1    0      0       0         0     0    0   0   0   0    0  ...   
2    0      0       0         0     0    0   0   0   0    0  ...   
3    0      0       0         0     0    0   0   0   0    0  ...   
4    0      0       0         0     0    0   0   0   0    0  ...   

   zambiatherefor  zctu  zealous  zesco  zfe  znbc  zr      zulu  zx  être  
0               0     0        0      0    0     0   0  0.187183   0     0  
1               0     0        0      0    0     0   0         0   0     0  
2               0     0        0      0    0     0   0         0   0     0  
3               0     0        0      0    0     0   0         0   0     0  
4               0     0        0      0    0     0   0         0   0     0  

[5 rows x 4279 col

In [None]:
#joing the vector with dataframe for inspection
df_tfidf = pd.DataFrame.sparse.from_spmatrix(
    tfidf_matrix,
    index=df.index,
    columns=vectorizer.get_feature_names_out()
)

# Join back to main DataFrame
df_full = df.join(df_tfidf)


ValueError: columns overlap but no suffix specified: Index(['context', '11', '131', '132', '134', '135', '139', '140', '145', '148',
       '165', '179', '185', '19', '190', '202', '203', '204', '205', '207',
       '208', '210', '213', '215', '223', '226', '23', '231', '243', '25',
       '27', '28', '33', '34', '44', '49', '51', '53', '57', '63', '65', '66',
       '69', '70', '72', '76', '77', '84', '85', '86'],
      dtype='object')

In [None]:
working_df.head(1).T

Unnamed: 0,0
rulingText,rule honour madam speaker point order rais tue...
context,point of order
administrative decisions,0.0
constitutional interpretations,0.0
debate management,0.0
...,...
86,0.0
context_complaint,0.0
context_guidance,0.0
context_matter of urgent public importance,0.0


In [None]:
#looking at our data dimension
print(f"Working DataFrame shape: {working_df.shape}")

# 7. Data Preparation Summary

## What We Accomplished

### Data Selection
- Selected relevant columns for multi-label text classification  

### Quality Assessment
- Identified and handled missing values, duplicates, and data inconsistencies  

### Text Preprocessing
- Merged ruling titles and text for comprehensive content  
- Applied parliamentary-specific text cleaning  
- Implemented complete NLP pipeline:
  - Lowercasing  
  - Punctuation removal  
  - Stopword removal  
  - Stemming  

### Multi-label Preparation
- Parsed category strings into lists  
- Applied multi-label binarization for target variables  

### Feature Engineering
- Created text-based and categorical features  

### Vectorization Preparation
- Configured TF-IDF vectorizer for parliamentary text  


## Key Outputs
- **Text Corpus**: Clean, processed text ready for TF-IDF vectorization  
- **Multi-label Targets**: Binary matrix for all categories  
- **Additional Features**: Text statistics and categorical encodings  
- **Model Configuration**: TF-IDF and multi-label binarizer settings saved  



