# Data Processing and Cleaning for BERT-based Vector Search

## Overview

This notebook processes a CSV file named `legislation_content.csv`, transforming its data, and saving it as a new CSV called `cleanerData.csv`. The primary goal is to combine paragraph entries within sections and to clean the 'Section' column by removing prefixes and replacing slashes. This processed data is then suitable for use with BERT to create embeddings for vector-based semantic search and retrieval.



**Description:** This line imports the pandas library, which is essential for data manipulation and analysis in Python. The `as pd` part is an alias, allowing you to use `pd` as a shorthand for pandas throughout the code.
    **Reasoning:** pandas is used to read and manipulate the CSV files.


In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('legislation_content.csv')

**Description:** This loads the `legislation_content.csv` file into a pandas DataFrame, which is assigned to the variable `df`.
    **Reasoning:** The `read_csv` function from pandas is used to load the raw data.
**Convert 'Paragraph' Column to String Type:**

In [None]:
df['Paragraph'] = df['Paragraph'].astype(str)
# Group by 'Section' and concatenate 'Paragraph' entries
df_combined = df.groupby('Section')['Paragraph'].agg(' '.join).reset_index()

In [None]:
df_combined

Unnamed: 0,Section,Paragraph
0,/ukpga/2016/19/body,(1)The Secretary of State must appoint a perso...
1,/ukpga/2016/19/contents,Introductory Text PART 1 Labour market and ill...
2,/ukpga/2016/19/contents/enacted,Introductory Text PART 1 Labour market and ill...
3,/ukpga/2016/19/introduction,nan An Act to make provision about the law on ...
4,/ukpga/2016/19/part/1,(1)The Secretary of State must appoint a perso...
...,...,...
586,/ukpga/2016/19/section/92,(1)The Secretary of State may by regulations m...
587,/ukpga/2016/19/section/93,(1)Regulations made by the Secretary of State ...
588,/ukpga/2016/19/section/94,(1)Subject to subsections (3) to (5) this Act ...
589,/ukpga/2016/19/section/95,"(1)This Act extends to England and Wales, Scot..."


**Description:** This code groups the data by the 'Section' column and combines all 'Paragraph' entries within each section into a single string, separated by spaces. The result is stored in a new DataFrame, `df_combined`, with the combined paragraphs and their corresponding sections.
    **Reasoning:** This step aggregates all the paragraphs in the same section.

**Remove Prefix from 'Section' Column:**

In [None]:
# Remove the prefix from the 'Section' column
df_combined['Section'] = df_combined['Section'].str.replace(r'^/ukpga/2016/19/', '', regex=True)

In [None]:
df_combined

Unnamed: 0,Section,Paragraph
0,body,(1)The Secretary of State must appoint a perso...
1,contents,Introductory Text PART 1 Labour market and ill...
2,contents/enacted,Introductory Text PART 1 Labour market and ill...
3,introduction,nan An Act to make provision about the law on ...
4,part/1,(1)The Secretary of State must appoint a perso...
...,...,...
586,section/92,(1)The Secretary of State may by regulations m...
587,section/93,(1)Regulations made by the Secretary of State ...
588,section/94,(1)Subject to subsections (3) to (5) this Act ...
589,section/95,"(1)This Act extends to England and Wales, Scot..."


**Description:** This code removes the `/ukpga/2016/19/` prefix from each entry in the 'Section' column.
    **Reasoning:** This is a data cleaning step to remove unnecessary information.

**Replace Slashes with Spaces in 'Section' Column:**

In [None]:
df_combined['Section'] = df_combined['Section'].str.replace('/', ' ')
df_combined

Unnamed: 0,Section,Paragraph
0,body,(1)The Secretary of State must appoint a perso...
1,contents,Introductory Text PART 1 Labour market and ill...
2,contents enacted,Introductory Text PART 1 Labour market and ill...
3,introduction,nan An Act to make provision about the law on ...
4,part 1,(1)The Secretary of State must appoint a perso...
...,...,...
586,section 92,(1)The Secretary of State may by regulations m...
587,section 93,(1)Regulations made by the Secretary of State ...
588,section 94,(1)Subject to subsections (3) to (5) this Act ...
589,section 95,"(1)This Act extends to England and Wales, Scot..."


**Description:** This code replaces all forward slashes (`/`) in the 'Section' column with spaces.
    **Reasoning:** This is another data cleaning step to standardize the format of the 'Section' column.

**Save the Cleaned Data to CSV:**

In [None]:
df_combined.to_csv('cleanerData.csv', index=False)