# Data Inspection: Landmark Complaint Dataset
This notebook provides an overview of the dataset, including the structure of the JSON file, exploration of key fields, and a summary of the initial data analysis.

In [3]:
import json
import pandas as pd

In [4]:
# Load the JSON file
file_path = '../data/raw/rows.json'

with open(file_path, 'r', encoding='utf-8') as f:
    data = json.load(f)

# Display the top-level keys in the JSON file
print("Top-level keys in the JSON file:", data.keys())

# If the data is a list of records, check the length
print("Number of records in the dataset:", len(data))

# Check the data type
print(type(data))

Top-level keys in the JSON file: dict_keys(['meta', 'data'])
Number of records in the dataset: 2
<class 'dict'>


In [5]:
# Convert the dictionary to a list
data_items = data['data']

# Print the first 5 key-value pairs from the dictionary
print(data_items[:5])



From the printed data, we now understand that the dataset is structured as a list of lists, with each sublist containing various details about a complaint. Among this information is the actual complaint description, which we need to extract. Our next step will be to isolate the complaint text from the rest of the data for further analysis.

In [9]:
# Extracting the complaint text found at index 18
complaint_texts = [entry[18] for entry in data_items if len(entry) > 18]

# print the first 5 complaints descriptions to verify
for i, complaint in enumerate(complaint_texts[:5]):
    print(f"Complaint {i+1}: {complaint}")

# Convert the list to a DataFrame
complaints_df = pd.DataFrame(complaint_texts, columns=['Complaint Text'])

# Save the complaints to a CSV file
complaints_df.to_csv('complaints_extracted.csv', index=False)

Complaint 1: painting base of building without permits
Complaint 2: work being done; working on building façade
Complaint 3: Construction of rear yard addition without permits; photographs with complaint file
Complaint 4: Painting front walls
Complaint 5: Installation of bar on patio


We've successfully extracted and printed the first 5 complaints. Each complaint provides insights into issues reported, such as unauthorized modifications or construction work. This extraction will help us focus solely on the complaint descriptions, allowing us to apply NLP techniques to analyze common themes and topics within the dataset. To facilitate further analysis, we have converted the complaints into a DataFrame and saved them into a CSV file, making it easy to load and process in the next steps.

In this notebook, we loaded and inspected the dataset, identifying it as a list of lists with various fields, including complaint descriptions. We isolated the complaint text, which will be the focus of our NLP analysis.

In the next notebook, we will clean and preprocess the complaint texts to prepare them for topic modeling and other NLP tasks.