# **DataDriven.Nola.gov Data Analysis**

Intro, talk about data sources

In [None]:
# Install necessary libraries
%%capture
!pip install bertopic
import pandas as pd
import spacy
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer

### **Public Records Requests**
Using BERTopic, we performed topic modeling on the public records requests data to gauge community data interests and determine the most frequently requested data topics. By manipulating the parameters, we trained four different models, and the topics showing up in the results of all four models were deemed the most important/relevant data topics.

In [None]:
#requests data - convert to list of strings
requests = pd.read_csv('Public_Records_Requests.csv')
requests = requests['Request Text'].tolist()

Below, we train our first model. We end up with the most frequently appearing topics in the data, each represented by a set of words. The 20 most frequent topics are displayed in the chart below.

In [None]:
# training
topic_model_1 = BERTopic(language="english", calculate_probabilities=True, verbose=True)
topics, probs = topic_model_1.fit_transform(requests)

Batches:   0%|          | 0/242 [00:00<?, ?it/s]

In [None]:
# save model 1
topic_model_1.save("PublicRecords_Model1")	

In [None]:
# most frequent topics: -1 indicates outliers, ignore
freq = topic_model_1.get_topic_info(); freq.head(20)

Unnamed: 0,Topic,Count,Name
0,-1,2530,-1_the_and_of_in
1,0,223,0_fire_report_occurred_incident
2,1,144,1_camera_footage_video_accident
3,2,135,2_fire_incident_report_occurred
4,3,132,3_rfp_proposals_selection_evaluation
5,4,104,4_public_records_fees_act
6,5,104,5_paid_taxes_nola_st
7,6,100,6_street_permits_orleans_new
8,7,93,7_incident_initial_report_view
9,8,93,8_paid_taxes_years_la


The most frequent topic in the public records requests apprears to be fire incident reports in New Orleans. Below is the cluster of words representing this topic.

In [None]:
topic_model_1.get_topic(0) # get the most frequent topic

[('fire', 0.05267922523855218),
 ('report', 0.03489122195061917),
 ('occurred', 0.03273741016577411),
 ('incident', 0.025259593276499192),
 ('la', 0.02479762762000258),
 ('involving', 0.02093564846224196),
 ('new', 0.018817773365469956),
 ('orleans', 0.01856295081160376),
 ('on', 0.017813844971275566),
 ('at', 0.01724144049364158)]

The second most frequent topic is related to traffic camera footage.

In [None]:
topic_model_1.get_topic(1) # get the 2nd most frequent topic

[('camera', 0.03233983545989223),
 ('footage', 0.03190395171780871),
 ('video', 0.03158366868124489),
 ('accident', 0.03074310280447741),
 ('intersection', 0.030413903174532117),
 ('pm', 0.02656250932480622),
 ('traffic', 0.017989825800340278),
 ('cameras', 0.015957888315200414),
 ('at', 0.015665788564913737),
 ('on', 0.015453408727017172)]

Based on the relationship between the words representing each topic, we label each topic and display the 10 most frequent topics below.

In [None]:
# Set labels for top 10 topics
topic_model_1.set_topic_labels({0: "Fire Incident", 1: "Traffic Camera Footage",
                                2: "Fire Incident", 3: "RFP", 4: "Public Records Requests",
                                5: "Tax Payment", 6: "Street Permits", 7: "Incident Reports",
                                8: "Taxes", 9: "Emails"})

# Display model 1 bar chart
topic_model_1.visualize_barchart(top_n_topics=10, title='Model 1: Topic Word Scores',
                                 custom_labels=True)

Now, we modify the parameters of the first model to create model 2. By changing the N-gram range, we can control how many words are included in each entity of a topic. This is set to its default value of 1 in our first model, meaning that each topic is represented by a group of singular words. We now change this to a range of (1, 3). For example, "fire code violation" could be one single term representing a topic. Let's see how this affects our results.

In [None]:
# change ngram_range, let each entity in a topic be up to 3 words
vectorizer_model = CountVectorizer(ngram_range=(1, 3))

# update model 1
topic_model_1.update_topics(requests, vectorizer_model=vectorizer_model)

In [None]:
# get most frequent topics: ngram_range = (1,3)
freq = topic_model_1.get_topic_info(); freq.head(20)

Unnamed: 0,Topic,Count,Name
0,-1,2374,-1_the_and_of_to
1,0,364,0_paid_tax_bill_taxes
2,1,223,1_fire_report_the fire_fire report
3,2,143,2_rfp_proposals_rfp no_proposal
4,3,134,3_fire_report_incident_fire report
5,4,132,4_camera_footage_accident_intersection
6,5,110,5_public_records_request_public records
7,6,96,6_bid_group_tabulations_bid tabulations
8,7,89,7_contract_the city_the city of_city
9,8,87,8_incident_initial incident_incident report_in...


In [None]:
# train a new model with modified topic sizes (min topic size = 5)
topic_model_2 = BERTopic(language="english", min_topic_size=5, calculate_probabilities=True, verbose=True)
topics, probs = topic_model_2.fit_transform(requests)

Batches:   0%|          | 0/242 [00:00<?, ?it/s]

In [None]:
# save model 2
topic_model_2.save("PublicRecords_Model2")	

In [None]:
# most frequent topics: model 2
freq = topic_model_2.get_topic_info(); freq.head(20)

Unnamed: 0,Topic,Count,Name
0,-1,2122,-1_the_and_in_of
1,0,365,0_paid_bill_taxes_tax
2,1,243,1_fire_report_occurred_incident
3,2,132,2_video_footage_intersection_accident
4,3,79,3_rfp_proposals_sheets_scoring
5,4,77,4_incident_initial_view_report
6,5,74,5_permit_permits_seasons_issued
7,6,74,6_code_violations_open_parcel
8,7,69,7_bid_unit_prices_tabulations
9,8,68,8_body_footage_camera_cam


In [None]:
topic_model_2.get_topic(0) # get the most frequent topic

[('paid', 0.030763668590070976),
 ('bill', 0.022534097061466675),
 ('taxes', 0.021453433857983782),
 ('tax', 0.01726269520375312),
 ('bills', 0.014618764186750375),
 ('payment', 0.013607243854968174),
 ('years', 0.013133040618896929),
 ('need', 0.011279474878138566),
 ('amount', 0.01105217765898051),
 ('property', 0.01071072975638051)]

In [None]:
topic_model_2.get_topic(1) # get the 2nd most frequent topic

[('fire', 0.03495184621558374),
 ('report', 0.023251193681501966),
 ('occurred', 0.022447373691165347),
 ('incident', 0.018760099343314),
 ('la', 0.015005217922794222),
 ('involving', 0.014626335187983272),
 ('new', 0.010739276696985394),
 ('on', 0.010698407010312922),
 ('orleans', 0.010648460715898724),
 ('at', 0.010626758741238479)]

### **Search Terms**

### **Website Feedback**
Users were asked "was this page helpful?" and then prompted to give feedback on their experience with the site. We have analyzed this data below.