## Analysing the data from the Market Research Survey

In [1]:
# import libraries
import pandas as pd

In [2]:
# import csv file as dataframe
data_path = "tudublin_amenities_access_survey.csv"
survey_data = pd.read_csv(data_path, delimiter=",", encoding='unicode_escape')
survey_data.head(5)

Unnamed: 0,Id,Start time,Completion time,Email,Name,Which county do you work/live in?,Which sector do you work in?\n,Does your job involve accessing/working with data on publicly available amenities?\n,Which devices do you primarily use for your work?\n,What type of public amenities information do you usually require for your job?\n,...,Do you have any additional suggestions/feedback on our proposal ?\n,"If you'd like to contact us for more information on the project, please share your email below\n",Which devices do you primarily use for your day-to-day personal tasks?\n,How often do you use digital tools/applications for navigation?\n,"Based on the demo below, how useful would a web app be for accessing parking locations and quantities in your day-to-day?\n",Why would this be impractical for you?\n,Please tell us what you are using instead\n,What other public amenity would you like more information on?,What other features would you like to see alongside location & quantity of parking spaces?\n,"If you'd like to contact us for more information on the project, please share your email below\n1"
0,1,22/10/2024 18:14,22/10/2024 18:17,anonymous,,,Education,Yes,Desktop computer;Laptop;Smartphone;,"Mechanical (water grid, electric grid, etc);Tr...",...,,,,,,,,,,
1,2,29/10/2024 16:29,29/10/2024 16:36,anonymous,,Dublin,Government,No,,,...,,,Desktop computer;Laptop;Smartphone;,Weekly,Extremely useful,,,,Cost;Parking hours;,
2,3,31/10/2024 12:18,31/10/2024 12:19,anonymous,,Fingal,Construction,No,,,...,,,Desktop computer;,Never,Somewhat useful,,,,Parking hours;,
3,4,31/10/2024 12:25,31/10/2024 12:29,anonymous,,Dublin,Architecture,Yes,Desktop computer;Laptop;Smartphone;,"Recreational (parks, sport facilities, hiking ...",...,,,,,,,,,,
4,5,31/10/2024 12:24,31/10/2024 12:29,anonymous,,Dublin,Construction,Yes,Desktop computer;,"Healthcare & Safety (emergency services, hospi...",...,,,,,,,,,,


In [3]:
# check number of rows and columns
print("Number of columns: ", len(survey_data.columns))
print("Number of rows: ", len(survey_data))

Number of columns:  36
Number of rows:  111


In [4]:
# check data type of columns -- all object
print("Data types of columns:")
survey_data.dtypes

Data types of columns:


Id                                                                                                                               int64
Start time                                                                                                                      object
Completion time                                                                                                                 object
Email                                                                                                                           object
Name                                                                                                                           float64
Which county do you work/live in?                                                                                               object
Which sector do you work in?\n                                                                                                  object
Does your job involve accessing/working with data on pu

### First order of business: clean up the df
<li> I will start by removing the unecessary columns: start & completion time, name & email because they're empty.<br></li>
<li>Need to rename the columns as well</li>
<li>Next, I will fill in the NAN in county responses, and group the different "other" responses in job<br></li>
<li>Then, I will split the df into those who are employed and unemployed, and then those who use public amenities or not in their professional life.</li>

In [5]:
# remove unecessary columns
survey_data = survey_data.drop(survey_data.columns[[1,2,3,4]], axis=1)

# import column names
col_file = open("col_names.txt", "r")
col_names = col_file.read()
col_list = col_names.replace(' ','').split(",")
col_file.close()

# rename columns
survey_data = survey_data.set_axis(col_list, axis=1)
print(survey_data.columns)

Index(['id', 'county', 'sector', 'use_amenity_data', 'device_work',
       'type_amenity_data_work', 'freq_amenity_work', 'type_tool_work',
       'freq_tool_work', 'satisfaction_tool_work', 'why_unsatisfied_tool_work',
       'feature_wish_work', 'demo_useful_work', 'why_impractical_demo_work',
       'other_amenity_work', 'searchfunctionality_useful_rank',
       'realtimeavail_useful_rank', 'qtyamenity_useful_rank',
       'costamenity_useful_rank', 'routeplan_useful_rank',
       'filtertype_useful_rank', 'export_useful_rank', 'feedback_work',
       'contact_work', 'device_personal', 'freq_tool_personal',
       'demo_useful_personal', 'why_impractical_demo_personal',
       'other_tool_personal', 'other_amenity_personal',
       'other_feature_personal', 'contact_personal'],
      dtype='object')


In [6]:
# remove trailing semicolons in all columns
survey_data = survey_data.map(lambda x: x.rstrip(';') if isinstance(x, str) else x)

In [7]:
# split in 2 sub dataframe, those that use amenity data for work and those who dont
# user A = dont' use amenity data
# user B = use amenity data

users_A = pd.DataFrame(survey_data[survey_data["use_amenity_data"] == "No"])
users_B = pd.DataFrame(survey_data[survey_data["use_amenity_data"] == "Yes"])

#### Pre process user_A
THOSE THAT DONT USE AMENITY DATA

In [8]:
# remove useless columns aka those with only NAN values
users_A = users_A.dropna(axis=1, how='all')



In [10]:
users_A.columns

Index(['id', 'county', 'sector', 'use_amenity_data', 'device_personal',
       'freq_tool_personal', 'demo_useful_personal',
       'why_impractical_demo_personal', 'other_tool_personal',
       'other_amenity_personal', 'other_feature_personal', 'contact_personal'],
      dtype='object')

In [14]:
# replace NaN with "unapplicable" in columns users did not answer (branching) and "empty" with those users chose to not answer

A_branch_cols_list = users_A.columns[7:8].tolist()
A_unrequired_cols_list = users_A.columns[9:].tolist()

def solve_nan(df, col_list, value):
    for col in col_list:
        df[col] = df[col].fillna(value)

solve_nan(users_A, A_branch_cols_list, "Unapplicable")
solve_nan(users_A, A_unrequired_cols_list, "Empty")

In [None]:
users_A

In [None]:
# function to handle "other" answers
def prefix_other_answers(row, answer):
    items = row.split(";") # split
    updated_items = [
        item if item.strip() in answer else f"Other: {item.strip()}"
        for item in items
    ]
    return "; ".join(updated_items)  # join back

# adding "other" prefix to why impractical demo personal answers
neg_reason_demo_list = ["Already have access to this information","I don't like web applications"]
users_A["why_impractical_demo_personal"] = users_A["why_impractical_demo_personal"].apply(prefix_other_answers, answer=neg_reason_demo_list)

#### Preprocess user_B
THOSE THAT USE AMENITY DATA

In [None]:
# replace NAN in first row county with Mayo (Damian)
users_B["county"] = users_B["county"].fillna("Mayo")

# remove useless columns aka those with only NAN values
users_B = users_B.dropna(axis=1, how='all')

# replace NaN with "unapplicable" in columns users did not answer (branching) and "empty" with those users chose to not answer

branch_cols_list = ["why_unsatisfied_tool_work", "why_impractical_demo_work"]
unrequired_cols_list = survey_data.columns[[11]].tolist() + survey_data.columns[15:].tolist()

def solve_nan(df, col_list, value):
    for col in col_list:
        df[col] = df[col].fillna(value)

solve_nan(users_B, branch_cols_list, "Unapplicable")
solve_nan(users_B, unrequired_cols_list, "Empty")

In [None]:
# shorten answers for type amenity and type tool
new_amenity_list = ["Recreational","Transport & Mobility","Healthcare & Safety", "Technological","Mechanical","Accessibility"]
og_amenity_list = ["Recreational (parks, sport facilities, hiking trails, public beaches, etc)",
                   "Transport & mobility (bus stops, EV charging stations, parking, bicycle lanes, etc)",
                   "Healthcare & Safety (emergency services, hospitals, pharmacies, public defibrillators, etc)",
                   "Technological (public wi-fi, etc)",
                   "Mechanical (water grid, electric grid, etc)",
                   "Accessibility features (wheelchair ramps, tactile pavement, public toilets, etc)"]

new_tool_list = ["gov_db","city_software","nav_app"]
og_tool_list = ["Government database (i.e: data.gov.ie)",
                   "City planning or Zoning software",
                   "Navigation applications (i.e: Google Maps)"]

# use mapping dict to account for multiple answers
amenity_mapping = dict(zip(og_amenity_list, new_amenity_list))
users_B["type_amenity_data_work"] = users_B["type_amenity_data_work"].apply(
    lambda x: ";".join([amenity_mapping.get(item.strip(), item.strip()) for item in x.split(";")])
)
tool_mapping = dict(zip(og_tool_list, new_tool_list))
users_B["type_tool_work"] = users_B["type_tool_work"].apply(
    lambda x: ";".join([tool_mapping.get(item.strip(), item.strip()) for item in x.split(";")])
)

# adding "other" prefix to custom amenity answers
users_B["type_amenity_data_work"] = users_B["type_amenity_data_work"].apply(prefix_other_answers, answer=new_amenity_list)

# adding "other" prefix to custom tool answers
users_B["type_tool_work"] = users_B["type_tool_work"].apply(prefix_other_answers, answer=new_tool_list)

# adding "other" prefix to why satisfaction tool work "other" answers
neg_reason_tool_list = ["Incomplete information","Not user friendly","Slow - not modern"]
users_B["why_unsatisfied_tool_work"] = users_B["why_unsatisfied_tool_work"].apply(prefix_other_answers, answer=neg_reason_tool_list)

# adding "other" prefix to additional amenity "other" answers
other_amenity_list = ["Bike lanes","Bike sheds","Hiking trails","Car parking","Parks","Public bathrooms"]
users_B["other_amenity_work"] = users_B["other_amenity_work"].apply(prefix_other_answers, answer=other_amenity_list)

# adding "other" prefix to why impractical demo "other" answers
users_B["why_impractical_demo_work"] = users_B["why_impractical_demo_work"].apply(prefix_other_answers, answer=neg_reason_demo_list)

In [None]:
users_B.head(5)

### Second order of business: probably encoding some of the columns