<div style="background:#FFFFEE; color:#440404; padding:8px; border-radius: 4px; text-align: center; font-weight: 500;">IFN619 - Data Analytics for Strategic Decision Makers (2024 Sem 1)</div>

---

## Data for both Part A and Part B

This assignment uses data from the Queensland Government [Open Data Portal](https://www.data.qld.gov.au). Both parts will use data on [Advance Queensland Funding Recipients](https://www.data.qld.gov.au/dataset/advance-queensland-funding-recipients). You should familiarise yourself with the [Advance Queensland Program and Grants](https://advance.qld.gov.au) to understand the context for the data. You should also refer to the `field descriptions` metadata to better understand the fields that are relevant to the `funding recipients` data.

---
## Part A

**IMPORTANT** For the following task, keep a record of the dates and times where you demonstrated your understanding with your tutor. These should be AFTER you have completed the questions, and BEFORE week 5.

### [Q1] Read the data

- Open the CSV version of the file. Open directly from the URL into a pandas dataframe.
- Identify an appropriate index, and make a note of the columns.

Import necessary libraries:

In [2]:
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
import numpy as np
from plotly.subplots import make_subplots

Read the CSV data from the URL:

In [3]:
url = 'https://www.data.qld.gov.au/dataset/db190f2d-f866-4811-9a6e-4b78744b551b/resource/0f97b985-f5c7-49d2-8b0a-bc5dfbe070b9/download/advance-queensland-funding-recipients.csv'
df = pd.read_csv(url, encoding='ISO-8859-1')

Inspect the dataframe to identify an appropriate index and note the columns:

In [4]:
df.head(10)

Unnamed: 0,Program,Round,Recipient Name,Physical Address of Recipient - Suburb/Location,Physical Address of Recipient - Post Code,University Collaborator (if applicable),Other Partners; Collaborators (if applicable),Investment/Project Title,Primary Location of Activity/Project - Suburb,Primary Location of Activity/Project - Post Code,Multiple Locations of Activity/Project (if applicable),Approval date,Local Government /Council,RAP Region,State Electorate,Actual Contractual Commitment ($)
0,Aboriginal and Torres Strait Islander PhD Scho...,AQ Aboriginal & Torres Strait Islander PhD Sch...,Central Queensland University,Norman Gardens,4701.0,,BHP Billiton,Decolonising the systematic barriers and enabl...,Brisbane City,4001.0,,14/06/2019,Rockhampton (R),Brisbane and Redlands,Keppel,107084
1,Aboriginal and Torres Strait Islander PhD Scho...,AQ Aboriginal & Torres Strait Islander PhD Sch...,Griffith University,Nathan,4111.0,,,An indigenous journey through the 21st century...,Nathan,4111.0,,28/01/2016,Brisbane (C),Brisbane and Redlands,Toohey,117500
2,Aboriginal and Torres Strait Islander Research...,AQ Aboriginal & Torres Strait Islander Researc...,Queensland University of Technology,Brisbane City,4000.0,,Engineered Wood Products Association Australa...,An Innovative Framing System for Taller Timber...,Brisbane City,4000.0,,21/08/2018,Brisbane (C),Brisbane and Redlands,McConnel,240000
3,Aboriginal and Torres Strait Islander Research...,AQ Aboriginal & Torres Strait Islander Researc...,CSIRO,Smithfield,4878.0,,CSIRO,Transforming hidden data: An integrative infor...,Smithfield,4878.0,,28/01/2016,Cairns (R),Far North Queensland,Barron River,158032
4,Advancing Regional Innovation Program,AQ Advancing Regional Innovation Full 2016-17,Redland City Council,Cleveland,4163.0,,Community Information Support Services Ltd \n...,Growing innovation in the Redlands and Logan r...,Cleveland,4163.0,"Meadowbrook, Alexandra Hills, Springwood",21/09/2017,Redland (C),Brisbane and Redlands,Oodgeroo,500000
5,Advancing Regional Innovation Program,AQ Advancing Regional Innovation Full 2020-21,Tablelands Regional Council,Atherton,4883.0,,Natural Evolution Pty Ltd \n Farmer Meets Foo...,Tablelands Innovation Program,Atherton,4883.0,,11/11/2020,Tablelands (R),Far North Queensland,Hill,45000
6,Advancing Regional Innovation Program,AQ Advancing Regional Innovation Full 2016-17,Sunshine Coast Regional Council,Maroochydore,4558.0,,Noosa Shire Council \n University of the Suns...,Sunshine Coast Regional Innovation Program #SC...,Maroochydore,4558.0,"Sippy Downs, Nambour, Tewantin",2/06/2017,Sunshine Coast (R),Sunshine Coast,Maroochydore,500000
7,Advancing Regional Innovation Program,AQ Advancing Regional Innovation Full 2020-21,Smart Precinct NQ Limited,Townsville City,4810.0,,,"North Queensland Investment, Innovation & Indu...",Townsville City,4810.0,,2/07/2020,Townsville (C),Townsville,Townsville,500000
8,Advancing Regional Innovation Program,AQ Advancing Regional Innovation Staged 2016-17,Wide Bay Burnett Regional Organisation of Coun...,Gympie,4570.0,,University of the Sunshine Coast \n The Gener...,Develop a Collaborative Action Plan and advise...,Gympie,4570.0,,19/01/2017,Gympie (R),Wide Bay,Gympie,20000
9,Advancing Regional Innovation Program,AQ Advancing Regional Innovation Full 2016-17,Wide Bay Burnett Regional Organisation of Coun...,Gympie,4570.0,,University of the Sunshine Coast \n Fraser Co...,Delivering Innovation in the Wide Bay Burnett ...,Gympie,4570.0,"Gayndah, Kingaroy",17/07/2017,Gympie (R),Wide Bay,Gympie,478330


In [5]:
# check available columns
df.columns

Index(['Program', 'Round ', 'Recipient Name',
       'Physical Address of Recipient - Suburb/Location',
       'Physical Address of Recipient - Post Code',
       'University Collaborator (if applicable)',
       'Other Partners; Collaborators (if applicable)',
       'Investment/Project Title',
       'Primary Location of Activity/Project - Suburb',
       'Primary Location of Activity/Project - Post Code',
       'Multiple Locations of Activity/Project (if applicable)',
       'Approval date', 'Local Government /Council', 'RAP Region',
       'State Electorate', 'Actual Contractual Commitment ($)'],
      dtype='object')

In [6]:
# check data type of columns
df.dtypes

Program                                                    object
Round                                                      object
Recipient Name                                             object
Physical Address of Recipient - Suburb/Location            object
Physical Address of Recipient - Post Code                 float64
University Collaborator (if applicable)                    object
Other Partners; Collaborators (if applicable)              object
Investment/Project Title                                   object
Primary Location of Activity/Project - Suburb              object
Primary Location of Activity/Project - Post Code          float64
Multiple Locations of Activity/Project (if applicable)     object
Approval date                                              object
Local Government /Council                                  object
RAP Region                                                 object
State Electorate                                           object
Actual Con

In [14]:
# check Empty (NaN) values
pd.options.mode.use_inf_as_na = True
empty_values = df.isna().sum()
print("Empty values:\n", empty_values)

Empty values:
 Program                                                      0
Round                                                       51
Recipient Name                                               0
Physical Address of Recipient - Suburb/Location              0
Physical Address of Recipient - Post Code                   71
University Collaborator (if applicable)                   1327
Other Partners; Collaborators (if applicable)              862
Investment/Project Title                                     0
Primary Location of Activity/Project - Suburb               33
Primary Location of Activity/Project - Post Code            55
Multiple Locations of Activity/Project (if applicable)    1090
Approval date                                                0
Local Government /Council                                    0
RAP Region                                                   0
State Electorate                                             0
Actual Contractual Commitment ($)       

  pd.options.mode.use_inf_as_na = True


In [22]:
# check Unique values

unique_values = df.nunique()
print("Unique values:\n", unique_values)

Unique values:
 Program                                                     81
Round                                                      124
Recipient Name                                             816
Physical Address of Recipient - Suburb/Location            350
Physical Address of Recipient - Post Code                  194
University Collaborator (if applicable)                     11
Other Partners; Collaborators (if applicable)              382
Investment/Project Title                                  1243
Primary Location of Activity/Project - Suburb              303
Primary Location of Activity/Project - Post Code           188
Multiple Locations of Activity/Project (if applicable)     231
Approval date                                              218
Local Government /Council                                   47
RAP Region                                                  16
State Electorate                                            88
Actual Contractual Commitment ($)      

### [Q2] Group the data
- Choose at least one category and group the data
- Obtain an appropriate aggregate for the groups (e.g. Sum, Mean, etc)

#### The function clean_dataframe preprocesses a pandas DataFrame by:

- Stripping spaces from column names.
- Converting specific columns to title case and handling newline characters.
- Filling missing values with empty strings.
- Removing excess whitespace and converting appropriate columns to numeric types.
- Ensuring consistent data types across columns, including converting monetary values to float after removing commas.
- This thorough cleaning ensures data consistency and prepares the DataFrame for accurate analysis.

In [63]:
def clean_dataframe(df):
    # Strip leading/trailing spaces from column names
    df.columns = df.columns.str.strip()

    # Round - MARK NA WITH EMPTY VALUE
    df['Round'] = df['Round'].fillna('')

    # Program COLUMN - CHANGE ALL TO CAPITALIZE
    df['Program'] = df['Program'].str.title()

    # Recipient Name - CHANGE ALL TO CAPITALIZE, REMOVE \N CHARACTER
    df['Recipient Name'] = df['Recipient Name'].str.replace(r'\n', '').str.title()

    # University Collaborator (if applicable) - REMOVE WHITE SPACE, REMOVE \N CHARACTER, CHANGE ALL TO CAPITALIZE, REPLACE NaN WITH EMPTY
    df['University Collaborator (if applicable)'] = df['University Collaborator (if applicable)'].str.replace(r'\s+', ' ', regex=True).str.replace(r'\n', '').str.title().fillna('')

    # Investment/Project Title - REMOVE WHITE SPACE, CHANGE ALL TO title
    df['Investment/Project Title'] = df['Investment/Project Title'].str.replace(r'\s+', ' ', regex=True).str.title()

    # RAP Region - CHANGE ALL TO title
    df['RAP Region'] = df['RAP Region'].str.title()

    # State Electorate - CHANGE ALL TO title
    df['State Electorate'] = df['State Electorate'].str.title()

    # Physical Address of Recipient - Post Code: REMOVE NULL REPLACE WITH EMPTY
    df['Physical Address of Recipient - Post Code'] = df['Physical Address of Recipient - Post Code'].replace('', np.nan).astype(float).round(1).fillna('')

    # Primary Location of Activity/Project - Suburb: CHANGE ALL TO title
    df['Primary Location of Activity/Project - Suburb'] = df['Primary Location of Activity/Project - Suburb'].str.title()

    # Local Government /Council - CHANGE ALL TO title
    df['Local Government /Council'] = df['Local Government /Council'].str.title()

    # Multiple Locations of Activity/Project (if applicable) - REPLACE NaN WITH EMPTY, CHANGE ALL TO title
    df['Multiple Locations of Activity/Project (if applicable)'] = df['Multiple Locations of Activity/Project (if applicable)'].str.title().fillna('')

    # Convert columns to specified types
    df['Program'] = df['Program'].astype(str)
    df['Round'] = df['Round'].astype(str)
    df['Recipient Name'] = df['Recipient Name'].astype(str)
    df['Physical Address of Recipient - Suburb/Location'] = df['Physical Address of Recipient - Suburb/Location'].astype(str)
    df['Physical Address of Recipient - Post Code'] = df['Physical Address of Recipient - Post Code'].replace('', np.nan).astype(float).round(1)
    df['University Collaborator (if applicable)'] = df['University Collaborator (if applicable)'].astype(str).fillna('')
    df['Other Partners; Collaborators (if applicable)'] = df['Other Partners; Collaborators (if applicable)'].astype(str)
    df['Investment/Project Title'] = df['Investment/Project Title'].astype(str)
    df['Primary Location of Activity/Project - Suburb'] = df['Primary Location of Activity/Project - Suburb'].astype(str)
    df['Primary Location of Activity/Project - Post Code'] = df['Primary Location of Activity/Project - Post Code'].replace('', np.nan).astype(float).round(1)
    df['Multiple Locations of Activity/Project (if applicable)'] = df['Multiple Locations of Activity/Project (if applicable)'].astype(str).fillna('')
    df['Approval date'] = df['Approval date'].astype(str)
    df['Local Government /Council'] = df['Local Government /Council'].astype(str)
    df['RAP Region'] = df['RAP Region'].astype(str)
    df['State Electorate'] = df['State Electorate'].astype(str)
    df['Actual Contractual Commitment ($)'] = df['Actual Contractual Commitment ($)'].str.replace(',', '').astype(float).round(1)
    df['Actual Contractual Commitment ($)'] = df['Actual Contractual Commitment ($)'].replace('', np.nan).astype(float).round(1)

    return df

df = clean_dataframe(df)
# check dataframe cleaned properly
df.tail()

Unnamed: 0,Program,Round,Recipient Name,Physical Address of Recipient - Suburb/Location,Physical Address of Recipient - Post Code,University Collaborator (if applicable),Other Partners; Collaborators (if applicable),Investment/Project Title,Primary Location of Activity/Project - Suburb,Primary Location of Activity/Project - Post Code,Multiple Locations of Activity/Project (if applicable),Approval date,Local Government /Council,RAP Region,State Electorate,Actual Contractual Commitment ($)
1334,Young Starters' Fund,AQ Young Starters Fund Round 2015-16 Round 5,Griffith University,Southport,4215.0,,,Mentor Revolution  Get Started - Ygstrs-49249...,Southport,4215.0,University Of Queensland - St Lucia,21/06/2016,Gold Coast (C),Gold Coast,Bonney,19226.0
1335,Young Starters' Fund,AQ Young Starters Fund Round 2015-16 Round 5,Fifty Six Creations Pty Ltd - Mt Gravatt,Upper Mount Gravatt,4122.0,,,Fiftysix Academy And Advance Queensland In Mac...,Mackay,4740.0,,21/06/2016,Brisbane (C),Mackay-Whitsunday,Mansfield,20000.0
1336,Young Starters' Fund,AQ Young Starters Fund Round 2016-17 Round 1,South Bank Business Association Incorporated,South Brisbane,4101.0,,,The Big 5 - 5 Big Learnings From Industry Experts,South Brisbane,4101.0,,28/07/2016,Brisbane (C),Brisbane And Redlands,South Brisbane,7500.0
1337,Young Starters' Fund,AQ Young Starters Fund Round 2015-16 Round 5,Time Masters (Australia) Pty Limited,Runaway Bay,4216.0,,,Open Your Eyes To Cash - Logan - Ygstrs-479994...,Loganholme,4129.0,Logan,21/06/2016,Gold Coast (C),Logan,Broadwater,10350.0
1338,Young Starters' Fund,AQ Young Starters Fund Round 2015-16 Round 5,Marist Youth Care Limited,Paddington,4064.0,,,Impact National Conference - Ygstrs-5061022-69,South Brisbane,4101.0,,21/06/2016,Brisbane (C),Brisbane And Redlands,Cooper,9546.0


### [Q3] Save the data
- Transform your grouped data into a dataframe
- Save the dataframe as a CSV file


In [64]:
cleaned_file_path = r'cleaned-queensland-funding-recipients.csv'
# Save the cleaned dataframe to a CSV file
df.to_csv(cleaned_file_path, index=False)

### [Q4] Visualise the data
	
- Visualise the grouped data with an appropriate chart
- Ensure X and Y axes are labelled appropriately
- Add an appropriate title for the chart

### [Q5] Create new dataframes
- Select 2 groups from a particular category and filter the data into 2 separate dataframes
- For each dataframe group by at least one logical category with meaningful aggregates

### [Q6] Obtain descriptive statistics
	
- Find the descriptive statistics for the funds committed for the different groups.
- Assign the `count`, `mean`, `min`, and `max` to variables. Round the mean to a reasonable precision.
- Use the variables to create a string which describes in words the basic descriptive statistics of the committed funds.
- Print the constructed strings for each group

### [Q7] Visualise the data

- Using the plotly library, create histograms of the committed funds for the different groups
- Set the number of bins to an appropriate value
- Display the actual counts in the bars
- Enhance the visualisation of the variance by including a box plot
- Use suitable colours and add appropriate textual information

### Charts

#### Chart1 illustrates the top 19 programs ranked by their Actual Contractual Commitments in dollars. 

- The Covid-19 Vaccine Program at Uq stands out with a notably higher financial commitment in comparison to other programs, indicating a significant emphasis and investment in their vaccine development endeavors. 
- Specifically, the Uq - Covid-19 Vaccine Program demonstrates the highest average Actual Contractual Commitment in dollars when juxtaposed with programs such as Data61, showcasing a considerable monetary dedication beyond that of its counterparts.

In [68]:
# Group the dataframe by 'Program' and calculate the mean of 'Actual Contractual Commitment ($)'
average_commitment_df = df.groupby('Program')['Actual Contractual Commitment ($)'].mean().reset_index()

# Rename the column to reflect that it contains average values
average_commitment_df = average_commitment_df.rename(columns={'Actual Contractual Commitment ($)': 'Average Actual Contractual Commitment ($)'})

# Sort the dataframe in descending order and select the top 19 programs
top_19_programs_df = average_commitment_df.sort_values(by='Average Actual Contractual Commitment ($)', ascending=False).head(19)

top_19_programs_df = top_19_programs_df.reset_index(drop=True)
top_19_programs_df['Rank'] = top_19_programs_df.index + 1

fig = px.bar(
    top_19_programs_df,
    x='Average Actual Contractual Commitment ($)',
    y='Program',
    orientation='h',
    title='Top 19 Programs by Average Actual Contractual Commitments ($)',
    hover_data={'Average Actual Contractual Commitment ($)':True, 'Program': True},
    text='Average Actual Contractual Commitment ($)',
    color='Program',
    color_discrete_sequence=px.colors.qualitative.Set3
)

# Update the text and layout of the bar chart
fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig.update_layout(
    title=dict(font=dict(size=20)),
    xaxis_title="Average Actual Contractual Commitment ($)",
    yaxis_title="Program",
    margin=dict(l=150, r=20, t=70, b=70),
    height=700 
)

# Show the figure
fig.show()


#### Chart2 - Top 15 Investment/Projects Title by Average Actual Contractual Commitments ($)
* The Uq Covid-19 Vaccine Program appears to have garnered a significantly greater level of financial backing in contrast to alternative initiatives, thereby underscoring its perceived significance and potential ramifications. The infusion of financial resources into this program is indicative of the recognition of its pivotal role in addressing the current pandemic crisis and underscores the urgency and priority attached to its development and potential impact.

* The second tier, referred to as Tier 1 Of 3, within the scope of Advanced Autonomy Platform Technologies, represents an elevated level of financial support and investment that serves as a pivotal component in the progression and development of cutting-edge technologies in the realm of autonomous systems


In [12]:
# Group the dataframe by 'Investment/Project Title' and calculate the mean of 'Actual Contractual Commitment ($)'
average_commitment_df = df.groupby('Investment/Project Title')['Actual Contractual Commitment ($)'].mean().reset_index()

# Rename the column to reflect that it contains average values
average_commitment_df = average_commitment_df.rename(columns={'Actual Contractual Commitment ($)': 'Average Actual Contractual Commitment ($)'})

# Sort the dataframe in descending order and select the top 15 investment/projects
top_15_projects_df = average_commitment_df.sort_values(by='Average Actual Contractual Commitment ($)', ascending=False).head(15)

# Reset the index and add a 'Rank' column
top_15_projects_df = top_15_projects_df.reset_index(drop=True)
top_15_projects_df['Rank'] = top_15_projects_df.index + 1

# Truncate labels after 20 characters
top_15_projects_df['Truncated Title'] = top_15_projects_df['Investment/Project Title'].apply(lambda x: x[:20] + '...' if len(x) > 20 else x)


fig = px.pie(
    top_15_projects_df,
    values='Average Actual Contractual Commitment ($)',
    names='Truncated Title',
    title='Top 15 Investment/Projects Title by Average Actual Contractual Commitments ($)',
    color='Investment/Project Title',
    color_discrete_sequence=px.colors.qualitative.Set3,
    hole=0.3,
    hover_name='Investment/Project Title'
)

fig.update_layout(
    title=dict(font=dict(size=20)),
    margin=dict(l=150, r=20, t=70, b=70),
    height=700 
)

fig.show()


#### Chart3 - Top 10 Recipient Names averaged by Actual Contractual Commitments ($)
- This implies that the Commonwealth Scientific And Industrial Research Organisation T/A Data61 is engaging in substantial financial investments, potentially reflecting a deep dedication to the pursuit of research and development endeavors or the undertaking of large-scale projects.

- It is noteworthy that the Commonwealth Scientific And Industrial Research Organisation T/A Data61 exhibits the most elevated average "Actual Contractual Commitment ($)" in comparison to various other entities, such as Boeing Defence Australia Ltd. This particular observation indicates that they have made a firm commitment to allocating significantly more financial resources in comparison to their counterparts.


In [13]:
# Group the dataframe by 'Investment/Project Title' and calculate the mean of 'Actual Contractual Commitment ($)'
average_commitment_df = df.groupby('Recipient Name')['Actual Contractual Commitment ($)'].mean().reset_index()

# Rename the column to reflect that it contains average values
average_commitment_df = average_commitment_df.rename(columns={'Actual Contractual Commitment ($)': 'Average Actual Contractual Commitment ($)'})

# Sort the dataframe in descending order and select the top 15 investment/projects
top_10_RecipientName_df = average_commitment_df.sort_values(by='Average Actual Contractual Commitment ($)', ascending=False).head(10)

# Reset the index and add a 'Rank' column
top_10_RecipientName_df = top_10_RecipientName_df.reset_index(drop=True)
top_10_RecipientName_df['Rank'] = top_10_RecipientName_df.index + 1

# Truncate labels after 20 characters
top_10_RecipientName_df['Truncated Title'] = top_10_RecipientName_df['Recipient Name'].apply(lambda x: x[:20] + '...' if len(x) > 20 else x)


# Create the bar chart with updated colors and sorted in descending order
fig = px.bar(
    top_10_RecipientName_df,
    x='Average Actual Contractual Commitment ($)',
    y='Truncated Title',
    orientation='h',
    title='Top 10 Recipient Names averaged by Actual Contractual Commitments ($)',
    labels={'Average Actual Contractual Commitment ($)': 'Average Actual Contractual Commitment ($)', 'Truncated Title': 'Recipient Name'},
    text='Average Actual Contractual Commitment ($)',
    color='Truncated Title',
    color_discrete_sequence=px.colors.qualitative.Set3,
    hover_data={'Recipient Name': True}  # Add the full recipient name for hover data
)

# Update the text and layout of the bar chart
fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig.update_layout(
    title=dict(font=dict(size=20)),
    xaxis_title="Average Actual Contractual Commitment ($)",
    yaxis_title="Recipient Name",
    margin=dict(l=150, r=20, t=70, b=70),
    height=700
)

# Show the figure
fig.show()


#### Chart 4 - Analysis  Actual Contractual Commitment ($) for Mean, Median, count Programs 
1. Allocation of Funds and Establishment of Priorities:

- The substantial total funding allocated to specific programs implies a strategic prioritization in domains such as vaccine development, artificial intelligence, and technological innovation. These allocations likely mirror overarching policy objectives or emerging necessities, such as the management of the COVID-19 crisis.

2. Impact Assessment of Projects:

- Individual projects that receive substantial funding, such as "Uq - Covid-19 Vaccine" and "Data61," may represent high-impact endeavors anticipated to yield significant results. On the other hand, multi-project programs with varying funding levels may be exploring diverse aspects or phases of innovation, diversifying risks, or aiming for incremental progress.

3. Relationship between Project Numbers and Funding Allocation:

- The prevalence of single-project programs among the listed initiatives indicates a concentrated allocation of funds per program. Conversely, programs with multiple projects, such as "Platform Technology Program" and "Innovation Partnerships Grants," imply sustained or diversified funding efforts for ongoing or comprehensive initiatives.

4. Comparison of Mean and Median Funding Levels:

- The mean and median funding amounts for most programs are identical owing to their single-project nature. In contrast, for multi-project programs, the difference between mean and median funding values offers insights into the distribution of funds across various projects. For instance, the "Platform Technology Program" displays a mean funding amount of $3,826,918 and a median of $2,653,836, indicating that certain projects receive substantially higher funding than others.


In [14]:
# Convert 'Actual Contractual Commitment ($)' to numeric, coercing errors to NaN
df['Actual Contractual Commitment ($)'] = pd.to_numeric(df['Actual Contractual Commitment ($)'], errors='coerce')
# Sort the dataframe in descending order and select the top 15 investment/projects
grouped_data = df.groupby('Program').agg({
    'Actual Contractual Commitment ($)': ['sum', 'mean', 'median', 'count']
}).reset_index()

# Flatten the column names
grouped_data.columns = ['Program', 'Total Funding Amount ($)', 'Mean Funding Amount ($)', 'Median Funding Amount ($)', 'Count of Projects']
grouped_data = grouped_data.sort_values(by='Mean Funding Amount ($)', ascending=False).head(15)
# Create a truncated version of the 'Program' names for x-axis display
grouped_data['Program_Truncated'] = grouped_data['Program'].apply(lambda x: x[:10] + '...' if len(x) > 10 else x)

# Create subplots: use 'domain' type for pie charts and 'xy' for scatter plots
fig = make_subplots(
    rows=2, cols=2,
    specs=[[{"type": "bar"}, {"type": "bar"}],
           [{"type": "scatter"}, {"type": "scatter"}]],
    subplot_titles=("Total Funding Amount by Program", "Count of Projects by Program",
                    "Mean Funding Amount by Program", "Median Funding Amount by Program")
)

# Add bar chart for 'Total Funding Amount ($)'
fig.add_trace(go.Bar(x=grouped_data['Program_Truncated'], y=grouped_data['Total Funding Amount ($)'], name='Total Funding Amount', 
                     text=grouped_data['Program'], hovertemplate='%{text}<br>Total Funding Amount: $%{y}<extra></extra>'), row=1, col=1)

# Add bar chart for 'Count of Projects'
fig.add_trace(go.Bar(x=grouped_data['Program_Truncated'], y=grouped_data['Count of Projects'], name='Count of Projects', 
                     text=grouped_data['Program'], hovertemplate='%{text}<br>Count of Projects: %{y}<extra></extra>'), row=1, col=2)

# Add scatter plot for 'Mean Funding Amount ($)'
fig.add_trace(go.Scatter(x=grouped_data['Program_Truncated'], y=grouped_data['Mean Funding Amount ($)'], mode='markers+lines', name='Mean Funding Amount', 
                         text=grouped_data['Program'], hovertemplate='%{text}<br>Mean Funding Amount: $%{y}<extra></extra>'), row=2, col=1)

# Add scatter plot for 'Median Funding Amount ($)'
fig.add_trace(go.Scatter(x=grouped_data['Program_Truncated'], y=grouped_data['Median Funding Amount ($)'], mode='markers+lines', name='Median Funding Amount', 
                         text=grouped_data['Program'], hovertemplate='%{text}<br>Median Funding Amount: $%{y}<extra></extra>'), row=2, col=2)

# Update layout
fig.update_layout(
    title_text="Funding Analysis by Program",
    height=900,
    showlegend=False,
    title=dict(font=dict(size=20))
)

# Show the figure
fig.show()

---
## Part B - creating a narrative to answer significant questions

**SCENARIO:**  The allocation of public money (obtained from the public via taxes) is a politically sensitive activity with governments regularly coming under scrutiny for how this money is spent. A respected media organisation is looking into the Queensland Government's Advance Queensland program. The resulting story could be a "good news" story reporting on the success of the program, however if inappropriate spending or irregularities are found, it could become a story that is critical of the scheme, and potentially the Government.

As a data analyst, your task is to analyse the publicly available data on the distribution of the funds over time. You are looking for patterns that may support the "good news" story, or which may be a cause for concern. It is up to you how deeply you explore the data, but at a minimum you should look at (a) the balance between South-East Queensland and the remainder of the state (regional Queensland); and (b) how distributions align with the objectives of the scheme which may include supporting specified groups of people. 

**ETHICAL APPROACH:** You are expected to be fair and ethical in your analysis, and therefore the insights that you draw should take into account contextual factors. You should avoid simplistic assumptions like assuming all groups and activities should receive equal funding. For example, disproportionate funding may be appropriate due to social circumstances or the costs involved in a particular activity. Further, benefits to Queensland may come in different forms. For example, cultural benefits cannot be directly compared to economic benefits.

**ESSENTIAL REQUIREMENTS:** Your task as a data analyst is to:

- Ensure that you use the techniques and libraries/packages that have been used in class
- Identify high quality questions that when answered may be helpful in addressing the scenario above
- Obtain the data in JSON form from the API.
- Clean and filter the data as appropriate
- Analyse the data in a way that answers your questions and ultimately addresses the concern in the scenario
- Visualise your results in a meaningful way that is helpful in making visible key findings
- Provide a detailed summary of the insights found and how they address the original questions and scenario

**AUTHENTICITY AND INTEGRITY**: You will be marked on (a) *HOW* you undertake the task todgether; with (b) detail of *WHY* you made various decisions involved in the tasks; and (c) acknowledgement of **WHERE** you  used material that is not directly yours. Therefore, you must document your thinking and approach throughout the notebook using the Markdown cells, and give credit to other resources as appropriate. You are encouraged to use the `Exemplars` PDF to help write your code. You may use online resources including `GenAI tools` and `stackoverflow` to help you write your code, however you must acknowledge that you are using these resources in the markdown cells explaining your analysis. Note that you do not need to use formal referencing for this.

---



#### Chart 5 - Actual Contractual Commitment ($) per Program, grouped by RAP Region focus on  Brisbane And Redlands and Central Queensland
### Analysis Insights:

#### Brisbane And Redlands:

1. **Elevated Levels of Funding:**

   - The regions of Brisbane And Redlands have acquired notably higher levels of funding across various programs. The program with the highest funding allocation is the "Ignite Ideas Fund" at $27,914,294, succeeded by "Industry Research Fellowships" and "Research Fellowships" at $19,669,661 and $15,226,045 respectively.

2. **Variety of Initiatives:**

   - A broad spectrum of initiatives is endorsed in the area, encompassing fields such as health research (e.g., "Clem Jones Centre For Ageing Dementia Research") and technological progress (e.g., "Platform Technology Program").

3. **Strategic Investments:**

   - Prominent endeavors like the "Uq - Covid-19 Vaccine" project, funded with $10,000,000, reflect an emphasis on impactful research and developmental undertakings. Central Queensland:

#### Central Queensland:
1. **Reduced Funding Levels:**

   - Central Queensland exhibits comparatively lower levels of funding, with the most substantial allocation directed towards the "Rockhampton Technology And Innovation Centre" at $2,800,000. Programs like "Ignite Ideas Fund" and "Industry Research Fellowships" received $847,993 and $682,440 respectively, indicating investments of a smaller scale.

2. **Targeted Financial Support:**
   - Funding is channeled into a limited number of projects, albeit of significant importance, such as the "Rockhampton Technology And Innovation Centre" and the "Biofutures Commercialisation Program".

3. **Prospects for Expansion:**

   - The existence of investments, though modest in scale, hints at the potential for future expansion and advancement in the realms of innovation and technology.

In [15]:
# Filter the data for the specified RAP regions
filtered_df = df[df['RAP Region'].isin(['Brisbane And Redlands', 'Central Queensland'])]

# Group by Program and RAP Region and sum the Actual Contractual Commitment
grouped_df = filtered_df.groupby(['Program', 'RAP Region'])['Actual Contractual Commitment ($)'].sum().reset_index()

# Get the top 15 programs by sum of Actual Contractual Commitment
top_programs = grouped_df.groupby('Program')['Actual Contractual Commitment ($)'].sum().nlargest(15).index
grouped_df = grouped_df[grouped_df['Program'].isin(top_programs)]

# Truncate Program labels longer than 10 characters
grouped_df['Truncated Program'] = grouped_df['Program'].apply(lambda x: x if len(x) <= 10 else x[:10] + '...')

# Create a histogram with truncated labels but full labels in hover
fig_hist = px.histogram(
    grouped_df, 
    x='Truncated Program', 
    y='Actual Contractual Commitment ($)', 
    color='RAP Region', 
    barmode='group', 
    histfunc='sum',
    text_auto=True,
    title='Sum of Actual Contractual Commitment ($) per Program by RAP Region'
)

# Customize the histogram
fig_hist.update_layout(
    xaxis_title='Program',
    yaxis_title='Sum of Actual Contractual Commitment ($)',
    yaxis=dict(range=[0, 10000000]),  # Set y-axis limit to 10M
    legend_title='RAP Region',
    bargap=0.2,
    height=700  # Set plot height to 700
)

# Add hover data
fig_hist.update_traces(
    hovertemplate='<b>Program:</b> %{customdata[0]}<br><b>Sum:</b> %{y}<br>',
    customdata=grouped_df[['Program']].values
)

# Create a box plot with truncated labels but full labels in hover
fig_box = px.box(
    grouped_df, 
    x='Truncated Program', 
    y='Actual Contractual Commitment ($)', 
    color='RAP Region',
    title='Distribution of Actual Contractual Commitment ($) per Program by RAP Region'
)

# Add hover data to box plot
fig_box.update_traces(
    hovertemplate='<b>Program:</b> %{customdata[0]}<br><b>Value:</b> %{y}<br>',
    customdata=grouped_df[['Program']].values
)

# Combine the plots into a single figure
fig = go.Figure(data=fig_hist.data + fig_box.data)

# Apply layout to the combined figure
fig.update_layout(
    xaxis_title='Program',
    yaxis_title='Actual Contractual Commitment ($)',
    yaxis=dict(range=[0, 10000000]),  # Set y-axis limit to 10M
    legend_title='RAP Region',
    height=700  # Set plot height to 700
)

# Show the plot
fig.show()

### Reflection:
- This comparative analysis examines the funding disparities between Brisbane and Redlands and Central Queensland, shedding light on several crucial aspects. Brisbane and Redlands possess a significant funding advantage across various programs, indicative of their well-established infrastructure and research and innovation capabilities. The region's proficiency in attracting and overseeing large-scale ventures, particularly in health research and technology investments, emphasizes its pivotal role as a central development hub.

- In contrast, Central Queensland, despite receiving a lesser total funding amount, benefits from targeted investments aimed at stimulating regional progress. Initiatives like the "Rockhampton Technology and Innovation Centre" demonstrate a deliberate emphasis on nurturing regional innovation centers. Despite being comparatively smaller, these investments play a pivotal role in cultivating local capacities and ensuring equitable development throughout the state.

- Furthermore, the analysis highlights the critical need for an even distribution of funding to bolster both urban and regional areas. While Brisbane and Redlands excel in securing high-value projects, supporting regional endeavors in Central Queensland can help narrow the gap, promoting a more balanced and inclusive growth approach. This strategy not only enriches the overall innovation landscape but also fosters sustainable economic growth across diverse regions.