# IFN619 :: UA2 - Extending Analytics (40%)

**IMPORTANT:** Refer to the instructions in Canvas [UA2 - Assignment 2 - extending analytics](https://canvas.qut.edu.au/courses/17432/assignments/163774) *BEFORE* working on this assignment.

#### REQUIREMENTS ####

1. Complete and run the code cell below to display your name, student number, and assignment option
2. Identify an appropriate question (or questions) to be addressed by your overall data analytics narrative
3. Extend your analysis in assignment 1 with:
    - the analysis of additional unstructured data using the Guardian API (See accessing the Guardian API notebook),
    - the use of one machine learning technique (as used in the class materials), and
    - identification of ethical considerations relevant to the analysis (by drawing on class materials).
4. Ensure that you include documentation of your thinking and decision-making using markdown cells
5. Ensure that you include appropriate visualisations, and that they support the overall narrative
6. Ensure that your insights answer your question/s and are appropriate to your narrative. 
7. Ensure that your insights are consistent with the ethical considerations identified.

**NOTE:** you should not repeat the analysis from assignment 1, but you may need to save dataframes from assignment 1 and reload for use in this assignment. You may also summarise your assignment 1 insights as part of the process of identifying questions for analysis.

#### SUBMISSION ####

1. Create an assignment 2 folder named in the form **UA2-surname-idnumber** and put your notebook and any data files inside this folder. Note, do not put large training data in this folder (reference any training data that you used but keep it outside this folder), only keep small data files and models in this folder with your notebook.
2. When you have everything in the correct folder, reset all cells and restart the kernel, then run the notebook completely, checking that all cells have run without error. If you encounter errors, fix your notebook and re-run the process. It is important that your notebook runs without errors only requiring the files in the folder that you have created.
3. When the notebook is error free, zip the entire folder (you can select download folder in Jupyter).
4. Submit the zipped folder on Canvas [UA2 - Assignment 2 - extending analytics](https://canvas.qut.edu.au/courses/17432/assignments/163774)


<div style="background:#FFFFEE; color:#440404; padding:8px; border-radius: 4px; text-align: center; font-weight: 500;">IFN619 - Data Analytics for Strategic Decision Makers (2024 Sem 1)</div>

---


### Summary and Analysis of Assignment 1 Insights

##### Distribution Chart: Histogram of Funding Amounts
- **Observation**: The histogram shows a significantly right-skewed distribution, with most funding amounts concentrated within the $0 to $100k range.

##### Funding by Program
- **Observation**: Top-funded programs include Platform Technology Program and UQ - Covid-19 Vaccine Development, each exceeding $9 million.

##### Top 100 Funding by Location (Suburb)
- **Observation**: St Lucia, Brisbane City, and Fortitude Valley receive the highest funding.

##### Funding Over Time
- **Observation**: Significant fluctuations with peaks in 2017, 2018, 2019, and 2020.

##### Top 10 Recipients of Funding
- **Observation**: The University of Queensland is the dominant recipient.

##### Funding by Year
- **Observation**: A surge in funding between 2015 and 2017, followed by a decline.

##### Top 10 Local Governments/Councils by Funding
- **Observation**: Brisbane (C) receives the largest allocation.

##### Committed Funds Histogram and Box Plot
- **Hot Desq**: Diverse funding with specific focal points.
- **Advancing Regional Innovation Program**: Emphasizes substantial funding diversity. 

Import necessary libraries:

In [1]:

import pandas as pd
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from helperFunc import CleanDataFrame
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

#### Overview

- Query: "Can we predict the funding amount based on the characteristics of the recipients and the program details?"
- Technique: To forecast the funding amount, we will use a regression model, namely RandomForestRegressor.
- train a Random Forest Regressor  model, assess its performance, and visualize the outcomes.
- Visualizations will include key feature histograms, correlation heatmaps, scatter plots, and a prediction vs. actual values plot.

Read data from json api then convert to dataframe 

In [2]:
# Define the file path for cleaned datasets
file_path = "cleaned-jsonapidata-queensland-funding-recipients.csv"

# Read the CSV file into a pandas DataFrame
try:
  df = pd.read_csv(file_path)
  # Randomly sample 200 rows from the DataFrame
  df_sample = df.sample(n=200, random_state=42)
  # Display the first few rows of the sampled dataset to understand its structure
  print(df_sample.head())
  df = df_sample
except FileNotFoundError:
  print("Error: File not found. Please check the file path.")


     _Id                        Program  \
521  522              Ignite Ideas Fund   
737  738  Industry Research Fellowships   
740  741  Industry Research Fellowships   
660  661  Industry Research Fellowships   
411  412              Ignite Ideas Fund   

                                             Round  \
521           Aq Ignite Ideas Fund 2015-16 Round 1   
737  Aq Industry Research Fellowships 2019 Round 2   
740  Aq Industry Research Fellowships 2019 Round 2   
660  Aq Industry Research Fellowships 2021 Round 4   
411           Aq Ignite Ideas Fund 2018-19 Round 5   

                    Recipient Name  \
521                   Opmantek Ltd   
737   The University Of Queensland   
740   The University Of Queensland   
660   The University Of Queensland   
411  Vostronet (Australia) Pty Ltd   

    Physical Address Of Recipient - Suburb/Location  \
521                                Surfers Paradise   
737                                        St Lucia   
740                   

In [3]:
df.columns

Index(['_Id', 'Program', 'Round', 'Recipient Name',
       'Physical Address Of Recipient - Suburb/Location',
       'Physical Address Of Recipient - Post Code',
       'University Collaborator (If Applicable)',
       'Other Partners; Collaborators (If Applicable)',
       'Investment/Project Title',
       'Primary Location Of Activity/Project - Suburb',
       'Primary Location Of Activity/Project - Post Code',
       'Multiple Locations Of Activity/Project (If Applicable)',
       'Approval Date', 'Local Government /Council', 'Rap Region',
       'State Electorate', 'Actual Contractual Commitment ($)'],
      dtype='object')

### Chart - Actual Contractual Commitment ($) per Program, Grouped by RAP Region

#### Observations:

1. **Primary Funding Initiatives**:

- The **Ignite Ideas Fund** emerges as the program that has garnered the highest overall funding, surpassing all other initiatives by a considerable margin. This suggests a noteworthy focus on nurturing innovative concepts and nascent enterprises.

- Noteworthy contributions are also attributed to the **Industry Research Fellowships** and the **Platform Technology Program**, implying a significant allocation of funds towards research endeavors and technological advancements.

2. **Geographical Allocation**:

- **Brisbane and Redlands** emerge as pivotal regions, displaying substantial financial support across various initiatives. The prevalence of funding in this area may be attributed to its robust infrastructure and the clustering of research establishments and industries.

- Significant funding is also directed towards **Brisbane**, **Gold Coast**, and **Sunshine Coast**, underscoring their significance within the state's innovation landscape.

- Regions such as **Far North Queensland**, **Townsville**, **Central Queensland**, and **Mackay-Whitsunday** exhibit a dispersion of funding across multiple initiatives, illustrating a wide-ranging yet less concentrated investment strategy.

3. **Diverse Program Landscape Across Regions**:

- The data depicts a diverse array of initiatives receiving funding in various regions, signaling a multifaceted approach to nurturing innovation. Initiatives like the **Ignite Ideas Fund** and **Industry Research Fellowships** display a broad geographical reach.

- Certain initiatives such as **Data61** and the **Artificial Intelligence Hub** receive more localized funding, potentially due to specialized expertise or facilities in those regions.

4. **Specialized Initiatives and Tailored Emphasis**:

- Despite having lower overall funding, several initiatives like the **Female Founders Accelerators** and the **Agtech And Logistics Hub** garner significant attention in specific regions, indicating targeted assistance for particular industries or demographic groups.


In [4]:
# Read the CSV file into a pandas DataFrame
try:
  data = pd.read_csv(file_path)
except FileNotFoundError:
  print("Error: File not found. Please check the file path.")

# Grouping the data by Program and RAP Region and summing the Actual Contractual Commitment
grouped_data = data.groupby(['Program', 'Rap Region'])['Actual Contractual Commitment ($)'].sum().reset_index()
# Sorting the grouped data in descending order and selecting the top 15
sorted_grouped_data = grouped_data.sort_values(by='Actual Contractual Commitment ($)', ascending=False).head(60)

# Creating a bar plot for Actual Contractual Commitment ($) per Program, grouped by RAP Region
fig = px.bar(sorted_grouped_data, 
             x='Program', 
             y='Actual Contractual Commitment ($)', 
             color='Rap Region', 
             title='Actual Contractual Commitment ($) per Program, grouped by RAP Region',
             labels={'Actual Contractual Commitment ($)': 'Contractual Commitment ($)', 'Program': 'Program'},
             height=600)

fig.update_layout(xaxis_title="Program", yaxis_title="Actual Contractual Commitment ($)", 
                  bargap=0.2, showlegend=True,height=700)

fig.show()

### Chart - Actual Contractual Commitment ($) per Program, Grouped by University Collaborator

#### Observations:

1. **Preeminence of the Artificial Intelligence Hub**:

- The **Artificial Intelligence Hub** emerges as the program endowed with the highest financial support, exceeding $5 million. This substantial allocation implies a pronounced focus on the advancement of AI capabilities, potentially positioning Queensland as a frontrunner in this impactful domain.

2. **Eminent Collaborations with Universities**:

- A significant portion of the funding allocated to the **Artificial Intelligence Hub** is attributed to a partnership between **The University of Queensland (UQ)** and **Queensland University of Technology (QUT)**, highlighting a strategic alliance between prominent academic institutions to capitalize on their collective proficiency in AI.

- The **Agtech and Logistics Hub** also secures considerable funding, approximately $3 million, through a notable collaboration involving the **University of Southern Queensland (USQ)** and **UQ**, underscoring the significance of agricultural technology and logistics in the region's strategy for innovation.

3. **Programs with Limited Financial Support**:

- Initiatives such as **Hot DesQ**, **Create Queensland**, and **Innovation Precincts and Places** receive comparatively modest funding. This trend may signify early developmental stages, specialized thematic orientations, or alternative funding mechanisms not depicted in this analysis.

In [5]:
# Grouping the data by Program and University Collaborator and summing the Actual Contractual Commitment
grouped_data = data.groupby(['Program', 'University Collaborator (If Applicable)'])['Actual Contractual Commitment ($)'].sum().reset_index()

# Renaming the column
grouped_data.rename(columns={'University Collaborator (If Applicable)': 'University Collaborator'}, inplace=True)

# Sorting the grouped data in descending order and selecting the top 15
sorted_grouped_data = grouped_data.sort_values(by='Actual Contractual Commitment ($)', ascending=False)

# Creating a bar plot for Actual Contractual Commitment ($) per Program, grouped by University Collaborator
fig = px.bar(sorted_grouped_data, 
             x='Program', 
             y='Actual Contractual Commitment ($)', 
             color='University Collaborator', 
             title='Actual Contractual Commitment ($) per Program, grouped by University Collaborator',
             labels={'Actual Contractual Commitment ($)': 'Contractual Commitment ($)', 'Program': 'Program'},
             height=600)

fig.update_layout(xaxis_title="Program", yaxis_title="Actual Contractual Commitment ($)", 
                  bargap=0.2, showlegend=True, height=700)

fig.show()

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 17 columns):
 #   Column                                                  Non-Null Count  Dtype  
---  ------                                                  --------------  -----  
 0   _Id                                                     1000 non-null   int64  
 1   Program                                                 1000 non-null   object 
 2   Round                                                   960 non-null    object 
 3   Recipient Name                                          1000 non-null   object 
 4   Physical Address Of Recipient - Suburb/Location         1000 non-null   object 
 5   Physical Address Of Recipient - Post Code               930 non-null    float64
 6   University Collaborator (If Applicable)                 6 non-null      object 
 7   Other Partners; Collaborators (If Applicable)           286 non-null    object 
 8   Investment/Project Title               

### Data Exploration
- We start by exploring the dataset to understand its structure and contents. The first few rows of the dataset are displayed below:

In [8]:
# Check for missing values
df.isnull().sum()

_Id                                                         0
Program                                                     0
Round                                                       8
Recipient Name                                              0
Physical Address Of Recipient - Suburb/Location             0
Physical Address Of Recipient - Post Code                  14
University Collaborator (If Applicable)                   198
Other Partners; Collaborators (If Applicable)             145
Investment/Project Title                                    0
Primary Location Of Activity/Project - Suburb               3
Primary Location Of Activity/Project - Post Code            5
Multiple Locations Of Activity/Project (If Applicable)    165
Approval Date                                               0
Local Government /Council                                   0
Rap Region                                                  0
State Electorate                                            0
Actual C

### Feature Selection and Data Preprocessing
We will discover significant aspects that may influence the grant amount and address any missing numbers or categorical data.


In [9]:
# clear json pandas data frame
lsStringTypeColumnHeader = ["Program",'Round', 'Recipient Name','Physical Address Of Recipient - Suburb/Location','University Collaborator (If Applicable)','Other Partners; Collaborators (If Applicable)','Investment/Project Title','Primary Location Of Activity/Project - Suburb','Multiple Locations Of Activity/Project (If Applicable)','Local Government /Council', 'Rap Region','State Electorate']
lsFloatTypeColumnHeader = ["Primary Location Of Activity/Project - Post Code","Actual Contractual Commitment ($)","Physical Address Of Recipient - Post Code"]

# Clean String type and float type column headers
for column in lsStringTypeColumnHeader:
    df[column] = CleanDataFrame.MSCleanStrTypeColumn(df[column])

for column in lsFloatTypeColumnHeader:
    df[column] = CleanDataFrame.MSCleanFloatTypeColumn(df[column])

df.tail()

Unnamed: 0,_Id,Program,Round,Recipient Name,Physical Address Of Recipient - Suburb/Location,Physical Address Of Recipient - Post Code,University Collaborator (If Applicable),Other Partners; Collaborators (If Applicable),Investment/Project Title,Primary Location Of Activity/Project - Suburb,Primary Location Of Activity/Project - Post Code,Multiple Locations Of Activity/Project (If Applicable),Approval Date,Local Government /Council,Rap Region,State Electorate,Actual Contractual Commitment ($)
408,409,Ignite Ideas Fund,Aq Ignite Ideas Fund 2019-20 Round 6,Health Management Pty Ltd,Westcourt,4870.0,Nan,Nan,Sophus Nutrition Commercialisation,Westcourt,4870.0,Nan,2020-05-08,Cairns (R),Far North Queensland,Cairns,200000
332,333,Ignite Ideas Fund,Aq Ignite Ideas Fund 2020-21 Round 7,Fiffy Solutions Pty Ltd,West End,4101.0,Nan,Nan,Commercialisation Of People Counting Solution,St Lucia,4067.0,Nan,2020-12-02,Brisbane (C),Brisbane And Redlands,South Brisbane,100000
208,209,Hot Desq,Round 2,Canvas Coworking Inc,Toowoomba,4350.0,Nan,Nan,Host,Toowoomba City,4350.0,Nan,2016-03-14,Toowoomba (R),Darling Downs,Toowoomba North,6000
613,614,Ignite Ideas Fund,Aq Ignite Ideas Fund 2017-18 Round 3,Adivo Pty Ltd,Clifton Beach,4879.0,Nan,Nan,Adivo - Automatically Inflating Lifejacket Sys...,Clifton Beach,4879.0,Nan,2017-09-05,Cairns (R),Far North Queensland,Barron River,97141
78,79,Female Founders Program,Round 1,Sbe Australia Limited,"Bondi Junction, Nsw",,Nan,Nan,Sbe Evolve,Fortitude Valley,4006.0,Nan,2021-09-22,Nonqueensland,Brisbane,Nonqueensland,30000


In [10]:
# Check for missing values
df.isnull().sum()

_Id                                                       0
Program                                                   0
Round                                                     0
Recipient Name                                            0
Physical Address Of Recipient - Suburb/Location           0
Physical Address Of Recipient - Post Code                 0
University Collaborator (If Applicable)                   0
Other Partners; Collaborators (If Applicable)             0
Investment/Project Title                                  0
Primary Location Of Activity/Project - Suburb             0
Primary Location Of Activity/Project - Post Code          0
Multiple Locations Of Activity/Project (If Applicable)    0
Approval Date                                             0
Local Government /Council                                 0
Rap Region                                                0
State Electorate                                          0
Actual Contractual Commitment ($)       

In [11]:
df.nunique()

_Id                                                       200
Program                                                    26
Round                                                      49
Recipient Name                                            153
Physical Address Of Recipient - Suburb/Location           104
Physical Address Of Recipient - Post Code                  74
University Collaborator (If Applicable)                     3
Other Partners; Collaborators (If Applicable)              53
Investment/Project Title                                  193
Primary Location Of Activity/Project - Suburb             101
Primary Location Of Activity/Project - Post Code           75
Multiple Locations Of Activity/Project (If Applicable)     35
Approval Date                                              62
Local Government /Council                                  23
Rap Region                                                 15
State Electorate                                           56
Actual C

### Training the RandomForestRegressor Model
We will train a Random Forest Regressor model using the training data and evaluate its performance on the test data.

In [12]:
# Select relevant features
features = ['Program', 'Recipient Name']

# Select only the necessary columns
df = df[features + ['Actual Contractual Commitment ($)', 'Approval Date']]

# Handle missing values (for simplicity, we'll drop rows with missing values)
df = df.dropna(subset=features + ['Actual Contractual Commitment ($)', 'Approval Date'])

# Convert categorical features to numerical values
df = pd.get_dummies(df, columns=['Program', 'Recipient Name'], drop_first=True)

# Convert date feature to datetime and extract year and month
df['Approval Date'] = pd.to_datetime(df['Approval Date'], errors='coerce')

# Ensure no invalid date entries
df = df.dropna(subset=['Approval Date'])

df['approval_year'] = df['Approval Date'].dt.year
df['approval_month'] = df['Approval Date'].dt.month

# Drop original date column
df = df.drop(columns=['Approval Date'])

# Define the target variable and features
X = df.drop(columns=['Actual Contractual Commitment ($)'])
y = df['Actual Contractual Commitment ($)']

# Ensure all features are numeric
X = X.apply(pd.to_numeric, errors='coerce')
X = X.fillna(0)

# Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)


In [13]:
df.shape

(200, 180)

In [14]:
# Function to evaluate models
def evaluate_model(model):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    print(f"{model.__class__.__name__} - Mean Squared Error: {mse}, R-squared: {r2}")
    return y_pred

# Train and evaluate different models
models = [
    RandomForestRegressor(n_estimators=100, random_state=42)
]

predictions = {}
for model in models:
    predictions[model.__class__.__name__] = evaluate_model(model)



RandomForestRegressor - Mean Squared Error: 1085344146480.2076, R-squared: -0.010119437372201556


### Model Performance and Insights
The performance of the Random Forest Regressor model is evaluated using Mean Squared Error (MSE) and R-squared metrics. We also visualize the predictions against the actual values.

In [15]:
fig = px.scatter(x=y_test, y=predictions['RandomForestRegressor'], labels={'x': 'Actual Funding Amount', 'y': 'Predicted Funding Amount'},
                 title='Actual vs Predicted Funding Amount - Random Forest Regressor')
fig.show()

In [16]:
fig = px.scatter(x=y_test, y=predictions['RandomForestRegressor'], labels={'x': 'Actual Funding Amount', 'y': 'Predicted Funding Amount'},
                 title='Actual vs Predicted Funding Amount - Random Forest Regressor')
fig.show()

### Conclusion
The Random Forest Regressor model provides us with a basic understanding of how different features influence the funding amount. The model's performance, as indicated by the MSE and R-squared metrics, suggests that there is room for improvement, possibly by using more complex models or additional features. The scatter plot of actual vs. predicted values helps us visually assess the model's accuracy.
