# IFN619 :: UA2 - Extending Analytics (40%)

**IMPORTANT:** Refer to the instructions in Canvas [UA2 - Assignment 2 - extending analytics](https://canvas.qut.edu.au/courses/17432/assignments/163774) *BEFORE* working on this assignment.

#### REQUIREMENTS ####

1. Complete and run the code cell below to display your name, student number, and assignment option
2. Identify an appropriate question (or questions) to be addressed by your overall data analytics narrative
3. Extend your analysis in assignment 1 with:
    - the analysis of additional unstructured data using the Guardian API (See accessing the Guardian API notebook),
    - the use of one machine learning technique (as used in the class materials), and
    - identification of ethical considerations relevant to the analysis (by drawing on class materials).
4. Ensure that you include documentation of your thinking and decision-making using markdown cells
5. Ensure that you include appropriate visualisations, and that they support the overall narrative
6. Ensure that your insights answer your question/s and are appropriate to your narrative. 
7. Ensure that your insights are consistent with the ethical considerations identified.

**NOTE:** you should not repeat the analysis from assignment 1, but you may need to save dataframes from assignment 1 and reload for use in this assignment. You may also summarise your assignment 1 insights as part of the process of identifying questions for analysis.

#### SUBMISSION ####

1. Create an assignment 2 folder named in the form **UA2-surname-idnumber** and put your notebook and any data files inside this folder. Note, do not put large training data in this folder (reference any training data that you used but keep it outside this folder), only keep small data files and models in this folder with your notebook.
2. When you have everything in the correct folder, reset all cells and restart the kernel, then run the notebook completely, checking that all cells have run without error. If you encounter errors, fix your notebook and re-run the process. It is important that your notebook runs without errors only requiring the files in the folder that you have created.
3. When the notebook is error free, zip the entire folder (you can select download folder in Jupyter).
4. Submit the zipped folder on Canvas [UA2 - Assignment 2 - extending analytics](https://canvas.qut.edu.au/courses/17432/assignments/163774)


<div style="background:#FFFFEE; color:#440404; padding:8px; border-radius: 4px; text-align: center; font-weight: 500;">IFN619 - Data Analytics for Strategic Decision Makers (2024 Sem 1)</div>

---


Import necessary libraries:

In [7]:

import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
import numpy as np
from plotly.subplots import make_subplots

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

In [8]:
# Define the file path for cleaned datasets
file_path = r"cleaned-queensland-funding-recipients.csv"

# Read the CSV file into a pandas DataFrame
try:
  df = pd.read_csv(file_path)
except FileNotFoundError:
  print("Error: File not found. Please check the file path.")

#### Charts

#### Chart 1 - Top 15 Other Partners; Collaborators (if applicable) averaged by Actual Contractual Commitments ($)
* The Partners Will Be Selected might exhibit a more robust financial obligation towards the project in contrast to their counterparts, potentially reflecting a heightened degree of dedication and capital investment in the project. This differential financial stance could imply a deeper level of engagement and a more substantial commitment to the project's success on the part of The Partners Will Be Selected.

* The quantum of financial resources pledged to contracts by the chosen partners subsequent to securing funding is notably higher for The Partners Will Be Selected when juxtaposed with Data61 and other collaborators. This disparity underscores the fact that The Partners Will Be Selected are indeed assuming significantly greater financial responsibilities in comparison to their peer partners, indicating a potentially stronger financial backing and a more substantial stake in the project's outcomes.

In [9]:
# Group the dataframe by 'Investment/Project Title' and calculate the mean of 'Actual Contractual Commitment ($)'
average_commitment_df = df.groupby('Other Partners; Collaborators (if applicable)')['Actual Contractual Commitment ($)'].mean().reset_index()

# Rename the column to reflect that it contains average values
average_commitment_df = average_commitment_df.rename(columns={'Actual Contractual Commitment ($)': 'Average Actual Contractual Commitment ($)'})

# Sort the dataframe in descending order and select the top 15 investment/projects
top_15_other_partner_collaborators = average_commitment_df.sort_values(by='Average Actual Contractual Commitment ($)', ascending=False).head(15)

# Reset the index and add a 'Rank' column
top_15_other_partner_collaborators = top_15_other_partner_collaborators.reset_index(drop=True)
top_15_other_partner_collaborators['Rank'] = top_15_other_partner_collaborators.index + 1

# Truncate labels after 20 characters
top_15_other_partner_collaborators['Truncated Title'] = top_15_other_partner_collaborators['Other Partners; Collaborators (if applicable)'].apply(lambda x: x[:20] + '...' if len(x) > 20 else x)


# Create the bar chart with updated colors and sorted in descending order
fig = px.bar(
    top_15_other_partner_collaborators,
    x='Average Actual Contractual Commitment ($)',
    y='Truncated Title',
    orientation='h',
    title='Top 15 Other Partners; Collaborators (if applicable) averaged by Actual Contractual Commitments ($)',
    labels={'Average Actual Contractual Commitment ($)': 'Average Actual Contractual Commitment ($)', 'Truncated Title': 'Other Partners; Collaborators (if applicable)'},
    text='Average Actual Contractual Commitment ($)',
    color='Truncated Title',
    color_discrete_sequence=px.colors.qualitative.Set3,
    hover_data={'Other Partners; Collaborators (if applicable)': True}
)


fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig.update_layout(
    title=dict(font=dict(size=20)),
    xaxis_title="Average Actual Contractual Commitment ($)",
    yaxis_title="Other Partners; Collaborators (if applicable)",
    margin=dict(l=150, r=20, t=70, b=70),
    height=700
)

fig.show()

#### Chart 2 - Analyze Average Actual Contractual Commitments ($) for RAP Region
* Nonqueensland significantly lags behind other regions. 
* While certain discrepancies in contract funding may seem insignificant, it is imperative to delve into the reasons behind Nonqueensland having considerably less funds allocated to contracts in comparison to other areas. 
* Within the RAP Region, specific categories exhibit a substantial influence on the extent of funds allocated to contracts. Discrepancies in contract funding across categories may not hold much significance, potentially occurring randomly. For instance, the average funds allocated to contracts in Brisbane And Redlands do not differ significantly from the statewide average. However, Nonqueensland stands out for its significantly lower funding in contracts compared to Brisbane And Redlands, Darling Downs, Brisbane, Gold Coast, and Ipswich.

In [10]:
# Group the dataframe by 'Investment/Project Title' and calculate the mean of 'Actual Contractual Commitment ($)'
average_commitment_df = df.groupby('RAP Region')['Actual Contractual Commitment ($)'].mean().reset_index()

# Rename the column to reflect that it contains average values
average_commitment_df = average_commitment_df.rename(columns={'Actual Contractual Commitment ($)': 'Average Actual Contractual Commitment ($)'})
average_commitment_df = average_commitment_df.sort_values(by='Average Actual Contractual Commitment ($)', ascending=False)


fig = px.bar(
    average_commitment_df,
    x='Average Actual Contractual Commitment ($)',
    y='RAP Region',
    orientation='h',
    title='RAP Region over Average Actual Contractual Commitments ($)',
    labels={'Average Actual Contractual Commitment ($)': 'Average Actual Contractual Commitment ($)', 'RAP Region': 'RAP Region'},
    text='Average Actual Contractual Commitment ($)',
    color='RAP Region',
    color_discrete_sequence=px.colors.qualitative.Set3,
    hover_data={'RAP Region': True}
)


fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig.update_layout(
    title=dict(font=dict(size=20)),
    xaxis_title="Average Actual Contractual Commitment ($)",
    yaxis_title="RAP Region",
    margin=dict(l=150, r=20, t=70, b=70),
    height=700
)

fig.show()

#### Chart 3 -  Average Actual Contractual Commitments ($) within State Electorate
* Proportions for the nineteen classifications within the State Electorate are being accounted for (%).

* Mcconnel and Maiwar exhibit notably elevated percentages in comparison to the remaining classifications, potentially reflecting heightened levels of engagement or involvement within those regions. However, it is imperative to acknowledge the extensive array of classifications encompassed within the residual category when interpreting these proportions.

* Analysis of the figures pertaining to the State Electorate reveals that Mcconnel boasts a percentage of 23.36%, a substantially higher figure, whereas Maiwar registers 12.09%, also demonstrating an elevated proportion in contrast to the other classifications.

* It is crucial to bear in mind that the residual category comprises 88 classifications. Given the substantial size of this cohort, it is essential to factor this into consideration when examining the graphical representations.

In [11]:
# Calculate the frequency count and mean actual contractual commitment for each state electorate
frequency_df = df.groupby('State Electorate').agg(
    Frequency=('State Electorate', 'size'),
    Average_Commitment=('Actual Contractual Commitment ($)', 'mean')
).reset_index()

# Calculate the mean frequency count
mean_frequency = frequency_df['Frequency'].mean()

# Separate entries above and below the threshold
above_threshold = frequency_df[frequency_df['Frequency'] >= mean_frequency]
below_threshold = frequency_df[frequency_df['Frequency'] < mean_frequency]

# Sum the values below the threshold and label them as "Rest"
rest_sum_frequency = below_threshold['Frequency'].sum()
rest_sum_commitment = below_threshold['Average_Commitment'].sum()
rest_row = pd.DataFrame([{'State Electorate': 'Rest', 'Frequency': rest_sum_frequency, 'Average_Commitment': rest_sum_commitment / rest_sum_frequency}])

# Concatenate the above threshold data with the "Rest" row
final_df = pd.concat([above_threshold, rest_row], ignore_index=True)

# Sort the data in descending order and take the top 15 entries
final_df = final_df.sort_values(by='Frequency', ascending=False).head(15)

# Calculate percentages and round them
total_frequency = final_df['Frequency'].sum()
final_df['Percentage'] = ((final_df['Frequency'] / total_frequency) * 100).round(2)

# Create the bar chart with updated colors and sorted in descending order
fig = px.bar(
    final_df,
    x='Frequency',
    y='State Electorate',
    orientation='h',
    title='State Electorate by Frequency Count and Average Actual Contractual Commitments ($)',
    labels={'Frequency': 'Frequency Count', 'State Electorate': 'State Electorate'},
    text='Frequency',
    color='Average_Commitment',
    color_continuous_scale=px.colors.sequential.Viridis,
    hover_data={
        'State Electorate': True,
        'Average_Commitment': True,
        'Frequency': True,
        'Percentage': ':.2f%'  # Format the percentage to 2 decimal places
    }
)

# Update the text to include percentage
final_df['Text'] = final_df.apply(lambda row: f"{row['Frequency']} ({row['Percentage']}%)", axis=1)
fig.update_traces(texttemplate=final_df['Text'], textposition='outside')


fig.update_layout(
    title=dict(font=dict(size=20)),
    xaxis_title="Frequency Count",
    yaxis_title="State Electorate",
    margin=dict(l=150, r=20, t=70, b=70),
    height=700
)

# Show the figure
fig.show()

In [13]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Convert categorical variables to dummy variables if needed
data = pd.get_dummies(df)

# Split the data into features (X) and target variable (y)
X = data.drop('Actual Contractual Commitment ($)', axis=1)
y = data['Actual Contractual Commitment ($)']

# Adjust the test size parameter
test_size = 0.3  # You can adjust this value as needed

# Perform train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42)

# Check if test set has enough samples
if len(X_test) < 2:
    print("Test set has too few samples. Try increasing the test size or using a larger dataset.")
else:
    # Train the Decision Tree Regressor model
    model = DecisionTreeRegressor(random_state=42)
    model.fit(X_train, y_train)

    # Make predictions on the test set
    y_pred = model.predict(X_test)

    # Evaluate the model
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    print(f'Mean Squared Error: {mse}')
    print(f'R^2 Score: {r2}')


Best Parameters: {'max_depth': 30, 'min_samples_leaf': 1, 'min_samples_split': 2}
Mean Squared Error: 100731362994.17317
R^2 Score: 0.15457031853342773
