# **Assignment2-Machine Learning**

## **1 Simple Correlation Analysis**

Feature Selection and Dataset Preparation:
* Based on the earlier EDA, we have identified population, income, education, and other factors that might be important predictors of IRSD.
* We select relevant columns related to population characteristics, education, residential status, cultural background, and minority representation.
* Next we will use correlation analysis to determine the relationships between the selected features and IRSD.

In [68]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [69]:
df = pd.read_csv('Data/communities.csv')

columns_of_interest = [
    'IRSD (avg)', 
    'Equivalent household income <$600/week', 
    'Personal income <$400/week, persons', 
    'Unemployed, persons', 
    'Public Housing Dwellings',
    '% dwellings which are public housing',
    'Primary school students', 
    'Secondary school students', 
    'TAFE students', 
    'Holds degree or higher, persons', 
    'Did not complete year 12, persons',
    'Population Density',
    'Distance to GPO (km)',  
    'Dwellings with no motor vehicle',
    'Dwellings with no internet',
    'Born overseas, persons', 
    'Born in non-English speaking country, persons', 
    'Speaks LOTE at home, persons',
    'Aboriginal or Torres Strait Islander, persons',
    'Poor English proficiency, persons',
    '2012 ERP age 0-4, persons', 
    '2012 ERP age 5-9, persons', 
    '2012 ERP age 10-14, persons', 
    '2012 ERP age 15-19, persons', 
    '2012 ERP age 20-24, persons', 
    'Public hospital separations, 2012-13',
    'Travel time to nearest public hospital',
    'Primary Schools', 
    'Secondary Schools'
]

for col in columns_of_interest:
    if col not in df.columns:
        print(f"Column '{col}' does not exist in the data")

df_filtered = df[[col for col in columns_of_interest if col in df.columns]].copy()

df_filtered.replace('<5', 5, inplace=True)

for col in df_filtered.columns:
    if df_filtered[col].isnull().sum() > 0:
        df_filtered[col] = df_filtered[col].fillna(df_filtered[col].median())

correlation_matrix = df_filtered.corr()

irsd_correlations = correlation_matrix['IRSD (avg)'].sort_values(ascending=False)

pd.set_option('display.max_rows', None)  # 显示所有行
print(irsd_correlations)

IRSD (avg)                                       1.000000
Holds degree or higher, persons                  0.238098
Population Density                               0.152509
Primary school students                          0.087607
Secondary school students                        0.081053
2012 ERP age 10-14, persons                      0.060856
2012 ERP age 5-9, persons                        0.060523
2012 ERP age 15-19, persons                      0.052011
2012 ERP age 20-24, persons                      0.039232
2012 ERP age 0-4, persons                        0.023887
Born overseas, persons                          -0.008502
Travel time to nearest public hospital          -0.019597
TAFE students                                   -0.034866
Personal income <$400/week, persons             -0.047917
Primary Schools                                 -0.051335
Born in non-English speaking country, persons   -0.053694
Dwellings with no motor vehicle                 -0.058276
Did not comple

| Feature                                  | Correlation Index | Analysis                                                                                                                                   |
|------------------------------------------|-------------------|--------------------------------------------------------------------------------------------------------------------------------------------|
| % dwellings which are public housing     | -0.429            | Public housing proportion is a strong indicator of socio-economic disadvantage.                                                           |
| Distance to GPO (km)                     | -0.332            | Distance from the city center is associated with economic underdevelopment, as remote areas may lack access to resources and opportunities.|
| Public Housing Dwellings                 | -0.284            | The number of public housing dwellings indicates communities that might rely on public welfare, often associated with socio-economic disadvantage. |
| Equivalent household income <$600/week   | -0.146            | The proportion of low-income households indicates economic hardship and socio-economic disadvantage.                                       |
| Poor English proficiency, persons        | -0.170            | Poor language proficiency is strongly associated with socio-economic disadvantage, affecting access to education and employment opportunities. |
| Dwellings with no internet               | -0.174            | Lack of internet access reflects digital exclusion, often linked to socio-economic disadvantage.                                           |
| Aboriginal or Torres Strait Islander, persons | -0.193          | Areas with higher Indigenous populations often face socio-economic disadvantages, indicating systemic challenges.                         |
| Holds degree or higher, persons          | 0.238             | Higher educational attainment is a strong indicator of socio-economic advantage.                                                          |
| Population Density                       | 0.153             | Higher population density is often associated with better socio-economic status due to concentration of resources and opportunities.       |

In [70]:
columns_of_ml = [
    'IRSD (avg)', 
    '% dwellings which are public housing', 
    'Distance to GPO (km)', 
    'Public Housing Dwellings', 
    'Equivalent household income <$600/week',
    'Poor English proficiency, persons',
    'Dwellings with no internet',
    'Aboriginal or Torres Strait Islander, persons',
    'Holds degree or higher, persons',
    'Population Density'
]

for col in columns_of_ml:
    if col not in df.columns:
        print(f"Column '{col}' does not exist in the data")


In [71]:
df_filtered = df[[col for col in columns_of_ml if col in df.columns]].copy()

X = df_filtered.drop(columns=['IRSD (avg)'])
y = df_filtered['IRSD (avg)']


In [72]:
X.head() # Features 


Unnamed: 0,% dwellings which are public housing,Distance to GPO (km),Public Housing Dwellings,Equivalent household income <$600/week,"Poor English proficiency, persons",Dwellings with no internet,"Aboriginal or Torres Strait Islander, persons","Holds degree or higher, persons",Population Density
0,3.815789,4.264157,87,400,305,246,16,1784,3082.440714
1,4.684173,9.881527,66,265,45,193,12,877,2426.66545
2,,134.213743,<5,31,<5,17,8,73,0.841522
3,1.226598,124.859887,19,137,<5,68,8,245,213.059443
4,,14.758418,<5,121,14,64,7,84,210.819042


In [73]:
y.head() # Label: IRSD(avg)

0    1054.014288
1    1087.153516
2    1061.326811
3    1056.031657
4    1033.615698
Name: IRSD (avg), dtype: float64

In [74]:
# Load the filtered features and target data

X.to_csv('Data/features_X.csv', index=False)
y.to_csv('Data/target_y.csv', index=False)

## **2 Data Cleaning and Preprocessing**

* Ensure all missing or invalid values are handled.
* Perform data normalization if required, especially if we plan to use algorithms like linear regression or KNN that are sensitive to data scales.

### **(1) Check for missing values in the dataset**

In [75]:
# Load the filtered features and target data
X = pd.read_csv('Data/features_X.csv')
y = pd.read_csv('Data/target_y.csv')



In [76]:
# Check for missing values
print("\nChecking for missing values in the dataset:")
print(X.isnull().sum())



Checking for missing values in the dataset:
% dwellings which are public housing             427
Distance to GPO (km)                               0
Public Housing Dwellings                           0
Equivalent household income <$600/week             0
Poor English proficiency, persons                  0
Dwellings with no internet                         0
Aboriginal or Torres Strait Islander, persons      0
Holds degree or higher, persons                    0
Population Density                                 0
dtype: int64


In [77]:
# Convert non-numeric values like '<5' to numeric 
X.replace('<5', 5, inplace=True)

# Convert all columns to numeric if possible
for col in X.columns:
    X[col] = pd.to_numeric(X[col], errors='coerce')

In [78]:
# Check for missing values in the dataset
if X.isnull().sum().any() or y.isnull().sum().any():
    print("There are missing values in the dataset. Handling missing values now.")
    # Fill missing values with median for numerical features
    for col in X.select_dtypes(include=['number']).columns:
        X[col].fillna(X[col].median(), inplace=True)
    # Fill missing values with mode for categorical features (if any)
    for col in X.select_dtypes(include=['object']).columns:
        X[col].fillna(X[col].mode()[0], inplace=True)

There are missing values in the dataset. Handling missing values now.


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X[col].fillna(X[col].median(), inplace=True)


### **(2) Removal collinearity**


In [79]:
# Calculate correlation matrix to identify collinear features
correlation_matrix = X.corr().abs()

# Select the upper triangle of the correlation matrix
upper_triangle = correlation_matrix.where(np.triu(np.ones(correlation_matrix.shape), k=1).astype(bool))

# Debug: Print the upper triangle of the correlation matrix
print("\nUpper Triangle of Correlation Matrix:")
upper_triangle




Upper Triangle of Correlation Matrix:


Unnamed: 0,% dwellings which are public housing,Distance to GPO (km),Public Housing Dwellings,Equivalent household income <$600/week,"Poor English proficiency, persons",Dwellings with no internet,"Aboriginal or Torres Strait Islander, persons","Holds degree or higher, persons",Population Density
% dwellings which are public housing,,0.053928,0.639015,0.184397,0.137929,0.202796,0.222425,0.017475,0.187085
Distance to GPO (km),,,0.123699,0.195515,0.28056,0.160891,0.090628,0.370361,0.497407
Public Housing Dwellings,,,,0.701377,0.475657,0.72315,0.631942,0.406994,0.328614
Equivalent household income <$600/week,,,,,0.709704,0.975595,0.674283,0.663377,0.326698
"Poor English proficiency, persons",,,,,,0.677571,0.269453,0.544159,0.376646
Dwellings with no internet,,,,,,,0.704293,0.614944,0.304507
"Aboriginal or Torres Strait Islander, persons",,,,,,,,0.229075,0.026501
"Holds degree or higher, persons",,,,,,,,,0.59383
Population Density,,,,,,,,,


In [80]:
# Find features with correlation greater than 0.9 and remove them
collinear_features = [column for column in upper_triangle.columns if any(upper_triangle[column] > 0.9)]
print(f"\nCollinear features to be removed: {collinear_features}")

# Drop collinear features
X = X.drop(columns=collinear_features)

# Debug: Check the remaining features
print("\nRemaining features after removing collinear features:")
print(X.columns)


Collinear features to be removed: ['Dwellings with no internet']

Remaining features after removing collinear features:
Index(['% dwellings which are public housing', 'Distance to GPO (km)',
       'Public Housing Dwellings', 'Equivalent household income <$600/week',
       'Poor English proficiency, persons',
       'Aboriginal or Torres Strait Islander, persons',
       'Holds degree or higher, persons', 'Population Density'],
      dtype='object')


Base on the output:

*Dwellings with no internet* were removed because it had a high collinearity (correlation greater than 0.9) with other features, meaning it provided duplicate information with one or more other features. If it continues to be retained, it may result in multicollinearity in the model, which affects the stability and accuracy of the model.

### **(3) Standardizing**

In [81]:
# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


In [82]:
# Debug: Check the mean and standard deviation after scaling
print("\nTraining set mean after scaling:", X_scaled.mean(axis=0))
print("Training set standard deviation after scaling:", X_scaled.std(axis=0))



Training set mean after scaling: [ 1.97372982e-17  9.21073917e-17  2.30268479e-17 -6.00342821e-17
 -1.64477485e-18 -3.28954970e-18  1.97372982e-17 -1.18423789e-16]
Training set standard deviation after scaling: [1. 1. 1. 1. 1. 1. 1. 1.]


Check the feature mean and standard deviation of the training and test sets to ensure that they are normalized (mean 0, standard deviation 1).

### **(4) Data Partitioning (70/30)**


In [83]:
# Split the dataset into training (70%) and testing (30%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Further split the training set into training (80%) and validation (20%) sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Debug: Check the sizes of each dataset
print("Training set shape:", X_train.shape)
print("Validation set shape:", X_val.shape)
print("Test set shape:", X_test.shape)

Training set shape: (604, 8)
Validation set shape: (152, 8)
Test set shape: (324, 8)


In [84]:
# Check for missing values in the training and testing sets
if X_train.isnull().sum().any() or y_train.isnull().sum().any():
    print("There are missing values in the training set. Handling missing values now.")
    X_train.fillna(X_train.median(), inplace=True)
    y_train.fillna(y_train.median(), inplace=True)

if X_test.isnull().sum().any() or y_test.isnull().sum().any():
    print("There are missing values in the testing set. Handling missing values now.")
    X_test.fillna(X_test.median(), inplace=True)
    y_test.fillna(y_test.median(), inplace=True)

There are missing values in the training set. Handling missing values now.


In [85]:
print("Number of missing values in X_train:", X_train.isnull().sum().sum())
print("Number of missing values in y_train:", y_train.isnull().sum().sum())

Number of missing values in X_train: 0
Number of missing values in y_train: 0


## **3 Model Selection**

### **(1) Linear Regression**

In [86]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


In [87]:
model = LinearRegression()

# Train the model with the training set
model.fit(X_train, y_train)

In [88]:
# Print model coefficients
print("\nModel coefficients:", model.coef_)
print("Model intercept:", model.intercept_)


Model coefficients: [[-4.63708230e+00 -2.25143119e-01  2.56991234e-02 -3.40621470e-02
  -3.42191251e-02  7.85105129e-02  2.26538595e-02 -2.93843507e-03]]
Model intercept: [1064.68608172]


In [89]:
# Predict on the validation set
y_val_pred = model.predict(X_val)

# Calculate MSE and R² score
val_mse = mean_squared_error(y_val, y_val_pred)
val_r2 = r2_score(y_val, y_val_pred)

# Print validation metrics
print("\nValidation Set Metrics:")
print(f"Mean Squared Error (MSE): {val_mse}")
print(f"R² Score: {val_r2}")

# Predict on the test set
y_test_pred = model.predict(X_test)

# Calculate MSE and R² score for the test set
test_mse = mean_squared_error(y_test, y_test_pred)
test_r2 = r2_score(y_test, y_test_pred)

# Print test set metrics
print("\nTest Set Metrics:")
print(f"Mean Squared Error (MSE): {test_mse}")
print(f"R² Score: {test_r2}")


Validation Set Metrics:
Mean Squared Error (MSE): 2785.911826562375
R² Score: 0.3960944983462573

Test Set Metrics:
Mean Squared Error (MSE): 2992.5656258006697
R² Score: 0.4426969846249328


From the output:

1. The Mean Squared Error (MSE) value on the validation set is high, and the R² Score also indicates that the explanatory power of the model is not strong, only 0.396, which means that the model can only explain about 39.6% of the data variation.

2. The Mean Squared Error (MSE) on the test set was 2992.57 and the R² Score was 0.4427, indicating that the model performed slightly better on the test set than on the validation set, but still not very good overall.

### **(2) Random Forest**

In [92]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV


In [93]:
# Initialize the Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Fit the model
rf_model.fit(X_train, y_train)

  return fit_method(estimator, *args, **kwargs)


In [94]:
# Predict on the validation set
y_val_pred = rf_model.predict(X_val)

# Calculate evaluation metrics for validation set
val_mse = mean_squared_error(y_val, y_val_pred)
val_r2 = r2_score(y_val, y_val_pred)

print(f"Validation Set Metrics:\nMean Squared Error (MSE): {val_mse}\nR² Score: {val_r2}")

# Predict on the test set
y_test_pred = rf_model.predict(X_test)

# Calculate evaluation metrics for test set
test_mse = mean_squared_error(y_test, y_test_pred)
test_r2 = r2_score(y_test, y_test_pred)

print(f"Test Set Metrics:\nMean Squared Error (MSE): {test_mse}\nR² Score: {test_r2}")



Validation Set Metrics:
Mean Squared Error (MSE): 1547.9840709825817
R² Score: 0.6644416065054497
Test Set Metrics:
Mean Squared Error (MSE): 1565.052870328168
R² Score: 0.7085415015345136


Base on the output:

1. The MSE of the verification set is 1547.98, and the MSE of the test set is 1565.05, which is significantly decreased compared with 2785.91 and 2992.57 in the previous linear regression. 

2. The validation set has an R² Score of 0.6644 and the test set has an R² Score of 0.7085, which is also significantly improved.

In [95]:
# Feature importance
feature_importances = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': rf_model.feature_importances_
}).sort_values(by='Importance', ascending=False)

print("\nFeature Importances:")
print(feature_importances)


Feature Importances:
                                         Feature  Importance
0           % dwellings which are public housing    0.270662
6                Holds degree or higher, persons    0.227155
1                           Distance to GPO (km)    0.168945
3         Equivalent household income <$600/week    0.138643
5  Aboriginal or Torres Strait Islander, persons    0.065303
7                             Population Density    0.054776
4              Poor English proficiency, persons    0.046985
2                       Public Housing Dwellings    0.027531


### **K-means Cluster**

## **Interpretation**