# **Assignment2-Machine Learning**

## **1 Simple Correlation Analysis**

Feature Selection and Dataset Preparation:
* Based on the earlier EDA, we have identified population, income, education, and other factors that might be important predictors of IRSD.
* We select relevant columns related to population characteristics, education, residential status, cultural background, and minority representation.
* Next we will use correlation analysis to determine the relationships between the selected features and IRSD.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [17]:
df = pd.read_csv('Data/communities.csv')

columns_of_interest = [
    'IRSD (avg)', 
    'Equivalent household income <$600/week', 
    'Personal income <$400/week, persons', 
    'Unemployed, persons', 
    'Public Housing Dwellings',
    '% dwellings which are public housing',
    'Primary school students', 
    'Secondary school students', 
    'TAFE students', 
    'Holds degree or higher, persons', 
    'Did not complete year 12, persons',
    'Population Density',
    'Distance to GPO (km)',  
    'Dwellings with no motor vehicle',
    'Dwellings with no internet',
    'Born overseas, persons', 
    'Born in non-English speaking country, persons', 
    'Speaks LOTE at home, persons',
    'Aboriginal or Torres Strait Islander, persons',
    'Poor English proficiency, persons',
    '2012 ERP age 0-4, persons', 
    '2012 ERP age 5-9, persons', 
    '2012 ERP age 10-14, persons', 
    '2012 ERP age 15-19, persons', 
    '2012 ERP age 20-24, persons', 
    'Public hospital separations, 2012-13',
    'Travel time to nearest public hospital',
    'Primary Schools', 
    'Secondary Schools'
]

for col in columns_of_interest:
    if col not in df.columns:
        print(f"Column '{col}' does not exist in the data")

df_filtered = df[[col for col in columns_of_interest if col in df.columns]].copy()

df_filtered.replace('<5', 5, inplace=True)

for col in df_filtered.columns:
    if df_filtered[col].isnull().sum() > 0:
        df_filtered[col] = df_filtered[col].fillna(df_filtered[col].median())

correlation_matrix = df_filtered.corr()

irsd_correlations = correlation_matrix['IRSD (avg)'].sort_values(ascending=False)

pd.set_option('display.max_rows', None)  # 显示所有行
print(irsd_correlations)

IRSD (avg)                                       1.000000
Holds degree or higher, persons                  0.238098
Population Density                               0.152509
Primary school students                          0.087607
Secondary school students                        0.081053
2012 ERP age 10-14, persons                      0.060856
2012 ERP age 5-9, persons                        0.060523
2012 ERP age 15-19, persons                      0.052011
2012 ERP age 20-24, persons                      0.039232
2012 ERP age 0-4, persons                        0.023887
Born overseas, persons                          -0.008502
Travel time to nearest public hospital          -0.019597
TAFE students                                   -0.034866
Personal income <$400/week, persons             -0.047917
Primary Schools                                 -0.051335
Born in non-English speaking country, persons   -0.053694
Dwellings with no motor vehicle                 -0.058276
Did not comple

| Feature                                  | Correlation Index | Analysis                                                                                                                                   |
|------------------------------------------|-------------------|--------------------------------------------------------------------------------------------------------------------------------------------|
| % dwellings which are public housing     | -0.429            | Public housing proportion is a strong indicator of socio-economic disadvantage.                                                           |
| Distance to GPO (km)                     | -0.332            | Distance from the city center is associated with economic underdevelopment, as remote areas may lack access to resources and opportunities.|
| Public Housing Dwellings                 | -0.284            | The number of public housing dwellings indicates communities that might rely on public welfare, often associated with socio-economic disadvantage. |
| Equivalent household income <$600/week   | -0.146            | The proportion of low-income households indicates economic hardship and socio-economic disadvantage.                                       |
| Poor English proficiency, persons        | -0.170            | Poor language proficiency is strongly associated with socio-economic disadvantage, affecting access to education and employment opportunities. |
| Dwellings with no internet               | -0.174            | Lack of internet access reflects digital exclusion, often linked to socio-economic disadvantage.                                           |
| Aboriginal or Torres Strait Islander, persons | -0.193          | Areas with higher Indigenous populations often face socio-economic disadvantages, indicating systemic challenges.                         |
| Holds degree or higher, persons          | 0.238             | Higher educational attainment is a strong indicator of socio-economic advantage.                                                          |
| Population Density                       | 0.153             | Higher population density is often associated with better socio-economic status due to concentration of resources and opportunities.       |

In [18]:
columns_of_ml = [
    'IRSD (avg)', 
    '% dwellings which are public housing', 
    'Distance to GPO (km)', 
    'Public Housing Dwellings', 
    'Equivalent household income <$600/week',
    'Poor English proficiency, persons',
    'Dwellings with no internet',
    'Aboriginal or Torres Strait Islander, persons',
    'Holds degree or higher, persons',
    'Population Density'
]

for col in columns_of_ml:
    if col not in df.columns:
        print(f"Column '{col}' does not exist in the data")


In [19]:
df_filtered = df[[col for col in columns_of_ml if col in df.columns]].copy()

X = df_filtered.drop(columns=['IRSD (avg)'])
y = df_filtered['IRSD (avg)']


In [20]:
X.head() # Features 


Unnamed: 0,% dwellings which are public housing,Distance to GPO (km),Public Housing Dwellings,Equivalent household income <$600/week,"Poor English proficiency, persons",Dwellings with no internet,"Aboriginal or Torres Strait Islander, persons","Holds degree or higher, persons",Population Density
0,3.815789,4.264157,87,400,305,246,16,1784,3082.440714
1,4.684173,9.881527,66,265,45,193,12,877,2426.66545
2,,134.213743,<5,31,<5,17,8,73,0.841522
3,1.226598,124.859887,19,137,<5,68,8,245,213.059443
4,,14.758418,<5,121,14,64,7,84,210.819042


In [21]:
y.head() # Label: IRSD(avg)

0    1054.014288
1    1087.153516
2    1061.326811
3    1056.031657
4    1033.615698
Name: IRSD (avg), dtype: float64

In [22]:
# Load the filtered features and target data

X.to_csv('Data/features_X.csv', index=False)
y.to_csv('Data/target_y.csv', index=False)

## **2 Data Cleaning and Preprocessing**

* Ensure all missing or invalid values are handled.
* Perform data normalization if required, especially if we plan to use algorithms like linear regression or KNN that are sensitive to data scales.

### **(1) Check for missing values in the dataset**

In [23]:
# Load the filtered features and target data
X = pd.read_csv('Data/features_X.csv')
y = pd.read_csv('Data/target_y.csv')



In [28]:
# Check for missing values
print("\nChecking for missing values in the dataset:")
print(X.isnull().sum())



Checking for missing values in the dataset:
% dwellings which are public housing             0
Distance to GPO (km)                             0
Public Housing Dwellings                         0
Equivalent household income <$600/week           0
Poor English proficiency, persons                0
Dwellings with no internet                       0
Aboriginal or Torres Strait Islander, persons    0
Holds degree or higher, persons                  0
Population Density                               0
dtype: int64


In [25]:
# Convert non-numeric values like '<5' to numeric 
X.replace('<5', 5, inplace=True)

# Convert all columns to numeric if possible
for col in X.columns:
    X[col] = pd.to_numeric(X[col], errors='coerce')

In [27]:
# Check for missing values in the dataset
if X.isnull().sum().any() or y.isnull().sum().any():
    print("There are missing values in the dataset. Handling missing values now.")
    # Fill missing values with median for numerical features
    for col in X.select_dtypes(include=['number']).columns:
        X[col].fillna(X[col].median(), inplace=True)
    # Fill missing values with mode for categorical features (if any)
    for col in X.select_dtypes(include=['object']).columns:
        X[col].fillna(X[col].mode()[0], inplace=True)

There are missing values in the dataset. Handling missing values now.


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X[col].fillna(X[col].median(), inplace=True)


### **(2) Removal collinearity**


In [29]:
# Calculate correlation matrix to identify collinear features
correlation_matrix = X.corr().abs()

# Select the upper triangle of the correlation matrix
upper_triangle = correlation_matrix.where(np.triu(np.ones(correlation_matrix.shape), k=1).astype(bool))

# Debug: Print the upper triangle of the correlation matrix
print("\nUpper Triangle of Correlation Matrix:")
upper_triangle




Upper Triangle of Correlation Matrix:


Unnamed: 0,% dwellings which are public housing,Distance to GPO (km),Public Housing Dwellings,Equivalent household income <$600/week,"Poor English proficiency, persons",Dwellings with no internet,"Aboriginal or Torres Strait Islander, persons","Holds degree or higher, persons",Population Density
% dwellings which are public housing,,0.053928,0.639015,0.184397,0.137929,0.202796,0.222425,0.017475,0.187085
Distance to GPO (km),,,0.123699,0.195515,0.28056,0.160891,0.090628,0.370361,0.497407
Public Housing Dwellings,,,,0.701377,0.475657,0.72315,0.631942,0.406994,0.328614
Equivalent household income <$600/week,,,,,0.709704,0.975595,0.674283,0.663377,0.326698
"Poor English proficiency, persons",,,,,,0.677571,0.269453,0.544159,0.376646
Dwellings with no internet,,,,,,,0.704293,0.614944,0.304507
"Aboriginal or Torres Strait Islander, persons",,,,,,,,0.229075,0.026501
"Holds degree or higher, persons",,,,,,,,,0.59383
Population Density,,,,,,,,,


In [30]:
# Find features with correlation greater than 0.9 and remove them
collinear_features = [column for column in upper_triangle.columns if any(upper_triangle[column] > 0.9)]
print(f"\nCollinear features to be removed: {collinear_features}")

# Drop collinear features
X = X.drop(columns=collinear_features)

# Debug: Check the remaining features
print("\nRemaining features after removing collinear features:")
print(X.columns)


Collinear features to be removed: ['Dwellings with no internet']

Remaining features after removing collinear features:
Index(['% dwellings which are public housing', 'Distance to GPO (km)',
       'Public Housing Dwellings', 'Equivalent household income <$600/week',
       'Poor English proficiency, persons',
       'Aboriginal or Torres Strait Islander, persons',
       'Holds degree or higher, persons', 'Population Density'],
      dtype='object')


Base on the output:

*Dwellings with no internet* were removed because it had a high collinearity (correlation greater than 0.9) with other features, meaning it provided duplicate information with one or more other features. If it continues to be retained, it may result in multicollinearity in the model, which affects the stability and accuracy of the model.

### **(3) Standardizing**

In [34]:
# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


In [36]:
# Debug: Check the mean and standard deviation after scaling
print("\nTraining set mean after scaling:", X_scaled.mean(axis=0))
print("Training set standard deviation after scaling:", X_scaled.std(axis=0))



Training set mean after scaling: [ 1.97372982e-17  9.21073917e-17  2.30268479e-17 -6.00342821e-17
 -1.64477485e-18 -3.28954970e-18  1.97372982e-17 -1.18423789e-16]
Training set standard deviation after scaling: [1. 1. 1. 1. 1. 1. 1. 1.]


Check the feature mean and standard deviation of the training and test sets to ensure that they are normalized (mean 0, standard deviation 1).

### **(4) Data Partitioning (70/30)**


In [39]:
# Split the dataset into training (70%) and testing (30%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Further split the training set into training (80%) and validation (20%) sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Debug: Check the sizes of each dataset
print("Training set shape:", X_train.shape)
print("Validation set shape:", X_val.shape)
print("Test set shape:", X_test.shape)

Training set shape: (604, 8)
Validation set shape: (152, 8)
Test set shape: (324, 8)


## **3 Model Selection**

### **Linear Regression**

### **Random Forest**

### **K-means Cluster**

## **Model Training**
Train each model on the training dataset and validate using cross-validation to ensure the robustness of your models.

## **Model Evaluation**

## **Interpretation**