# **Assignment2-Machine Learning**

## **1 Simple Correlation Analysis**

Feature Selection and Dataset Preparation:
* Based on the earlier EDA, we have identified population, income, education, and other factors that might be important predictors of IRSD.
* We select relevant columns related to population characteristics, education, residential status, cultural background, and minority representation.
* Next we will use correlation analysis to determine the relationships between the selected features and IRSD.

In [3]:
import pandas as pd
import numpy as np

In [4]:
df = pd.read_csv('Data/communities.csv')

columns_of_interest = [
    'IRSD (avg)', 
    'Equivalent household income <$600/week', 
    'Personal income <$400/week, persons', 
    'Unemployed, persons', 
    'Public Housing Dwellings',
    '% dwellings which are public housing',
    'Primary school students', 
    'Secondary school students', 
    'TAFE students', 
    'Holds degree or higher, persons', 
    'Did not complete year 12, persons',
    'Population Density',
    'Distance to GPO (km)',  
    'Dwellings with no motor vehicle',
    'Dwellings with no internet',
    'Born overseas, persons', 
    'Born in non-English speaking country, persons', 
    'Speaks LOTE at home, persons',
    'Aboriginal or Torres Strait Islander, persons',
    'Poor English proficiency, persons',
    '2012 ERP age 0-4, persons', 
    '2012 ERP age 5-9, persons', 
    '2012 ERP age 10-14, persons', 
    '2012 ERP age 15-19, persons', 
    '2012 ERP age 20-24, persons', 
    'Public hospital separations, 2012-13',
    'Travel time to nearest public hospital',
    'Primary Schools', 
    'Secondary Schools'
]

for col in columns_of_interest:
    if col not in df.columns:
        print(f"Column '{col}' does not exist in the data")

df_filtered = df[[col for col in columns_of_interest if col in df.columns]].copy()

df_filtered.replace('<5', 5, inplace=True)

for col in df_filtered.columns:
    if df_filtered[col].isnull().sum() > 0:
        df_filtered[col] = df_filtered[col].fillna(df_filtered[col].median())

correlation_matrix = df_filtered.corr()

irsd_correlations = correlation_matrix['IRSD (avg)'].sort_values(ascending=False)

pd.set_option('display.max_rows', None)  # 显示所有行
print(irsd_correlations)

IRSD (avg)                                       1.000000
Holds degree or higher, persons                  0.238098
Population Density                               0.152509
Primary school students                          0.087607
Secondary school students                        0.081053
2012 ERP age 10-14, persons                      0.060856
2012 ERP age 5-9, persons                        0.060523
2012 ERP age 15-19, persons                      0.052011
2012 ERP age 20-24, persons                      0.039232
2012 ERP age 0-4, persons                        0.023887
Born overseas, persons                          -0.008502
Travel time to nearest public hospital          -0.019597
TAFE students                                   -0.034866
Personal income <$400/week, persons             -0.047917
Primary Schools                                 -0.051335
Born in non-English speaking country, persons   -0.053694
Dwellings with no motor vehicle                 -0.058276
Did not comple

| Feature                                  | Correlation Index | Analysis                                                                                                                                   |
|------------------------------------------|-------------------|--------------------------------------------------------------------------------------------------------------------------------------------|
| % dwellings which are public housing     | -0.429            | Public housing proportion is a strong indicator of socio-economic disadvantage.                                                           |
| Distance to GPO (km)                     | -0.332            | Distance from the city center is associated with economic underdevelopment, as remote areas may lack access to resources and opportunities.|
| Public Housing Dwellings                 | -0.284            | The number of public housing dwellings indicates communities that might rely on public welfare, often associated with socio-economic disadvantage. |
| Equivalent household income <$600/week   | -0.146            | The proportion of low-income households indicates economic hardship and socio-economic disadvantage.                                       |
| Poor English proficiency, persons        | -0.170            | Poor language proficiency is strongly associated with socio-economic disadvantage, affecting access to education and employment opportunities. |
| Dwellings with no internet               | -0.174            | Lack of internet access reflects digital exclusion, often linked to socio-economic disadvantage.                                           |
| Aboriginal or Torres Strait Islander, persons | -0.193          | Areas with higher Indigenous populations often face socio-economic disadvantages, indicating systemic challenges.                         |
| Holds degree or higher, persons          | 0.238             | Higher educational attainment is a strong indicator of socio-economic advantage.                                                          |
| Population Density                       | 0.153             | Higher population density is often associated with better socio-economic status due to concentration of resources and opportunities.       |

## **Data Cleaning and Preprocessing**

* Ensure all missing or invalid values are handled.
* Perform data normalization if required, especially if we plan to use algorithms like linear regression or KNN that are sensitive to data scales.

## **Feature Engineering**

## **Data Partitioning (70/30)**


## **Model Selection**

### **Linear Regression**

### **Random Forest**

### **K-means Cluster**

## **Model Training**
Train each model on the training dataset and validate using cross-validation to ensure the robustness of your models.

## **Model Evaluation**

## **Interpretation**