#Clone GitHub repo into Colab

In [None]:
# Clone the public GitHub repo into Colab
!git clone https://github.com/Nixis/geochem-orebody-proximity-prediction.git

# Change into the repo folder
%cd geochem-orebody-proximity-prediction



Cloning into 'geochem-orebody-proximity-prediction'...
remote: Enumerating objects: 69, done.[K
remote: Counting objects: 100% (69/69), done.[K
remote: Compressing objects: 100% (61/61), done.[K
remote: Total 69 (delta 22), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (69/69), 235.00 KiB | 3.98 MiB/s, done.
Resolving deltas: 100% (22/22), done.
/content/geochem-orebody-proximity-prediction/geochem-orebody-proximity-prediction


# Step 1 – Load Data

In [None]:
import pandas as pd

raw_data_path = 'data/raw/data_for_distribution.csv'
df = pd.read_csv(raw_data_path)

print("Dataset shape:", df.shape)
df.head()


Dataset shape: (4771, 13)


Unnamed: 0,Unique_ID,holeid,from,to,As,Au,Pb,Fe,Mo,Cu,S,Zn,Class
0,A04812,SOLVE003,561,571.0,,0.066,1031.0,61380.0,138.2,3.6,3586.0,43.6,A
1,A03356,SOLVE003,571,581.0,,0.152,1982.0,50860.0,75.4,4.8,1822.0,36.4,A
2,A04764,SOLVE003,581,591.0,,0.068,1064.8,57940.0,29.2,3.0,740.4,36.6,A
3,A04626,SOLVE003,591,601.0,,0.074,891.6,48620.0,63.0,4.2,820.8,39.6,A
4,A05579,SOLVE003,601,611.0,,0.043125,801.25,51025.0,56.0625,4.875,745.6875,32.3125,A


# Step 2 – QAQC Data

#2a - Count missing values

In [None]:
# Count missing values in each column
df.isna().sum()


Unnamed: 0,0
Unique_ID,0
holeid,0
from,0
to,0
As,1503
Au,6
Pb,15
Fe,62
Mo,30
Cu,25


**Observations:**
- Arsenic (As) has a large number of missing values → consider median imputation.
- Most other elements have very few missing values → can drop or impute.
- Key identifiers (`Unique_ID`, `holeid`, `from`, `to`) and `Class` have no missing values.

#2b - Count invalid values

In [None]:
# Count -999 entries per column
(df == -999).sum()


Unnamed: 0,0
Unique_ID,0
holeid,0
from,0
to,0
As,0
Au,0
Pb,0
Fe,0
Mo,28
Cu,0


**Observations:**
- Only **Mo (Molybdenum)** contains 28 `-999` values.  
- All other columns have no `-999` values, indicating previous replacements were successful.  
- These remaining Mo values should be handled (e.g., replaced with `NaN` and then imputed or dropped) before further analysis.


#2c - Non-Numeric values check

In [None]:
# Check which columns have non-numeric entries without converting
for col in df.columns:
    non_numeric_count = df[col].apply(lambda x: not isinstance(x, (int, float, type(None)))).sum()
    if non_numeric_count > 0:
        print(f"Column '{col}' has {non_numeric_count} non-numeric values")


Column 'Unique_ID' has 4771 non-numeric values
Column 'holeid' has 4771 non-numeric values
Column 'Au' has 4765 non-numeric values
Column 'Class' has 4771 non-numeric values


**Observations:**
- Columns 'Unique_ID', 'holeid', and 'Class' are categorical / identifier columns → expected to be non-numeric.
- Column 'Au' has 4765 non-numeric entries → unusual because should be numeric, requires investigation before analysis.
- Other numeric columns appear mostly numeric (to be confirmed), but any text-like entries should be cleaned before modeling.

#2d - General info about Raw Data

In [None]:
# Get general info about dataset
df.info()

# Show basic statistics
df.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4771 entries, 0 to 4770
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Unique_ID  4771 non-null   object 
 1   holeid     4771 non-null   object 
 2   from       4771 non-null   int64  
 3   to         4771 non-null   float64
 4   As         3268 non-null   float64
 5   Au         4765 non-null   object 
 6   Pb         4756 non-null   float64
 7   Fe         4709 non-null   float64
 8   Mo         4741 non-null   float64
 9   Cu         4746 non-null   float64
 10  S          4761 non-null   float64
 11  Zn         4762 non-null   float64
 12  Class      4771 non-null   object 
dtypes: float64(8), int64(1), object(4)
memory usage: 484.7+ KB


Unnamed: 0,from,to,As,Pb,Fe,Mo,Cu,S,Zn
count,4771.0,4771.0,3268.0,4756.0,4709.0,4741.0,4746.0,4761.0,4762.0
mean,750.379585,760.353574,19.730855,689.831232,49952.514598,9.991452,12.450601,9750.033213,59.389636
std,447.126995,447.114592,37.181529,1047.642566,21490.606419,87.098943,107.438873,15557.657335,120.489477
min,71.0,81.0,1.0,1.6,2080.0,-999.0,1.0,26.0,5.6
25%,421.0,431.0,5.4,132.2,39260.0,1.4,3.0,1338.0,29.8
50%,641.0,651.0,9.2,396.7,49020.0,4.4,4.6,3636.0,38.2
75%,991.0,1001.0,20.0,940.2,58420.0,17.4,8.0,10988.0,52.6
max,2201.0,2211.0,827.8,29793.8,397000.0,1939.4,6767.0,217600.0,3455.0


#2e - QAQC Observations

- Arsenic (As) has a few extreme values (max=827.8 vs median=9.2), likely outliers.
- Iron (Fe) still contains -999 placeholders that need replacement.
- Other elements (Cu, Zn, Mo) have very high maxima, worth noting for scaling or visualization.
- As column has 1503 missing values → median imputation recommended.
- The column `Au` is misssing in this table → 4765 non-numeric entries out of 4472 samples should be numeric (float) for analysis.
- Columns 'from', 'to', 'Unique_ID', 'Class' have no missing values.



# Step 3 – Clean Data

In [None]:
import pandas as pd

# Step 1: Replace invalid detection limits (-999) with NaN
df.replace(-999, pd.NA, inplace=True)

# Step 2: Clean Au column
# Strip spaces, replace commas, convert to numeric, impute median
df['Au'] = df['Au'].apply(lambda x: x.strip() if isinstance(x, str) else x)
df['Au'] = df['Au'].apply(lambda x: x.replace(',', '.') if isinstance(x, str) else x)
df['Au'] = pd.to_numeric(df['Au'], errors='coerce')
df['Au'].fillna(df['Au'].median(), inplace=True)

# Step 3: Impute As (large number of missing values) with median
df['As'].fillna(df['As'].median(), inplace=True)

# Step 4: Drop rows with remaining NaNs in other columns
df_clean = df.dropna()

# Step 5: Save cleaned dataset to processed folder
df_clean.to_csv('data/processed/cleaned_data.csv', index=False)

# Step 6: Quick check
print("Original shape:", df.shape)
print("Cleaned shape:", df_clean.shape)
print("Missing values per column after cleaning:")
print(df_clean.isna().sum())


Original shape: (4771, 13)
Cleaned shape: (4619, 13)
Missing values per column after cleaning:
Unique_ID    0
holeid       0
from         0
to           0
As           0
Au           0
Pb           0
Fe           0
Mo           0
Cu           0
S            0
Zn           0
Class        0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Au'].fillna(df['Au'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['As'].fillna(df['As'].median(), inplace=True)


#3a - Read cleaned data


In [None]:
import pandas as pd

df_cleaned = pd.read_csv('data/processed/cleaned_data.csv')
df_cleaned.head()


Unnamed: 0,Unique_ID,holeid,from,to,As,Au,Pb,Fe,Mo,Cu,S,Zn,Class
0,A04812,SOLVE003,561,571.0,9.2,0.066,1031.0,61380.0,138.2,3.6,3586.0,43.6,A
1,A03356,SOLVE003,571,581.0,9.2,0.152,1982.0,50860.0,75.4,4.8,1822.0,36.4,A
2,A04764,SOLVE003,581,591.0,9.2,0.068,1064.8,57940.0,29.2,3.0,740.4,36.6,A
3,A04626,SOLVE003,591,601.0,9.2,0.074,891.6,48620.0,63.0,4.2,820.8,39.6,A
4,A05579,SOLVE003,601,611.0,9.2,0.043125,801.25,51025.0,56.0625,4.875,745.6875,32.3125,A


#3b - Export cleaned data

In [None]:
from google.colab import files

files.download('data/processed/cleaned_data.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# 3c - Data Cleaning Observations
**Actions Taken:**

1. **Replaced invalid placeholders:**  
   - All `-999` values in the dataset were replaced with `NaN` to indicate missing/invalid data.

2. **Au column cleaning:**  
   - Stripped extra spaces from entries.  
   - Replaced commas with decimal points where needed.  
   - Converted all values to numeric, coercing any remaining invalid entries to `NaN`.  
   - Imputed missing values with the **median** of the Au column.

3. **As column handling:**  
   - Due to a large number of missing values (1503), all `NaN` entries were imputed with the **median**.

4. **Other columns:**  
   - Columns with small numbers of missing values (Fe, Mo, Cu, S, Zn, Pb) were cleaned.  
   - Any remaining rows with missing values were dropped to ensure a fully numeric dataset.

5. **Result:**  
   - Cleaned dataset saved to `data/processed/cleaned_data.csv`.  
   - All numeric columns are now properly formatted.  
   - Ready for exploratory data analysis and predictive modeling.

**Observations:**  
- Median imputation preserves central tendency without being affected by extreme outliers.  
- Au and As columns required special attention due to formatting and missing data issues.
