# Mining Data Analysis Report
This notebook analyzes a 20% sample of the mining dataset, focusing on data preprocessing, cleaning, and visualization.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

# Set style for better visualizations
plt.style.use('seaborn')
sns.set_palette("husl")
pd.set_option('display.max_columns', None)

## 1. Data Loading and Sampling

In [None]:
# Read the Excel file
df = pd.read_excel('Copy of MTL Coding Exercise_MSS_New.xlsx')

# Take 20% random sample
df_sample = df.sample(frac=0.2, random_state=42)

print(f"Original dataset size: {len(df)}")
print(f"Sample dataset size: {len(df_sample)}")

## 2. Data Preprocessing and Cleaning

In [None]:
# Check for missing values
missing_values = df_sample.isnull().sum()
print("Missing values in each column:")
print(missing_values[missing_values > 0])

# Basic statistics of numerical columns
numeric_stats = df_sample.describe()
print("\nBasic statistics of numerical columns:")
print(numeric_stats)

In [None]:
# Clean the data
df_clean = df_sample.copy()

# Handle missing values
df_clean['mean_lh'].fillna(df_clean['mean_lh'].mean(), inplace=True)
df_clean['cuka_dcr'].fillna(df_clean['cuka_dcr'].mean(), inplace=True)
df_clean['moka_dcr'].fillna(df_clean['moka_dcr'].mean(), inplace=True)

# Remove outliers using IQR method for key measurements
def remove_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

for col in ['cugrade', 'mograde', 'mean_lh']:
    df_clean = remove_outliers(df_clean, col)

print(f"Dataset size after cleaning: {len(df_clean)}")

## 3. Data Visualization

In [None]:
# 1. Grade Distribution Plot
plt.figure(figsize=(15, 5))

plt.subplot(1, 2, 1)
sns.histplot(data=df_clean, x='cugrade', bins=30)
plt.title('Distribution of Copper Grade')
plt.xlabel('Copper Grade (%)')

plt.subplot(1, 2, 2)
sns.histplot(data=df_clean, x='mograde', bins=30)
plt.title('Distribution of Molybdenum Grade')
plt.xlabel('Molybdenum Grade (%)')

plt.tight_layout()
plt.show()

In [None]:
# 2. Correlation Analysis
numerical_cols = ['mean_lh', 'cuka_dcr', 'moka_dcr', 'cugrade', 'mograde', 
                 'valid_bh_num', 'avg_bh_grade_cu', 'avg_bh_grade_mo', 'Dist_to_NN_bh']

plt.figure(figsize=(12, 10))
correlation_matrix = df_clean[numerical_cols].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap of Key Variables')
plt.xticks(rotation=45)
plt.yticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# 3. Scatter Plot Matrix for Key Variables
key_vars = ['cugrade', 'mograde', 'avg_bh_grade_cu', 'avg_bh_grade_mo']
sns.pairplot(df_clean[key_vars], diag_kind='kde')
plt.suptitle('Scatter Plot Matrix of Grade Measurements', y=1.02)
plt.show()

In [None]:
# 4. Grade by Shift Analysis
plt.figure(figsize=(10, 6))
shift_stats = df_clean.groupby('shift_id')[['cugrade', 'mograde']].mean()
shift_stats.plot(kind='bar')
plt.title('Average Grades by Shift')
plt.xlabel('Shift')
plt.ylabel('Grade (%)')
plt.legend(['Copper', 'Molybdenum'])
plt.tight_layout()
plt.show()

## 4. Analysis Report

### Data Preprocessing and Cleaning:
1. The dataset was sampled to 20% of its original size for analysis
2. Missing values were handled by filling with mean values for continuous measurements
3. Outliers were removed using the IQR method for key grade measurements

### Key Insights from Visualizations:

1. **Grade Distributions**:
   - The copper and molybdenum grade distributions show the typical range and variation in mineral content
   - Any significant skewness or multi-modal patterns would be visible in these plots

2. **Correlation Analysis**:
   - The heatmap reveals relationships between different measurements
   - Strong correlations between sensor readings (cuka_dcr, moka_dcr) and actual grades would validate sensor accuracy
   - Distance to nearest blasthole may show impact on prediction accuracy

3. **Grade Relationships**:
   - The scatter plot matrix shows relationships between predicted and actual grades
   - Helps identify any systematic bias in predictions
   - Shows if there's any correlation between copper and molybdenum grades

4. **Shift Analysis**:
   - Compares grade measurements between day and night shifts
   - Helps identify any systematic differences in measurements between shifts
   - Important for quality control and operational consistency

### Recommendations:
1. Monitor and calibrate sensors based on correlation analysis results
2. Investigate any significant shift-based variations
3. Use distance to nearest blasthole as a confidence metric for grade predictions
4. Regular validation of sensor predictions against laboratory assays