# 1.1 - Data Assessing and Data Cleaning

___

## Project Workflow


1.0) Geological Setting and Mineral Disponibility

**1.1) Data Assessing and Data Cleaning**

1.2) Exploratory Data Analysis & Geostats

2.1) Spatial Analysis

2.2) Hydrograph Basins Delimitations

2.3) Correlation between basins and samples

3.0) Conclusion
___

## Table Of Contents

[a) Importing Libraries](#il)

[b) Importing File](#if)

[c) Data Assessing](#da)

[d) Data Cleaning](#dc)

<a name="il"></a>
## Importing Libraries

In [1]:
import pandas as pd
import geopandas as gpd
import numpy as np
import folium
from shapely import geometry
import matplotlib.pyplot as plt

%matplotlib inline

<a name="if"></a>
## Importing File

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [2]:
df = pd.read_csv('/content/NURE_15.csv', sep =',')
NURE_15 = df.copy()
NURE_15.head()

Unnamed: 0,rec_no,prime_id,samptyp,latitude,longitude,ornlid,site,state,quad,mapcode,...,zn_ppm,zr_ppm,po4_ppm,so4_ppm,methods,tapefile,reformat,coordprb,comments,comment2
0,5239129,8986,15,29.703,-102.759,8986,1105,TX,EMORY PEAK,NH1309,...,333,56,0,0,"OR1, OR2, OR6, OR8",XG0386.03,,,"1-38 MARAVILLAS CANYON, QUAD., 15 MIN. SAMPLE ...",
1,5239061,8874,15,29.465,-102.834,8874,1001,TX,EMORY PEAK,NH1309,...,59,64,0,0,"OR1, OR2, OR6, OR8",XG0386.03,,SAME COORDINATES AS ORNLID(029338):,1-78 GARY HOMERSTAD % AREA MANAGER BLACK GAP W...,
2,5239062,8875,15,29.473,-102.816,8875,1002,TX,EMORY PEAK,NH1309,...,64,92,0,0,"OR1, OR2, OR6, OR8",XG0386.03,,,1-78 GARY HOMERSTAD % AREA MANAGER BLACK GAP W...,
3,5239065,8880,15,29.432,-102.863,8880,1007,TX,EMORY PEAK,NH1309,...,54,76,0,0,"OR1, OR2, OR6, OR8",XG0386.03,,,"1-38 STILLWELL CROSSING QUAD., 7.5 MIN., SAMPL...",
4,5239066,8884,15,29.871,-102.93,8884,1010,TX,EMORY PEAK,NH1309,...,19,27,0,0,"OR1, OR2, OR6, OR8",XG0386.03,HSSR SAMPLE USED IN TERRELL STUDY AREA: SAME R...,,"1-38 DOVE MTN. QUAD., 15 MIN., SAMPLE TAKEN OV...",


<a name="da"></a>
## Data Assessing

In [3]:
NURE_15.sample(10)

Unnamed: 0,rec_no,prime_id,samptyp,latitude,longitude,ornlid,site,state,quad,mapcode,...,zn_ppm,zr_ppm,po4_ppm,so4_ppm,methods,tapefile,reformat,coordprb,comments,comment2
10,5239075,8896,15,29.791,-103.5,8896,1021,TX,EMORY PEAK,NH1309,...,56,77,0,0,"OR1, OR2, OR6, OR8",XG0386.03,HSSR SAMPLE USED IN TASCOTAL STUDY AREA: SAME ...,,"1-38 BUCK HILL QUAD., 15 MIN., SAMPLE TAKEN OV...",
250,5239652,27463,15,29.673,-102.986,27463,G129,TX,EMORY PEAK,NH1309,...,101,80,0,0,"OR1, OR2, OR7, OR8",XG0386.24,STILLWELL MOUNTAINS STUDY AREA:,,1-38 MARAVILLAS CANYON N.W. 7.5' QUAD COLLECTE...,
486,5240097,29607,15,29.538,-102.84,29607,H569,TX,EMORY PEAK,NH1309,...,74,61,0,0,"OR1, OR2, OR7, OR8",XG0386.24,STILLWELL MOUNTAINS STUDY AREA:,,1-38 MARAVILLAS CANYON SE 7.5 MIN QUAD COLLECT...,"URNE % SAN ANGELO, TX 76901:"
678,5330925,29869,92,30.079,-103.534,29869,F137,TX,FORT STOCKTON,NH1306,...,0,0,0,0,OR-GS,XG0386.25,TASCOTAL STUDY AREA: RADIOMETRIC DATA-EQ_K_PCT...,,"1-38 FT. STOCKTON, 1X2 DEGREE, BLOW-UP:",
142,5239390,14501,15,29.864,-102.505,14501,1452,TX,EMORY PEAK,NH1309,...,33,57,0,0,"OR1, OR2, OR6, OR8",XG0386.03,HSSR SAMPLE USED IN TERRELL STUDY AREA: SAME R...,,"1-38 BULLIS GAP QUAD., 15 MIN. COLLECTED SAMPL...",
117,5239339,14203,15,29.353,-103.089,14203,0739,TX,EMORY PEAK,NH1309,...,275,98,0,0,"OR1, OR2, OR6, OR8",XG0386.03,HSSR SAMPLE USED IN SOLITARIO STUDY AREA: SAME...,,1-38 ROYS PEAK QUAD. SAMPLE COLLECTED OVER A 6...,
507,5241234,29130,15,30.125,-103.478,29130,F105,TX,FORT STOCKTON,NH1306,...,162,271,0,0,"OR1, OR2, OR6, OR8",XG0386.25,TASCOTAL STUDY AREA:,,"1-38 MONUMENT SPRING 15' TOPO, FT. STOCKTON QU...",AND LITHIC FRAGMENTS:
329,5239940,29347,15,29.441,-102.868,29347,H316,TX,EMORY PEAK,NH1309,...,75,69,0,0,"OR1, OR2, OR7, OR8",XG0386.24,STILLWELL MOUNTAINS STUDY AREA:,,"1-38 STILLWELL CROSSING 7.5 MIN. TOPO, COLLECT...",
230,5239537,15036,15,29.938,-103.301,15036,1645,TX,EMORY PEAK,NH1309,...,81,83,0,0,"OR1, OR2, OR6, OR8",XG0386.03,HSSR SAMPLE USED IN TASCOTAL STUDY AREA: SAME ...,,1-38 SANTIAGO PEAK 15' OF EMORY PEAK. SAMPLE T...,R REYNOLDS CREEK:
648,5330850,29609,92,29.611,-102.907,29609,H571,TX,EMORY PEAK,NH1309,...,0,0,0,0,NONE,XG0386.24,STILLWELL MOUNTAINS STUDY AREA:,,1-38 MARAVILLAS CANYON SW 7.5 MIN QUAD READING...,


In [4]:
NURE_15.shape

(680, 137)

In [5]:
NURE_15.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 680 entries, 0 to 679
Columns: 137 entries, rec_no to comment2
dtypes: float64(17), int64(65), object(55)
memory usage: 727.9+ KB


In [6]:
#looking for duplicated
NURE_15.duplicated().sum()

0

<a name="dc"></a>
## Data Cleaning

`1.` Tidiness Issues

- Change columns dtypes

    - Make a pattern for numbers below a certain value (e.g. <10 ppm)
    
- Filter the dataframe to only elements with high mobility

- Rename `Long__X_` and `Lat__Y_` columns


In [7]:
NURE_15.rename(columns = {'longitude':'X', 'latitude':'Y'},inplace = True)
NURE_15.head(1)

Unnamed: 0,rec_no,prime_id,samptyp,Y,X,ornlid,site,state,quad,mapcode,...,zn_ppm,zr_ppm,po4_ppm,so4_ppm,methods,tapefile,reformat,coordprb,comments,comment2
0,5239129,8986,15,29.703,-102.759,8986,1105,TX,EMORY PEAK,NH1309,...,333,56,0,0,"OR1, OR2, OR6, OR8",XG0386.03,,,"1-38 MARAVILLAS CANYON, QUAD., 15 MIN. SAMPLE ...",


___

`2.` Quality Issues

- Check for the elements that are statistically significant (>= 50% in register)

- Remove `<` or `>` sign from numbers.

In [8]:
NURE_15 = NURE_15.replace(0, np.nan)

In [9]:
NURE_15.head()

Unnamed: 0,rec_no,prime_id,samptyp,Y,X,ornlid,site,state,quad,mapcode,...,zn_ppm,zr_ppm,po4_ppm,so4_ppm,methods,tapefile,reformat,coordprb,comments,comment2
0,5239129,8986,15,29.703,-102.759,8986,1105,TX,EMORY PEAK,NH1309,...,333.0,56.0,,,"OR1, OR2, OR6, OR8",XG0386.03,,,"1-38 MARAVILLAS CANYON, QUAD., 15 MIN. SAMPLE ...",
1,5239061,8874,15,29.465,-102.834,8874,1001,TX,EMORY PEAK,NH1309,...,59.0,64.0,,,"OR1, OR2, OR6, OR8",XG0386.03,,SAME COORDINATES AS ORNLID(029338):,1-78 GARY HOMERSTAD % AREA MANAGER BLACK GAP W...,
2,5239062,8875,15,29.473,-102.816,8875,1002,TX,EMORY PEAK,NH1309,...,64.0,92.0,,,"OR1, OR2, OR6, OR8",XG0386.03,,,1-78 GARY HOMERSTAD % AREA MANAGER BLACK GAP W...,
3,5239065,8880,15,29.432,-102.863,8880,1007,TX,EMORY PEAK,NH1309,...,54.0,76.0,,,"OR1, OR2, OR6, OR8",XG0386.03,,,"1-38 STILLWELL CROSSING QUAD., 7.5 MIN., SAMPL...",
4,5239066,8884,15,29.871,-102.93,8884,1010,TX,EMORY PEAK,NH1309,...,19.0,27.0,,,"OR1, OR2, OR6, OR8",XG0386.03,HSSR SAMPLE USED IN TERRELL STUDY AREA: SAME R...,,"1-38 DOVE MTN. QUAD., 15 MIN., SAMPLE TAKEN OV...",


In [16]:
#Replacing - sample values with the absolute value of 1/2 to indicate analysis was preformed and sampled to 1/2 the detection limit


In [10]:
for col in NURE_15.select_dtypes(include=np.number).columns:  # Select numeric columns
    NURE_15[col] = NURE_15[col].apply(lambda x: abs(x / 2) if x < 0 else x)

In [11]:
elements = [col for col in NURE_15.columns if col.endswith(('ppm', 'pct'))]

In [12]:
print(elements)

['orgn_pct', 'u_xx_ppm', 'u_dn_ppm', 'u_fl_ppm', 'u_ms_ppm', 'u_na_ppm', 'ag_ppm', 'al_pct', 'as_ppm', 'au_ppm', 'b_ppm', 'ba_ppm', 'be_ppm', 'bi_ppm', 'br_ppm', 'ca_pct', 'cd_ppm', 'ce_ppm', 'cl_ppm', 'co_ppm', 'cr_ppm', 'cs_ppm', 'cu_ppm', 'dy_ppm', 'eu_ppm', 'f_ppm', 'fe_pct', 'hf_ppm', 'hg_ppm', 'k_pct', 'la_ppm', 'li_ppm', 'lu_ppm', 'mg_pct', 'mn_ppm', 'mo_ppm', 'na_pct', 'nb_ppm', 'ni_ppm', 'p_ppm', 'pb_ppm', 'pt_ppm', 'rb_ppm', 'sb_ppm', 'sc_ppm', 'se_ppm', 'sm_ppm', 'sn_ppm', 'sr_ppm', 'ta_ppm', 'tb_ppm', 'th_ppm', 'ti_ppm', 'v_ppm', 'w_ppm', 'y_ppm', 'yb_ppm', 'zn_ppm', 'zr_ppm', 'po4_ppm', 'so4_ppm']


In [13]:
#creating an empty list
remove = []

#iterating all elements
for e in elements:
    try:
        #counting the number of occurrences that has < or >
        not_sampled = NURE_15[e].isnull().sum()  # Use isnull() to detect NaN values

        #checking the percentage
        validity = 1 - (not_sampled/NURE_15.shape[0])

        #if the value is less than 0.5, we must remove it
        if validity < 0.5:
            del_element = e
            print(f'Remove {e}, because its value in under the proper rate {validity:0.2f}')
            #creating a list with the elements we must remove.
            remove.append(del_element)
    except:
        pass

Remove u_xx_ppm, because its value in under the proper rate 0.00
Remove u_ms_ppm, because its value in under the proper rate 0.00
Remove u_na_ppm, because its value in under the proper rate 0.00
Remove au_ppm, because its value in under the proper rate 0.00
Remove bi_ppm, because its value in under the proper rate 0.00
Remove br_ppm, because its value in under the proper rate 0.00
Remove cd_ppm, because its value in under the proper rate 0.00
Remove ce_ppm, because its value in under the proper rate 0.40
Remove cl_ppm, because its value in under the proper rate 0.00
Remove cs_ppm, because its value in under the proper rate 0.00
Remove dy_ppm, because its value in under the proper rate 0.00
Remove eu_ppm, because its value in under the proper rate 0.00
Remove f_ppm, because its value in under the proper rate 0.00
Remove hf_ppm, because its value in under the proper rate 0.33
Remove hg_ppm, because its value in under the proper rate 0.00
Remove la_ppm, because its value in under the prop

In [14]:
NURE_15.columns

Index(['rec_no', 'prime_id', 'samptyp', 'Y', 'X', 'ornlid', 'site', 'state',
       'quad', 'mapcode',
       ...
       'zn_ppm', 'zr_ppm', 'po4_ppm', 'so4_ppm', 'methods', 'tapefile',
       'reformat', 'coordprb', 'comments', 'comment2'],
      dtype='object', length=137)

In [15]:
#filtering the dataframe
NURE_15 = NURE_15.drop(columns = remove)

In [16]:
NURE_15.shape

(680, 108)

In [20]:
clean = NURE_15.copy()
clean.to_csv(r'C:\Users\srs6239\Box\Spring2024_GRA_SRS\Brewster_Ranch\_DATA\NURE\NURE_15_data_cleaned.csv', sep = ',', index = False)

In [21]:
removed_elements_df = pd.DataFrame(remove, columns=['Removed Elements'])

# Display the table
display(removed_elements_df)

Unnamed: 0,Removed Elements
0,u_xx_ppm
1,u_ms_ppm
2,u_na_ppm
3,au_ppm
4,bi_ppm
5,br_ppm
6,cd_ppm
7,ce_ppm
8,cl_ppm
9,cs_ppm
