# Overview
```
🎇 Can you guess which elements are the most important in minerals?
✨ Can you name the most richest countries with minerals?
```

# 1. Observe the dataset

### 1.1  Import necessary modules

In [None]:
import pandas as pd      # data processing, CSV file I/O
import seaborn as sns    # beautiful plots
sns.set()
import matplotlib.pyplot as plt
%matplotlib inline

### 1.2 Read and understand data
* Mineral Name = common name (Pyroxferroite, Gerhardtite, Hydroxylherderite, etc. Silver, Gold and Carbon are elements that form minerals on their own)   
* RRUFF Chemistry (plain) = RRUFF formula (Stoiberite = Cu2+5O2(V5+O4)2	which stands for `Cu²⁺₅O₂(V⁵⁺O₄)₂` formula)  
* IMA Chemistry (plain) = International Mineralogical Association formula (Stoiberite = Cu5O2(VO4)2	which stands for `Cu₅O₂(VO₄)₂` formula)
* Chemistry Elements = chemistry elements in mineral (Stoiberite = Cu, V, O, Silver = Ag)
* IMA Number = unique IMA number (if applies) (Stoiberite = IMA1979-016)
* RRUFF IDs = unique RRUFF ID (if applies) - one or more (Silver = R070416, R070463, R070754)
* Country of Type Locality = country (including 'unknown' and '?')
* Year First Published = year the mineral was first officially described
* IMA Status = official status of IMA assigned (Approved, Grandfathered, Pending publication)
* Structural Groupname = structural group of minerals (Platarsite = Pyrite)
* Fleischers Groupname = structural group for minerals by Fleischers Glossary 2008 (corresponds with Structural Groupname)
* Status Notes = publication
* Crystal Systems = a set of point groups and their corresponding space groups are assigned to a lattice system (monoclinic, cubic, orthorhombic, hexagonal, etc)
* Oldest Known Age (Ma) = age in megaannums (one million years)

In [None]:
minerals = pd.read_csv("/kaggle/input/ima-database-of-mineral-properties/RRUFF_Export_20191025_022204.csv")
minerals.sample(5)

We have 14 columns with descriptive information about minerals.


### 1.3 Rename columns with complicated headings

In [None]:
minerals.rename(columns={'Mineral Name':'Mineral','Chemistry Elements':'Elements', 'Country of Type Locality' : 'Country', 'Crystal Systems':'Systems', 'Oldest Known Age (Ma)' : 'Age', 'Structural Groupname' : 'Groupname' }, inplace=True)

# 2. Working with nulls

### 2.1 Check nulls

* `Mineral` (unique name of mineral), `RRUFF Chemistry (plain)` formula do not have null values.  
* `Groupname` and `Fleischers Groupname` have empty string values.  
* `IMA Chemistry (plain)` and `Elements` have 4 nulls.


In [None]:
minerals.isnull().sum(axis = 0)

### 2.2 Check nulls in Elements and IMA Chemistry
* use mask on columns `Elements` and `IMA Chemistry (plain)`

In [None]:
el_null_mask = minerals['Elements'].isnull() | minerals['IMA Chemistry (plain)'].isnull()
minerals[el_null_mask]

Four (4) rows found (both nulls for Elements and IMA Chemistry), drop these rows:

In [None]:
minerals = minerals[minerals.Elements.notna()]

### 2.3 Fill nulls in Country
* There are Nones for Country column, fill it with 'unknown' value.

In [None]:
minerals['Country'] = minerals.Country.fillna(value = 'unknown')
minerals.isnull().sum(axis = 0)

# 3. Understand the distribution of chemistry elements in dataset

### 3.1 Split column Elements by space
* For now we have this kind of table:   

| Mineral    | Element         |
|------------|-----------------|
| Nealite    | Pb Fe As O Cl H |
| Hilgardite | Ca B O Cl H     |
| ...        | ...             |


* The aim is to make this one (new dataframe `elem`):

| Mineral    | Element  |
|------------|----------|
| Nealite    | Pb       |
| Nealite    | Fe       |
| Nealite    | As       |
| Nealite    | ...      |
| Hilgardite | Ca       |
| Hilgardite | B        |
| Hilgardite | ...      |

In [None]:
elem = minerals.set_index('Mineral').Elements.str.split(' ', expand=True).stack().reset_index('Mineral').reset_index(drop=True)
elem.columns = ['Mineral', 'Element']
elem

### 3.2 Count all elements

In [None]:
elem['Element'].value_counts()

* Plot graph of first 10 most 'popular' chemistry elements:  


In [None]:
elem['Element'].value_counts()[0:10].sort_values().plot(kind='barh', figsize=(8, 6))
plt.xlabel("Count of occurrences", labelpad=14)
plt.ylabel("Chemistry Element", labelpad=14)
plt.title("Most frequent elements in minerals", y=1.02);

**<span style="color:red">O (Oxygen) and H (Hydrogen) are at the first and second places respectively.</span>**

# 4. Understand the distribution of countries in dataset
### 4.1 Understand data in Country column

* Nones in `Country` column were already filled with `unknown` value.

In [None]:
minerals['Country'].value_counts()

* Some cells contain two or more countries delimited by `/` (80 rows), so first let's expand this set:

In [None]:
countr = minerals.set_index('Mineral').Country.str.split(' / ', expand=True).stack().reset_index('Mineral').reset_index(drop=True)
countr.columns = ['Mineral', 'Country']

* There are countries which contain `meteorite` mineral which was found is this country:

In [None]:
print(countr[countr['Country'].str.contains('meteorite', regex=False)])

* There is also uncertain data which contains `?` symbol:

In [None]:
print(countr[countr['Country'].str.contains('?', regex=False)])

* Moreover there are minerals which have `meteorite` origin.

### 4.2 Reorganize countries

1. Replace `?` symbol with empty string
2. If cell contains `Some country (meteorite)` replace it with `meteorite`
3. If cell has only symbol `?` replace it with `unknown`
4. Replace `IDP (interplanetary dust particle) over USA` with `IDP` (just one row with Mineral Brownleeite)


In [None]:
countr['Country'] = countr['Country'].replace({' \?':''}, regex=True)
countr.loc[countr['Country'].str.contains('meteorite', case=False), 'Country'] = 'meteorite'
countr['Country'] = countr['Country'].replace('?', 'unknown')
countr.loc[countr['Country'].str.contains('IDP', case=False), 'Country'] = 'IDP'

5. Cut unnecessary spaces and escape characters (just in case)!  
6. Drop rows where country is `unknown`

In [None]:
countr['Country'].str.strip()
countr.drop(countr[countr.Country == 'unknown'].index, inplace=True)
countr['Country'].value_counts()[0:20]

Plot the most richest (20) countries with the minerals:

In [None]:
countr['Country'].value_counts()[0:20].sort_values().plot(kind='barh', figsize=(10, 7))
plt.xlabel("Count of occurancies", labelpad=14)
plt.ylabel("Country", labelpad=14)
plt.title("Count countries by minerals", y=1.02);

**<span style="color:red">USA and Russia are at the first and second places respectively.</span>**
# 5. Merge countries and minerals with expanded elements

* Drop rows with not frequent minerals and countries:

In [None]:
countr = countr[countr['Country'].map(countr['Country'].value_counts()) >= 140]
elem = elem[elem['Element'].map(elem['Element'].value_counts()) >= 600]

* Merge two dataframes (countries and elements) on `Mineral` column:

In [None]:
result = pd.merge(countr, elem, on='Mineral')
result

Plot countries count by elements in minerals:

In [None]:
sns.catplot(x="Country", hue="Element", kind="count", palette="pastel", edgecolor=".6", data=result, height=8, aspect = 2)

**<span style="color:red">Expanded by countries we have the same variety of H and O as the highest bars.</span>**