# Statistical Analysis of Extracted Text Data

## ðŸ“Œ Overview
This notebook processes previously extracted **xDD text snippets** and computes **word statistics**, including:
- **Co-occurrence statistics**: Measures how often words appear together.
- **Entropy and JointEntropy**: Measures the information contained in words and pairs of words.
- **Mutual Information**: Measures how two words or terms are related.


In [3]:
# Import core libraries
import os
import sys
import pandas as pd
!python -m spacy download en_core_web_sm

# Import custom functions
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..", "src")))

# Import the functions for statistics calculations
from nlp_statistics import process_all_terms

# Move the working directory one level up 
os.chdir(os.path.abspath(os.path.join(os.getcwd(), "..")))

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     - -------------------------------------- 0.5/12.8 MB 4.2 MB/s eta 0:00:03
     ----------- ---------------------------- 3.7/12.8 MB 12.8 MB/s eta 0:00:01
     ------------------------ --------------- 7.9/12.8 MB 16.2 MB/s eta 0:00:01
     --------------------------------------  12.6/12.8 MB 18.8 MB/s eta 0:00:01
     --------------------------------------- 12.8/12.8 MB 18.3 MB/s eta 0:00:00
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


## 1. Define Data Directory & Processing Parameters

Extracted text is stored in the `data` folder. We will:
âœ” Process **each term folder** separately.
âœ” Compute **co-occurrence & TF-IDF scores**.
âœ” Store results as CSV files inside each term folder.


In [6]:
root_folder = r"D:\MSc\data"
n = 1000000  # Number of most frequent words to analyze
total_docs = 16023367  # This number is the total number of documents in xDD at the time of the search.
overwrite = False  # Set to True to force reprocessing of terms previously searched.

## 2. Run Statistical Analysis

This step will:
- **Scan each term folder** for processed text files.
- **Run calculations in parallel** to improve efficiency.
- **Save results as CSV files** inside the corresponding term folder.


In [7]:
process_all_terms(root_folder, n, total_docs, overwrite)

âœ… abyssal plain already processed. Skipping.
âœ… accretionary already processed. Skipping.
âœ… accretionary prism already processed. Skipping.
âœ… accretionary wedge already processed. Skipping.
âœ… active already processed. Skipping.
âœ… active margin already processed. Skipping.
âœ… adakite already processed. Skipping.
âœ… aegean already processed. Skipping.
âœ… aeolian already processed. Skipping.
âœ… affinity already processed. Skipping.
âœ… age already processed. Skipping.
âœ… ailik domain already processed. Skipping.
âœ… aishihik plutonic suite already processed. Skipping.
âœ… alaska already processed. Skipping.
âœ… alberta group already processed. Skipping.
âœ… aleutian already processed. Skipping.
âœ… alexander terrane already processed. Skipping.
âœ… algoma-type bif already processed. Skipping.
âœ… alkaline already processed. Skipping.
âœ… alkaline intrusion-hosted ree already processed. Skipping.
âœ… alkaline rock already processed. Skipping.
âœ… alluvial already processed.

Building word list for 'sask craton': 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 894/894 [00:03<00:00, 251.22 docs/s]

âœ… Saved co-occurrence statistics to D:\MSc\data\sask craton\sask craton_cooccurrence_stats.csv
ðŸ“Š Saved corpus stats for sask craton to D:\MSc\data\sask craton\sask craton_corpus_stats.csv
âœ… scapolite already processed. Skipping.
âœ… scheelite already processed. Skipping.
âœ… schist already processed. Skipping.
âœ… scoria already processed. Skipping.
âœ… sea already processed. Skipping.
âœ… sedex already processed. Skipping.
âœ… sedex pb-zn already processed. Skipping.
âœ… sediment already processed. Skipping.
âœ… sediment-hosted cu already processed. Skipping.
âœ… sediment-hosted vms already processed. Skipping.
âœ… sedimentary already processed. Skipping.
âœ… sedimentology already processed. Skipping.
âœ… seismic survey already processed. Skipping.
âœ… sequence stratigraphy already processed. Skipping.
âœ… serpentine already processed. Skipping.
âœ… serpentinite already processed. Skipping.
âœ… setting already processed. Skipping.
âœ… shale already processed. Skipping.
âœ… shea




## 3. Review Extracted Statistics

Once processing is complete, each term folder will contain:
âœ” `term_cooccurrence_stats.csv` â†’ Co-occurrence frequencies.
âœ” `term_tfidf.csv` â†’ Term importance scores.

Let's now preview some of these results.


In [8]:
# Define the root directory where data is stored
root_folder = os.path.join(os.getcwd(), r"D:\MSc\data")

term_folder = os.path.join(root_folder, "volcanic arc")  # Example term
cooc_file = os.path.join(term_folder, "volcanic arc_cooccurrence_stats.csv")

if os.path.exists(cooc_file):
    df_cooc = pd.read_csv(cooc_file)
    print(f"âœ… Successfully loaded {len(df_cooc)} co-occurrence records from {cooc_file}")
    display(df_cooc.head(10))
else:
    print("âš  No co-occurrence data found. Ensure the processing step was completed successfully.")


âœ… Successfully loaded 19765 co-occurrence records from D:\MSc\data\volcanic arc\volcanic arc_cooccurrence_stats.csv


Unnamed: 0,word_1,word_2,count,prob_w1w2
0,volcanic arc,granite,3981,5.743173e-08
1,volcanic arc,rock,3527,5.088212e-08
2,volcanic arc,active,2613,3.769633e-08
3,volcanic arc,subduction,2538,3.661435e-08
4,volcanic arc,basalt,2399,3.460907e-08
5,volcanic arc,basin,2367,3.414743e-08
6,volcanic arc,plate,2320,3.346938e-08
7,volcanic arc,central,2283,3.29356e-08
8,volcanic arc,continental,2267,3.270478e-08
9,volcanic arc,within,2218,3.199788e-08


### ðŸ”¹ What Are Co-Occurrence Statistics?

Co-occurrence statistics help identify **which words frequently appear together**. This allows us to:
- **Understand context**: Words that co-occur frequently may represent key relationships.
- **Improve keyword extraction**: Helps refine NLP models by identifying dominant term associations.

---


## 4. Next Steps

Now that we have:  
âœ” Extracted data (`1_xdd_data_extraction.ipynb`).  
âœ” Processed word statistics (`2_calculate_statistics.ipynb`).  

ðŸ“Œ **Next Notebook: `3_DDB Import and Analysis.ipynb`**
- Import statistics into **postgreSQL**.
- Update further calculations such as **mutual information**.
- Visualize and explore **semantic trends in geoscience literature**.

---
