# Bigfoot Sighting Natural Language Processing
## Data courtesy of [Tim Renner and the Bigfoot Field Researchers Organization ](https://www.kaggle.com/datasets/thedevastator/unlocking-mysteries-of-bigfoot-through-sightings)

## Import necessary modules

In [46]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

## Pull down our selected model for encoding the sighting reports as vectors
### In this case we are using [mpnet-base](https://huggingface.co/microsoft/mpnet-base), an all-round model trained to be used for a variety of use cases

In [2]:
model = SentenceTransformer("all-mpnet-base-v2")
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

## Load in the geocoded data that includes the reports

In [3]:
df = pd.read_csv("./data/bfro_reports_geocoded.csv")
df.head()

Unnamed: 0,index,observed,location_details,county,state,season,title,latitude,longitude,date,...,moon_phase,precip_intensity,precip_probability,precip_type,pressure,summary,uv_index,visibility,wind_bearing,wind_speed
0,0,I was canoeing on the Sipsey river in Alabama....,,Winston County,Alabama,Summer,,,,,...,,,,,,,,,,
1,1,Ed L. was salmon fishing with a companion in P...,East side of Prince William Sound,Valdez-Chitina-Whittier County,Alaska,Fall,,,,,...,,,,,,,,,,
2,2,"While attending U.R.I in the Fall of 1974,I wo...","Great swamp area, Narragansett Indians",Washington County,Rhode Island,Fall,Report 6496: Bicycling student has night encou...,41.45,-71.5,1974-09-20,...,0.16,0.0,0.0,,1020.61,Foggy until afternoon.,4.0,2.75,198.0,6.92
3,3,"Hello, My name is Doug and though I am very re...",I would rather not have exact location (listin...,York County,Pennsylvania,Summer,,,,,...,,,,,,,,,,
4,4,It was May 1984. Two friends and I were up in ...,"Logging roads north west of Yamhill, OR, about...",Yamhill County,Oregon,Spring,,,,,...,,,,,,,,,,


In [4]:
df.shape

(5021, 29)

## Check for null's in the column we plan to encode, and split the data based on nulls

In [50]:
df_with_observed = df[~df["observed"].isna()].copy()
df_without_observed = df[df["observed"].isna()].copy()
df_with_observed.shape

(4983, 29)

In [51]:
df_without_observed.sample(1)

Unnamed: 0,index,observed,location_details,county,state,season,title,latitude,longitude,date,...,moon_phase,precip_intensity,precip_probability,precip_type,pressure,summary,uv_index,visibility,wind_bearing,wind_speed
1535,1535,,It happened in Bancroft park in Lansing,Ingham County,Michigan,Spring,Report 49621: Teen recounts possible encounter...,42.75195,-81.52805,2015-04-01,...,0.41,0.0,0.0,,1019.36,Mostly cloudy in the morning.,5.0,10.0,137.0,3.21


## Perform the encoding, transforming paragraphs into 768 dimension vectors

In [7]:
embeddings = model.encode(list(df_with_observed["observed"]))
embeddings

array([[ 0.02051164,  0.02096521,  0.02246091, ...,  0.10477722,
         0.00103394,  0.02230784],
       [-0.04386169,  0.04560776, -0.03324964, ...,  0.08863181,
         0.00317093,  0.01241465],
       [ 0.03663173,  0.0735049 ,  0.00363371, ...,  0.04705494,
         0.00171463, -0.00958929],
       ...,
       [ 0.03145492,  0.03121825, -0.00923622, ...,  0.06445298,
         0.00094692,  0.0089708 ],
       [ 0.02337824,  0.01296577,  0.00216981, ...,  0.07106277,
         0.0101575 ,  0.00332086],
       [ 0.02623024,  0.10074759,  0.00023831, ...,  0.03221185,
        -0.02743666, -0.01710202]], dtype=float32)

## Save the embeddings for later use, if desired

In [47]:
np.save("bigfoot_embeddings.npy",embeddings)

## Get the mean of the vectors, for comparison

In [30]:
mean_embed = embeddings.mean(axis=0)
mean_embed.shape

(768,)

## Calculate [cosign similarity](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html) for each embedding, compared to the mean, and save the value to the dataframe

In [52]:
similarity_to_mean = [cosine_similarity(x.reshape(1,-1), mean_embed.reshape(1,-1))[0][0] for x in embeddings]
df_with_observed["similarity_to_mean"] = similarity_to_mean
df_with_observed.sample(10)

Unnamed: 0,index,observed,location_details,county,state,season,title,latitude,longitude,date,...,precip_intensity,precip_probability,precip_type,pressure,summary,uv_index,visibility,wind_bearing,wind_speed,similarity_to_mean
3836,3836,Near Cresent City: a television production cre...,,Del Norte County,California,Summer,,,,,...,,,,,,,,,,0.590819
1306,1306,I was hunting hogs and felt I was being watche...,Near the Sabine River.,Wood County,Texas,Winter,Report 8376: Hog hunter encounters unknown ani...,32.69028,-95.61056,2004-03-25,...,0.0,0.0,,1021.64,Mostly cloudy throughout the day.,4.0,8.79,150.0,12.41,0.708542
510,510,10 of us rented a cabin in the woods Thursday ...,It was in Hocking Hills area in Logan Ohio. We...,Perry County,Ohio,Fall,Report 67239: Cabin renters have sighting outs...,39.59607,-82.35291,2020-10-31,...,0.0007,0.11,rain,1027.7,Clear throughout the day.,4.0,10.0,134.0,3.94,0.804978
137,137,I have always wanted to tell this story to som...,From Hubert you can take Hwy. 172 to where it ...,Onslow County,North Carolina,Summer,Report 9750: Motorist observes bipedal animal ...,34.53784,-77.50434,1999-07-16,...,0.0006,0.73,rain,1022.35,Partly cloudy until evening.,10.0,8.85,112.0,1.65,0.820102
3865,3865,I wanted to comment on your report case #8595....,,El Dorado County,California,Summer,"Report 8650: Brothers hear unusual, late-night...",38.71806,-120.5619,2004-05-10,...,0.001,0.62,rain,,Clear throughout the day.,0.0,,168.0,1.14,0.721236
2442,2442,"Found a fresh footprint approximately 18"" or s...",North Fork of the Skokomish River.,Mason County,Washington,Spring,Report 1567: Hikers find a large track,47.3542,-123.2334,1996-04-14,...,0.0,0.0,,1016.13,Mostly cloudy overnight.,7.0,9.9,182.0,0.8,0.509357
3048,3048,I was on the BFRO's West Virginia expedition t...,Along the Greenbrier River,Pocahontas County,West Virginia,Spring,Report 13083: Participant observations during ...,38.10785,-80.19044,2005-04-09,...,0.0,0.0,,,Clear throughout the day.,9.0,10.0,93.0,1.57,0.701987
1029,1029,My name is Dustin Anderson and my friends name...,"Toward the back of Mooormill,Out Lampa Valley ...",Coos County,Oregon,Summer,,,,,...,,,,,,,,,,0.684169
3340,3340,My name is T.W. and I live in Arkansas. Befor...,Right beside my house. I would say it had to h...,Conway County,Arkansas,Summer,,,,,...,,,,,,,,,,0.736541
4985,4985,In the fall of 1978 we were in the house eatin...,It's been a long time since we lived there. If...,Henry County,Kentucky,Fall,"Report 2383: Tall, hairy creature observed sta...",38.44148,-85.0016,1978-09-30,...,,,,,,,,,,0.833205


## Add `similarity_to_mean` column to null dataframe

In [53]:
df_without_observed["similarity_to_mean"] = np.nan

## Union and sort the DataFrames

In [55]:
final_df = pd.concat([df_with_observed, df_without_observed]).sort_values("index")
final_df.shape

  final_df = pd.concat([df_with_observed, df_without_observed]).sort_values("index")


(5021, 30)

## Save final df to csv

In [57]:
final_df.to_csv("bfro_reports_geocoded_with_similarity.csv", index=False)