lsgordon@seas.upenn.edu | github.com/lsgordon
Hello!
My name is Leo, I'm a current 1st year grad student at UPenn, and a current senior at Haverford College. I'm interesting in Machine Learning, Data Engineering, and Data Science. These projects are the best examples of my work. Feel free to reach out with any questions you have.
BirdCLEF+ 2025: Species Classification w/ YamNET | GitHub May 2025
- Developed a robust system for classifying 206 endangered and under-studied animal species (including birds, amphibians, mammals, and insects) from the Middle Magdalena Valley of Colombia using audio recordings for the BirdCLEF 2025 competition.
- Engineered a comprehensive data preprocessing pipeline: converted audio files to mono, standardized sample rates by downsampling to 22kHz (compatible with YAMNet) using
scipy.signal.resample
, and implemented custom functionsfor data cleaning and handling irregularities. - Created fixed-length audio representations (1000 samples per segment) by padding shorter sequences and chunking/padding longer ones, preparing data for consistent model input.
- Utilized the pre-trained YAMNet model from TensorFlow Hub for feature extraction, transforming raw audio segments into 1024-dimensional embeddings.
- Addressed significant class imbalance in the BirdCLEF 2025 dataset by strategically downsampling processed audio segments, limiting each of the 206 classes to a maximum of 1000 samples for training the final model.
- Designed, built, and trained a deep neural network using the Keras functional API. The architecture included an input layer for 1024-dimensional embeddings, multiple Dense layers (512, 256, 128 units) with ReLU activations, Layer Normalization for improved stability, Dropout (0.2, 0.3) for regularization, and a residual connection, leading to a Softmax output layer for multi-class classification.
- Optimized model training by employing EarlyStopping (monitoring
val_loss
with a patience of 5 epochs), which helped in preventing overfitting and restoring the best performing model weights. - Achieved a test accuracy of approximately 71% and a test loss of 1.14 on the classification of 206 species.
- Leveraged parallel processing with
concurrent.futures
to expedite the audio data reading and initial preprocessing stages. - Technologies: Python, Keras, TensorFlow, TensorFlow Hub, Scikit-learn, Pandas, NumPy, SoundFile, SciPy, Matplotlib, tqdm, concurrent.futures.
Big Data Final Project | GitHub
- April 2025
- Developed a model to differentiate between real and fake news articles using linguistic features.
- Engineered features by calculating various text statistics (e.g., Coleman-Liau index, SMOG index, average sentence length, subjectivity, sentiment, word count, syllable count, Flesch reading ease, and Flesch-Kincaid grade level).
- Implemented Logistic Regression, XGBoost, LSTM, and TF-IDF classification models, evaluating their performance with accuracy, F1-score, and confusion matrices. Accuracy on the best models was 99%.
- Technologies: Python, Pandas, Statsmodels, Textstat, TextBlob, Scikit-learn, XGBoost, Matplotlib, Seaborn, SHAP
Hidden Markov Model For Stock Prediction | GitHub Jan 2025
- Designed Hidden Markov class from scratch and applied it to predict $AAPL stock price.
- Technologies: Pandas, Refinitiv, Python
J.P. Morgan Data for Good MKE Fellows Project | Pres Link October 2024
- Received first place for our proposal to MKE Fellows non-profit for their next city to expand to.
-
- Built k-means and agglomerative clustering models, engineered census data, and designed presentation in 24 hours.
- Technologies: Pandas, Python, Plotly
D3 Track Times | GitHub May 2024
- Built dual MongoDB/PSQL database of over 300K track records, which are queried to find percentile rank in D3.
- Used by several D3 teams in PA area to aid with recruiting and modeling expected times.
- Technologies: HTML, CSS, JS, PSQL, Mongo