Skip to content
Describes all my data science experience, with links to individual projects.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore
README.md

README.md

Data Science Portfolio by Gregory Hilston

This repository is to act like a high level overview of the data science projects I've worked on. Each project's README.md should act as a full description of the project, but I'll provide a short description here for convenience. All of these projects are in Python. Links are provided to projects whose code can be made public.

Kaggle Competitions

  • Talking Data Ad Tracking Fraud Detection Challenge
    • Classification challenge on a mobile advertisement data where fraudulent clicks were the class of interest. I spent most of my time on the exploratory data analysis (EDA) and used a random forest classifier for my submission. The data set was over one hundred thirty five million rows, so I used an elastic compute cloud (ec2) instance on amazon web services (AWS) for processing.

Visualizations

  • Meteorites
    • Visualized geospatial meteorite data using pandas, seaborn, matplotlib, numpy and folium.
  • It Is Wednesday My Dudes Google Trends
    • Investigated Google Trends data with pandas and matplotlib to prove, visually, a time series trend.

Presentations

  • Markov Chains
    • This presentation was a fun, interactive, way to explain what Markov Chains are and how they can be used to generate new sentences, given a corpus. The text of books, and scripts of television shows were used to demonstrate having a corpus, and sentences were generated against the entire corpus, as well as per character. Additionally, all Slack messages sent in the channels #general and #random were used to demonstrate generating sentences, for each coworker. Two small games were played:
      1. Was this sentence actually sent by this person, or is it generated?
      2. Given these five generated sentence, which person was this Markov Chain trained on?
  • Pandas
    • This presentation was not a lecture format, rather participants broke up into small groups and walked through a Jupyter Notebook together. The Notebook was broken up into sections, some of which were simply reading and executing lines of code to learn the basics. The last section involved twenty challenges to introduce the user with Pandas commands.
  • Word Embeddings
    • Discuss what word embeddings are and what they can be used to do. Combines a lecture format with an IPython Notebook that can be used independently by the viewers or part of the presentation.
  • Big Data Processing: Using Hadoop, Hive and Pig
  • General Adversarial Networks
    • A high level summary of what General Adversarial Networks are and how they can be used. Best given to an audience with little to no data science background. Very interesting and easy to follow by someone with little to no technical or relevant technical background.

Natural Language Processing (NLP)

  • Most Important Phrases In Corpus
    • Wrote a production ready micro service based system to find N "important" sentences that would summarize a given corpus. Used TF-IDF to discover "important" sentences and page rank to sort them. An API accepted jobs and posted them to amazon web services (AWS) simple queue system (SQS), while N drones pull down jobs, process the request and POST the results back to the requester. Leveraged nltk, sklearn, numpy, flask and boto3. Was used as an opportunity to create an example project for the data science team, on using:
      • Docker
      • linting (prospector and mypy)
      • Git hooks (to run linting)
      • Continuous integration continuous deployment (CircleCi)
      • Unit tests (pytest)
  • User Review Sentiment Analysis
    • Implemented a cron polling system that processes users review's sentiment using flask and Google Cloud's Language API.
  • Inter Company Categorization Mapping
    • Automated the mapping of data from one company's categorization to another, using gensim and glove vectors.
  • Synonym Generator
    • Generated synonyms programmatically using nltk, spacy, gensim and glove vectors to improve smart search capabilities.
  • Business Classifier
    • Classified business data using keras, spacy, gensim, sklearn and neural networks, saving years of projected manual labor.

Dashboard

  • Data Dashboard
    • Created internal dashboard using folium and dash to generate visualizations. Allowed leadership to make data driven decisions instead of going of gut reactions.
      • An example of this would be suggesting which US cities our product could target next by accounting for population, number of businesses and other data points. Rather than just choosing by gut.

Image Processing

  • Google Vision API Wrapper
    • Developed a wrapper for Google Cloud's Vision API to filter inappropriate user submitted images, memes and determine percentage of image used by text. Written using flask and Pillow, deployed to AWS lambda using zappa.

Recommenders

  • Geolocation Item Based Recommender
    • Researched item and user based recommenders for review aggregate platform using pandas, sqlalchemy and turicreate. Ended up with basic model that accounted for user's geolocation when producing recommendations.

Data Munging (Cleaning)

While most, if not all, of these projects have a data munging aspect, the following were specifically exclusively data munging efforts.

  • Hours Regularization
    • Took previously unformatted, no front end validation, hours data and wrote a large amount of regex to parse the hours data into a structured format. Successfully parsed ~2.3 million pieces of data, saving the company time and money. The newly formatted data can now be used programmatically for many new features. Leveraged pandas for data processing and ray for parallelizing operations, taking execution from 1.5 hours about 15 minutes.

Games

  • Decision Tree Based Tic Tac Toe Player
    • Created a driver to play Tic Tac Toe and challenged an opponent who placed randomly. Used the random opponent to record what a good player, me, would do. Used this data set to train a Decision Tree Classifier, and then challenged this new model to play. The model fell back on randomness when the small data set fell short. Done using pandas and scikit learn.

Book List

You can’t perform that action at this time.