A lightweight machine learning project that builds a CNN-based file type classifier inspired by Google's Magika. The model distinguishes JavaScript files from other code types using byte-level analysis.
pip install requests tqdm tensorflow numpy scikit-learn python-dotenv seabornTo avoid GitHub API rate limits, create a personal access token:
- Go to GitHub Settings → Developer settings → Personal access tokens
- Generate a new token (classic) with
public_reposcope - Create a
.envfile in the project root:
GITHUB_TOKEN=ghp_your_token_hereOr set the environment variable directly in your shell:
export GITHUB_TOKEN=ghp_your_token_hereWithout a token, you're limited to 60 API requests per hour.
First we will implement a basic version of the Magika model to classify JavaScript files from other file types.
- Data Collection (JS and non-JS files from GitHub)
- Data Preprocessing
- Implement the CNN model from Magika paper
- Train and Evaluate the Model