Data Science - Assignment 1

Sean Herman, Daniel Reidler

Setup

This project was developed in Python (2.7) and Spark 1.6.0 built with Hadoop 2.6. First, install the project's Python dependencies with pip:

$ pip install -r requirements.txt

Criteo Data

This project analyzes data from the Kaggle's Display Advertising Challenge. config.py must be updated to point to a directory, DAC_FILES_PATH, which includes the train.txt file included in the Criteo archive. SPLIT_FILES_PATH, RESULTS_PATH, MODELS_PATH should also be updated to storage locations for the train file splits, the results charts, and saved training models.

Run the split.py script to split Criteo's train.txt dataset into a test.txt set (approx. 38mm rows) and training set (approx. 10mm rows). This training set is further divided into train_5m.txt, test_3m.txt, and validation_2m.txt.

$ ./split.py

Summary Statistics Instructions

Open the hw1_summary_statistics.ipyn notebook.

Running the notebook will calculate the histograms for the integer and category features. Further, the notebook will also calculate the summary statistics for the integer features (mean, std, skewness, kurtosis).

Classification Instructions

Once setup and Criteo Data splits are completed, the data analysis can be initiated through classify.py.

Train on train_5m.txt and make predictions for test_3m.txt:

$ ./classify.py

Train on train_5m.txt and make predictions for validation_2m.txt:

$ PY_ENV=validate ./classify.py

Train on train_5m.txt and make predictions for test.txt (38mm rows):

$ PY_ENV=production ./classify.py

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
Summary Statistics		Summary Statistics
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
classify.py		classify.py
config.py		config.py
criteodata.py		criteodata.py
etl.py		etl.py
evaluate.py		evaluate.py
hw1.ipynb		hw1.ipynb
hw1_summary_statistics.ipynb		hw1_summary_statistics.ipynb
setup.py		setup.py
split.py		split.py
submit.sh		submit.sh
summary.py		summary.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Science - Assignment 1

Sean Herman, Daniel Reidler

Setup

Criteo Data

Summary Statistics Instructions

Classification Instructions

About

Releases

Packages

Contributors 2

Languages

ApplyHiTech/DataScienceHW1

Folders and files

Latest commit

History

Repository files navigation

Data Science - Assignment 1

Sean Herman, Daniel Reidler

Setup

Criteo Data

Summary Statistics Instructions

Classification Instructions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages