Skip to content

psnegi/data_science_tools1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Science Tools 1

  • Course: Data Science Project 1 COMP-4447-1
  • Class time: M, Wed 07:00 PM - 08:50 PM |Engineering & Computer Science | Room 410
  • Instructor: Pooran Singh Negi, pooran.negi@du.edu webpage
  • Office: 470
  • Office Hours: Tue, Thu, 3.30 p.m. - 5.30 p.m. Email for 1-on-1 help.
  • GTA: Mitchell Wright, GTA office hours ECS 126, Mon 4-6 p.m, Fri 3-5 p.m

Books

Other books

- Think Bayes

Optional material online

Course Description

It is recommended that you consult this github page often for material related to this course. You should check your e-mail periodically for messages. Assignments will be upload here and in the canvas.

The main objective of data science tools 1 is to learn various tools to perform data analysis. The focus in tools 1 is data cleanup, summarization, and visualization. It is more like a hacking skill set but our primary focus will be on the scientific python and Linux ecosystem. We’ll use jupyter notebook/lab for in the class and homeworks. This should make our learning interactive.

For the final project, students will work through individual or team projects applying course-work to the data lifecycle within a particular domain. The focus will also be on best data science/software engineering practices and reproducible work.

Please select a project by January 20th as per your preference. You are allowed to have a group of 2 to 3 students but project work must justify team count. There will be a homework asking about the detail of your final project. We’ll provide feedback about feasibility of the final project. Final project, can be based on initial capstone work?. Please let us know if this is the case. We need to go over details.

Syllabus

This syllabus is subject to change at the discretion of the instructor.

  • Jupyter Notebook for reproducible workflow.
  • Data science and EDA.
  • Git tools work flow.
  • Data science at command prompt. Linux command line, bash, basic awk and sed.
  • Data collection and ingestion(web scraping and reading datasets + pandas).
  • Data cleanup and imputation + Pandas.
  • Data summarization and visualization+ panda(groupby, apply, aggregate etc).
  • Go over some some topics as per students demands.
  • more to come

    Linux command line and scientific python ( primarily numpy, matplotlib, request, seaborn, basic pandas) will be used throughout the course.

Grading

There will be coding/analysis homework assignments, midterm and a final project. We’ll drop one of your worst assignment grade.

There will be a final presentation of the final project. You will be required to submit a final project report in the jupyter notebook format.

Dates

coding Homework50%
midterm, 13 Feb in class15%
Comprehensive final 13 March. We’ll use best of your midterm or final marks
final project presentation, 10 minutes, 18 March in class15%
final project report, due 18 March, please refer to above final report format for submission guideline20%

Final course grading rubric

grade range [(‘A’, >=93), (‘A_minus’, >=89), (‘B_plus’, >=85), (‘B’, >=81), (‘B_minus’, >=77), (‘C_plus’, >=73), (‘C’, >=69), (‘C_minus’, >=65), (‘D_plus’, >61), (‘D’, >=57), (‘D_minus’, >=53), (‘F’, < 53)])

Honor code

All members of the University of Denver community are expected to uphold the values of Integrity, Respect, and Responsibility. These values embody the standards of conduct for students, faculty, staff, and administrators as members of the University community. Our institutional values are defined as:

Integrity: acting in an honest and ethical manner;

Respect: honoring differences in people, ideas, experiences, and opinions;

Responsibility: accepting ownership for one’s own behavior and conduct.

Please respect DU Honor Yourself, Honor the Code

Students with Disabilities

Students with recognized disabilities will be provided reasonable accommodations, appropriate to the course, upon documentation of the disability with a Student Accommodation Form from the Disability Services Program. To receive these accommodations, you must request the specific accommodations, by submitting them to the instructor in writing, by the end of first week of classes. Visit CAMPUS LIFE & INCLUSIVE EXCELLENCE webpage for details.

Withdrawal Policy

Please see registrar calender for Academic deadlines. We’ll strictly follow the deadlines.

Data set for Projects

We need to know your project/dataset, before we approve it for final project.

More to come.

Software Installation

Python

We want everybody to have same experience using computational tools in data science tools 1. Please follow steps as per your operating system.

Window based installation

Please install Windows Subsystem for Linux (WSL) on window 10. Follow the instruction in this post Using Windows Subsystem for Linux for Data Science by Hugo Ferreira for installing Linux. **ignore install Anaconda part.**

You can also watch this video to see installation of Windows 10 Bash & Linux Subsystem Setup.

Linux /Mac users should already have bash command prompt

You can run echo $0 to check current shell. Change to bash shell using chsh -s /bin/bash

One you are in Linux/Mac bash command prompt, Please follow following instructions

Python3 installation

Please follow instructions here to install python3 if it is not installed in your system. This link also lists Windows Subsystem for Linux (WSL) for window 10(Windows 10 Creators or Anniversary Update). I am using python 3.5.2. Hopefully any version of python 3 should work.

creating virtual environment and installing packages for data science tools 1

Run following commands from command prompt.

  • apt-get install python3-venv
  • Using command line(cd command), go to the folder where you want to keep python file, notebooks related to this course.
  • run **python3 -m venv /path/to/new/virtual/environment**
    • e.g. I ran python3 -m venv dst1_env
  • To activate your environment run source /path/to/new/virtual/environment/bin/activate
    • e.g From this course directory I run, source dst1_env/bin/activate
  • run python3 -m pip install – upgrade pip. Note that there are 2 dashes in upgrade option.
  • run wget https://raw.githubusercontent.com/psnegi/data_science_tools1/master/requirements.txt
  • run pip install -r requirements.txt
  • run jupyter notebook or jupyter lab.
  • In the browser you should see your current files.
  • Click on the notebook you want to run.
  • click on RISE slideshow extension in notebook, if you want to see notebook as slideshow.

To deactivate python virtual environment, run deactivate

Python learning resources

You can also go to my python for reproducible research github repository and start by running pythonBasic.ipynb notebook. I will go over basic of python and jupyter notebook.

data analysis tools in python

  • more to come

Notebooks

Jan 7

Jan 9

Jan 14

Jan 16

Jan 23

Jan 28

Jan 30

Feb 4

Feb 6

Feb 11

18 feb

20 feb

25 th Feb

27 th Feb

4 th March

6 March

11 th March

Homeworks

No late hw will be accepted

HW nodescription and linkssolution
Due date
Monday 21 th Jan 11.59 p.m1Complete questions in this notebooks
Friday 25 th Jan 11.59 p.m2Complete questions in this notebook
Thursday 31 Jan 11.59 p.m3Complete questions in this notebook
Friday 8 th Feb 11.59 p.m4Complete question in this bash file
Friday 15 Feb, 11.59 p.m5Complete questions in this notebook
Friday 23 Feb, 11.59 p.m6Complete questions in this notebook
Friday 1 st March 11.59 p.m.7Complete question in this notebook
Monday 11 th March 11.59 p.m8Complete the this hw notebook

Midterm

Course Activity

DateReading/Coding Assignmentsclass activity
7 JanInstall jupyter environmentMitchell covered Jupyter introduction notebook
also helped with installation
Python Virtual EnvironmentsCovered jupyter introduction and data science notebook.
9 JanResources to learn gitIt may not be time consuming to wait for notebook to get started via binder every time.
We’ll also go over data science Go to the folder for this course in your computer and run git clone https://github.com/psnegi/data_science_tools1.git.
Run command ls. You should see data_science_tools1 folder. Activate your virtual environment.
Navigate to course directory using cd data_science_tools1. change to the notebook directory using command cd notebooks.
Now run jupyter notebook. You should see all the notebooks in a browser window. Click on the notebook you want to run.
To run a cell in the notebook press alt+enter or ctr+enter.
Note that whenever a new content is posted, you must run git pull origin master from data_science_tools1 directory to make sure you have the latest
content. Don’t worry about above git commands. We’ll start git in next class. Please start with git notebook.
I don’t like notebooks.- Joel Grus video provide by Laura Atkinson
14 JanCovered git for managing local project and git work flow in team.
If you are using Mac, you may need to install Xcode Command Line Tools or install git.
If you haven’t setup window subsystem for Linux and want to use git in window see this How to Install GIT client on Windows
I use emacs but use any editor you like for coding python. ATOM is good choice.
16 JanWill work on git tool part 2Covered work flow in a team, when to push a branch to the remote(you don’t have integration setup, other team members wants to
look at the feature code for review etc.), merge conflict, tagging. Started with “forget to work on a feature branch”.
23 JanData science at command promptFinished how to move changes to feature branch. Not that when cleaning the master branch using soft or mixed reset, the master branch
will still contain your changes. If you use hard reset changes will be lost in master. **HEAD detached** will contain the changes if required.
Finished Linux over view, basic commands, redirection and pipe.
28 JanPractice posted notebooksFinished regular expression. Using basic Linux commands and regular expression (curl, grep, sort, uniq) found top k words in a Gutenberg book.
See notebooks in notebooks sectionFinished basic awk and sed.
30 JanSee notebooks in notebooks sectionFinished positional parameters and command substitution in bash scripting. Note that to use bc command to do floating point arithmetics
numpy library for scientific computation.
In the jupyter notebook use ? or ?? to read about a function(like np.array?). Press shit tab to get tool tip for function arguments(like np.ones( and press shift+tab).
Started with REST API. /Please install chrome/ so that we have same options to click when inspecting https messages.
See 4 th feb notebooksCovered REST API. Will cover how to create REST API in tool2 using AWS api gateway and lambda function.
4 FebWeb Scraping in class versionFinished scraping Fry electronics website for telescopes.
6 FebPandas basic see notebook section
11 FebData ingestion and cleaningCovered basic data ingestion API and cleanup functionality. see pd.qcut Quantile-based discretization too.
13 Febin class midterm
18 Febpython re library and data wrangling
20 th FebBasic on NLP and normalization of text data
25 th FebText clean up, contraction, using wordnet for synonyms, antonyms, hypernyms, hyponyms and edit distance
There will be a comprehensive final in class exam. We’ll use best your midterm or final marks(15% weight).
27 th FebExtracting text and tables from pdf files. Concept of split-apply and combine. Pandas group by.
If you had issue installing pdf miner in Mac, It can Java related.
Install JDK using this link https://www.oracle.com/technetwork/java/javase/downloads/jdk11-downloads-5066655.html
and also: sudo R CMD javareconf otherwise other packages that use java will fail
(provide by Chris Haddad)
4 th marchCovered matplotlib theory, hierarchical organization(tree structure) of figure components.
Started seaborn.
6 th MarchSeaborn when some variables are categorical, scatter , swarm(concept of hue, jitter). For big data plotting statistical summary
distplot, jointplot, pairplot boxplot, bar plot(uni/bi variate). Linear relationships using regplot.
Touched upon geo plot(choropleth map) using folium.
11 th MarchTime series, Timestamp and period concepts. Feature engineering(shift, rolling, weighted feature summary) and started time series analysis.