Due 1/9 (Th), 12:30pm
The Internet is full of published linguistic data sets. Let's data-surf! Instructions:
- Go out and find two linguistic data sets you like. One should be a corpus, the other should be some other format. They must be free and downloadable in full. Make sure they are linguistic data sets, meaning designed for linguistic inquiries.
- You might want to start with various bookmark sites listed in the following Learning Resources sections: Linguistic Data, Open Access, Data Publishing, and Corpus Linguistics. But don't be constrained by them.
- Download the data sets and poke around. Open up a file or two to take a peek. (No need to do this in Python.)
- In a text file (should have the
.txt
extension), make note of:
- The name of the data resource
- The author(s)
- The URL of the download page
- Its makeup: size, type of language, format, etc.
- License: whether it comes with one, and if so what kind?
- Anything else noteworthy about the data. A sentence or two will do.
- If you are comfortable with markdown, make an
.md
file instead of a text file.
SUBMISSION: Upload your text file to To-do1 submission link, on CourseWeb. If you do not have CourseWeb access, email your submission to Jevon cc Cassie and John.
Due 1/16 (Th), 12:30pm
Learn about the numpy
library: study the Python Data Science Handbook and/or the NumPy documentation here.
While doing so, create your own study notes, as a Jupyter Notebook file entitled numpy_notes_yourname.ipynb
.
Include examples, explanations, etc. Replicating DataCamp's examples is also something you could do.
You are essentially creating your own reference material.
SUBMISSION: Your file should be in the todo2/
directory of the Class-Exercise-Repo
.
Make sure it's configured for the "upstream" remote and your fork is up-to-date. Push to your GitHub fork, and create a pull request for me.
Due 1/21 (Tue)
Study the pandas
library (through the Python Data Science Handbook and/or the documentation. pandas
is a big topic with lots to learn: aim for about 1/2. While doing so, try it out on TWO spreadsheet (.csv, .tsv, etc.) files:
- The first file should be your choice. You can get one from this CSV Files archive, or make up your own. Keep it super simple! It's supposed to be a toy dataset.
- The second one should be
billboard_lyrics_1964-2015.csv
by Kaylin Pavlik, from her project '50 Years of Pop Music'. (Note: you might need to specify ISO8859 encoding.)
Name your Jupyter Notebook file pandas_notes_yourname.ipynb
. Don't change the filename of any downloaded CSV files or edit them in any way.
SUBMISSION: Your files should be in the todo3/
directory of Class-Exercise-Repo
.
Commit and push all three files to your GitHub fork, and create a pull request for me.
Due 1/23 (Thu)
This one is a continuation of To-do #3: work further on your pandas
study notes. You may create a new JNB file, or you can expand the existing one. Also: try out a spreadsheet submitted by a classmate. You are welcome to view the classmate's notebook to see what they did with it. (How to find out who submitted what? Git/GitHub history of course.) Give them a shout-out.
SUBMISSION: We'll stick to the todo3/
directory in Class-Exercise-Repo
. Push to your GitHub fork, and create a pull request for me.