# [CPSC 322]() Data Science Algorithms
[Gonzaga University](https://www.gonzaga.edu/) |
[Sophina Luitel](https://www.gonzaga.edu/school-of-engineering-applied-science/faculty/detail/sophina-luitel-phd-0dba6a9d)

---

## PA1 Environment Setup (100 pts)

## Learner Objectives
At the conclusion of this programming assignment, participants should be able to:
* Work with files, functions, and lists in Python
* Extract data from a column in a table
* Remove missing values
* Execute unit tests

## Prerequisites
Before starting this programming assignment, participants should be able to:
* Use variables, operators, conditionals, and loops in Python
* Run Python code in Docker containers
* Submit assignments via Github Classroom

## Acknowledgments
Content used in this assignment is based upon information in the following sources:
* [Kaggle TV shows dataset](https://www.kaggle.com/ruchi798/tv-shows-on-netflix-prime-video-hulu-and-disney)

## Github Classroom Setup
For this assignment, you will use GitHub Classroom to create a private code repository to track code changes and submit your assignment. Open this PA1 link to accept the assignment and create a private repository for your assignment in Github classroom: https://classroom.github.com/a/Pugg-apj

Your repo, for example, will be named DataScienceAlgorithms/pa1-yourusername (where yourusername is your Github username). I highly recommend committing/pushing regularly so your work is always backed up.

## Dev Environment Setup (5 pts)
Include a screenshot showing your [continuumio/anaconda3:2024.06-1](https://hub.docker.com/r/continuumio/anaconda3) Docker container running locally on your machine. Your screenshot should include your PA1 repository name in some shape or form, for example:

![](https://raw.githubusercontent.com/DataScienceAlgorithms/M1_Introduction/main/figures/sl_docker_container_running.png)

## Programming with the 🎬 TV Shows Dataset 🎬 (80 pts)
Write a program (`pa1.py`) that opens a data file called [tv_shows.csv](https://www.kaggle.com/ruchi798/tv-shows-on-netflix-prime-video-hulu-and-disney). This file contains "the data scraped comprises a comprehensive list of tv shows available on various streaming platforms." It has the following columns (AKA attributes):
1. Unique TV show ID
1. Title
1. Year: The year in which the tv show was produced
1. Age: Target age group
1. IMDb: IMDb rating
1. Rotten Tomatoes: Rotten Tomatoes %
1. Netflix: Whether the tv show is found on Netflix
1. Hulu: Whether the tv show is found on Hulu
1. Prime Video: Whether the tv show is found on Prime Video
1. Disney+: Whether the tv show is found on Disney+

Write code to do the following:

**1. Load the data**
1. Define/call a function that loads the TV shows data into a 2D Python list (AKA table).
2.  Remove (and store) the first row, which contains the header of table.
3.  You may use Python’s built-in csv module (recommended), or write the parsing from scratch.
    1. Note: you can use the [`csv` module](https://docs.python.org/3/library/csv.html) to help with file I/O if you'd like (but if you want to write it from scratch, see the bonus below!)
    
**2. Clean the data**

2. Finish the function `remove_missing_values(table, header, col_name)`
    1. Accepts the following parameters:
        1. `table`: the 2D list
        1. `header`: the header of the table
        1. `col_name`: the name of the column to check for missing values. A missing value is represented as an empty string (""). If a row has a missing value in this column, drop the row.
    1. Returns the table with the rows dropped
    1. Note: only call this function *per column, as needed*. If you call this function for each column in your table, you are unnecessarily discarding rows that may have usable data in other columns.
       
**3. Answer the following data science questions (write one function per question):**
1. Q1: Which TV show has the highest IMDb rating? 
    *Note: be sure to use your `remove_missing_values()` function!!*
1. Q2: Which streaming service hosts the most TV shows?
1. Q3: What is the oldest TV show in the dataset (based on Year)?
1. Q4: How many TV shows are rated for the “18+” age group?
1. Q5: Define and answer a data science question of your own that you are interested in about the dataset. Be creative!

Note: we are learning data science from scratch! The only library you should need to use for this assignment is `csv`. This means you should not `pip install` any additional libraries beyond what is included in the [continuumio/anaconda3:2024.06-1](https://hub.docker.com/r/continuumio/anaconda3) Docker image and no `pandas/numpy/scipy/`etc...



## Testing Setup (10 pts)
When you have thoroughly tested your code yourself, run the unit test for the `remove_missing_values()` function by running `pytest --verbose test_pa1.py` at the Docker container command line. Once your test passes, include two screenshots:
1. (5 pts) The test passing output locally in your terminal
1. (5 pts) The test passing output on Github. To see this, push your code, go to your repo on Github, click on the "Actions" tab, under "All workflows" click your most recent commit, click "test-code", and expand "Test code in Docker container." This setup uses Github Actions and is part of a "continuous integration" workflow where every time you push your code to Github, a job executes that tests your code. For **reproducible results, this workflow tests your code using the same Docker image you set up to run locally on your machine.** How cool is that!? Congrats, you passed your first unit test running in a Docker container!

![](https://raw.githubusercontent.com/DataScienceAlgorithms/M1_Introduction/main/figures/slpytest_local.png)

![](https://raw.githubusercontent.com/DataScienceAlgorithms/M1_Introduction/main/figures/githubtest.png)

Note: your screenshots should include your PA1 repository name in some shape or form.

## Submitting Assignments
1. Turn in your assignment files via a Github Classroom repo. See the "Github Classroom Setup" section at the beginning of this document for details on how to do this.
    1. Your repo should contain all of the files needed to run and test your solution (e.g. .py file(s), input files, etc.). 
    1. Double-check that this is the case by "pretending to be the grader": clone (or download a zip) your submission repo and run your code in a fresh [continuumio/anaconda3:2024.06-1](https://hub.docker.com/r/continuumio/anaconda3) Docker container like we will when we grade your code.
1. Submit this PA’s associated assignment in Canvas to mark your PA as "done" and ready for grading. We will then pull your Github repo and grade your PA as soon as possible. The date and time you submit the PA assignment in Canvas will be used for marking your assignment as "late" or "on-time."

## Grading Guidelines
This assignment is worth 100 points. Your assignment will be evaluated based on a successful execution in the [continuumio/anaconda3:2024.06-1](https://hub.docker.com/r/continuumio/anaconda3) Docker container and adherence to the program requirements. We will grade according to the following criteria:

* 5 pts for including a screenshot of your Docker container running locally on your machine
* 5 pts loading the table from the csv file
* 5 pts for removing missing values
* 15 pts for answering Q1
* 15 pts for answering Q2
* 15 pts for answering Q3
* 10 pts for answering Q4
* 15 pts for answering Q5
* 5 pts for adherence to course [coding standard](https://github.com/DataScienceAlgorithms/PAs/blob/main/Coding%20Standard.ipynb)
* 10 pts for passing the unit test and including screenshots