# Assignment 4
## Data Science Tools I
### Professor: Don Dalton

---

### Student: Duncan Ferguson


The questions in this notebook all relate to the same sets of data, linked on the Canvas assignment page. Each question walks through various manipulations of the data, each in multiple parts. These are designed to review the topics covered in week 3 and many parts (there are some exceptions) can be answered with just one or a few lines of code.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Question 1 - Data Cleaning

## Part (a)
### 3 points

Use `pd.read_csv` to read the `cereals_data.csv` file into a DataFrame called `cereals_data`. If you are using Google Colab, uploaded files reside in the root folder `/content/`. If you are using a local file, provide the path to the file relative to the folder your notebook is in.

Print the dimensions (shape) of the data frame and output the first 5 rows to take a look at what sort of information we're dealing with.

In [4]:
cereals_data = pd.read_csv('cereals_data.csv')
cereals_data.head()

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
4,Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843


If you take a close look at this .csv file, you will find that some column names as well as some of the categorical values have extra whitespace on the left- and/or right-hand sides. For example, the column index `unnamed: ` has a space after the colon.

Strip the whitespace off of the column names by using `cereals_data.columns.str.strip()`. The `.str.strip()` method can be applied to values within individual columns as well. Apply this method to the columns `Name`, `Manuf`, and `Type`. (Even if you print the first 5 rows again, it may not be immediately clear if you have successfully stripped the whitespace until you start accessing particular values.)

In [6]:
cereals_data.columns


Index(['name', 'mfr', ' type', 'calories', 'protein', 'fat', 'sodium', 'fiber',
       'carbo', 'sugars', 'potass', 'vitamins', 'shelf', 'weight', 'cups',
       'rating'],
      dtype='object')

## Part (b) 
### 3 points


The `unnamed:` column is not very useful, so permanently drop that column. 

Additionally, to follow the pattern of abbreviations, rename `Calories` to `Cal` and `Vitamins` to `Vit`. (This can be done in one line.) 

Again output the data frame dimensions and the first five rows to confirm these changes. Successfully performing these changes should demonstrate that the extraneous whitespace in the columns was removed.

## Part (c)

### 3 points


Print the unique values in the columns `Manuf`, `Type`, and `Shelf`. You should find that there are 7 unique characters for `Manuf`, 'C' and 'H' for `Type`, and 1, 2, and 3 for `Shelf`.

Change the `Type` column such that it uses 'cold' and 'hot' instead of 'C' and 'H' respectively. Output the first five rows to confirm this change.

# Question 2 - Indexing

## Part (a)
### 5 points

Use the `iloc` method to select the 5th and 7th rows of `cereals_data`. Note that indexes start at 0, so the 5th row, for example, is at index 4.

"Select" simply means let the notebook run this command and then output the result to the screen to view the contents.

Now select the same two rows using the `loc` method, but additionally select the all the columns between `Type` through `Carbo`, inclusive.

Now make the exact same selection as above (with the columns from `Type` to `Carbo`), but using `iloc` once again.

Use the square bracket syntax (not using `loc` nor `iloc`) to select the first 10 rows and the columns `Protein`, `Sodium`, and `Carbo`.

Use `iloc` to select the first five rows.

## Part (b)

### 3 points


Use boolean indexing to select the rows with a value of `Cal` greater than 100.

Use boolean indexing again to select rows with calories greater than 120, but additionally only the rows that also meet the condition that the `Manuf` is 'K'.

Select the same rows as above (calories > 120 and manufacturer being 'K'), but using the `query` method.

`query` was not covered in class, so here is a [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html) to the corresponding documentation. Simply put, `query` behaves just like boolean indexing, but uses a slightly more succinct syntax in string format.

# Question 3 - Combining Data

## Part (a)

### 2 points


Read the .csv file `cereal_names.csv` as a data frame called `cereal_names` and output the first five rows.

Create a new data frame called `cereals_data2` that combines (joins) `cereals_data` with `cereal_names` horizontally, placing `cereal_names` on the right. Output the first five rows and last 10 columns to confirm the resulting combined data frame. (Remember that negative indexes wrap around to the end of an array-like structure.)

## Part (b)

### 4 points


Read the .csv file `more_cereals_data.csv` as a data frame called `more_cereals_data`. As we did with `cereals_data.csv`, strip the column names of extraneous whitespace, and same for the columns `Name`, `Manuf`, and `Type`.

Output the first five rows to get a look at the data.

Create a new data frame called `cereals_data3` that combines (appends) `cereals_data2` with `more_cereals_data` vertically, placing `more_cereals_data` below `cereals_data2`. To avoid duplicating indexes, make sure that indexes are ignored when appending.

Output the last 10 rows to confirm the resulting combined data frame. (You should see that `Cal`, `Vit`, `Calories`, and `Vitamins` are all present since the additional data frame did not use the abbreviated terms you applied earlier. This is an intentional result.)

## Part (c)

### 4 points


Check for any missing values in `cereals_data3` by computing for each column how many values are null.

Iterate over the columns of `cereals_data3`. For each numeric column, fill in the missing (null) values of the column with the column's mean. For each categorical (non-numeric) column, fill in the missing values using the "forward fill" method.

To check if a column's data type is numeric, you may import the top-level function `is_numeric_dtype` and provide a specific column. The import statement is provided below:

```python
from pandas.api.types import is_numeric_dtype
```

## Part (d)

### 6 points


Pandas provides two types of "cut" methods. One is `cut` and the other `qcut`. `cut` will separate data (such as a column in a DataFrame) into bins based on the values of the data, where the bins are equally spaced. `qcut` will separate the data into bins such that each bin has (roughly) equal number of entities in each bin. For example, a normal bell-curved dataset split into 3 bins should have most values placed in the middle bin when using `cut`, whereas `qcut` using a value of 3 for `q` ("q" referring to the more common "quartile" with a value of 4) would instead adjuct the range of each bin to make it so each one has almost equal number of values.

Use the `Cal` column of `cereals_data3` to demonstrate this difference. Cut the data using `cut` with 3 bins and the labels `low`, `moderate`, and `high`. Do the same thing with `qcut` with the same labels and a value of 3 for `q`. Print out the `value_counts()` table for each type of cut to see the results.

Combine the two counts of each cut into a single data frame with one column called `cut` and the other `qcut`, each with counts for the three labels. Then plot a histogram / bar graph displaying the counts side-by-side on the same graph.

## Part (e)

### 5 points


Write a method called `standardize` that takes in a data frame and a list of columns. The method should create and return a data frame that contains only the numeric columns from the given data frame that are provided in the columns list, but in a standardized form. (You may find it easiest to start with an initially empty data frame and add columns to it.) This standardized form is called a "z-score" and can be computed as follows:

$$
z = \frac{x-\bar{x}}{s}
$$

where $x$ is the original value, $\bar{x}$ is the mean of the values in that column, and $s$ is the standard deviation of values in that column.

Call `standardize` by giving it the `cereals_data3` data frame and the columns `Carbo`, `Sugars`, `Potass`, `Vit`, and `Name`. (`Name` is not a numeric column, so it should not show up in the resulting data frame, nor should it cause an error.)



Using this standardized data frame, check for outliers by finding any values that are greater than 3. Output for a each column a count of the number of outliers in that column.

# Question 4 - Data Aggregation

All parts in the question will use `cereals_data3` as the data frame to analyze.

## Part (a)

### 3 points


Group the data based on manufacturer and name using `groupby` and compute the mean for the columns `Cal`, `Fat`, `Sodium`, `Fiber`, and `Sugars`.

Produce the same table as the above GroupBy example, but using `pivot_table` instead.

Use the `agg` method to compute the min, mean, and max of the columns `Cal`, `Carbo`, and `Sugars`.

## Part (b)

### 3 points


Create a cross tabulation between `Manuf` and `Shelf` using raw frequencies. (If you properly removed whitespace after reading in `more_cereals_data.csv` in Question 3, you should not see any duplicate values for `Manuf` here.)

Using the above crosstab, determine what percentage of cereals on shelf 2 are made by manufacturer 'K'. Your code should compute and output this value as a floating point number.

# Question 5 - Data Visualization

## Part (a)

### 4 points


Create a `plt` figure and set its size in inches (use `set_size_inches`) to 12x8. Add six subplots to this figure, arranged in a 2x3 grid. The x-values for each subplot should come from `Protein`, `Fat`, `Sodium`, `Fiber`, `Carbo`, and `Sugars` respectively. The y-values for all subplots should be the values for `Cal`. Set the x-label (use `set_xlabel`) for each subplot to be its respective column name.

## Part (b)

### 4 points


Use Seaborn to create a scatterplot using `Carbo` vs `Sugars`. Change the hue of the data points based on `Shelf` so that each shelf (1, 2, and 3) has its own color represented in the plot.

Get a list of the unique manufacturers in the data. Create a figure and axes using `subplots` that is 1x7 (for the 7 manufacturers). Set the figure to 2x14 inches.

For each manufacturer, use Seaborn to create a scatterplot displaying `Carbo` vs `Sugars` for each subplot based on each manufacturer. (You can pass an AxesSubplot object to a Seaborn scatterplot using the `ax` parameter.) You should have 7 subplots, each corresponding to a particular manufacturer displaying the correlation between `Carbo` and `Sugar` for that manufacturer's cereal.