# CSC271H1 Week 03 Tutorial: Arrays and Dataframe Inspection

In this tutorial, you will practice working with NumPy arrays. You will also step into the role of data analyst and inspect a dataset of **Canadian rental housing costs**. This first step will prepare you for upcoming data cleaning and preparation tasks.

<div class="alert alert-block alert-danger">
 
 <b>Important</b>: the autotesting will be inspecting the variables you are asked to create and their values, so be sure to use the variable names specified.
</div>

## Task 1: Practice with NumPy arrays

For these exercises, you'll work with an array of numbers. Use Numpy operations, functions, and methods to complete each exercise.

Your solutions should work with any array `values`, not just the example array shown below.

In [10]:
import numpy as np

# An example array for testing your code. You solutions should work with any array of numberes.
values = np.array([-4, 3, 2, 6.5, 7.9, 3, -2.5, -5, 3])

<div class="alert alert-block alert-success">
1A. Create a variable named <code>evens</code> that refers to a new array containing only the elements of <code>values</code> that are even integers.
</div>

In [11]:
# 1A.
evens = values[values%2 == 0]
print(evens)

[-4.  2.]


<div class="alert alert-block alert-success">
1B. Create a variable named <code>negatives</code> that refers to an integer representing the number of negative numbers in <code>values</code>.
</div>

In [None]:
# (values < 0).sum()   ~ numpy.int64
# np.int64(values[values < 0].size) ~ cast into numpy.int64

(3,)


In [57]:
# 1B.
negatives = (values < 0).sum()
print(negatives)

3


<div class="alert alert-block alert-success">
1C. Update the <code>values</code> array so that all 3s are replaced by 4s.
</div>

In [22]:
# 1C.
values = np.where(values == 3, 4, values)
print(values)

[-4.   4.   2.   6.5  7.9  4.  -2.5 -5.   4. ]


<div class="alert alert-block alert-success">
1D. Create a variable named <code>types</code> that refers to an array of categories (either <code>'neg'</code> or <code>'non-neg'</code>) corresponding to each element of <code>values</code>, where negative values are labeled <code>'neg'</code> and all other values are labeled <code>'non-neg'</code>. (Zero is considered non-negative.)
</div>

In [23]:
#1D.
types = np.where(values >= 0, 'non-neg','neg')
print(types)

['neg' 'non-neg' 'non-neg' 'non-neg' 'non-neg' 'non-neg' 'neg' 'neg'
 'non-neg']


## Task 2: Inspect Data

For the rest of this tutorial and in upcoming weeks, you'll work with Canadian rental cost data. Download the  `Canadian_rental_costs.csv` file from Quercus, put it in the same folder as this Jupyter notebook, and run the code cell below. Make sure the code runs without error before moving on to the next tasks.

In [24]:
import pandas as pd

df = pd.read_csv('Canadian_rental_costs.csv')
df.head()

Unnamed: 0,REF_DATE,GEO,DGUID,Rental unit type,Estimates,UOM,UOM_ID,SCALAR_FACTOR,SCALAR_ID,VECTOR,COORDINATE,VALUE,STATUS,SYMBOL,TERMINATED,DECIMALS
0,2024-01,"St. John's, Census metropolitan area (CMA)",2021S0503001,Apartment - 1 bedroom,Average asking rent,Dollars,81,units,0,v1675424962,1.2.1,1150.0,,,,0
1,2024-04,"St. John's, Census metropolitan area (CMA)",2021S0503001,Apartment - 1 bedroom,Average asking rent,Dollars,81,units,0,v1675424962,1.2.1,1050.0,E,,,0
2,2024-07,"St. John's, Census metropolitan area (CMA)",2021S0503001,Apartment - 1 bedroom,Average asking rent,Dollars,81,units,0,v1675424962,1.2.1,1200.0,,,,0
3,2024-10,"St. John's, Census metropolitan area (CMA)",2021S0503001,Apartment - 1 bedroom,Average asking rent,Dollars,81,units,0,v1675424962,1.2.1,1230.0,,,,0
4,2025-01,"St. John's, Census metropolitan area (CMA)",2021S0503001,Apartment - 1 bedroom,Average asking rent,Dollars,81,units,0,v1675424962,1.2.1,1210.0,,,,0


For this task, you'll inspect the dataset to get a sense of its shape, its values, and any issues like missing data.

### A: Get Basic Information


Panda's `Dataframe`s types has many attributes and methods. In the `DataFrame` documentation, you can see a list of attributes with descriptions by scrolling down the page to the Attributes table:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

In the code cell below, we use the `shape` attribute to get the number of rows and columns in the `DataFrame`.

In [27]:
# Print the shape as a tuple containing the number of rows and columns
print(df.shape)

(630, 16)


<div class="alert alert-block alert-success">
2A. The <code>shape</code> attribute is a <code>tuple[int, int]</code>. Write code to assign the number of rows to variable <code>num_rows</code> and the number of columns to <code>num_cols</code>. You must use <code>df.shape</code> in your answers.
</div>

In [28]:
# 2A. 
num_rows = df.shape[0]
num_cols = df.shape[1]
print(num_rows)
print(num_cols)

630
16


### B: Find column names


<div class="alert alert-block alert-success">
2B. Complete the code below to create a variable <code>col_labels</code> that refers to the column labels from <code>df</code>. (Tip: to find the relevant attribute, look at the <code>DataFrame</code> attributes table in the documentation linked above or see the Week 02 Lecture prep.)
</div>

In [29]:
# B.
col_labels = df.columns
print(col_labels)

Index(['REF_DATE', 'GEO', 'DGUID', 'Rental unit type', 'Estimates', 'UOM',
       'UOM_ID', 'SCALAR_FACTOR', 'SCALAR_ID', 'VECTOR', 'COORDINATE', 'VALUE',
       'STATUS', 'SYMBOL', 'TERMINATED', 'DECIMALS'],
      dtype='object')


### C: Check for Missing Values

We can find out which columns have missing values, and how many by using the following command:

In [30]:
df.isna().sum()

REF_DATE              0
GEO                   0
DGUID                 0
Rental unit type      0
Estimates             0
UOM                   0
UOM_ID                0
SCALAR_FACTOR         0
SCALAR_ID             0
VECTOR                0
COORDINATE            0
VALUE                30
STATUS              569
SYMBOL              630
TERMINATED          630
DECIMALS              0
dtype: int64

<div class="alert alert-block alert-success">

2C. The code cell below contains a statement to create a variable named <code>cols_missing</code> with the labels and counts of the columns that contain missing values. 

Complete the assignment statement to create a variable named <code>cols_complete</code> with the labels and counts of the columns that <i>do not</i> contain missing values.

</div>

In [32]:
#2C.

counts = df.isna().sum()
cols_missing = counts[counts > 0]
print(cols_missing)

cols_complete = counts[counts == 0]
print(cols_complete)

VALUE          30
STATUS        569
SYMBOL        630
TERMINATED    630
dtype: int64
REF_DATE            0
GEO                 0
DGUID               0
Rental unit type    0
Estimates           0
UOM                 0
UOM_ID              0
SCALAR_FACTOR       0
SCALAR_ID           0
VECTOR              0
COORDINATE          0
DECIMALS            0
dtype: int64


## Final Task: Show your TA and submit to MarkUs

If you have not already done so, make sure your TA has recorded your attendance.

Submit your completed `w03_tutorial.ipynb` file to [Markus](https://markus.teach.cs.toronto.edu/markus/courses/128) and run the tests.