# Recitation 2: Numpy and Pandas

In this recitation, you went through tutorials for two of the most important python packages for machine learning and data science at large: `numpy` and `pandas`.

In this notebook, I'll ask you to download a dataset into a `pandas` DataFrame, and then execute some transformations on the data both with `pandas` and with `numpy`.

## Step 1: Download and format the data.

We are going to start by using `pandas` to download a weather dataset from the National Oceanic and Atmospheric Administration (NOAA), a U.S. federal agency that studies climate phenomena and publishes a wealth of open climate data.  This dataset contains the average temperature, in degrees Fahrenheit, of the [continental United States](https://www.ncei.noaa.gov/access/monitoring/dyk/us-climate-divisions) measured in January of each year from 1895 to 2025.

Stepping through the code, first we import the `pandas` library and store it in a variable called `pd`.  When you run this cell, this makes `pandas` available in that variable `pd` in *all cells* in this notebook.  Usually we import libraries and define variables towards the top of a Jupyter notebook, so that they are available in the cells towards the bottom.

Next, we pass a URL that points to the data (held in a `.csv` file, which stands for Comma-Separated Values) in to the `read_csv` function from the `pandas` library.  The `read_csv` function is very powerful: it downloads the data, automatically separates the fields, and loads them into an object called a DataFrame, which is a rich class representing a dataset.  DataFrames have many convenience methods, which you can try out yourself in the subsequent cells.  We use a simple one, `head()`, which returns the top 5 rows of a dataset.  This is very frequently done to do a "sanity check", i.e. to make sure the dataset is in the format you expect.

In [None]:
import pandas as pd

# This URL points to the NOAA data on average temperature in the US in the month of January from 1895 to 2025.
url = "https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/national/time-series/110/tavg/1/1/1895-2025/data.csv"

# We skip the first 2 rows because pandas can't parse the comments at the top of the file.
df = pd.read_csv(url, skiprows=2)

df.head()

This data looks ok, but the date field won't be particularly useful to us if we are doing machine learning.  While it's numeric, we would like it to be scaled to be more meaningful, rather than including the month, i.e. `1895` would be more useful than `189501`.  To do this, we are going to use a `pandas` function to first convert that date string into a Date object, then extract the year itself from the Date object.  Run the code below.

In [None]:
df['Formatted-Date'] = pd.to_datetime(df['Date'], format='%Y%m')

df.head()

In [None]:
df['year'] = df['Formatted-Date'].dt.year

df.head()

To extract the year as a new column, first we used a `pandas` method `to_datetime`.  This method takes in a collection of strings (either a full DataFrame, or a single column, known as a Series, which is what we did here).  By default, it tries to parse the date, but you typically need to provide it a format string, which tells the parser what format the date is in.  Here, we passed in `%Y%m`, which means a 4-digit year immediately followed by a 2-digit month.  After that, we print out the head of the dataframe and see that the Formatted-Date column contains a full date (1895-01-01), rather than the awkward date string(189501).

Next, we create a new column which extracts the year from those date objects.  This time, we call `.dt` on our Formatted-Date column, which returns the values as DateTime objects, which have accessors for things like the year, month, or day.

--------

Next, I want you to try to update this DataFrame in two ways.  First, I want you to update the name of the `Value` feature to be `degrees-fahr`.  Then, I want you to create a new column that represents `degrees-cent`, i.e. the temperature in Celsius.  There are many ways to update the name: you could [try the `df.rename(.)` function](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html), for example.  To create the new column with the temperature in Celsius, try to look up how to assign a new column based on existing columns - StackOverflow might be helpful here.  

In [None]:
# Your code here

## Step 2: Convert to numpy

`pandas` is a nice library for formatting and transforming data.  It could be considered a "heavy" interface to the data - once you put your data into a DataFrame object, it has lots of functionality.  

`numpy` is a library that can also transform data, but it more directly operates on numerical matrices - i.e. it doesn't do well with nonnumeric columns, like the Date column in our dataset.

`pandas` and `numpy` datasets can convert from one to the other.  Here, we first drop the nonnumeric columns from our dataset (we don't need them anymore), and then we convert our DataFrame object into a `numpy` `ndarray`, a numpy object that stands for "n-dimensional array", another word for a matrix.  The method `df.values` converts the DataFrame object into a `numpy` `ndarray` object.

In [None]:
import numpy as np

numeric_df = df.drop(columns=["Date", "Formatted-Date"])
arr = numeric_df.values
arr

`numpy` has some nice features that make it easy to manipulate data - in class we used it to very quickly slice the data into a `trainX`, `trainY`, `testX`, and `testY`.  For now, we are just going to do some simple matrix multiplication.  Here, we define a weight vector `w`, and then do a matrix multiplication of our dataset, `arr`, with our weight vector, `w`.  However, `w` is likely not the correct length.  You want to define a weight for each column of your matrix `arr`.  Add some arbitrary values into `w` until the matrix multiplication at the end of the cell is able to complete.  [Note that `np.dot` is the `numpy` method for matrix multiplication, not just dot products.](https://numpy.org/doc/stable/reference/generated/numpy.dot.html)

Try out some different weights and see how your values change.  Consider that this is what a machine learning algorithm is doing - trying out different values of parameters to see how they affect the answer.

Try out a few values for `w` and then you can consider this lab complete.

In [None]:
# CHANGE THESE VALUES TO SEE HOW THE RESULT CHANGES
w = np.array([1, 2, -2])  # HOW MANY 

result = np.dot(arr, w)
result