# Lab 03: Numpy

Scientific Python code uses a fast array structure, called the numpy array. Those who have worked in Matlab will find this very natural. For reference, the numpy documention can be found [here](https://numpy.org/doc/stable/reference/)

Let's make a numpy array.



## to install: pip install numpy  

In [1]:
import numpy as np

In [None]:
my_array = np.array([1, 2, 3, 4])
my_array

Numpy arrays are listy! Below we compute length, slice, and iterate.

In [None]:
print(len(my_array))
print(my_array[2:4])
for ele in my_array:
    print(ele)

In general you should manipulate numpy arrays by using numpy module functions (np.mean, for example). This is for efficiency purposes, and a discussion follows below this section.

You can calculate the mean of the array elements either by calling the method .mean on a numpy array or by applying the function np.mean with the numpy array as an argument.



In [None]:
print(my_array.mean())
print(np.mean(my_array))

The way we constructed the numpy array above seems redundant..after all we already had a regular python list. Indeed, it is the other ways we have to construct numpy arrays that make them super useful.

There are many such numpy array constructors. Here are some commonly used constructors. Look them up in the documentation.



In [None]:
np.ones(10) # generates 10 floating point ones

Numpy gains a lot of its efficiency from being typed. That is, all elements in the array have the same type, such as integer or floating point. The default type, as can be seen above, is a float of size appropriate for the machine (64 bit on a 64 bit machine).



In [None]:
np.dtype(float).itemsize # in bytes

In [None]:
np.ones(10, dtype='int') # generates 10 integer ones

In [None]:
np.zeros(10)

Often you will want random numbers. Use the random constructor!



In [None]:
np.random.random(10) # uniform on [0,1]

In [None]:
normal_array = np.random.randn(1000)
print("The sample mean and standard devation are %f and %f, respectively." %(np.mean(normal_array), np.std(normal_array)))


In [None]:
first = np.ones(5)
second = np.ones(5)
first + second


In [None]:
first_list = [1., 1., 1., 1., 1.]
second_list = [1., 1., 1., 1., 1.]
first_list + second_list #not what u want

In [None]:
first + 1

In [None]:
first*5

# **2D arrays**

Similarly, we can create two-dimensional arrays.

In [None]:
my_array2d = np.array([ [1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12] ])

# 3 x 4 array of ones
ones_2d = np.ones([3, 4])
print(ones_2d)
# 3 x 4 array of ones with random noise
ones_noise = ones_2d + .01*np.random.randn(3, 4)
print(ones_noise)
# 3 x 3 identity matrix
my_identity = np.eye(3)
print(my_identity)


In [None]:
print(my_array2d)
my_array2d[2, 3]

Numpy arrays are listy! They have set length (array dimensions), can be sliced, and can be iterated over with loop. Below is a schematic illustrating slicing two-dimensional arrays.


Earlier when we generated the one-dimensional arrays of ones and random numbers, we gave ones and random the number of elements we wanted in the arrays. In two dimensions, we need to provide the shape of the array, ie, the number of rows and columns of the array.

In [None]:
onesarray = np.ones([3,4])
onesarray

You can transpose the array:



In [None]:
onesarray.shape
onesarray.T
onesarray.T.shape

Matrix multiplication is accomplished by np.dot. The * operator will do element-wise multiplication.



In [None]:
print(np.dot(onesarray, onesarray.T)) # 3 x 3 matrix
np.dot(onesarray.T, onesarray) # 4 x 4 matrix


In [None]:
np.sum(onesarray)


# **EXERCISE**: Verify that two-dimensional arrays are listy. Create a two-dimensional array and show that this array has set length (shape), can be sliced, and can be iterated through with a loop. Your code should slice the array in at least two different ways, and your loop should print out the array entries.



In [None]:
twod = np.zeros((5, 2))
twod

In [None]:
for i in range(twod.shape[0]):
    twod[i, :] = np.random.random(2)
twod

In [None]:
twod = np.random.random(size=(5,2))
twod

# Numpy Arrays vs. Python Lists?
1. Why the need for numpy arrays? Can't we just use Python lists?
2. Iterating over numpy arrays is slow. Slicing is faster


# Numpy: https://numpy.org/doc/stable/user/quickstart.html

# **Introduction to Pandas**

## Dataset: https://figshare.com/articles/dataset/Motor_Trend_Car_Road_Tests/3122005?file=4856698


## Datasets: https://github.com/panchaldhruv27223/datasets_for_ml_-csv-xlsl-


# Install pandas: pip install pandas

In [None]:
import pandas as pd

In [None]:
# Read in the csv files
dfcars=pd.read_csv("/content/cars.csv")
type(dfcars)

In [None]:
dfcars.head()

Wow! That was easier and the output is nicer. What we have now is a spreadsheet with indexed rows and named columns, called a dataframe in pandas. dfcars is an instance of the pd.DataFrame class, created by calling the pd.read_csv "constructor function".

The take-away is that dfcars is a dataframe object, and it has methods (functions) belonging to it. For example, df.head() is a method that shows the first 5 rows of the dataframe.

A pandas dataframe is a set of columns pasted together into a spreadsheet, as shown in the schematic below, which is taken from the cheatsheet above. The columns in pandas are called series objects.



Let's look again at the first five rows of dfcars.



In [None]:
dfcars.head()

In [None]:
dfcars.columns

In [None]:
dfcars.index

In [None]:
len(dfcars)

In [None]:
dfcars.shape

In [None]:
dfcars.mpg

In [None]:
dfcars.mpg.values

In [None]:
dfcars['mpg']

In [None]:
dfcars[['am', 'mpg']]


# Dataframes and Series
Now that we have our automobile data loaded as a dataframe, we'd like to be able to manipulate it, its series, and its sub-dataframes, say by calculating statistics and plotting distributions of features. Like arrays and other containers, dataframes and series are listy, so we can apply the list operations we already know to these new containers. Below we explore our dataframe and its properties, in the context of listiness.



In [None]:
print(dfcars.shape)     # 12 columns, each of length 32
print(len(dfcars))      # the number of rows in the dataframe, also the length of a series
print(len(dfcars.mpg))  # the length of a series

# Listiness property 2: iteration via loops
One consequence of the column-wise construction of dataframes is that you cannot easily iterate over the rows of the dataframe. Instead, we iterate over the columns, for example, by printing out the column names via a for loop.



In [None]:
for ele in dfcars: # iterating iterates over column names though, like a dictionary
    print(ele)

In [None]:
for ele in dfcars.cyl:
    print(ele)


How do you iterate over rows? Dataframes are put together column-by-column and you should be able to write code which never requires iteration over loops. But if you still find a need to iterate over rows, you can do it using itertuples. See the documentation.

In general direct iteration through pandas series/dataframes (and numpy arrays) is a bad idea, because of the reasons in the earlier "Python Lists vs. Numpy Arrays" section.

Instead, you should manipulate dataframes and series with pandas methods which are written to be very fast (ie, they access series and dataframes at the C level). Similarly numpy arrays should be accessed directly through numpy methods.



# **Listiness property 3: slice**
Let's see how indexing works in dataframes. Like lists in Python and arrays in numpy, dataframes and series are zero-indexed.



In [None]:
dfcars.head()


In [None]:
# index for the dataframe
print(list(dfcars.index))
# index for the cyl series
dfcars.cyl.index

In [None]:
new_index = np.arange(5, 37)
# new dataframe with indexed rows from 5 to 36
dfcars_reindex = dfcars.reindex(new_index)
dfcars_reindex.head()


We now return the first three rows of dfcars_reindex in two different ways, first with iloc and then with loc. With iloc we use the command

In [None]:
dfcars_reindex.iloc[0:3]

since iloc uses the position in the index. Notice that the argument 0:3 with iloc returns the first three rows of the dataframe, which have label names 5, 6, and 7. To access the same rows with loc, we write

In [None]:
dfcars_reindex.loc[0:7] # or dfcars_reindex.loc[5:7]


since loc indexes via the label name.

Here's another example where we return three rows of dfcars_reindex that correspond to column attributes mpg, cyl, and disp. First do it with iloc:


In [None]:
dfcars_reindex.iloc[2:5, 1:4]


Notice that rows we're accessing, 2, 3, and 4, have label names 7, 8, and 9, and the columns we're accessing, 1, 2, and 3, have label names mpg, cyl, and disp. So for both rows and columns, we're accessing elements of the dataframe using the integer position indices. Now let's do it with loc:

In [None]:
dfcars_reindex.loc[7:9, ['mpg', 'cyl', 'disp']]

We don't have to remember that disp is the third column of the dataframe the way we did when the data was stored as a numpy array -- we can simply access it with loc using the label name disp.

Generally we prefer iloc for indexing rows and loc for indexing columns.

# **EXERCISE**: In this exercise you'll examine the documentation to generate a toy dataframe from scratch. Go to the documentation and click on "10 minutes to pandas" in the table of contents. Then do the following:

Create a series called column_1 with entries 0, 1, 2, 3.

Create a second series called column_2 with entries 4, 5, 6, 7.

Glue these series into a dataframe called table, where the first and second labelled column of the dataframe are column_1 and column_2, respectively. In the dataframe, column_1 should be indexed as col_1 and column_2 should be indexed as col_2.

Oops! You've changed your mind about the index labels for the columns. Use rename to rename col_1 as Col_1 and col_2 as Col_2.

Stretch: Can you figure out how to rename the row indexes? Try to rename 0 as zero, 1 as one, and so on.



In [None]:
dfcars.describe()

## How to create dataframe

In [None]:
df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)

df2

In [None]:
df2.dtypes

In [None]:
df2.head()

In [None]:
df2.tail(1)

In [None]:
df2.to_numpy()

In [None]:
df2.T

In [None]:
df2.sort_values(by="B")

In [None]:
# https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv

# https://raw.githubusercontent.com/panchaldhruv27223/datasets_for_ml_-csv-xlsl-/refs/heads/main/ebird_taxonomy_v2022.csv

# https://raw.githubusercontent.com/panchaldhruv27223/datasets_for_ml_-csv-xlsl-/refs/heads/main/kc_house_data.csv

In [None]:
!wget -O ./datasets/titanic.csv "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"

In [None]:
!wget -O ./datasets/ebird_taxonomy_v2022.csv "https://raw.githubusercontent.com/panchaldhruv27223/datasets_for_ml_-csv-xlsl-/refs/heads/main/ebird_taxonomy_v2022.csv"

In [None]:
!wget -O ./datasets/kc_house_data.csv "https://raw.githubusercontent.com/panchaldhruv27223/datasets_for_ml_-csv-xlsl-/refs/heads/main/kc_house_data.csv"

# Pandas: https://pandas.pydata.org/docs/user_guide/10min.html