<a href="https://colab.research.google.com/github/SchachtmanLab/Transgenic-sorghum-sorgoleone/blob/master/%5BSTUDENT_COPY%5DWednesday_morning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Good morning!

Welcome back to Day 3 of PyCamp!

Yesterday, we went on a grand tour of `numpy`, a package broadly used to conduct numerical operations in Python. We learned a *ton* of new functions with `numpy`, so instead of diving in to a new topic right away, we're going to spend a fair chunk of the morning expanding on yesterday's content and walking through how you can use it for your own data.

# Warm-up: reviewing arrays

Yesterday, we spent our whole day discussing the art and science of `numpy` arrays. You probably recall the main points of arrays: they're similar-ish to lists, they can be of different shapes and sizes, and they're very efficient for numerical operations.

Let's begin our review by importing the package that allows us to actually use arrays: `numpy`. **Please follow along with your cheat sheet in hand, so that you know exactly where to find the material during exercises.**

In [7]:
# remember, we need to import the packages we want to use
# you've seen us do it before: now try it yourself

# try it out:
# import the numpy package with its alias
import numpy as np
np.array()

TypeError: array() missing required argument 'object' (pos 0)

## Creating arrays

Next, let's recall how we generate arrays. We can generate arrays from lists or nested lists:

```
two_by_five = np.array([[1, 1, 1, 1, 1],
                        [1, 1, 1, 1, 1]])
```

Alternatively, we can use special functions to generate arrays:

* `np.ones()`: Given a tuple of `(rows, columns)`, returns an array of the desired shape where each value is a `1.0`.
* `np.zeros()`: Given a tuple of `(rows, columns)`, returns an array of the desired shape where each value is a `0.0`.
* `np.full()`: Given a tuple of `(rows, columns)` and a input value, returns an array of the desired shape where each value is the input value.
* `np.arange()`: Given three inputs: `start, stop, step`, returns a sequence of numerics spaced by `step` in the range between `start` and `stop`, *exclusive* of `stop`.
  * Example: `np.arange(0, 6, 3)` would return an array with `0, 3`.

## Basic operations

Once we have our arrays in the shape, size, and range we desire, we can perform basic array operations.

In [17]:
# a 3x3 array of 1s + a 3x3 array of 4s
np.ones((3,3))
np.full((3,3),4)
np.ones((3,3)) + np.full((3,3),4)


array([[5., 5., 5.],
       [5., 5., 5.],
       [5., 5., 5.]])

In [19]:
np.array([1,3,5])
np.zeros((1,3))
np.full((5,8),10000000000000000)
np.arange(9,19,1)

array_1 = (np.array([0,9,8,8])).reshape(4,1)
array_1

array([[0],
       [9],
       [8],
       [8]])

In [None]:
# When performing an operation on two arrays in numpy:
# For every dimension, either:
# 1) Both numpy arrays have the same length
# 2) One numpy array has a length of one (broadcasting happens)

We can use *broadcasting* to "stretch" operations across rows and columns:

![broadcasting figure 3](https://numpy.org/doc/stable/_images/broadcasting_4.png)

In [34]:
# try it out:
# multiply a 3 x 6 array of ones by an array containing
# the sequence of integers from 1 to 7, exclusive of 7
# and assign it to a variable called int_array
a = np.ones((3,6))
b = np.arange(1,7,1) #default stepsize  = 1

print(a,"\n", b)
int_array = a * b
int_array_2 = a + b
print(int_array, "\n", int_array_2)

[[1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1.]] 
 [1 2 3 4 5 6]
[[1. 2. 3. 4. 5. 6.]
 [1. 2. 3. 4. 5. 6.]
 [1. 2. 3. 4. 5. 6.]] 
 [[2. 3. 4. 5. 6. 7.]
 [2. 3. 4. 5. 6. 7.]
 [2. 3. 4. 5. 6. 7.]]


## Indexing and slicing
Next, we can index and slice arrays using a `[row, column]` index. Here are the key tips for this:
* You can use the `:` operator in place of the `row` or `column` index to indicate "all rows" or "all columns".
* You can use a tuple in the `row` or `column` index to select specific rows or columns: for example, `(0, 1)` in the column index would select the first and second columns.

In [37]:
# try it out: slice out the first, third, and sixth columns of int_array
int_array[0, (0,5)]
int_array[:,(0, 2,5)] # use tumple in () to slice multiple columns

array([[1., 3., 6.],
       [1., 3., 6.],
       [1., 3., 6.]])

## Summarizing arrays

We can use simple methods and functions to summarize values in arrays.

* `.min()` and `.max()` are equivalent to `np.min()` and `np.max()`.
* `.sum()` is equivalent to `np.sum()`.
* `.mean()` is equivalent to `np.mean()`.
* `.std()` is equivalent to `np.std()`.
* `.var()` is equivalent to `np.var()`.
* Median is only available as a function (`np.median`).

In [42]:
# try it out both ways: calculate the mean of int_array
int_array.var()
int_array.mean()
np.median(int_array)

3.5

### The `axis` parameter
Many methods and functions in `numpy` accept an input parameter called `axis`, which determines whether the summary calculation is performed row-wise or column-wise.
* `axis = 0`: Indicates a column-wise operation.
* `axis = 1`: Indicates a row-wise operation.

In [45]:
# column-wise means
a= int_array.mean(axis=0)
b=np.mean(int_array, axis=0)
a==b

array([ True,  True,  True,  True,  True,  True])

In [None]:
# row-wise means


## Random number generation

We can create arrays of pseudo-random numbers using `numpy`'s random number generator object, which is always created with the same line of code:

```
rng = np.random.default_rng()
```

We can slightly modify this line of code with a **seed value**, ensuring that we generate reproducible random numbers (read: the "random" numbers that the lecturer generates will be the same ones that you generate on your own screen).

```
# we will usually use 2025 as a seed
rng = np.random.default_rng(2025)
```

We can use methods for `rng` to draw random numbers from a set or distribution of numbers:
* `.integers()`: Takes three inputs: a lower bound, an upper bound, and the number of desired numbers. Draws from the set of *integers*.
* `.uniform()`: Takes three inputs: a lower bound, an upper bound, and the number of desired numbers. Each value is equally likely to be drawn from the uniform distribution in the given range.
* `.normal()`: Takes three inputs: the center (mean) of the distribution, a scale factor, and the number of desired numbers. Probabilities of numbers within the range correspond to the normal distribution.

In [56]:
# try it out:
# using 2025 as a seed, create the rng object
# then draw 10 random numbers from the normal distribution with a scale factor of 5
rng = np.random.default_rng(2025)
rng.integers(0,666, (4,5))
rng.uniform(0,666, (4,5))

rng.normal(8,5, 10)

array([ 7.22444237,  8.67883541,  9.03561987,  8.78305723, 11.89500834,
        7.36791179, 10.73347405,  9.52178924,  2.39126714,  3.42144134])

## Filtering (Boolean masking)
Lastly (and most importantly), we can filter arrays using logical operations. Using a logic check on an array will yield a Boolean array of the same shape and size, which tells us about whether or not each element in the original array passes the logic check.

In [67]:
# generate random integers into a (8,3) array
rng = np.random.default_rng()
random_ints = rng.integers(1,10,(8,3))


# reshape it

print('Starting array:\n', random_ints)  #back-in:\n

print('Rows where the third column is greater than 4:\n', random_ints[:, 2] > 4)
rr = random_ints[:, 2] > 4
random_ints[rr,:]

print('Row values where the third column is greater than 4:\n', random_ints[rr,:])

Starting array:
 [[4 3 2]
 [3 7 2]
 [4 4 9]
 [1 1 1]
 [6 8 7]
 [1 4 3]
 [1 8 9]
 [3 4 9]]
Rows where the third column is greater than 4:
 [False False  True False  True False  True  True]
Row values where the third column is greater than 4:
 [[4 4 9]
 [6 8 7]
 [1 8 9]
 [3 4 9]]


Once we have the Boolean array, we can use it to select our filtered values and perform downstream operations on them.

In [89]:
# try it out:
# select rows in random_ints where the second column is greater than 2
# then calculate the column-wise means of the selected rows
cc = random_ints[:, 1] > 2
print(cc)
#random_ints.mean([:,cc])
np.mean(random_ints[cc,:], axis=0)
gg=random_ints[cc,:]
gg.mean(axis=0)

print(gg.mean(axis=0), "\n", np.mean(random_ints[cc,:], axis=0))
print(f'inti array:\n{random_ints}')
print('inti array:\n', random_ints)
#print(random_ints[cc,:])


[ True  True  True False  True  True  True  True]
[3.14285714 5.42857143 5.85714286] 
 [3.14285714 5.42857143 5.85714286]
inti array:
[[4 3 2]
 [3 7 2]
 [4 4 9]
 [1 1 1]
 [6 8 7]
 [1 4 3]
 [1 8 9]
 [3 4 9]]
inti array:
 [[4 3 2]
 [3 7 2]
 [4 4 9]
 [1 1 1]
 [6 8 7]
 [1 4 3]
 [1 8 9]
 [3 4 9]]


# Intro to data exploration
Now that we've refreshed our knowledge of arrays, we're ready to move on to the next phase of our Python journey: learning how to import and work with actual data.

**Data exploration** refers to the process of semi-structured examination of the data at hand – we're not gathering data, but we're also not quite analyzing it rigorously (statistical tests, etc) yet. This process broadly encompasses importing, cleaning, and summarizing data.

In [90]:
# make sure to run this cell to import our external files for today!
# this is not something you have to know the details of, but if you're
# curious about the specifics, you can read the [Optional] section
# on cloning files from GitHub at the end of the notebook

!git clone https://github.com/ccbskillssem/pythonbootcamp.git

Cloning into 'pythonbootcamp'...
remote: Enumerating objects: 298, done.[K
remote: Counting objects: 100% (200/200), done.[K
remote: Compressing objects: 100% (163/163), done.[K
remote: Total 298 (delta 113), reused 82 (delta 37), pack-reused 98 (from 1)[K
Receiving objects: 100% (298/298), 95.57 MiB | 23.64 MiB/s, done.
Resolving deltas: 100% (141/141), done.
Updating files: 100% (50/50), done.


## Common file types

Scientific data comes in a myriad of file types. The easiest form of data to work with comes in a **delimited file**, which separates values into rows and columns based on *delimiter characters*. Here are some common delimited file types:
* **Comma separated value files** (`.csv`): Columns are separated by commas, and rows are separated by *newlines* (literally new lines, or `\n` special characters).
* **Tab separated value files** (`.tsv`): Columns are separated by tabs, and rows are separated by newlines as well.

You can generally tell the structure of a file given its **file extension**: for example, a comma separated file will usually be denoted `.csv`, and a tab separated file will be denoted `.tsv`.

## Importing data

Let's take a look at a simple dataset called `animals2` [[source here]](https://vincentarelbundock.github.io/Rdatasets/doc/robustbase/Animals2.html), which contains data on average body weight and brain weight for 65 species.

| Index  | Description                   |
|--------|-------------------------------|
| 0      | Body weight (kilograms)       |
| 1      | Brain weight (grams)          |

A **file path** is an address that points to where the data is stored. Our file path, shown below, is a string that points to the location in the runtime where the `animals2` dataset is stored.

```
# we represent file paths as strings
'/content/pythonbootcamp/day_3/Animals2.csv'
```

We're going to use a function called `np.genfromtxt()` to read in our data. This function takes many different input parameters, but the most important one is `delimiter`, a string that specifies the character used to delimit columns. By default, `numpy` delimits columns by spaces, so this is an essential input parameter for comma and tab delimited data.

We can infer from the file path that `animals2` comma-delimited (`.csv` format). Let's give importing a go and see what our data looks like.

In [91]:
animals2 = np.genfromtxt('/content/pythonbootcamp/day_3/Animals2.csv',
              delimiter=',')
animals2 # look at the imported data

array([[      nan,       nan,       nan],
       [      nan, 1.350e+00, 8.100e+00],
       [      nan, 4.650e+02, 4.230e+02],
       [      nan, 3.633e+01, 1.195e+02],
       [      nan, 2.766e+01, 1.150e+02],
       [      nan, 1.040e+00, 5.500e+00],
       [      nan, 1.170e+04, 5.000e+01],
       [      nan, 2.547e+03, 4.603e+03],
       [      nan, 1.871e+02, 4.190e+02],
       [      nan, 5.210e+02, 6.550e+02],
       [      nan, 1.000e+01, 1.150e+02],
       [      nan, 3.300e+00, 2.560e+01],
       [      nan, 5.290e+02, 6.800e+02],
       [      nan, 2.070e+02, 4.060e+02],
       [      nan, 6.200e+01, 1.320e+03],
       [      nan, 6.654e+03, 5.712e+03],
       [      nan, 9.400e+03, 7.000e+01],
       [      nan, 6.800e+00, 1.790e+02],
       [      nan, 3.500e+01, 5.600e+01],
       [      nan, 1.200e-01, 1.000e+00],
       [      nan, 2.300e-02, 4.000e-01],
       [      nan, 2.500e+00, 1.210e+01],
       [      nan, 5.550e+01, 1.750e+02],
       [      nan, 1.000e+02, 1.57

## Cleaning data
Now that our data is imported, we can see that we have a fair number of elements called `nan`, which stands for *not a number*. `nan` values are problematic because they can prevent us from performing meaningful analysis with our data.

> *What's the origin of `nan` values?<br>*
Recall that `numpy` arrays can only contain a single data type: for example, our data is mostly numerical, `numpy` will default to importing the data as a numerical-only array. Any non-numerical data (for example, column or row names) or missing data values (like empty cells in Excel) are automatically transformed into `nan` values.<br>If you download `animals2`, you'll see that it has both column and row names: hence, it has both a column and row of `nan` values. Tomorrow, we'll learn about a data structure that borrows heavily from arrays and *also* lets us use multiple types.

How to download the data file? Go to the left hand panel of the Colab notebook and click on the folder icon at the bottom of the panel. This will bring you to Colab's Files menu. You should see a file called random_airquality.csv: if you hover over it and click the menu with three dots, you can obtain the file path (if you want to import it again) or download the file.

There are many methods to resolve these `nan` values. This process is known as **data cleaning**, and it's one of the most important steps for us to do ahead of data analysis.

Let's start by mocking up a smaller array with interspersed `nan` values. We'll practice with this array before returning to `animals2`.

In [94]:
# manually create an smaller sample array with nan values
# using np.nan

nan_array = np.array([[np.nan, np.nan, np.nan, np.nan, np.nan],
                      [0, 2, np.nan, 3, 5],
                      [1, 3, 5, 0, 9],
                      [4, 2, 9, np.nan, 2]])
nan_array

array([[nan, nan, nan, nan, nan],
       [ 0.,  2., nan,  3.,  5.],
       [ 1.,  3.,  5.,  0.,  9.],
       [ 4.,  2.,  9., nan,  2.]])

What happens if we try to perform operations with this array?

In [95]:
# try it out:
# get the column-wise mean for each column
nan_array.mean(axis=0)

array([nan, nan, nan, nan, nan])

### Identifying `nan` values
We can resolve `nan` values using re-assignment and filtering, which we learned about yesterday, alongside a new function called `np.isnan()`.

`np.isnan()` takes in an existing array and performs a logic check on its values, returning a Boolean array where `True` indicates the presence of a `nan` value.

In [96]:
print('Array with nans:\n', nan_array)

print('Boolean map of nans:\n', np.isnan(nan_array))

Array with nans:
 [[nan nan nan nan nan]
 [ 0.  2. nan  3.  5.]
 [ 1.  3.  5.  0.  9.]
 [ 4.  2.  9. nan  2.]]
Boolean map of nans:
 [[ True  True  True  True  True]
 [False False  True False False]
 [False False False False False]
 [False False False  True False]]


In [98]:
#True = 1
#False = 0

np.array([True, False]).mean()

0.5

Although it's easy to manually inspect the Boolean map for our above sample array, it may not be so simple for larger arrays. Thankfully, we can use the `.sum()` method on the Boolean map to get a count of `nan` values. If the sum is non-zero, we have `nan` values.

In [109]:
# try it out:
# use .sum() on the Boolean map to get the number of nan values
np.isnan(nan_array) == True
x = (np.isnan(nan_array) == True)
x.sum()

boolmap = np.isnan(nan_array)
boolmap.sum()
boolmap.sum(axis =1)
x = (np.isnan(nan_array) == False)
x.sum()
#np.isnan(nan_array).sum()
# use .sum() and the axis parameter to get the number of nan values in each row
nan_array


array([[nan, nan, nan, nan, nan],
       [ 0.,  2., nan,  3.,  5.],
       [ 1.,  3.,  5.,  0.,  9.],
       [ 4.,  2.,  9., nan,  2.]])

### Removing all-`nan` rows/cols

Let's start by performing an obvious data-cleaning step: removing "empty" rows and columns that contain all `nan` values. These rows/columns are devoid of data value and prevent us from performing calculations with our array.

The `.all()` method summarizes the `np.isnan()` map by collapsing it either row-wise or column-wise, depending on the `axis` parameter. Rows/columns that are comprised of all `nan` values will be collapsed to a `True` value.

In [112]:
print(np.isnan(nan_array)[1, :])
print(np.isnan(nan_array)[1, :].all()) # Tests whether all elements are True. boolean logic: everything is the array is true or not

[False False  True False False]
False


In [118]:
print('Boolean map of nans:\n', np.isnan(nan_array))
print(np.isnan(nan_array).all(axis=0))
print('\nColumns containing all nans:\n', np.isnan(nan_array).all(axis=0))

print('\nRows containing all nans:\n', np.isnan(nan_array).all(axis=1))

Boolean map of nans:
 [[ True  True  True  True  True]
 [False False  True False False]
 [False False False False False]
 [False False False  True False]]
[False False False False False]

Columns containing all nans:
 [False False False False False]

Rows containing all nans:
 [ True False False False]


Above, we see that we have no all-`nan` columns, but we do have one all-`nan` row.

We're going to perform our most complicated filtering and slicing trick yet: we're going to take the 1D Boolean map of rows containing only `nan` values and use it as a Boolean mask to remove the all-`nan` row.

Let's start by examining the 1D Boolean map generated with `.all()`. If we use this as a Boolean mask, we end up selecting *only* the all-`nan` row.

In [125]:
# you can use a 1D Boolean array as a row index to select rows
# True = selects the row
# False = does not select the row
filter = np.isnan(nan_array).all(axis=1)
nan_array[filter, :]
print("after filter\n", nan_array[filter, :])
print(filter)
nan_array
~np.isnan(nan_array).all(axis=1)  # flip everything by ~
print("after filter\n", nan_array[~filter, :])

after filter
 [[nan nan nan nan nan]]
[ True False False False]
after filter
 [[ 0.  2. nan  3.  5.]
 [ 1.  3.  5.  0.  9.]
 [ 4.  2.  9. nan  2.]]


How does this work?

Above, `np.isnan(nan_array).all(axis = 1)`, corresponds to:

```
[True, False, False, False]
```

The `True` value indicates that the first row of `nan_array` contains all `nan` values. When we select with this map, we only retain this all-`nan` row. That's the opposite of what we want!

Therefore, we'll need to **negate** our 1D Boolean map using a new operator: the `~` operator.

This operator will invert the Boolean values in the map, such that the `True` values (indicating a row containing all `nan` values) becomes `False`, and vice versa.

In [126]:
print('Before negation:\n', np.isnan(nan_array).all(axis = 1))

print('After negation:\n', ...)

Before negation:
 [ True False False False]
After negation:
 Ellipsis


This allows us to specifically *exclude* the all-`nan` row, which is marked as `False`.

In [127]:
print('Array with nans:\n', nan_array)

print('After filtering out all-nan row:\n', ...)

Array with nans:
 [[nan nan nan nan nan]
 [ 0.  2. nan  3.  5.]
 [ 1.  3.  5.  0.  9.]
 [ 4.  2.  9. nan  2.]]
After filtering out all-nan row:
 Ellipsis


### Setting `nan` to a default value

After removing the all-`nan` row, we still have some interspersed `nan` values. One common method for resolving `nan` values is simply replacing them with a real numeric value.

The `np.nan_to_num()` function is a convenient function that replaces `nan` with the float `0.0`.

In [131]:
# look at our array again
print('Array with nans:\n', nan_array)

# replace nan_array's nans with 0.0
print('Array with replaced nans:\n', np.nan_to_num(nan_array))
print('Array with replaced nans:\n', np.nan_to_num(nan_array, nan=-999))

Array with nans:
 [[nan nan nan nan nan]
 [ 0.  2. nan  3.  5.]
 [ 1.  3.  5.  0.  9.]
 [ 4.  2.  9. nan  2.]]
Array with replaced nans:
 [[0. 0. 0. 0. 0.]
 [0. 2. 0. 3. 5.]
 [1. 3. 5. 0. 9.]
 [4. 2. 9. 0. 2.]]
Array with replaced nans:
 [[-999. -999. -999. -999. -999.]
 [   0.    2. -999.    3.    5.]
 [   1.    3.    5.    0.    9.]
 [   4.    2.    9. -999.    2.]]


In [142]:
# try it out:
# update nan_array by removing all-nan rows
filter = np.isnan(nan_array).all(axis=1)
~np.isnan(nan_array).all(axis=1)  # flip everything by ~
print("after filter\n", nan_array[~filter, :])
# update nan_array, using np.nan_to_num() to replace nan values
gg = nan_array[~filter, :]
print('Array with replaced nans with -999:\n', np.nan_to_num(gg, nan=-999))
# then calculate the column-wise mean
cc = np.nan_to_num(gg, nan=-999)
print(cc.mean(axis = 0))
np.isnan(nan_array).all(axis=1)
# np.isnan(nan_array).all(axis=-1) %% [markdown] Now that we've reviewed the
# various methods for cleaning our data, finish cleaning up `animals2` using the
# following steps: * Count how many `nan` values exist in our dataset. * Remove
# all-`nan` rows. * Remove all-`nan` columns. * Count how many `nan` values
# remain. %% use this cell to "reset" animals2 if necessary

after filter
 [[ 0.  2. nan  3.  5.]
 [ 1.  3.  5.  0.  9.]
 [ 4.  2.  9. nan  2.]]
Array with replaced nans with -999:
 [[   0.    2. -999.    3.    5.]
 [   1.    3.    5.    0.    9.]
 [   4.    2.    9. -999.    2.]]
[   1.66666667    2.33333333 -328.33333333 -332.            5.33333333]


array([ True, False, False, False])

Now that we've reviewed the various methods for cleaning our data, finish cleaning up `animals2` using the following steps:

* Count how many `nan` values exist in our dataset.
* Remove all-`nan` rows.
* Remove all-`nan` columns.
* Count how many `nan` values remain.

In [161]:
# use this cell to "reset" animals2 if necessary
animals2 = np.genfromtxt('/content/pythonbootcamp/day_3/Animals2.csv',
              delimiter=',')
animals2
print(animals2[(0,9),:]) # 1st and 10th row
animals2[:10,:]

[[ nan  nan  nan]
 [ nan 521. 655.]]


array([[      nan,       nan,       nan],
       [      nan, 1.350e+00, 8.100e+00],
       [      nan, 4.650e+02, 4.230e+02],
       [      nan, 3.633e+01, 1.195e+02],
       [      nan, 2.766e+01, 1.150e+02],
       [      nan, 1.040e+00, 5.500e+00],
       [      nan, 1.170e+04, 5.000e+01],
       [      nan, 2.547e+03, 4.603e+03],
       [      nan, 1.871e+02, 4.190e+02],
       [      nan, 5.210e+02, 6.550e+02]])

In [245]:
# try it out:
# Count how many nan values exist in our dataset

print("Total nans in dataset:", np.isnan(animals2).sum())

# Remove all-nan rows
fil = np.isnan(animals2).all(axis=1)
fil
animals2.shape
fil

ss = animals2[~fil, :] # all the column, semicolumn here :
ss.shape

# Remove all-nan columns
fil2 = np.isnan(ss).all(axis=0)
sss = ss[:, ~fil2]
sss.shape
# Count how many nan values remain
print("Total nans in dataset:", np.isnan(sss).sum())

sss.mean(axis=0)
sss.mean(axis=1)
sss.mean()

print(fil2)


Total nans in dataset: 68
Total nans in dataset: 0
[ True False False]


# << `Exercises` >>

The following exercises introduce a function (`clean_data()` and dataset (`airquality`) that we'll need in the next section.

## clean_data()
Write a function called `clean_data()` that takes in a 2D array of values and removes all-`nan` rows and columns. Your function should return a cleaned array.

In [208]:
### write your code below ###

def clean_data(precleaned_array):

  # remove all-nan rows
    fil = ~ np.isnan(precleaned_array).all(axis=1)
    ss = precleaned_array[fil, :]
  # remove all-nan columns
    fil2 = ~ np.isnan(ss).all(axis=0)
    sss = ss[:, fil2]
    #cc = np.nan_to_num(sss, nan=-999)
    cleaned_array = sss
    return(cleaned_array)


In [197]:
clean_data(animals2)

array([[1.350e+00, 8.100e+00],
       [4.650e+02, 4.230e+02],
       [3.633e+01, 1.195e+02],
       [2.766e+01, 1.150e+02],
       [1.040e+00, 5.500e+00],
       [1.170e+04, 5.000e+01],
       [2.547e+03, 4.603e+03],
       [1.871e+02, 4.190e+02],
       [5.210e+02, 6.550e+02],
       [1.000e+01, 1.150e+02],
       [3.300e+00, 2.560e+01],
       [5.290e+02, 6.800e+02],
       [2.070e+02, 4.060e+02],
       [6.200e+01, 1.320e+03],
       [6.654e+03, 5.712e+03],
       [9.400e+03, 7.000e+01],
       [6.800e+00, 1.790e+02],
       [3.500e+01, 5.600e+01],
       [1.200e-01, 1.000e+00],
       [2.300e-02, 4.000e-01],
       [2.500e+00, 1.210e+01],
       [5.550e+01, 1.750e+02],
       [1.000e+02, 1.570e+02],
       [5.216e+01, 4.400e+02],
       [2.800e-01, 1.900e+00],
       [8.700e+04, 1.545e+02],
       [1.220e-01, 3.000e+00],
       [1.920e+02, 1.800e+02],
       [3.385e+00, 4.450e+01],
       [4.800e-01, 1.550e+01],
       [1.483e+01, 9.820e+01],
       [4.190e+00, 5.800e+01],
       [

In [246]:
# you can test out your function by using it on nan_array
nan_array = np.array([[np.nan, np.nan, np.nan, np.nan, np.nan],
                      [0, 2, np.nan, 3, 5],
                      [1, 3, 5, 0, 9],
                      [4, 2, 9, np.nan, 2]])

#print('Nan_array: \n', nan_array)
#print('Cleaned array:\n', clean_data(nan_array))

# Did it work? :)
clean_data(nan_array)

array([[ 0.,  2., nan,  3.,  5.],
       [ 1.,  3.,  5.,  0.,  9.],
       [ 4.,  2.,  9., nan,  2.]])

## `airquality` dataset
Let's try importing and cleaning another dataset called `airquality`. This dataset is a little more complex, containing 153 daily observations of the air quality values from May to September 1973. [[source]](https://vincentarelbundock.github.io/Rdatasets/doc/datasets/airquality.html)

| Index  | Description                   |
|--------|-------------------------------|
| 0      | Ozone (ppb)                   |
| 1      | Solar radiation (Langeleys)   |
| 2      | Wind (mph)                    |
| 3      | Temperature (degrees F)       |
| 4      | Month (`1`-`12`)              |
| 5      | Day of month (`1`-`31`)       |

Start by importing the data using `np.genfromtxt()` and the file path below:

```
'/content/pythonbootcamp/day_3/airquality.csv'
```

In [209]:
### import the data using this cell ###
airquality = np.genfromtxt("/content/pythonbootcamp/day_3/airquality.csv", delimiter = ",")
airquality

array([[ nan,  nan,  nan, ...,  nan,  nan,  nan],
       [ nan,  41., 190., ...,  67.,   5.,   1.],
       [ nan,  36., 118., ...,  72.,   5.,   2.],
       ...,
       [ nan,  14., 191., ...,  75.,   9.,  28.],
       [ nan,  18., 131., ...,  76.,   9.,  29.],
       [ nan,  20., 223., ...,  68.,   9.,  30.]])

How many `nan` values are there? Using `clean_data()`, remove the all-`nan` row and column. How many `nan` values are there after cleaning? Can you identify the remaining `nan` values?

In [210]:
### clean your data below ###
x = clean_data(airquality)
print("Total nans in dataset:", np.isnan(x).sum())

Total nans in dataset: 44


Earlier, we discussed using `np.nan_to_num()` as a way to simply fill in `nan` values with `0.0` substitute values. Although `np.nan_to_num()` is a quick and simple fix, substituting `0.0` for measurements in the `airquality` dataset may compromise our analyses.

Here are some things to explore and consider:
* What is the relative proportion of rows that have any `nan` values, compared to the total number of rows in `airquality`?
* Do the `nan` values appear evenly interspersed throughout all of the data, or do they mostly fall in one column?
  * Do rows with `nan` values tend to have multiple `nan` values, or just one?
* Will the dataset be valuable with partial data per observation, or are the observations only valuable if all columns are filled?

In [214]:
# find the number of rows with *any* nan values
x.shape
airquality.shape
print(airquality.shape)
print(np.isnan(airquality))
print("all the NA number is:\n",np.isnan(airquality).sum())
#np.isnan(airquality).all(axis=1)
np.isnan(airquality).any(axis=1)
ll = np.isnan(airquality).any(axis=1)
print(ll)
ll.sum()

(154, 7)
[[ True  True  True ...  True  True  True]
 [ True False False ... False False False]
 [ True False False ... False False False]
 ...
 [ True False False ... False False False]
 [ True False False ... False False False]
 [ True False False ... False False False]]
all the NA number is:
 204
[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True 

154

*Challenge*: One method of resolving `nan` values is to substitute `nan` with values that minimize data distortion. For example, we can use **mean imputation** to substitute the remaining `nan` values with the *mean* of non-`nan` values in the ozone and solar radiation columns.

`numpy` provides a useful function called `np.nanmean()`, which excludes `nan` values. You can read about it [here](https://numpy.org/doc/stable/reference/generated/numpy.nanmean.html?highlight=nanmean#numpy.nanmean).

Try starting with just the ozone column: how can you replace the `nan` values with the `nan`-excluding mean of the existing ozone values?

> *Hint*: If you have a Boolean map that tells you where `nan` values are, you can use the map to select those values and assign them to a different value. Use this strategy to change existing `nan` values to the `nan`-excluding mean value.

In [215]:
# Hint: review this example from yesterday

rng = np.random.default_rng(2024)
random_ints = rng.integers(-10, 10, 25)
random_ints = random_ints.reshape((5,5))
print('Before:\n',random_ints)

# select all elements in random_ints that are less than 0
# and then assign them to be 0

random_ints[random_ints < 0] = 0
print('After:\n', random_ints)

Before:
 [[-6  3 -9 -6 -4]
 [-4  8  5  8  9]
 [-9 -8  7 -9 -7]
 [-7  8 -3 -5 -7]
 [-1  1  6  2  9]]
After:
 [[0 3 0 0 0]
 [0 8 5 8 9]
 [0 0 7 0 0]
 [0 8 0 0 0]
 [0 1 6 2 9]]


In [267]:
### write your code below ###
airquality = np.genfromtxt("/content/pythonbootcamp/day_3/airquality.csv", delimiter = ",")
airquality
print("before add mean:\n", np.isnan(airquality)[1:,1].sum())
#print(airquality)
#airquality[:,1]
mean = np.nanmean(airquality[1:,1])
#print(np.nanmean(airquality[:,1]))
###filter = np.isnan(airquality)[1:, 1]
xxx = np.nan_to_num(airquality[1:,1],  nan=mean)
xxx
#print(xxx)
airquality[1:,1] = xxx
print(airquality[1:13,1])
print(airquality[1:13,2])
np.isnan(airquality)[1:,1].sum()







before add mean:
 37
[41.         36.         12.         18.         42.12931034 28.
 23.         19.          8.         42.12931034  7.         16.        ]
[190. 118. 149. 313.  nan  nan 299.  99.  19. 194.  nan 256.]


0

# [Optional] Removing rows/cols with *any* `nan`s

For absolutely stringent filtering, we can remove all rows and/or columns that contain any `nan` values.

The `.any()` method is similar to the `.all()` method: however, it returns a collapsed map where `True` corresponds to the presence of *any* `nan` values.

In [None]:
# re-initialize our array with nan values
nan_array = np.array([[np.nan, np.nan, np.nan, np.nan, np.nan],
                      [0, 2, np.nan, 3, 5],
                      [1, 3, 5, 0, 9],
                      [4, 2, 9, np.nan, 2]])

# look at our map again
print('Array with nans:\n', nan_array)

print('Boolean map of nans:\n', np.isnan(nan_array))

In [None]:
# try it out
# use the .any() method to filter out the rows containing *any* nan values


As you can see, this method can be a little *too* stringent. Most of the time, you can retain the bulk of your data by trimming all-`nan` rows/columns and setting a default value for the rest. Unless you have a very good reason to exclude all missing values, tread carefully!

# [Optional] Simple random samples
With very large datasets, we may be interested in *randomly* sampling our rows. We can do this by initiating a random number generator (`rng`), then using the `.choice()` method with to draw random rows *without* replacement.

The `.choice()` method takes two inputs: an array that we wish to sample from, and a number of rows.

In [None]:
# let's sample 10 rows from animals2

rng = np.random.default_rng() # run this a few times: you should get different values
rng.choice(animals2, 10)

# [Optional] Cloning files from GitHub

[GitHub](https://github.com/) is a website that hosts code and files for software development projects. It serves two major functions: backing up **codebases** (files with data and code that work together) and enabling collaboration between programmers/developers.

We (your staff team) use GitHub as a **repository** for files that are used during PyCamp. We do this so that we have a stable copy of these files that stays out of "I spilled coffee on my laptop the night before PyCamp", or "my laptop was ransomed for cryptocurrency" territory. Moreover, if we accidentally delete a file from the repository, GitHub's **version control**  allows us to roll back the repository to a working version. Neat, right?

The below command allows us to **clone** these files from the GitHub repository to our local runtime's session storage. This allows for us to skip the messy steps of trying to get everyone to download and re-upload the right data.

```
!git clone https://github.com/ccbskillssem/pythonbootcamp.git
```

The `!` operator is used to indicate *special commands* that would normally be run at a computer's **command line**, rather than in Python. This is akin to communicating with a computer (or in Colab, our runtime) directly to tell it that we want to download files using the given file path.

The GitHub file path that you see above points to a single file called a `.git` file. This file does not contain all the data: rather, it provides directions to the GitHub repository of interest, and therefore all the files it contains. In this manner, we never have to worry about giving all the file paths to each file we want: we just pull all the files in the repository by giving its `.git` file path.

# [Optional] More methods for external data

This section describes the bare essentials of file uploads/downloads with Colab. For a more in-depth exploration, you can visit the official Google Colab notebook on data I/O [here](https://colab.research.google.com/notebooks/io.ipynb).

## Loading data from your computer
You can use Colab's `Files` menu to upload data from your own computer to Colab's temporary **session storage**. Session storage is reset each time the notebook runtime ends or is otherwise reset.

Go to the left hand panel of the Colab notebook and click on the folder icon at the bottom of the panel. This will bring you to Colab's `Files` menu.

Click on the leftmost icon underneath the `'Files'` title of the panel: it should appear as a piece of paper with an up arrow on it. Follow the prompts to upload your data of choice. Once your file is uploaded, you can access the file path by hovering over the file name, clicking on the three-dot menu, then selecting `Copy path`.

___

**CAUTION**: Files that you upload are NOT retained in the `Files` panel after you close the notebook or reset the runtime. If you would prefer to avoid the upload process, consider the next section on loading data from Google Drive.

___

## Loading data from Google Drive
Google Drive is an excellent cloud storage solution for data you wish to work with in Colab. Colab provides a simple solution for allowing you to access files from Google Drive in Colab: all you have to do is access the `Files` menu by clicking the folder icon on the left hand panel of the Colab notebook.

Once you're in the `Files` menu, click on the third icon below the `'Files'` title: it should appear as a filled-in white folder with the Google Drive icon. Click this button to connect Google Drive to Colab: a pop-up should appear asking you to confirm that you wish to do this, and you may need to wait a few minutes while Google Drive loads.

Once your Drive is mounted, you should see a new folder called `drive` in the `Files` menu. You can access the file path by hovering over the file name, clicking on the three-dot menu, then selecting `Copy path`.

## Saving your arrays to external files

`numpy` also provides a utility function called `np.savetxt()` that saves arrays to external files. We won't be using it in the remainder of the bootcamp, but it might be useful for your post-bootcamp adventures.

`np.savetxt()` takes three key inputs:
1. A file name.
2. The array of interest.
3. A `delimiter` string.

In [None]:
# let's sample 5 random rows from airquality

rng = np.random.default_rng(2024) # this time, using a seed value for reproducibility
random_airquality = rng.choice(airquality, 5)
random_airquality

In [None]:
# now let's save random_airquality as a comma-separated value file
np.savetxt('random_airquality.csv', random_airquality, delimiter = ',')

Go to the left hand panel of the Colab notebook and click on the folder icon at the bottom of the panel. This will bring you to Colab's `Files` menu. You should see a file called `random_airquality.csv`: if you hover over it and click the menu with three dots, you can obtain the file path (if you want to import it again) or download the file.
___
**CAUTION**: Files that you save while using Colab are not retained after you close the notebook, as they only exist in Colab's temporary **session storage**. If you generate files and wish to keep them, make sure to download your files (with the same three dots menu) before you exit Colab.
___