# First steps with NumPy and Pandas


The following exercises could be done here on Colab or your own IDE for convenience. It's up to you

In [None]:
import numpy as np
import pandas as pd

## numpy

Create a 1-D array of numbers from 1 to 10 (have a look at the `np.arange()` function)

Multiply the previously created vector by 2

Create a new array of 20 evenly spaced numbers between -1 and 1

Use the `.repeat()` method to repeat each element in the vector three times

Create a new array of 100 randomly generated, normally-distributed values with mean=0 and sd=1

Compute the mean (`.mean()`) and standard deviation (`.std()`) of the above array, using both the function-style syntax ( `<package>.<function>(<object>)` ) and the method-style syntax ( `<object>.<method>()` )

Select all elements greater than 0 in the above array, using appropriate indexing with `[ ]`

Check the help/documentation to inspect what the `.empty()` function of `numpy` does

Use the `np.empty()` function to initialize a new vector of 30 floating-point numbers (hint: `"float"` or `"float64"`); then print its content (does it look empty, right? nope?)

Now make the previously created array *actually* empty... by filling it with missing values (hint: use `[:]` for indexing the whole vector at once, and remember that missing values are represented by `np.nan`)

Use the `np.empty()` function to initialize a new array of 30 strings each composed by max 3 Unicode characters (hint: `"<U3"`), then:
* fill the first position of the array with a long string;
* print the array and see what happened to that long string 😏

Create a new array with 5 rows and 15 columns, filled with any kind of randomly generated numbers, then:
*  compute the overall mean (grand average);
*  compute the mean value for each column (hint: `axis=0`);
*  compute the mean value for each row (hint: `axis=1`);
*  now do the same, but compute standard deviations instead of means.

## (mostly) pandas

Generate a new `DataFrame` called `df` with 100 rows and three columns:
*  "id" with a sequence of integer numbers representing participants' IDs;
*  "age" with floating-point numbers randomly drawn from a uniform distribution (hint: `np.random.uniform()`) between 5 and 10 (it's ages in years);
*  "score" with floating-point numbers randomly drawn from a normal distribution (hint: `np.random.normal()`) with mean=50 and sd=10 (it's test scores).


Then make the "score" simulated values more realistic: increase each score by 2 points for every year of age below 7 and increase it by 2 points for every year of age above 7; if possible do everything in a single line of code 😊

Now round both "age" and "score" to one decimal place (hint: `.round()`)

Now compute the correlation between "age" and "score" (hint: `np.corrcoef()`)

Add a new column to the DataFrame with the z-scores of the "score" variable
(hint: z-scores are scores minus their mean and then divided by their standard deviation)

Filter the rows too see only those where the "score" is above average

Add a new column called "above" that include logical values `True` if "score" is above average, and `False` otherwise

Use the following methods and attributes to inspect different aspects of the data frame: `.head()`, `.tail()`, `.describe()`, `.ndim`, `.shape`, `.dtypes`

#### Now, a little bit more advanced:

Create a new column that categorizes scores as "high" (>60), "middle" (45-65), or "low" (<40); try to do that without incurring warnings (hint: use `.loc[]`, or for a more complex alternative that does everything in a single line of code, use `np.select()` as presented [here](https://enricotoffalini.github.io/Basics-Python/Slides/20.Programming.html) )

Compute the median (hint: `.median()`) "age" for each score category, if possible using a single line of code (hint: use `.groupby()`)