<a href="https://colab.research.google.com/github/CommunityRADvocate/ida-colabs/blob/main/Week_10_Activities_Introduction_to_NumPy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 10 Activities - Welcome to NumPy!

This week our activities will focus on how you can use NumPy in your projects to generate findings and complete the requirements. This will show you how you can start working with your data in Python, and using NumPy to analyze the data.

In [None]:
# This block set up the notebook by importing libraries and our project data. Make sure you run it before trying to run other blocks of code!
import pandas as pd
import numpy as np

# extension for interactive tables -- limited to 20 columns
# %load_ext google.colab.data_table

# add link to dataset
url = 'https://raw.githubusercontent.com/CommunityRADvocate/ida2404-capstone/main/mxmh_survey_results.csv'
# read dataset and set it as a dataframe using pandas
df = pd.read_csv(url)

# see an overview of the imported dataframe, including column headers and their indexes
df.info()

In [None]:
# Let's put our dataframe into a NumPy array so we can work with it. Just run this block of code!
np_array = df.to_numpy()
np_array

## Using NumPy to understand and slice our data

Next, we'll look at the rank and shape of our data. If you need a reminder on how to determine this, look at the documentation:


*   Get rank using [ndim](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.ndim.html)
*   Get the [shape](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.shape.html) of your array



In [None]:
# Your code here to determine the rank of the survey data in np_array

In [None]:
# Your code here to determine the shape of the array

Now, slice your array so you get the column with Primary Streaming Service data and store it in a variable called *streaming_service*. Also get the column corresponding to Age and store it in a variable called *age* and the column for hours per day stored in a variable called *hours*.

***Refer to the*** `df.info()` ***block above to view the index numbers that correspond with the columns***

Print out each column to make sure you got the right data! Reference the [indexing documentation](https://numpy.org/doc/stable/user/basics.indexing.html#) for more information.


```
# To pull all values in a row, you would write:
array_name[n,:]
# where n is the number row you want to get (remember Python starts at 0!)

# Alternatively, to get a column, you would write:
array_name[:,n]
# where n is the number column you want

# Be sure to store your column in a new variable!
```



In [None]:
# your code here to create and print the columns streaming_service, age, and hours

Next, use boolean indexing to filter the ages to include only those over 30. Save it in a variable called *ages_over_30*. Reference the same indexing documentation!


```
# Here's the basic format:
new_variable = original_array[<condition>]
```

Then, count the number of survey participants that had an age over 30 using the [size](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.size.html) method on your new variable. Also count the original number of ages.

In [None]:
# your code here to create ages_over_30 and count the number of ages in total and over 30

*There will be a RuntimeWarning due to a `nan` value in the age column, but the code will still run; we'll get more into data cleaning with Pandas*

In [None]:
# We'll also remove any null values. You can just run this code after creating your variables!
age = age[age > 0]
hours = hours[hours > 0]
# Ignore the runtime warning for now :)

  age = age[age > 0]


## Using NumPy to Perform Statistics

NumPy can also be used to perform statistics! Let's practice using the age variable we created. Start by pulling the [maximum](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.max.html) and [minimum](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.min.html) values (documentation linked)!

In [None]:
# your code here to pull the maximum and minimum values in the age array

Now, let's try looking at some other statistics. Generate each of the following for the age array:


*   [Mean](https://numpy.org/doc/stable/reference/generated/numpy.mean.html)
*   [Median](https://numpy.org/doc/stable/reference/generated/numpy.median.html)
*   [Standard deviation](https://numpy.org/doc/stable/reference/generated/numpy.std.html)
*   [Variance](https://numpy.org/doc/stable/reference/generated/numpy.var.html)
*   [75th percentile](https://numpy.org/doc/stable/reference/generated/numpy.percentile.html)



In [None]:
# your code here to generate some statistics!

## NumPy and Graphing with Matplotlib

NumPy is compatible with Python's graphing library, Matplotlib. Let's do some basic practice graphing our data!

We'll start by creating a [box and whiskers plot](https://www.geeksforgeeks.org/box-plot-in-python-using-matplotlib/) of our age data.

In [None]:
# import matplotlib
import matplotlib.pyplot as plt

# your code here to create a box and whisker plot of the age data


Let's also create a histogram of this data! Check out the documentation [here](https://www.geeksforgeeks.org/box-plot-in-python-using-matplotlib/)

In [None]:
# your code here to create a histogram of the age data

We can also create a scatter plot! Let's compare the age of responders to their hours listened per day. Scatter plot documentation [here](https://www.w3schools.com/python/matplotlib_scatter.asp).

In [None]:
# we'll reset the cleaning we did earlier with these columns for plotting purposes
age = np_array[:, 1]
hours = np_array[:, 3]

In [None]:
# your code here to create a scatter plot with age on the x-axis and hours on the y-axis

Awesome work! Although we won't formally go through matplotlib, feel free to experiment with it in your projects if you'd like to create some visualizations! You have access to the [Treehouse Matplotlib module](https://teamtreehouse.com/library/introduction-to-data-visualization-with-matplotlib) if you'd like to learn on your own.