<a href="https://colab.research.google.com/github/MidoriTinto/Python_For_Data_Analysis/blob/main/5_numpy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using numpy for processing lists of number data

---

## numpy is a high performance array processing library for Python

Python lists can contain any type of data, including objects.  numpy arrays are specialised and can only contain numbers.

You can customise the memory usage so that arrays can use less memory and items are always stored contiguously, which isn't always the case for Python lists.  This means that it is more efficient at storing, and faster at processing, large data sets.

Use numpy arrays to **store and manipulate large lists of numbers** (for other data types use plain Python lists or a panda series)

Use numpy arrays to **process panda series (columns) where these contain numerical data and a large number of records**.

Use numpy arrays to **create new sets of data** to add to a dataframe.

For this course we are going to focus on using numpy arrays as a means of holding and working with a list of data from a pandas dataframe column.


---
## Creating a new numpy array (maybe for a new series or an extended series)

To use numpy, you will need to import it.  The conventional way to import numpy is to import the whole library and use an *alias*

`import numpy as np`

Every time you want to use a function from the numpy library you use the syntax:  

`np.function_name()`  

Create a new numpy array from a Python list of numbers
   
`arr = np.array([1,2,3])`

Or a new numpy matrix from a Python 2 dimensional list of numbers

`matrix = np.array([1,2,3],[4,5,6])`

Create a numpy array from a dataframe column (series):

`arr = df['column name'].to_numpy(datatype)`

where the datatype matches the data type of the column (`df.info()` will give this information if you are unsure).



### Exercise 1 - create a new numpy array to hold the numbers from 1 to 100
---

Write a function called `make_array()` which will:
*   create a list of the numbers 1 to 10  
*   create a new numpy array called **new_array** from that list
*   print `new_array`

Expected output:  
`[1 2 3 4 5 6 7 8 9 10] ` 



In [None]:
import numpy as np

def make_array(num_list):
  # add your code below here
  num_list=range(1,11)
  new_array=np.array(num_list)
  print(new_array)


# run and test the function against the expected output
make_array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

[ 1  2  3  4  5  6  7  8  9 10]


## Setting the number type in memory

numpy allows you to set the type of number in memory (e.g. `int8`, `int32`) when you create the array.  This allows memory allocation to be as small as possible.  

`new_matrix = np.array([1, 2, 3], np.int8)`    

This creates a list of whole numbers which are all small enough to fit in 1 byte of memory storage.

---
### Exercise 2 - create list of smallish numbers

Write a function which will:
*   accept a parameter **num_list**    
*   create a new numpy array called **new_array** from `num_list`, with data size `int16`
*   print `new_array`  

Test input:  
`[31112, 32321, 24567,456,324,789]`

Expected output:   
```
[31112 32321 24567 456 324 789]
```

In [None]:
import numpy as np

def create_new_array(num_list):
  # add your code below here to turn num_list into a numpy array
 new_matrix=np.array(num_list, np.int16)
 print(new_matrix)


# run and test the code against the expected output
create_new_array([[31112, 32321, 24567],[456,324,789]])

[[31112 32321 24567]
 [  456   324   789]]


---
### Exercise 3 - create a numpy array from a pandas dataframe column

Write a function which  will first create a dataframe from the titanic data set, and then will create a numpy array from the Fare column.

*Recap*:  *the Fare column is df['Fare'] (assuming your dataframe is called df)*

*  create a dataframe from the data set in the file at this url:  https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv  
*  create a new numpy array called **fare** from the `Fare` column (*remember to specify the data type e.g. np.int32, float64 when using* `df.to_numpy()`)  
*  print the `fare` array

Expected output:  
```
array([  7.25  ,  71.2833,   7.925 , ......  23.45  ,  30.    ,   7.75  ])
```

  

In [None]:
import pandas as pd
import numpy as np

def create_fare_series(num_list):
  
  # add your code below here to read the data set into a dataframe and then create a numpy array from the Fare column and print it
  url="https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv"
  titanic=pd.read_csv(url)
  titanic

  fare_series=titanic["Fare"].to_numpy(np.float32)

  return(fare_series)


# run the function and test against the expected output.
create_fare_series([[1,2,3], [4,5,6], [3,4,5]])

array([  7.25  ,  71.2833,   7.925 ,  53.1   ,   8.05  ,   8.4583,
        51.8625,  21.075 ,  11.1333,  30.0708,  16.7   ,  26.55  ,
         8.05  ,  31.275 ,   7.8542,  16.    ,  29.125 ,  13.    ,
        18.    ,   7.225 ,  26.    ,  13.    ,   8.0292,  35.5   ,
        21.075 ,  31.3875,   7.225 , 263.    ,   7.8792,   7.8958,
        27.7208, 146.5208,   7.75  ,  10.5   ,  82.1708,  52.    ,
         7.2292,   8.05  ,  18.    ,  11.2417,   9.475 ,  21.    ,
         7.8958,  41.5792,   7.8792,   8.05  ,  15.5   ,   7.75  ,
        21.6792,  17.8   ,  39.6875,   7.8   ,  76.7292,  26.    ,
        61.9792,  35.5   ,  10.5   ,   7.2292,  27.75  ,  46.9   ,
         7.2292,  80.    ,  83.475 ,  27.9   ,  27.7208,  15.2458,
        10.5   ,   8.1583,   7.925 ,   8.6625,  10.5   ,  46.9   ,
        73.5   ,  14.4542,  56.4958,   7.65  ,   7.8958,   8.05  ,
        29.    ,  12.475 ,   9.    ,   9.5   ,   7.7875,  47.1   ,
        10.5   ,  15.85  ,  34.375 ,   8.05  , 263.    ,   8.0

---
### Exercise 4 - get some statistics from a numpy array created from a data series

This exercise will use data on income in certain US states.  The link is: https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/Income-Data.xlsx?raw=true  This spreadsheet just has one sheet.  

Write a function which will create a numpy array from the `Age` column in the income dataset and will print the following:

*  the average (mean) age of those surveyed  
*  the age of the oldest person
*  the age of the youngest person

TO HELP with this, refer to this helpsheet: http://datacamp-community-prod.s3.amazonaws.com/da466534-51fe-4c6d-b0cb-154f4782eb54 

Expected output:  
```
29.88888888888889
42
22
```

     

In [None]:
import pandas as pd
import numpy as np

def get_age_stats():
  # add your code below here to read the data set into a dataframe and then create a numpy array from the Fare column and print it
  url="https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/Income-Data.xlsx?raw=true"
  income=pd.read_excel(url)
  income
  
  age=income["Age"].to_numpy(np.int16)
  
  mean=age.mean()
  oldest=age.max()
  youngest=age.min()
  
  return mean, oldest, youngest



# run the function and test against the expected output.
get_age_stats()

(29.88888888888889, 42, 22)

---
### Exercise 6 - find the mean and standard deviation of wages

This exercise will again use data on income in certain US states.  The link is: https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/Income-Data.xlsx?raw=true  This spreadsheet just has one sheet.

Write a function which will create a numpy array from the `Income` column in the income dataset and will print the following:

*  the mean income of those surveyed  
*  the standard deviation of income
*  the highest income
*  the lowest income as a percentage of the mean (lowest / mean * 100) 

TO HELP with this, refer to this helpsheet: http://datacamp-community-prod.s3.amazonaws.com/da466534-51fe-4c6d-b0cb-154f4782eb54 

Expected output:  
```
63.388888888888886
13.936916958961463
81
34.70639789658195
```



In [None]:
import pandas as pd
import numpy as np

def get_income_stats():
  # add your code below to calculate the stats
  url="https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/Income-Data.xlsx?raw=true"
  income=pd.read_excel(url)
  income

  income_stats=income["Income"].to_numpy(np.float64)
  
  mean=income_stats.mean()
  sd=income_stats.std()
  highest=income_stats.max()
  lowest=income_stats.min()/mean*100

  
  #return mean, sd, highest, lowest
  print("\n", mean,"\n", sd, "\n", highest, "\n", lowest)

# run the function and test against expected output
get_income_stats()


 63.388888888888886 
 13.936916958961463 
 81.0 
 34.70639789658195


---
### Exercise 7 - finding the correlation between two series

Let's find out if there is a strong correlation between Age and Income in the income data set.

*  read the Income data into a pandas dataframe
*  create a numpy array from the Age column  
*  create a numpy array from the Income column  
*  use the np.corrcoef(nparray1, nparray2) function to get the Pearson's Correlation Coefficient (the measure of linear correlation between the two data sets) and store it in a variable called **coef**
*  print the correlation coefficient output (see below, it will be a 2x2 matrix)
*  print the correlation coefficient (which is at position [0][1] (coef[0][1]))


Expected output:  
```
[[ 1.         -0.14787412]
 [-0.14787412  1.        ]]
 -0.1478741157606825

```
The matrix gives 4 values showing the correlation between:

```
   |    (Age/Age)        (Age/Income)     |
   |    (Income/Age)     (Income/Income)  |
```
This suggests that income decreases with age (the correlation is negative 
so as one increases the other decreases) but that the correlation is quite weak (an absolute correlation would be 1 and no correlation would be 0)

In [None]:
import pandas as pd
import numpy as np
url="https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/Income-Data.xlsx?raw=true"
income=pd.read_excel(url)
income

def get_correlation():
  # add your code below to get the correlation figure for age and salary

  

 age_corr=income["Age"].to_numpy(np.float64)
 income_corr=income["Income"].to_numpy(np.float64)

 coeff=np.corrcoef(age_corr, income_corr)
 print(coeff)
 print(coeff[0][1])



# run the function and test against expected output
get_correlation()

[[ 1.         -0.14787412]
 [-0.14787412  1.        ]]
-0.1478741157606825


---
## Broadcasting an operation across an array

Because a numpy array is created from a related set of data, it is useful to be able to operate on every item in the array in the same way.  For instance, the array might hold a set of scores out of 30 and you might want to convert all scores into percentages.

We can do this in a number of ways:  
1.  Create a new array to store the result after the operation in the new array
```
scores = np.array([29,25,15,22,30])
percentages = scores / 30 * 100
print(percentages)
```
Expected output:
```
[ 96.66666667  83.33333333  50.          73.33333333 100.        ]
```
2.  Store the result in the original array
```
scores = np.array([29,25,15,22,30])
scores = scores / 30 * 100
print(scores)
```
Expected output:
```
[ 96.66666667  83.33333333  50.  73.33333333 100. ]
```

Give it a try:





In [None]:
score = np.array([5,10,24])
percentages = score/5*100
print(score)

scores = np.array([29,25,15,22,30])
scores = scores / 30 * 100
print(scores)

[ 5 10 24]
[ 96.66666667  83.33333333  50.          73.33333333 100.        ]


---
### Exercise 8 - increase whole array by 20

Write a function which will:
*  create a numpy array of 12 numbers
*  create a new array adding 20 to each of the items in the first array  
*  print the new array

Test input:  
`[1,2,3,4,5,6,7,8,1,2,3,4]`

Expected output:  
`[21 22 23 24 25 26 27 28 21 22 23 24]`

In [None]:
# add your code to define the function to increase all values in an array by 20 and then to call the function

def whole_array():
  new_array = np.array([1,2,3,4,5,6,7,8,1,2,3,4])
  print(new_array)
  new_array_20 = new_array + 20
  print (new_array_20)

whole_array()

[1 2 3 4 5 6 7 8 1 2 3 4]
[21 22 23 24 25 26 27 28 21 22 23 24]


---
## Conversion of values using broadcasting

---
### Exercise 9 - convert Titanic fares into 21st century values 

Write a function which will:  
*  create a dataframe from the titanic data set (https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv)   
*  create a numpy array from the Fare column
*  convert the fares into current value (multiply by a factor of 120.7045 - source https://www.in2013dollars.com/uk/inflation/1912?amount=32 *accessed 22/1/2022*)
*  print the average fare, the maximum fare and the minimum fare

Expected output:  
```
3887.1928207428173
61840.4399214
0.0
```


 

In [None]:
import pandas as pd
import numpy as np

url="https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv"
titanic=pd.read_csv(url)
titanic



def get_current_fares():
  # add your code below to get the fares data into a numpy array, convert to today's prices and print stats

  fare=titanic["Fare"].to_numpy(np.float64)
  current_fare=fare*120.7045
  average=current_fare.mean()
  max=current_fare.max()
  min=current_fare.min()

  print( "\n", average,"\n", max,"\n", min)


# run the function and test against expected output
get_current_fares()


 3887.1928207428173 
 61840.4399214 
 0.0


---
### Exercise 10 - create a new column in the dataframe from a numpy array

**Challenging**

Write a function which will calculate expected salaries for all in the income data set after an inflation rate of 3.5% (with results in a new numpy array).

Just to show the result, calculate and print the Pearson Correlation Coefficient between the salaries series and the inflated salaries series.  We would expect this to be 1 (ie the inflated salary is always 3.5% higher than the current salary) and the exercise is just meant to show that - the statistic has no relevance.  

Create a new column in the dataframe from the new numpy array (so that the dataframe now contains the original salaries and the inflated salaries.  
(**Recap**:  *to add a new column, just use* `df['new column name']`)  

To assign a numpy array to a pandas column use  
`df['new column name'] = numpyarrayname.tolist()`

Display the new dataframe and print the correlation coefficient.







In [None]:
import pandas as pd
import numpy as np

url="https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/Income-Data.xlsx?raw=true"
income=pd.read_excel(url)
income

def expected_salaries():

 salaries = income["Income"].to_numpy(np.int64)
 inflated_salaries= salaries /3.5*100

 coeff=np.corrcoef(salaries, inflated_salaries)
 income['Salaries_after_Inflation'] = inflated_salaries.tolist()



 return coeff, income['Salaries_after_Inflation']


 

expected_salaries()






(array([[1., 1.],
        [1., 1.]]), 0     1857.142857
 1     1285.714286
 2     1314.285714
 3     1857.142857
 4     1514.285714
 5     1771.428571
 6     1885.714286
 7     2142.857143
 8      628.571429
 9     1971.428571
 10    2085.714286
 11    2142.857143
 12    1857.142857
 13    1885.714286
 14    2228.571429
 15    2314.285714
 16    2085.714286
 17    1771.428571
 Name: Salaries_after_Inflation, dtype: float64)

# Reflection
----

## What skills have you demonstrated in completing this notebook?

*Your* answer: Working with numpy and pandas libraries:
 1)creating numpy arrays with columns from pandas dataframes
 2)doing statistics from a numpy array created from a data series
 3) increment of whole arrays
 4) broadcasting
 5) creating a new column in an existing dataset





## What caused you the most difficulty?

Your answer: