<a href="https://colab.research.google.com/github/SebastienBienfait/L2C-Data-managment/blob/main/5_numpy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using numpy for processing lists of number data

---

## numpy is a high performance array processing library for Python

Python lists can contain any type of data, including objects.  numpy arrays are specialised and can only contain numbers.

You can customise the memory usage so that arrays can use less memory and items are always stored contiguously, which isn't always the case for Python lists.  This means that it is more efficient at storing, and faster at processing, large data sets.

Use numpy arrays to **store and manipulate large lists of numbers** (for other data types use plain Python lists or a panda series)

Use numpy arrays to **process panda series (columns) where these contain numerical data and a large number of records**.

Use numpy arrays to **create new sets of data** to add to a dataframe.

For this course we are going to focus on using numpy arrays as a means of holding and working with a list of data from a pandas dataframe column.


---
## Creating a new numpy array (maybe for a new series or an extended series)

To use numpy, you will need to import it.  The conventional way to import numpy is to import the whole library and use an *alias*

`import numpy as np`

Every time you want to use a function from the numpy library you use the syntax:  

`np.function_name()`  

Create a new numpy array from a Python list of numbers
   
`arr = np.array([1,2,3])`

Or a new numpy matrix from a Python 2 dimensional list of numbers

`matrix = np.array([1,2,3],[4,5,6])`

Create a numpy array from a dataframe column (series):

`arr = df['column name'].to_numpy(datatype)`

where the datatype matches the data type of the column (`df.info()` will give this information if you are unsure).



### Exercise 1 - create a new numpy array to hold the numbers from 1 to 100
---

Write a function called `make_array()` which will:
*   create a list of the numbers 1 to 10  
*   create a new numpy array called **new_array** from that list
*   print `new_array`

Expected output:  
`[1 2 3 4 5 6 7 8 9 10] ` 



In [3]:
import numpy as np

def make_array(arr):
  new_array = np.array(arr)
  return print(new_array)
make_array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])



# run and test the function against the expected output


[ 1  2  3  4  5  6  7  8  9 10]


## Setting the number type in memory

numpy allows you to set the type of number in memory (e.g. `int8`, `int32`) when you create the array.  This allows memory allocation to be as small as possible.  

`new_matrix = np.array([1, 2, 3], np.int8)`    

This creates a list of whole numbers which are all small enough to fit in 1 byte of memory storage.

---
### Exercise 2 - create list of smallish numbers

Write a function which will:
*   accept a parameter **num_list**    
*   create a new numpy array called **new_array** from `num_list`, with data size `int16`
*   print `new_array`  

Test input:  
`[31112, 32321, 24567,456,324,789]`

Expected output:   
```
[31112 32321 24567 456 324 789]
```

In [68]:
import numpy as np

def create_new_array(num_list):
  new_array = np.array(num_list,np.int8)
  return print(new_array)



# run and test the code against the expected output
create_new_array([31000 , 32321, 24567,456,324,789])

[ 24  65  -9 -56  68  21]


---
### Exercise 3 - create a numpy array from a pandas dataframe column

Write a function which  will first create a dataframe from the titanic data set, and then will create a numpy array from the Fare column.

*Recap*:  *the Fare column is df['Fare'] (assuming your dataframe is called df)*

*  create a dataframe from the data set in the file at this url:  https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv  
*  create a new numpy array called **fare** from the `Fare` column (*remember to specify the data type e.g. np.int32, float64 when using* `df.to_numpy()`)  
*  print the `fare` array

Expected output:  
```
array([  7.25  ,  71.2833,   7.925 , ......  23.45  ,  30.    ,   7.75  ])
```

  

In [53]:
import pandas as pd
import numpy as np


def import_data_csv(url):
  dataset = pd.read_csv(url)
  return dataset

df_titanic = import_data_csv("https://raw.githubusercontent.com/SebastienBienfait/L2C-Data-managment/main/Datasets/titanic_data.csv")
display(df_titanic)
def create_fare_series(df):
  fare_array1 = df["Fare"].to_numpy(np.float16)
  fare_array2 = np.array(df["Fare"],np.float32) #both work
  return fare_array1[0:10]



# run the function and test against the expected output.
s = create_fare_series(df_titanic)
print(s)
print(type(s[0]))

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


[ 7.25  71.3    7.926 53.1    8.05   8.46  51.88  21.08  11.13  30.08 ]
<class 'numpy.float16'>


---
### Exercise 4 - get some statistics from a numpy array created from a data series

This exercise will use data on income in certain US states.  The link is: https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/Income-Data.xlsx?raw=true  This spreadsheet just has one sheet.  

Write a function which will create a numpy array from the `Age` column in the income dataset and will print the following:

*  the average (mean) age of those surveyed  
*  the age of the oldest person
*  the age of the youngest person

TO HELP with this, refer to this helpsheet: http://datacamp-community-prod.s3.amazonaws.com/da466534-51fe-4c6d-b0cb-154f4782eb54 

Expected output:  
```
29.88888888888889
42
22
```

     

In [70]:
import pandas as pd
import numpy as np

def import_data_xl(url):
  dataset = pd.read_excel(url)
  return dataset

df_income = import_data_xl("https://github.com/SebastienBienfait/L2C-Data-managment/blob/main/Datasets/Income-Data%20(1).xlsx?raw=true")


def get_age_stats(df):
  age_array = df["Age"].to_numpy(np.int8)
  mean_age = age_array.mean()
  oldest = age_array.max()
  youngest = age_array.min()
  return mean_age,oldest,youngest
# run the function and test against the expected output.
print(get_age_stats(df_income))

(29.88888888888889, 42, 22)


---
### Exercise 6 - find the mean and standard deviation of wages

This exercise will again use data on income in certain US states.  The link is: https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/Income-Data.xlsx?raw=true  This spreadsheet just has one sheet.

Write a function which will create a numpy array from the `Income` column in the income dataset and will print the following:

*  the mean income of those surveyed  
*  the standard deviation of income
*  the highest income
*  the lowest income as a percentage of the mean (lowest / mean * 100) 

TO HELP with this, refer to this helpsheet: http://datacamp-community-prod.s3.amazonaws.com/da466534-51fe-4c6d-b0cb-154f4782eb54 

Expected output:  
```
63.388888888888886
13.936916958961463
81
34.70639789658195
```



In [72]:
import pandas as pd
import numpy as np



def get_income_stats(df):
  income_array = df["Income"].to_numpy(np.int8)

  mean_income = income_array.mean()
  std_income = income_array.std()
  highest_income = income_array.max()
  lowest_per = income_array.min()/mean_income*100

  return mean_income,std_income,highest_income,lowest_per


# run the function and test against expected output
print(get_income_stats(df_income))

(63.388888888888886, 13.936916958961463, 81, 34.70639789658195)


---
### Exercise 7 - finding the correlation between two series

Let's find out if there is a strong correlation between Age and Income in the income data set.

*  read the Income data into a pandas dataframe
*  create a numpy array from the Age column  
*  create a numpy array from the Income column  
*  use the np.corrcoef(nparray1, nparray2) function to get the Pearson's Correlation Coefficient (the measure of linear correlation between the two data sets) and store it in a variable called **coef**
*  print the correlation coefficient output (see below, it will be a 2x2 matrix)
*  print the correlation coefficient (which is at position [0][1] (coef[0][1]))


Expected output:  
```
[[ 1.         -0.14787412]
 [-0.14787412  1.        ]]
 -0.1478741157606825

```
The matrix gives 4 values showing the correlation between:

```
   |    (Age/Age)        (Age/Income)     |
   |    (Income/Age)     (Income/Income)  |
```
This suggests that income decreases with age (the correlation is negative 
so as one increases the other decreases) but that the correlation is quite weak (an absolute correlation would be 1 and no correlation would be 0)

In [73]:
import pandas as pd
import numpy as np

def get_correlation(df):
  age_array = df["Age"].to_numpy(np.int8)
  income_array = df["Income"].to_numpy(np.int8)

  coef = np.corrcoef(age_array,income_array)
  return print(coef)


# run the function and test against expected output
get_correlation(df_income)

[[ 1.         -0.14787412]
 [-0.14787412  1.        ]]


---
## Broadcasting an operation across an array

Because a numpy array is created from a related set of data, it is useful to be able to operate on every item in the array in the same way.  For instance, the array might hold a set of scores out of 30 and you might want to convert all scores into percentages.

We can do this in a number of ways:  
1.  Create a new array to store the result after the operation in the new array
```
scores = np.array([29,25,15,22,30])
percentages = scores / 30 * 100
print(percentages)
```
Expected output:
```
[ 96.66666667  83.33333333  50.          73.33333333 100.        ]
```
2.  Store the result in the original array
```
scores = np.array([29,25,15,22,30])
scores = scores / 30 * 100
print(scores)
```
Expected output:
```
[ 96.66666667  83.33333333  50.  73.33333333 100. ]
```

Give it a try:





---
### Exercise 8 - increase whole array by 20

Write a function which will:
*  create a numpy array of 12 numbers
*  create a new array adding 20 to each of the items in the first array  
*  print the new array

Test input:  
`[1,2,3,4,5,6,7,8,1,2,3,4]`

Expected output:  
`[21 22 23 24 25 26 27 28 21 22 23 24]`

In [74]:
# add your code to define the function to increase all values in an array by 20 and then to call the function

def add_20(list1):
  arr = np.array(list1)

  return arr+20

print(add_20([1,2,3,4,5,6,7,8,1,2,3,4]))

[21 22 23 24 25 26 27 28 21 22 23 24]


---
## Conversion of values using broadcasting

---
### Exercise 9 - convert Titanic fares into 21st century values 

Write a function which will:  
*  create a dataframe from the titanic data set (https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv)   
*  create a numpy array from the Fare column
*  convert the fares into current value (multiply by a factor of 120.7045 - source https://www.in2013dollars.com/uk/inflation/1912?amount=32 *accessed 22/1/2022*)
*  print the average fare, the maximum fare and the minimum fare

Expected output:  
```
3887.1928207428173
61840.4399214
0.0
```


 

In [77]:
import pandas as pd
import numpy as np

def get_current_fares(df):
  fare_array = df["Fare"].to_numpy(np.float32)
  fare_array *= 120.7045

  return print(fare_array.mean(),"\n",fare_array.max(),"\n",fare_array.min())

# run the function and test against expected output
get_current_fares(df_titanic)

3887.193 
 61840.44 
 0.0


---
### Exercise 10 - create a new column in the dataframe from a numpy array

**Challenging**

Write a function which will calculate expected salaries for all in the income data set after an inflation rate of 3.5% (with results in a new numpy array).

Just to show the result, calculate and print the Pearson Correlation Coefficient between the salaries series and the inflated salaries series.  We would expect this to be 1 (ie the inflated salary is always 3.5% higher than the current salary) and the exercise is just meant to show that - the statistic has no relevance.  

Create a new column in the dataframe from the new numpy array (so that the dataframe now contains the original salaries and the inflated salaries.  
(**Recap**:  *to add a new column, just use* `df['new column name']`)  

To assign a numpy array to a pandas column use  
`df['new column name'] = numpyarrayname.tolist()`

Display the new dataframe and print the correlation coefficient.







In [81]:
import pandas as pd
import numpy as np

def inflation(df):
  income_array = df["Income"].to_numpy(np.int8)
  income_array_new = income_array*1.035
  df["Income_+3%"] = income_array_new.tolist()
  #df = df.drop("Income", axis = 1)
  corr = np.corrcoef(income_array,income_array_new)
  return df,corr

print(inflation(df_income))


(   State  County  Population   Age  Income  Income_2  Income_+3%
0     TX     1.0        72.0  34.0    65.0    67.275      67.275
1     TX     2.0        33.0  42.0    45.0    46.575      46.575
2     TX     5.0        25.0  23.0    46.0    47.610      47.610
3     TX     6.0        54.0  36.0    65.0    67.275      67.275
4     TX     7.0        11.0  42.0    53.0    54.855      54.855
5     TX     8.0        28.0  25.0    62.0    64.170      64.170
6     TX     9.0        82.0  35.0    66.0    68.310      68.310
7     TX    10.0         5.0  40.0    75.0    77.625      77.625
8     MD    11.0        61.0  27.0    22.0    22.770      22.770
9     MD     2.0         5.0  23.0    69.0    71.415      71.415
10    MD     4.0        98.0  25.0    73.0    75.555      75.555
11    MD     3.0        64.0  29.0    75.0    77.625      77.625
12    MD     2.0        36.0  24.0    65.0    67.275      67.275
13    MD     1.0        24.0  25.0    66.0    68.310      68.310
14    MD     5.0        

# Reflection
----

## What skills have you demonstrated in completing this notebook?

Your answer: 

## What caused you the most difficulty?

Your answer: 