<a href="https://colab.research.google.com/github/TaylorL74/data-and-python/blob/main/Copy_of_D_05b_numpy_statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Numpy for Statistical Analysis

---

The exercises in this worksheet will use data on income in certain US states. Run the cell below to import the data, which will be available for you to use as `df`

In [1]:
import pandas as pd
import numpy as np

url = 'https://github.com/futureCodersSE/python-programming-for-data/blob/main/Datasets/Income-Data.xlsx?raw=True'
df = pd.read_excel(url)


---
### Exercise 1 - get some statistics from a numpy array created from a data series


Write a function which will create a numpy array from the `Age` column in the income dataset and will print the following:

*  the average (mean) age of those surveyed  
*  the median age of those surveyed
*  the age of the oldest person
*  the age of the youngest person

Expected output:  
```
29.88888888888889
28.0
42
22
```

     

In [2]:
import pandas as pd
import numpy as np

def get_age_stats():
  age_array = np.array(df["Age"], np.int64)
  mean = np.mean(age_array)
  median = np.median(age_array)
  oldest = np.max(age_array)
  youngest = np.min(age_array)

  print(f"The average age is {mean}, the median age is {median}, the oldest person was {oldest} and the youngest person was {youngest}")

  return age_array


age_array = get_age_stats()

The average age is 29.88888888888889, the median age is 28.0, the oldest person was 42 and the youngest person was 22


### Exercise 2 - find the mode value from a numpy array

Write a function that will create a numpy array from the `Age` column in the income dataset and will print the mode value.

*There is only one mode in this data range, but as an extra challenge, include code in your function that will check there is only one value equal to the maximum and print 'more than one mode' if multiple modes are found.*

Expected output:

25

In [3]:
def get_mode(age_array):
  ages, count = np.unique(age_array, return_counts = True)
  max_count = np.max(count)
  count_index = list(count).index(max_count)
  median_age = ages[count_index]
  print(f"The most common age was {median_age}")

get_mode(age_array)

The most common age was 25


In [4]:
def get_mode(age_array):
  # Creates two arrays, one with each unique value in the original array, and another with the count of how many times each number appears
  ages, count = np.unique(age_array, return_counts = True)
  # Finds the highest number in the counts array to find how often the most common number appears
  max_count = np.max(count)

  # Code to count how many times the max count appears in the count array, and therefore how many values appear most frequently simultaneously
  check_count = 0
  for item in count:
    if item == max_count:
      check_count += 1

  # Tells the user if there is more than one mode number
  if check_count > 1:
    print("More than one mode")
  # Finds the mode age by finding the corresponding value in the ages array
  else:
    count_index = list(count).index(max_count)
    mode_age = ages[count_index]
    print(f"The most common age was {mode_age}")

get_mode(age_array)

The most common age was 25


In [5]:
# Same code as above except with an array of test data with more than one mode to test if the code checking for that works

def test_get_mode():
  age_array_test = np.array([3, 4, 5, 6, 4, 3, 7, 8, 6, 4, 6]) # 4 and 6 both appear 3 times
  ages, count = np.unique(age_array_test, return_counts = True)
  max_count = np.max(count)

  check_count = 0
  for item in count:
    if item == max_count:
      check_count += 1

  if check_count > 1:
    print("More than one mode")
  else:
    count_index = list(count).index(max_count)
    mode_age = ages[count_index]
    print(f"The most common age was {mode_age}")

test_get_mode()

More than one mode


---
### Exercise 3 - find the mean and standard deviation of wages

This exercise will again use data on income in certain US states.  

Write a function which will create a numpy array from the `Income` column in the income dataset and will print the following:

*  the mean income of those surveyed  
*  the standard deviation of income
*  the highest income
*  the lowest income as a percentage of the mean (lowest / mean * 100)


Expected output:  
```
63.388888888888886
13.936916958961463
81
34.70639789658195
```



In [6]:
import pandas as pd
import numpy as np

def get_income_stats():
  income_array = np.array(df["Income"], np.float64)
  mean = np.mean(income_array)
  stddev = np.std(income_array)
  highest = np.max(income_array)
  lowest = np.min(income_array)
  lowest_perc = (lowest / mean) * 100

  print(f"The average income was ${mean} with a standard deviation of {stddev}.")
  print(f"The highest income was ${highest}, and the lowest incomes was {lowest_perc}% of the average income.")

  return income_array


income_array = get_income_stats()

The average income was $63.388888888888886 with a standard deviation of 13.936916958961463.
The highest income was $81.0, and the lowest incomes was 34.70639789658195% of the average income.


### Exercise 4 - find income IQR

Write a function that will create a numpy array from the `Income` column of the income dataset and print the following:
* the 25th percentile
* the 75th percentile
* the interquartile range

Expected output:
```
62.0
73.0
11.0
```


In [7]:
def get_income_iqr(income_array):
  perc_25 = np.percentile(income_array, 25)
  perc_75 = np.percentile(income_array, 75)
  iqr = perc_75 - perc_25

  print(f"The 25th percentile is {perc_25}")
  print(f"The 75th percentile is {perc_75}")
  print(f"The interquartile range is {iqr}")

get_income_iqr(income_array)

The 25th percentile is 62.0
The 75th percentile is 73.0
The interquartile range is 11.0


### Exercise 5 - find outliers

Write a function that will create a numpy array from the Income column that will do the following:
* calculate the standard deviation, mean, Q1, Q2 and interquartile range of the data (as separate variables)
* calculate the upper limit for outliers based on both standard deviation and iqr
* calculate the lower limit for outliers based on both standard deviation and iqr
* filter four times, once for each outlier type
* print the outliers


Expected output:
```
Upper outliers by std: []
Lower outliers by std: [22]
Upper outliers by iqr: []
Lower outliers by iqr: [45 22]
```

In [10]:
def find_outliers(income_array):
  stddev = np.std(income_array)
  mean = np.mean(income_array)
  perc_25 = np.percentile(income_array, 25)
  perc_75 = np.percentile(income_array, 75)
  iqr = perc_75 - perc_25

  iqr_lower = perc_25 - (iqr * 1.5)
  iqr_upper = perc_75 + (iqr * 1.5)

  std_lower = mean - (stddev * 2)
  std_upper = mean + (stddev * 2)

  filter_iqr_lower = income_array < iqr_lower
  outliers_iqr_lower = income_array[filter_iqr_lower]
  print(outliers_iqr_lower)

  filter_iqr_upper = income_array > iqr_upper
  outliers_iqr_upper = income_array[filter_iqr_upper]
  print(outliers_iqr_upper)

  filter_std_lower = income_array < std_lower
  outliers_std_lower = income_array[filter_std_lower]
  print(outliers_std_lower)

  filter_std_upper = income_array > std_upper
  outliers_std_upper = income_array[filter_std_upper]
  print(outliers_std_upper)


find_outliers(income_array)

[45. 22.]
[]
[22.]
[]


---
### Exercise 6 - finding the correlation between two series

Let's find out if there is a strong correlation between Age and Income in the income data set.

*  create a numpy array from the Age column  
*  create a numpy array from the Income column  
*  use the np.corrcoef(nparray1, nparray2) function to get the Pearson's Correlation Coefficient (the measure of linear correlation between the two data sets) and store it in a variable called **coef**
*  print the correlation coefficient output (see below, it will be a 2x2 matrix)
*  print the correlation coefficient (which is at position [0][1] (coef[0][1]))


Expected output:  
```
[[ 1.         -0.14787412]
 [-0.14787412  1.        ]]
 -0.1478741157606825

```
The matrix gives 4 values showing the correlation between:

```
   |    (Age/Age)        (Age/Income)     |
   |    (Income/Age)     (Income/Income)  |
```
This suggests that income decreases with age (the correlation is negative
so as one increases the other decreases) but that the correlation is quite weak (an absolute correlation would be 1 and no correlation would be 0)

In [13]:
import pandas as pd
import numpy as np

def get_correlation(age_array, income_array):
  coef = np.corrcoef(age_array, income_array)
  print(coef)
  print(coef[0][1])



get_correlation(age_array, income_array)

[[ 1.         -0.14787412]
 [-0.14787412  1.        ]]
-0.1478741157606825


### Exercise 7 - practicing correlation

Repeat exercise 6 but use the Income and Population columns instead.

If you have completed the user defined functions worksheet, try writing a function that takes two arrays as parameters and returns the correlation coefficient.

In [18]:
import pandas as pd
import numpy as np

def practicing_correlation(array1, array2):
  population_array = np.array(df["Population"], np.int64)
  coef = np.corrcoef(array1, array2)
  print(coef)
  print(coef[0][1])



population_array = np.array(df["Population"], np.int64)
practicing_correlation(income_array, population_array)

[[1.         0.11644143]
 [0.11644143 1.        ]]
0.11644142628402859


---
### Exercise 8 - create a new column in the dataframe from a numpy array

**Challenging**

Write a function which will calculate expected salaries for all in the income data set after an inflation rate of 3.5% (with results in a new numpy array).

Just to show the result, calculate and print the Pearson Correlation Coefficient between the salaries series and the inflated salaries series.  We would expect this to be 1 (ie the inflated salary is always 3.5% higher than the current salary) and the exercise is just meant to show that - the statistic has no relevance.  

Create a new column in the dataframe from the new numpy array (so that the dataframe now contains the original salaries and the inflated salaries.  
(**Recap**:  *to add a new column, just use* `df['new column name']`)  

To assign a numpy array to a pandas column use  
`df['new column name'] = numpyarrayname.tolist()`

Display the new dataframe and print the correlation coefficient.







In [20]:
import pandas as pd
import numpy as np

def add_new_column(income_array):
  new_salaries = income_array * 3.5
  df["New Salaries"] = new_salaries.tolist()
  coef = np.corrcoef(income_array, new_salaries)
  print(coef)

add_new_column(income_array)



[[1. 1.]
 [1. 1.]]


# Reflection
----

## What skills have you demonstrated in completing this notebook?

Your answer: I have demonstrated an understanding of how to perform statistical calculations on numpy arrays such as finding the mean, median, min/max values, mode, iqr and standard deviation, and identifying and isolating outliers. I have also passed data to and from pandas databases and numpy arrays, and practiced using functions to find the correlation coefficients between two numpy arrays.

## What caused you the most difficulty?

Your answer:I had to look up some guides to fully understand how to interpret the resuts of Pearson Correlation Coefficient calculations, but after viewing some [example graphs ](https://www.scribbr.com/statistics/pearson-correlation-coefficient/) I was able to understand the meaning of the results.