<a href="https://colab.research.google.com/github/Omolabak5/data-and-python/blob/main/Copy_of_D_05b_numpy_statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Numpy for Statistical Analysis

---

The exercises in this worksheet will use data on income in certain US states. Run the cell below to import the data, which will be available for you to use as `df`

In [3]:
import pandas as pd
import numpy as np

url = 'https://github.com/futureCodersSE/python-programming-for-data/blob/main/Datasets/Income-Data.xlsx?raw=True'
df = pd.read_excel(url)


---
### Exercise 1 - get some statistics from a numpy array created from a data series


Write a function which will create a numpy array from the `Age` column in the income dataset and will print the following:

*  the average (mean) age of those surveyed  
*  the median age of those surveyed
*  the age of the oldest person
*  the age of the youngest person

Expected output:  
```
29.88888888888889
28.0
42
22
```

     

In [2]:
import pandas as pd
import numpy as np

def get_age_stats():
  # add your code below here to create a numpy array from the Age column, calculate and print the statistics
  # create a numpy array from the Age colunm
  age = df['Age'].to_numpy(np.float64)

  #calculate the statistics; mean,median, age of oldest and youngest
  mean_age = np.mean(age)
  median_age = np.median(age)
  age_of_oldest = np.max(age)
  age_of_youngest = np.min(age)

  #print the result
  print(mean_age)
  print(median_age)
  print(age_of_oldest)
  print(age_of_youngest)



# run the function and test against the expected output.
get_age_stats()

29.88888888888889
28.0
42.0
22.0


### Exercise 2 - find the mode value from a numpy array

Write a function that will create a numpy array from the `Age` column in the income dataset and will print the mode value.

*There is only one mode in this data range, but as an extra challenge, include code in your function that will check there is only one value equal to the maximum and print 'more than one mode' if multiple modes are found.*

Expected output:

25

In [4]:
def get_mode():
  # add your code below to read create a numpy array from the Age column and compute the mode
  # create a numpy array from the Age colunm
  age = df['Age'].to_numpy(np.float64)
  try:
    #counts the occurence of each value, return the values and their counts
   values,counts = np.unique(age, return_counts = True)
   #print(values)
   #print(counts)
   max_count = np.max(counts) #get the highest counts
   max_count_index = list(counts).index(max_count)  #get the position of the highest counts
   mode_value = values[max_count_index]  #get the corresponding value in the highest counts position
   return mode_value
  except:
    #return "more than one mode" if there are more than one value
    return "more than one mode"



get_mode()

25.0

---
### Exercise 3 - find the mean and standard deviation of wages

This exercise will again use data on income in certain US states.  

Write a function which will create a numpy array from the `Income` column in the income dataset and will print the following:

*  the mean income of those surveyed  
*  the standard deviation of income
*  the highest income
*  the lowest income as a percentage of the mean (lowest / mean * 100)


Expected output:  
```
63.388888888888886
13.936916958961463
81
34.70639789658195
```



In [5]:
import pandas as pd
import numpy as np

def get_income_stats():
  # add your code below to calculate the stats
  # create a numpy array from the Income colunm
  income = df['Income'].to_numpy(np.float16)

  #calculate the summary statistics
  mean_income = np.mean(income)
  std_income = np.std(income)
  highest_income = np.max(income)
  lowest_income = np.min(income)
  lowest_income_percentage = lowest_income / mean_income * 100

  #print the results of summary statistics
  print(mean_income)
  print(std_income)
  print(highest_income)
  print(lowest_income_percentage)




# run the function and test against expected output
get_income_stats()

63.38
13.94
81.0
34.716796875


### Exercise 4 - find income IQR

Write a function that will create a numpy array from the `Income` column of the income dataset and print the following:
* the 25th percentile
* the 75th percentile
* the interquartile range

Expected output:
```
62.0
73.0
11.0
```


In [7]:
def get_income_iqr():
  # add your code to calculate the interquartile range
  # create a numpy array from the Income colunm
  income = df['Income'].to_numpy(np.float16)

  percentile_25 = np.percentile(income,25)  #calculate the 25th percentile
  percentile_75 = np.percentile(income,75) #calculate the 75th percentile
  iqr = percentile_75 - percentile_25 #get the IQR as the difference between the 75th & 25th percentiles

  #print the results above
  print(percentile_25)
  print(percentile_75)
  print(iqr)



get_income_iqr()

62.0
73.0
11.0


### Exercise 5 - find outliers

Write a function that will create a numpy array from the Income column that will do the following:
* calculate the standard deviation, mean, Q1, Q2 and interquartile range of the data (as separate variables)
* calculate the upper limit for outliers based on both standard deviation and iqr
* calculate the lower limit for outliers based on both standard deviation and iqr
* filter four times, once for each outlier type
* print the outliers


Expected output:
```
Upper outliers by std: []
Lower outliers by std: [22]
Upper outliers by iqr: []
Lower outliers by iqr: [45 22]
```

In [10]:
def find_outliers():
  # add your code to calculate the outliers
  # create a numpy array from the Income colunm
  income = df['Income'].to_numpy(np.float16)


  std = np.std(income) #calculate the standard deviation
  mean = np.mean(income) #calculate the mean
  q1 = np.percentile(income,25) #calculate the 25th percentile
  q2 = np.percentile(income,50) #calculate the 50th percentile
  q3 = np.percentile(income,75) #calculate the 75th percentile
  iqr = q3 - q1 #calculate the IQR
  upper_limit_std = mean + (std*2) #calculate the upper limit for outliers based on standard deviation
  lower_limit_std = mean - (std*2) #calculate the lower limit for outliers based on standard deviation
  upper_limit_iqr = q3 + (iqr*1.5) #calculate the upper limit for outliers based on IQR
  lower_limit_iqr = q1 - (iqr*1.5) #calculate the lower limit for outliers based on IQR
  upper_outliers_std = income[income > upper_limit_std] #filter the upper outliers based on standard deviation
  lower_outliers_std = income[income < lower_limit_std] #filter the lower outliers
  upper_outliers_iqr = income[income > upper_limit_iqr] #filter the upper outliers based on IQR
  lower_outliers_iqr = income[income < lower_limit_iqr] #filter the lower outliers

  #print the result of the above statistics summary
  print("Upper outliers by std: ",upper_outliers_std)
  print("Lower outliers by std: ",lower_outliers_std)
  print("Upper outliers by iqr: ",upper_outliers_iqr)
  print("Lower outliers by iqr: ",lower_outliers_iqr)



find_outliers()

Upper outliers by std:  []
Lower outliers by std:  [22.]
Upper outliers by iqr:  []
Lower outliers by iqr:  [45. 22.]


---
### Exercise 6 - finding the correlation between two series

Let's find out if there is a strong correlation between Age and Income in the income data set.

*  create a numpy array from the Age column  
*  create a numpy array from the Income column  
*  use the np.corrcoef(nparray1, nparray2) function to get the Pearson's Correlation Coefficient (the measure of linear correlation between the two data sets) and store it in a variable called **coef**
*  print the correlation coefficient output (see below, it will be a 2x2 matrix)
*  print the correlation coefficient (which is at position [0][1] (coef[0][1]))


Expected output:  
```
[[ 1.         -0.14787412]
 [-0.14787412  1.        ]]
 -0.1478741157606825

```
The matrix gives 4 values showing the correlation between:

```
   |    (Age/Age)        (Age/Income)     |
   |    (Income/Age)     (Income/Income)  |
```
This suggests that income decreases with age (the correlation is negative
so as one increases the other decreases) but that the correlation is quite weak (an absolute correlation would be 1 and no correlation would be 0)

In [12]:
import pandas as pd
import numpy as np

def get_correlation():
  # add your code below to get the correlation figure for age and salary
   # create a numpy array from the Income and age  colunms
   age = df['Age'].to_numpy(np.float64)
   income = df['Income'].to_numpy(np.float16)

   #calculate the coefficient between the 2 arrays(age,income)
   coef = np.corrcoef(age,income)
   coef1 = (coef[0][1])
   print(coef)
   print(coef1)




# run the function and test against expected output
get_correlation()

[[ 1.         -0.14787412]
 [-0.14787412  1.        ]]
-0.1478741157606825


### Exercise 7 - practicing correlation

Repeat exercise 6 but use the Income and Population columns instead.

If you have completed the user defined functions worksheet, try writing a function that takes two arrays as parameters and returns the correlation coefficient.

In [13]:
import pandas as pd
import numpy as np

def get_correlation():
  # create a numpy array from the Income and population colunms
  income = df['Income'].to_numpy(np.float16)
  population = df['Population'].to_numpy(np.int16)

  #get the correlation between the two arrays
  coef = np.corrcoef(income,population)
  coef1 = (coef[0][1])

  #print the results
  print(coef)
  print(coef1)



get_correlation()


[[1.         0.11644143]
 [0.11644143 1.        ]]
0.11644142628402859


---
### Exercise 8 - create a new column in the dataframe from a numpy array

**Challenging**

Write a function which will calculate expected salaries for all in the income data set after an inflation rate of 3.5% (with results in a new numpy array).

Just to show the result, calculate and print the Pearson Correlation Coefficient between the salaries series and the inflated salaries series.  We would expect this to be 1 (ie the inflated salary is always 3.5% higher than the current salary) and the exercise is just meant to show that - the statistic has no relevance.  

Create a new column in the dataframe from the new numpy array (so that the dataframe now contains the original salaries and the inflated salaries.  
(**Recap**:  *to add a new column, just use* `df['new column name']`)  

To assign a numpy array to a pandas column use  
`df['new column name'] = numpyarrayname.tolist()`

Display the new dataframe and print the correlation coefficient.







In [9]:
import pandas as pd
import numpy as np

def inflated_salary ():
  income = df['Income'].to_numpy(np.float16)
  inflated_salary= income * (1 + 0.035) #calculate the inflated salary

  #calculate the correlation between salary and inflated salary
  coef = np.corrcoef(income,inflated_salary)
  coef1 = (coef[0][1])


  #Add inflated salary to the dataframe
  df['Inflated_salary'] = inflated_salary.tolist()

  #display the new dataframe
  print(df)
  print(coef)
  print(coef1)

inflated_salary()








   State  County  Population  Age  Income  Inflated_salary
0     TX       1          72   34      65         67.31250
1     TX       2          33   42      45         46.59375
2     TX       5          25   23      46         47.62500
3     TX       6          54   36      65         67.31250
4     TX       7          11   42      53         54.87500
5     TX       8          28   25      62         64.18750
6     TX       9          82   35      66         68.31250
7     TX      10           5   40      75         77.62500
8     MD      11          61   27      22         22.78125
9     MD       2           5   23      69         71.43750
10    MD       4          98   25      73         75.56250
11    MD       3          64   29      75         77.62500
12    MD       2          36   24      65         67.31250
13    MD       1          24   25      66         68.31250
14    MD       5          34   31      78         80.75000
15    MD       6          89   22      81         83.875

# Reflection
----

## What skills have you demonstrated in completing this notebook?

Your answer: Utilized Numpy to extract specific columns from a DataFrame, showcasing my ability to manipulate data effectively. By leveraging NumPy, I calculated key statistical metrics such as mean, median, interquartile range (IQR), and identified outliers. Additionally, I demonstrated proficiency in data engineering by adding new columns to a Pandas DataFrame using NumPy arrays. This skill is essential for enhancing datasets with calculated values or derived features, ensuring they are ready for further analysis.

Lastly, I calculated the Pearson Correlation Coefficient between two datasets, highlighting my understanding of statistical relationships and the ability to measure the strength and direction of linear associations.

## What caused you the most difficulty?

Your answer: This notebook was relatively straightforward for me, as statistics is an area that genuinely interests me. Taking the time to thoroughly read and understand the explanations beforehand made it much easier to approach and tackle the challenges effectively.