<a href="https://colab.research.google.com/github/JaySanthanam/Programming-for-data/blob/main/Worksheets/Numpy_mini_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Investigate, wrangle, add new data to the dataframe, filter and correlate

The following data file contains data about student scores in math, reading and writing.  The exam scores are assumed to be percentages. It also contains data about gender, ethnicity, parental education, whether the student qualifies for free school food and whether or not the student has taken a preparation course for the exams.  The data set has already been cleaned. 

1.  Investigate the data set.

2.  Create numpy arrays to hold each of the three sets of scores.  
Create a new numpy array to hold the average exam score (of the three scores for each student)
Add the new numpy array as a new column in the dataframe.

3.  Filter the original dataset into a new dataframe containing just the females.  Calculate the average exam scores for all rows in this new dataframe and then find the mean of the average column.

4.  Do the same for the males.

5.  Use the original dataset to find the correlation coefficient for reading and maths.  How closely do they correlate?  Write what you find in a text box below the code.

6.  Do the same for reading and writing. What do you find?

7.  You might want to filter on different criteria and check correlation (e.g. those on free school meals, or those who had prepared, etc)







The dataset can be accessed here:  https://raw.githubusercontent.com/lilaceri/Working-with-data-/main/Data%20Sets%20for%20code%20divisio/exams.csv.  This is a .csv file.

**NOTE:** Some useful references are included at the bottom of this spreadsheet.

Use the code cell below to work your code.

In [80]:
import pandas as pd
import numpy as np

def get_data(url):
  df = pd.read_csv(url)
  return df

def any_nulls(df):
  df_null = df.isna().any()
  return df_null

def find_average_score(list1,list2,list3):
  average_arr = np.mean(np.array([list1, list2, list3]), axis =0)
  return average_arr

def get_correlation(nparray1,nparray2):
  # code to get the correlation figure for both salaries
  R_matrix = np.corrcoef(nparray1, nparray2)
  return R_matrix

url = "https://raw.githubusercontent.com/lilaceri/Working-with-data-/main/Data%20Sets%20for%20code%20divisio/exams.csv"
df = get_data(url)
math_arr = np.array(df["math_score"])
read_arr = np.array(df["reading_score"])
write_arr = np.array(df["writing_score"])
#df.info() # find the columns names and data type
#print(any_nulls(df)) # find if any nulls to remove rows)
# Found no null values. So going to continue without dropping rows
df["average_score"] = find_average_score(df["math_score"],df["reading_score"],df["writing_score"]) 

#create new data frames for males and females
df_female = df[df.gender == 'female']
df_male = df[df.gender == 'male']

#find mean scores in 3 subjects for males and females 
female_math_mean = df_female["math_score"].mean()
female_read_mean = df_female["reading_score"].mean()
female_write_mean = df_female["writing_score"].mean()
male_math_mean = df_male["math_score"].mean()
male_read_mean = df_male["reading_score"].mean()
male_write_mean = df_male["writing_score"].mean()

print("The average score in maths for female and male pupils are", female_math_mean, male_math_mean, "respectively.")
print("This means that male pupils score better than female pupils in maths (on an average)")

print("\nThe average score in reading for female and male pupils are", female_read_mean, male_read_mean, "respectively.")
print("This means that female pupils score better than male pupils in reading (on an average)")

print("\nThe average score in writing for female and male pupils are", female_write_mean, male_write_mean, "respectively.")
print("This means that female pupils score better than male pupils in writing (on an average)")


#find mean for average score for male and female
female_score_mean = df_female["average_score"].mean()
male_score_mean = df_male["average_score"].mean()

#Get correlation between reading scores and math scores
R_read_math = get_correlation(math_arr,read_arr)
print("\nThe correlation between Reading and Maths scores is", R_read_math[0][1])
print("This means that nearly 80% of those who scored high marks in reading also did well in maths.")

#Get correlation between reading scores and writing scores
R_read_write = get_correlation(write_arr,read_arr)
print("\nThe correlation between Reading and Writing scores is", R_read_write[0][1])
print("This means that most the pupils who scored high marks in reading also did well in writing.")

# Let's see how financial difficulties (lunch) correlates to students scores - I will look at average score
#Get correlation between reading scores and writing scores
df["lunch_bool"] = df["lunch"].replace({'free/reduced':0, "standard":1})
lunch_arr = np.array(df["lunch_bool"])
average_arr = np.array(df["average_score"])
R_lunch_ave = get_correlation(lunch_arr,average_arr)
print("\nThe correlation between Financial difficulties and average scores is", R_lunch_ave[0][1])
print("This means that there is medium level of correlation face financial difficulties and average scores.")
print("Those facing financial difficulties score less in exams.")


The average score in maths for female and male pupils are 64.94153225806451 68.74801587301587 respectively.
This means that male pupils score better than female pupils in maths (on an average)

The average score in reading for female and male pupils are 73.86693548387096 65.64880952380952 respectively.
This means that female pupils score better than male pupils in reading (on an average)

The average score in writing for female and male pupils are 73.65120967741936 63.58134920634921 respectively.
This means that female pupils score better than male pupils in writing (on an average)

The correlation between Reading and Maths scores is 0.8145975152425367
This means that nearly 80% of those who scored high marks in reading also did well in maths.

The correlation between Reading and Writing scores is 0.9567686336066268
This means that most the pupils who scored high marks in reading also did well in writing.

The correlation between Financial difficulties and average scores is 0.309905332

I did try to give comments and add print statements as I progressed with the code. I want to make a comment on the method I followed to do the financial difficulties correlation to average scores. The method I followed is that if on free/reduced lunch the lunch takes on a value 0 which is lower than the one I assigned to standard lunch (which is 1). So if the average score is greater for those in standard lunch, that means that there is positive correlation with not facing financial difficulties. Hope this makes sense. This is a very crude way (I feel) of doing the correlation. I chose to use 0 and 1 since there are only 2 possibilities, either facing fianncial difficulties or not. I could do the same for preparing for the course as well. 

### Helpful references
---

Filtering on criteria using df references:  
`filtered_df = df[df['column name' == value]]`  

Filtering on multiple criteria using df reference:  
`filtered_df = df[df['first column name' == value] & df[second column name] >= value]`

Numpy help sheet:  http://datacamp-community-prod.s3.amazonaws.com/da466534-51fe-4c6d-b0cb-154f4782eb54

