#Numpy - Final Project

##Introduction
In this project, I am going to use numpy library to load and manipulate numeric types of data from the Pandas DataFrame for investingating, wrangling and adding new data to the DataFrame since Numpy has many advantages over Pandas in terms of better performance, calculation speed, and efficient use of computing memory space.

### Investigate, wrangle, add new data to the dataframe, filter and correlate

The following data file contains data about student scores in math, reading and writing.  The exam scores are assumed to be percentages. It also contains data about gender, ethnicity, parental education, whether the student qualifies for free school food and whether or not the student has taken a preparation course for the exams.  The data set has already been cleaned. I am going to:

1.  Investigate the data set.

2.  Create numpy arrays to hold each of the three sets of scores.  
Create a new numpy array to hold the average exam score (of the three scores for each student)
Add the new numpy array as a new column in the dataframe.

3.  Filter the original dataset into a new dataframe containing just the females.  Calculate the average exam scores for all rows in this new dataframe and then find the mean of the average column.

4.  Do the same for the males.

5.  Use the original dataset to find the correlation coefficient for reading and maths.  How closely do they correlate?  Write my findings in a text box below the code.

6.  Do the same for reading and writing.

7.  I might need to filter on different criteria and check correlation (e.g. those on free school meals, or those who had prepared, etc)







In [1]:
import pandas as pd
import numpy as np

url = 'https://raw.githubusercontent.com/lilaceri/Working-with-data-/main/Data%20Sets%20for%20code%20divisio/exams.csv'

# The function creates a dataframe and loads the data
def create_dataframe(url, db_type='csv', sheetname=None):
  import pandas as pd
  if db_type == 'csv':
    df = pd.read_csv(url)
  elif db_type == 'excel':
    if sheetname == None:
      df = pd.read_excel(url)
    else:
      df = pd.read_excel(url, sheet_name=sheetname)
  else:
    df = pd.read_csv(url)
  return df

# Create the DataFrame
exam_df = create_dataframe(url)

display(exam_df.info())
display(exam_df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   gender              1000 non-null   object
 1   ethnicity           1000 non-null   object
 2   parental_education  1000 non-null   object
 3   lunch               1000 non-null   object
 4   preparation_course  1000 non-null   object
 5   math_score          1000 non-null   int64 
 6   reading_score       1000 non-null   int64 
 7   writing_score       1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


None

Unnamed: 0,gender,ethnicity,parental_education,lunch,preparation_course,math_score,reading_score,writing_score
0,female,group E,some college,free/reduced,none,65,76,71
1,male,group C,some college,standard,completed,75,72,69
2,female,group B,some high school,free/reduced,completed,62,56,61
3,male,group D,some high school,standard,completed,60,60,59
4,female,group C,high school,free/reduced,completed,34,54,55


In [2]:
# Create numpy arrays to hold each of the three sets of scores
def get_score_np_array(df_column, np_data_type=np.int8):
  import numpy as np
  np_new_array = np.array(df_column, np_data_type)
  return np_new_array

nd_math_score = get_score_np_array(exam_df['math_score'])
nd_reading_score = get_score_np_array(exam_df['reading_score'])
nd_writing_score = get_score_np_array(exam_df['writing_score'])

# Create a new numpy array to hold the average exam score
nd_avg_score = np.array(exam_df[['math_score', 'reading_score', 'writing_score']].mean(axis=1))
exam_df['average_score'] = nd_avg_score
display(exam_df.head())
print('')

# Filter the original dataset into a new dataframe
def create_sub_df_stats(df, column, criteria=None):
  new_df = df[df[column]== criteria]
  column_exam_mean = new_df['average_score'].mean()
  return new_df, column_exam_mean

female_exam_df, female_exam_mean = create_sub_df_stats(exam_df, 'gender', 'female')
male_exam_df, male_exam_mean = create_sub_df_stats(exam_df, 'gender', 'male')

# Show the result set and stats
display(female_exam_df.head())
print(f'Female Average Score: {round(female_exam_mean)}')
print('')
display(male_exam_df.head())
print(f'Male Average Score: {round(male_exam_mean)}')

Unnamed: 0,gender,ethnicity,parental_education,lunch,preparation_course,math_score,reading_score,writing_score,average_score
0,female,group E,some college,free/reduced,none,65,76,71,70.666667
1,male,group C,some college,standard,completed,75,72,69,72.0
2,female,group B,some high school,free/reduced,completed,62,56,61,59.666667
3,male,group D,some high school,standard,completed,60,60,59,59.666667
4,female,group C,high school,free/reduced,completed,34,54,55,47.666667





Unnamed: 0,gender,ethnicity,parental_education,lunch,preparation_course,math_score,reading_score,writing_score,average_score
0,female,group E,some college,free/reduced,none,65,76,71,70.666667
2,female,group B,some high school,free/reduced,completed,62,56,61,59.666667
4,female,group C,high school,free/reduced,completed,34,54,55,47.666667
5,female,group B,some college,free/reduced,none,53,66,61,60.0
6,female,group D,some college,standard,none,85,88,92,88.333333


Female Average Score: 71



Unnamed: 0,gender,ethnicity,parental_education,lunch,preparation_course,math_score,reading_score,writing_score,average_score
1,male,group C,some college,standard,completed,75,72,69,72.0
3,male,group D,some high school,standard,completed,60,60,59,59.666667
7,male,group E,some high school,standard,none,93,73,70,78.666667
9,male,group C,associate's degree,standard,completed,85,75,82,80.666667
13,male,group C,some college,standard,none,90,81,81,84.0


Male Average Score: 66


In [3]:
# find the correlation coefficient for reading and maths scores
ccf_r_m = np.corrcoef(x=nd_reading_score, y=nd_math_score)
display(ccf_r_m)

array([[1.        , 0.81459752],
       [0.81459752, 1.        ]])

There is strong positive correlation with correlation coefficient firugre of round 0.81 between the 'reading_score' and 'math_score' variables. More the reading score there are, more the math score there going to be.

In [4]:
# find the correlation coefficient for reading and writing scores
ccf_r_w = np.corrcoef(x=nd_reading_score, y=nd_writing_score)
display(ccf_r_w)

array([[1.        , 0.95676863],
       [0.95676863, 1.        ]])

There is strong positive correlation with correlation coefficient figure of round 0.96 between the 'reading_score' and 'writing_score' variables. More the reading score there are, more the writing score there going to be.

In [5]:
# Filter on 'lunch' criteria and check the correlation
free_lunch_df, free_lunch_mean = create_sub_df_stats(exam_df, 'lunch', 'free/reduced')
standard_lunch_df, standard_lunch_mean = create_sub_df_stats(exam_df, 'lunch', 'standard')

# Show the result set and stats
display(free_lunch_df.head())
print(f'On free school meal, average score: {round(free_lunch_mean)}')
print('')
display(standard_lunch_df.head())
print(f'On standard school meal, average score: {round(standard_lunch_mean)}')

Unnamed: 0,gender,ethnicity,parental_education,lunch,preparation_course,math_score,reading_score,writing_score,average_score
0,female,group E,some college,free/reduced,none,65,76,71,70.666667
2,female,group B,some high school,free/reduced,completed,62,56,61,59.666667
4,female,group C,high school,free/reduced,completed,34,54,55,47.666667
5,female,group B,some college,free/reduced,none,53,66,61,60.0
11,female,group D,associate's degree,free/reduced,none,54,69,70,64.333333


On free school meal, average score: 62



Unnamed: 0,gender,ethnicity,parental_education,lunch,preparation_course,math_score,reading_score,writing_score,average_score
1,male,group C,some college,standard,completed,75,72,69,72.0
3,male,group D,some high school,standard,completed,60,60,59,59.666667
6,female,group D,some college,standard,none,85,88,92,88.333333
7,male,group E,some high school,standard,none,93,73,70,78.666667
8,female,group C,associate's degree,standard,completed,58,71,63,64.0


On standard school meal, average score: 72


In [6]:
# Create numpy arrays to hold each of the three sets of scores
nd_math_score_free_lunch = get_score_np_array(free_lunch_df['math_score'])
nd_reading_score_free_lunch = get_score_np_array(free_lunch_df['reading_score'])
nd_writing_score_free_lunch = get_score_np_array(free_lunch_df['writing_score'])

# Find the correlation coefficient for reading and maths scores
ccf_r_m_free_lunch = np.corrcoef(x=nd_reading_score_free_lunch, y=nd_math_score_free_lunch)

# Find the correlation coefficient for reading and writing scores
ccf_r_w_free_lunch = np.corrcoef(x=nd_reading_score_free_lunch, y=nd_writing_score_free_lunch)

print('On Free School Meal (correlation coefficient figures)')
print('='*60)
print('Reading Score vs Math Score:')
print('-'*60)
print('')
display(ccf_r_m_free_lunch)
print('')
print('-'*60)
print('Reading Score vs Writing Score:')
print('-'*60)
print('')
display(ccf_r_w_free_lunch)

On Free School Meal (correlation coefficient figures)
Reading Score vs Math Score:
------------------------------------------------------------



array([[1.        , 0.79602045],
       [0.79602045, 1.        ]])


------------------------------------------------------------
Reading Score vs Writing Score:
------------------------------------------------------------



array([[1.        , 0.95514705],
       [0.95514705, 1.        ]])

In [7]:
# Create numpy arrays to hold each of the three sets of scores
nd_math_score_standard_lunch = get_score_np_array(standard_lunch_df['math_score'])
nd_reading_score_standard_lunch = get_score_np_array(standard_lunch_df['reading_score'])
nd_writing_score_standard_lunch = get_score_np_array(standard_lunch_df['writing_score'])

# Find the correlation coefficient for reading and maths scores
ccf_r_m_standard_lunch = np.corrcoef(x=nd_reading_score_standard_lunch, y=nd_math_score_standard_lunch)

# Find the correlation coefficient for reading and writing scores
ccf_r_w_standard_lunch = np.corrcoef(x=nd_reading_score_standard_lunch, y=nd_writing_score_standard_lunch)

print('On Standard School Meal (correlation coefficient figures)')
print('='*60)
print('Reading Score vs Math Score:')
print('-'*60)
print('')
display(ccf_r_m_standard_lunch)
print('')
print('-'*60)
print('Reading Score vs Writing Score:')
print('-'*60)
print('')
display(ccf_r_w_standard_lunch)

On Standard School Meal (correlation coefficient figures)
Reading Score vs Math Score:
------------------------------------------------------------



array([[1.       , 0.8073629],
       [0.8073629, 1.       ]])


------------------------------------------------------------
Reading Score vs Writing Score:
------------------------------------------------------------



array([[1.        , 0.95341194],
       [0.95341194, 1.        ]])

The above correlation coefficient figures show that there was slight variations in between the correlation of the group of students on 'Free' or 'Standard' school meals for the analysed exam scores.

##Reflection
I learnt how to work with Numpy to load data from Pandas Dataframe and manipulate the data in Numpy and save the data back to the Pandas Dataframe. I also learnt the various numpy's method about statistics. Overall, I did not find any difficuilties in working with Numpy.