## Exploring data arrays with NumPy


In [None]:
# Suppose a college professor takes a sample of student grades from a class to analyze.

In [29]:
data = [50,50,47,97,49,3,53,42,26,74,82,62,37,15,70,27,36,35,48,52,63,64]
print(data)


[50, 50, 47, 97, 49, 3, 53, 42, 26, 74, 82, 62, 37, 15, 70, 27, 36, 35, 48, 52, 63, 64]


In [2]:
import numpy as np

grades = np.array(data)
print(grades)

[50 50 47 97 49  3 53 42 26 74 82 62 37 15 70 27 36 35 48 52 63 64]


In [4]:
# diffrence between defining list using numpy library and without numpy library
# NumPy arrays are specifically designed to support mathematical operations on numeric data—which makes them more useful for data analysis than a generic list.

In [3]:
print (type(data),'x 2:', data * 2)
print('---')
print (type(grades),'x 2:', grades * 2)

<class 'list'> x 2: [50, 50, 47, 97, 49, 3, 53, 42, 26, 74, 82, 62, 37, 15, 70, 27, 36, 35, 48, 52, 63, 64, 50, 50, 47, 97, 49, 3, 53, 42, 26, 74, 82, 62, 37, 15, 70, 27, 36, 35, 48, 52, 63, 64]
---
<class 'numpy.ndarray'> x 2: [100 100  94 194  98   6 106  84  52 148 164 124  74  30 140  54  72  70
  96 104 126 128]


In [None]:
# numpy.ndarray. We can use nd that indicates a structure that can consist of multiple dimensions. (It can have n dimensions) 

## Statistics analysis to explore dataset

In [13]:
grades.shape

(22,)

In [14]:
grades.mean()

49.18181818181818

In [15]:
# So mean grade is just around 50—more or less in the middle of the possible range from 0 to 100

In [None]:
# Let's add a second set of data for the same students. This time, we'll record the typical number of hours per week they devoted to studying

In [19]:
# Define an array of study hours
study_hours = [10.0,11.5,9.0,16.0,9.25,1.0,11.5,9.0,8.5,14.5,15.5,
               13.75,9.0,8.0,15.5,8.0,9.0,6.0,10.0,12.0,12.5,12.0]

# Create a 2D array (an array of arrays)
student_data = np.array([study_hours, grades])

# display the array
student_data

array([[10.  , 11.5 ,  9.  , 16.  ,  9.25,  1.  , 11.5 ,  9.  ,  8.5 ,
        14.5 , 15.5 , 13.75,  9.  ,  8.  , 15.5 ,  8.  ,  9.  ,  6.  ,
        10.  , 12.  , 12.5 , 12.  ],
       [50.  , 50.  , 47.  , 97.  , 49.  ,  3.  , 53.  , 42.  , 26.  ,
        74.  , 82.  , 62.  , 37.  , 15.  , 70.  , 27.  , 36.  , 35.  ,
        48.  , 52.  , 63.  , 64.  ]])

In [20]:
student_data.shape

(2, 22)

In [23]:
# # Show the second element of the last element
student_data[1][21]

64.0

In [28]:
# Get the mean value of each sub-array
avg_study = student_data[0].mean()
avg_grade = student_data[1].mean()

print('Average study hours:{:.2f} \n Average grade: {:.2f}'.format(avg_study, avg_grade))

Average study hours:10.52 
 Average grade: 49.18


## Exploring tabular data with Pandas
NumPy provides a lot of the functionality and tools you need to work with numbers, such as arrays of numeric values. However, when you start to deal with two-dimensional tables of data, the Pandas package offers a more convenient structure to work with: the DataFrame.

In [None]:
# Create a dataframe with three columns 

In [30]:
import pandas as pd

df_students = pd.DataFrame({'Name': ['Dan', 'Joann', 'Pedro', 'Rosie', 'Ethan', 'Vicky', 'Frederic', 'Jimmie', 
                                     'Rhonda', 'Giovanni', 'Francesca', 'Rajab', 'Naiyana', 'Kian', 'Jenny',
                                     'Jakeem','Helena','Ismat','Anila','Skye','Daniel','Aisha'],
                            'StudyHours':student_data[0],
                            'Grade':student_data[1]})

df_students 

Unnamed: 0,Name,StudyHours,Grade
0,Dan,10.0,50.0
1,Joann,11.5,50.0
2,Pedro,9.0,47.0
3,Rosie,16.0,97.0
4,Ethan,9.25,49.0
5,Vicky,1.0,3.0
6,Frederic,11.5,53.0
7,Jimmie,9.0,42.0
8,Rhonda,8.5,26.0
9,Giovanni,14.5,74.0


In [31]:
# Get the data for index value 15
df_students.loc[15]

Name          Jakeem
StudyHours       8.0
Grade           27.0
Name: 15, dtype: object

In [32]:
# Get the rows with index values from 0 to 3
df_students.loc[0:3]

Unnamed: 0,Name,StudyHours,Grade
0,Dan,10.0,50.0
1,Joann,11.5,50.0
2,Pedro,9.0,47.0
3,Rosie,16.0,97.0


In [33]:
# we can use iloc to get the data regardless of index

In [36]:
df_students.iloc[0:3]

Unnamed: 0,Name,StudyHours,Grade
0,Dan,10.0,50.0
1,Joann,11.5,50.0
2,Pedro,9.0,47.0



The loc method returned rows with index label in the list of values from 0 to 5,
which includes 0, 1, 2, 3, 4, and 5 (six rows). However, the iloc method returns the rows in the positions included in the range 0 to 5. Since integer ranges don't include the upper-bound value, this includes positions 0, 1, 2, 3, and 4 (five rows).

In [43]:
# find the three rows and first and second columns and three rows with only second column
df_students.iloc[0:3,[1,2]]
df_students.iloc[0:3,[2]]

Unnamed: 0,Grade
0,50.0
1,50.0
2,47.0


In [44]:
df_students.loc[0,'Grade']

50.0

We can explore dataframe based on the student name

In [45]:
df_students.loc[df_students['Name']=='Aisha']

Unnamed: 0,Name,StudyHours,Grade
21,Aisha,12.0,64.0


Or even without loc method:

In [46]:
df_students[df_students['Name']=='Aisha']

Unnamed: 0,Name,StudyHours,Grade
21,Aisha,12.0,64.0


Or Using dataframe quary method

In [47]:
df_students.query('Name=="Aisha"')

Unnamed: 0,Name,StudyHours,Grade
21,Aisha,12.0,64.0


Using column as a property of dataframe:

In [48]:
df_students[df_students.Name == 'Aisha']

Unnamed: 0,Name,StudyHours,Grade
21,Aisha,12.0,64.0


## We can find the students grades and study hours for high grade 

In [57]:
df_students[df_students.Grade>80]

Unnamed: 0,Name,StudyHours,Grade
3,Rosie,16.0,97.0
10,Francesca,15.5,82.0


## Loading a DataFrame from a file

In [101]:
df_students = pd.read_csv('grades.csv', header='infer')
df_students

Unnamed: 0,Name,StudyHours,Grade
0,Dan,10.0,50.0
1,Joann,11.5,50.0
2,Pedro,9.0,47.0
3,Rosie,16.0,97.0
4,Ethan,9.25,49.0
5,Vicky,1.0,3.0
6,Frederic,11.5,53.0
7,Jimmie,9.0,42.0
8,Rhonda,8.5,26.0
9,Giovanni,14.5,74.0


## Handling missing values

You can use the isnull method to identify which individual values are null, like this:

In [71]:
df_students.isnull()

Unnamed: 0,Name,StudyHours,Grade
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False
5,False,False,False
6,False,False,False
7,False,False,False
8,False,False,False
9,False,False,False


with a larger DataFrame, it would be inefficient to review all of the rows and columns individually, 
so we can get the sum of missing values for each column like this:

In [72]:
df_students.isnull().sum()

Name          0
StudyHours    1
Grade         2
dtype: int64

we can filter the DataFrame to include only rows where any of the columns (axis 1 of the DataFrame) are null.

In [76]:
df_students[df_students.isnull().any(axis=1)]

Unnamed: 0,Name,StudyHours,Grade
22,Bill,8.0,
23,Ted,,


## impute missing values

So now that we've found the null values, what can we do about them?

One common approach is to impute replacement values. For example, if the number of study hours is missing, we could just assume that the student studied for an average amount of time and replace the missing value with the mean study hours. To do this, we can use the fillna method like this:
We can fill missing values with the mean value of StudyHours of other students

In [77]:
df_students.StudyHours = df_students.StudyHours.fillna(df_students.StudyHours.mean())
df_students

Unnamed: 0,Name,StudyHours,Grade
0,Dan,10.0,50.0
1,Joann,11.5,50.0
2,Pedro,9.0,47.0
3,Rosie,16.0,97.0
4,Ethan,9.25,49.0
5,Vicky,1.0,3.0
6,Frederic,11.5,53.0
7,Jimmie,9.0,42.0
8,Rhonda,8.5,26.0
9,Giovanni,14.5,74.0


Or we can fill grade missing by 60 or any other chosen value

In [79]:
df_students.Grade=df_students.Grade.fillna(60)
df_students

Unnamed: 0,Name,StudyHours,Grade
0,Dan,10.0,50.0
1,Joann,11.5,50.0
2,Pedro,9.0,47.0
3,Rosie,16.0,97.0
4,Ethan,9.25,49.0
5,Vicky,1.0,3.0
6,Frederic,11.5,53.0
7,Jimmie,9.0,42.0
8,Rhonda,8.5,26.0
9,Giovanni,14.5,74.0


Alternatively, it might be important to ensure that you only use data you know to be absolutely correct. In this case, you can drop rows or columns that contain null values by using the dropna method. For example, we'll remove rows (axis 0 of the DataFrame) where any of the columns contain null values:

In [81]:
df_students = df_students.dropna(axis=0)
df_students

Unnamed: 0,Name,StudyHours,Grade
0,Dan,10.0,50.0
1,Joann,11.5,50.0
2,Pedro,9.0,47.0
3,Rosie,16.0,97.0
4,Ethan,9.25,49.0
5,Vicky,1.0,3.0
6,Frederic,11.5,53.0
7,Jimmie,9.0,42.0
8,Rhonda,8.5,26.0
9,Giovanni,14.5,74.0


## Explore data in the DataFrame

In [82]:
# Get the mean grade using the column name as a property (just to make the point!)
mean_grade = df_students.Grade.mean()

In [85]:
print('Average grade:{:.2f}'.format( mean_grade))

Average grade:50.08


let's filter the DataFrame to find only the students who studied for more than the average amount of time.

In [91]:
mean_study=df_students.StudyHours.mean()

In [89]:
student_top=df_students[df_students.Grade>mean_grade]
print(student_top)

         Name  StudyHours  Grade
3       Rosie   16.000000   97.0
6    Frederic   11.500000   53.0
9    Giovanni   14.500000   74.0
10  Francesca   15.500000   82.0
11      Rajab   13.750000   62.0
14      Jenny   15.500000   70.0
19       Skye   12.000000   52.0
20     Daniel   12.500000   63.0
21      Aisha   12.000000   64.0
22       Bill    8.000000   60.0
23        Ted   10.413043   60.0


let's find the average grade for students who undertook more than the average amount of study time.

In [92]:
df_students[df_students.StudyHours > mean_study].Grade.mean()

66.7

## Add a column
1. Create a Pandas Series containing the pass/fail indicator
2. add that column to dataframe using pd.concat syntax

In [102]:
passes  = pd.Series(df_students['Grade'] >= 60)
df_students = pd.concat([df_students, passes.rename("Pass")], axis=1)

df_students

Unnamed: 0,Name,StudyHours,Grade,Pass
0,Dan,10.0,50.0,False
1,Joann,11.5,50.0,False
2,Pedro,9.0,47.0,False
3,Rosie,16.0,97.0,True
4,Ethan,9.25,49.0,False
5,Vicky,1.0,3.0,False
6,Frederic,11.5,53.0,False
7,Jimmie,9.0,42.0,False
8,Rhonda,8.5,26.0,False
9,Giovanni,14.5,74.0,True


We can use the groupby method to group the student data into groups based on the Pass column you added previously and to count the number of names in each group. In other words, you can determine how many students passed and failed.

In [103]:
print(df_students.groupby(df_students.Pass).Name.count())

Pass
False    17
True      7
Name: Name, dtype: int64


In [106]:
print(df_students.groupby(df_students.Pass)['StudyHours', 'Grade'].mean())

       StudyHours      Grade
Pass                        
False    8.734375  38.000000
True    14.250000  73.142857


  print(df_students.groupby(df_students.Pass)['StudyHours', 'Grade'].mean())


In [107]:
# Create a DataFrame with the data sorted by Grade (descending)
df_students = df_students.sort_values('Grade', ascending=False)

# Show the DataFrame
df_students

Unnamed: 0,Name,StudyHours,Grade,Pass
3,Rosie,16.0,97.0,True
10,Francesca,15.5,82.0,True
9,Giovanni,14.5,74.0,True
14,Jenny,15.5,70.0,True
21,Aisha,12.0,64.0,True
20,Daniel,12.5,63.0,True
11,Rajab,13.75,62.0,True
6,Frederic,11.5,53.0,False
19,Skye,12.0,52.0,False
1,Joann,11.5,50.0,False
