### EXPLORING DATA WITH PYTHON (NUMPY)

In [3]:
data = [50,50,47,97,49,3,53,42,26,74,82,62,37,15,70,27,36,35,48,52,63,64]
print(data)

[50, 50, 47, 97, 49, 3, 53, 42, 26, 74, 82, 62, 37, 15, 70, 27, 36, 35, 48, 52, 63, 64]


The data has been loaded into a Python list structure, which is a good data type for general data manipulation, but it's not optimized for numeric analysis. For that, we're going to use the NumPy package, which includes specific data types and functions for working with Numbers in Python.

Run the following cell to load the data into a NumPy array.


In [4]:
import numpy as np

grades = np.array(data)
print(grades)

[50 50 47 97 49  3 53 42 26 74 82 62 37 15 70 27 36 35 48 52 63 64]


In [6]:
grades.shape

(22,)

In [10]:
grades.mean()

np.float64(49.18181818181818)

In [7]:
print (type(data),'x 2:', data * 2)
print('---')
print (type(grades),'x 2:', grades * 2)

<class 'list'> x 2: [50, 50, 47, 97, 49, 3, 53, 42, 26, 74, 82, 62, 37, 15, 70, 27, 36, 35, 48, 52, 63, 64, 50, 50, 47, 97, 49, 3, 53, 42, 26, 74, 82, 62, 37, 15, 70, 27, 36, 35, 48, 52, 63, 64]
---
<class 'numpy.ndarray'> x 2: [100 100  94 194  98   6 106  84  52 148 164 124  74  30 140  54  72  70
  96 104 126 128]


Multiplying a list by two creates a new list of twice the length with the original sequence of list elements repeated. Multiplying a NumPy array, on the other hand, performs an element-wise calculation in which the array behaves like a vector, so we end up with an array of the same size in which each element has been multiplied by two.

The key takeaway from this is that NumPy arrays are specifically designed to support mathematical operations on numeric data, which makes them more useful for data analysis than a generic list.

In [14]:
# Define an array of study hours 
study_hours = [10.0,11.5,9.0,16.0,9.25,1.0,11.5,9.0,8.5,14.5,15.5,
               13.75,9.0,8.0,15.5,8.0,9.0,6.0,10.0,12.0,12.5,12.0]

# Create a 2D array (an array of arrays)
student_data = np.array([study_hours, grades])

# display the array
student_data

array([[10.  , 11.5 ,  9.  , 16.  ,  9.25,  1.  , 11.5 ,  9.  ,  8.5 ,
        14.5 , 15.5 , 13.75,  9.  ,  8.  , 15.5 ,  8.  ,  9.  ,  6.  ,
        10.  , 12.  , 12.5 , 12.  ],
       [50.  , 50.  , 47.  , 97.  , 49.  ,  3.  , 53.  , 42.  , 26.  ,
        74.  , 82.  , 62.  , 37.  , 15.  , 70.  , 27.  , 36.  , 35.  ,
        48.  , 52.  , 63.  , 64.  ]])

In [13]:
student_data.shape

(2, 22)

In [15]:
student_data[0][0]

np.float64(10.0)

In [17]:
# Get the mean value of each sub-array
avg_study = student_data[0].mean()
avg_grade = student_data[1].mean()

print('Average study hours:{:.2f}\nAverage grade:{:.2f}'.format(avg_study,avg_grade))

Average study hours:10.52
Average grade:49.18


Exploring tabular data with Pandas

NumPy provides a lot of the functionality and tools we need to work with numbers, such as arrays of numeric values. However, when we start to deal with two-dimensional tables of data, the Pandas package offers a more convenient structure to work with: the DataFrame.

Excercise:
Run the following cell to import the Pandas library and create a DataFrame with three columns. The first column is a list of student names, and the second and third columns are the NumPy arrays containing the study time and grade data.


In [None]:
import pandas as pd

df_students = pd.DataFrame({'Name': ['Dan', 'Joann', 'Pedro', 'Rosie', 'Ethan', 'Vicky', 'Frederic', 'Jimmie', 
                                     'Rhonda', 'Giovanni', 'Francesca', 'Rajab', 'Naiyana', 'Kian', 'Jenny',
                                     'Jakeem','Helena','Ismat','Anila','Skye','Daniel','Aisha'],
                            'StudyHours':student_data[0],
                            'Grade':student_data[1]})

df_students 

Finding and filtering data in a DataFrame

We can use the DataFrame's loc method to retrieve data for a specific index value, like this.

In [None]:
# Get the rows with index values from 0 to 5
df_students.loc[0:5]

In [None]:
# Get data in the first five rows
df_students.iloc[0:5]

Look carefully at the iloc[0:5] results and compare them to the loc[0:5] results you obtained previously. Can you spot the difference?

The loc method returned rows with index label in the list of values from 0 to 5, which includes 0, 1, 2, 3, 4, and 5 (six rows). However, the iloc method returns the rows in the positions included in the range 0 to 5. Since integer ranges don't include the upper-bound value, this includes positions 0, 1, 2, 3, and 4 (five rows).

iloc identifies data values in a DataFrame by position, which extends beyond rows to columns. So, for example, you can use it to find the values for the columns in positions 1 and 2 in row 0, like this:


In [None]:
df_students.iloc[0,[1,2]]

In [None]:
df_students.iloc[0,'Grade']

In [None]:
df_students.query('Name=="Aisha"')

In [None]:
df_students[df_students.Name == 'Aisha']

### Loading a DataFrame from a file

We constructed the DataFrame from some existing arrays. However, in many real-world scenarios, data is loaded from sources such as files. Let's replace the student grades DataFrame with the contents of a text file.


In [None]:
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/ml-basics/grades.csv
df_students = pd.read_csv('grades.csv',delimiter=',',header='infer')
df_students.head() #0-4

In [None]:
df_students.StudyHours = df_students.StudyHours.fillna(df_students.StudyHours.mean())
df_students.Grade = df_students.Grade.fillna(df_students.Grade.mean())
df_students

In [None]:
df_students = df_students.dropna(axis=0, how='any')
df_students

### Explore data in the DataFrame

Now that we've cleaned up the missing values, we're ready to explore the data in the DataFrame. Let's start by comparing the mean study hours and grades.

In [None]:
# Get the mean study hours using the column name as an index
mean_study = df_students['StudyHours'].mean()

# Get the mean grade using the column name as a property (just to make the point!)
mean_grade = df_students.Grade.mean()

# Print the mean study hours and mean grade
print('Average weekly study hours: {:.2f}\nAverage grade: {:.2f}'.format(mean_study, mean_grade))

# Use @ to reference the local variable in the query
df_students.query('StudyHours > @mean_study')

#### Explanation:

    The query() method operates in the DataFrame's namespace by default. To access variables from your Python environment, use @variable_name.

    This tells pandas to look for a variable named mean_study in your Python environment rather than looking for a column named "mean_study" in the DataFrame.



In [None]:
# Get students who studied for the mean or more hours

mean_study_students = df_students[df_students.StudyHours > mean_study]
mean_study_students

In [None]:
# What was their mean grade?
mean_study_students.Grade.mean()

In [None]:
passes  = pd.Series(df_students['Grade'] >= 60)
df_students = pd.concat([df_students, passes.rename("Pass")], axis=1)

df_students

In [None]:
print(df_students.groupby(df_students.Pass).Name.count())

In [None]:
print(df_students.groupby(df_students.Pass)[['StudyHours', 'Grade']].mean())

In [None]:
# Create a DataFrame with the data sorted by Grade (descending)
df_students = df_students.sort_values('Grade', ascending=False)

# Show the DataFrame
df_students

## Summary

NumPy and DataFrames are the workhorses of data science in Python. They provide us ways to load, explore, and analyze tabular data. As we will learn in subsequent modules, even advanced analysis methods typically rely on NumPy and Pandas for these important roles.

In our next workbook, we'll take a look at how create graphs and explore your data in more interesting ways.