In many real-world scenarios, data is loaded from sources such as files. The dataset for this example has been derived from https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/ml-basics/grades.csv


In [17]:
# Importing Pandas
import pandas as pd

# Read the file
df_students = pd.read_csv("/content/sample_data.csv")
df_students.head()

Unnamed: 0,Name,StudyHours,Grade
0,Dan,10.0,50.0
1,Joann,11.5,50.0
2,Pedro,9.0,47.0
3,Rosie,16.0,97.0
4,Ethan,9.25,49.0


In [None]:
# Handling Missing Values
df_students.isnull()

In [18]:
# Summing up all missing values
df_students.isnull().sum()

Name          0
StudyHours    1
Grade         2
dtype: int64

In [19]:
# filter the DataFrame to include only rows where any of the columns (axis 1 of the DataFrame) are null.
df_students[df_students.isnull().any(axis=1)]

Unnamed: 0,Name,StudyHours,Grade
22,Bill,8.0,
23,Ted,,


In [None]:
"""
So now that we've found the null values, what can we do about them?

One common approach is to impute replacement values. For example,
if the number of study hours is missing,
we could just assume that the student studied for an average amount of time and replace the missing value with the mean study hours.
To do this, we can use the fillna method


df_students.StudyHours = df_students.StudyHours.fillna(df_students.StudyHours.mean())
df_students.Grade = df_students.Grade.fillna(df_students.Grade.mean())
df_students

"""

In [20]:
"""
Alternatively, it might be important to ensure that we only use data you know to be absolutely correct.
In this case, one can drop rows or columns that contain null values by using the dropna method.
For example, dropping rows (axis 0 of the DataFrame) where any of the columns contain null values:
"""
df_students = df_students.dropna(axis=0, how='any')
df_students

Unnamed: 0,Name,StudyHours,Grade
0,Dan,10.0,50.0
1,Joann,11.5,50.0
2,Pedro,9.0,47.0
3,Rosie,16.0,97.0
4,Ethan,9.25,49.0
5,Vicky,1.0,3.0
6,Frederic,11.5,53.0
7,Jimmie,9.0,42.0
8,Rhonda,8.5,26.0
9,Giovanni,14.5,74.0


**Explore data in the DataFrame**

---

Now that the data is all cleaned up, let's have some fun with it 😂

In [21]:
# Print mean study hours and mean grade
mean_grade = df_students.Grade.mean()
mean_hours = df_students.StudyHours.mean()

print("Average Grade: {:.2f}\nAverage Study Hours: {:.2f}".format(mean_grade, mean_hours))

Average Grade: 49.18
Average Study Hours: 10.52


In [22]:
#  Filter the DataFrame to find only the students who studied for more than the average amount of time.
df_students[df_students.StudyHours > df_students.StudyHours.mean()]

# What was their mean grade?
df_students[df_students.StudyHours > df_students.StudyHours.mean()].Grade.mean()

66.7

**Adding a New Column to the DataFrame**

---

Assume that the passing grade for the course is 60.

We can use that information to add a new column to the DataFrame that indicates whether or not each student passed.

First, we'll create a Pandas Series containing the pass/fail indicator (True or False), and then we'll concatenate that series as a new column (axis 1) in the DataFrame.

In [23]:
# A pandas series is like a column in a table; a one dimensional array holding data of one type
hasPassed = pd.Series(df_students.Grade >= 60)
hasPassed

0     False
1     False
2     False
3      True
4     False
5     False
6     False
7     False
8     False
9      True
10     True
11     True
12    False
13    False
14     True
15    False
16    False
17    False
18    False
19    False
20     True
21     True
Name: Grade, dtype: bool

In [24]:
# Concatenate the series to the df_students DataFrame
newDfStudents = pd.concat([df_students, hasPassed.rename("HasPassed")], axis=1)
newDfStudents

Unnamed: 0,Name,StudyHours,Grade,HasPassed
0,Dan,10.0,50.0,False
1,Joann,11.5,50.0,False
2,Pedro,9.0,47.0,False
3,Rosie,16.0,97.0,True
4,Ethan,9.25,49.0,False
5,Vicky,1.0,3.0,False
6,Frederic,11.5,53.0,False
7,Jimmie,9.0,42.0,False
8,Rhonda,8.5,26.0,False
9,Giovanni,14.5,74.0,True


**Grouping and Aggregating Tables of Data**

---



In [41]:
# Using the groupby method to group the student data into groups based on the HasPassed column
newDfStudents.groupby(newDfStudents.HasPassed).Name.count()

HasPassed
False    15
True      7
Name: Name, dtype: int64

In [34]:
# Calculate the mean for StudyHours and Grade based on the group made by the HasPassed column
newDfStudents.groupby(newDfStudents.HasPassed)['StudyHours','Grade'].mean()


  newDfStudents.groupby(newDfStudents.HasPassed)['StudyHours','Grade'].mean()


Unnamed: 0_level_0,StudyHours,Grade
HasPassed,Unnamed: 1_level_1,Unnamed: 2_level_1
False,8.783333,38.0
True,14.25,73.142857


**Sorting Values**

---



In [46]:
# Sort values by Grade in descending order
newDfStudents.sort_values(by='Grade', ascending=False)

Unnamed: 0,Name,StudyHours,Grade,HasPassed
3,Rosie,16.0,97.0,True
10,Francesca,15.5,82.0,True
9,Giovanni,14.5,74.0,True
14,Jenny,15.5,70.0,True
21,Aisha,12.0,64.0,True
20,Daniel,12.5,63.0,True
11,Rajab,13.75,62.0,True
6,Frederic,11.5,53.0,False
19,Skye,12.0,52.0,False
1,Joann,11.5,50.0,False
