In [2]:
import pandas as pd

In [3]:
students_df = pd.read_csv("StudentPerformanceFactors.csv")
students_df.head()

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,Family_Income,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score
0,23,84,Low,High,No,7,73,Low,Yes,0,Low,Medium,Public,Positive,3,No,High School,Near,Male,67
1,19,64,Low,Medium,No,8,59,Low,Yes,2,Medium,Medium,Public,Negative,4,No,College,Moderate,Female,61
2,24,98,Medium,Medium,Yes,7,91,Medium,Yes,2,Medium,Medium,Public,Neutral,4,No,Postgraduate,Near,Male,74
3,29,89,Low,Medium,Yes,8,98,Medium,Yes,1,Medium,Medium,Public,Negative,4,No,High School,Moderate,Male,71
4,19,92,Medium,Medium,Yes,6,65,Medium,Yes,3,Medium,High,Public,Neutral,4,No,College,Near,Female,70


In [4]:
students_df.shape 

(6607, 20)

StudentPerformanceFactors.csv file have 6607 rows and 20 columns.

# Introduction

Selecting specific values of a pandas **DataFrame** or **Series** </br>
to work on is an implicit step in almost any data operation you will run,</br>
So one of the first things you need to learn in working with data in python is</br>
How to go about selecting the data point relevant to you quickly and effectively.

## 1) Accessing Value

### Native accessors 

Native python objects provide good ways of indexing data.</br>
pandas carries all of these over, Which helps make it easy to start with. </br>

In python, We can access the property of an object by accessing it's attrinute.</br>
A `book` object, For example, might have a `title` property, which we can accesss by calling `book.title`

`Columns` in a pandas `DataFrame` work in much that same way. </br>
Hence to access the `Attendance` column/property of the `students_df` we can use following code:-

In [5]:
students_df.Attendance

0       84
1       64
2       98
3       89
4       92
        ..
6602    69
6603    76
6604    90
6605    86
6606    67
Name: Attendance, Length: 6607, dtype: int64

### Access Column using indexing Operator [] 

If we have a `python dictionary`,</br>
We can access it's value using the indexing operator [ ].</br>
We can do the same with `columns` in a `DataFrame` :

In [6]:
students_df['Attendance']

0       84
1       64
2       98
3       89
4       92
        ..
6602    69
6603    76
6604    90
6605    86
6606    67
Name: Attendance, Length: 6607, dtype: int64

#### Difference between students_df.Attendance vs students_df['Attendance']

These are two ways of selecting a specific `Series` out of a `DataFrame`.</br>
Neither of them is more or less syntactically valid than the other, But </br>
the `index operator []` does have the `advantage` that it can handle `column name` with </br>
`space or reserved characters` in them.

(If we had a column name `Attendance of Students` column, `students_df.Attendance of Students` wouldn't work.)


#### Doesn't a pandas Series look kind of like a fancy dictionary? 
It pretty much is, so it's no surprise that, to drill down to a single specific value, </br>
we need only use the indexing operator [] once more:

In [7]:
students_df['Attendance'][0]

np.int64(84)

- Student in `0 index` has 84% of attendance.

## 2) Indexing in Pandas

The `indexing operators ([])` and `attributes selection (students_df.Attendance)` are nice because </br>
they work just like they do in the rest of python ecosystem. As a novice, This makes them easy to pick up and use.</br>
However, `Pandas` has it's own accessors operators, `loc` and `iloc`. </br>
For more advanced operations, These are the ones you are supposed to be using.

### A. Index-based selection: 

Pandas indexing works in one of two paradigms. </br>
The first is `index-based selection` : selecting data based on its numerical position in the data. </br>
`iloc` follows this paradigm.

To select the first `row` of data in a `DataFrame`, we may use the following code:

In [8]:
students_df.iloc[0]

Hours_Studied                          23
Attendance                             84
Parental_Involvement                  Low
Access_to_Resources                  High
Extracurricular_Activities             No
Sleep_Hours                             7
Previous_Scores                        73
Motivation_Level                      Low
Internet_Access                       Yes
Tutoring_Sessions                       0
Family_Income                         Low
Teacher_Quality                    Medium
School_Type                        Public
Peer_Influence                   Positive
Physical_Activity                       3
Learning_Disabilities                  No
Parental_Education_Level      High School
Distance_from_Home                   Near
Gender                               Male
Exam_Score                             67
Name: 0, dtype: object

Both `loc` and `iloc` are `row-first`, `column-second`.</br>
This is the opposite of what we do in native Python, which is column-first, row-second.</br>

This means that it's marginally easier to retrieve rows, and marginally harder to get retrieve columns. </br>
To get a column with `iloc`, we can do the following:

In [9]:
students_df.iloc[:, 0]

0       23
1       19
2       24
3       29
4       19
        ..
6602    25
6603    23
6604    20
6605    10
6606    15
Name: Hours_Studied, Length: 6607, dtype: int64

On its own, the `:` operator, which also comes from native `Python`, means `"everything"`.</br> 
When combined with other selectors, however, it can be used to indicate a `range of values`. </br>

`For example`,</br>
To select the `Hours_Studied` column from just the **first, second, and third row**, we would do:

In [10]:
students_df.iloc[0:3, 0]

0    23
1    19
2    24
Name: Hours_Studied, dtype: int64

**Or, to select just the second and third entries, we would do:**

In [11]:
students_df.iloc[1:3,0]

1    19
2    24
Name: Hours_Studied, dtype: int64

**It's also possible to pass a list:**

In [12]:
students_df.iloc[[1,2,3], 0]

1    19
2    24
3    29
Name: Hours_Studied, dtype: int64

In [13]:
students_df.iloc[[1,2,3], 0:4] 

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources
1,19,64,Low,Medium
2,24,98,Medium,Medium
3,29,89,Low,Medium


Finally, it's worth knowing that negative numbers can be used in selection. </br>
This will start counting forwards from the end of the values. </br>
So for example here are the last five elements of the dataset.

In [14]:
students_df.iloc[-5:]

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,Family_Income,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score
6602,25,69,High,Medium,No,7,76,Medium,Yes,1,High,Medium,Public,Positive,2,No,High School,Near,Female,68
6603,23,76,High,Medium,No,8,81,Medium,Yes,3,Low,High,Public,Positive,2,No,High School,Near,Female,69
6604,20,90,Medium,Low,Yes,6,65,Low,Yes,3,Low,Medium,Public,Negative,2,No,Postgraduate,Near,Female,68
6605,10,86,High,High,Yes,6,91,High,Yes,2,Low,Medium,Private,Positive,3,No,High School,Far,Female,68
6606,15,67,Medium,Low,Yes,9,94,Medium,Yes,0,Medium,Medium,Public,Positive,4,No,Postgraduate,Near,Male,64


### B. Label Based Selection 

The second paradigm for attribute selection is the one followed by the `loc` operator: label-based selection. </br>
In this paradigm, it's the data index value, not its position, which matters.

For example, to get the first entry in `student_df` of `Hours_Studied` column, we would now do the following:

In [15]:
students_df.loc[0, 'Hours_Studied']

np.int64(23)

`iloc` is conceptually simpler than `loc` because it ignores the dataset's indices.</br>
When we use `iloc` we treat the dataset like a big matrix (a list of lists), one that we have to index into by position.</br>
`loc`, by contrast, uses the information in the indices to do its work. </br>
Since your dataset usually has meaningful indices, it's usually easier to do things using `loc` instead.

For example, here's one operation that's much easier using `loc`:

In [16]:
students_df.loc[:, ['Hours_Studied', 'Attendance', 'Access_to_Resources','Peer_Influence','Previous_Scores','Exam_Score']]

Unnamed: 0,Hours_Studied,Attendance,Access_to_Resources,Peer_Influence,Previous_Scores,Exam_Score
0,23,84,High,Positive,73,67
1,19,64,Medium,Negative,59,61
2,24,98,Medium,Neutral,91,74
3,29,89,Medium,Negative,98,71
4,19,92,Medium,Neutral,65,70
...,...,...,...,...,...,...
6602,25,69,Medium,Positive,76,68
6603,23,76,Medium,Positive,81,69
6604,20,90,Low,Negative,65,68
6605,10,86,High,Positive,91,68


#### Choosing between loc and iloc

When choosing or transitioning between `loc` and `iloc`, there is one "gotcha" worth keeping in mind, </br>
which is that the two methods use slightly different indexing schemes.</br>

`iloc` uses the Python stdlib (Standard library) indexing scheme, where the first element of the range is included and the last one excluded. So `0:10` will select entries 0,...,9. 

`loc`, meanwhile, indexes inclusively. So `0:10` will select entries 0,...,10.

Why the change? Remember that `loc` can index any stdlib type: strings, for example. </br>
If we have a DataFrame with index values Apples, ..., Potatoes, ..., and </br>
we want to select "all the alphabetical fruit choices between Apples and Potatoes", </br>
then it's a lot more convenient to index df.loc['Apples':'Potatoes'] than </br>
it is to index something like df.loc['Apples', 'Potatoet'] (t coming after s in the alphabet).

This is particularly confusing when the DataFrame index is a simple numerical list, e.g. 0,...,1000.</br>
In this case `df.iloc[0:1000]` will return 1000 entries, while `df.loc[0:1000]` return 1001 of them! </br>
To get 1000 elements using loc, you will need to go one lower and ask for df.loc[0:999].

Otherwise, the semantics of using `loc` are the same as those for `iloc`.

#### 1. .loc (Label-based indexing)
- loc is used to select data by label or boolean condition.
- It includes the start and end labels in slicing.
- It allows you to access rows and columns by their actual index or column names.

#### 2. .iloc (Integer-based indexing)
- iloc is used to select data by position (integer index).
- It uses integer positions starting from 0 (like Python lists).
- It excludes the end index in slicing (like Python).

## Manipulating The Index

`Label-based selection` derives its power from the labels in the index. Critically, </br>
the index we use is not immutable. We can manipulate the index in any way we see fit.

The set_index() method can be used to do the job. Here is what happens when we `set_index()` to the `Attendance` column/field:

In [17]:
students_df.set_index('Attendance')

Unnamed: 0_level_0,Hours_Studied,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,Family_Income,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score
Attendance,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
84,23,Low,High,No,7,73,Low,Yes,0,Low,Medium,Public,Positive,3,No,High School,Near,Male,67
64,19,Low,Medium,No,8,59,Low,Yes,2,Medium,Medium,Public,Negative,4,No,College,Moderate,Female,61
98,24,Medium,Medium,Yes,7,91,Medium,Yes,2,Medium,Medium,Public,Neutral,4,No,Postgraduate,Near,Male,74
89,29,Low,Medium,Yes,8,98,Medium,Yes,1,Medium,Medium,Public,Negative,4,No,High School,Moderate,Male,71
92,19,Medium,Medium,Yes,6,65,Medium,Yes,3,Medium,High,Public,Neutral,4,No,College,Near,Female,70
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
69,25,High,Medium,No,7,76,Medium,Yes,1,High,Medium,Public,Positive,2,No,High School,Near,Female,68
76,23,High,Medium,No,8,81,Medium,Yes,3,Low,High,Public,Positive,2,No,High School,Near,Female,69
90,20,Medium,Low,Yes,6,65,Low,Yes,3,Low,Medium,Public,Negative,2,No,Postgraduate,Near,Female,68
86,10,High,High,Yes,6,91,High,Yes,2,Low,Medium,Private,Positive,3,No,High School,Far,Female,68


This is useful if you can come up with an index for the dataset which is better than the current one.

## 3) Selecting

### Conditional Based Selection / Filtering

So far we've been indexing various strides of data, using structural properties of the `DataFrame` itself. </br>
To do interesting things with the data, however, we often need to ask questions based on conditions.</br>

For example, suppose that we're interested specifically in `Exam_Score` of students who have Scored more than 70 and has studied more than or equals to 20 hours.</br>

We can start by checking if each `Exam_Score` of student and if `Hours_Studied` is greater eqauls to 20 or not:

In [18]:
students_df['Hours_Studied'] >= 20

0        True
1       False
2        True
3        True
4       False
        ...  
6602     True
6603     True
6604     True
6605    False
6606    False
Name: Hours_Studied, Length: 6607, dtype: bool

This operation produced a Series of True/False booleans based on the `Hours_Studied` of each record. </br> This result can then be used inside of `loc` to select the relevant data:

In [19]:
students_df.loc[students_df['Hours_Studied'] >= 20]

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,Family_Income,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score
0,23,84,Low,High,No,7,73,Low,Yes,0,Low,Medium,Public,Positive,3,No,High School,Near,Male,67
2,24,98,Medium,Medium,Yes,7,91,Medium,Yes,2,Medium,Medium,Public,Neutral,4,No,Postgraduate,Near,Male,74
3,29,89,Low,Medium,Yes,8,98,Medium,Yes,1,Medium,Medium,Public,Negative,4,No,High School,Moderate,Male,71
6,29,84,Medium,Low,Yes,7,68,Low,Yes,1,Low,Medium,Private,Neutral,2,No,High School,Moderate,Male,67
7,25,78,Low,High,Yes,6,50,Medium,Yes,1,High,High,Public,Negative,2,No,High School,Far,Male,66
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6599,30,94,Medium,Low,No,5,52,Low,No,3,High,Medium,Private,Neutral,2,No,Postgraduate,Moderate,Female,70
6601,20,83,Medium,Low,No,6,51,Low,Yes,2,Medium,Medium,Public,Neutral,4,No,High School,Moderate,Female,65
6602,25,69,High,Medium,No,7,76,Medium,Yes,1,High,Medium,Public,Positive,2,No,High School,Near,Female,68
6603,23,76,High,Medium,No,8,81,Medium,Yes,3,Low,High,Public,Positive,2,No,High School,Near,Female,69


This DataFrame has ~3528 rows. The original had ~6607. That means that around 53% of students have studied 20 or more hours.

We also wanted to know the `Exam_Score` of students who have scored more than 70 and studied 20 or more hours.

We can use the `ampersand (&)` to bring the two questions together:

In [20]:
students_df.loc[(students_df['Hours_Studied'] >= 20) & (students_df['Exam_Score'] > 70)]

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,Family_Income,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score
2,24,98,Medium,Medium,Yes,7,91,Medium,Yes,2,Medium,Medium,Public,Neutral,4,No,Postgraduate,Near,Male,74
3,29,89,Low,Medium,Yes,8,98,Medium,Yes,1,Medium,Medium,Public,Negative,4,No,High School,Moderate,Male,71
9,23,98,Medium,Medium,Yes,8,71,Medium,Yes,0,High,High,Public,Positive,5,No,High School,Moderate,Male,72
27,22,83,High,High,Yes,6,94,Medium,Yes,0,High,Medium,Public,Neutral,2,No,College,Moderate,Male,71
55,26,88,Medium,High,Yes,5,79,Medium,Yes,1,Low,High,Private,Positive,4,No,College,Moderate,Male,72
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6565,24,89,Medium,Low,No,4,99,High,Yes,1,High,High,Private,Positive,3,No,Postgraduate,Near,Male,73
6566,29,96,High,Medium,No,8,85,Low,Yes,3,High,Medium,Public,Neutral,3,No,Postgraduate,Near,Male,76
6572,33,95,Low,Low,Yes,6,77,Low,Yes,2,High,High,Private,Positive,3,No,High School,Moderate,Female,73
6592,29,100,Medium,Low,Yes,8,100,Low,Yes,0,Low,Medium,Private,Neutral,5,Yes,High School,Near,Male,72


This `DataFrame` has only **900 rows** where the rows of students who have studied more or eqaul to **20 hours** is **3528** and the original DataFrame had **6607** rows.

That means :

- `Only around (900 students) 25% of students where able to score more than 70 from the students who have studied more or eqaul to 20 hours.`


In [31]:
students_df.loc[students_df['Exam_Score']>70]

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,...,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score,Assignment
2,24,98,Medium,Medium,Yes,7,91,Medium,Yes,2,...,Medium,Public,Neutral,4,No,Postgraduate,Near,Male,74,Yes
3,29,89,Low,Medium,Yes,8,98,Medium,Yes,1,...,Medium,Public,Negative,4,No,High School,Moderate,Male,71,Yes
5,19,88,Medium,Medium,Yes,8,89,Medium,Yes,3,...,Medium,Public,Positive,3,No,Postgraduate,Near,Male,71,Yes
9,23,98,Medium,Medium,Yes,8,71,Medium,Yes,0,...,High,Public,Positive,5,No,High School,Moderate,Male,72,Yes
11,17,97,Medium,High,Yes,6,87,Low,Yes,2,...,High,Private,Neutral,2,No,High School,Near,Male,71,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6565,24,89,Medium,Low,No,4,99,High,Yes,1,...,High,Private,Positive,3,No,Postgraduate,Near,Male,73,Yes
6566,29,96,High,Medium,No,8,85,Low,Yes,3,...,Medium,Public,Neutral,3,No,Postgraduate,Near,Male,76,Yes
6572,33,95,Low,Low,Yes,6,77,Low,Yes,2,...,High,Private,Positive,3,No,High School,Moderate,Female,73,Yes
6592,29,100,Medium,Low,Yes,8,100,Low,Yes,0,...,Medium,Private,Neutral,5,Yes,High School,Near,Male,72,Yes


- `Only around (1083 students out of 6607) 16% of students where able to score more than 70 from the DataFrame regradless of how many hours they have studied.`

In [22]:
students_df.loc[(students_df['Hours_Studied'] < 20) & (students_df['Exam_Score'] > 70)]

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,Family_Income,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score
5,19,88,Medium,Medium,Yes,8,89,Medium,Yes,3,Medium,Medium,Public,Positive,3,No,Postgraduate,Near,Male,71
11,17,97,Medium,High,Yes,6,87,Low,Yes,2,Low,High,Private,Neutral,2,No,High School,Near,Male,71
21,19,99,Medium,High,No,6,84,Medium,Yes,1,Medium,High,Public,Neutral,3,No,High School,Near,Male,72
57,18,94,Medium,Medium,Yes,6,87,High,Yes,1,Medium,Medium,Public,Positive,4,No,High School,Near,Male,71
88,17,86,Medium,High,No,5,97,High,Yes,3,High,Medium,Public,Neutral,3,No,College,Near,Male,71
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6411,19,92,High,Medium,No,6,70,High,Yes,2,Medium,Medium,Private,Negative,3,No,,Near,Female,71
6487,12,97,High,Low,Yes,6,96,Medium,Yes,2,Low,High,Public,Positive,2,No,Postgraduate,Near,Female,71
6499,18,97,High,High,Yes,6,76,Medium,Yes,0,Medium,High,Public,Neutral,5,No,High School,Moderate,Male,72
6522,18,90,High,High,Yes,6,54,Low,Yes,1,Medium,High,Public,Negative,3,No,High School,Near,Female,95



- `Only around (183 students out of 6607) 2% of students have scored more than 70 in Exam even after studying less than 20 hours.`

**Suppose we need to find any student who have scored more than 70 or have studied more than 20 regradless of his/her grade,
For this we use a pipe (|):**

In [23]:
students_df.loc[(students_df['Hours_Studied'] >= 20) | (students_df['Exam_Score'] > 70)]

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,Family_Income,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score
0,23,84,Low,High,No,7,73,Low,Yes,0,Low,Medium,Public,Positive,3,No,High School,Near,Male,67
2,24,98,Medium,Medium,Yes,7,91,Medium,Yes,2,Medium,Medium,Public,Neutral,4,No,Postgraduate,Near,Male,74
3,29,89,Low,Medium,Yes,8,98,Medium,Yes,1,Medium,Medium,Public,Negative,4,No,High School,Moderate,Male,71
5,19,88,Medium,Medium,Yes,8,89,Medium,Yes,3,Medium,Medium,Public,Positive,3,No,Postgraduate,Near,Male,71
6,29,84,Medium,Low,Yes,7,68,Low,Yes,1,Low,Medium,Private,Neutral,2,No,High School,Moderate,Male,67
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6599,30,94,Medium,Low,No,5,52,Low,No,3,High,Medium,Private,Neutral,2,No,Postgraduate,Moderate,Female,70
6601,20,83,Medium,Low,No,6,51,Low,Yes,2,Medium,Medium,Public,Neutral,4,No,High School,Moderate,Female,65
6602,25,69,High,Medium,No,7,76,Medium,Yes,1,High,Medium,Public,Positive,2,No,High School,Near,Female,68
6603,23,76,High,Medium,No,8,81,Medium,Yes,3,Low,High,Public,Positive,2,No,High School,Near,Female,69


- `Out of 6607 students, 3771 has studied more or equal to 20 hours or have scored more than 70 in Exam.`

- `Around 56% of students has studied more or equal to 20 hours or have scored more than 70 in Exam regardless of studying more or less than 20 hours.`

`Pandas` comes with a `few built-in conditional selectors`, two of which we will highlight here.

#### isin()
The first is `isin`. `isin` is lets you select data whose value "is in" a list of values. </br>
For example, here's how we can use it to select `Parental_Involvement`only with `Low` and `Medium`:

In [24]:
students_df.loc[students_df.Parental_Involvement.isin(['Low','Medium'])]

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,Family_Income,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score
0,23,84,Low,High,No,7,73,Low,Yes,0,Low,Medium,Public,Positive,3,No,High School,Near,Male,67
1,19,64,Low,Medium,No,8,59,Low,Yes,2,Medium,Medium,Public,Negative,4,No,College,Moderate,Female,61
2,24,98,Medium,Medium,Yes,7,91,Medium,Yes,2,Medium,Medium,Public,Neutral,4,No,Postgraduate,Near,Male,74
3,29,89,Low,Medium,Yes,8,98,Medium,Yes,1,Medium,Medium,Public,Negative,4,No,High School,Moderate,Male,71
4,19,92,Medium,Medium,Yes,6,65,Medium,Yes,3,Medium,High,Public,Neutral,4,No,College,Near,Female,70
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6599,30,94,Medium,Low,No,5,52,Low,No,3,High,Medium,Private,Neutral,2,No,Postgraduate,Moderate,Female,70
6600,12,98,Medium,Low,Yes,4,54,Medium,Yes,2,Medium,High,Private,Neutral,3,No,High School,Near,Female,67
6601,20,83,Medium,Low,No,6,51,Low,Yes,2,Medium,Medium,Public,Neutral,4,No,High School,Moderate,Female,65
6604,20,90,Medium,Low,Yes,6,65,Low,Yes,3,Low,Medium,Public,Negative,2,No,Postgraduate,Near,Female,68


#### isnull() and notnull()
The second is `isnull` (and its companion `notnull`). </br>
These methods let you highlight values which are (or are not) empty (NaN).</br>
For example, to filter out `Sleep_Hours` entries in the dataset, here's what we would do:

In [25]:
students_df.loc[students_df.Sleep_Hours.notnull()]

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,Family_Income,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score
0,23,84,Low,High,No,7,73,Low,Yes,0,Low,Medium,Public,Positive,3,No,High School,Near,Male,67
1,19,64,Low,Medium,No,8,59,Low,Yes,2,Medium,Medium,Public,Negative,4,No,College,Moderate,Female,61
2,24,98,Medium,Medium,Yes,7,91,Medium,Yes,2,Medium,Medium,Public,Neutral,4,No,Postgraduate,Near,Male,74
3,29,89,Low,Medium,Yes,8,98,Medium,Yes,1,Medium,Medium,Public,Negative,4,No,High School,Moderate,Male,71
4,19,92,Medium,Medium,Yes,6,65,Medium,Yes,3,Medium,High,Public,Neutral,4,No,College,Near,Female,70
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6602,25,69,High,Medium,No,7,76,Medium,Yes,1,High,Medium,Public,Positive,2,No,High School,Near,Female,68
6603,23,76,High,Medium,No,8,81,Medium,Yes,3,Low,High,Public,Positive,2,No,High School,Near,Female,69
6604,20,90,Medium,Low,Yes,6,65,Low,Yes,3,Low,Medium,Public,Negative,2,No,Postgraduate,Near,Female,68
6605,10,86,High,High,Yes,6,91,High,Yes,2,Low,Medium,Private,Positive,3,No,High School,Far,Female,68


There is no null value or NaN value in Dataframe.

In [26]:
students_df.columns

Index(['Hours_Studied', 'Attendance', 'Parental_Involvement',
       'Access_to_Resources', 'Extracurricular_Activities', 'Sleep_Hours',
       'Previous_Scores', 'Motivation_Level', 'Internet_Access',
       'Tutoring_Sessions', 'Family_Income', 'Teacher_Quality', 'School_Type',
       'Peer_Influence', 'Physical_Activity', 'Learning_Disabilities',
       'Parental_Education_Level', 'Distance_from_Home', 'Gender',
       'Exam_Score'],
      dtype='object')

### Assigning Data

In [27]:
# Creating Assignment named column in DataFrame and assigning 'Yes' value in every entries.
students_df['Assignment'] = 'Yes'
students_df.head()

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,...,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score,Assignment
0,23,84,Low,High,No,7,73,Low,Yes,0,...,Medium,Public,Positive,3,No,High School,Near,Male,67,Yes
1,19,64,Low,Medium,No,8,59,Low,Yes,2,...,Medium,Public,Negative,4,No,College,Moderate,Female,61,Yes
2,24,98,Medium,Medium,Yes,7,91,Medium,Yes,2,...,Medium,Public,Neutral,4,No,Postgraduate,Near,Male,74,Yes
3,29,89,Low,Medium,Yes,8,98,Medium,Yes,1,...,Medium,Public,Negative,4,No,High School,Moderate,Male,71,Yes
4,19,92,Medium,Medium,Yes,6,65,Medium,Yes,3,...,High,Public,Neutral,4,No,College,Near,Female,70,Yes


Create a DataFrame `top_oceania_wines` containing all reviews with at least 95 points (out of 100) for wines from Australia or New Zealand.

In [32]:
# Question From Kaggle

# Filter for students_df with more than 70 Exam_Score and from the students whose Parental_Education_Level is College or PostGraudate.

highScore_Student_with_Educated_parents = students_df[
    (students_df['Exam_Score'] > 70) & 
    (students_df['Parental_Education_Level'].isin(['College', 'Postgraduate']))
]

highScore_Student_with_Educated_parents


Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,...,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score,Assignment
2,24,98,Medium,Medium,Yes,7,91,Medium,Yes,2,...,Medium,Public,Neutral,4,No,Postgraduate,Near,Male,74,Yes
5,19,88,Medium,Medium,Yes,8,89,Medium,Yes,3,...,Medium,Public,Positive,3,No,Postgraduate,Near,Male,71,Yes
27,22,83,High,High,Yes,6,94,Medium,Yes,0,...,Medium,Public,Neutral,2,No,College,Moderate,Male,71,Yes
55,26,88,Medium,High,Yes,5,79,Medium,Yes,1,...,High,Private,Positive,4,No,College,Moderate,Male,72,Yes
64,25,98,Low,High,Yes,8,56,Medium,No,0,...,High,Private,Neutral,3,No,Postgraduate,Near,Male,71,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6516,30,78,Medium,High,Yes,8,84,Medium,Yes,1,...,Low,Private,Negative,1,No,College,Near,Male,71,Yes
6549,23,90,Medium,Medium,No,7,89,Low,Yes,2,...,Medium,Public,Positive,4,No,College,Moderate,Female,71,Yes
6565,24,89,Medium,Low,No,4,99,High,Yes,1,...,High,Private,Positive,3,No,Postgraduate,Near,Male,73,Yes
6566,29,96,High,Medium,No,8,85,Low,Yes,3,...,Medium,Public,Neutral,3,No,Postgraduate,Near,Male,76,Yes


- Out of 6607 students, Only 622 students (around 9%) has scored more than 70 in Exams and Has Parent with higher Education atleast college or postgraduate.
- Out of 1087, 622 students (around 57%) has Parents with higher education, Among the students who has scored more than 70 in exams.
