**Table of contents**<a id='toc0_'></a>    
- [Cleaning Data](#toc1_)    
  - [Null values](#toc1_1_)    
    - [What are the different scenarios of missing data?](#toc1_1_1_)    
      - [MCAR (Missing Completely at Random)](#toc1_1_1_1_)    
      - [MAR (Missing at Random)](#toc1_1_1_2_)    
      - [Missing Not At Random (MNAR)](#toc1_1_1_3_)    
    - [Why are null values relevant?](#toc1_1_2_)    
    - [Cleaning Null Values](#toc1_1_3_)    
    - [Checking for Null Values](#toc1_1_4_)    
    - [Dropping Null Values](#toc1_1_5_)    
    - [Filling Null Values](#toc1_1_6_)    
    - [💡 Check for understanding](#toc1_1_7_)    
  - [Dealing with Duplicates](#toc1_2_)    
    - [Identifying Duplicates](#toc1_2_1_)    
    - [Removing Duplicates](#toc1_2_2_)    
    - [Removing Duplicates Based on Specific Columns](#toc1_2_3_)    
    - [Resetting the Index](#toc1_2_4_)    
  - [Formatting Data (Recap)](#toc1_3_)    
    - [Formatting Numeric Values (Recap)](#toc1_3_1_)    
    - [Formatting Strings (Recap)](#toc1_3_2_)    
    - [Formatting Dates](#toc1_3_3_)
  - [Changing column datatypes](#toc1_4_)    
  - [Cleaning Column Names](#toc1_5_)    
- [Using `apply()`, `map()`, and `applymap()`](#toc2_)    
    - [More examples](#toc2_1_1_)    
      - [Comparing Map and Apply](#toc2_1_1_1_)    
      - [Calculating the length of the name](#toc2_1_1_2_)    
      - [Converting to float some columns with applymap()](#toc2_1_1_3_)    
      - [Modifying columns names with apply()](#toc2_1_1_4_)    
    - [💡 Check for understanding](#toc2_1_2_)    
    - [💡 Check for understanding](#toc2_1_3_)    
- [Filtering Data](#toc3_)    
    - [Creating a condition](#toc3_1_1_)    
    - [Filtering df](#toc3_1_2_)    
    - [Using multiple conditions](#toc3_1_3_)    
- [More Data Manipulation](#toc4_)    
  - [Setting the index](#toc4_1_)    
  - [Adding/removing rows and/or columns](#toc4_2_)    
  - [💡 Check for understanding](#toc4_3_)    
- [Summary](#toc5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Cleaning Data](#toc0_)

## <a id='toc1_1_'></a>[Null values](#toc0_)

Null values (also known as missing values) are common in datasets and can hinder data analysis and modeling. It is essential to handle null values appropriately to ensure accurate and reliable results. Pandas provides various methods to clean and handle null values in datasets.

In Python, `None` is a special constant that represents the absence of a value. It is commonly used to indicate that a variable or function has no value or hasn't been assigned any value. For example, if a function does not explicitly return a value, it implicitly returns `None`.

On the other hand, `NaN` stands for "Not a Number" and is a special value used to represent missing or undefined numerical data. `NaN` is part of the floating-point representation and is commonly used in numeric data structures like Pandas DataFrames and Series to indicate missing or invalid numerical values.

### <a id='toc1_1_1_'></a>[What are the different scenarios of missing data?](#toc0_)

#### <a id='toc1_1_1_1_'></a>[MCAR (Missing Completely at Random)](#toc0_)

When data is missing for reasons that have nothing to do with the information being recorded, i.e. has no connection to any data points in the dataset, whether observed or unobserved. This scenario happens when there's **a glitch in the system collecting the data**, rather than the data itself and there's no way to predict where values will be missing.

For example:
- **Lost Surveys**: Say a surveyor accidentally dropped completed questionnaires into a lake. The fact that there's information missing from these specific questionnaires doesn’t depend on who filled them out or what they contained. The loss of data is completely random.  

- **Sensor Failure in a Weather Station**: A weather station collects temperature data hourly, but sometimes the sensor fails from time to time, for no particular reason. Given these failures don’t depend on the time of day, season, or temperature, the missing values just happen randomly due to a technical issue. However, if the sensor failure was due to temperature, or rainfall, or **any other factor connected to the data being observed**, the data will no longer be MCAR but it would be **MAR**.  

#### <a id='toc1_1_1_2_'></a>[MAR (Missing at Random)](#toc0_)

> Despite its name, MAR occurs when the absence of data is not random. The probability of missing data is not equal for all measurements. They’re more likely for some observations than others. However, measurements of observed variables predict the unequal probability of missing values occurring. ([Statistics by Jim](https://statisticsbyjim.com/basics/missing-data/))

This means that we can figure out where a value might be missing based on other information in the dataset.

For example: 

- **Medical Study Missing Data on a Drug’s Side Effects**: In a medical study, older participants might be more likely to drop out or miss follow-up appointments. As a result, we may be missing some follow-up data on side effects for older participants. The missing data is linked to an observable factor (age), but it’s not related to the unobserved side effect outcomes themselves.

-  **Student Performance Data**: In a school, students may be more likely to skip reporting their test scores in optional evaluations. For instance, students with lower attendance rates might be more likely to miss submitting their scores. The missing test scores are related to an observed factor (attendance rate), not the missing test scores themselves. 

- **Workplace Wellness Programs and Job Satisfaction**: Employees who are part-time might be less likely to participate in wellness programs, leading to missing data on wellness program engagement for part-time employees. The missing engagement data is related to observed information (employment status: part-time vs. full-time) rather than how much the part-time employees actually benefit from the program.

#### <a id='toc1_1_1_3_'></a>[Missing Not At Random (MNAR)](#toc0_)

MNAR data means that the reason for missing data is related to the actual missing data itself, so it's not possible to predict where there will be missing values in the dataset as the information needed to do that is missing.

For example:
- **Sensitive Information in Surveys**: Suppose a survey asks people about their income, and higher-income individuals are more likely to skip the question due to privacy concerns. Here, the likelihood of a person not responding is directly related to the value of their income (higher income = more likely to skip). This is MNAR because the missing data (income) is missing specifically because of the value it would have been, not because of any other observed variable like age or occupation, although the latter could also be correlated with a high-income.    

- **Health Studies and Sensitive Symptoms**: In a medical study, participants with more severe symptoms might skip reporting certain symptoms or drop out of the study because they’re uncomfortable sharing the extent of their condition. This missing data would be MNAR since people are missing from the dataset specifically because of their symptom severity, which we can’t observe directly once they’ve dropped out.   

- **Employee Feedback and Job Satisfaction**: If a company survey asks employees about job satisfaction, those who are very dissatisfied may skip questions or avoid taking the survey altogether to avoid potential consequences. This situation would be MNAR because the likelihood of missing data (not answering the job satisfaction questions) is directly related to the unobserved values (job dissatisfaction). However, it could be linked to other factors, such as compensation.  

- **Credit Scores and Loan Applications**: In a financial study, people with very low credit scores might be less likely to report their scores or apply for loans, creating a gap in data. Since the missing data (credit scores) is related to the unobserved values (the actual low scores), this scenario would be classified as MNAR.

### <a id='toc1_1_2_'></a>[Why are null values relevant?](#toc0_)

- **Biased Analysis**: Except for the MCAR case, missing values can bias the conclusion we draw from data, mostly by ignoring a subset of the population that is potentially very different from the population whose data we collected.

- **Lower Statistical Power / Precision**: Having missing data reduces the sample size of our dataset, which in turn reduces the precision and power of statistical tests used for association and hypothesis testing. 



### <a id='toc1_1_3_'></a>[Cleaning Null Values](#toc0_)

1. Checking for Null Values:
   - Use `isnull()` method to check for null values in a DataFrame or Series.
   - Use `notnull()` method to check for non-null values in a DataFrame or Series.

2. Dropping Null Values:  
   _Typically useful for MCAR but can be harmful for MAR since it will likely bias the overall analysis._
   - Use `dropna()` method to remove rows with null values from a DataFrame.
   - Use `dropna(axis=1)` to remove columns with null values.

3. Filling Null Values:
   - Use `fillna(value)` method to replace null values with a specific value.
   - Use `fillna(method='ffill')` to forward-fill null values with the previous non-null value.
   - Use `fillna(method='bfill')` to backward-fill null values with the next non-null value.

### <a id='toc1_1_4_'></a>[Checking for Null Values](#toc0_)

In [1]:
import pandas as pd

# Load Titanic dataset from an online source
url = 'https://raw.githubusercontent.com/data-bootcamp-v4/data/main/titanic_train.csv'
df = pd.read_csv(url)

In [None]:
# Review dataframe and df columns
print(df.columns)
display(df.head())

In [None]:
# Checking for Null Values
df.isnull()  # Returns a DataFrame with True where values are null
# isnull is an alias of isna

When working with large datasets, using `isna()` or `isnull()` along with `any()`, `all()` and `sum()` in Pandas becomes essential for quick and efficient data quality assessment.

In [None]:
# Check which cols have any null values
df.isna().any()

In [None]:
# Check if any column have only null values
df.isna().all()

sum() calculates the sum of each row, considering True as 1 and False as 0.

In [None]:
# Count the number of null values in each column
df.isna().sum()

If we add the parameter `axis=1` with the `sum()` function, we can calculate the sum of each row (along the columns) of the DataFrame `df`. This results in a Series that contains the count of null values in each row.

In [None]:
# Get null values per row
df.isna().sum(axis=1)

In [None]:
# Get % of null values per row
df.isna().sum(axis=1) * 2 / df.shape[1]

In [None]:
# Sort descending
(df.isna().sum(axis=1) * 2 / df.shape[1]).sort_values(ascending=False)

In [None]:
# What about the null values in each column?
round(df.isna().sum() * 100 / df.shape[0], 2)

### <a id='toc1_1_5_'></a>[Dropping Null Values](#toc0_)

In [None]:
# Dropping rows with any Null Values
df.dropna() 

How many rows did we remove from our dataframe?

In [12]:
# Check rows removed

However, as we can see below in the DataFrame, the rows with NaN values have not been removed. To execute the change, it is necessary to use the `inplace=True` option: `df.dropna(inplace=True)` or assign it to a variable such as df = df.dropna().

In [None]:
# Check original dataframe
df

In [None]:
# Dropping columns with  Null Values
df.dropna(axis=1)

In the `dropna()` method of Pandas DataFrame, the `subset`, `how`, and `thresh` parameters are used to control the behavior of dropping rows or columns containing NaN (null) values, when we don't want to drop them just because they have *one* null value:

- `subset`: It allows you to specify a subset of columns on which to apply the `dropna()` operation. Only the rows containing NaN values in the specified subset of columns will be dropped.

In [None]:
df.tail()

In [None]:
# Drop cabin nulls
df.dropna(subset=['Cabin']).tail()

- `how`: It specifies the condition for dropping rows. It can take the values 'any', which means to drop rows containing any NaN values in the `subset`, or 'all', which means to drop rows containing all NaN values in the `subset`.

In [None]:
# Drop rows only if ALL values are null
df.dropna(how='all').tail()

- `thresh`: It sets a minimum threshold for the number of non-null values that a row must have in the `subset` in order to be kept. Rows with fewer non-null values than the specified threshold will be dropped.

In [None]:
# Test different thresh values
df.dropna(thresh=3).tail()

### <a id='toc1_1_6_'></a>[Filling Null Values](#toc0_)

`fillna()` is a Pandas method used to replace NaN (null) values in a DataFrame or Series with specified values.
- You can use `inplace=True` to modify the DataFrame directly.

In [None]:
# Filling Null Values
df.fillna(-1).tail()  # Replaces null values with -1

Careful if we assign a different data type, since Pandas will change the data type of the whole column. For example:

In [None]:
# Check dtypes
df.dtypes # age is a float

In [None]:
# Fill with "N/A" instead of -1 
df_na = df.fillna("N/A")
df_na.tail()

In [None]:
# Check dtypes
df_na.dtypes # age is not a float anymore, it's an object now

To avoid this, we can select manually in which column to apply the `fillna()`

In [None]:
# Fill Cabin only
df.Cabin.fillna("N/A").tail()

We can also use the mean(), median() etc. to fill the null values.

In [None]:
# Check Age in last rows
df.tail() 

In [None]:
# Fill with age mean
df.Age.fillna(df.Age.mean()).tail() #after filling with the mean, lets see how it would look

- Two common methods for filling NaN values are `ffill`, which forward fills using the last valid value, and `bfill`, which backward fills using the next valid value.

In [None]:
# Forward-fill null values in the Age column
df['Age'].fillna(method='ffill').tail()

In [None]:
# Backward-fill null values in the Age column
df['Age'].fillna(method='bfill').tail()

### <a id='toc1_1_7_'></a>[💡 Check for understanding](#toc0_)

Consider the following DataFrame containing information about students:

```python
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Cathy', None, 'Eva'],
    'Age': [25, 30, None, 22, 28],
    'Gender': ['Female', None, 'Female', 'Male', None],
    'Score': [90, None, 78, None, 85]
}

df_students = pd.DataFrame(data)
```

Your task is to perform the following data cleaning tasks:

1. Check for null values in the DataFrame using `isna()` or `isnull()`.

2. Replace the null values in the 'Age' column with the average age of the students.

3. Replace the null values in the 'Gender' column with "Female".

4. Drop any rows that have null values in the 'Name' column.

5. Forward fill (ffill) the null values in the 'Score' column with the previous valid value.

6. After performing all the cleaning steps, print the cleaned DataFrame.


In [60]:
# Your code goes here

## <a id='toc1_2_'></a>[Dealing with Duplicates](#toc0_)

In data analysis, it's common to encounter duplicate values in datasets. Duplicates can distort our analysis and lead to incorrect conclusions. Fortunately, pandas provides efficient methods to handle duplicates.


### <a id='toc1_2_1_'></a>[Identifying Duplicates](#toc0_)

To identify duplicate rows in a DataFrame, we can use the `duplicated()` method, which returns a boolean Series indicating whether each row is a duplicate or not. We can then use the `sum()` method to count the total number of duplicates.


In [None]:
# Check total # of duplicates
df.duplicated().sum()

In [None]:
# Check if there are any duplicates
df.duplicated().any()

To check for duplicates in specific columns, we can use the `duplicated()` method with the `subset` parameter, or just access first to the column and then check with duplicated().


In [None]:
# Check for duplicates in Age column
df.duplicated(subset=['Age']).sum()

In [None]:
df.Age.duplicated().any()

### <a id='toc1_2_2_'></a>[Removing Duplicates](#toc0_)

To remove duplicates from a DataFrame, we can use the `drop_duplicates()` method. By default, this method keeps the first occurrence of each duplicated row and removes the rest.


In [65]:
# Remove duplicates and update the DataFrame
df.drop_duplicates(inplace=True) # we know there are none but this is how we would do it

### <a id='toc1_2_3_'></a>[Removing Duplicates Based on Specific Columns](#toc0_)

Sometimes, we may want to remove duplicates based on specific columns. We can pass a subset of column names to the `drop_duplicates()` method to achieve this.


In [None]:
# Remove duplicates based on specific columns
df.drop_duplicates(subset=['Sex', 'Age']) #lets look at the number of rows if we do this

By default, `drop_duplicates()` keeps the first occurrence of each duplicated row. If we want to keep the last occurrence instead, we can set the `keep` parameter to `'last'`.


In [67]:
# Keep the last occurrence of duplicates
df.drop_duplicates(keep='last', inplace=True) # we know there are none but this is how we would do it

### <a id='toc1_2_4_'></a>[Resetting the Index](#toc0_)

When removing duplicates, the DataFrame index may have gaps due to removed rows. To reset the index after removing duplicates, we can use the `reset_index()` method with the `drop=True` parameter.


In [None]:
# Remove duplicates and reset the index
df_without_duplicates = df.copy()
df_without_duplicates = df.drop_duplicates(subset=['Sex', 'Age'])
df_without_duplicates.tail() # look at the gaps in the index

In [None]:
df_without_duplicates.reset_index(drop=True, inplace=True)
df_without_duplicates.tail()

## <a id='toc1_3_'></a>[Formatting Data (Recap)](#toc0_)

### <a id='toc1_3_1_'></a>[Formatting Numeric Values (Recap)](#toc0_)


1. `round()` Method:
   - Rounds numeric values to a specified number of decimal places.

2. `format()` Method:
   - Formats numeric values as strings for better representation.

In [70]:
num = 123456.78910

In [None]:
# Apply round
round(num, 2)

In [None]:
# Format to one decimal
format(num, '.1f') # .1f

In [None]:
# Apply round to Fare
df['Fare'].round(2)

In [None]:
# We could also use a lambda function:
df['Fare'].apply(lambda x: format(x, '.2f'))

### <a id='toc1_3_2_'></a>[Formatting Strings (Recap)](#toc0_)

We can apply all of the string methods we learnt about in the data structures, like `len`, `lower`, `upper`, `split`, `replace`, etc. :

In [78]:
str_example = "This is my second Pandas Lesson"

In [None]:
# string length
len(str_example)

In [None]:
# name length - direct method
df['Name'].len()

In [None]:
# name length - str method
df['Name'].str.len()

In [None]:
# Upper/lowercase
print(str_example.lower())
print(str_example.upper())

In [None]:
# Upper/lowercase
display(df['Ticket'].str.lower())
display(df['Ticket'].str.upper())

### <a id='toc1_3_3_'></a>[Changing string values](#toc0_)

In [None]:
df['Ticket'].str.replace('W./C.', 'wc')

### <a id='toc1_3_4_'></a>[Formatting Dates](#toc0_)

We will study this in another Notebook.

## <a id='toc1_4_'></a>[Changing data types](#toc1_4_)

In [None]:
df.dtypes

In [None]:
df.info()

### <a id='toc1_4_2_'></a>[Convert columns datatypes](#toc1_4_2_)

In [None]:
# Converting passenger id
df['PassengerId'].astype(float)

In [None]:
# Convert to boolean
df['Survived'].astype(bool)

In [56]:
# convert first Age to string
df['Age'] = df['Age'].astype(str)


In [None]:
# error with:
df['Age'].astype(int)

In [None]:
# when not sure if it should be float or integer, pd.to_numeric is the best method
pd.to_numeric(df['Age'], errors='coerce')

## <a id='toc1_5_'></a>[Cleaning Column Names](#toc1_5_)

We can acccess the columns using `df.columns`

In [None]:
# Review columns
df.columns

In order to modify them, we can assign new column names to `df.columns` by doing `df.columns = [list_of_new_column_names]` or we can use the `rename()` method to just modify a few of them.

In [None]:
# How can I create a list of columns that is lowercase?
df.columns = df.columns.str.lower()
df.columns

In [None]:
# Can also use a custom list...
df.columns = ['passenger_id', 'survived', 'pclass', 'sex', 'age', 'sib_sp', 'par_ch', 'ticket', 'fare',
       'cabin', 'embarked']
df.columns

In [None]:
# Rename using a dictionary
renaming_dict = {
    'par_ch': 'parents_children',
    "sib_sp": 'siblings_spouses'
}
df.rename(columns=renaming_dict, inplace=True)
# Alternative
# df.rename(renaming_dict, axis=1, inplace=True)
print(df.columns)

# <a id='toc2_'></a>[Using `apply()`, `map()`, and `applymap()`](#toc0_)

- `apply()`
    - Apply a custom function to a Series.
    - Useful for element-wise transformations.
    - Example: `df['squared_numbers'] = df['numbers'].apply(lambda x: x ** 2)`

- `map()`
    - Transform Series elements based on a dictionary.
    - Replaces elements with corresponding dictionary values.
    - Example: `df['gender_mapped'] = df['gender'].map({'M': 'Male', 'F': 'Female'})`

- `applymap()`
    - Apply a custom function to every element in a DataFrame.
    - Useful for element-wise transformations on entire DataFrames.
    - Example: `df = df.applymap(lambda x: x.upper())`


### `apply()`

In [None]:
# Applying a custom function using apply()
def get_yob(age):
    return 1912 - age #titanic sank in 1912, we will assume is when Age was recorded

# Add YOB to dataset
df['yob'] = df['age'].apply(get_yob)
df.head(3)

In [93]:
# Test out using lambda
df['yob'] = df['age'].apply(lambda age: 1912 - age)

In the example above, we can see that to create a new column in pandas, we can simply assign a new Series or list to a new column name within the DataFrame.

To edit the information in a whole column in pandas, you can simply assign a new list or array of values to the column you want to modify.

In [None]:
# Convert the fare from US Dollars to EUR
exchange_rate = 0.9877
df['fare_eur'] = df['fare'].apply(lambda num: exchange_rate * num)
df['fare_eur']

In [117]:
# Apply is awesome to place a logic inside of a function and then apply it to our dataframe:

def convert_class_description(col):
    """ Converts pclass to ticket class type. """
    if col == 1:
        return 'First Class'
    elif col == 2:
        return 'Second Class'
    else:
        return 'Third Class'



In [None]:
df['pclass'].apply(convert_class_description)

In [None]:
df.head()

In [None]:
# We can even use apply in the whole dataframe to perform some specific logic
def join_name_year(row):
    return row['name'] + '-' + str(row['age'])


df.apply(join_name_year, axis=1)

### `map()`

In [None]:
# Using map() to transform the 'Sex' column to an integer
gender_mapping = {'male': 0, 'female': 1}
df['sex_mapped'] = df['sex'].map(gender_mapping)
df.head()

In [None]:
# Switch to lambda
df.sex.apply(lambda x: 0 if x=="male" else 1)

### `applymap()`

In [None]:
# Using applymap() to convert all string columns to uppercase - 
df = df.applymap(lambda x: x.upper() if isinstance(x, str) else x)

# Displaying the modified DataFrame
df.head()

In [None]:
# Are we able to apply the str.upper to the whole dataframe?
df.str.upper()

### <a id='toc2_1_1_'></a>[More examples](#toc0_)

#### <a id='toc2_1_1_1_'></a>[Comparing Map and Apply](#toc0_)

We have a column called "Embarked" containing three possible values: 'C', 'Q', and 'S'. We want to map these values to 0, 1 and 2. In this case, `apply()` with a lambda function would be complex due to the if-elif-else conditions, but `map()` can handle it more easily.

In [None]:
# Mapping 'embarked' values to their full names using map()
embarked_mapping = {'C': 0, 'Q': 1, 'S': 2}
df['embarked_nr'] = df['embarked'].map(embarked_mapping)

# Display the first few rows of the updated DataFrame
df[['name', 'embarked_nr']].head()

Why is it a float?

In [None]:
# Check null values
df.isna().sum() # Because Embarked has null values and it converted it to float to handle NaN value

In [None]:
# Mapping 'Embarked' values to their full names using apply() and a lambda function
df['embarked'].apply(lambda x: 0 if x == 'C' else (1 if x == 'Q' else 2))

# Note that here it doesn't convert it to float

In [None]:
# What happened with the null values?
df.isna().sum()

#### <a id='toc2_1_1_2_'></a>[Calculating the length of the name](#toc0_)

What if we wanted to create a new column with the length of the name?

In [None]:
df.columns

In [None]:
df['name_length'] = df['name'].apply(len)
df.head()

#### <a id='toc2_1_1_3_'></a>[Converting to float some columns with applymap()](#toc0_)

Lets look just as an example, how to make float all the following columns: "PassengerId", "Survived", "Pclass"

In [None]:
df[["passengerid", "survived", "pclass"]].applymap(float)

#### <a id='toc2_1_1_4_'></a>[Modifying columns names with apply()](#toc0_)

In [None]:
# I could also use the apply function by converting df.columns to Series
pd.Series(df.columns).apply(lambda col: col.lower())

### <a id='toc2_1_2_'></a>[💡 Check for understanding](#toc0_)

Make the column Embarked_nr as an integer type.

- If you get an error, read the error, and think how you should proceed.
- If you decide to fill the null values, use the mode() since its a categorical variable.
- If you get another error, look at what mode() is returning in order to fix the error and convert to integer the Embarked_nr column.

In [None]:
# Your code goes here

### <a id='toc2_1_3_'></a>[💡 Check for understanding](#toc0_)

You are given a dataset of students' exam scores here: https://raw.githubusercontent.com/data-bootcamp-v4/data/main/student_performance.csv. Your task is to perform the following operations using pandas:

1. Read the CSV file into a DataFrame.
2. Create a new column "total_score" that calculates the total score for each student by summing their "math score," "reading score," and "writing score."
3. Create a new column "grade" that assigns a grade to each student based on the following criteria:
   - If the total score is >= 90, the grade is "A."
   - If the total score is >= 80 and < 90, the grade is "B."
   - If the total score is >= 70 and < 80, the grade is "C."
   - If the total score is >= 60 and < 70, the grade is "D."
   - If the total score is < 60, the grade is "F."
4. Convert all student names in the "gender" column to uppercase.
5. Create a new column "is_passed" that indicates whether each student has passed the exam or not. If the total score is >= 60, the student has passed; otherwise, they have failed.


In [123]:
df_students = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/student_performance.csv")

In [None]:
# Your code goes here


# <a id='toc3_'></a>[Filtering Data](#toc0_)

One of the primary tasks in dataset analysis is filtering rows.

When filtering DataFrames in Pandas, you can use boolean indexing to select specific rows based on certain conditions. Here's a step-by-step explanation:

1. Identify the column(s) you want to use as a filter condition. For example, in `housing_df` the column named 'SalePrice'.

2. Create a condition using a comparison operator (e.g., `>`, `<`, `==`, etc.) and the column(s) you want to filter. For instance, to filter all rows where the 'SalePrice' is greater than 10000, you would use `condition = housing_df['SalePrice'] > 10000`.

3. Use the condition to filter the DataFrame. You can do this by passing the condition inside square brackets to the DataFrame. For example, `filtered_df = housing_df[condition]` will create a new DataFrame `filtered_df` containing only the rows where the 'SalePrice' is greater than 10000.

Keep in mind that the condition should evaluate to a boolean Series with the same length as the DataFrame, indicating which rows to include (True) or exclude (False).

You can also combine multiple conditions using logical operators like `&` for 'and' and `|` for 'or'. For instance, to filter rows where the 'SalePrice' is greater than 10000 and the 'FullBath' is more than 1, you can use `condition = (housing_df['SalePrice'] > 10000) & (housing_df['FullBath'] > 1)`.

Filtering allows you to extract specific subsets of data from your DataFrame, making it easier to analyze and work with the data that meets your criteria.


### <a id='toc3_1_1_'></a>[Creating a condition](#toc0_)

In [None]:
# Check fare col and mean
df['fare']

In [None]:
df['fare'].mean()

In [None]:
# Create filter condition - fares higher than the mean
condition = df['fare'] > df['fare'].mean()
condition

### <a id='toc3_1_2_'></a>[Filtering df](#toc0_)

In [None]:
# Get filtered df
filtered_df = df[condition]
filtered_df

In [None]:
# Do it all in one go
df[df['fare'] > df['fare'].mean()]

### <a id='toc3_1_3_'></a>[Using multiple conditions](#toc0_)

In [None]:
# We can combine boolean operators with filters to add conditions
# boolean operators: and is &, or is |

# fare higher than mean but still lower than 50
df[(df['fare'] > df['fare'].mean()) & (df['fare'] <= 50)]

In [None]:
# To filter on categorical data we can also use .isin()
# Get expensive tickets in the lowest classes
df[(df['fare'] > 60) & (df['pclass'].isin([2, 3]))]

# Alternatively, get cheap tickets in the higher classes
df[(df['fare'] < 30) & (df['pclass'].isin([1, 2]))]

In [None]:
# We can also use between() for numerical data
# Get fares between 90-100
df[df['fare'].between(90, 100)]

# <a id='toc4_'></a>[More Data Manipulation](#toc0_)




## <a id='toc4_1_'></a>[Setting the index](#toc0_)

To set an index in pandas, you can use the `set_index()` method of the DataFrame. This method allows you to specify which column you want to use as the index for the DataFrame.

In [None]:
# Basically renaming the rows in our df
df.set_index('passenger_id',inplace=True)
df.head()

## <a id='toc4_2_'></a>[Adding/removing rows and/or columns](#toc0_)

To add or remove rows and/or columns from a pandas DataFrame, you can use the following methods:

1. Adding rows:
   - Use the `concat()` method to add rows to the DataFrame.

2. Removing rows:
   - Use the `drop()` method with the row index or label to remove specific rows.

3. Adding columns:
   - Using `df[new_column]`, you simply assign a list, Series, or scalar value to the new column name
   - Assign a new column to the DataFrame using bracket notation or the `assign()` method.

4. Removing columns:
   - Use the `drop()` method with the column name and `axis=1` to remove specific columns.
   - Alternatively, you can use the `del` keyword to remove a column in-place.

In [None]:
# Add the first row of the df at the end
new_df = pd.concat([df, pd.DataFrame(df.iloc[0, :]).T], axis=0)
new_df.tail()

In [None]:
# Remove the row from the new df
new_df.drop(1) # This deletes the row with index 1

In [None]:
# Remove a column from the dataframe
df.drop('name', axis=1, inplace=True)
df.head()

In [None]:
# Create a survived_bool col
df["survived_bool"] = df['survived'].map({0: False, 1: True})
df

## <a id='toc4_3_'></a>[💡 Check for understanding](#toc0_)

Use the `supermarket_sales.csv` file for this task.

1. **Load the Data**: Use pandas to load the `supermarket_sales.csv` file into a DataFrame.

2. **Null Values**: Check if the DataFrame has any null values. If there are any, count the number of null values in each column.

5. **Formatting Data**: Round any floating point numbers in the DataFrame to two decimal places.

6. **Cleaning Column Names**: Ensure all column names are in lowercase and replace any spaces in the column names with underscores.

7. **Using `apply()`, `map()`, and `applymap()`**: Create a new column called 'total_cost' which is the product of the 'quantity' and 'unit_price' columns (assuming these columns exist in your dataset). Use the `apply()` function for this.

8. **Filtering Data**: Filter the DataFrame to only include rows where 'total_cost' is greater than the average 'total_cost'.

9. **Setting the Index**: Set the 'invoice_id' column (or any other unique identifier) as the index of the DataFrame.



In [None]:
url = 'https://raw.githubusercontent.com/data-bootcamp-v4/data/main/supermarket_sales.csv'

In [None]:
# Your code goes here

# <a id='toc5_'></a>[Summary](#toc0_)

1. Null Values:
   - Null values (also known as missing values) can hinder data analysis and modeling.
   - Use `isnull()` or `isna()` to check for null values in a DataFrame or Series.
   - Use `any()` and `sum()` to efficiently assess data quality.
   - Use `dropna()` to remove rows or columns with null values from a DataFrame.
   - Parameters like subset, how, and thresh can control the behavior of dropping rows or columns.
   - Use `fillna()` to replace null values with specific values, such as `mean()`, `median()`, or forward/backward fill.

5. Formatting Data:
   - Use `round()` and `format()` to format numeric values.
   - Use string methods like `lower()`, `upper()`, `title()`, `strip()`, `split()`, and `replace()`.

6. Cleaning Column Names:
   - Use df.columns to access column names.
   - Modify column names using df.columns or `rename()`.

7. Using `apply()`, `map()`, and `applymap()`:
   - `apply()`: Applies a custom function to a Series.
   - `map()`: Transforms Series elements based on a dictionary.
   - `applymap()`: Applies a custom function to every element in a DataFrame.

8. Filtering Data:
   - Filter rows in a DataFrame using boolean indexing.
   - Use comparison operators (<, >, ==) to create conditions.
   - Combine multiple conditions using logical operators (& for 'and', | for 'or').

9. Setting the Index:
   - Use `set_index()` to set an index for the DataFrame.

10. Adding/Removing Rows and Columns:
   - Use `concat()` to add rows to the DataFrame.
   - Use `drop()` with the row index/label to remove specific rows.
   - Use bracket notation or `drop()` with axis=1 to add/remove columns.