### Handling missing data

#### `dropna()` Optional Parameters

- **`axis`:** 
  - `axis=0` or `axis='index'` (default) to drop rows with missing values
  - `axis=1` or `axis='columns'` to drop columns with missing values
- **`how`:** 
  - `how='any'` (default) drops the row/column if **any** `NA` values are present
  - `how='all'` drops the row/column only if **all** values are `NA`
- **`thresh`:** 
  - Sets a threshold for the number of non-NA values. Rows/columns with *fewer* **non-NA** values than the threshold will be dropped
- **`subset`:** 
  - Defines a list of columns in which to look for missing values, useful when `axis=0`
- **`inplace`:** 
  - If `True`, the operation modifies the `DataFrame` in place. Default is `False`, which returns a new `DataFrame`


In [None]:
df_drop = df.dropna(subset=['Column3','Column4']) # drops rows with missing values in Column3 or Column4

## Filling values


1. fillna()
2. Forward Fill and Backward Fill
3. Using the Mean, Median, or Mode:

In [2]:
## Creating a DF from a list of dictionaries, with each dictionary representing a column
import pandas as pd
data = [
    [1, 2, 3],
    ['Alice', 'Bob', 'Charlie'],
    [30, 25, 35]
]

# Create a DataFrame with each list as a column
employee_df = pd.DataFrame(list(zip(*data)), columns=['id', 'name', 'age'])
employee_df.head()

Unnamed: 0,id,name,age
0,1,Alice,30
1,2,Bob,25
2,3,Charlie,35


#### Optional Parameters

- **`sep`**: Defines the delimiter to use. The default is `,`, and you can check your `CSV` file to see which should be used 
  
- **`header`**: Indicates the row number to use as column names (0-indexed). Default is `0` (first line), but can be set to `None`, in case you have no column names in your first row.
  
- **`index_col`**: This parameter is used to specify which column should be used as the row index. It can be an integer (column position) or a string (column label).
  
- **`usecols`**: Useful when you want to load only specific columns. Pass a list of column names or numbers.
  
- **`dtype`**: Dictates the data type for each column, and should be a list of the same length as the number of columns

## Describing the Data

- `.info()`
- `.describe()` 
- `.value_counts()`

## sorting data
- sort_values()
- sort_index()

-  Sorting by a single column can be achieved using sort_values(), by specifying the column in the by parameter. Perhaps surprisingly, by default it sorts in ascending order, so to sort in descending order you need to specify ascending = False as well.


### Selecting Based on Multiple Values
If you want to select multiple values from your column for a subset of the data, you can use the `.isin()` method to achieve this. The `isin()` method returns `True` if a row's value is a member of a list:

In [3]:
# Creating a sample dataframe
data = {
    'Name': ['Alansana', 'Briana', 'Chanmony', 'Dietrich', 'Eva'],
    'Age': [23, 34, 45, 36, 50],
    'Country': ['USA', 'Switzerland', 'UK', 'Switzerland', 'Canada']
}
customers = pd.DataFrame(data)

# Displaying the original dataframe
print("Original DataFrame:")
print(customers)

# Using `loc` to select rows based on a logical expression
swiss_customers = customers.loc[customers['Country'] == 'Switzerland']

# Displaying the dataframe with only Swiss customers
print("\nSwiss Customers:")
print(swiss_customers)

Original DataFrame:
       Name  Age      Country
0  Alansana   23          USA
1    Briana   34  Switzerland
2  Chanmony   45           UK
3  Dietrich   36  Switzerland
4       Eva   50       Canada

Swiss Customers:
       Name  Age      Country
1    Briana   34  Switzerland
3  Dietrich   36  Switzerland


In [4]:

# Selecting rows where 'Country' is either 'USA' or 'UK'
selected_countries = ['USA', 'UK']
mask = customers['Country'].isin(selected_countries)

# Using the mask to get the subset of the DataFrame
subset_df = customers[mask]

# Displaying the subset of the DataFrame
print("\nSubset DataFrame (Countries: USA, UK):")
print(subset_df)


Subset DataFrame (Countries: USA, UK):
       Name  Age Country
0  Alansana   23     USA
2  Chanmony   45      UK


In [6]:
import pandas as pd
salary_df = pd.read_csv('https://aicore-files.s3.amazonaws.com/Foundations/Data_Formats/Salaries.csv')


In [7]:
shape = salary_df.shape
print(f'This dataset has {shape[0]} rows and {shape[1]} columns')

This dataset has 676 rows and 13 columns


In [8]:

salary_df['Year'].value_counts()

Year
2011    676
Name: count, dtype: int64

In [9]:
sorted_salary_df=salary_df.sort_values(by=['BasePay'],ascending=False)
sorted_salary_df.head(5)

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
12,AM,EDWARD HARRINGTON,EXECUTIVE CONTRACT EMPLOYEE,294580.02,0.0,0.0,,294580.02,294580.02,2011,,San Francisco,
9,AJ,JOANNE HAYES-WHITE,"CHIEF OF DEPARTMENT, (FIRE DEPARTMENT)",285262.0,0.0,17115.73,,302377.73,302377.73,2011,,San Francisco,
13,AN,JOHN MARTIN,DEPARTMENT HEAD V,271329.03,0.0,21342.59,,292671.62,292671.62,2011,,San Francisco,
16,AQ,AMY HART,DEPARTMENT HEAD V,268604.57,0.0,16115.86,,284720.43,284720.43,2011,,San Francisco,
28,BC,DENISE SCHMITT,DEPUTY CHIEF III (POLICE DEPARTMENT),261717.6,0.0,2357.0,,264074.6,264074.6,2011,,San Francisco,


In [11]:
sorted_salary_df=salary_df.sort_values(by=['JobTitle','BasePay'],ascending=[True,False])
sorted_salary_df.head(5)

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
36,BK,SUSAN CURRIN,"ADMINISTRATOR, SFGH MEDICAL CENTER",245124.44,0.0,12000.0,,257124.44,257124.44,2011,,San Francisco,
59,CH,AI-KYUNG CHUNG,ANESTHETIST,214745.44,9161.31,14972.45,,238879.2,238879.2,2011,,San Francisco,
96,DS,SARAH CARY,ANESTHETIST,208925.6,5539.3,12615.6,,227080.5,227080.5,2011,,San Francisco,
110,EG,MARK SMITH,ANESTHETIST,206057.69,3431.01,10921.33,,220410.03,220410.03,2011,,San Francisco,
121,ER,SHELLEY MITCHELL,ANESTHETIST,203658.55,4759.36,9561.44,,217979.35,217979.35,2011,,San Francisco,


In [12]:
sorted_salary_df=salary_df.sort_index()
sorted_salary_df.head(5)

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
0,AA,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,
1,AB,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,
2,AC,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,
3,AD,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,
4,AE,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,


In [10]:
mask = (salary_df['JobTitle'].str.contains('police', case=False)) & (salary_df['BasePay'] > 50000)
print(f'The datatype of the mask is {type(mask)}')
mask.head(5)


The datatype of the mask is <class 'pandas.core.series.Series'>


0    False
1     True
2     True
3    False
4    False
dtype: bool

In [13]:
filtered_df = salary_df[mask] # Applying the mask to the DataFrame
filtered_df.head(5)

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
1,AB,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,
2,AC,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,
11,AL,PATRICIA JACKSON,CAPTAIN III (POLICE DEPARTMENT),99722.0,87082.62,110804.3,,297608.92,297608.92,2011,,San Francisco,
15,AP,RICHARD CORRIEA,"COMMANDER III, (POLICE DEPARTMENT)",198778.01,73478.2,13957.65,,286213.86,286213.86,2011,,San Francisco,
25,AZ,GREGORY SUHR,CHIEF OF POLICE,256470.41,0.0,11522.18,,267992.59,267992.59,2011,,San Francisco,
