# Data Wrangling

![link text](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/ADSP_Images/Lesson_09_Data_Wrangling/Introduction.png)
 
- It's a very important step in the Data Science pipeline
- It's the process of cleaning, structuring, and enriching raw data into a desired format for better EDA, statistical evaluation, decision-making, and machine learning model training
- It involved many steps and techniques:
    - Data Acquisition: Gathering raw data from different sources
        - Internal data sources: grabbing local files, connecting to the company's database, etc...
        - External data sources:  data scraping, downloading and using external datasets, connecting to a vendor's data warehouse, etc...
            - Examples:
                - https://archive.ics.uci.edu/
                - https://catalog.data.gov/dataset/
                - https://www.kaggle.com/datasets
    - Data Preparation:
        - Selecting and combining data
            - Slicing and filtering e.g. top 6 neighborhoods by size, patients without diabetes, etc...
            - Integrating data: join or union tables from the same database or from external sources
        - Data cleansing: After doing EDA and sanity checks
            - Dropping unnecessary columns
            - Changing or formatting data types
            - Formatting values to proper data type
            - Treating missing values
            - Treating outliers
            - Removing duplicates
        - Data transformation and Feature Engineering
            - Data aggregation
            - Feature extraction. e.g. extract age bands form age, derive weekend indicator from dates, etc...
            - Label Encoding
            - Outlier Treatment
            - Data Scaling (Standardization)

- Importance of Data Wrangling in Machine Learning:
    - Improving data quality: ensuring the data is accurate, reliable, and complete
    - Enhancing the ML model performance: clean and well-structured data leads to better model training
    - Reducing Bias and Errors
    - It makes EDA much easier and more insightful

## Exercise - Bike Rental Data Wrangling

### Data Dictionary
**Description of Attributes:**

* **instant** - event or instant id
* **dteday** - date of the rental/ride 
* **season** -  
    - 1 = spring
    - 2 = summer
    - 3 = fall
    - 4 = winter 
* **yr** 
* **mnth**
* **hr** hour of the rental
* **holiday** - whether the day is considered a holiday
* **weekday** - whether the day is neither a weekend nor holiday
* **weathersit** 
    * 1: Clear, Few clouds, Partly cloudy, Partly cloudy
    * 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    * 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    * 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 
* **temp** - temperature in Celsius
* **atemp** - "feels like" temperature in Celsius
* **hum** - relative humidity
* **windspeed** - wind speed
* **casual** - number of non-registered user rentals initiated
* **registered** - number of registered user rentals initiated
* **cnt** - number of total rentals (casual and registered)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
dataset_1 = pd.read_csv('/Users/bassel_instructor/Documents/Datasets/rental_bike_descr.csv')
dataset_1.sample(10)

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,weathersit,temp
76,77,04-01-2011,1,0,1,8,False,2,1,0.14
139,140,07-01-2011,1,0,1,1,False,5,2,0.2
155,156,07-01-2011,1,0,1,18,False,5,1,0.2
179,180,08-01-2011,1,0,1,18,False,6,1,0.14
472,473,21-01-2011,1,0,1,17,False,5,1,0.14
268,269,12-01-2011,1,0,1,15,False,3,1,0.2
544,545,24-01-2011,1,0,1,20,False,1,1,0.14
39,40,02-01-2011,1,0,1,16,False,0,3,0.34
103,104,05-01-2011,1,0,1,12,False,3,1,0.26
211,212,10-01-2011,1,0,1,2,False,1,1,0.12


**Observations**
- year (yr) column has zeroes 
- temp is in decimal (sometimes that ok)
- hour (hr) fields is in military time
- based on the data dictionary, we're missing 6 columns. In this situation, we can either contact the data provider about the missing columns, OR look into the other data sources to see if the have them (may need to integrate datasets together) 

In [3]:
dataset_2 = pd.read_excel('/Users/bassel_instructor/Documents/Datasets/rental_bike_season.xlsx')
dataset_2.head()

Unnamed: 0.1,Unnamed: 0,instant,atemp,hum,windspeed,casual,registered,cnt
0,0,1,0.2879,0.81,0.0,3,13,16
1,1,2,0.2727,0.8,0.0,8,32,40
2,2,3,0.2727,0.8,0.0,5,27,32
3,3,4,0.2879,0.75,0.0,3,10,13
4,4,5,0.2879,0.75,0.0,0,1,1


> NOTE: if you get an error after running `read_excel()` you need to `pip install openpyxl`

**Observations**
- We may have a redundant index `unamed: 0` we can drop it
- Looks like **dataset_2** completes the missing columns from **dataset_1**
- Therefore, we need to join the 2 tables

### Drop Columns

In [4]:
dataset_2 = dataset_2.drop(columns=['Unnamed: 0']) # if you're dropping multiple columns, you need ot name them in the list
dataset_2.head()

Unnamed: 0,instant,atemp,hum,windspeed,casual,registered,cnt
0,1,0.2879,0.81,0.0,3,13,16
1,2,0.2727,0.8,0.0,8,32,40
2,3,0.2727,0.8,0.0,5,27,32
3,4,0.2879,0.75,0.0,3,10,13
4,5,0.2879,0.75,0.0,0,1,1


### Slicing and Filtering The Data

Select data from row num 100 to 200 and just instant and hum columns

In [5]:
# example of loc
dataset_2.loc[100:200, ['instant', 'hum']]

Unnamed: 0,instant,hum
100,101,0.37
101,102,0.37
102,103,0.33
103,104,0.33
104,105,0.30
...,...,...
196,197,0.40
197,198,0.37
198,199,0.34
199,200,0.32


Select the first 300 rows and first 4 columns

In [6]:
dataset_2.iloc[:300,:4] #[row range, col range]

Unnamed: 0,instant,atemp,hum,windspeed
0,1,0.2879,0.81,0.0000
1,2,0.2727,0.80,0.0000
2,3,0.2727,0.80,0.0000
3,4,0.2879,0.75,0.0000
4,5,0.2879,0.75,0.0000
...,...,...,...,...
295,296,0.1818,0.40,0.3284
296,297,0.1515,0.47,0.2537
297,298,0.1515,0.47,0.2239
298,299,0.1212,0.46,0.2985


In [7]:
#select specific datapoints
dataset_2.iat[0,2] #row loc, col loc

0.81

#### Filtering Data 

Get (filter) the rows where casual rentals are greater than 3

In [8]:
dataset_2[dataset_2['casual'] > 3]

Unnamed: 0,instant,atemp,hum,windspeed,casual,registered,cnt
1,2,0.2727,0.80,0.0000,8,32,40
2,3,0.2727,0.80,0.0000,5,27,32
9,10,0.3485,0.76,0.0000,8,6,14
10,11,0.3939,0.76,0.2537,12,24,36
11,12,0.3333,0.81,0.2836,26,30,56
...,...,...,...,...,...,...,...
580,581,0.1970,0.93,0.3284,6,35,41
581,582,0.1970,0.93,0.3284,7,41,48
582,583,0.1970,0.93,0.3284,4,43,47
591,592,0.2121,0.74,0.0896,4,55,59


In [9]:
#validate the filter
dataset_2[dataset_2['casual'] > 3].min()

instant        2.0000
atemp          0.0606
hum            0.2100
windspeed      0.0000
casual         4.0000
registered     6.0000
cnt           13.0000
dtype: float64

**Types of Logical Operators:**</br> 
if a is the col and b is the value</br> 
![logOp](https://miro.medium.com/v2/resize:fit:640/1*H9m-yjLwZ5-M16qld4fvOA.png)

**Combining Multiple Conditions**</br>
![MulOp](https://miro.medium.com/v2/resize:fit:720/1*nd09QyjA8OvZbSYK6znykQ.png)

Find the rentals that are 3 for casual and greater than 10 for registered users

In [10]:
my_filter = (dataset_2['casual']==3) & (dataset_2['registered']>10)

dataset_2[my_filter].head()

Unnamed: 0,instant,atemp,hum,windspeed,casual,registered,cnt
0,1,0.2879,0.81,0.0,3,13,16
21,22,0.4091,0.87,0.194,3,31,34
65,66,,0.47,0.1045,3,49,52
66,67,0.197,0.64,0.1343,3,49,52
86,87,0.2576,0.48,0.194,3,179,182


Pandas also offers a syntax close to SQL query 

In [11]:
#to do the same filter

dataset_2.query('casual == 3 & registered > 10').head()

Unnamed: 0,instant,atemp,hum,windspeed,casual,registered,cnt
0,1,0.2879,0.81,0.0,3,13,16
21,22,0.4091,0.87,0.194,3,31,34
65,66,,0.47,0.1045,3,49,52
66,67,0.197,0.64,0.1343,3,49,52
86,87,0.2576,0.48,0.194,3,179,182


- Both methods work, but method 1 could be more effective/faster since method 2 (query) uses string parsing.
- Also, method 1 takes spaces e.g. 'membder id', query method doesn't

In [12]:
# using isnin to select specific numbers instead of ranges

dataset_2[dataset_2['casual'].isin([3,6,7])].head()

Unnamed: 0,instant,atemp,hum,windspeed,casual,registered,cnt
0,1,0.2879,0.81,0.0,3,13,16
3,4,0.2879,0.75,0.0,3,10,13
19,20,0.4242,0.88,0.2537,6,31,37
21,22,0.4091,0.87,0.194,3,31,34
33,34,0.3485,0.81,0.2239,7,46,53


In [21]:
#not equal
dataset_2[dataset_2['casual'].ne(7)].head()

Unnamed: 0,instant,atemp,hum,windspeed,casual,registered,cnt
0,1,0.2879,0.81,0.0,3,13,16
1,2,0.2727,0.8,0.0,8,32,40
2,3,0.2727,0.8,0.0,5,27,32
3,4,0.2879,0.75,0.0,3,10,13
4,5,0.2879,0.75,0.0,0,1,1


In [22]:
# not equal for a list - using ~
dataset_2[~dataset_2['casual'].isin([3,6,7])].head()

Unnamed: 0,instant,atemp,hum,windspeed,casual,registered,cnt
1,2,0.2727,0.8,0.0,8,32,40
2,3,0.2727,0.8,0.0,5,27,32
4,5,0.2879,0.75,0.0,0,1,1
5,6,0.2576,0.75,0.0896,0,1,1
6,7,0.2727,0.8,0.0,2,0,2


Select all the rows that are below average humidity (dynamic selection)

In [13]:
my_filter = dataset_2['hum'] < dataset_2['hum'].mean() # no use of fixed mean value

dataset_2[my_filter].head()

Unnamed: 0,instant,atemp,hum,windspeed,casual,registered,cnt
41,42,0.3333,0.46,0.3284,10,43,53
42,43,0.2879,0.42,0.4478,1,29,30
43,44,,0.39,0.3582,5,17,22
44,45,0.2273,0.44,0.3284,11,20,31
45,46,0.2121,0.44,0.2985,0,9,9


In [14]:
print(dataset_2['hum'].mean())

0.5624754098360656


### Data Integration (Merging and Concatenating)

- Merging `merge()`: similar to join in SQL
- Concatenating `concat()`: similar to union in SQL ( sometimes it can do both union and join)

![joins](https://statisticsglobe.com/wp-content/uploads/2021/12/join-types-python-merge-programming.png)

![sql2](https://miro.medium.com/v2/resize:fit:1200/1*9eH1_7VbTZPZd9jBiGIyNA.png)

#### Data Merge (horizontal integration)

In dataset_1 and dataset_2, we need to merge (join) them to get the full list of columns. In order join them successfully:
- To have full info (no missing values) we need to make sure both datasets have the same number of values
- Both datasets have a common factor/column/key (single or multiple)

In [15]:
#check the row count
len(dataset_1) == len(dataset_2)

True

In [16]:
print('dataset_1 columns:\n',dataset_1.columns)
print('dataset_2 columns:\n',dataset_2.columns)

dataset_1 columns:
 Index(['instant', 'dteday', 'season', 'yr', 'mnth', 'hr', 'holiday', 'weekday',
       'weathersit', 'temp'],
      dtype='object')
dataset_2 columns:
 Index(['instant', 'atemp', 'hum', 'windspeed', 'casual', 'registered', 'cnt'], dtype='object')


In [17]:
# check if instant has same values from both sides
(dataset_1['instant'].sort_values() == dataset_2['instant'].sort_values()).sum()

610

- both datasets have the same num of rows
- the number of rows equals to the number of true matches
- since both datasets have equal num of rows and both IDs are aligned, it doesn't matter which type of join is used

In [31]:
combined_data = pd.merge(left=dataset_1, right=dataset_2, 
                         how='inner', #any type works here, but we're choosing inner
                         on='instant')

combined_data.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,01-01-2011,1,0,1,0,False,6,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,01-01-2011,1,0,1,1,False,6,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,01-01-2011,1,0,1,2,False,6,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,01-01-2011,1,0,1,3,False,6,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,01-01-2011,1,0,1,4,False,6,1,0.24,0.2879,0.75,0.0,0,1,1


In [32]:
# check for duplicates
combined_data.duplicated().sum()

0

In [33]:
combined_data.drop_duplicates().shape

(610, 16)

We get the same number of rows from the original dataset. Therefore, there are no duplicates.

#### Data Concatenation (Vertical Integration)

> Note: `concat()` can also do horizontal join, but it's not as clean as `merge()`

Let's use `concat()` to join the data vertically (union in SQL)

In [39]:
dataset_3 = pd.read_csv('/Users/bassel_instructor/Documents/Datasets/rental_bike_dataset_ext.csv')
dataset_3.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,620,29-01-2011,1,0,1,1,False,6,1,0.22,0.2273,0.64,0.194,0,20,20
1,621,29-01-2011,1,0,1,2,False,6,1,0.22,0.2273,0.64,0.1642,0,15,15
2,622,29-01-2011,1,0,1,3,False,6,1,0.2,0.2121,0.64,0.1343,3,5,8
3,623,29-01-2011,1,0,1,4,False,6,1,0.16,0.1818,0.69,0.1045,1,2,3
4,624,29-01-2011,1,0,1,6,False,6,1,0.16,0.1818,0.64,0.1343,0,2,2


In [35]:
combined_data.tail()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
605,606,28-01-2011,1,0,1,11,False,5,3,0.18,0.2121,0.93,0.1045,0,30,30
606,607,28-01-2011,1,0,1,12,False,5,3,0.18,0.2121,0.93,0.1045,1,28,29
607,608,28-01-2011,1,0,1,13,False,5,3,0.18,0.2121,0.93,0.1045,0,31,31
608,609,28-01-2011,1,0,1,14,False,5,3,0.22,0.2727,0.8,0.0,2,36,38
609,610,28-01-2011,1,0,1,15,False,5,2,0.2,0.2576,0.86,0.0,1,40,41


**Observation**
- combined_data (dataset_1 and dataset_2) ends at instant = 610
- the top 5 rows of dataset_3 start at 620
- we need to check if we're missing the rows with instant values between 611 and 619. methods:
    - sort the values
    - calculate the min of instant
    - use filters

In [36]:
dataset_3['instant'].min()

611

In [40]:
dataset_3 = dataset_3.sort_values(by='instant').reset_index(drop=True)
dataset_3.head(10)

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,611,28-01-2011,1,0,1,16,False,5,1,0.22,0.2727,0.8,0.0,10,70,80
1,612,28-01-2011,1,0,1,17,False,5,1,0.24,0.2424,0.75,0.1343,2,147,149
2,613,28-01-2011,1,0,1,18,False,5,1,0.24,0.2273,0.75,0.194,2,107,109
3,614,28-01-2011,1,0,1,19,False,5,2,0.24,0.2424,0.75,0.1343,5,84,89
4,615,28-01-2011,1,0,1,20,False,5,2,0.24,0.2273,0.7,0.194,1,61,62
5,616,28-01-2011,1,0,1,21,False,5,2,0.22,0.2273,0.75,0.1343,1,57,58
6,617,28-01-2011,1,0,1,22,False,5,1,0.24,0.2121,0.65,0.3582,0,26,26
7,618,28-01-2011,1,0,1,23,False,5,1,0.24,0.2273,0.6,0.2239,1,22,23
8,619,29-01-2011,1,0,1,0,False,6,1,0.22,0.197,0.64,0.3582,2,26,28
9,620,29-01-2011,1,0,1,1,False,6,1,0.22,0.2273,0.64,0.194,0,20,20


In [41]:
final_dataset = pd.concat([combined_data, dataset_3])
final_dataset.sample(15)

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
226,227,10-01-2011,1,0,1,17,False,1,1,0.2,0.2273,0.4,0.1045,4,174,178
512,513,23-01-2011,1,0,1,11,False,0,1,0.14,0.1364,0.43,0.2239,22,77,99
315,926,11-02-2011,1,0,2,2,False,5,1,0.1,0.1364,0.54,0.0896,0,3,3
149,760,04-02-2011,1,0,2,1,False,5,2,0.16,0.2273,0.59,0.0,0,7,7
321,322,14-01-2011,1,0,1,21,False,5,1,0.16,0.2273,0.69,0.0,4,48,52
237,848,07-02-2011,1,0,2,18,False,1,2,0.34,0.3333,0.66,0.1343,5,170,175
525,526,24-01-2011,1,0,1,0,False,1,1,0.06,0.0606,0.41,0.194,0,7,7
163,164,08-01-2011,1,0,1,2,False,6,2,0.18,0.2424,0.55,0.0,3,13,16
174,785,05-02-2011,1,0,2,3,False,6,2,0.24,0.2424,0.75,0.1642,1,10,11
590,591,27-01-2011,1,0,1,19,False,4,1,0.2,0.2273,0.69,0.0896,3,76,79


### DQA Techniques
- Data Quality Assurance (DQA) are methods of checking the data to make sure it has no error and ready for analysis
- It can be done before building a model or right after importing data (before data wrangling)
- The goal is to make sure the data is: accurate, consistent, complete, reasonable, and reliable
- You can use domain expertise and data dictionaries to determine the DQA results (sanity checks)
- Types of data validation checks:
    - nulls (missing data)
    - row counts after joins
    - duplicates
    - data type consistency e.g. no text values in a numeric column
    - range check (domain expertise) or extreme values (outliers)
    - format check
    - units (measurements)
    - white spaces

#### Sanity Checks

In [43]:
#checking that the union or concat worked well
final_dataset.shape[0] == combined_data.shape[0] + dataset_3.shape[0]

True

In [51]:
# for season col, we need to check we only have 4 values or less
final_dataset['season'].unique()

array([1])

We have only 1 season. While the data value is correct, we need to be aware that we might be missing the other seasons.

In [53]:
# check we have numbers up to 12
final_dataset['mnth'].unique()

array([1, 2])

We have only 2 months, now it makes sense why we have only 1 season.

In [54]:
# according to the data dictionary, cnt means total customers (registered and casual)
final_dataset['cnt'].sum() == final_dataset['casual'].sum() + final_dataset['registered'].sum()

True

The evaluation shows that cnt of customers = registered + casual

In [58]:
# check for temp range
final_dataset['temp'].describe()

count    989.000000
mean       0.204712
std        0.077789
min        0.020000
25%        0.160000
50%        0.200000
75%        0.240000
max        0.460000
Name: temp, dtype: float64

The range between 2 and 46 degrees celsius is reasonable.

#### Checking for Nulls

In [46]:
# overall null counts
final_dataset.isna().sum().sum()

11

In [45]:
# nulls by column summary
final_dataset.isna().sum()

instant        0
dteday         0
season         0
yr             0
mnth           0
hr             0
holiday        0
weekday        0
weathersit     0
temp           0
atemp         11
hum            0
windspeed      0
casual         0
registered     0
cnt            0
dtype: int64

**Observation**
- We have 11 nulls in atemp column
- Addressing nulls can be done with 2 main methods:
    - Imputation: replacing nulls with sensible values
    - Dropping Nulls
- The dataset above is a good example of dropping nulls is a good approach because 11 nulls are not significant
- Steps:
    - Get the ratio of nulls
    - recommendation = if nulls represent a small portion of the dataset, then drop.
    - recommendation for small: 1 and 5% 
    - in other words, we expect a small impact from losing the rows with null values
- NOTE: if the column with nulls is not important, maybe dropping the column would be a better idea


In [48]:
null_perc = final_dataset.isna().sum().sum() / len(final_dataset)
print(f'Percent of Nulls: {null_perc:.2%}')

Percent of Nulls: 1.10%


**Conclusion** We have 1.1% nulls in the data, we can safely drop those rows.

In [49]:
#dropping null rows
final_dataset = final_dataset.dropna()
final_dataset.shape

(989, 16)

Renaming Columns

In [60]:
final_dataset.columns

Index(['instant', 'dteday', 'season', 'yr', 'mnth', 'hr', 'holiday', 'weekday',
       'weathersit', 'temp', 'atemp', 'hum', 'windspeed', 'casual',
       'registered', 'cnt'],
      dtype='object')

In [None]:
#final_dataset.columns = final_dataset.columns.str.replace(' ', '_')

#method 1
final_dataset.columns = ['instant', 'dteday', 'season', 'yr', 'mnth', 'hr', 'holiday', 'weekday',
       'weathersit', 'temp', 'atemp', 'hum', 'windspeed', 'casual',
       'registered', 'cnt']

In [61]:
#method 2

final_dataset.rename(columns={'yr':'year',
                              'mnth':'month',
                              'hr':'hour',
                              'cnt':'count'}
                              ,
                              inplace=True)
final_dataset.head()

Unnamed: 0,instant,dteday,season,year,month,hour,holiday,weekday,weathersit,temp,atemp,hum,windspeed,casual,registered,count
0,1,01-01-2011,1,0,1,0,False,6,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,01-01-2011,1,0,1,1,False,6,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,01-01-2011,1,0,1,2,False,6,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,01-01-2011,1,0,1,3,False,6,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,01-01-2011,1,0,1,4,False,6,1,0.24,0.2879,0.75,0.0,0,1,1


In [75]:
# capitalize the first letter of the column
final_dataset.columns = final_dataset.columns.str.title()
final_dataset.head()

Unnamed: 0,Instant,Dteday,Season,Year,Month,Hour,Holiday,Weekday,Weathersit,Temp,Atemp,Hum,Windspeed,Casual,Registered,Count
0,1,01-01-2011,1,0,1,0,False,6,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,01-01-2011,1,0,1,1,False,6,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,01-01-2011,1,0,1,2,False,6,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,01-01-2011,1,0,1,3,False,6,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,01-01-2011,1,0,1,4,False,6,1,0.24,0.2879,0.75,0.0,0,1,1


## Additional Data Wrangling Techniques

#### `dropna()` with threashold

- it controls the number of non-null values allowed per row

In [64]:
data = {
    'a':[1,3,9,np.nan],
    'b':[np.nan, 2, 4, np.nan],
    'c':[np.nan, 1, 7, 6],
    'd':[np.nan, 2, np.nan, 9]
    }

df = pd.DataFrame(data)
df

Unnamed: 0,a,b,c,d
0,1.0,,,
1,3.0,2.0,1.0,2.0
2,9.0,4.0,7.0,
3,,,6.0,9.0


In [71]:
# drop all nulls
# very strict with removing nulls
# also not mentioning the threshold is the same
df.dropna(thresh=4)

Unnamed: 0,a,b,c,d
1,3.0,2.0,1.0,2.0


In [68]:
# dropped rows with more than 1 null
df.dropna(thresh=3)

Unnamed: 0,a,b,c,d
1,3.0,2.0,1.0,2.0
2,9.0,4.0,7.0,


In [69]:
# give df with a max of 2 nulls per row
df.dropna(thresh=2)

Unnamed: 0,a,b,c,d
1,3.0,2.0,1.0,2.0
2,9.0,4.0,7.0,
3,,,6.0,9.0


In [72]:
df.dropna(thresh=1)

Unnamed: 0,a,b,c,d
0,1.0,,,
1,3.0,2.0,1.0,2.0
2,9.0,4.0,7.0,
3,,,6.0,9.0
