**Table of contents**<a id='toc0_'></a>    
- [Import Statements](#toc1_1_)    
  - [Data type constraints](#toc1_2_)    
  - [Data range constraints](#toc1_3_)    
  - [Duplicate data](#toc1_4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=5
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

### <a id='toc1_1_'></a>[Import Statements](#toc0_)

In [1]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

### <a id='toc1_2_'></a>[Data type constraints](#toc0_)

<u>Useful functions and methods</u>
- `df.info()` - returns the data type of each column and the number of non-null values
- `df.dtypes` - returns the data type of each column
- `df.select_dtypes()` - returns a dataframe with only the columns that are of the specified data type
- `df.describe()` - returns a dataframe with the count, mean, standard deviation, minimum, maximum, and quartiles of each numeric column
- `df.astype()` - converts the data type of a column/multiple columns to the specified data type(s)
- The `.str` accessor - allows you to apply string methods to a column that contains strings/objects

In [2]:
df_ride_sharing = pd.read_csv("../datasets/ride_sharing_new.csv", index_col=0)

In [3]:
df_ride_sharing.head(3)

Unnamed: 0,duration,station_A_id,station_A_name,station_B_id,station_B_name,bike_id,user_type,user_birth_year,user_gender
0,12 minutes,81,Berry St at 4th St,323,Broadway at Kearny,5480,2,1959,Male
1,24 minutes,3,Powell St BART Station (Market St at 4th St),118,Eureka Valley Recreation Center,5193,2,1965,Male
2,8 minutes,67,San Francisco Caltrain Station 2 (Townsend St...,23,The Embarcadero at Steuart St,3652,3,1993,Male


In [4]:
df_ride_sharing.info()

<class 'pandas.core.frame.DataFrame'>
Index: 25760 entries, 0 to 25759
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   duration         25760 non-null  object
 1   station_A_id     25760 non-null  int64 
 2   station_A_name   25760 non-null  object
 3   station_B_id     25760 non-null  int64 
 4   station_B_name   25760 non-null  object
 5   bike_id          25760 non-null  int64 
 6   user_type        25760 non-null  int64 
 7   user_birth_year  25760 non-null  int64 
 8   user_gender      25760 non-null  object
dtypes: int64(5), object(4)
memory usage: 2.0+ MB


In [5]:
# duration should be int
df_ride_sharing["duration"] = (
    df_ride_sharing["duration"].str.strip(" minutes").astype(int)
)

In [6]:
assert df_ride_sharing["duration"].dtype == "int"

In [7]:
df_ride_sharing.head(2)

Unnamed: 0,duration,station_A_id,station_A_name,station_B_id,station_B_name,bike_id,user_type,user_birth_year,user_gender
0,12,81,Berry St at 4th St,323,Broadway at Kearny,5480,2,1959,Male
1,24,3,Powell St BART Station (Market St at 4th St),118,Eureka Valley Recreation Center,5193,2,1965,Male


In [8]:
df_ride_sharing.nunique()

duration            172
station_A_id          9
station_A_name        9
station_B_id        152
station_B_name      152
bike_id            1805
user_type             3
user_birth_year      63
user_gender           3
dtype: int64

In [9]:
# although "station_A_id", "station_B_id", "bike_id", "user_type" all are stored as int type, they are
# actually categorical data. using summary statistics on them will produce misleading data at best
# and will cause wrong conclusion at worst
col_int_to_cat_map = {
    "station_A_id": "category",
    "station_B_id": "category",
    "bike_id": "category",
    "user_type": "category",
}
df_ride_sharing = df_ride_sharing.astype(col_int_to_cat_map)

# Also "station_A_name", "station_B_name" columns should be categorical (though not strictly necessary)

In [10]:
df_ride_sharing.info()

<class 'pandas.core.frame.DataFrame'>
Index: 25760 entries, 0 to 25759
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   duration         25760 non-null  int64   
 1   station_A_id     25760 non-null  category
 2   station_A_name   25760 non-null  object  
 3   station_B_id     25760 non-null  category
 4   station_B_name   25760 non-null  object  
 5   bike_id          25760 non-null  category
 6   user_type        25760 non-null  category
 7   user_birth_year  25760 non-null  int64   
 8   user_gender      25760 non-null  object  
dtypes: category(4), int64(2), object(3)
memory usage: 1.4+ MB


### <a id='toc1_3_'></a>[Data range constraints](#toc0_)

Sometimes there are data that should be within a certain range. For example, the rating of a book should be between 0 and 5 (if a scale of 5 is used). Or the date of a completed transaction should not be in the future. 

In case the data is not within the expected range, we can take different actions depending on the situation.

- The simplest option is to drop the data. However, depending on the amount of data that falls out of range, we could be losing out on essential information. As a rule of thumb, only drop data when a small proportion of the dataset is affected by out of range values. However you really need to understand your dataset before deciding to drop values. 
- Another option would be to clip the data i.e, setting custom minimums or maximums to the data columns. 
- We could also set the data to missing, and impute it (for ordered data intepolation is a good option).
- Depending on the business assumptions behind our data, we could assign a custom value for any values of our data that go beyond a certain range. 

<u>Useful functions and methods</u>
- `df.describe()` - returns a dataframe with the count, mean, standard deviation, minimum, maximum, and quartiles of each numeric column
- `df.drop()` - drops rows or columns from a dataframe
- `df.clip()` - clips the data to the specified minimum and maximum values
- `df.fillna()` - fills missing values with the specified values
- `df.interpolate()` - interpolates the data (useful for ordered data)
- A `histogram` might be useful to visualize the distribution of the data

### <a id='toc1_4_'></a>[Duplicate data](#toc0_)

<u>Useful functions and methods</u>

- `df.duplicated(subset, keep)` - returns a boolean series indicating which rows are duplicates
- `df.drop_duplicates(subset, keep)` - drops duplicate rows from a dataframe
- `df.sort_values(by)` - sorts a dataframe by the specified column(s). Useful for identifying duplicate rows that are not next to each other and may differ slightly in certain columns (which can be safely assumed to be errors). Use with `df.duplicated(keep=False)`.
- `df.groupby(by).agg(func)` - can be used to aggregate duplicate rows into a single row by applying a function to each group of duplicate rows.