# Lesson I 

## Data Type Constraints

In this course, we're going to understand how to diagnose different problems in our data and how they can come up during our workflow.

**In this Course **:

* Diagnose dirty data
* Side effects of dirty data
* Clean Data

### Why do we need to clean data?

Data Science workflow:

* Access Data
    - Explore and Process Data
        - Extract Insights
            - Report Insights

Dirty data can appear because of duplicate values, mis-spellings, data-type parsing errors and legacy systems. Without making sure that data is properly cleaned in the exploration
and processing phase, we will surely compromise the insight and reports subsequently generated.

***Garbage in garbage out.***

<img src='pictures/datascienceworklow.jpg' />

### Data type Constraints

| **DataType** | **Example** | **Python data type** |
| -------------|-------------|----------------------|
| Text Data | First name, last name, address ... | ``str`` |
| Integers | # Subscribers, # products sold ... | ``int`` |
| Decimals | Temperature, $ Exchange rates ... | ``float`` |
| Binary | Is married, new customer, yes/no ... | ``bool`` |
| Dates | Order Dates, ship dates ... | ``datetime``  |
| Categories | Marriage status, gender ... | ``category`` |


### String to Integers

Let's take a look at the following example:

```python
# Import CSV file and output header
sales = pd.read_csv('sales.csv')
sales.head(2)
```

<img src='pictures/saleshead.jpg' />

Here's the head of a Data Frame containing revenue generated and quentity of items sold for a sales order. We want to calculate the total revenue generated by all sales orders. As we can see, the Revenue column has the dollar sign on the right hand side.

```python
# Get data types of columns
sales.dtypes    
```

<img src='pictures/datatypes.jpg' />

A close inspection of the DataFrame column's data types using the ``.dtypes`` attribute returns object for the Revenue column.

We can also check the data types as well as the number of missing values per column in a DataFrame, by using the ``.info()`` method.

```python
# Get DataFrame information
sales.info()
```

Since the Revenue column is a string, summing across all sales orders returns one large concatenated string containing each row's string. To fix this, we need to first remove the **$** sign from the string so that pandas is able to convert the strings into numbers without error. 


```python
# Print sum of all Revenue column
sales['Revenue'].sum()  # This will return all of the strings concatenated
```

We do this with the ```.str.strip()``` method, while specifying the string we want to strip as an argument, which is in this case the *dollar sign*. Since our dollar values do not contain decimals, we then convert the Revenue column to an *integer* by using the ```.astype()``` method, specifying the desired data type as argument. 

```python
# Remove $ from Revenue column
sales['Revenue'] = sales['Revenue'].str.strip('$')
sales['Revenue'] = sales['Revenue'].astype('int')
```

Had our revenue values been decimal, we would have converted the Revenue column to *float*. We can make sure that the Revenue column is now an integer by using the ```assert``` statement, which takes in a condition as input, as *returns nothing* if that condition is met, and an *error* if it is not.

```python
# Verify that Revenue is now an integer
assert sales['Revenue'].dtype == 'int'
```

### Numeric or Categorical?

A common type of data seems numeric but actually represents categories with a finite set of possible categories. This is called **categorical data**.

```python
marriage_status
3
1
2
```

``0`` = Never Married ``1`` = Married ``2`` = Separated ``3`` = Divorced

Here we have a marriage status column, which represented by 0, 1, 2, 3. 

However it will be imported of type integer, which could lead to misleading results when trying to extract some statistical summaries.

We can solve this by using ``.astype()`` method seen earlier, but this time specifying the category data type.

```python
# Convert to categorical
df["marriage_status"] = df['marriage_status'].astype('category')
```

## Exercise

### Numeric data or ...?

In this exercise, and throughout this chapter, you'll be working with bicycle ride sharing data in San Francisco called ```ride_sharing```. It contains information on the start and end stations, the trip duration, and some user information for a bike sharing service.

The ```user_type``` column contains information on whether a user is taking a free ride and takes on the following values:
* ``1`` for free riders.
* ``2`` for pay per ride.
- ``3`` for monthly subscribers.

In this instance, you will print the information of ```ride_sharing``` using ```.info()``` and see a firsthand example of how an incorrect data type can flaw your analysis of the dataset.

In [3]:
# Import packages
import pandas as pd

# Ride Sharing dataFrame
ride_sharing = pd.read_csv('datasets/ride_sharing_new.csv')

# Prin the information of ride_sharing
print(ride_sharing.info())

# Print summary statistics of user_type column
print(ride_sharing['user_type'].describe())

# Convert user_type from integer to category
ride_sharing['user_type_cat'] = ride_sharing['user_type'].astype('category')

# Write an assert statement confirming the change
assert ride_sharing['user_type_cat'].dtype == 'category'

# Print new summary statistics 
print(ride_sharing['user_type_cat'].describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25760 entries, 0 to 25759
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Unnamed: 0       25760 non-null  int64 
 1   duration         25760 non-null  object
 2   station_A_id     25760 non-null  int64 
 3   station_A_name   25760 non-null  object
 4   station_B_id     25760 non-null  int64 
 5   station_B_name   25760 non-null  object
 6   bike_id          25760 non-null  int64 
 7   user_type        25760 non-null  int64 
 8   user_birth_year  25760 non-null  int64 
 9   user_gender      25760 non-null  object
dtypes: int64(6), object(4)
memory usage: 2.0+ MB
None
count    25760.000000
mean         2.008385
std          0.704541
min          1.000000
25%          2.000000
50%          2.000000
75%          3.000000
max          3.000000
Name: user_type, dtype: float64
count     25760
unique        3
top           2
freq      12972
Name: user_type_cat, dty

### Summing strings and concatenating numbers

In the previous exercise, you were able to identify that ```category``` is the correct data type for ``user_type`` and convert it in order to extract relevant statistical summaries that shed light on the distribution of ```user_type```.

Another common data type problem is importing what should be numerical values as strings, as mathematical operations such as summing and multiplication lead to string concatenation, not numerical outputs.

In this exercise, you'll be converting the string column ```duration``` to the type ```int```. Before that however, you will need to make sure to strip ```"minutes"``` from the column in order to make sure ```pandas``` reads it as numerical. 

In [6]:
# Strip duration of minutes
ride_sharing['duration_trim'] = ride_sharing['duration'].str.strip('minutes')

# Convert duration to integer
ride_sharing['duration_time'] = ride_sharing['duration_trim'].astype('int')

# Write an assert statement making sure of conversion
assert ride_sharing['duration_time'].dtype == 'int'

# Print formed columns and calculate average ride duration 
print(ride_sharing[['duration','duration_trim','duration_time']])
print(ride_sharing['duration_time'].mean())

         duration duration_trim  duration_time
0      12 minutes           12              12
1      24 minutes           24              24
2       8 minutes            8               8
3       4 minutes            4               4
4      11 minutes           11              11
...           ...           ...            ...
25755  11 minutes           11              11
25756  10 minutes           10              10
25757  14 minutes           14              14
25758  14 minutes           14              14
25759  29 minutes           29              29

[25760 rows x 3 columns]
11.389052795031056


# Lesson II

## Data range Constraints

Let's first start off with some motivation, Imagine we have a dataset of movies with their respective average rating from a streamin service. The rating can be any integer between 1 and 5.

```python
movies.head()
```

<img src='pictures/movies.jpg' />

After creating a histogram with matplotlib, we see that there are few movies with an average rating of 6, which is well above the allowable range.

```python
import matplotlib.pyplot as plt
plt.hist(movies['avg_rating'])
plt.title('Average rating of movies (1-5)')
```

<img src='pictures/movieshist.jpg' />

Here's another example, where we see subscription dates in the future for a service. We use the datetime package's ```.date.today()``` function to get today's date, and we filter the dataset by any subscription date higher than today's date. ***We need to pay attention to the range of our data!***

```python
# Import date time
import datetime as dt
today_date = dt.date.today()
user_signups[user_signups['subscription_date'] > dt.date.today()]
```

<img src='pictures/subdate.jpg' />

### How to deal with out of range Data?

* Dropping Data 
    - Simplest option
    - However, we could lose our essential information
    - Rule of Thumb:
        - Only drop data when a small proportion of your datasets is affected by out of range values.

* Setting custom minimums and maximums
* Treat as missing and impute
* Setting custom value depending on business assumptions

#### Movie Example

Let's take a look at the movies example. We first isolate the movies with ratings higher than 5. Now if these values are affect a small set of our data, we candrop them.

```python
import pandas as pd
# Output Movies with rating > 5
movies[movies['avg_rating'] > 5]
```

**We can drop data in two ways**:

* We can either create a new filtered movies DataFrame where we only keep values of ```avg_rating``` lower or equal than to 5.

```python
# Drop values using filtering
movies = movies[movies['avg_rating'] <= 5]
```

* Or drop the values by using the drop method.

```python
# Drop values using .drop()
movies.drop(movies[movies['avg_rating'] > 5].index, inplace=True)
# Assert results
assert movies['avg_rating'].max() <= 5 
```

* We can also change the out of range values to a hard limit.

```python
# Convert avg_rating > 5 to 5
movies.loc[movies['avg_rating'] > 5, 'avg_rating'] = 5

# Asserts statement
assert movies['avg_rating'].max() <= 5
```

#### Date Range example

We first look at the data types of the column with the ```.dtypes``` attribute. We can confirm that the ```subscription_date``` column is an *object* and not a *date* or *datetime* object. To compare a pandas object to a date, the first step is to *convert* it to another date. 

```python
import datetime as dt
import pandas as pd
# Output data types
user_signups.dtypes
```

We do so by first converting it into a pandas ```datetime``` object with the ```to_datetime()``` function from pandas, which takes in as an argument the column we want to convert. 

We then need to convert the ```datetime``` object into a ```date```. This conversion is done by appending ```dt.date``` to the code. 

```python
# Conver to date
user_signups['subscription_date'] = pd.to_datetime(user_signups['subscription_date']).dt.date
```

Could we have converted from an object directly to a date, without the pandas datetime conversion in the middle? ***Yes!*** But we'd have had to provide information about the date's format as a string, so it's just as easy to do it this way.


Now that the column is a ```date```, we can treat it in a variety of ways. We first create a ```today_date``` variable using the datetime function ```date.today()```, which allows us to store today's date.

```python
today_date = dt.date.today()
```

**Drop the Data**

```python
# Drop values using filtering
user_signups = user_signups[user_signups['subscription_date'] < today_date]
# Drop values using .drop()
user_signups.drop(user_signups[user_signups['subscription_date'] > today_date].index, inplace = True)
```

**Hardcode dates with upper limit**

```python
# Drop values using filtering
user_signups.loc[user_signups['subscription_date'] > today_date, 'subscription_date'] = today_date
# Assert is true
assert user_signups.subscription_date.max().date() <= today_date
```


## Exercise

### Tire Size Constraints

In this lesson, you're going to build on top of the work you've been doing with the ```ride_sharing``` DataFrame. You'll be working with the ```tire_sizes``` column which contains data on each bike's tire size.

Bicycle tire sizes could be either *26″, 27″ or 29″* and are here correctly stored as a categorical value. In an effort to cut maintenance costs, the ride sharing provider decided to set the maximum tire size to be *27″*.

In this exercise, you will make sure the ```tire_sizes``` column has the correct range by first converting it to an *integer*, then setting and testing the new upper limit of 27″ for tire sizes.

In [None]:
# Convert tire_sizes to integer
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('int')

# Set all values above 27 to 27
ride_sharing.loc[ride_sharing['tire_sizes'] > 27, 'tire_sizes'] = 27

# Reconvert tire_sizes back to categorical
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('category')

# Print tire size description
print(ride_sharing['tire_sizes'].describe())

### Back to the Future

A new update to the data pipeline feeding into the ```ride_sharing``` DataFrame has been updated to register each ride's date. This information is stored in the ```ride_date``` column of the type object, which represents strings in pandas.

A bug was discovered which was relaying rides taken today as taken next year. To fix this, you will find all instances of the ```ride_date``` column that occur anytime in the *future*, and set the maximum possible value of this column to ```today's date```. Before doing so, you would need to convert ```ride_date``` to a datetime object.

In [None]:
# Import datetime
import datetime as dt 

# Convert ride_date to date
ride_sharing['ride_dt'] = pd.to_datetime(ride_sharing['ride_dt']).dt.date

# Save today's date
today = dt.date.today()

# Set all in the future to today's date
ride_sharing.loc[ride_sharing['ride_dt'] > today, 'ride_dt'] = today

# Print maximum of ride_dt column
print(ride_sharing['ride_dt'].max())

# Lesson III

## Uniqueness Constraints

Another common data cleaning problemi **duplicate values**

Duplicate values can be diagnosed when we have the same exact information repeated across multiple rows, for a some or all columns in our DataFrame.

**All Columns have the Same Values**

| **First Name** | **Last Name** | **Adress** | **height** | **weight** |
| -----------|-----------|--------|--------|--------|
| Justin | Saddlemyer | Boulevard du Jardin Botanique 3, Bruxelles | 193 cm | 87 kg |
| Justin | Saddlemyer | Boulevard du Jardin Botanique 3, Bruxelles | 193 cm | 87 kg |

**Most Columns have the same values**

| **First Name** | **Last Name** | **Adress** | **height** | **weight** |
| -----------|-----------|--------|--------|--------|
| Justin | Saddlemyer | Boulevard du Jardin Botanique 3, Bruxelles | 193 cm | 87 kg |
| Justin | Saddlemyer | Boulevard du Jardin Botanique 3, Bruxelles | **194 cm** | 87 kg |

### Why do They Happen ?

* Data entry and human errors
* Bugs and design errors
* Join or merge Errors

### How to find duplicate values?

In this example, we're working with a bigger version of the height and weight data seen earlier.

```python
# Print the header
height_weight.head()
```

<img src='pictures/duplicate.jpg' />

We can find duplicated in a DataFrame by using the ```.duplicated()``` method. It returns a Series of boolean values that are True for duplicate values, and False for non-duplicated values.

```python
# Get duplicates across all columns
duplicates = height_weight.duplicated()
height_weight[duplicates]
```

<img src='pictures/duplicate1.jpg' />

We can see exactly which rows are affected by using brackets as such. However, using ```.duplicated()``` without playing around with the arguments of the method can lead to misleading results, as all the columns are required to have duplicate values by default, with all duplicate values being marked as ``True`` except for the first occurrence. 

This limits our ability to properly diagnose what type of duplication we have, and how to effectively treat it.

To properly calibrate how we go about finding duplicates, we will use 2 arguments from the ``.duplicated()`` method:

* ``subset`` : List of column names to check for duplication
    - For Example; it allows us to find duplicates for the first and last name columns only
* ``keep`` : Whether to keep **first**(``'first'``), **last**(``'last'``) or **all**(``False``) duplicate values.

In this example, we're checking for duplicates across the first name, last name, and address variables, and we're choosing to keep all duplicates.

```python
# Column names to check for duplication
column_names = ['first_name','last_name', 'address']
duplicates = height_weight.duplicated(subset= column_names, keep=False)
```

To get a better bird's eye view of the duplicates, we sort the duplicate rows using the ```.sort_values()``` method, choosing ```first_name``` to sort by:

```python
# Output duplicate values
height_weight[duplicates].sort_values(by= 'first_name')
```

### How to treat duplicate values?

The complete duplicates can be treated easily. All that required is to keep one of them only and discard the others. This can be done with the ``.drop_duplicates()`` method, as well as the inplace argument which drops the duplicated values directly inside the ``height_weight`` DataFrame.

Here we are dropping complete duplicates only, so it's not necessary nor advisable to set a subset, and since the keep argument takes in first as default, we can keep it as such. Note that we can also set it as last, but not as False as it would keep all duplicates.

```python
# Drop duplicates
height_weight.drop_duplicates(inplace=True)
```

This leaves us with the other 2 sets of duplicates discussed earlier, which are the same for first_name, last_name and address, but contain discrepancies in height and weight. Apart from dropping rows with really small discrepancies, we can use a statistical measure to combine each set of duplicated values.

```python
# Output duplicate values
column_names = ['first_name', 'last_name', 'address']
duplicates = height_weight.duplicated(subset=column_names, keep=False)
height_weight[duplicates].sort_values(by='first_name')
```
For example, we can combine these two rows into one by computing the average mean between them, or the maximum, or other statistical measures, this is highly dependent on a common sense understanding of our data, and what type of data we have.

<img src='pictures/duplicates2.jpg' />

We can do this easily using the groupby method, which when chained with the agg method, lets you group by a set of common columns and return statistical values for specific columns when the aggregation is being performed.

**The ``.groupby()`` and ``.agg()`` methods**

```python
# Group by column names and produce statistical summaries
column_names = ['first_name', 'last_name', 'address']
summaries = {'height':'max', 'weight':'mean'}
height_weight = height_weight.groupby(by= column_names).agg(summaries).reset_index()
# Make sure aggregation is done
duplicates = height_weight.duplicated(subset= column_name, keep= False)
height_weight[duplicates].sort_values(by= 'first_name')
```

For example here, we created a dictionary called ``summaries``, which instructs *groupby* to return the maximum of duplicated rows for the height column, and the mean duplicated rows for the weight column. 

We then group ``height_weight`` by the *column names* defined earlier, and chained it with the ``agg`` method, which takes in the *summaries* dictionary we created. We chain this entire line with the ``.reset_index()`` method, so that we can have numbered indices in the final output. 

We can verify that there are no more duplicate values by running the duplicated method again, and use brackets to output duplicate rows.

## Exercise

### Finding duplicates

A new update to the data pipeline feeding into ```ride_sharing``` has added the ``ride_id`` column, which represents a unique identifier for each ride.

The update however coincided with radically shorter average ride duration times and irregular user birth dates set in the future. Most importantly, the number of rides taken has increased by *20%* overnight, leading you to think there might be both complete and incomplete duplicates in the ``ride_sharing`` DataFrame.

In this exercise, you will confirm this suspicion by finding those duplicates.

In [10]:
# Find duplicates
duplicates = ride_sharing.duplicated(subset='bike_id', keep= False)

# Sort your duplicated rides
duplicated_rides = ride_sharing[duplicates].sort_values('bike_id')

# Print relevant columns of duplicated_rides
print(duplicated_rides[['bike_id','duration','user_birth_year']])

       bike_id    duration  user_birth_year
3638        11  12 minutes             1988
6088        11   5 minutes             1985
10857       11   4 minutes             1987
10045       27  13 minutes             1989
16104       27  10 minutes             1970
...        ...         ...              ...
8812      6638  10 minutes             1986
6815      6638   5 minutes             1995
8456      6638   7 minutes             1983
8300      6638   6 minutes             1962
8380      6638   8 minutes             1984

[25717 rows x 3 columns]


### Treating duplicates

In the last exercise, you were able to verify that the new update feeding into ``ride_sharing`` contains a bug generating both complete and incomplete duplicated rows for some values of the ``ride_id`` column, with occasional discrepant values for the ``user_birth_year`` and ``duration`` columns.

In this exercise, you will be treating those duplicated rows by first dropping complete duplicates, and then merging the incomplete duplicate rows into one while keeping the average duration, and the minimum ``user_birth_year`` for each set of incomplete duplicate rows.

In [11]:
# CODE WILL ONLY WORK ON DATACAMP WORKSPACE

# Drop complete duplicates from ride_sharing
ride_dup = ride_sharing.drop_duplicates()

# Create statistics dictionary for aggregation function
statistics = {'user_birth_year': 'min', 'duration': 'mean'}

# Group by ride_id and compute new statistics
ride_unique = ride_dup.groupby('bike_id').agg(statistics).reset_index()

# Find duplicated values again
duplicates = ride_unique.duplicated(subset = 'bike_id', keep = False)
duplicated_rides = ride_unique[duplicates == True]

# Assert duplicates are processed
assert duplicated_rides.shape[0] == 0

TypeError: Could not convert 12 minutes5 minutes4 minutes to numeric