# Lesson I 

## Data Type Constraints

In this course, we're going to understand how to diagnose different problems in our data and how they can come up during our workflow.

**In this Course **:

* Diagnose dirty data
* Side effects of dirty data
* Clean Data

### Why do we need to clean data?

Data Science workflow:

* Access Data
    - Explore and Process Data
        - Extract Insights
            - Report Insights

Dirty data can appear because of duplicate values, mis-spellings, data-type parsing errors and legacy systems. Without making sure that data is properly cleaned in the exploration and processing phase, we will surely compromise the insight and reports subsequently generated.

***Garbage in garbage out.***

<img src='pictures/datascienceworklow.jpg' />

### Data type Constraints

| **DataType** | **Example** | **Python data type** |
| -------------|-------------|----------------------|
| Text Data | First name, last name, address ... | ``str`` |
| Integers | # Subscribers, # products sold ... | ``int`` |
| Decimals | Temperature, $ Exchange rates ... | ``float`` |
| Binary | Is married, new customer, yes/no ... | ``bool`` |
| Dates | Order Dates, ship dates ... | ``datetime``  |
| Categories | Marriage status, gender ... | ``category`` |


### String to Integers

Let's take a look at the following example:

```python
# Import CSV file and output header
sales = pd.read_csv('sales.csv')
sales.head(2)
```

<img src='pictures/saleshead.jpg' />

Here's the head of a Data Frame containing revenue generated and quentity of items sold for a sales order. We want to calculate the total revenue generated by all sales orders. As we can see, the Revenue column has the dollar sign on the right hand side.

```python
# Get data types of columns
sales.dtypes    
```

<img src='pictures/datatypes.jpg' />

A close inspection of the DataFrame column's data types using the ``.dtypes`` attribute returns object for the Revenue column.

We can also check the data types as well as the number of missing values per column in a DataFrame, by using the ``.info()`` method.

```python
# Get DataFrame information
sales.info()
```

Since the Revenue column is a string, summing across all sales orders returns one large concatenated string containing each row's string. To fix this, we need to first remove the **$** sign from the string so that pandas is able to convert the strings into numbers without error. 


```python
# Print sum of all Revenue column
sales['Revenue'].sum()  # This will return all of the strings concatenated
```

We do this with the ```.str.strip()``` method, while specifying the string we want to strip as an argument, which is in this case the *dollar sign*. Since our dollar values do not contain decimals, we then convert the Revenue column to an *integer* by using the ```.astype()``` method, specifying the desired data type as argument. 

```python
# Remove $ from Revenue column
sales['Revenue'] = sales['Revenue'].str.strip('$')
sales['Revenue'] = sales['Revenue'].astype('int')
```

Had our revenue values been decimal, we would have converted the Revenue column to *float*. We can make sure that the Revenue column is now an integer by using the ```assert``` statement, which takes in a condition as input, as *returns nothing* if that condition is met, and an *error* if it is not.

```python
# Verify that Revenue is now an integer
assert sales['Revenue'].dtype == 'int'
```

### Numeric or Categorical?

A common type of data seems numeric but actually represents categories with a finite set of possible categories. This is called **categorical data**.

```python
marriage_status
3
1
2
```

``0`` = Never Married ``1`` = Married ``2`` = Separated ``3`` = Divorced

Here we have a marriage status column, which represented by 0, 1, 2, 3. 

However it will be imported of type integer, which could lead to misleading results when trying to extract some statistical summaries.

We can solve this by using ``.astype()`` method seen earlier, but this time specifying the category data type.

```python
# Convert to categorical
df["marriage_status"] = df['marriage_status'].astype('category')
```

## Exercise

### Numeric data or ...?

In this exercise, and throughout this chapter, you'll be working with bicycle ride sharing data in San Francisco called ```ride_sharing```. It contains information on the start and end stations, the trip duration, and some user information for a bike sharing service.

The ```user_type``` column contains information on whether a user is taking a free ride and takes on the following values:
* ``1`` for free riders.
* ``2`` for pay per ride.
- ``3`` for monthly subscribers.

In this instance, you will print the information of ```ride_sharing``` using ```.info()``` and see a firsthand example of how an incorrect data type can flaw your analysis of the dataset.

In [3]:
# Import packages
import pandas as pd

# Ride Sharing dataFrame
ride_sharing = pd.read_csv('datasets/ride_sharing_new.csv')

# Prin the information of ride_sharing
print(ride_sharing.info())

# Print summary statistics of user_type column
print(ride_sharing['user_type'].describe())

# Convert user_type from integer to category
ride_sharing['user_type_cat'] = ride_sharing['user_type'].astype('category')

# Write an assert statement confirming the change
assert ride_sharing['user_type_cat'].dtype == 'category'

# Print new summary statistics 
print(ride_sharing['user_type_cat'].describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25760 entries, 0 to 25759
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Unnamed: 0       25760 non-null  int64 
 1   duration         25760 non-null  object
 2   station_A_id     25760 non-null  int64 
 3   station_A_name   25760 non-null  object
 4   station_B_id     25760 non-null  int64 
 5   station_B_name   25760 non-null  object
 6   bike_id          25760 non-null  int64 
 7   user_type        25760 non-null  int64 
 8   user_birth_year  25760 non-null  int64 
 9   user_gender      25760 non-null  object
dtypes: int64(6), object(4)
memory usage: 2.0+ MB
None
count    25760.000000
mean         2.008385
std          0.704541
min          1.000000
25%          2.000000
50%          2.000000
75%          3.000000
max          3.000000
Name: user_type, dtype: float64
count     25760
unique        3
top           2
freq      12972
Name: user_type_cat, dty

### Summing strings and concatenating numbers

In the previous exercise, you were able to identify that ```category``` is the correct data type for ``user_type`` and convert it in order to extract relevant statistical summaries that shed light on the distribution of ```user_type```.

Another common data type problem is importing what should be numerical values as strings, as mathematical operations such as summing and multiplication lead to string concatenation, not numerical outputs.

In this exercise, you'll be converting the string column ```duration``` to the type ```int```. Before that however, you will need to make sure to strip ```"minutes"``` from the column in order to make sure ```pandas``` reads it as numerical. 

In [6]:
# Strip duration of minutes
ride_sharing['duration_trim'] = ride_sharing['duration'].str.strip('minutes')

# Convert duration to integer
ride_sharing['duration_time'] = ride_sharing['duration_trim'].astype('int')

# Write an assert statement making sure of conversion
assert ride_sharing['duration_time'].dtype == 'int'

# Print formed columns and calculate average ride duration 
print(ride_sharing[['duration','duration_trim','duration_time']])
print(ride_sharing['duration_time'].mean())

         duration duration_trim  duration_time
0      12 minutes           12              12
1      24 minutes           24              24
2       8 minutes            8               8
3       4 minutes            4               4
4      11 minutes           11              11
...           ...           ...            ...
25755  11 minutes           11              11
25756  10 minutes           10              10
25757  14 minutes           14              14
25758  14 minutes           14              14
25759  29 minutes           29              29

[25760 rows x 3 columns]
11.389052795031056
