The previous checkpoint offered an overview of the various elements of exploratory data analysis, or EDA. In this checkpoint, we will begin the **data cleaning** phase with examples in Python. 

Below is a diagram depicting where data cleaning is situated in the overall EDA process:

![data_cleaning.png](assets/data_cleaning.png)


Whether we are working on an established dataset for a personal project (i.e., a dataset from Kaggle) or a brand new dataset for work, the first step is to establish the **variable type** of each feature in our dataset. From there we are able to identify *how* to clean the data by scrubbing it of any potential problems for future analysis.

While a programmer might think in terms of strings, integers, floats, or booleans, a data scientist sees variables as either **continuous** or **categorical**. The essential difference here is continuous variables can take a potentially unlimited number of values while categorical variables can take only a limited number of categories as their values. 

This distinction between the two variable types influences how we approach a data science problem. As we will see later in this course, if our target variable is continuous, then we formulate the problem as a **regression problem**. If the target variable is categorical, then we formulate it as a **classification problem**. More immediately, each type of variable requires different EDA techniques. 

Now let's consider variable types more closely along with how to work with them in Pandas. You will see that within the split between categorical and continuous variables, there are a couple of "sub-types," and that it's not always obvious how to model some variables. 

Topics covered:

* Continuous variables: **interval** versus **ratio** variables
* Categorical variables: **nominal** versus **ordinal** variables

#### Continuous variables

Continuous variables can take an infinite number of potential values. This is probably the most common variable type you will encounter when working on a dataset. Some real-world examples of continuous variables include speed, distance, weight, height, and so forth. 

There are 2 sub-types of continuous variables:


## 1. Interval variables

These variables are sensitive to both rank-order and distance. Temperature is a good example. The distance between 30 and 40 degrees Fahrenheit is the same as the distance between 70 and 80 degrees (10 degrees). We can also say that 80 degrees is higher than 70 degrees, which is higher than 40 degrees. This means that we can rank-order this variable.  

**The crucial distinction about interval variables is that they lack an absolute zero point**. For example, a temperature of 0 degrees Fahrenheit does *not* mean that there is no warmth in the air at all! The lack of a "true zero" means that we cannot logically calculate ratios from an interval variable. For example, we cannot say that 60 degrees F is "twice as hot" as 30 degrees F.

Interval variables can be treated as either categorical or continuous variables. If we are comfortable with the assumption that a continuous variable naturally has an absolute zero point, then we can treat interval variables as continuous.

## 2. Ratio variables

The second type of continuous variable is the **ratio variable**. Ratio variables indicate rank and distance, and have a meaningful absolute zero value. When a ratio variable has a score of 0, then none of the quantity measured by the variable is present. The presence of a "true zero" is the main distinction between ratio and interval variables.

As an example, age is a ratio variable: Someone who is "0 years old" has not been born, which means that a 20-year-old has lived twice as long as a 10-year-old.

# Categorical variables

By definition, a categorical variable can take only a limited number of distinct values, called "categories". As a rule of thumb, if a variable can only take a finite number of values, that variable can be treated as a categorical variable. However, this guideline is not rigid. 

Think about a variable like age, for example. While we claimed age as a continuous variable above, it is unclear whether there really is a fixed or unlimited number of values that can take. Biologically, of course, we can effectively say there is a range on the ages a human can take. 

But consider a dataset collected from college students, or working adults, or retirees. In these cases, there is in practice a limited number of values that age might take. So, depending on the analysis it make sense to treat age as a categorical variable.

There are 2 sub-types of categorical variables: 

## 1. Nominal variables

If a variable is nominal, the ordering between categories does not matter. For example, if we have a variable indicating the origin country of a product, that variable is categorical because it can only take the given number of countries that exist. On top of that, it's a nominal variable because a rank-ordering of countries is meaningless. Relationships such as "greater than," "less than" or "equal to" are not clearly defined among values of a nominal variables. 

## 2. Ordinal variables

Ordinal variables indicate a rank-ordering among categories. For example, runners might be scored as 1 for 1st place, 2 for 2nd place, 3 for 3rd place, and so forth. 

However, an ordinal variable does not give any information about the *distance* between the scores. We know that the 1st place runner was faster than the 2nd place runner, but the difference in their *times* (the distance between them within the variable) could be minutes or milliseconds. In addition, the difference in times between 1st and 2nd place is probably not the same as the difference in times between 2nd and 3rd place.

# Examining variable types in Pandas

Now that you have a deeper understanding of the different types of variables, let's use Pandas to determine the variable types of a dataset. For this example, we will use a dataset that was previously from Kaggle called ["Top 5000 YouTube channels data from Social Blade"].

Let's first import Pandas and then load our dataset:

In [1]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import warnings

warnings.filterwarnings('ignore')

In [2]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'youtube'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

youtube_df = pd.read_sql_query('select * from youtube',con=engine)

# no need for an open connection, 
# as we're only doing a single query
engine.dispose()

To get a high level understanding of the data frame, we can use the `.info()` function from Pandas. This function returns the number of rows and columns in the data frame as well as the data type of each column:

In [3]:

youtube_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 6 columns):
Rank             5000 non-null object
Grade            5000 non-null object
Channel name     5000 non-null object
Video Uploads    5000 non-null object
Subscribers      5000 non-null object
Video views      5000 non-null int64
dtypes: int64(1), object(5)
memory usage: 234.5+ KB


As we can see from the output, this dataset contains 5,000 observations and 6 columns. It appears that only the *Video views* field is numeric and that the rest are objects (which means that they are strings). 

However, we should be cautious about this interpretation. As we will see upon closer investigation, some of the variables that appear as objects are indeed numeric variables. To discover this, let's print the first few rows of the data frame using the `.head()` function:

In [4]:
# print first rows of the data frame
youtube_df.head()

Unnamed: 0,Rank,Grade,Channel name,Video Uploads,Subscribers,Video views
0,1st,A++,Zee TV,82757,18752951,20869786591
1,2nd,A++,T-Series,12661,61196302,47548839843
2,3rd,A++,Cocomelon - Nursery Rhymes,373,19238251,9793305082
3,4th,A++,SET India,27323,31180559,22675948293
4,5th,A++,WWE,36756,32852346,26273668433


It appears that both *Video Uploads* and *Subscribers* are numeric! **So, why did these columns appear as *object* and not *int or float***?

This happened because both *Video Uploads* and *Subscribers* contain observations that can't be handled as numeric. This includes both missing values and observations stored in the database as `--`. 

To confirm that this is indeed what's happening, let's select the rows where values of the *Video Uploads* and *Subscribers* columns are equal to `--`:

In [5]:
youtube_df[(youtube_df["Video Uploads"].str.strip() == "--") | (youtube_df["Subscribers"].str.strip() == "--")]

Unnamed: 0,Rank,Grade,Channel name,Video Uploads,Subscribers,Video views
17,18th,A+,Vlad and Nikita,53,--,1428274554
108,109th,A,BIGFUN,373,--,941376171
115,116th,A,Bee Kids Games - Children TV,740,--,414535723
142,143rd,A,ChiChi TV Siêu Nhân,421,--,2600394871
143,144th,A,MusicTalentNow,1487,--,3252752212
152,153rd,A,Family GamesTV,282,--,1287242549
156,157th,A,KH Show,31,--,106302038
175,176th,A,LES BOYS TV2,116,--,387595623
180,181st,A,BIBO TOYS,313,--,1574657579
189,190th,A,Kids Tv Show,8,--,86516866


As we can see, there are a total of 390 rows in our dataset that include `--` as the value for either the *Video Uploads* or *Subscribers* columns. We'll cover how to clean this data in the next checkpoint. For now, we understand that these variables are indeed numeric and, likely, continuous. 

At this point in the data cleaning process, we believe that the continuous variables in the dataset are *Video Uploads*, *Subscribers* and *Video views*, while the categorical variables are *Grade*, *Rank*, and *Channel name*. Let's confirm this. 

Getting the number of unique values for each column of our data frame would help us decide which DataFrames might be categorical—if it's a small number, then it's likely to be categorical. 

Let's retrieve this information using the Pandas `.nunique()` function:

In [6]:
youtube_df.nunique()

Rank             5000
Grade               6
Channel name     4993
Video Uploads    2286
Subscribers      4612
Video views      5000
dtype: int64

Here we can see that the *Grade* column has only 6 distinct values, so it's safe to classify this as a categorical variable. But *Channel name* has nearly 5,000 unique values—how can we be sure it's categorical?

Here, we must simply think logically about our data. Since *Channel name* stores the name of the YouTube channel, we can think of each channel as a unique category. The number possibilities this value can take is limited to the number of YouTube channels, so it's a categorical variable. 

What about the *Rank* column? Since this contains the rank of each channel, it can be considered as either an ordinal categorical variable or as an interval continuous variable. The variable type we choose will depend on the task at hand and likely some experimentation on the data to decide which is more helpful. As with many tasks in data science, there is no clear-cut "right" answer here, so in this bootcamp you will learn how to develop the best course of action given your objectives. 

# Changing variable types

Sometimes, we may want to work with categorical instead of continuous variables. For example, we might want to look at the difference between the most and least watched channels. In this case, we would categorize the channels with respect to their video views into 3 groups: 

1. The most watched channels, where the video views are higher than 1 billion.
2. The moderately watched channels, where the video views are lower than 1 billion and higher than 100 million.
3. The least watched channels, where the video views are lower than 100 million.

Here, we are transforming a *continuous* variable into an *ordinal, categorical* variable. 

Let's create a new feature in our Pandas DataFrame, `views_group`, that does this:

In [7]:
# this method returns group numbers 
# given video views
def categorize_video_views(views_num):
    if views_num >= 1000000000:
        return 1
    elif views_num >= 100000000:
        return 2
    else:
        return 3

# we use Pandas' .apply() method by calling the function above.
youtube_df['views_group'] = youtube_df['Video views'].apply(categorize_video_views)

# let's see how many observations we have in each group
print(youtube_df.groupby("views_group")["Video views"].count())

views_group
1    1399
2    2846
3     755
Name: Video views, dtype: int64


## Assignment

You will see the benefits of transforming continuous variables into categorical variables when we begin to model the data. For now, let's continue to practice identifying variable types. 

To complete this assignment, submit a Gist file or enter your answers directly below to the following questions:

1. Consider the advantages and disadvantages of treating the *Rank* variable as categorical. Discuss your arguments with your mentor.
2. What are the types of the following variables?
    * Age
    * Salary
    * Revenue
    * Customer type
    * Stock price
    
Submit your work below, and plan on discussing it with your mentor. You can also take a look at this [example solution](https://github.com/Thinkful-Ed/data-201-assignment-solutions/blob/master/model_prep_variable_types/solution.ipynb).