# Cleaning Data With Pandas

When we receive raw data, we have to do a number of things before we're ready to analyze it, this can include any or all of the following;

 - diagnosing the "tidiness" of the data — how much data cleaning we will have to do
 
 - reshaping the data — getting right rows and columns for effective analysis
 
 - combining multiple files
 
 - changing the types of values — how we fix a column where numerical values are stored as strings, for example
 - dropping or filling missing values - how we deal with data that is incomplete or missing

 - manipulating strings to represent the data better


### 1. Diagnose Data

We often describe data that is easy to analyze and visualize as **tidy data**. For data to be tidy, it must have:

 - Each variable as a separate column
 
 - Each row as a separate observation

We would want to re-shape a table that looks like:

![Data Table 1](img/data-01.png)

Into a table that looks more like(use `pd.melt()`):

![Data Table 2](img/data-02.png)

The first step of diagnosing whether or not a dataset is tidy is using pandas functions to explore and probe the dataset. Some of the most useful ones are:

 - `.head()` - display the first 5 rows
 - `.tail()` - display the last 5 rows
 - `.info()` - display a summary, shape, columns, datatypes, etc.
 - `.describe()` — display the summary statistics, mean, std, median, min, max, 1st and 3rd quartiles.
 - `.columns` - display the column names

### 2. Dealing with multiple Files

We will often have our data spread across multiple files. We can use the Python library `glob`, together with `pandas` to combine our files into one table so that we can analyze the aggregate data.

In [1]:
import pandas as pd
import glob as gb

# use 'glob' to open multiple files using regex to match on filename
files = gb.glob("data/file*.csv")

df_list = []
# iterate through the 'files' object, read the data into a DataFrame and append to a list
for filename in files:
  data = pd.read_csv(filename) # create dataframe
  df_list.append(data)

# then concatenate all of those DataFrames together.
df = pd.concat(df_list)

In [2]:
df.head()

Unnamed: 0,id,full_name,gender_age,Fractions,Probability,grade
0,0,Moses Kirckman,M14,69%,89%,11th grade
1,1,Timofei Strowan,M18,63%,76%,11th grade
2,2,Silvain Poll,M18,69%,77%,9th grade
3,3,Lezley Pinxton,M18,71%,72%,11th grade
4,4,Bernadene Saunper,F17,72%,84%,11th grade


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 900 entries, 0 to 99
Data columns (total 6 columns):
id             900 non-null int64
full_name      900 non-null object
gender_age     900 non-null object
Fractions      900 non-null object
Probability    900 non-null object
grade          900 non-null object
dtypes: int64(1), object(5)
memory usage: 49.2+ KB


## 3. Reshaping Your Data

We want:

 - each variable as a separate column
 
 - each row as a separate observation
 
To do so we can use pandas `.melt()` method to do the transformation. `.melt()` takes in a DataFrame, and the columns to unpack:

```py
# compare with tables above
pd.melt(frame=df, id_vars='name', value_vars=['Checking','Savings'], var_name="Account Type", value_name="Amount")
```
You provide the following parameters:

 - `frame`: the DataFrame you want to melt
 - `id_vars`: the column(s) of the old DataFrame to preserve
 - `value_vars`: the column(s) of the old DataFrame that you want to turn into variables
 - `var_name`: what to call the column of the new DataFrame that stores the variables
 - `value_name`: what to call the column of the new DataFrame that stores the values
 
We often use pandas `.columns()` method to rename the columns after melting:

```py
df.columns(["Account", "Account Type", "Amount"])
```

In [5]:
# columns in the dataframe
df.columns

Index(['id', 'full_name', 'gender_age', 'Fractions', 'Probability', 'grade'], dtype='object')

We want to make each row an observation, so we want to transform this table to look like:
![Data 03](img/data-03.png)

In [6]:
df = pd.melt(frame=df, id_vars=['full_name','gender_age','grade'], 
        value_vars=['Fractions', 'Probability'], value_name='score', var_name='exam')
print(df.columns)
print(df.exam.value_counts())

Index(['full_name', 'gender_age', 'grade', 'exam', 'score'], dtype='object')
Fractions      900
Probability    900
Name: exam, dtype: int64


In [7]:
df.head()

Unnamed: 0,full_name,gender_age,grade,exam,score
0,Moses Kirckman,M14,11th grade,Fractions,69%
1,Timofei Strowan,M18,11th grade,Fractions,63%
2,Silvain Poll,M18,9th grade,Fractions,69%
3,Lezley Pinxton,M18,11th grade,Fractions,71%
4,Bernadene Saunper,F17,11th grade,Fractions,72%


### 4. Dealing With Duplicates

To check for duplicates, use the pandas method `.duplicated()`, which will return a pandas `Series` telling us which rows are duplicate rows.

The series consists of two columns, `id` and `value`, `True` or `False` as to whether that particular row is a duplicate. A row is **flagged** as duplicate when every field in the row is the same as another row, they are exact duplicates. You can use pandas `.drop_duplicates()` method to drop all such rows.

There are instances where you have two or more rows where some of the fields are duplicate, and so aren't removed by `.drop_duplicates()`, e.g.

![Data Field](img/data-04.png)

In the example, we can see that the two "peach" rows remain because there is a difference in the price column. If we wanted to remove every row with a duplicate value in the `item` column, we could specify a `subset`. By default, the first occurrence is kept.

```py
df.drop_duplicates(subset=['item'])
```

![Data Field](img/data-05.png)

When using `subset`, you want to make sure that the columns you drop duplicates from are specifically the ones where duplicates don't belong, e.g. you wouldn't pick the `price` column.

In [8]:
# find the duplicate rows
duplicates = df.duplicated()

# determine how many rows are exact duplicates
duplicates.value_counts() # returns a pandas series

False    1800
dtype: int64

No duplicates we're found, if there had been:

```py
df = df.drop_duplicates()
```
To drop all exact duplicates. Then repeat:

```py
duplicates = df.duplicated()
duplicates.value_counts()
```
To check.

### 5. Splitting by Index

Often multiple measurements are recorded in the same column. We want to make sure each column represents one type of measurement, so that we can do individual analysis on each variable.

For example, we may have a 'dob' column formatted in `MMDDYYYY`, and we want to split this data into `day`, `month`, and `year` so that we can use these columns as separate features.

In this case, we know the exact structure of these strings. The first two characters will always correspond to the `month`, the second two to the `day`, and the rest of the string will always correspond to `year`. We can easily break the data into three separate columns by splitting the strings using `.str`.

```py
# Create the 'month' column
df['month'] = df.birthday.str[0:2]

# Create the 'day' column
df['day'] = df.birthday.str[2:4]

# Create the 'year' column
df['year'] = df.birthday.str[4:]
```
This would transform a table:

![Data Field](img/data-06.png)

into a table like:

![Data Field](img/data-07.png)

In [9]:
# split the 'gender_age' column into 'gender' and 'age' columns
df['gender'] = df.gender_age.str[:1]
df['age'] = df.gender_age.str[1:] # we still need to convert this to a number
df.head()

Unnamed: 0,full_name,gender_age,grade,exam,score,gender,age
0,Moses Kirckman,M14,11th grade,Fractions,69%,M,14
1,Timofei Strowan,M18,11th grade,Fractions,63%,M,18
2,Silvain Poll,M18,9th grade,Fractions,69%,M,18
3,Lezley Pinxton,M18,11th grade,Fractions,71%,M,18
4,Bernadene Saunper,F17,11th grade,Fractions,72%,F,17


In [10]:
# delete the 'gender_age' column, return a subset of the req'd columns
df = df[['full_name', 'grade', 'exam', 'score', 'gender', 'age']]
df.head()

Unnamed: 0,full_name,grade,exam,score,gender,age
0,Moses Kirckman,11th grade,Fractions,69%,M,14
1,Timofei Strowan,11th grade,Fractions,63%,M,18
2,Silvain Poll,9th grade,Fractions,69%,M,18
3,Lezley Pinxton,11th grade,Fractions,71%,M,18
4,Bernadene Saunper,11th grade,Fractions,72%,F,17


### 6. splitting by character

Let's say we have a column called `type` with data entries in the format `admin_US` or `user_Kenya`. This column contains two types of data. A user type (with values like "admin" or "user") and the country this user is in (with values like "US" or "Kenya"). We can split this column, `type`, into two separate columns on the `_` character.

```py
# Split 'type' to create the 'str_split' column
df['str_split'] = df.type.str.split('_')

# Create the 'usertype' column
df['usertype'] = df.str_split.str.get(0)

# Create the 'country' column
df['country'] = df.str_split.str.get(1)
```

This would transform a table like:

![Data Field](img/data-08.png)

into:

![Data Field](img/data-09.png)

In [11]:
# split the 'full_name' column into 'last_name' and 'first_name'
# 1st  create a Series object called name_split that splits the full_name on " " 
name_split = df.full_name.str.split(' ')
name_split[:5]

0       [Moses, Kirckman]
1      [Timofei, Strowan]
2         [Silvain, Poll]
3       [Lezley, Pinxton]
4    [Bernadene, Saunper]
Name: full_name, dtype: object

In [12]:
# create the columns 
df['first_name'] = name_split.str.get(0)
df['last_name'] = name_split.str.get(1)
df.head()

Unnamed: 0,full_name,grade,exam,score,gender,age,first_name,last_name
0,Moses Kirckman,11th grade,Fractions,69%,M,14,Moses,Kirckman
1,Timofei Strowan,11th grade,Fractions,63%,M,18,Timofei,Strowan
2,Silvain Poll,9th grade,Fractions,69%,M,18,Silvain,Poll
3,Lezley Pinxton,11th grade,Fractions,71%,M,18,Lezley,Pinxton
4,Bernadene Saunper,11th grade,Fractions,72%,F,17,Bernadene,Saunper


### 7. Data  Type Conversion

Each column of a DataFrame can hold items of the same data type or dtype. The dtypes that pandas uses are: `float`, `int`, `bool`, `datetime`, `timedelta`, `category` and `object`. Often, we want to convert between types so that we can do better analysis.

To see the specific `dtypes` of a dataframe:

In [13]:
df.dtypes

full_name     object
grade         object
exam          object
score         object
gender        object
age           object
first_name    object
last_name     object
dtype: object

We can see that the `dtype` is a **Series** object. If you try a statistical analysis, e.g. calculate **mean** on 
column that is a **Series** object, the interpreter raises a `TypeError` exception. Convert it to a numerical value first.

![Data Field](img/data-10.png)

In the above example, we can see that the 'price' column is composed of strings representing dollar amounts. Before converting the price to a `float` we'll want to use a regex to strip off the `$` sign.

```py
fruit.price = fruit['price'].replace('[\$,]', '', regex=True)
```
We can use the pandas function `.to_numeric()` to convert strings containing numerical values to integers or floats:

```py
fruit.price = pd.to_numeric(fruit.price)
```

In [14]:
df.head()

Unnamed: 0,full_name,grade,exam,score,gender,age,first_name,last_name
0,Moses Kirckman,11th grade,Fractions,69%,M,14,Moses,Kirckman
1,Timofei Strowan,11th grade,Fractions,63%,M,18,Timofei,Strowan
2,Silvain Poll,9th grade,Fractions,69%,M,18,Silvain,Poll
3,Lezley Pinxton,11th grade,Fractions,71%,M,18,Lezley,Pinxton
4,Bernadene Saunper,11th grade,Fractions,72%,F,17,Bernadene,Saunper


In [15]:
# strip the '%' symbol from the score column
df['score'] = df.score.replace('[\%,]', '', regex=True)

# convert score to numerical value
df['score'] = pd.to_numeric(df['score'])

# convert age to numerical type
df['age'] = pd.to_numeric(df['age'])
df.head()

Unnamed: 0,full_name,grade,exam,score,gender,age,first_name,last_name
0,Moses Kirckman,11th grade,Fractions,69,M,14,Moses,Kirckman
1,Timofei Strowan,11th grade,Fractions,63,M,18,Timofei,Strowan
2,Silvain Poll,9th grade,Fractions,69,M,18,Silvain,Poll
3,Lezley Pinxton,11th grade,Fractions,71,M,18,Lezley,Pinxton
4,Bernadene Saunper,11th grade,Fractions,72,F,17,Bernadene,Saunper


In [16]:
df.dtypes

full_name     object
grade         object
exam          object
score          int64
gender        object
age            int64
first_name    object
last_name     object
dtype: object

Sometimes we want to do analysis on numbers that are hidden within string values. We can use regex to extract this numerical data.

![Data Field](img/data-11.png)

In the example above, it would be useful to split `exerciseDescription` into two columns, `exercise` and `reps`. To extract the numbers from the string we can use pandas' `.str.split()` function:

```py
split_df = df['exerciseDescription'].str.split('(\d+)', expand=True)
```
Which will return a pandas `Series`:

![Data Field](img/data-12.png)

We can then create the two columns in the dataframe:

```py
df.reps = pd.to_numeric(split_df[1])
df.exercise = split_df[0].replace('[\- ]', '', regex=True)

# alternative
pd.to_numeric(split_df.str.get(1))
df['exercise'] = split_df.str.get(0).replace('[\- ]', '', regex=True)
```

So that our dataframe looks like:

![Data Field](img/data-13.png)

In [18]:
# extract the numeric grade from the grade column
df_grade = df.grade.str.split('(\d+)', expand=True)[1]
df_grade[:5]

0    11
1    11
2     9
3    11
4    11
Name: 1, dtype: object

In [20]:
# convert the type to numeric
df.grade = pd.to_numeric(df_grade)
df.dtypes

full_name     object
grade          int64
exam          object
score          int64
gender        object
age            int64
first_name    object
last_name     object
dtype: object

In [21]:
df.head()

Unnamed: 0,full_name,grade,exam,score,gender,age,first_name,last_name
0,Moses Kirckman,11,Fractions,69,M,14,Moses,Kirckman
1,Timofei Strowan,11,Fractions,63,M,18,Timofei,Strowan
2,Silvain Poll,9,Fractions,69,M,18,Silvain,Poll
3,Lezley Pinxton,11,Fractions,71,M,18,Lezley,Pinxton
4,Bernadene Saunper,11,Fractions,72,F,17,Bernadene,Saunper


### 8. Dealing with Missing Values

It is common to have fields with missing values. these show up as `NaN`.  Some calculations will skip NaN values, others or visualizations will break when a `NaN` is encountered.

We generally use one of two methods to handle missing value:

1. drop all rows with a missing value - use the `.dropna()` methods, e.g.

```py
df = df.dropna()
```

If you wanted to drop every row that had a `NaN` value for a specific column, use `subsets`, e.g.

```py
# drop every row with 'NaN' for 'num_guests'
df = df.dropna(subset=['num_guests'])
```

2. fill in the particular field with an average or some other aggregate value, use the `.fillna()` method, e.g.

```py
# fill in the 'price and 'num_guests' columns
df = df.fillna(value={"price":df.price.mean(), "num_guests":df.num_guests.mean()})
```

To fill in all `NaN` values in the dataframe, `df = df.fillna(<value>)`