## Session 3 - Data cleaning

Today we're going to look at how to clean data using pandas

In [None]:
# First lets read in some movie metadata that we know needs cleaning. 
# This dataset is from Kaggle (https://www.kaggle.com/datasets/carolzhangdc/imdb-5000-movie-dataset)

import pandas as pd

data = pd.read_csv('data/movie_metadata_unclean.csv')

data.head()

## Let's examine our data


1. When we look at the dataset  we can start to note down the problems, and then we’ll come up with solutions to fix those problems.

2. Pandas has some selection methods which we can use to slice and dice the dataset based on your queries.

EXAMPLES:

- Look at the some basic stats for the ‘imdb_score’ column: `data.imdb_score.describe()`
- Select a column: `data[‘movie_title’]`
- Select the first 10 rows of a column: `data[‘duration’][:10]`
- Select multiple columns: `data[[‘budget’,’gross’]]`
- Select all movies over two hours long: `data[data[‘duration’] > 120]`


In [None]:
data.imdb_score.describe()

In [None]:
data['movie_title']

In [None]:
data['duration'][:10]

In [None]:
data[['budget','gross']]

In [None]:
data[data['duration'] > 120]

## Missing Data

One of the most common problems is missing data. This could be because it was never filled out properly, the data wasn’t available, or there was a computing error. 

> __IMPORTANT: if we leave the blank values in there, it will cause errors in analysis later on, so we need to process the data to deal with missing values__

OPTIONS:

- Add in a default value for the missing data
- Get rid of (delete) the rows that have missing data
- Get rid of (delete) the columns that have a high incidence of missing data


In [None]:
# Let's see how many non-null values in each column
len(data) - data.count()

## Add default values

- Let's get rid of all those nasty NaN values. 
- QUESTION: what to put in its place? This is where we need to jugde the dataset and make an executive decision!

For our example, let’s look at the ‘country’ column. It’s straightforward enough, but some of the movies don’t have a country provided so the data shows up as NaN. In this case, we probably don’t want to assume the country, so we can replace it with an __empty string__ or some other default value.

In [None]:
data.country.isna()

In [None]:
# Let's see which films don't have a country
data[['movie_title', 'country']][data.country.isna()]

In [None]:
# This replaces the NaN entries in the ‘country’ column with the empty string, 
# but we could just as easily tell it to replace with a default name such as “Not known” or "Other".

data.country = data.country.fillna('')

data.loc[[4]][['movie_title', 'country']]


In [None]:
# Now let's look at a numeric column - duration

data[['movie_title', 'duration']][data.duration.isna()]

In [None]:
# With numerical data like the "duration" of the movie, 
# a calculation like taking the *mean duration* can help us even the dataset out. 

# That way we don’t have crazy numbers like 0 or NaN throwing off our analysis.

data.duration = data.duration.fillna(data.duration.mean())

data.loc[[4]][['movie_title', 'duration']]

## Remove incomplete rows

- Now we want to get rid of any rows that have a missing value. 
- It’s a pretty aggressive technique, but there may be a use case where that’s exactly what __we want to do.__

In [None]:
# Let's create a copy of the data set to play with and count the rows
test_data = data.copy()
test_data.shape

In [None]:
# Let's test dropping all rows with any NA values:
test_data = data.copy()

test_data.dropna(inplace=True)
test_data.shape

In [None]:
# We can also drop rows that have ALL NA values (which we don't have any of):
test_data = data.copy()

test_data.dropna(how='all', inplace=True)
test_data.shape

In [None]:
# Put a limitation on how many non-null values need to be in a row in order to keep it 
# (in this example, the data needs to have at least 25 non-null values):
test_data = data.copy()

test_data.dropna(thresh=25, inplace=True)
test_data.shape

In [None]:
# In this instance that we don’t want to include any movie 
# that doesn’t have information on when the movie came out:
test_data = data.copy()

test_data.dropna(subset=['title_year'], inplace=True)
test_data.shape

## Dealing  with error-prone columns

- We can apply the same kind of criteria to our __columns.__ 
- But we just need to use the parameter __axis=1__ in our code. 
- That means to operate on columns, not rows. 

> _Do not run the code below if you do not want to delete data - otherwise feel free to experiment!_

In [None]:
# Drop the columns with that are all NA values (we don't have any of these):
test_data = data.copy()

test_data.dropna(axis=1, how='all', inplace=True)
test_data.shape

In [None]:
# Drop all columns with *any* NA values:
test_data = data.copy()

test_data.dropna(axis=1, how='any', inplace=True)
test_data.shape

# Note: we can use same threshold and subset params as we did with rows

## Normalize data types

- Sometimes, especially when we are reading in a CSV with a bunch of numbers, some of the numbers will read in as __strings__ instead of numeric values, or vice versa.

- Let's review a couple of ways to fix and normlise our data types.

- Please note that we are going to read data from disk again, so the types are converted on data rparsing

- If we just run the next line of code it owuld throw an error! Complaining about NaN values. We need to save our previous results into a file and then read the file again. 


In [None]:
# 1. save results into the file again

data.dropna(inplace=True)

data.to_csv('data/movie_metadata_cleaned.csv')

# 2. read from the file again

data = pd.read_csv('data/movie_metadata_cleaned.csv')

# Look at how the duration field has been read in (float)
data.duration

In [None]:
# Now force it to be an integer
data = pd.read_csv('data/movie_metadata_cleaned.csv', dtype={'duration': int})
data.duration

In [None]:
# Same with actor_2_facebook_likes field

data = pd.read_csv('data/movie_metadata_cleaned.csv')
data.actor_2_facebook_likes

In [None]:
# Force actor_2_facebook_likes to be a string

data = pd.read_csv('data/movie_metadata_cleaned.csv', dtype={'actor_2_facebook_likes': str})
data.actor_2_facebook_likes

## Change casing

- Columns with user-provided data are ripe for corruption. 
- People make typos, leave their caps lock on (or off), and add extra spaces where they shouldn’t.
- Let's see how to correct these issues


In [None]:
data['movie_title'].str.upper()

In [None]:
#  Let's get rid of trailing whitespace

data['movie_title'].str.strip()

## DID YOU KNOW:  

### It is also possible to correct spelling mistakes in your data!

<div class="alert alert-block alert-success">

- We will not be covering this in our course, but you can read about it in your spare time
- It is called __FUZZY MATCHING__ 
- Fuzzy string matching uses __[Levenshtein Distance] (https://en.wikipedia.org/wiki/Levenshtein_distance)__ to calculate the differences between sequences
- note the exclamation sign below
    
</div>

<img src="data/images/fuzzy.jpg">

## Rename columns

- If your data was generated by a computer program, it probably has some computer-generated column names too. 
- Those can be hard to read and understand while working
- We can rename a column to something more user-friendly
- we have already practiced that in earlier pandas tutorial, let's remind ourselves how to do it

In [None]:
data.rename(columns = {'title_year':'release_date', 'movie_facebook_likes':'facebook_likes'})

## Saving Results

- When you’re done cleaning your data, you may want to export it back into CSV format for further processing in another program.
- Always remember to save your data, otherwise all our efforts would be lost. 

In [None]:
data.to_csv('data/movie_metadata_cleaned.csv', encoding='utf-8')

# NON-CODING DEMO SLIDES

<div class="alert alert-block alert-warning">
    
- There are many advanced techniques on how you can clean, inspect, process data

- <b>We won't do any coding</b>, but we will have a high level walk through to review some of them

- In the future you can explore these techniques in more detail
</div>


## Let's review how to identify and see:

- ##### Missing Data
- ##### Irregular Data


## Technique 1: Missing Data Heatmap

- When there is a smaller number of features, we can visualize the missing data via heatmap.
- The chart below demonstrates the missing data patterns of the first 30 features. 
    - The horizontal axis shows the feature name; 
    - the vertical axis shows the number of observations/rows; 
    - the yellow color represents the missing data while the blue color otherwise.
    
    
<img src="data/images/missing1.png">

## Technique 2: Missing Data Percentage List

- When there are many features in the dataset, we can make a list of missing data % for each feature.
- The list below shows the percentage of missing values for each of the features.

<img src="data/images/missing2.png">

## Technique 3: Missing Data Histogram

- Missing data histogram is also a technique for when we have many features.
- To learn more about the missing value patterns among observations, we can visualize it by a histogram.
- This histogram helps to identify the missing values situations among the 30,471 observations.

<img src="data/images/missing3.png">

# Irregular data (Outliers)

- Outliers are data that is distinctively different from other observations. 
- They could be real outliers or mistakes.
- Depending on whether the feature is numeric or categorical, we can use different techniques to study its distribution to detect outliers.

## Technique 1: Histogram/Box Plot

- When the feature is numeric, we can use a histogram and box plot to detect outliers.
- The data looks highly skewed with the possible existence of outliers.

<img src="data/images/outlier1.png">


## Technique 2: Bar Chart

- When the feature is categorical, we can use a bar chart to learn about its categories and distribution.
- For example, the feature ecology has a reasonable distribution. 
- But if there is a category with only one value called “other”, then that would be an outlier.

<img src="data/images/outlier2.png">



# Unnecessary data

- All the data feeding into the model should serve the purpose of the project. 
- The unnecessary data is when the data doesn’t add value. 
- We cover two main types of unnecessary data due to different reasons.

## Unnecessary type 1: Uninformative / Repetitive

- Sometimes one feature is uninformative because it has too many rows being the same value.
- We can create a list of features with a high percentage of the same value.
- For example, we specify below to show features with over 95% rows being the same value.

<img src="data/images/unnecessary1.png">

## Unnecessary type 2: Duplicates

- The duplicate data is when copies of the same observation exist.
- There are two main types of duplicate data

__1. Duplicates type 1: All Features based__

- This duplicate happens when all the features’ values within the observations are the same. 

__2.Duplicates type 2: Key Features based__

- Sometimes it is better to remove duplicate data based on a set of unique identifiers.
- For example, the chances of two transactions happening at the same time, with the same square footage, the same price, and the same build year are close to zero.
- We can set up a group of critical features as unique identifiers for transactions and we check if there are duplicates based on them.

> We will need to identiy and remove those duplicates (if necessary)