# CSI 4142 - Introduction to Data Science
# Assignment 2: Data cleaning

Shacha Parker (300235525)\
Callum Frodsham and (300199446)\
Group 79

### Setup Instructions To Reproduce this Data Cleaning Notebook:
(Step 1 Optional)
1. Create a virtual python environment in the project directory (if you want) for all of the packages required:  
``` 
python -m venv .venv
```
To enter the virutal environment: 
```
.venv/Scripts/activate.ps1 # on windows
source .venv/bin/activate # on mac/linux
```
2. Download all of the required packages (run in cmd/shell of choice):
```
pip install jupyter
pip install ipykernel
pip install pandas
pip install numpy
```
3. VSCode: Ensure you have the correct python kernel selected!
<br> 
If you are using a virtual environment, make sure to select the python interpreter for that virtual environment otherwise this will not work! If you have everything done globally, then just make sure the correct python kernel you are using is selected.

In [None]:
# Initial imports
import numpy as np
import pandas as pd
import re

<h1>Dataset 1: Netflix Movies and TV Shows</h1>
<h3>Clean Data Checking</h3>

Author: Shivam Bansal
<br>
Purpose: This dataset was made to provide insights on the shows and movies that Netflix is hosting on their platform. For example, these insights could be used to see what type of content the platform is missing, or what type of content they have too much of.
<br>
Shape: This dataset is composed of 12 columns, and 8810 rows.
<br><br>
Link: <a href="https://www.kaggle.com/datasets/shivamb/netflix-shows"> Mobile Device Usage and User Behavior</a>
<br>
<h3>Dataset Feature List: </h3>
<ol>
    <li>Show Id:
    <br>
    Feature Type: Categorical - Nominal
    <br>
    Description: The show_id is a unique ID that is assigned to each show/movie. There are 8807 entries ranging from s1 to s8807.
    </li>
    <br>
    <li>Type
    <br>
    Feature Type: Categorical - Nominal
    <br>
    Description Informs the viewer whether the content is a tv show or a movie.
        </li>
    <br>
    <li>Title:
    <br>
    Feature Type: Categorical - Nominal
    <br>
    Description: The title name of the tv show or movie.
        </li>
    <br>
    <li>Director:
    <br>
    Feature Type: Categorical - Nominal
    <br>
    Description: The name of the person(s) who directed the tv show or movie. 
        </li>
    <br>
    <li>Cast:
    <br>
    Feature Type: Categorical - Nominal
    <br>
    Description: The name(s) of the notable actor(s) who acted in the tv show or movie.
        </li>
    <br>
    <li>Country:
    <br>
    Feature Type: Categorical - Nominal
    <br>
    Description: The countries/country where the tv show or movie was produced.
        </li>
    <br>
    <li>Date Added:
    <br>
    Feature Type: Numerical - Continuous
    <br>
    Description: The date the show or movie was added to Netflix.
        </li>
    <br>
    <li>Release Year:
    <br>
    Feature Type: Numerical - Continuous
    <br>
    Description: The year in which the tv show or movie was originally released.
        </li>
    <br>
    <li>Rating:
    <br>
    Feature Type: Categorical - Ordinal
    <br>
    Description: This rating indicates the acceptable age of viewing for the tv show or movie.
        </li>
    <br>
    <li>Duration:
    <br>
    Feature Type: Mixed Type - Numerical Continuous - Categorical Ordinal 
    <br>
    Description: The duration of the movie in minutes, or if it is a tv show, in seasons.
        </li>
    <br>
    <li>Genre/Listed In:
    <br>
    Feature Type: Categorical - Nominal
    <br>
    Description: The genres/subgenres the tv show or movie falls in.
        </li>
    <br>
    <li>Description:
    <br>
    Feature Type: Categorical - Nominal
    <br>
    Description: The description of the tv show or movie.
        </li>
</ol>

In [None]:
# load the dataset: 
dataset = pd.read_csv("https://raw.githubusercontent.com/CLFrod/Assignment2CSI4142/refs/heads/master/netflix_titles.csv")

<h4>1. Range Check:</h4>
<p>
In this test, we will verify the range of a numerical value. The range is the minimum and maximum values that an attribute can have.
</p>

In [None]:
# Please enter the various attributes below to perform the tests:
range_attributes = ['release_year', 'date_added']
# attribute selection:
test_range_attribute = 'release_year'
# Minimum:
range_minimum = 1888
# Maximum:
range_maximum = 2025

In [None]:
# Checker Code
range_series = dataset[test_range_attribute].between(range_minimum, range_maximum)

# changing the booleans in range_series from True to False and vice versa
# values that were previously not in range (False) are true!
not_range_series = ~range_series

# check if the range series is empty
empty_range = dataset[not_range_series].empty
# if the series is not empty, 
if not empty_range:
    out_of_range_count = 0
    for val in not_range_series:
        out_of_range_count += 1
    print(out_of_range_count)
# if the series is empty
else:
    print("All values fall within specified paramters!")

<p style ="font-size:20px">Range check Findings: </p>
No range errors were detected based on the provided parameters.
The dataset's 'release_year' feature only has years that fit within logical parameters.

<h4>2. Format Check:</h4>
<p>

</p>

In [None]:
# Please enter the various attributes below to perform the tests:
format_attributes = ['date_added', 'release_year']

test_format_attribute = format_attributes[0]


In [None]:
# Checker Code
# get the df of dates added
date_added_df = dataset['date_added']

#convert to datetime
converted_dates = pd.to_datetime(date_added_df, format="%B %d, %Y", errors="coerce")

# get the incorrectly formatted dates
incorrectly_formatted_dates = converted_dates.isna()
print(dataset[incorrectly_formatted_dates][test_format_attribute])


<p style ="font-size:20px">Format check Findings: </p>
There are 98 data points that don't follow the correct format of the dates,
For example,
Row 6068 is missing a date, and thus is technically not following the correct format.<br>
<br>
Row 8759 has " November 1, 2016". Although technically following correct formatting, it has a leading whitespace which should not be there.

<h4>3. Data Type Check:</h4>
<p>

</p>

In [None]:
# Please enter the various attributes below to perform the tests:
attributes = ['release_year']
data_type_test_attribute = attributes[0]

In [None]:
# Checker Code
# small function to check if val is int
def is_integer(val):
    return isinstance(val, int)

data_type_check = dataset[data_type_test_attribute].apply(is_integer).all()

if data_type_check:
    print("All values are integers.")
else:
    print("Not all values are integers.")


<p style ="font-size:20px">Data Type check Findings: </p>
No data type errors were detected based on the provided parameters.
The dataset's 'release_year' feature only contains integer values.

<h4>4. Consistency Check</h4>
<p>

</p>

In [None]:
# Please enter the various attributes below to perform the tests:
consistency_attributes = ['release_year']


In [None]:
# Checker Code

findings report

<h4>5. Uniqueness Check:</h4>
<p>

</p>

In [None]:
# Please enter the various attributes below to perform the tests:
attributes = []

In [None]:
# Checker Code

findings report

<h4>6. Presence Check:</h4>
<p>

</p>

In [None]:
# Please enter the various attributes below to perform the tests:
attributes = []

In [None]:
# Checker Code

findings report

<h4>7. Length Check:</h4>
<p>

</p>

In [None]:
# Please enter the various attributes below to perform the tests:
attributes = []

In [None]:
# Checker Code

findings report

<h4>8. Look-up Check:</h4>
<p>

</p>

In [None]:
# Please enter the various attributes below to perform the tests:
attributes = []

In [None]:
# Checker Code

findings report

<h4>9. Exact Duplicate Check</h4>
<p>

</p>

In [None]:
# Please enter the various attributes below to perform the tests:
attributes = []

In [None]:
# Checker Code

findings report

<h4>10. Near Duplicate Check:</h4>
<p>

</p>

In [None]:
# Please enter the various attributes below to perform the tests:
attributes = []


In [None]:
# Checker Code

findings report

<h3>References:</h3>
<ul>
<li>
<a href="https://www.w3schools.com/python/python_datetime.asp"> Python Date time formatting</a>
</li>
<li>
<a href="https://stackoverflow.com/questions/402504/how-to-determine-a-python-variables-type"> Check Variable Type</a>
</li>
</ul>