# CSI 4142 Data Science 
## Assignment 2 - Data Cleaning

### Identification

Name: Eli Wynn<br/>
Student Number: 300248135

Name: Jack Snelgrove<br/>
Student Number: 300247435


Our datasets have been uploaded from the public repository:

- [github.com/eli-wynn/Datasets](https://github.com/eli-wynn/Datasets)

Imports:

In [4]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

### Importing Datasets

In [5]:
netflix  = "https://raw.githubusercontent.com/eli-wynn/Datasets/refs/heads/main/netflix_titles.csv"
netflixData = pd.read_csv(netflix)
startup = "https://raw.githubusercontent.com/eli-wynn/Datasets/refs/heads/main/startup.csv"
startupData = pd.read_csv(startup)

### Clean Data Checker

#### Data Type Error

A data type error occurs when the the data entered into a column doesnt match the data type assigned to that column. There are zero datatype errors in the Netflix dataset


#### Parameters

In [7]:
intCol = ['release_year']
stringCols = ['show_id', 'type', 'title', 'director', 'cast', 'country', 'rating', 'duration', 'listed_in', 'description']

#### Checker Code

In [8]:
for col in intCol:
    netflixData[col] = pd.to_numeric(netflixData[col], errors='coerce')

for col in stringCols:
    invalid_strings = netflixData[~netflixData[col].astype(str).apply(lambda x: isinstance(x, str))]
    if not invalid_strings.empty:
        print(f"\nPossible non-string values in '{col}':\n", invalid_strings.head(5))

#### Findings

In [9]:
# Find rows where conversion resulted in NaN (potential type errors)
type_errors = netflixData[netflixData[intCol].isna().any(axis=1)]
print("Possible data type errors:\n", type_errors)

Possible data type errors:
 Empty DataFrame
Columns: [show_id, type, title, director, cast, country, date_added, release_year, rating, duration, listed_in, description]
Index: []


#### Range Errors

Searches for errors where the data is outside acceptable range (e.g. season -1 or release date prior to 1930)

#### Parameters

In [10]:
releaseParam = [1925, 2025]
durationParams = [0, 300] #split on space and make sure first item in array is >0 <300
dateAdded = [2007, 2025] #just look at year 

#### Checker Code

In [14]:
releaseErrors = netflixData[(netflixData['release_year'] < releaseParam[0]) | (netflixData['release_year'] > releaseParam[1])]

netflixData['duration_split'] = netflixData['duration'].str.split(" ").str[0]  # Extract number part
netflixData['duration_split'] = pd.to_numeric(netflixData['duration_split'], errors='coerce')  # Convert to int
# Identify invalid durations
durationErrors = netflixData[(netflixData['duration_split'] <= durationParams[0]) | (netflixData['duration_split'] >= durationParams[1])]

netflixData['date_added'] = pd.to_datetime(netflixData['date_added'], errors='coerce')
netflixData['year_added'] = netflixData['date_added'].dt.year #take just year value, other date errors will be caught in format error below
# Identify invalid date_added values
dateAddedErrors = netflixData[(netflixData['year_added'] < dateAdded[0]) | (netflixData['year_added'] > dateAdded[1])]

#### Findings

In [13]:
print("\nRelease Year Errors:\n", releaseErrors[['title', 'release_year']].head(5))
print("\nDuration Errors:\n", durationErrors[['show_id', 'title', 'duration', 'duration_split']].head(5))
print("\nDate Added Errors:\n", dateAddedErrors[['title', 'date_added', 'year_added']].head(5))


Release Year Errors:
 Empty DataFrame
Columns: [title, release_year]
Index: []

Duration Errors:
      show_id                       title duration  duration_split
4253   s4254  Black Mirror: Bandersnatch  312 min           312.0

Date Added Errors:
 Empty DataFrame
Columns: [title, date_added, year_added]
Index: []


#### Format Errors

Checks for errors with the formatting of the data, e.g. date being DD-MM-YYYY instead of YYYY first

#### Parameters

In [15]:
dateCol = ['date_added'] #make sure date is correct format
showCol = ['show_id'] #make sure it is s### format
durationCol = ['duration'] #make sure duration is number followed by either "min" or "Season" or "Seasons"

#### Checker Code

In [None]:
for col in dateCol:
    netflixData[col] = pd.to_datetime(netflixData[col], errors='coerce')
invalid_dates = netflixData[netflixData[dateCol].isna().any(axis=1)]

brokenID = netflixData[~netflixData['show_id'].astype(str).str.match(r"^s\d{1,4}$", na=False)] #finds all ids that don't match format of s followed by 1-4 #'s
brokenDuration = netflixData[~netflixData['duration'].astype(str).str.match(r"^\d+\s(min|Season|Seasons)$", na=False)] #finds all id's that don't match digits then space then seasons, season or min

#### Findings

In [20]:
print("\nPossible date format errors:\n", invalid_dates[['title', 'date_added']].head(5))
print("\nShow ID Format Errors:\n", brokenID[['show_id', 'title']].head(5))
print("\nDuration Format Errors:\n", brokenDuration[['duration', 'title']].head(5)) #Louis C.k. durations are in rating column for some reason


Possible date format errors:
                                             title date_added
6066  A Young Doctor's Notebook and Other Stories        NaT
6079                              Abnormal Summit        NaT
6174              Anthony Bourdain: Parts Unknown        NaT
6177                                     忍者ハットリくん        NaT
6213                                Bad Education        NaT

Show ID Format Errors:
   show_id                  title
0      s1   Dick Johnson Is Dead
1      s2          Blood & Water
2      s3              Ganglands
3      s4  Jailbirds New Orleans
4      s5           Kota Factory

Duration Format Errors:
      duration                                 title
5541      NaN                       Louis C.K. 2017
5794      NaN                 Louis C.K.: Hilarious
5813      NaN  Louis C.K.: Live at the Comedy Store


#### Consistency Errors

#### Parameters

#### Checker Code

#### Findings

#### Uniqueness Errors

#### Parameters

#### Checker Code

#### Findings

#### Presence Errors

#### Parameters

#### Checker Code

#### Findings

#### Length Errors

#### Parameters

#### Checker Code

#### Findings

#### Lookup Errors

#### Parameters

#### Checker Code

#### Findings

#### Exact Duplicate Errors

#### Parameters

#### Checker Code

#### Findings

#### Near Duplicate Errors

#### Parameters

#### Checker Code

#### Findings

### Imputation

#### Test #1

- a) Funding Rounds 
- b) 

Index(['Startup Name', 'Industry', 'Funding Rounds', 'Investment Amount (USD)',
       'Valuation (USD)', 'Number of Investors', 'Country', 'Year Founded',
       'Growth Rate (%)'],
      dtype='object')
