## **01_data_clean.ipynb**

### Objectives

* Fetch data from Kaggle and save as raw data
* Clean the data

### Inputs

* https://www.kaggle.com/datasets/rabieelkharoua/predict-pet-adoption-status-dataset/data
* https://doi.org/10.34740/kaggle/ds/5242440
* Save as data_raw.csv

### Outputs

* Save cleaned data as data_clean.csv


In [None]:
# import required libraries
import os
import pandas as pd
import numpy as np

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [21]:
# get the current working directory
current_dir = os.getcwd()
current_dir

'c:\\Users\\beth_\\Documents\\vscode-projects'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [22]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [23]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\beth_\\Documents'

## Saving the data to data_raw.csv

In [24]:
# load the raw data
# use of Copilot to access the correct file path
df = pd.read_csv(os.path.join(current_dir, 'data', 'data_raw.csv'))
df.head(10)

FileNotFoundError: [Errno 2] No such file or directory: 'c:\\Users\\beth_\\Documents\\data\\data_raw.csv'

## Initial data exploration and cleaning

- missing values
- duplicates
- outliers
- convert data types
- data ethics
- rename/remove columns

In [None]:
# start to find out about the data
df.shape

In [None]:
# get a feel for the data types and non-null counts
df.info()

In [None]:
# in the data the index does not need to be reset as it is in order
# but if needed, uncomment the lines below to reset the index
# df.reset_index(drop=True, inplace=True)
# df.head()

In [None]:
# see if there are any missing values
df.isna().sum()

There are no missing values to deal with.
If there were missing values, depending on the type of missing values, I would consider either replacing with the median or modal value, or inserting the value in the row above/below. The data set is not that large and it would be a shame to delete whole rows and lose data. 

In [None]:
# check for duplicates
dupes = df.duplicated().sum()
print("Duplicate rows: ", dupes) 
duplicates_all = df[df.duplicated(keep=False)]


There are no duplicated rows, if there were, the code is shown below:

In [None]:
# drop the duplicate row and modify the dataframe in place
# df.drop_duplicates(df, keep='first', inplace=True)
# df.shape

In [None]:
# see what the summary statistics look like
# looking at the minimum and maximum values to see if there are any obvious anomalies
df.describe()

There are no negative values, if there were I would replace with 0.0.
The maximum values look sensible: 179 months is 14.92 years; 30kg for a large dog is normal.

In [None]:
# declare categorical types where they are object
# this will help with memory usage and clarity
category_cols = ["PetType", "Breed", "Color","Size"]
for cat in category_cols:
    df[cat] = df[cat].astype("category")

df.info()

In [None]:
# remove PetID column as it is not needed for analysis
df.drop(columns=['PetID'], inplace=True)
df.head()


PetID is not required for analysis. However, it is also best to remove this identifier to fully anonymise the data. If this were a real world data set any identifiers of specific animals or details of the owners would need to be anonymised using a boolean mask as shown in the code below. 

In [None]:
# get unique PetID
# unique_pet_id = df['PetID'].unique()
# map new IDs to replace the old ones
# pet_id_map = {num: f'Pet-{i+1}' for i, num in enumerate(unique_pet_id)}
# make a new column with the anonymous IDs
# df['anon_petID'] = df['PetID'].map(pet_id_map)

In [None]:
# rename the Color column to Colour
df.rename(columns={'Color': 'Colour'}, inplace=True)
df.head()

In [None]:
# create a new column to show the age of the pet in years
df['AgeInYears'] = df['AgeMonths']/12
df.head()

## Save the cleaned data to a new file data_clean.csv

In [None]:
# save the cleaned data to a new csv file
df.to_csv(os.path.join(current_dir, 'data', 'data_clean.csv'), index=False)


# Conclusions and next steps
- The data is now saved to data_clean.csv
- Next is EDA - exploratory data analysis
- Please go to notebook 02_eda_visuals.ipynb