In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/austin-animal-center-outcomes/Austin_Animal_Center_Outcomes.csv


Currently studying the pandas library and using this dataset as an opportunity to demonstrate what I've learned.

**Questions I want to answer:**
1. How many total records (animals) are in the dataset?
2. What are the most common animal types?
3. What are the most common breeds for dogs and cats?
4. What are the busiest months for animal intakes?
5. Has the number of adoptions changed over time?
6. What are the most common outcomes for animals?
7. Do certain breeds have higher adoption rates?
8. Do younger animals get adopted more quickly?
9. Are certain zip codes associated with more intakes or adoptions?
10. How long do animals typically stay in the shelter before adoption?

### Step 1: Cleaning
1. Does the dataset have missing values? If so, which columns and how many?
2. Are there duplicate rows? Should they be removed?
3. Are there any inconsistent or unexpected values (e.g., negative ages, incorrect dates)?
4. Do column names follow a consistent format (lowercase, snake_case, etc.)?
## handling missing data
* Should missing values be filled, removed, or left as is?
* If filling missing values, what’s the best approach (mean, median, mode, forward/backward fill)?
* Are missing values concentrated in specific columns or random?

## data type validation
* Are all columns in the correct data type (e.g., dates as datetime, numbers as int/float)?
* Do categorical columns have correct and consistent labels?
* Are numerical columns formatted correctly (e.g., no misplaced commas or currency symbols)?

## standardizing and fixing inconsistencies
* Are animal types (e.g., “Dog” vs. “dog”) consistent?
* Are breed names standardized (e.g., “German Shepherd” vs. “GSD”)?
* Do outcome types and intake types have uniform spelling and categorization?
* Are date formats consistent across the dataset?

## handling outliers and erroneous data
* Are there extreme outliers in numerical columns (e.g., very high ages, negative values)?
* Are there animals with duplicate intake and outcome records?
* Are there any animals with unrealistic age values (e.g., “100 years old” for a dog)?

## date and time processing
* Are intake and outcome dates properly formatted and sorted?
* Are there cases where the outcome date is before the intake date?
* Are there unexpected time gaps between intake and outcome?

## final checks
* Does the cleaned dataset retain all necessary columns for analysis?
* Have you documented the cleaning steps for reproducibility?
* Have you saved the cleaned dataset for further analysis?

In [2]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        dataset = os.path.join(dirname, filename)

df = pd.read_csv(dataset)
df.info()

/kaggle/input/austin-animal-center-outcomes/Austin_Animal_Center_Outcomes.csv
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 172113 entries, 0 to 172112
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Animal ID         172113 non-null  object
 1   Name              122845 non-null  object
 2   DateTime          172113 non-null  object
 3   MonthYear         172113 non-null  object
 4   Date of Birth     172113 non-null  object
 5   Outcome Type      172071 non-null  object
 6   Outcome Subtype   78842 non-null   object
 7   Animal Type       172113 non-null  object
 8   Sex upon Outcome  172111 non-null  object
 9   Age upon Outcome  172106 non-null  object
 10  Breed             172113 non-null  object
 11  Color             172113 non-null  object
dtypes: object(12)
memory usage: 15.8+ MB


### 1. Does the dataset have missing values? If so, which columns and how many?
We see below that Outcome Type, Outcome Subtype, Sex upon Outcome, and Age upon Outcome all have missing values.

In [3]:
df.describe()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
count,172113,122845,172113,172113,172113,172071,78842,172113,172111,172106,172113,172113
unique,154722,29585,143835,138,8613,11,26,5,5,55,2990,661
top,A721033,Luna,04/18/2016 12:00:00 AM,Jun 2019,05/01/2016,Adoption,Partner,Dog,Neutered Male,1 year,Domestic Shorthair Mix,Black/White
freq,33,748,39,2244,121,83709,40046,93718,60363,28510,34026,17820


## 2. Are there duplicate rows? Should they be removed?
25 duplicated rows. Yes, they should be removed.

In [4]:
df.duplicated().sum()

25

In [5]:
df.drop_duplicates()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
0,A882831,*Hamilton,07/01/2023 06:12:00 PM,Jul 2023,03/25/2023,Adoption,,Cat,Neutered Male,3 months,Domestic Shorthair Mix,Black/White
1,A794011,Chunk,05/08/2019 06:20:00 PM,May 2019,05/02/2017,Rto-Adopt,,Cat,Neutered Male,2 years,Domestic Shorthair Mix,Brown Tabby/White
2,A776359,Gizmo,07/18/2018 04:02:00 PM,Jul 2018,07/12/2017,Adoption,,Dog,Neutered Male,1 year,Chihuahua Shorthair Mix,White/Brown
3,A821648,,08/16/2020 11:38:00 AM,Aug 2020,08/16/2019,Euthanasia,,Other,Unknown,1 year,Raccoon,Gray
4,A720371,Moose,02/13/2016 05:59:00 PM,Feb 2016,10/08/2015,Adoption,,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff
...,...,...,...,...,...,...,...,...,...,...,...,...
172108,A922949,*Scout,03/07/2025 11:45:00 AM,Mar 2025,12/16/2024,Adoption,Foster,Dog,Neutered Male,2 months,Chihuahua Shorthair Mix,Buff/White
172109,A925959,,03/07/2025 11:24:00 AM,Mar 2025,03/03/2021,Transfer,Partner,Dog,Intact Male,4 years,Dachshund/Chihuahua Shorthair,Black/White
172110,A925235,*Gertrude,03/07/2025 12:48:00 PM,Mar 2025,02/27/2019,Adoption,,Dog,Spayed Female,6 years,Miniature Schnauzer,Black/Black
172111,A924228,*Penny Lane,03/07/2025 12:41:00 PM,Mar 2025,01/06/2025,Adoption,,Dog,Spayed Female,1 month,Australian Cattle Dog/Pit Bull,Blue Tick


Are there any inconsistent or unexpected values (e.g., negative ages, incorrect dates)?

In [6]:
df['Name'].value_counts()

Name
Luna        748
Max         721
Bella       674
Rocky       490
Daisy       464
           ... 
*Brocade      1
Babydoll      1
A849756       1
Karati        1
Trueno        1
Name: count, Length: 29585, dtype: int64