## Begin to conduct a full exploratory data analysis on your chosen dataset. This means:
Load and inspect your data
Read in your CSV file using pd.read_csv()
Get a sense of the structure
Ask and investigate initial questions
  What are the columns and what types of data do they contain?
  Are there obvious patterns or distributions worth looking at?
  What questions are you curious about?
Create visualizations
(Use Matplotlib and/or Seaborn to create 1 or 2 plots to help you understand the data)
Document your process
Commit your work regularly
Your Project Folder Should Include:
  A data/ folder with your dataset
  A notebooks/ folder with your working .ipynb file

In [1]:
import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np 
import seaborn as sns

## Load the dataset 
#  Renamed the original csv to something more easy to use

In [2]:
df = pd.read_csv('../data/kentucky_shelter_data.csv')

# Preview and Analyze the data structure

In [3]:
df.head()

Unnamed: 0,kennel,animalid,jurisdiction,intype,insubtype,indate,surreason,outtype,outsubtype,outdate,animaltype,sex,bites,petsize,color,breed,sourcezipcode,ObjectId
0,408,A688221,40216,STRAY,OTC,2021-01-17 00:00:00,STRAY,TRANSPORT,RESCUE GRP,2021-02-06 00:00:00,DOG,M,N,PUPPY,BR BRINDLE,CHIHUAHUA SH / MIX,40218,1
1,ID07,A688234,40222,STRAY,OTC,2021-01-18 00:00:00,STRAY,RTO,IN FIELD,2021-01-18 00:00:00,DOG,N,N,MED,BR BRINDLE / WHITE,BOSTON TERRIER / MIX,40205,2
2,DW13,A688337,40118,STRAY,OTC,2021-01-21 00:00:00,STRAY,RTO,IN KENNEL,2021-01-23 00:00:00,DOG,S,N,LARGE,WHITE / BROWN,CATAHOULA / MIX,40118,3
3,INTAKE,A688419,40204,STRAY,OTC,2021-02-03 00:00:00,STRAY,TNR,CARETAKER,2021-02-04 00:00:00,CAT,S,N,MED,BLK TABBY,DOMESTIC SH,40204,4
4,INTAKE,A688478,40214,STRAY,OTC,2021-02-10 00:00:00,STRAY,TNR,CARETAKER,2021-02-11 00:00:00,CAT,N,N,MED,TORTIE,DOMESTIC SH,40210,5


In [4]:
df.describe()

Unnamed: 0,ObjectId
count,60343.0
mean,30172.0
std,17419.667984
min,1.0
25%,15086.5
50%,30172.0
75%,45257.5
max,60343.0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60343 entries, 0 to 60342
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   kennel         60343 non-null  object
 1   animalid       60343 non-null  object
 2   jurisdiction   47038 non-null  object
 3   intype         60343 non-null  object
 4   insubtype      60216 non-null  object
 5   indate         60343 non-null  object
 6   surreason      47038 non-null  object
 7   outtype        46755 non-null  object
 8   outsubtype     37967 non-null  object
 9   outdate        46789 non-null  object
 10  animaltype     60343 non-null  object
 11  sex            59318 non-null  object
 12  bites          47053 non-null  object
 13  petsize        57919 non-null  object
 14  color          60342 non-null  object
 15  breed          60267 non-null  object
 16  sourcezipcode  50868 non-null  object
 17  ObjectId       60343 non-null  int64 
dtypes: int64(1), object(17)
me

In [6]:
df.columns

Index(['kennel', 'animalid', 'jurisdiction', 'intype', 'insubtype', 'indate',
       'surreason', 'outtype', 'outsubtype', 'outdate', 'animaltype', 'sex',
       'bites', 'petsize', 'color', 'breed', 'sourcezipcode', 'ObjectId'],
      dtype='object')

In [7]:
df.shape

(60343, 18)

In [8]:
df.dtypes

kennel           object
animalid         object
jurisdiction     object
intype           object
insubtype        object
indate           object
surreason        object
outtype          object
outsubtype       object
outdate          object
animaltype       object
sex              object
bites            object
petsize          object
color            object
breed            object
sourcezipcode    object
ObjectId          int64
dtype: object

# Data Clean Up
Rename Columns
Check for Missing Data
Count of Missing Data
Drop rows missing critical info
Check for duplicates
Drop Duplicates
Convert date columns to datetime format

In [None]:
# Assign a new list of column names
df.columns = ["col1", "col2", "col3"]

In [None]:
# check for missing data
df.isna()

In [None]:
# count of missing data
df.isna().sum()

In [None]:
# check for duplicates
df.duplicated(keep = False)

df.duplicated(keep = False).sum()

In [None]:
# Convert date columns to datetime format
df['indate'] = pd.to_datetime(df['indate'], errors='coerce')

df['outdate'] = pd.to_datetime(df['outdate'], errors='coerce')

print(df.indate, df.outdate)

In [None]:
# Drop rows missing critical info
df = df.dropna(subset=['indate', 'outcome_type', 'animal_type'])

# Analysis I would like to accomplish with the data
# 1. Most common outcome type by animal type
# 2. Adoption rates by animal type
# 3. Monthly intake trends
# 4. Intake types count
# 5. Age group effect on outcomes
# 6. Age by Outcome Type by Animal Type
# 7. Rates of adoption by day of week


