Begin to conduct a full exploratory data analysis on your chosen dataset. This means:

Load and inspect your data

Read in your CSV file using pd.read_csv()
Use .head(), .info(), .describe() to get a sense of the structure
Identify missing values, duplicate rows, or formatting issues

Ask and investigate initial questions

What are the columns and what types of data do they contain?
Are there obvious patterns or distributions worth looking at?
What questions are you curious about?

Create visualizations

Use Matplotlib and/or Seaborn to create 1 or 2 plots to help you understand the data. Here are some ideas:

Bar plots: to compare values across categories
Histograms: to explore distributions
Scatter plots: to investigate relationships between numeric variables
Boxplots: to examine spread and outliers
Pie chart or count plot: for categorical proportions
Line chart (if your data includes a date or time dimension)


Document your process

Keep notes to explain and understand:

What you’re doing at each step
Why you’re making each chart or calculation
What insights or patterns you’re noticing
Later on, you'll learn about using markdown language and you'll incorporate these thoughts and findings into your notebook.

Commit your work regularly

Initialize a repository when you begin. Make small, clear Git commits as you work:

After setting up your project
After cleaning data
After each major plot or insight
Your Project Folder Should Include:

A data/ folder with your dataset
A notebooks/ folder with your working .ipynb file
Tip

Don’t worry about making it perfect. The goal this week is to explore, get familiar with your data, and surface questions you might want to dig into more deeply later.




In [1]:
import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np 
import seaborn as sns

# Load the dataset 
# Renamed the original csv to something more easy to use

In [None]:
df = pd.read_csv('../data/kentucky_shelter_data.csv')

# Preview the data

In [13]:
df.head()


Unnamed: 0,kennel,animalid,jurisdiction,intype,insubtype,indate,surreason,outtype,outsubtype,outdate,animaltype,sex,bites,petsize,color,breed,sourcezipcode,ObjectId
0,408,A688221,40216,STRAY,OTC,2021-01-17 00:00:00,STRAY,TRANSPORT,RESCUE GRP,2021-02-06 00:00:00,DOG,M,N,PUPPY,BR BRINDLE,CHIHUAHUA SH / MIX,40218,1
1,ID07,A688234,40222,STRAY,OTC,2021-01-18 00:00:00,STRAY,RTO,IN FIELD,2021-01-18 00:00:00,DOG,N,N,MED,BR BRINDLE / WHITE,BOSTON TERRIER / MIX,40205,2
2,DW13,A688337,40118,STRAY,OTC,2021-01-21 00:00:00,STRAY,RTO,IN KENNEL,2021-01-23 00:00:00,DOG,S,N,LARGE,WHITE / BROWN,CATAHOULA / MIX,40118,3
3,INTAKE,A688419,40204,STRAY,OTC,2021-02-03 00:00:00,STRAY,TNR,CARETAKER,2021-02-04 00:00:00,CAT,S,N,MED,BLK TABBY,DOMESTIC SH,40204,4
4,INTAKE,A688478,40214,STRAY,OTC,2021-02-10 00:00:00,STRAY,TNR,CARETAKER,2021-02-11 00:00:00,CAT,N,N,MED,TORTIE,DOMESTIC SH,40210,5


In [12]:
df.describe()

Unnamed: 0,ObjectId
count,60343.0
mean,30172.0
std,17419.667984
min,1.0
25%,15086.5
50%,30172.0
75%,45257.5
max,60343.0


In [11]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60343 entries, 0 to 60342
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   kennel         60343 non-null  object
 1   animalid       60343 non-null  object
 2   jurisdiction   47038 non-null  object
 3   intype         60343 non-null  object
 4   insubtype      60216 non-null  object
 5   indate         60343 non-null  object
 6   surreason      47038 non-null  object
 7   outtype        46755 non-null  object
 8   outsubtype     37967 non-null  object
 9   outdate        46789 non-null  object
 10  animaltype     60343 non-null  object
 11  sex            59318 non-null  object
 12  bites          47053 non-null  object
 13  petsize        57919 non-null  object
 14  color          60342 non-null  object
 15  breed          60267 non-null  object
 16  sourcezipcode  50868 non-null  object
 17  ObjectId       60343 non-null  int64 
dtypes: int64(1), object(17)
me

In [10]:
df.columns

Index(['kennel', 'animalid', 'jurisdiction', 'intype', 'insubtype', 'indate',
       'surreason', 'outtype', 'outsubtype', 'outdate', 'animaltype', 'sex',
       'bites', 'petsize', 'color', 'breed', 'sourcezipcode', 'ObjectId'],
      dtype='object')

In [None]:
df.shape

(60343, 18)

In [14]:
df.dtypes

kennel           object
animalid         object
jurisdiction     object
intype           object
insubtype        object
indate           object
surreason        object
outtype          object
outsubtype       object
outdate          object
animaltype       object
sex              object
bites            object
petsize          object
color            object
breed            object
sourcezipcode    object
ObjectId          int64
dtype: object

In [16]:
# check for missing data
df.isna()

Unnamed: 0,kennel,animalid,jurisdiction,intype,insubtype,indate,surreason,outtype,outsubtype,outdate,animaltype,sex,bites,petsize,color,breed,sourcezipcode,ObjectId
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60338,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
60339,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
60340,False,False,True,False,False,False,True,True,True,True,False,False,True,False,False,False,True,False
60341,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False


In [17]:
# count of missing data
df.isna().sum()


kennel               0
animalid             0
jurisdiction     13305
intype               0
insubtype          127
indate               0
surreason        13305
outtype          13588
outsubtype       22376
outdate          13554
animaltype           0
sex               1025
bites            13290
petsize           2424
color                1
breed               76
sourcezipcode     9475
ObjectId             0
dtype: int64

In [18]:
# check for duplicates
df.duplicated(keep = False).sum()

np.int64(0)