# M1L5 More EDA with Pandas 

This notebook will guide you through some essential data manipulation techniques using the Pandas library in Python. We'll be working with the Austin Animal Center Intakes dataset, which contains information about animals entering the Austin Animal Center.

### **Dataset:** [Austin Animal Center Intakes](https://catalog.data.gov/dataset/austin-animal-center-intakes) -- This is also in your data folder 

### **Objectives:**

 1.  Load and explore the dataset.
 2.  Use `groupby()` to aggregate data.
 3.  Create contingency tables using `crosstab()`.
 4.  Identify and handle duplicate entries.

## Step 1:  Import pandas and numpy 

In [1]:
#Import packages 

import pandas as pd
import numpy as np

## Step 2:  Load in the data and save it as `df`

In [4]:
df = pd.read_csv('Austin_Animal_Center_Intakes__10_01_2013_to_05_05_2025_.csv')
#air_data = pd.read_csv("global_air_pollution_data.csv")

## Step 3:  Look at the data (can you think of some methods to do this)

In [6]:
df.dtypes

Animal ID           object
Name                object
DateTime            object
MonthYear           object
Found Location      object
Intake Type         object
Intake Condition    object
Animal Type         object
Sex upon Intake     object
Age upon Intake     object
Breed               object
Color               object
dtype: object

## Step 4:  Count up how many missing values exist in each column (you would need to chain two methods here -- one to check for missing values and the other to sum missing values up)

In [18]:
df['DR_isNull'] = df['DateTime'].isna()

In [None]:
df['animalID_null'] = df['Animal ID'].isnull()

0         False
1         False
2         False
3         False
4         False
          ...  
173807    False
173808    False
173809    False
173810    False
173811    False
Name: Animal ID, Length: 173812, dtype: bool

In [20]:
df.isna().sum()

Animal ID               0
Name                49991
DateTime                0
MonthYear               0
Found Location          0
Intake Type             0
Intake Condition        0
Animal Type             0
Sex upon Intake         1
Age upon Intake         0
Breed                   0
Color                   0
DR_isNull               0
dtype: int64

In [19]:
df.isnull().sum()

Animal ID               0
Name                49991
DateTime                0
MonthYear               0
Found Location          0
Intake Type             0
Intake Condition        0
Animal Type             0
Sex upon Intake         1
Age upon Intake         0
Breed                   0
Color                   0
DR_isNull               0
dtype: int64

## Step 5:  Count up the amount of animals by Animal Type 

In [21]:
df.groupby('Animal Type')['Animal ID'].count()
#animal_counts = None
#print(animal_counts)

Animal Type
Bird           878
Cat          69324
Dog          94608
Livestock       34
Other         8968
Name: Animal ID, dtype: int64

## Step 6:  Create a crosstab showing the count of animal types for each intake condition.

In [25]:
pd.crosstab(df['Intake Condtion'], df['Animal Type'])
#pd.crosstab(df['Animal Type'], df['Intake Condition'])

#cross_table = None
#print(cross_table)

KeyError: 'Intake Condtion'

## Step 7:  Check for duplicate Animal IDs (pay close attention to the syntax here)

In [28]:
duplicate_ids = df['Animal ID'].duplicated().sum()
print(duplicate_ids)

17525


## Practice Joining Data 

### Scenario 1
You have customer data split into two different files (DataFrames),
and you want to combine them into a single DataFrame for analysis.

In [11]:
# Run the cell without changes to create the two dataframes 

customers_part1_df = pd.DataFrame({'CustomerID': [101, 102, 103],
                                   'FirstName': ['Alice', 'Bob', 'Charlie'],
                                   'City': ['Anytown', 'Otherville', 'Smallburg']})

customers_part2_df = pd.DataFrame({'CustomerID': [104, 105, 106],
                                   'FirstName': ['David', 'Emily', 'Frank'],
                                   'City': ['Bigcity', 'Townsville', 'Villageton']})

### Scenario 1 Task:  Use `pd.concat()` to stack the two dataframes above (afterall they have the same columns)

In [None]:
#customers_part1_df
#customers_part2_df

Unnamed: 0,CustomerID,FirstName,City
0,104,David,Bigcity
1,105,Emily,Townsville
2,106,Frank,Villageton


In [32]:
all_customers_df = pd.concat([customers_part1_df, customers_part2_df],ignore_index=1)
all_customers_df

Unnamed: 0,CustomerID,FirstName,City
0,101,Alice,Anytown
1,102,Bob,Otherville
2,103,Charlie,Smallburg
3,104,David,Bigcity
4,105,Emily,Townsville
5,106,Frank,Villageton


### Scenario 2

Combining customer details and loyalty points.

In [13]:
# Run the cell without changes to create the two dataframes 

customer_details_df = pd.DataFrame({'CustomerID': [101, 102, 103],
                                    'Name': ['Alice', 'Bob', 'Charlie'],
                                    'City': ['Anytown', 'Otherville', 'Smallburg']})

loyalty_points_df = pd.DataFrame({'CustomerID': [101, 102, 103],
                                  'Points': [100, 250, 50]})

### Scenario 2 Task :  Merge the DataFrames on CustomerID


In [35]:
merged_customer_df = pd.merge(customer_details_df, loyalty_points_df, on = "CustomerID")
merged_customer_df

Unnamed: 0,CustomerID,Name,City,Points
0,101,Alice,Anytown,100
1,102,Bob,Otherville,250
2,103,Charlie,Smallburg,50
