<a href="https://colab.research.google.com/github/Rohanrathod7/my-ds-labs/blob/main/16_Working_with_Categorical_Data_in_Python/03_Visualizing_Categorical_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd



In [2]:
url = "https://raw.githubusercontent.com/Rohanrathod7/my-ds-labs/main/16_Working_with_Categorical_Data_in_Python/dataset/dogs.csv"
dogs = pd.read_csv(url)
display(dogs.head())


Unnamed: 0,ID,name,age,sex,breed,date_found,adoptable_from,posted,color,coat,size,neutered,housebroken,likes_people,likes_children,get_along_males,get_along_females,get_along_cats,keep_in
0,23807,Gida,0.25,female,Unknown Mix,12/10/19,12/11/19,12/11/19,red,short,small,no,,,,,,,
1,533,Frida És Ricsi,0.17,female,Unknown Mix,12/1/19,12/1/19,12/9/19,black and white,short,small,no,,yes,yes,yes,yes,yes,
2,23793,,4.0,male,Unknown Mix,12/8/19,12/23/19,12/8/19,saddle back,short,medium,no,,,,,,,
3,23795,,1.0,male,Unknown Mix,12/8/19,12/23/19,12/8/19,yellow-brown,medium,medium,no,,,,,,,
4,23806,Amy,2.0,female,French Bulldog Mix,12/10/19,12/11/19,12/11/19,black,short,small,no,,,,,,,


**Adding categories**  
The owner of a local dog adoption agency has listings for almost 3,000 dogs. One of the most common questions they have been receiving lately is: "What type of area was the dog previously kept in?". You are setting up a pipeline to do some analysis and want to look into what information is available regarding the "keep_in" variable. Both pandas, as pd, and the dogs dataset have been preloaded.

In [3]:
# Check frequency counts while also printing the NaN count
print(dogs["keep_in"].value_counts(dropna=False))

# Switch to a categorical variable
dogs["keep_in"] = dogs["keep_in"].astype("category")

# Add new categories
new_categories = ["Unknown History", "Open Yard (Countryside)"]
dogs["keep_in"] = dogs["keep_in"].cat.add_categories(new_categories)

# Check frequency counts one more time
print(dogs["keep_in"].value_counts(dropna=False))

# When the adoption agency starts adding more information to this column,
# they will need to use one of the five categories now available in the 'keep_in' variable.

keep_in
both flat and garden    1224
NaN                     1021
garden                   510
flat                     182
Name: count, dtype: int64
keep_in
both flat and garden       1224
NaN                        1021
garden                      510
flat                        182
Unknown History               0
Open Yard (Countryside)       0
Name: count, dtype: int64


**Removing categories**   
Before adopting dogs, parents might want to know whether or not a new dog likes children. When looking at the adoptable dogs dataset, dogs, you notice that the frequency of responses for the categorical Series "likes_children" looks like this:

maybe     1718
yes       1172
no          47
The owner of the data wants to convert all "maybe" responses to "no", as it would be unsafe to let a family adapt a dog if it doesn't like children. The code to convert all "maybe" to "no" is provided in Step 1. However, the option for "maybe" still remains as a category.

In [7]:
# Set "maybe" to be "no"
dogs.loc[dogs["likes_children"] == "maybe", "likes_children"] = "no"

# Convert 'likes_children' to categorical dtype
dogs["likes_children"] = dogs["likes_children"].astype("category")

# Print out categories
print(dogs["likes_children"].cat.categories)

# Print the frequency table
print(dogs["likes_children"].value_counts())

# Remove the "maybe" category
# dogs["likes_children"] = dogs["likes_children"].cat.remove_categories(["maybe"]) # This line is no longer needed
print(dogs["likes_children"].value_counts())

# Print the categories one more time
print(dogs["likes_children"].cat.categories)

#  Telling parents that a dog 'maybe' likes children isn't helpful. To be on the
#  safe side, the adoption agency has decided to remove maybe as an option. You can
#  now do your analysis without worrying about 'Maybe?' showing up in the data.

Index(['no', 'yes'], dtype='object')
likes_children
yes    1172
no       47
Name: count, dtype: int64
likes_children
yes    1172
no       47
Name: count, dtype: int64
Index(['no', 'yes'], dtype='object')


**Renaming categories**   
The likes_children column of the adoptable dogs dataset needs an update. Here are the current frequency counts:

Maybe?    1718
yes       1172
no          47
Two things that stick out are the differences in capitalization and the ? found in the Maybe? category. The data should be cleaner than this and you are being asked to make a few changes.

In [8]:
# Create the my_changes dictionary
my_changes = {"Maybe?": "Maybe"}

# Rename the categories listed in the my_changes dictionary
dogs["likes_children"] = dogs["likes_children"].cat.rename_categories(my_changes)

# Use a lambda function to convert all categories to uppercase using upper()
dogs["likes_children"] =  dogs["likes_children"].cat.rename_categories(lambda c: c.upper())

# Print the list of categories
print(dogs["likes_children"].cat.categories)

# Using two steps, we have completly updated the likes_children pandas Series.
# You can use these few steps to clean up categorical columns before performing your analysis.

Index(['NO', 'YES'], dtype='object')


**Collapsing categories**  
One problem that users of a local dog adoption website have voiced is that there are too many options. As they look through the different types of dogs, they are getting lost in the overwhelming amount of choice. To simplify some of the data, you are going through each column and collapsing data if appropriate. To preserve the original data, you are going to make new updated columns in the dogs dataset. You will start with the coat column. The frequency table is listed here:

short       ---   1969   
medium      --    565     
wirehaired   --   220     
long         --   180   
medium-long   --   3    

In [9]:
# Create the update_coats dictionary
update_coats= {"wirehaired": "medium",
                "medium-long": "medium"}
# Create a new column, coat_collapsed
dogs["coat_collapsed"] = dogs["coat"].replace(update_coats)

# Convert the column to categorical
dogs["coat_collapsed"] = dogs["coat_collapsed"].astype("category")

# Print the frequency table
print(dogs["coat_collapsed"].value_counts())


# By collapsing four categories down to three, you have simplified your data. If

# you repeat this across several columns, the total combination of categories
# across these variables will be greatly reduced.

coat_collapsed
short     1972
medium     785
long       180
Name: count, dtype: int64


**Reordering categories in a Series**  
The owner of a local dog adoption agency has asked you take a look at her data on adoptable dogs. She is specifically interested in the size of the dogs in her dataset and wants to know if there are differences in other variables, given a dog's size. The adoptable dogs dataset has been loaded as dogs and the "size" variable has already been saved as a categorical column.

In [15]:
# Convert the size column to categorical dtype
dogs["size"] = dogs["size"].astype("category")

# Print out the current categories of the size variable
print(dogs["size"].cat.categories)

# Reorder the categories, specifying the Series is ordinal, and overwriting the original series
dogs["size"] = dogs["size"].cat.reorder_categories(
  new_categories=["small", "medium", "large"],
  ordered=True,
)

# Small is smaller than medium, which is smaller than large. Now all of your analyses will be printed in the same order.

Index(['large', 'medium', 'small'], dtype='object')


**Using .groupby() after reordering**  
It is now time to run some analyses on the adoptable dogs dataset that is focused on the "size" of the dog. You have already developed some code to reorder the categories. In this exercise, you will develop two similar .groupby() statements to help better understand the effect of "size" on other variables. dogs has been preloaded for you.

In [17]:
# Previous code
dogs["size"].cat.reorder_categories(
  new_categories=["small", "medium", "large"],
  ordered=True,

)

# How many Male/Female dogs are available of each size?
print(dogs.groupby("size")["sex"].value_counts())

# Do larger dogs need more room to roam?
print(dogs.groupby("size")["keep_in"].value_counts())

# There are more medium male dogs than any other combination of size and sex. It
# also looks like larger dogs are more often kept outside, as opposed to a flat.
# Isn't it nice that each printout uses the order we specified earlier?

size    sex   
small   male       260
        female     214
medium  male      1090
        female     854
large   male       331
        female     188
Name: count, dtype: int64
size    keep_in                
small   both flat and garden       238
        flat                        80
        garden                      21
        Unknown History              0
        Open Yard (Countryside)      0
medium  both flat and garden       795
        garden                     317
        flat                        97
        Unknown History              0
        Open Yard (Countryside)      0
large   both flat and garden       191
        garden                     172
        flat                         5
        Unknown History              0
        Open Yard (Countryside)      0
Name: count, dtype: int64


  print(dogs.groupby("size")["sex"].value_counts())
  print(dogs.groupby("size")["keep_in"].value_counts())


**Cleaning variables**  
Users of an online entry system used to have the ability to freely type in responses to questions. This is causing issues when trying to analyze the adoptable dogs dataset, dogs. Here is the current frequency table of the "sex" column:

male      1672
female    1249
 MALE        10
 FEMALE       5
Malez        1
Now that the system only takes responses of "female" and "male", you want this variable to match the updated system.

In [18]:
# Fix the misspelled word
replace_map = {"Malez": "male"}

# Update the sex column using the created map
dogs["sex"] = dogs["sex"].replace(replace_map)

# Strip away leading whitespace
dogs["sex"] = dogs["sex"].str.strip()

# Make all responses lowercase
dogs["sex"] = dogs["sex"].str.lower()

# Convert to a categorical Series
dogs["sex"] = dogs["sex"].astype("category")

print(dogs["sex"].value_counts())

# Categorical variables are usually just strings with some additional properties.
#  An easy way to update them is using .str. Just don't forget to convert the
#  column back to a categorical Series!

sex
male      1681
female    1256
Name: count, dtype: int64


**Accessing and filtering data**  
You are working on a Python application to display information about the dogs available for adoption at your local animal shelter. Some of the variables of interest, such as "breed", "size", and "coat", are saved as categorical variables. In order for this application to work properly, you need to be able to access and filter data using these columns.

The ID variable has been set as the index of the pandas DataFrame dogs.



In [20]:
# Set the 'ID' column as the index
dogs.set_index('ID', inplace=True)

# Print the category of the coat for ID 23807
print(dogs.loc[23807, "coat"])

short


In [21]:
# Find the count of male and female dogs who have a "long" coat
print(dogs.loc[dogs["coat"] == "long", "sex"].value_counts())

sex
male      124
female     56
Name: count, dtype: int64


In [22]:
# Print the mean age of dogs with a breed of "English Cocker Spaniel"
print(dogs.loc[dogs["breed"] == "English Cocker Spaniel", "age"].mean())

8.186153846153847


In [23]:
# Count the number of dogs that have "English" in their breed name
print(dogs[dogs["breed"].str.contains("English", regex=False)].shape[0])


# here are currently 24 dogs up for adoption with "English" in their breed name.
# Being able to access values and filter data in a DataFrame is an important skill that will be needed almost anytime pandas is used.

35
