# Purrfect Purrsonalities:
## How Does a Cat's Color and Acquisition Affect His or Her Temperament?
### And Does This Have the Potential to Increase Cat Adoption?

---

![Everyone needs an Archie.](Archie.jpeg)

## Part One: Cleaning the Survey Results CSV

Survey respondents were promised that any personal information gathered, namely their email address, would not be shared in any publicly visible platform. To keep this promise and to follow data ethics to tbe best of my ability, email addresses in the original respondent CSV file obtained from the Google Form survey were replaced in my very first step of cleaning with an identifying value for the family instead of the email. This was then saved as a new CSV file that will be safe to be posted on GitHub and Tableau and that "safe" file will be used for the remainder of this project. The original CSV with email addresses was deleted  from the repo immediately following cleaning. The code used to do that is shown but hashed out to protect the privacy of all respondents. 

In [81]:
import pandas as pd
import numpy as np

In [82]:
# survey_df= pd.read_csv('survey_responses.csv')
# survey_df.head()

In [83]:
# survey_df.rename(columns={'Email Address' : 'family'}, inplace=True)
# survey_df.head()

In [84]:
# survey_df.family = pd.factorize(survey_df.family)[0]
# survey_df.head()

In [85]:
# survey_df.to_csv('survey_data.csv')

Now, let's read in our safe CSV file.

In [86]:
survey_df = pd.read_csv('survey_data.csv', index_col=[0])
survey_df.head()

Unnamed: 0,Timestamp,family,"What is your pet's name? (As a reminder, you should submit one separate survey for each cat you have.)",How did you acquire this cat?,"If your cat came from a cat cafe, or if you indicated you adopted this cat from ""Other"", please give the name of the cafe or other adoption/purchase source below:",In which of the following life stages is this cat?,"How long, in years, have you had this cat? (Please round up or round down to the closest whole number.)",Which of these best describes your cat's coat color?,"If your cat's coat was marked as ""Other"", please describe the cat's coloration below:","Approximately how much does your cat weigh, in pounds? (Please use decimals instead of fractions if you know your cat's weight very specifically.)","On a scale from 1 to 10, with 1 being ""scaredy cat"" and 10 being ""social butterfly"", how social is your cat around people?","On a scale from 1 to 10, with 1 being ""Has one brain cell"" and 10 being ""Kitty Einstein"", how intelligent do you think your cat is? (A nice simple quiz you can use to gauge this based on everyday behaviors can be found here.)","On a scale from 1 to 10, with 1 being ""Couch potato"" and 10 being ""Always in motion"", how active/playful is your cat?","On a scale from 1 to 10, with 1 being ""Cat's got his tongue"" and 10 being ""Chatty Cathy"", how vocal is your cat?"
0,6/5/2023 17:07:17,0,Archie,Adopted at a cat cafe,,Adult (3--6 years),4,Orange tabby/orange tabby with white,,15.0,9,7,6,3
1,6/5/2023 17:15:18,1,Dax,Adopted at a cat cafe,,Junior (7 months--2 years) KITTENS UNDER 1 YEA...,0,Orange tabby/orange tabby with white,,9.5,10,7,9,8
2,6/5/2023 19:54:13,2,Aylin,Acquired through a friend or relative,,Junior (7 months--2 years) KITTENS UNDER 1 YEA...,1,Calico,,7.0,5,8,6,7
3,6/5/2023 20:17:40,3,Ali,Adopted through animal shelter/rescue group,,Mature (7--10 years),6,Tortoiseshell,,15.0,2,7,3,5
4,6/5/2023 20:18:42,3,Warner,Adopted through animal shelter/rescue group,,Mature (7--10 years),6,Gray tuxedo (gray and white),,12.0,9,7,7,10


Because the column names came from the Google Sheet that the survey questions and responses saved into, the column names are currently the full questions from that survey. That simply won't do. Let's make those column names more Pythonic.

In [87]:
column_names = ['timestamp', 'family', 'name', 'acquired_from', 'place_name', 'age', 'home_duration', 'color',
                 'other_color', 'weight', 'socialness', 'intelligence', 'activity_level', 'vocalization']
survey_df.columns = column_names
survey_df

Unnamed: 0,timestamp,family,name,acquired_from,place_name,age,home_duration,color,other_color,weight,socialness,intelligence,activity_level,vocalization
0,6/5/2023 17:07:17,0,Archie,Adopted at a cat cafe,,Adult (3--6 years),4,Orange tabby/orange tabby with white,,15,9,7,6,3
1,6/5/2023 17:15:18,1,Dax,Adopted at a cat cafe,,Junior (7 months--2 years) KITTENS UNDER 1 YEA...,0,Orange tabby/orange tabby with white,,9.5,10,7,9,8
2,6/5/2023 19:54:13,2,Aylin,Acquired through a friend or relative,,Junior (7 months--2 years) KITTENS UNDER 1 YEA...,1,Calico,,7,5,8,6,7
3,6/5/2023 20:17:40,3,Ali,Adopted through animal shelter/rescue group,,Mature (7--10 years),6,Tortoiseshell,,15,2,7,3,5
4,6/5/2023 20:18:42,3,Warner,Adopted through animal shelter/rescue group,,Mature (7--10 years),6,Gray tuxedo (gray and white),,12,9,7,7,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
472,6/13/2023 19:54:33,255,Cash,Found as a stray,,Junior (7 months--2 years) KITTENS UNDER 1 YEA...,1,Solid black,,7,3,10,4,8
473,6/13/2023 19:56:02,255,Penny,Adopted through animal shelter/rescue group,,Junior (7 months--2 years) KITTENS UNDER 1 YEA...,1,Brown tabby/brown tabby with white,,8,1,5,2,3
474,7/6/2023 11:41:41,256,Morty,Adopted at a cat cafe,,Adult (3--6 years),4,Solid black,,12,1,1,2,10
475,7/6/2023 11:46:57,256,Indiana,Adopted at a cat cafe,,Adult (3--6 years),4,Orange tabby/orange tabby with white,,15,5,8,3,2


We do not need the timestamp column, and the place_name column came from a previous iteration of the survey that was later scrapped. It asked respondents to name the place where their cat was adopted. Let's delete those columns and get some information about our DataFrame. 

In [88]:
survey_df.drop(['timestamp', 'place_name'], axis=1, inplace=True)
survey_df.head()

Unnamed: 0,family,name,acquired_from,age,home_duration,color,other_color,weight,socialness,intelligence,activity_level,vocalization
0,0,Archie,Adopted at a cat cafe,Adult (3--6 years),4,Orange tabby/orange tabby with white,,15.0,9,7,6,3
1,1,Dax,Adopted at a cat cafe,Junior (7 months--2 years) KITTENS UNDER 1 YEA...,0,Orange tabby/orange tabby with white,,9.5,10,7,9,8
2,2,Aylin,Acquired through a friend or relative,Junior (7 months--2 years) KITTENS UNDER 1 YEA...,1,Calico,,7.0,5,8,6,7
3,3,Ali,Adopted through animal shelter/rescue group,Mature (7--10 years),6,Tortoiseshell,,15.0,2,7,3,5
4,3,Warner,Adopted through animal shelter/rescue group,Mature (7--10 years),6,Gray tuxedo (gray and white),,12.0,9,7,7,10


In [89]:
survey_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 477 entries, 0 to 476
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   family          477 non-null    int64 
 1   name            477 non-null    object
 2   acquired_from   477 non-null    object
 3   age             477 non-null    object
 4   home_duration   477 non-null    object
 5   color           477 non-null    object
 6   other_color     46 non-null     object
 7   weight          477 non-null    object
 8   socialness      477 non-null    int64 
 9   intelligence    477 non-null    int64 
 10  activity_level  477 non-null    int64 
 11  vocalization    477 non-null    int64 
dtypes: int64(5), object(7)
memory usage: 48.4+ KB


In [90]:
survey_df.shape

(477, 12)

Respondents were asked to give their cat's weight in pounds, rounding to the nearest pound, but that is showing as an object and not an integer data type. What's going on there?

In [91]:
print(survey_df['weight'].unique())

['15' '9.5' '7' '12' '8.6' '8 lbs' '13 pounds' '13' '16lbs' '7 lbs' '6lbs'
 '11' '9' '5' '2' '14.7' '10' '10 pounds' '12 lbs' '8' '3.5' '14 lbs'
 '9lbs' '13 lbs' '9.14' '20 lbs' '12 pounds' '17' '8.5' '6 lbs'
 '12.50 pounds' '14' '7 pounds' '15 pounds' '20' '12.5 lbs' 'Ginger 3 lbs'
 '16' '15.5' '6.5' '5.5' '11.5' 'No idea but he’s a smol kitteh' '25'
 '13lbs' '18 lbs' 'No idea ' '14lbs' '7lbs' '8.4 lbs' '8-15 pounds'
 'Grey and white tuxedo cat' '6.4' '16 lbs' '?' '18' '9.3 pounds'
 '6.8 pounds' '9 pounds' '5 pounds' '3 lbs' '12 LBS' '6 pounds 8 ounces'
 '6 pounds' '6.5 pounds' '8.2' '11.8' 'I don’t know. ' '21' '10 lbs'
 '17 pounds' '8.8' 'Brown and black tabby' '3' '10.73 lb ' '8.73 lb '
 '7.5 pounds ' '13.6' '9 lbs' '8lb' 'No idea' 'not sure' '17 lbs' '6' '4'
 'Less then a pound' '5 lbs' '24 lbs' '9.6' '9.8' '7.5' '7.2' '8lbs'
 '11.1 lbs' '8 pounds' '8 poundspp' '17.6'
 "I don't know and I'm bad at guessing these things!"
 "I'm not sure and I would hate to guess!" '9 pounds no' '13

In case we need to do calculations later based on feline chonkiness, we need to make the weight column an integer data type. Let's get rid of all non-number values in that Series. 

In [92]:
survey_df['weight'] = survey_df['weight'].str.extract(pat='(\d+)', expand=False)
print(survey_df['weight'].unique())

['15' '9' '7' '12' '8' '13' '16' '6' '11' '5' '2' '14' '10' '3' '20' '17'
 nan '25' '18' '21' '4' '24' '22' '28' '23' '19' '112']


We do have some null values, which will keep us from changing the weight column to an integer type as I'd like. We can convert it to a nullable integer type, however, by using "Int64" instead of "int". 

In [93]:
survey_df.weight = survey_df.weight.astype('Int64')

In [94]:
survey_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 477 entries, 0 to 476
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   family          477 non-null    int64 
 1   name            477 non-null    object
 2   acquired_from   477 non-null    object
 3   age             477 non-null    object
 4   home_duration   477 non-null    object
 5   color           477 non-null    object
 6   other_color     46 non-null     object
 7   weight          464 non-null    Int64 
 8   socialness      477 non-null    int64 
 9   intelligence    477 non-null    int64 
 10  activity_level  477 non-null    int64 
 11  vocalization    477 non-null    int64 
dtypes: Int64(1), int64(5), object(6)
memory usage: 48.9+ KB


The survey included an "Other" drop-down option for the cat's coat color, which led respondents to a short-answer question on the Form where they could describe their cat's unconventional coat color. To simplify that cat's coat color into just one column, I need to first replace null values with an empty string, then move any values left in the other_color column to the color column, replacing anything in the color column (this would have shown up as just the string "other") with the value from the other_color column IF there isn't an empty string in the other_color column. Along the way I'll check to see what strings are in the other_color column, and then to check to see if this worked I'll print the first 50 rows of the DataFrame to get a good representation of how the values transferred.

In [95]:
survey_df.other_color = survey_df.other_color.fillna('')


In [96]:
print(survey_df['other_color'].unique())

['' 'ocicat tabby mix' 'ocicat marble tabby mix' 'abyssian tabby mix'
 'Blonde with tabby markings on the head' 'Snowshoe Siamese '
 'Grey and white long hair' 'White with orange spots'
 'Tabby-tortie or torbie'
 'Red flame, orange tips on ears, orange strips on tail'
 'Grays, orange, white, no cat color fits her'
 'Black with brown/gray stripe marking, brown belly'
 'Black with white spots '
 'Belly, legs and bottom half of his body is white. His back and some of the sides of his body are black.'
 'Mostly white with black spots' 'Brown/orange/white Torby'
 'Part orange tabby, part calico' 'Chocolate tabby point Siamese '
 'Mostly white with a few large brown spots (including a heart on shoulder)'
 'She is a calico and a tabby ' 'Black and white splotches, not tuxedo'
 'almost all white but black tail, and partly black on head and face '
 'White with brown and black spots' 'Hes a cream tabby'
 'Fire point siamese' 'Long hair black and white marbled (non-tuxedo)'
 'Long hair gray with w

In [97]:
survey_df['color'] = survey_df['other_color'].where(survey_df['other_color'].ne(''), survey_df['color'])
survey_df.head(50)

Unnamed: 0,family,name,acquired_from,age,home_duration,color,other_color,weight,socialness,intelligence,activity_level,vocalization
0,0,Archie,Adopted at a cat cafe,Adult (3--6 years),4,Orange tabby/orange tabby with white,,15,9,7,6,3
1,1,Dax,Adopted at a cat cafe,Junior (7 months--2 years) KITTENS UNDER 1 YEA...,0,Orange tabby/orange tabby with white,,9,10,7,9,8
2,2,Aylin,Acquired through a friend or relative,Junior (7 months--2 years) KITTENS UNDER 1 YEA...,1,Calico,,7,5,8,6,7
3,3,Ali,Adopted through animal shelter/rescue group,Mature (7--10 years),6,Tortoiseshell,,15,2,7,3,5
4,3,Warner,Adopted through animal shelter/rescue group,Mature (7--10 years),6,Gray tuxedo (gray and white),,12,9,7,7,10
5,4,Cleo,Acquired through a friend or relative,Super Senior (15 years+),16,Solid black,,8,4,6,2,3
6,5,Cinnamon,Adopted through animal shelter/rescue group,Adult (3--6 years),3 years,Brown tabby/brown tabby with white,,8,8,6,7,9
7,6,Patti LaBelle,Adopted through animal shelter/rescue group,Adult (3--6 years),2,Brown tabby/brown tabby with white,,13,5,8,4,10
8,7,Claire,Adopted through animal shelter/rescue group,Senior (11-14 years),14,Calico,,13,10,8,3,8
9,8,Charlie,Found as a stray,Mature (7--10 years),10 years,Classic tuxedo (black and white),,16,10,3,2,1


Since that seems to have worked, we can delete the other_color column.

In [98]:
survey_df.drop(['other_color'], axis=1, inplace=True)
survey_df.head()

Unnamed: 0,family,name,acquired_from,age,home_duration,color,weight,socialness,intelligence,activity_level,vocalization
0,0,Archie,Adopted at a cat cafe,Adult (3--6 years),4,Orange tabby/orange tabby with white,15,9,7,6,3
1,1,Dax,Adopted at a cat cafe,Junior (7 months--2 years) KITTENS UNDER 1 YEA...,0,Orange tabby/orange tabby with white,9,10,7,9,8
2,2,Aylin,Acquired through a friend or relative,Junior (7 months--2 years) KITTENS UNDER 1 YEA...,1,Calico,7,5,8,6,7
3,3,Ali,Adopted through animal shelter/rescue group,Mature (7--10 years),6,Tortoiseshell,15,2,7,3,5
4,3,Warner,Adopted through animal shelter/rescue group,Mature (7--10 years),6,Gray tuxedo (gray and white),12,9,7,7,10


Next, I want the cat's age column to be simpler to read and do calculations with. Let's get rid of every value in that column that comes after the age classification (Mature, Adult, Senior, etc.) The ranges included in these classifications can be given later in any results write-ups and visualizations. 

In [99]:
survey_df['age'] = survey_df['age'].str.split('(').str[0]
survey_df.head()

Unnamed: 0,family,name,acquired_from,age,home_duration,color,weight,socialness,intelligence,activity_level,vocalization
0,0,Archie,Adopted at a cat cafe,Adult,4,Orange tabby/orange tabby with white,15,9,7,6,3
1,1,Dax,Adopted at a cat cafe,Junior,0,Orange tabby/orange tabby with white,9,10,7,9,8
2,2,Aylin,Acquired through a friend or relative,Junior,1,Calico,7,5,8,6,7
3,3,Ali,Adopted through animal shelter/rescue group,Mature,6,Tortoiseshell,15,2,7,3,5
4,3,Warner,Adopted through animal shelter/rescue group,Mature,6,Gray tuxedo (gray and white),12,9,7,7,10


We similarly need to simplify where/how the cats were acquired. These were drop-down-menu choices on the survey, so there aren't too many possible unique values in that column. We can create a dictionary to make these replacements as more Pythonic strings.

In [100]:
print(survey_df['acquired_from'].unique())

['Adopted at a cat cafe' 'Acquired through a friend or relative'
 'Adopted through animal shelter/rescue group' 'Found as a stray'
 'Purchased from a pet store' 'Received as a gift'
 'Purchased from a breeder' 'Bred at home/from owned pet']


In [101]:
replacement_dict = {
    'Adopted at a cat cafe' : 'cat_cafe',
    'Acquired through a friend or relative' : 'friend_or_relative',
    'Adopted through animal shelter/rescue group' : 'shelter_or_rescue',
    'Found as a stray' : 'stray',
    'Purchased from a pet store' : 'pet_store',
    'Received as a gift' : 'gift',
    'Purchased from a breeder' : 'breeder',
    'Bred at home/from owned pet' : 'from_owned_pet'
}
survey_df['acquired_from'] = survey_df['acquired_from'].replace(replacement_dict)
print(survey_df['acquired_from'].unique())

['cat_cafe' 'friend_or_relative' 'shelter_or_rescue' 'stray' 'pet_store'
 'gift' 'breeder' 'from_owned_pet']


Since this was a survey, it's quite possible there are accidental duplicates. If there are duplicates, they can be easily detected by searching on the family and name columns. If there are duplicates there, we should keep the last duplicate; when people submit two copies of a survey, it's often because they didn't answer the way they wanted to on the first submission and are re-doing it. Keeping the last duplicate gives us a better shot at accurate data.

In [102]:
duplicate_rows = survey_df[survey_df.duplicated(['family', 'name'])]
duplicate_rows

Unnamed: 0,family,name,acquired_from,age,home_duration,color,weight,socialness,intelligence,activity_level,vocalization
23,10,Pandora,stray,Adult,6,Calico,14,3,9,2,2
359,129,Theo,gift,Adult,4,Classic tuxedo (black and white),12,8,2,5,5


In [103]:
survey_df.drop_duplicates(subset=['family', 'name'], keep='last', inplace=True)
survey_df.shape

(475, 11)

Finally, let's save a clean copy of this CSV file to be used for further analysis and for visualization tools later in the project.

In [104]:
survey_df.to_csv('clean_survey_data.csv')