<a href="https://colab.research.google.com/github/1998JustinLee/Animal-Shelter-Outcomes/blob/master/Animal_Shelter_Outcomes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Upload CSV

In [0]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving train.csv to train.csv
User uploaded file "train.csv" with length 2824793 bytes


# **Data Transformation**

### Original Data Transformation Description

* Have to keep AnimalID column for final output
* Converted AgeuponOutcome into equivalent days and put under Age column
* Changed Name to 1 for if it has a name and 0 for no name
*   Grouped Colors into 8 groups (Done by searching for keywords in color value)
 * White
  * Brown
  * Black
  * Gray
  * Tan
  * Orange
  * Tricolor
  * Other
*   Kept common (more than 150 entries, copied from kaggle) breeds for cats and dogs, rest are labelled exotic. First split strings by '/' and then removed 'Mix' from strings.
* Split DateTime into Date and Time
* Split SexuponOutcome into Sex and Neutered, changed in null values in Sex to Unknown for Unknown splits (there was only one missing entry for this, could've just removed it)
 * Changed Spayed to Neutered - Necessary?
* Change null values for Age to 0 (18 missing entries)
* Change in null values for OutcomeSubtype to NA (pretty much half of them were missing, most of them are from Adoption and Return_to_owner)

**Notes:**

* Split up the time and date further?
* Convert all other text into numbers? - Don't know if the random forest thing can process text

### Changes
* Colors are chosen like Breeds, keep first value in a '/' split, then sorted from there.
* Breeds now cover any entries that occur more than 100 times

### Printing Functions

In [0]:
# Print dataframe
print(df)

# More custom print - won't compress columns
with pd.option_context('display.max_rows', 20, 'display.max_columns', 15):
    print(df)

# Print number of entries for each column, doesn't count NULL values
print(df.count())

# Count and print values for a given column, doesn't count NULL values
print(pd.value_counts(df['SexuponOutcome'].values, sort=False))

# Count the number of Breeds of Cats, sort by count
print(pd.value_counts(df.loc[df.AnimalType=='Cat']['Breed'].values, ascending=True))

NameError: ignored

### Actual Code

In [0]:
#!/usr/bin/env python
import pandas as pd
import numpy as np

df = pd.read_csv('train.csv', low_memory=False)

# Split DateTime to Date and Time
df[['Date','Time']] = df['DateTime'].str.split(' ',expand=True);
df.drop(['DateTime'], axis=1, inplace=True);

# Split SexuponOutcome to Neutered and Sex
df[['Neutered','Sex']] = df['SexuponOutcome'].str.split(' ',expand=True);
df.drop(['SexuponOutcome'], axis=1, inplace=True);
df.loc[df.Sex.isnull()==True, 'Sex'] = 'Unknown';
df.loc[df.Neutered=='Spayed', 'Neutered'] = 'Neutered';

# Turn Age into a full number
df[['Age1','Age2']] = df['AgeuponOutcome'].str.split(' ',expand=True);
df.drop(['AgeuponOutcome'], axis=1, inplace=True);
df.replace(['years', 'year', 'months', 'month', 'weeks', 'week', 'days', 'day'],
          [365, 365, 30, 30, 7, 7, 1, 1], inplace=True)
df[['Age1', 'Age2']] = df[['Age1', 'Age2']].astype(float);
df['Age'] = df.Age1 * df.Age2
df.drop(['Age1', 'Age2'], axis=1, inplace=True);
df.loc[df.Age.isnull()==True, 'Age'] = 0;

# Change name to bool
df.loc[df.Name.isnull()==False, 'Name'] = 1;
df.loc[df.Name.isnull()==True, 'Name'] = 0;

# Fixing null values in OutcomeSubtype column
df.loc[df.OutcomeSubtype.isnull()==True, 'OutcomeSubtype'] = 'NA';

# Changing colours
df['Color'] = df['Color'].str.split("/").str[0];
df.loc[(df.Color.str.contains('White', regex=False)) |
        (df.Color.str.contains('Calico', regex=False)) |
        (df.Color.str.contains('Lilac', regex=False)), 'Color'] = 'White';

df.loc[(df.Color.str.contains('Brown', regex=False)) |
        (df.Color.str.contains('Chocolate', regex=False)) |
        (df.Color.str.contains('Torbie', regex=False)) |
        (df.Color.str.contains('Sable', regex=False)), 'Color'] = 'Brown';

df.loc[(df.Color.str.contains('Black', regex=False)) |
        (df.Color.str.contains('Tortie', regex=False)), 'Color'] = 'Black';

df.loc[(df.Color.str.contains('Gray', regex=False)) |
        (df.Color.str.contains('Blue', regex=False)) |
        (df.Color.str.contains('Lynx', regex=False)) |
        (df.Color.str.contains('Silver', regex=False)) |
        (df.Color.str.contains('Agouti', regex=False)), 'Color'] = 'Gray';

df.loc[(df.Color.str.contains('Tan', regex=False)) |
        (df.Color.str.contains('Cream', regex=False)) |
        (df.Color.str.contains('Buff', regex=False)) |
        (df.Color.str.contains('Seal', regex=False)) |
        (df.Color.str.contains('Flame', regex=False)) |
        (df.Color.str.contains('Fawn', regex=False)) |
        (df.Color.str.contains('Apricot', regex=False)), 'Color'] = 'Tan';

df.loc[(df.Color.str.contains('Orange', regex=False)) |
        (df.Color.str.contains('Yellow', regex=False)) |
        (df.Color.str.contains('Gold', regex=False)) |
        (df.Color.str.contains('Red', regex=False)), 'Color'] = 'Orange';

df.loc[df.Color.str.contains('Tricolor', regex=False), 'Color'] = 'Tricolor';

df.loc[(df.Color!='Black') &
        (df.Color!='White') &
        (df.Color!='Brown') &
        (df.Color!='Gray') &
        (df.Color!='Tan') &
        (df.Color!='Orange') &
        (df.Color!='Tricolor') , 'Color'] = 'Other';

# Grouping breeds
df['Breed'] = df['Breed'].str.split("/").str[0];
df['Mix'] = np.where(df.Breed.str.contains('Mix', regex=False), 1, 0)
df['Breed'] = df['Breed'].str.replace(' Mix', '');

breeds = ['Blue Lacy','Queensland Heeler','Rhod Ridgeback','Retriever','Chinese Sharpei','Black Mouth Cur','Catahoula','Staffordshire','Affenpinscher','Afghan Hound','Airedale Terrier','Akita','Australian Kelpie','Alaskan Malamute','English Bulldog','American Bulldog','American English Coonhound','American Eskimo Dog (Miniature)','American Eskimo Dog (Standard)','American Eskimo Dog (Toy)','American Foxhound','American Hairless Terrier','American Staffordshire Terrier','American Water Spaniel','Anatolian Shepherd Dog','Australian Cattle Dog','Australian Shepherd','Australian Terrier','Basenji','Basset Hound','Beagle','Bearded Collie','Beauceron','Bedlington Terrier','Belgian Malinois','Belgian Sheepdog','Belgian Tervuren','Bergamasco','Berger Picard','Bernese Mountain Dog','Bichon Fris_','Black and Tan Coonhound','Black Russian Terrier','Bloodhound','Bluetick Coonhound','Boerboel','Border Collie','Border Terrier','Borzoi','Boston Terrier','Bouvier des Flandres','Boxer','Boykin Spaniel','Briard','Brittany','Brussels Griffon','Bull Terrier','Bull Terrier (Miniature)','Bulldog','Bullmastiff','Cairn Terrier','Canaan Dog','Cane Corso','Cardigan Welsh Corgi','Cavalier King Charles Spaniel','Cesky Terrier','Chesapeake Bay Retriever','Chihuahua','Chinese Crested Dog','Chinese Shar Pei','Chinook','Chow Chow',"Cirneco dell'Etna",'Clumber Spaniel','Cocker Spaniel','Collie','Coton de Tulear','Curly-Coated Retriever','Dachshund','Dalmatian','Dandie Dinmont Terrier','Doberman Pinsch','Doberman Pinscher','Dogue De Bordeaux','English Cocker Spaniel','English Foxhound','English Setter','English Springer Spaniel','English Toy Spaniel','Entlebucher Mountain Dog','Field Spaniel','Finnish Lapphund','Finnish Spitz','Flat-Coated Retriever','French Bulldog','German Pinscher','German Shepherd','German Shorthaired Pointer','German Wirehaired Pointer','Giant Schnauzer','Glen of Imaal Terrier','Golden Retriever','Gordon Setter','Great Dane','Great Pyrenees','Greater Swiss Mountain Dog','Greyhound','Harrier','Havanese','Ibizan Hound','Icelandic Sheepdog','Irish Red and White Setter','Irish Setter','Irish Terrier','Irish Water Spaniel','Irish Wolfhound','Italian Greyhound','Japanese Chin','Keeshond','Kerry Blue Terrier','Komondor','Kuvasz','Labrador Retriever','Lagotto Romagnolo','Lakeland Terrier','Leonberger','Lhasa Apso','L_wchen','Maltese','Manchester Terrier','Mastiff','Miniature American Shepherd','Miniature Bull Terrier','Miniature Pinscher','Miniature Schnauzer','Neapolitan Mastiff','Newfoundland','Norfolk Terrier','Norwegian Buhund','Norwegian Elkhound','Norwegian Lundehund','Norwich Terrier','Nova Scotia Duck Tolling Retriever','Old English Sheepdog','Otterhound','Papillon','Parson Russell Terrier','Pekingese','Pembroke Welsh Corgi','Petit Basset Griffon Vend_en','Pharaoh Hound','Plott','Pointer','Polish Lowland Sheepdog','Pomeranian','Standard Poodle','Miniature Poodle','Toy Poodle','Portuguese Podengo Pequeno','Portuguese Water Dog','Pug','Puli','Pyrenean Shepherd','Rat Terrier','Redbone Coonhound','Rhodesian Ridgeback','Rottweiler','Russell Terrier','St. Bernard','Saluki','Samoyed','Schipperke','Scottish Deerhound','Scottish Terrier','Sealyham Terrier','Shetland Sheepdog','Shiba Inu','Shih Tzu','Siberian Husky','Silky Terrier','Skye Terrier','Sloughi','Smooth Fox Terrier','Soft-Coated Wheaten Terrier','Spanish Water Dog','Spinone Italiano','Staffordshire Bull Terrier','Standard Schnauzer','Sussex Spaniel','Swedish Vallhund','Tibetan Mastiff','Tibetan Spaniel','Tibetan Terrier','Toy Fox Terrier','Treeing Walker Coonhound','Vizsla','Weimaraner','Welsh Springer Spaniel','Welsh Terrier','West Highland White Terrier','Whippet','Wire Fox Terrier','Wirehaired Pointing Griffon','Wirehaired Vizsla','Xoloitzcuintli','Yorkshire Terrier',

'Entlebucher', 'Treeing Tennesse Brindle', 'Sealyham Terr', 'Spanish Mastiff', 'Hovawart', 'Mexican Hairless', 'Swiss Hound', 'Lowchen', 'Port Water Dog', 'Old English Bulldog', 'Presa Canario', 'Jindo', 'Bull Terrier Miniature', 'English Shepherd', 'Picardy Sheepdog', 'Bedlington Terr', 'Glen Of Imaal', 'Treeing Cur', 'Boykin Span', 'Schnauzer Giant', 'Podengo Pequeno', 'Chinese Crested', 'Landseer', 'Patterdale Terr', 'Feist', 'Bluetick Hound', 'English Coonhound', 'Dutch Shepherd', 'St. Bernard Rough Coat', 'Cavalier Span', 'St. Bernard Smooth Coat', 'Pbgv', 'Chesa Bay Retr', 'American Eskimo', 'English Pointer', 'Alaskan Husky', 'Dogo Argentino', 'Collie Rough', 'Bichon Frise', 'Wire Hair Fox Terrier', 'West Highland', 'German Shorthair Pointer', 'Redbone Hound', 'Bruss Griffon', 'Soft Coated Wheaten Terrier', 'Collie Smooth', 'Carolina Dog', 'Flat Coat Retriever', 'Dachshund Longhair', 'Dachshund Wirehair', 'American Pit Bull Terrier', 'Plott Hound', 'Anatol Shepherd',
'Chihuahua Longhair', 'Jack Russell Terrier', 'Chihuahua Shorthair', 'Black',
          
'Havana Brown', 'Cornish Rex', 'Norwegian Forest Cat', 'Ocicat', 'Burmese', 'Munchkin Longhair', 'Cymric', 'Javanese', 'Turkish Van', 'Devon Rex', 'Sphynx', 'Exotic Shorthair', 'Abyssinian', 'Pixiebob Shorthair', 'Tonkinese', 'British Shorthair', 'Bengal', 'Balinese', 'Bombay', 'Japanese Bobtail', 'Angora', 'American Shorthair', 'Ragdoll', 'Persian', 'Himalayan', 'Russian Blue', 'Maine Coon', 'Manx', 'Snowshoe', 'Siamese', 'Domestic Longhair', 'Domestic Medium Hair', 'Domestic Shorthair']

groups = ['Herding','Herding','Hound','Sporting','Non-Sporting','Herding','Herding','Terrier','Toy','Hound','Terrier','Working','Working','Working','Non-Sporting','Non-Sporting','Hound','Non-Sporting','Non-Sporting','Toy','Hound','Terrier','Terrier','Sporting','Working','Herding','Herding','Terrier','Hound','Hound','Hound','Herding','Herding','Terrier','Herding','Herding','Herding','Herding','Herding','Working','Non-Sporting','Hound','Working','Hound','Hound','Working','Herding','Terrier','Hound','Non-Sporting','Herding','Working','Sporting','Herding','Sporting','Toy','Terrier','Terrier','Non-Sporting','Working','Terrier','Working','Working','Herding','Toy','Terrier','Sporting','Toy','Toy','Non-Sporting','Working','Non-Sporting','Hound','Sporting','Sporting','Herding','Non-Sporting','Sporting','Hound','Non-Sporting','Terrier','Working','Working','Working','Sporting','Hound','Sporting','Sporting','Toy','Herding','Sporting','Herding','Non-Sporting','Sporting','Non-Sporting','Working','Herding','Sporting','Sporting','Working','Terrier','Sporting','Sporting','Working','Working','Working','Hound','Hound','Toy','Hound','Herding','Sporting','Sporting','Terrier','Sporting','Hound','Toy','Toy','Non-Sporting','Terrier','Working','Working','Sporting','Sporting','Terrier','Working','Non-Sporting','Non-Sporting','Toy','Terrier','Working','Herding','Terrier','Toy','Terrier','Working','Working','Terrier','Herding','Hound','Non-Sporting','Terrier','Sporting','Herding','Hound','Toy','Terrier','Toy','Herding','Hound','Hound','Hound','Sporting','Herding','Toy','Non-Sporting','Non-Sporting','Toy','Hound','Working','Toy','Herding','Herding','Terrier','Hound','Hound','Working','Terrier','Working','Hound','Working','Non-Sporting','Hound','Terrier','Terrier','Herding','Non-Sporting','Toy','Working','Toy','Terrier','Hound','Terrier','Terrier','Herding','Sporting','Terrier','Working','Sporting','Herding','Working','Non-Sporting','Non-Sporting','Toy','Hound','Sporting','Sporting','Sporting','Terrier','Terrier','Hound','Terrier','Sporting','Sporting','Non-Sporting','Toy',

'Herding', 'Hound', 'Working', 'Working', 'Working', 'Non-Sporting', 'Hound', 'Non-Sporting', 'Working', 'Non-Sporting', 'Working', 'Sporting', 'Terrier', 'Working', 'Herding', 'Terrier', 'Terrier', 'Working', 'Sporting', 'Working', 'Hound', 'Toy', 'Working', 'Terrier', 'Terrier', 'Hound', 'Hound', 'Herding', 'Working', 'Toy', 'Working', 'Hound', 'Sporting', 'Non-Sporting', 'Sporting', 'Working', 'Sporting', 'Herding', 'Non-Sporting', 'Terrier', 'Terrier', 'Sporting', 'Hound', 'Toy', 'Terrier', 'Herding', 'Hound', 'Sporting', 'Hound', 'Hound', 'Terrier', 'Hound', 'Working', 'Toy', 'Terrier', 'Toy', 'Unknown',

'Short', 'Rex', 'Long', 'Short', 'Short', 'Long', 'Long', 'Long', 'Semi-Long', 'Rex', 'Hairless', 'Short', 'Short', 'Short', 'Short', 'Short', 'Short', 'Long', 'Short', 'Short/Long', 'Semi-Long', 'Short', 'Long', 'Long', 'Long', 'Short', 'Long', 'Short/Long', 'Short', 'Short', 'Long', 'Medium', 'Short']

breeds_group = np.array([breeds,groups]).T
dog_groups = np.unique(breeds_group[:,1])

for i in breeds:
    indx = np.where(breeds_group[:,0] == i)[0]
    df.Breed.replace([i],[breeds_group[indx,1][0]], inplace=True)

# # Create array of columns that need to be one-hot encoded
# one_hot = np.array(['AnimalType', 'Breed', 'Color', 'Neutered', 'Sex'])

# # Using get_dummies for one hot encoding
# df = pd.get_dummies(df, columns=one_hot)

df

Unnamed: 0,AnimalID,Name,OutcomeType,OutcomeSubtype,AnimalType,Breed,Color,Date,Time,Neutered,Sex,Age,Mix
0,A671945,1,Return_to_owner,,Dog,Herding,Brown,2014-02-12,18:22:00,Neutered,Male,365.0,1
1,A656520,1,Euthanasia,Suffering,Cat,Short,Tan,2013-10-13,12:44:00,Neutered,Female,365.0,1
2,A686464,1,Adoption,Foster,Dog,Pit Bull,Gray,2015-01-31,12:28:00,Neutered,Male,730.0,1
3,A683430,0,Transfer,Partner,Cat,Short,Gray,2014-07-11,19:09:00,Intact,Male,21.0,1
4,A667013,0,Transfer,Partner,Dog,Non-Sporting,Tan,2013-11-15,12:52:00,Neutered,Male,730.0,0
5,A677334,1,Transfer,Partner,Dog,Terrier,Black,2014-04-25,13:04:00,Intact,Female,30.0,0
6,A699218,1,Transfer,Partner,Cat,Short,Gray,2015-03-28,13:11:00,Intact,Male,21.0,1
7,A701489,0,Transfer,Partner,Cat,Short,Brown,2015-04-30,17:02:00,Unknown,Unknown,21.0,1
8,A671784,1,Adoption,,Dog,Terrier,Orange,2014-02-04,17:17:00,Neutered,Female,150.0,1
9,A677747,0,Adoption,Offsite,Dog,Terrier,White,2014-05-03,07:48:00,Neutered,Female,365.0,0


# Josh Messing With stuff
Linear Testing

Starting with a function to clean data to only use intergers

In [0]:
df['Breed'] = pd.factorize(df.Breed, sort=True)[0]

# Although, one-hot encoding will separate each feature into multiple, makes it easier for reviewing I think, 
# David used it below
# Though, might just move the one-hot encoding to the data transformation part since it always needs to be done

In [0]:
# This is so you don't have to manually type each thing, you won't know what is what though
def Josh_Clean(data):
    data['AnimalType'] = pd.factorize(df.AnimalType, sort=False)[0]
    data['OutcomeType'] = pd.factorize(df.OutcomeType, sort=False)[0]
    data['Color'] = pd.factorize(df.Color, sort=False)[0]
    data['Breed'] = pd.factorize(df.Breed, sort=False)[0]
    data['Sex'] = pd.factorize(df.Sex, sort=False)[0]
    data['Neutered'] = pd.factorize(df.Neutered, sort=False)[0]

In [0]:
def Josh_Clean(data):

  #Animal Type
  data.loc[data["AnimalType"] == "Dog", "AnimalType"] = 0
  data.loc[data["AnimalType"] == "Cat", "AnimalType"] = 1

  #Outcome
  data.loc[data["OutcomeType"] == "Return_to_owner", "OutcomeType"] = 0
  data.loc[data["OutcomeType"] == "Euthanasia", "OutcomeType"] = 1
  data.loc[data["OutcomeType"] == "Adoption", "OutcomeType"] = 2
  data.loc[data["OutcomeType"] == "Transfer", "OutcomeType"] = 3
  data.loc[data["OutcomeType"] == "Died", "OutcomeType"] = 4

  #color
  data.loc[data["Color"] == "White", "Color"] = 0
  data.loc[data["Color"] == "Black", "Color"] = 2
  data.loc[data["Color"] == "Brown", "Color"] = 1
  data.loc[data["Color"] == "Gray", "Color"] = 3
  data.loc[data["Color"] == "Tan", "Color"] = 4
  data.loc[data["Color"] == "Orange", "Color"] = 5
  data.loc[data["Color"] == "Tricolor", "Color"] = 6
  data.loc[data["Color"] == "Other", "Color"] = 7

  #Breed
  data.loc[data["Breed"] == "Exotic", "Breed"] = 0
  data.loc[data["Breed"] == "Siamese", "Breed"] = 1
  data.loc[data["Breed"] == "Domestic Shorthair", "Breed"] = 2
  data.loc[data["Breed"] == "Domestic Medium Hair", "Breed"] = 3
  data.loc[data["Breed"] == "Domestic Longhair", "Breed"] = 4
  data.loc[data["Breed"] == "Siberian Husky", "Breed"] = 5
  data.loc[data["Breed"] == "Shih Tzu", "Breed"] = 6
  data.loc[data["Breed"] == "Rottweiler", "Breed"] = 7
  data.loc[data["Breed"] == "Rat Terrier", "Breed"] = 8
  data.loc[data["Breed"] == "Pit Bull", "Breed"] = 9
  data.loc[data["Breed"] == "Miniature Schnauzer", "Breed"] = 10
  data.loc[data["Breed"] == "Miniature Poodle", "Breed"] = 11
  data.loc[data["Breed"] == "Labrador Retriever", "Breed"] = 12
  data.loc[data["Breed"] == "Jack Russell Terrier", "Breed"] = 13
  data.loc[data["Breed"] == "German Shepherd", "Breed"] = 14
  data.loc[data["Breed"] == "Dachshund", "Breed"] = 15
  data.loc[data["Breed"] == "Chihuahua Shorthair", "Breed"] = 16
  data.loc[data["Breed"] == "Chiuahua Longhair", "Breed"] = 17
  data.loc[data["Breed"] == "Catahoula", "Breed"] = 18
  data.loc[data["Breed"] == "Boxer", "Breed"] = 19
  data.loc[data["Breed"] == "Border Collie", "Breed"] = 20
  data.loc[data["Breed"] == "Beagle", "Breed"] = 21
  data.loc[data["Breed"] == "Australian Shepherd", "Breed"] = 22
  data.loc[data["Breed"] == "Australian Cattle Dog", "Breed"] = 23
  data.loc[data["Breed"] == "Yorkshire Terrier", "Breed"] = 24

  #Sex
  data.loc[data["Sex"] == "Male", "Sex"] = 0
  data.loc[data["Sex"] == "Female", "Sex"] = 1
  data.loc[data["Sex"] == "Unknown", "Sex"] = 0

  #Neutered
  data.loc[data["Neutered"] == "Intact" , "Neutered" ] = 0
  data.loc[data["Neutered"] == "Neutered" , "Neutered" ] = 1
  data.loc[data["Neutered"] == "Unknown" , "Neutered" ] = 2


Linear Model Test using the sklearn linear model

Result: 0.5489917318268547

Not a good model at all for this problem

In [0]:
from sklearn import linear_model

train2 = df.copy()

Josh_Clean(train2)

target = train2["OutcomeType"].values
features = train2[["Name", "AnimalType", "Color", "Age", "Breed", "Sex"]].values

classifier = linear_model.LogisticRegression()
joshresults = classifier.fit(features, target)

print(joshresults.score(features,target))

0.5472707546110965


Polynomial Model Test
0.5438288001795802

In [0]:
from sklearn import linear_model, preprocessing

train2 = df.copy()

Josh_Clean(train2)

target = train2["OutcomeType"].values
features = train2[["Name", "AnimalType", "Color", "Age", "Breed", "Sex"]].values

classifier = linear_model.LogisticRegression()
poly = preprocessing.PolynomialFeatures(degree=2) # <<<<<<<<<<<<<<< degree is how many turning points
poly_features = poly.fit_transform(features)

classifier_ = classifier.fit(poly_features, target)
print(classifier_.score(poly_features,target))

0.5378053799244267


Tree Model Selection
0.6747727187698754

In [0]:
from sklearn import tree, model_selection
train2 = df.copy()

Josh_Clean(train2)

target = train2["OutcomeType"].values
features_names = ["Name", "AnimalType", "Color", "Age", "Breed", "Sex"]
features = train2[features_names].values

decision_tree = tree.DecisionTreeClassifier(random_state = 1)
decision_tree_ = decision_tree.fit(features, target)

print (decision_tree_.score(features, target))

0.6998017134947061


### Discussion

With the new changes to the data (detailed above), the Linear (0.549 -> 0.547)and Polynomial (0.544 -> 0.538) models fell in accuracy, while there was a small increase for the Tree Model Selection (0.675 -> 0.70)

#David Random Forest Attempt




Very quick random forest following https://chrisalbon.com/machine_learning/trees_and_forests/random_forest_classifier_example/

Things to do:
1. Split up the training set (I used the whole thing, which should only be done when submitting I think?)
2. Include date/time
3. Try out different feature engineering thingos

## 0. Load packages

In [0]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

## 1. Random Forest Classifier

First I'll just do some preliminary stuff, will change later

In [0]:
# Create duplicate dataframe for working
df_rf = df.copy()

# Drop these columns for now (simplifying things)
df_rf.drop(['Date', 'Time', 'OutcomeSubtype'], axis=1, inplace=True);

# Put OutcomeType at the start
df_rf = df_rf[['OutcomeType', 'Name', 'AnimalType', 'Breed', 'Color', 'Neutered',
               'Sex', 'Age']]

# View dataframe
df_rf.head()

Unnamed: 0,OutcomeType,Name,AnimalType,Breed,Color,Neutered,Sex,Age
0,Return_to_owner,1,Dog,Exotic,White,Neutered,Male,365.0
1,Euthanasia,1,Cat,Domestic Shorthair,Tan,Neutered,Female,365.0
2,Adoption,1,Dog,Pit Bull,White,Neutered,Male,730.0
3,Transfer,0,Cat,Domestic Shorthair,Gray,Intact,Male,21.0
4,Transfer,0,Dog,Exotic,Tan,Neutered,Male,730.0


Need to one-hot encode because we are working with categorical variables

In [0]:
# Create array of columns that need to be one-hot encoded
one_hot = np.array(['AnimalType', 'Breed', 'Color', 'Neutered', 'Sex'])

# Using get_dummies for one hot encoding
df_rf = pd.get_dummies(df_rf, columns=one_hot)

# View dataframe
df_rf.head()

Unnamed: 0,OutcomeType,Name,Age,AnimalType_Cat,AnimalType_Dog,Breed_Australian Cattle Dog,Breed_Australian Shepherd,Breed_Beagle,Breed_Border Collie,Breed_Boxer,...,Color_Other,Color_Tan,Color_Tricolor,Color_White,Neutered_Intact,Neutered_Neutered,Neutered_Unknown,Sex_Female,Sex_Male,Sex_Unknown
0,Return_to_owner,1,365.0,0,1,0,0,0,0,0,...,0,0,0,1,0,1,0,0,1,0
1,Euthanasia,1,365.0,1,0,0,0,0,0,0,...,0,1,0,0,0,1,0,1,0,0
2,Adoption,1,730.0,0,1,0,0,0,0,0,...,0,0,0,1,0,1,0,0,1,0
3,Transfer,0,21.0,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
4,Transfer,0,730.0,0,1,0,0,0,0,0,...,0,1,0,0,0,1,0,0,1,0


Next, we extract out the features (column names that are in the dataframe)

In [0]:
# Create a list of the feature column's names
features = df_rf.columns[1:]

# The outcome types encoded as numbers
y = pd.factorize(df_rf['OutcomeType'])[0]

# Target names manually specified (CHANGE THIS ITS STUPID) 
target_names = np.array(['Return_to_owner', 'Euthanasia', 'Adoption', 'Transfer', 'Died'])

Finally, we train the classifer

In [0]:
# Create a random forest Classifier
clf = RandomForestClassifier(n_jobs=2, random_state=0)

# Train the Classifier to take the training features and learn how they relate
# to the training y (the outcome type)
clf.fit(df_rf[features], y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=2,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

## Accuracy

This is what the predicted probabilities for the outcomes looks like

In [0]:
# View the predicted probabilities of the first 10 observations
clf.predict_proba(df_rf[features])[0:10]

array([[0.26642241, 0.04153974, 0.53307332, 0.15497791, 0.00398662],
       [0.        , 0.2       , 0.05      , 0.75      , 0.        ],
       [0.29697975, 0.12623753, 0.35508216, 0.21725455, 0.00444601],
       [0.        , 0.        , 0.        , 1.        , 0.        ],
       [0.        , 0.        , 0.55833333, 0.44166667, 0.        ],
       [0.10666667, 0.        , 0.1       , 0.79333333, 0.        ],
       [0.        , 0.        , 0.        , 0.76666667, 0.23333333],
       [0.        , 0.        , 0.        , 1.        , 0.        ],
       [0.        , 0.        , 0.74395136, 0.25604864, 0.        ],
       [0.05200758, 0.        , 0.80562887, 0.14236355, 0.        ]])

We can do a quick check to see if these predicted outcomes match up with what we originally saw

In [0]:
# Create actual english names for the plants for each predicted plant class
preds = target_names[clf.predict(df_rf[features])]

# View the PREDICTED species for the first five observations
preds[0:5]

array(['Adoption', 'Transfer', 'Adoption', 'Transfer', 'Adoption'],
      dtype='<U15')

Nope. Doesn't look like it at all. Oh well. A confusion matrix shows the predictions in their entirety:

In [0]:
# Create confusion matrix
pd.crosstab(df_rf['OutcomeType'], preds, rownames=['Actual Species'], colnames=['Predicted Species'])

Predicted Species,Adoption,Died,Euthanasia,Return_to_owner,Transfer
Actual Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Adoption,9277,4,23,769,696
Died,16,48,14,8,111
Euthanasia,211,3,653,210,478
Return_to_owner,1676,4,42,2649,415
Transfer,1860,8,108,545,6901


At least the accuracy is not like 50% or something

In [0]:
# Get the overall accuracy of the model
accuracy_score(df_rf['OutcomeType'], preds, normalize=True, sample_weight=None)

0.7305922406375098

One advantage of random forest is that you can view how "important" each feature is. I have no idea how this is supposed to work with one-hot encoded variables because ths does not look right at all, the tutorial I followed was not using categorical variables.

Are we supposed to just group together the one-hot encodings to find the final importance scores? Maybe... Could try it. 

In [0]:
# View a list of the features and their importance scores
list(zip(df_rf[features], clf.feature_importances_))

[('Name', 0.0593944384671512),
 ('Age', 0.426384988997659),
 ('AnimalType_Cat', 0.025028758468997582),
 ('AnimalType_Dog', 0.005861932287196567),
 ('Breed_Australian Cattle Dog', 0.004475236176749528),
 ('Breed_Australian Shepherd', 0.003256815136278902),
 ('Breed_Beagle', 0.0026587561816943947),
 ('Breed_Border Collie', 0.0031932727756882164),
 ('Breed_Boxer', 0.003606378634066746),
 ('Breed_Catahoula', 0.0025655450937124872),
 ('Breed_Chihuahua Shorthair', 0.006781964455528445),
 ('Breed_Dachshund', 0.004483599290940865),
 ('Breed_Domestic Longhair', 0.002361323440155965),
 ('Breed_Domestic Medium Hair', 0.0033349935932960235),
 ('Breed_Domestic Shorthair', 0.02302735835765465),
 ('Breed_Exotic', 0.009021703396608552),
 ('Breed_German Shepherd', 0.005187295113362422),
 ('Breed_Jack Russell Terrier', 0.0022470873226062968),
 ('Breed_Labrador Retriever', 0.0061727152949352295),
 ('Breed_Miniature Poodle', 0.003450886345231981),
 ('Breed_Miniature Schnauzer', 0.002947547110586872),
 ('B

## Discussion

Just out of curiosity, since Age is the most important feature from David's initial model, I wanted to test it just by itself. And it got a 0.511 accuracy.
<br>
<hr>
So I split up the Date and Time into their respective segments, year, month etc. I just dropped seconds since they're all just 00. However, there seems to be some leakage since using the Date and Time yields a 0.987 accuracy. So I think that we just have to drop these features.
<br>
<hr>
In the original model, exotic breeds had an importance of 0.009 which was one of the most important breeds, so there may have been some inaccuracy with this, since almost 5000 entries were put under exotic. The most common breed was Domestic Shorthair with 8958 entries, exotic is second with 4790 and Chihuahua Shorthair is third with 2145. So I changed it so breeds with more than 100 entries were included. 

Exotic has 3193 entries, and the result was a 0.742 accuracy. Marginal improvement but possibility for over-fitting
<br>
<hr>
For consistency, while the importances of the colours all seem fairly even, may try to change it so that it follows the breed method, getting the first colour of a mix instead of if the keyword is found.

By splitting colors and then sorting the first color in a mix, the number in each color group was more balanced. Resulting in 0.736 accuracy, compared to 0.731. 

But with the new colors and the more breeds, the result is 0.750
<br>
<hr>

# Justin's K Nearest Neighbours

Just copying stuff from all over the place

In [0]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

# Target names manually specified (CHANGE THIS ITS STUPID)
target_names = np.array(['Return_to_owner', 'Euthanasia', 'Adoption', 'Transfer', 'Died'])
# The outcome types encoded as numbers
y = pd.factorize(df['OutcomeType'])[0]

train2 = df.copy()
Josh_Clean(train2)

# Target is y
# target = train2["OutcomeType"].values
features_names = ["Name", "AnimalType", "Color", "Age", "Breed", "Sex"]
# Features is x
features = train2[features_names].values

### Whole training data

In [0]:
## Instantiate the model with 5 neighbors.
knn = KNeighborsClassifier(n_neighbors=5)
## Fit the model on the training data.
knn.fit(train2[features_names], y)

# Create actual english names for the plants for each predicted plant class
preds = target_names[knn.predict(train2[features_names])]

# Create confusion matrix
print(pd.crosstab(df['OutcomeType'], preds, rownames=['Actual'], colnames=['Predicted']))

Predicted        Adoption  Died  Euthanasia  Return_to_owner  Transfer
Actual                                                                
Adoption             7930     0          96             1716      1027
Died                   55     8           8               18       108
Euthanasia            381     0         323              408       443
Return_to_owner      1382     0          54             2977       373
Transfer             2698    22         133             1202      5367


In [0]:
# See how the model performs on the test data.
print(accuracy_score(df['OutcomeType'], preds, normalize=True, sample_weight=None))

0.621235362340529


### Split training data

In [0]:
## Split data into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(features, y, random_state=42)

## Instantiate the model with 5 neighbors.
knn = KNeighborsClassifier(n_neighbors=5)
## Fit the model on the training data.
knn.fit(X_train, y_train)

# Create actual english names for the plants for each predicted plant class
preds = target_names[knn.predict(train2[features_names])]

# Create confusion matrix
print(pd.crosstab(df['OutcomeType'], preds, rownames=['Actual'], colnames=['Predicted']))

Predicted        Adoption  Died  Euthanasia  Return_to_owner  Transfer
Actual                                                                
Adoption             7810     0         117             1752      1090
Died                   56     7          14               17       103
Euthanasia            415     1         319              409       411
Return_to_owner      1540     0          90             2770       386
Transfer             2827     8         195             1199      5193


In [0]:
# See how the model performs on the test data.
print(knn.score(X_test, y_test))

0.5297022295376328


### Discussion

Overall, this isn't a very good model. Makes sense for the animals that died, since there are only 197 entries, so there aren't many neighbours to base it on. I guess splitting it means that there are even fewer neighbours.

So the difference in the split and whole data accuracy just means that the model is overfitting.

# Submission Format
Animal ID then the probability for each possible outcome

In [0]:
AnimalID,Adoption,Died,Euthanasia,Return_to_owner,Transfer
A715022,1,0,0,0,0
A677429,0.5,0.3,0.2,0,0
...
etc.