# Central Park Squirrel Census Analysis (Logistic Regression)

This is a project that uses the 2018 Central Park Squirrel Census Data set. I performed Logistic Regression to predict wheter a squirrel is an adult (1) or a juvenile (0). I used pandas, numpy, matplotlib,  and scikit learn.

For a description of the data set and the values used, please refer to these links:

https://data.cityofnewyork.us/Environment/2018-Central-Park-Squirrel-Census-Squirrel-Data/vfnx-vebw

https://data.cityofnewyork.us/Environment/2018-Central-Park-Squirrel-Census-Hectare-Data/ej9h-v6g2

## Import and merge data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.utils import resample

squirrels = pd.read_csv("/Users/kelvenopoku/Downloads/2018_Central_Park_Squirrel_Census_-_Squirrel_Data.csv")
hectare = pd.read_csv("/Users/kelvenopoku/Downloads/2018_Central_Park_Squirrel_Census_-_Hectare_Data.csv")

In [2]:
squirrels.head()

Unnamed: 0,X,Y,Unique Squirrel ID,Hectare,Shift,Date,Hectare Squirrel Number,Age,Primary Fur Color,Highlight Fur Color,...,Approaches,Indifferent,Runs from,Other Interactions,Lat/Long,Zip Codes,Community Districts,Borough Boundaries,City Council Districts,Police Precincts
0,-73.956134,40.794082,37F-PM-1014-03,37F,PM,10142018,3,,,,...,False,False,False,,POINT (-73.9561344937861 40.7940823884086),,19,4,19,13
1,-73.957044,40.794851,37E-PM-1006-03,37E,PM,10062018,3,Adult,Gray,Cinnamon,...,False,False,True,me,POINT (-73.9570437717691 40.794850940803904),,19,4,19,13
2,-73.976831,40.766718,2E-AM-1010-03,02E,AM,10102018,3,Adult,Cinnamon,,...,False,True,False,,POINT (-73.9768311751004 40.76671780725581),,19,4,19,13
3,-73.975725,40.769703,5D-PM-1018-05,05D,PM,10182018,5,Juvenile,Gray,,...,False,False,True,,POINT (-73.9757249834141 40.7697032606755),,19,4,19,13
4,-73.959313,40.797533,39B-AM-1018-01,39B,AM,10182018,1,,,,...,False,False,False,,POINT (-73.9593126695714 40.797533370163),,19,4,19,13


In [3]:
hectare.head()

Unnamed: 0,Hectare,Shift,Date,Anonymized Sighter,Sighter Observed Weather Data,Litter,Litter Notes,Other Animal Sightings,Hectare Conditions,Hectare Conditions Notes,Number of sighters,Number of Squirrels,Total Time of Sighting
0,01A,AM,10072018,110.0,"70º F, Foggy",Some,,"Humans, Pigeons",Busy,,1,4,22.0
1,01A,PM,10142018,177.0,"54º F, overcast",Abundant,,"Humans, Pigeons",Busy,,1,7,26.0
2,01B,AM,10122018,11.0,"60º F, sunny",Some,,"Humans, Dogs, Pigeons, Horses",Busy,,1,17,23.0
3,01B,PM,10192018,109.0,"59.8º F, Sun, Cool",Some,,"Humans, Dogs, Pigeons, Sparrow, Blue jay",Busy,,1,10,35.0
4,01C,PM,10132018,241.0,"55° F, Partly Cloudy",,,"Humans, Dogs, Pigeons, Birds",Busy,,1,10,25.0


In [4]:
# Merging two datasets together
df=pd.merge(squirrels, hectare, how='left', on='Hectare')
df.columns

Index(['X', 'Y', 'Unique Squirrel ID', 'Hectare', 'Shift_x', 'Date_x',
       'Hectare Squirrel Number', 'Age', 'Primary Fur Color',
       'Highlight Fur Color', 'Combination of Primary and Highlight Color',
       'Color notes', 'Location', 'Above Ground Sighter Measurement',
       'Specific Location', 'Running', 'Chasing', 'Climbing', 'Eating',
       'Foraging', 'Other Activities', 'Kuks', 'Quaas', 'Moans', 'Tail flags',
       'Tail twitches', 'Approaches', 'Indifferent', 'Runs from',
       'Other Interactions', 'Lat/Long', 'Zip Codes', 'Community Districts',
       'Borough Boundaries', 'City Council Districts', 'Police Precincts',
       'Shift_y', 'Date_y', 'Anonymized Sighter',
       'Sighter Observed Weather Data', 'Litter', 'Litter Notes',
       'Other Animal Sightings', 'Hectare Conditions',
       'Hectare Conditions Notes', 'Number of sighters', 'Number of Squirrels',
       'Total Time of Sighting'],
      dtype='object')

## Transform Data

Being that most of the data came in strings or booleans, I converted those values to numbers. In terms of the Age colums, a value of 0 means the squirrel is an Juvenile and a value of 1 means the squirrel is an Adult.

In [5]:
#Shift: 'AM'=0, 'PM=1'
df.loc[df['Shift_x'] == 'AM', 'Shift_x'] = int(0)
df.loc[df['Shift_x'] == 'PM', 'Shift_x'] = int(1)
          
#Age: 'Juvenile'=0, 'Adult'=1
df.loc[df['Age'] == 'Juvenile', 'Age'] = int(0)
df.loc[df['Age'] == 'Adult', 'Age'] = int(1)
df.loc[df['Age'] == '?', 'Age'] = np.nan

#Primary Fur Color: 'Gray'=0, 'Cinnamon'=1, 'Black'=2 
df.loc[df['Primary Fur Color'] == 'Gray', 'Primary Fur Color'] = int(0)
df.loc[df['Primary Fur Color'] == 'Cinnamon', 'Primary Fur Color'] = int(1)
df.loc[df['Primary Fur Color'] == 'Black', 'Primary Fur Color'] = int(2)

#Location: 'Ground Plane'=0, 'Above Ground'=1
df.loc[df['Location'] == 'Ground Plane', 'Location'] = int(0)
df.loc[df['Location'] == 'Above Ground', 'Location'] = int(1)

#All Columns from 'Running' to 'Runs from': 'False'=0, "True"=1
cols=np.array(['Running','Chasing','Climbing','Eating','Foraging','Kuks','Quaas','Moans','Tail flags','Tail twitches','Approaches','Indifferent','Runs from'])
for x in cols:
    df.loc[df[x] == False, x] = int(0)
    df.loc[df[x] == True, x] = int(1)


#Litter 'None'=0, 'Some'=1, 'Abundant'=2
df.loc[df['Litter'] == 'None', 'Litter'] = int(0)
df.loc[df['Litter'] == 'Some', 'Litter'] = int(1)
df.loc[df['Litter'] == 'Abundant', 'Litter'] = int(2)

    

## Clean the Data

To clean our data, we will first drop any columns that we do not need. Then, we will drop any rows that have NAN values:

In [6]:
to_drop=['Date_x','Date_y','Combination of Primary and Highlight Color','Color notes','Above Ground Sighter Measurement','Other Activities','Other Interactions','Shift_y','Zip Codes','Community Districts','Borough Boundaries','City Council Districts','Police Precincts','Sighter Observed Weather Data','Litter Notes','Specific Location','Anonymized Sighter','Other Animal Sightings','Hectare Squirrel Number','Highlight Fur Color','Hectare Conditions Notes','Number of sighters','Hectare Conditions','Lat/Long',]
df.drop(labels=to_drop, axis=1, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6030 entries, 0 to 6029
Data columns (total 24 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   X                       6030 non-null   float64
 1   Y                       6030 non-null   float64
 2   Unique Squirrel ID      6030 non-null   object 
 3   Hectare                 6030 non-null   object 
 4   Shift_x                 6030 non-null   object 
 5   Age                     5780 non-null   object 
 6   Primary Fur Color       5920 non-null   object 
 7   Location                5903 non-null   object 
 8   Running                 6030 non-null   object 
 9   Chasing                 6030 non-null   object 
 10  Climbing                6030 non-null   object 
 11  Eating                  6030 non-null   object 
 12  Foraging                6030 non-null   object 
 13  Kuks                    6030 non-null   object 
 14  Quaas                   6030 non-null   

In [7]:
df.isna().any()

X                         False
Y                         False
Unique Squirrel ID        False
Hectare                   False
Shift_x                   False
Age                        True
Primary Fur Color          True
Location                   True
Running                   False
Chasing                   False
Climbing                  False
Eating                    False
Foraging                  False
Kuks                      False
Quaas                     False
Moans                     False
Tail flags                False
Tail twitches             False
Approaches                False
Indifferent               False
Runs from                 False
Litter                     True
Number of Squirrels       False
Total Time of Sighting     True
dtype: bool

In [8]:
df.dropna(axis=0,inplace=True)
df.isna().any()

X                         False
Y                         False
Unique Squirrel ID        False
Hectare                   False
Shift_x                   False
Age                       False
Primary Fur Color         False
Location                  False
Running                   False
Chasing                   False
Climbing                  False
Eating                    False
Foraging                  False
Kuks                      False
Quaas                     False
Moans                     False
Tail flags                False
Tail twitches             False
Approaches                False
Indifferent               False
Runs from                 False
Litter                    False
Number of Squirrels       False
Total Time of Sighting    False
dtype: bool

Many of our columns are represented by the type "object". This is a problem because we can do numerical calculations with objects. Therefore, we will use the pandas to_numeric function to convert the object values to numbers:

In [9]:
cols=['Shift_x','Age','Primary Fur Color','Location','Primary Fur Color','Location','Running','Chasing','Climbing','Eating','Foraging','Kuks','Quaas','Moans','Tail flags','Tail twitches','Approaches','Indifferent','Runs from','Litter']
for x in cols:
    df[x]=pd.to_numeric(df[x])
df.dtypes

X                         float64
Y                         float64
Unique Squirrel ID         object
Hectare                    object
Shift_x                     int64
Age                         int64
Primary Fur Color           int64
Location                    int64
Running                     int64
Chasing                     int64
Climbing                    int64
Eating                      int64
Foraging                    int64
Kuks                        int64
Quaas                       int64
Moans                       int64
Tail flags                  int64
Tail twitches               int64
Approaches                  int64
Indifferent                 int64
Runs from                   int64
Litter                      int64
Number of Squirrels         int64
Total Time of Sighting    float64
dtype: object

Lastly, let's find out which varibles correlate with age the most:

In [10]:
df.corr().Age.sort_values()

Location                 -0.083305
Chasing                  -0.039248
Runs from                -0.033159
Climbing                 -0.030557
Eating                   -0.028444
Kuks                     -0.025838
Primary Fur Color        -0.021079
Y                        -0.015962
Tail twitches            -0.015223
Running                  -0.004587
Tail flags               -0.003651
Number of Squirrels      -0.002842
Quaas                    -0.001830
Approaches                0.001513
Moans                     0.005142
Shift_x                   0.008321
Total Time of Sighting    0.010036
X                         0.013683
Indifferent               0.041990
Litter                    0.044100
Foraging                  0.091760
Age                       1.000000
Name: Age, dtype: float64

It seems that Location (above ground/ground level), Chasing, (another squirrel), Runs from (humans), Foraging (for food), Amount of Litter, and Indifference (toward humans) have the strongest correlations with Age.

## Imbalanced Classification
Let us examine the distubution of 0 and 1's.

In [11]:
ones = df[df.Age==1] #majority
zero = df[df.Age==0] #minority

print("Values equal to 0: " +str(len(zero)))
print("Values equal to 1: " +str(len(ones)))

Values equal to 0: 553
Values equal to 1: 4306


This is a case of imbalanced classification, when one class has many more observations than another class. In this case, because we have so many more Adults than Juveniles, our model would predict the Adults accurately, but would not be able to predict Juveniles. Let's fix this by upsapmling the minority class, which are the Juveniles in this case.

In [12]:
#Upsample minority class
df_minority =resample(zero, replace=True, n_samples=4306, random_state=123)

#Combine majority class with upsampled minority class
df_balanced=pd.concat([ones,df_minority])

#Display new class counts
df_balanced.Age.value_counts()

1    4306
0    4306
Name: Age, dtype: int64

As you can see, the ratio of the two classes in now 1:1. Let's now perform Logistic Regression with our balanced dataset.

# Logistic Regression

To begin our Logistics Regression model we must define our independent and dependent variables:

In [13]:
x=df_balanced[["Location","Chasing","Runs from","Climbing","Eating","Kuks","Primary Fur Color","Tail twitches","Running","Tail flags","Number of Squirrels","Quaas","Approaches","Moans","Shift_x","Total Time of Sighting","Indifferent","Litter","Foraging"]]
y=df_balanced["Age"]

In [14]:
#Splitting data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)

### Create and Train Model

In [15]:
model = LogisticRegression(max_iter=135)
model.fit(x,y)
y_pred = model.predict(x)

### Evaluate Model with Confusion Matrix and Classification Report

In [16]:
confusion_matrix(y, y_pred)

array([[2473, 1833],
       [1711, 2595]])

In [17]:
import warnings
warnings.filterwarnings('always')

print(classification_report(y, y_pred))

              precision    recall  f1-score   support

           0       0.59      0.57      0.58      4306
           1       0.59      0.60      0.59      4306

    accuracy                           0.59      8612
   macro avg       0.59      0.59      0.59      8612
weighted avg       0.59      0.59      0.59      8612



This is a reasonably performing model with and accuracy of 59%.


## Predict the Age of a Squirrel
As a bonus, I'm going to predict the age of a squirrel I saw outside my house.

In [18]:
attributes=[[0,0,1,0,0,0,1,0,1,0,1,0,0,0,0,2,1,0,0]]
if(model.predict(attributes)[0] == 0):
    print("The squirrel you saw was a Juvenile!")
else:
    print("The squirrel you saw was an Adult!")


The squirrel you saw was a Juvenile!


## Conclusion

This was a very intersting project. I got to explore binary classifcation with Logistic Regression and learned how to deal with issues such as imbalanced classes. 