# Assignment 5

### Questions:

1. What types of variables are there in the dataset?  
1. What do their distributions look like?  
1. Do you have missing values (do not fix this)?  
1. Are there any typos (not just misspellings but other things that just don't seem right)?
1. Is there any formatting that causes Python to think a number is a string?
1. Do you observe outliers? 
1. Are the outliers really outliers, or maybe typos?
1. How do the different pairs of features correlate with one other?  
1. Do these correlations make sense?  
1. What is the relationship between the features and the target?
1. Do any features exhibit skew?
1. What do you know now that will inform the modeling strategy?

In [1]:
%matplotlib notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
snails = pd.read_excel('snail_size.xlsx', sheet_name='snail_size')

In [3]:
snails.head()

Unnamed: 0,gender,length,diameter,height,full_weight,no_shell_weight,core_weight,shell_weight,age
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,,10
4,Infant,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


### 1) What types of variables are there in the dataset?

In [4]:
snails.dtypes

gender              object
length             float64
diameter           float64
height             float64
full_weight        float64
no_shell_weight    float64
core_weight        float64
shell_weight       float64
age                  int64
dtype: object

There are a total of 3 variable types within 9 columns: `object` (`str`), `float64`, `int64`

### 2) What do their distributions look like?

In [6]:
snails.hist()

<IPython.core.display.Javascript object>

array([[<Axes: title={'center': 'length'}>,
        <Axes: title={'center': 'diameter'}>,
        <Axes: title={'center': 'height'}>],
       [<Axes: title={'center': 'full_weight'}>,
        <Axes: title={'center': 'no_shell_weight'}>,
        <Axes: title={'center': 'core_weight'}>],
       [<Axes: title={'center': 'shell_weight'}>,
        <Axes: title={'center': 'age'}>, <Axes: >]], dtype=object)

The column `gender` has categorical data, but the following columns have data which distributes as follows:
* length: left skew / negative
* diameter: left skew / negative
* height: right skew / positive
* full_weight: right skew / positive
* no_shell_weight: right skew / positive
* core_weight: right skew / positive
* shell_weight: right skew / positive
* age: right skew / positive, but also

### 3) Do you have missing values (do not fix this)?

In [7]:
snails.describe()

Unnamed: 0,length,diameter,height,full_weight,no_shell_weight,core_weight,shell_weight,age
count,4163.0,4163.0,4177.0,4177.0,4177.0,4177.0,4147.0,4177.0
mean,0.524042,0.407871,0.219368,0.828742,0.359367,0.180594,0.239078,9.933684
std,0.1201,0.099266,3.506068,0.490389,0.221963,0.109614,0.139089,3.224169
min,0.075,0.055,0.0,0.002,0.001,0.0005,0.0015,1.0
25%,0.45,0.35,0.115,0.4415,0.186,0.0935,0.13,8.0
50%,0.545,0.425,0.14,0.7995,0.336,0.171,0.235,9.0
75%,0.615,0.48,0.165,1.153,0.502,0.253,0.32975,11.0
max,0.815,0.65,165.0,2.8255,1.488,0.76,1.005,29.0


In [8]:
snails.count()

gender             4177
length             4163
diameter           4163
height             4177
full_weight        4177
no_shell_weight    4177
core_weight        4177
shell_weight       4147
age                4177
dtype: int64

We do have missing values. The fact that some columns have 4,177 items, while others have fewer items, such as `length` having only 4,163 and `shell_weight` having only 4,147 - implies that values are missing from these columns.

### 4) Are there any typos (not just misspellings but other things that just don't seem right)?

In [9]:
snails['gender'].unique()

array(['M', 'F', 'Infant', 'Instant'], dtype=object)

Yes, for the categorical column `gender`, the values `M` and `F` make sense, since they allude to Male and Female, respectively. However, `Infant` and `Instant` do not relate to any known genders / sex assignments, and instead relate to an `age` or unknown category. 

### 5) Is there any formatting that causes Python to think a number is a string?

Currently, no. since all of the numerical columns have types that are either float64 or int64, the dataframe has interpreted all values as numerical:

In [10]:
snails.dtypes

gender              object
length             float64
diameter           float64
height             float64
full_weight        float64
no_shell_weight    float64
core_weight        float64
shell_weight       float64
age                  int64
dtype: object

### 6) Do you observe outliers?

In [11]:
snails.describe()

Unnamed: 0,length,diameter,height,full_weight,no_shell_weight,core_weight,shell_weight,age
count,4163.0,4163.0,4177.0,4177.0,4177.0,4177.0,4147.0,4177.0
mean,0.524042,0.407871,0.219368,0.828742,0.359367,0.180594,0.239078,9.933684
std,0.1201,0.099266,3.506068,0.490389,0.221963,0.109614,0.139089,3.224169
min,0.075,0.055,0.0,0.002,0.001,0.0005,0.0015,1.0
25%,0.45,0.35,0.115,0.4415,0.186,0.0935,0.13,8.0
50%,0.545,0.425,0.14,0.7995,0.336,0.171,0.235,9.0
75%,0.615,0.48,0.165,1.153,0.502,0.253,0.32975,11.0
max,0.815,0.65,165.0,2.8255,1.488,0.76,1.005,29.0


Yes! For `height`, the `mean` is 0.219368, but the `max` value is 165.000000 - which is very high! The same can be said for the `age` column, with a max value of 29.000000, despite having a mean value of 9.933684 - this would indicate that these columns have outliers.

### 7) Are the outliers really outliers, or maybe typos?

It is possible that the outliers could be typos: for `height`, the 75% percentile value is 0.165000, while the max value is 165.000000 - this looks like a value might have been entered incorrectly with the decimal missing from the entry.

### 8) How do the different pairs of features correlate with one other?

In [12]:
snails.loc[ : , snails.columns!='gender'].corr()

Unnamed: 0,length,diameter,height,full_weight,no_shell_weight,core_weight,shell_weight,age
length,1.0,0.986836,0.031092,0.925109,0.897713,0.902941,0.898193,0.556459
diameter,0.986836,1.0,0.023029,0.9255,0.893227,0.899629,0.905727,0.574515
height,0.031092,0.023029,1.0,0.021602,0.019173,0.026642,0.022889,0.008717
full_weight,0.925109,0.9255,0.021602,1.0,0.969405,0.966375,0.955526,0.54039
no_shell_weight,0.897713,0.893227,0.019173,0.969405,1.0,0.931961,0.883249,0.420884
core_weight,0.902941,0.899629,0.026642,0.966375,0.931961,1.0,0.908726,0.503819
shell_weight,0.898193,0.905727,0.022889,0.955526,0.883249,0.908726,1.0,0.627703
age,0.556459,0.574515,0.008717,0.54039,0.420884,0.503819,0.627703,1.0


As we can see above, none of the columns are negatively correlated with each other - and some are actually highly correlated with each other. For example, `diameter` and `length` are highly correlated at 0.986836, and so are `full_weight` and `no_shell_weight`, at 0.969405 - meaning that as one value goes up, the other one will very likely go up as well.

### 9) Do these correlations make sense?

Some of these do, yes - for example, since snails are almost spherical, it stands to reason that length and diameter would be closely correlated. Additionally, full weight and no shell weight would be correllated, since most of the weight of the animal would be in its body.

### 10) What is the relationship between the features and the target?

The answer to this question would depend on what the target is - if we wish to predict the full weight of a snail based on its no shell weight, the relation would be positive. However, if we wanted to predict the shell weight based on its height, there would likely be no relationship, since the correlation value is 0.022889, which is close to zero.

### 11) Do any features exhibit skew?

In [13]:
snails.plot.scatter(x='age', y='diameter')

<IPython.core.display.Javascript object>

<Axes: xlabel='age', ylabel='diameter'>

Yes, for example when comparing `age` and `diameter`, we see a right-skewed, positive relationship between the two values. We also see one with `age` and `shell_weight`:

In [14]:
snails.plot.scatter(x='age', y='shell_weight')

<IPython.core.display.Javascript object>

<Axes: xlabel='age', ylabel='shell_weight'>

### 12) What do you know now that will inform the modeling strategy?

I know that correlation can play a big part in designing a model strategy. However, correlation does not prove causation - so although two columns might be strongly correlated with each other - this does not necessarily mean that one feature can confidently predict the future value of a particualr topic.