# Exploring a new dataset
- For each new dataset you encounter, you'll need to understand what state it is in
- Exploration and cleaning will have to occur with all datasets
- The purpose of this notebook is to function as an exploration checklist

## Import libraries

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

## Read in dataset
- In this dataset, target refers to plant type which have been assigned a number between 0-2
- There are options in pandas to explicitly specific datatypes here

In [None]:
df = pd.read_csv('../data/iris.csv')

## Examine the data
- Need to ensure the datatypes are correct for each column
- Presence of unexpected datatypes may indicate the presence of errors

In [None]:
df.head()

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df.describe()

## Convert datatypes
- sepal length was already a float but I'm converting it as an example
- `errors='coerce'` will replace non-numeric data with nan (not a number)

In [None]:
df['sepal length (cm)'] = pd.to_numeric(df['sepal length (cm)'], errors='coerce')
df['target'] = df['target'].astype('category')

## Examine categorical variables

In [None]:
df['target'].value_counts()

In [None]:
df['target'].value_counts().plot(kind='barh', color="blue", alpha=.65)

## Visualise relationships
- Look at the distribution of the data
- Look for outliers or minority classes

In [None]:
sns.pairplot(data=df) #, hue='target'

In [None]:
df.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
plt.show()

In [None]:
df.corr()

## Missing values
### Finding them
- Check for missing values: they are indicated by a `nan`
- Be careful, some datasets may not leave a blank space for us to infer missing values
- Instead, you may have to look for signs of them, such as them being recorded as:
    - 0s
    - Spaces
- You can either replace them or remove them
- Consider whether missing values are random or if there is a systemic bias in the dataset
- Removing data with missing values may result in you throwing too much information away

In [None]:
# examine every column
pd.notnull(df).all()

In [None]:
# Examine rows with missing values
df[df['target'].isnull()]

### Replacing missing values
- Simplest methods involve replacing missing values with:
    - 0s
    - mean/mean
    - min
- You can also try to use more advanced techinques to try and infer the missing values

In [None]:
df = df.fillna(0)

### Dropping missing values
- Can be done at the column or row level
    - Column: axis=1
    - row: axis=0
- You can choose how strict you want to be when filtering out missing values

In [None]:
# Columns: delete if all or any of the values in them are missing
df = df.dropna(axis=1, how='all')
df = df.dropna(axis=1, how='any')

# Rows: Delete if any are missing or if there are less than 2 non-missing balues
df = df.dropna(axis=0, how='any')
df.dropna(thresh=2)

### Eliminating duplicate rows

In [None]:
# keep default='first', other options include 'last' and False
df[df.duplicated(keep='first')]

In [None]:
# don't do this if there's a reason there may be duplicates!
df = df.drop_duplicates()

## Visualise relationships and categorical variables

In [None]:
sns.lmplot(x="sepal length (cm)", y="sepal width (cm)", hue="target", data=df, fit_reg=True)
sns.lmplot(x="petal length (cm)", y="petal width (cm)", hue="target", data=df, fit_reg=True)