this worksheets is part of the [mlvu machine learning course](https://mlvu.github.io)<br>
setting up your environment: https://bit.ly/3bzpn5C

For this worksheet, we'll need to install the pandas package. Run the cell below, or run ```pip install pandas``` in the terminal/command-line/command prompt.

In [2]:
!pip install pandas


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


# Worksheet 3: Pandas

Pandas is a python package for data analysis. Where numpy is a package built around a the data structure of a <em>matrix</em>, pandas is a package built around the data structure of a _dataframe_. A dataframe is a lot like a matrix, with some key differences:

* In a data frame, the columns have header names. 
* Different datatypes (int, string, boolean) are allowed within the same dataframe. Each column has its own datatype

In short, dataframes represent datasets of the kind we've seen in the lectures: an instance per row, and a feature per column. 

Pandas is designed to work together with numpy and matplotlib. Let's import all of them

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

To explore Pandas we'll import the ANSUR II dataset (we also used this to create the examples in the first lecture). ANSUR II is an _anthropometric_ dataset: it contains body measurements. ANSUR II contains 108 measurements for about 4000 men and about 2000 women (all US soldiers).

We'll start by reading the data. Like numpy, pandas has a function for reading CSV files. Pandas' function is much more robust, and much less likely to give you trouble. It does come with [a lot of options](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html), so you may have to try a few things before you get it to read your data accurately.

ANSUR II comes in two separate tables: one for male soldiers, and one for female soldiers:

In [4]:
female = pd.read_csv('./ansur/ANSUR II FEMALE Public.csv')
male = pd.read_csv('./ansur/ANSUR II MALE Public.csv')

FileNotFoundError: [Errno 2] No such file or directory: './ansur/ANSUR II FEMALE Public.csv'

All the warnings from the first two social impact videos, and from the fourth lecture apply here. This is not a sample that is representative of the population, and the gender class is a sensitive attribute that is poorly captured by two classes.

```male``` and ```female``` are pandas DataFrame objects. Jupyter notebooks will print these as tables:

In [None]:
male

These are big dataframes. Pandas can give us some quick summary statistics per column very easily.

In [None]:
male.describe()


In pandas these columns are called <em>Series</em>, and a dataframe is basically a list of Series objects with the same length (indexed in various ways for efficient access).

Let's have a look at all the available measurements.

In [None]:
for i, col in enumerate(female.columns):
    print(i, col)

Each column name with a lowercase letter represents a physical measurement. Some of these are quite technical (like <em>bizygomaticbreadth</em>). The dataset comes with a very helpful document that shows what each measurement means (and how it should be performed). It's [included with the worksheets](./ansur/Hotzman_2011_ANSURIII_Measurements_a548497.pdf). Scroll down to section 6.4 for the description of the measurements.

Once the dataframe is loaded, you can refer to the columns by name as python objects. For instance:

In [None]:
male.stature

Jupyter notebooks even gives you dynamic autocomplete. Try putting your cursor at the end of the next line and pressing the ```TAB``` button on your keyboard (it may take a second).

In [None]:
female.

We can now easily do scatterplots of different measurements. Let's plot the ```stature``` (height) against the ```span``` (distance between outstretched arms).

In [None]:
plt.scatter(female.stature, female.span, color='red')
plt.scatter(male.stature, male.span, color='blue');

Even if we make the points transparent (```alpha=0.1```) and small (```s=1```), it's quite a dense cloud. We can easily select a small subset by [slicing](https://www.pythoncentral.io/how-to-slice-listsarrays-and-tuples-in-python/):

In [None]:
male[:3]

Note that slicing like this only works over the rows. We can't do ```male[:3, 5:12]```, like we could in numpy.

For now, we know enough to plot a small subset of the data.

In [None]:
female_sub = female[:50]
male_sub = male[:50]

plt.scatter(female_sub.stature, female_sub.span, color='red')
plt.scatter(male_sub.stature, male_sub.span, color='blue');

To select a set of columns, the best method is to pass a list of strings containing column names. The result is another dataframe.

In [None]:
male[['bicepscircumferenceflexed', 'Age']]

### Simple arithmetic

Like numpy, pandas objects overload basic arithmetic operations. For instance, the units in this dataset are in millimeters, which is a little hard to read. To convert them to meters, we can simply multiply by 0.001.


In [None]:
stat = female.stature * 0.001
span = female.span * 0.001

plt.scatter(stat, span);

plt.xlabel('height (m)')
plt.ylabel('span (m)');

### Descriptive statistics

For most descriptive statistics, pandas provides member functions:

In [None]:
print('mean           ', female.stature.mean())
print('std dev.       ', female.stature.std())
print('median         ', female.stature.median())
print('standard error ', female.stature.sem())

### Concatenating and sampling

To perform the classification task from the first lecture (predicting gender from physical measurements), we want the male and female data in a single dataframe. To accomplish this, we can concatenate the two dataframes

In [None]:
people = pd.concat([male, female])

This gives us a dataset of all the male measurements first, and then all the female measurements. For many reasons, it's helpful to shuffle these, so that the order is random. The simplest way to do this is pandas is to _sample_ a new dataframe (without replacement) of the same size:

In [None]:
people = people.sample(frac=1)
people[:5].Gender

Note that the row indices from the original dataframe are retained and shuffled as well. For our purposes this doesn't matter.

## Performing Classification

### Categories and codes

To perform classification on this dataset, we need to convert the target value from strings to categorical data.

Since gender is a sensitive attribute and we're just looking for any example, let's instead try to predict handedness. This is indicated by the attribute ```WritingPreference```.

Right now, pandas thinks the column can have any string value: when we convert it, it checks the existing values (```Left Hand```, ```Right hand``` and ```Either hand (No preference)```) and limits the column values to those two options (changing the datatype).

In [None]:
wp_cat = people.WritingPreference.astype('category')
print(wp_cat.dtype)
print(wp_cat.cat.categories)

Note that we haven't changed the original data. To insert the categorized column back into the original dataframe, we just re-assign it.

In [None]:
people.WritingPreference = people.WritingPreference.astype('category')

We can quickly check the class balance using the ```value_counts()``` function. (For a more fancy display, try the hist() function).

In [None]:
people.WritingPreference.value_counts(normalize=True) 

# normalize=False will give you absolute counts

What does this tell us about the performance of the majority class baseline?

For many tasks (including classification with sklearn), we need integers instead of categorical values. Pandas actually uses integer codes behind the scenes for its categories and it's a simple matter to get a column of integers from a column of categorical data:

In [None]:
people.WritingPreference.cat.codes

This allows us, for instance, to scatterplot the data using the categories for color.

You can add a cmap argument like ```cmap='copper'``` to change the colors. A [colormap](https://matplotlib.org/examples/color/colormaps_reference.html) maps a range of numeric values to a range of colors. I our case, we only have the values 0 and 1, so those get mapped to the extremes of the chosen colormap.

In [None]:
sub = people[:500]
plt.scatter(sub.stature, sub.span, c=sub.WritingPreference.cat.codes, alpha=0.3);

Clearly, stature and span are not very predictive. Perhaps the measurements of the right arm will tell us a little more??

In [None]:
sub = people[:500]
plt.scatter(sub.bicepscircumferenceflexed, sub.bideltoidbreadth, c=sub.WritingPreference.cat.codes, alpha=0.3);

Clearly, this is a difficult problem. It may even be impossible from the features we have. Let's try anyway.

### Classification

Running a classifier looks much the same as it did with a numpy array.

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

train = people[:4000]
test  = people[4000:]
# NB: We can split like this because we know the data is shuffled

cls = SVC()
cls.fit(train[['stature','span']], train.WritingPreference)

accuracy_score(cls.predict(test[['stature', 'span']]), test.WritingPreference)

Very slightly higher than the majority class (though your results may differ). Can we conclude that handedness can be predicted from height and span?

Let's see what we get for a kNN classifier on all measurements.

In [None]:
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.metrics import accuracy_score

train = people[:4000]
test  = people[4000:]
# NB: We can split like this because we know the data is shuffled

traina, testa = train.iloc[:, 1:94], test.iloc[:, 1:94] # select all measurement columns

cls = KNN(n_neighbors=1)
cls.fit(traina, train.WritingPreference)

accuracy_score(cls.predict(testa), test.WritingPreference)

Lower than the majority class. Note that with the number of neigbhors at 1, kNN is likely to overfit. See what happens if you increase the number of neighbors.

At this point, we're in danger of multiple testing, so we should make a proper train/validation/test split. Can you see how you would do that?

As you can see, sklearn integrates beautifully with pandas, making our training and testing code even simpler. Not all libraries integrate this well with pandas; mlxtend, for instance, only inderstands numpy data.

Happily, pandas data contains numpy arrays in the background, and we can simply ask for those by retrieving the ```.values``` attribute.

In [None]:
from mlxtend.plotting import plot_decision_regions

# Plot the decision boundary with the first 50 points in the test set
numpy_x = train[['stature','span']].values
numpy_y = train.WritingPreference.cat.codes.values

# This is necessary if pandas read the CSV files as integers
# (seems to depend on version/OS)
numpy_x = numpy_x.astype(float)

# Rebuild the classifier 
# (a classifier trained on pandas data doesn't interoperate well with pure numpy data)
tree = KNN(n_neighbors=2)
tree.fit(numpy_x, numpy_y)

plot_decision_regions(numpy_x[:25, :], numpy_y[:25], clf=tree);



## Final comments

As usual, there is a lot more to learn. A good place to start is the 10-minute quicktstart guide to pandas:
https://pandas.pydata.org/pandas-docs/stable/10min.html

One very useful feature we didn't mention is _grouping_ (which will be familiar if you've done a little SQL):
https://pandas.pydata.org/pandas-docs/stable/groupby.html

Here's a list of 12 random tips, which gives you a good idea of how far pandas can go:
https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/

Next week, deep learning with _Keras_.