# Unit 3 - A Data Science campaign with pandas and PCA
This unit covers:
* Essential data wrangling with `pandas`;
* Working with different data types;
* Discerning categorical from numerical features;
* Spotting and interpreting the PCA 'variance bug';
* Working with a public data set.


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## Essential Python tools and concepts – `pandas`

In this practical we will predominantly be working with the `pandas` library. 


In [2]:
import numpy as np
import pandas as pd

### Pandas Series

A Series is a one-dimensional list of values. 

Note the `NaN` value - it means "Not a Number". Originally this represents an undefined numerical, like the result of a division by zero. It is often used to denote missing values. 

In [3]:
values =[1, 3, 5, np.nan, 6, 8] # a list of values
pd.Series(values)

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

It can have an index associated with each value. 

In [4]:
s = pd.Series(values, index=['a', 'b', 'c', 'd', 'e', 'f'])
s

a    1.0
b    3.0
c    5.0
d    NaN
e    6.0
f    8.0
dtype: float64

**Task 1:** Create a series of values 10,12,14,..20 with the indicies from 0 to 5
Save it in the variable called *mySeries*

### Pandas Data Frames
A **data frame** is like a two-dimensional series.

In [5]:
df = pd.DataFrame(np.random.randn(6,4)) # note the size: 6 rows, 4 columns
df

Unnamed: 0,0,1,2,3
0,-1.444459,-2.677714,-0.839878,1.76365
1,-0.473239,-1.100614,-0.643218,-0.482046
2,0.008334,0.755598,-1.102782,0.975557
3,-0.111121,-1.008879,0.778967,0.071188
4,-0.701276,1.036063,0.424508,0.604168
5,-0.666581,1.028694,0.991621,-1.623023


You can pass it an index during construction:

In [6]:
df = pd.DataFrame(np.random.randn(6,4), 
                  index=["row {}".format(i) for i in range(6)]) # note the size: 6 rows, 4 columns
df

Unnamed: 0,0,1,2,3
row 0,-0.164572,-0.276863,-0.489491,-0.000642
row 1,-0.065934,0.689488,-0.398049,-1.141333
row 2,-0.546454,0.584696,-0.881552,-1.630982
row 3,0.495106,0.072019,0.793277,-1.309145
row 4,-2.105708,-0.083496,1.460389,1.165236
row 5,0.708298,-0.444114,-0.405987,-0.995237


Likewise, you can pass column names:

In [7]:
df = pd.DataFrame(np.random.randn(6,4), 
                  index=["row {}".format(i) for i in range(6)],
                  columns=['col {}'.format(i) for i in range(4)]) 
df

Unnamed: 0,col 0,col 1,col 2,col 3
row 0,0.504396,0.65347,-0.006096,-0.659978
row 1,0.208127,1.638025,0.215765,0.096315
row 2,-0.744013,0.411306,0.979918,-1.053079
row 3,1.117119,0.868774,-0.224997,-1.480922
row 4,1.335434,0.699595,-1.427961,-0.557457
row 5,0.557421,-0.460455,0.473014,0.595313


You can also construct it from a dictionary:

In [8]:
columns = {'beep': np.random.randn(6),
           'bop': np.random.randn(6),
           'bup': np.random.randn(6),
           'bap': np.random.randn(6)}
df = pd.DataFrame(columns, index=["row {}".format(i) for i in range(6)])
df

Unnamed: 0,beep,bop,bup,bap
row 0,0.515672,-1.291355,-0.463974,1.097808
row 1,0.343777,1.693518,1.313489,0.391107
row 2,-0.803122,-0.286866,2.592035,1.445094
row 3,-1.011674,0.387013,0.273001,-0.104802
row 4,0.60545,-0.256283,0.059999,1.528498
row 5,-1.08998,0.397341,-1.07037,2.220405


### Data frame addressing
We can now address parts of the data by their row and column names. Columns are addressed like this:

In [9]:
df['bop']

row 0   -1.291355
row 1    1.693518
row 2   -0.286866
row 3    0.387013
row 4   -0.256283
row 5    0.397341
Name: bop, dtype: float64

Note that the return value type is a Series, not like a Data Frame! This is because it's one-dimensional.

Extract multiple columns at once by passing a list of column names. Does not need to be unique!

In [10]:
df[['bop', 'bap', 'bap']]

Unnamed: 0,bop,bap,bap.1
row 0,-1.291355,1.097808,1.097808
row 1,1.693518,0.391107,0.391107
row 2,-0.286866,1.445094,1.445094
row 3,0.387013,-0.104802,-0.104802
row 4,-0.256283,1.528498,1.528498
row 5,0.397341,2.220405,2.220405


Here, the return value is a DataFrame because it's two-dimensional.

Rows use the `.loc` attribute:

In [11]:
df.loc[['row 0', 'row 2']]

Unnamed: 0,beep,bop,bup,bap
row 0,0.515672,-1.291355,-0.463974,1.097808
row 2,-0.803122,-0.286866,2.592035,1.445094


The `.iloc` attribute will allow you to specify the number of the row, instead of its label:

In [12]:
df.iloc[0]

beep    0.515672
bop    -1.291355
bup    -0.463974
bap     1.097808
Name: row 0, dtype: float64

It can also use numpy-style addressing for the value matrix:

In [13]:
df.iloc[2,1:4]

bop   -0.286866
bup    2.592035
bap    1.445094
Name: row 2, dtype: float64

In [14]:
df.iloc[1:3,0:2]

Unnamed: 0,beep,bop
row 1,0.343777,1.693518
row 2,-0.803122,-0.286866


### Advanced data frames

Pandas supports all kinds of data types. A data frame can hold all sorts of data types at the same time. This is the principal difference to a `numpy.array`, where all elements need to be of the same data type.

In [15]:
df2 = pd.DataFrame({'A': 1.,
                    'B': pd.Timestamp('20130102'),
                    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D': np.array([3] * 4, dtype='int32'),
                    'E': pd.Categorical(["test", "train", "test", "train"]),
                    'F': 'foo'})
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


The `info()` function gives you an overview of the data types:

In [16]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   A       4 non-null      float64       
 1   B       4 non-null      datetime64[ns]
 2   C       4 non-null      float32       
 3   D       4 non-null      int32         
 4   E       4 non-null      category      
 5   F       4 non-null      object        
dtypes: category(1), datetime64[ns](1), float32(1), float64(1), int32(1), object(1)
memory usage: 260.0+ bytes


This is all the `pandas` you'll need in this practical. Feel free to make yourself familiar with what else `pandas` has to offer. A good starting point is [10 minutes to pandas](https://pandas.pydata.org/pandas-docs/version/1.3/user_guide/10min.html) in the official `pandas` documentation.

# Data campaign: Cars 
You now know (almost) all you need to know to start your first data science campaign. We will analyse the "cars" data set. It contains data on historic car models. You will be guided through the first steps, then it's up to you to apply PCA to explore the data. 

Let's load the data from the internet and make a data frame:

In [None]:
# use read_csv to read data from url 
# use set_index to set index 'model" 
# show dataframe.

cars = pd.read_csv('https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv')
cars = cars.set_index('model')
cars

### Descriptive statistics


Let's have a quick look at some descriptive statistics.

In [None]:
cars.mean()

In [None]:
cars.var()

#### Task
What do you notice about the variance?

There's also a command that gives you a few common statistical descriptors, all in one data frame. Up to you to decide which you like better!

In [None]:
cars.describe()

## Numerical vs. categorical features
**Numerical features** express a quantitative relationship between an instance and a feature. For example, 'height' is a numerical feature of a human. 

**Categorical features** express whether an instance belongs in a certain category. 'Male', 'female' are two categories that apply to humans (alongside others).

Most interesting data sets contain numerical **and** categorical features.

For PCA, only numerical features are useful (most of the time). 

These are the features of the cars dataset:
* mpg: Miles per (US) gallon
* cyl: Number of cylinders
* disp: Displacement (cubic inches)
* hp: Gross horsepower
* drat: Rear axle ratio
* wt: Weight (1000 lbs)
* qsec: 1/4 mile time
* vs: V-engine (0) or straight engine (1)
* am: Transmission (0 = automatic, 1 = manual)
* gear: Number of forward gears
* carb: Number of carburators


### **Task: remove categorical features** 
1. Decide which features are categorical, which are numerical. Search the internet if you don't know what a certain feature means.  
2. Delete categorical features from the data frame. Use the `.drop()` function. Documentation is available [online](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) or with the built-in documentation accessed by typing `pd.DataFrame.drop?` in a cell and executing it. 

In [None]:
# Please put your solution here, feel free to create more cells if needed

### Task: Analyse the cleaned data set using PCA

1. Do a pca on the raw data.
2. Produce a scatter plot of the PCA'd data.
2. Produce a scree plot and analyse how much variance is captured in the first 2 components.
3. Plot the covariance matrix of the dataset. What stands out? 
3. Look at the components of the dataset. Which features are highly represented in the first two components?

In [None]:
# Please put your solution here, feel free to create more cells if needed

### Task: Normalise and observe the effect (the *variance bug*)
1. Normalise the data to zero mean and unit variance and repeat the steps above. 
5. How does the scatter plot of the first two PCs compare to the PCA on the raw data before normalisation? 
6. What's the difference in the scree plot?
7. How is the covariance matrix different?
7. How do components differ? 
8. *(Advanced)* Spot a cluster in the plotted data, find the corresponding data points, figure out what they have in common.

In [None]:
# Please put your solution here, feel free to create more cells if needed

# Coronavirus epidemic dynamics

Here's a task for advanced students. As you all are aware, last spring we saw the outbreak of Covid-19, aka coronavirus. Here, we're going to analyse a dataset from the beginning of the outbreak, when it was just about to spread around the world. 

### Task: Explore data on kaggle.com
1. Go to the website on the kaggle site: https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset 
2. Inspect the data on the website and the various ways the site lets you explore it. Registration on the website is not necessary.



## Working with the Coronavirus outbreak data
1. The coronavirus dataset is provided on canvas. Download and save it in the same folder as this notebook.
2. Unzip the data.
3. Verify that the folder in which this notebook resides should now contain a sub-folder named `novel-corona-virus-2019-dataset`

First we load the main dataset into a DataFrame:

In [None]:
df = pd.read_csv('novel-corona-virus-2019-dataset/2019_nCoV_data.csv')
df

If this fails then please check again whether the dataset folder is unzipped and resides in the same folder as this notebook. Check that the filename in the command matches the filename of the data set on disk.

The `Sno` column contains the serial number and is identical with the aut-generated index column. let's set the Index to track the `Sno` column:

In [None]:
df = df.set_index('Sno')
df 

Let's explore the data types.

### Task: 
Which features are numeric? Which are continuous? Which are categorical?

## Initial exploration
Let's plot the number of number of confirmed cases for the whole data set.

In [None]:
a = df['Confirmed'].plot()

### Task

Is this the plot you expected? Why not? 

Solution: The above command naively plots the whole column, but ignores the structure of the data set, where each line applies to a different province/state, or even country.

We need to filter by country! Let's look only at the Hubei province, the root of the outbreak:

In [None]:
df.loc[df['Province/State'] == 'Hubei']

That looks better! Let's plot those values:

In [None]:
df.loc[df['Province/State'] == 'Hubei'].plot()

Note how the `.plot()` function of the `DataFrame` object already gives us a plot of all numerical features, complete with a legend!

The plot is still lackiong though; It needs for example:
* Dates on the x-axis
* proper labels on the y-axis

### Task (advanced)
* Read the documentation of the `DataFrame.plot` command to learn how to make it plot the date on the x-axis.
* use the `ax = gca()` method the get an axis object, and call the `set_ylabel()` method to set an appropriate y-label.

### Task (advanced)
* Plot the data for all of China.
* Aggregate the data for the rest of the world and plot it.

## That's it for today!
We have covered have learnt:
* Essential `pandas`; how to create, address and modify `pandas` `DataFrame`s.
* How to perform a data science campaign using PCA.
* Initial loading and ploting of time-series data.

Next week we'll continue our analysis of the coronavirus data, with a special focus on visualisation.

# Submission Instructions

Before submitting this via canvas please go to kernel -> 'Restart&Run All'. Please check afterwards if this introduced any issues, then you should fix them. Your notebook should, in the end always represent working code written in a linear way, one cell after the other.

**Do not touch the cells below!**

None of these test are guaranteed to prove correctness, they are just a very general indicators!

In [None]:
from IPython import display
green = "https://www.iconsdb.com/icons/preview/green/checkmark-xxl.png"
orange = "https://www.iconsdb.com/icons/preview/orange/checkmark-xxl.png" 
red = "https://www.iconsdb.com/icons/preview/red/checkmark-xxl.png"

In [None]:
#Task1 Test 
path=""
if(True):
    path = red
if(str(mySeries[4]) == '18'):
    path = orange
if(str(mySeries[[4]])[0]):
    path = green
display.Image(path)