# **Environment Setup**

In [1]:
# datasets to be used: "LaptopSalesJanuary2008.csv", "WHO.csv"
# upload datasets to a folder in Google Drive, e.g., My Drive/Colab Data
# connect to Google Drive, path to datasets is "/content/drive/My Drive/Colab Data/..."
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
## (if running in Colab you can skip this part)
## install required packages if not already in the environment
## either use anaconda navigator GUI or call conda on command prompt(win)/terminal(mac)
# conda install numpy
# conda install pandas
# conda install matplotlib
# conda install seaborn
# conda install -c conda-forge plotnine

In [5]:
# after installation, import required packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from plotnine import *
import duckdb as db

# **Example 1: Laptop Sales**

In [None]:
# Laptop Sales at a London Computer Chain - LaptopSalesJanuary2008.csv
# load data as a Dataframe
laptop_df = pd.read_csv("../data/LaptopSalesJanuary2008.csv")

# check if data is loaded correctly
laptop_df.head()

Unnamed: 0,Date,Configuration,Customer Postcode,Store Postcode,Retail Price,Screen Size (Inches),Battery Life (Hours),RAM (GB),Processor Speeds (GHz),Integrated Wireless?,HD Size (GB),Bundled Applications?,OS X Customer,OS Y Customer,OS X Store,OS Y Store,CustomerStoreDistance
0,1/1/2008 0:01,163,EC4V 5BH,SE1 2BN,455,15,5,1,2.0,Yes,80,Yes,532041,180995,534057.0,179682.0,2405.873022
1,1/1/2008 0:02,320,SW4 0JL,SW12 9HD,545,15,6,1,2.0,No,300,No,529240,175537,528739.0,173080.0,2507.558574
2,1/1/2008 0:04,23,EC3V 1LR,E2 0RY,515,15,4,1,2.0,Yes,300,Yes,533095,181047,535652.0,182961.0,3194.001409
3,1/1/2008 0:04,169,SW1P 3AU,SE1 2BN,395,15,5,1,2.0,No,40,Yes,529902,179641,534057.0,179682.0,4155.202281
4,1/1/2008 0:06,365,EC4V 4EG,SW1V 4QQ,585,15,6,2,2.0,No,120,Yes,531684,180948,528924.0,178440.0,3729.298057


In [4]:
# print a concise summary of the Dataframe
laptop_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7956 entries, 0 to 7955
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Date                    7956 non-null   object 
 1   Configuration           7956 non-null   int64  
 2   Customer Postcode       7956 non-null   object 
 3   Store Postcode          7956 non-null   object 
 4   Retail Price            7956 non-null   int64  
 5   Screen Size (Inches)    7956 non-null   int64  
 6   Battery Life (Hours)    7956 non-null   int64  
 7   RAM (GB)                7956 non-null   int64  
 8   Processor Speeds (GHz)  7956 non-null   float64
 9   Integrated Wireless?    7956 non-null   object 
 10  HD Size (GB)            7956 non-null   int64  
 11  Bundled Applications?   7956 non-null   object 
 12  OS X Customer           7956 non-null   int64  
 13  OS Y Customer           7956 non-null   int64  
 14  OS X Store              7952 non-null   

## **Pre-processing:**

In [6]:
# Print the list of variables to the screen

In [7]:
# Change the variable names to be more suitable for analysis

## **Task 1:** compute the average retail price by store postcode

In [8]:
# use dataframe "groupby()" function
# note, this is going to convert the 'dataframe' into a 'Series' object

## **Task 2:** compare mean retail prices across the various postcodes in a barchart. Where are the min and max prices located?

In [9]:
# barchart of store vs. mean retail price
# we're going to use pandas' built-in plot() function

## **Task 3:** compare retail price distributions across the various postcodes in a boxplot

In [10]:
# pandas boxplot() function to plot boxplots of retail price by store
# grouping method is embedded, so we can use the original laptop dataframe

# make the figure more readable

# **Example 2: World Health Organization (WHO)**

In [11]:
# World Health Organization – WHO.csv

# load data as a Dataframe

In [12]:
# print a concise summary of the Dataframe

## **Task 1:** what's the relationship between Fertitility Rate and GNI (gross nat. income)?

In [13]:
# using seaborn scatterplot() function

In [14]:
# change marker type & size

In [15]:
# change marker type and color

In [16]:
# add title

In [17]:
# save graph as image

In [18]:
# same thing with lineplot() function
# good idea?

## **Task 2:** Add a 3rd dimension to represent "Regions", then "LifeExpectancy"

In [19]:
# color the observations by region (categorical)

In [20]:
# color the observations according to life expectancy (numeric)

## **Task 3:** is the fertility rate of a country a good predictor of the percentage of the population under 15?

In [21]:
# visualize raw data

In [22]:
# visualize a log transformation by changing the scale of the x axis

In [23]:
# alternative log transformation
# this method adds a new column to the who_df dataframe containing the log

In [24]:
# yet another log transformation

In [25]:
# add a regression line to our plot
# default confidence interval is 95%

In [26]:
# 99% confidence interval

In [27]:
# no confidence interval

In [28]:
# change the color of the regression line

# **Example 3: Re-plot WHO data using ggplot**

Required components for creating a plot w/ ggplot (grammar of graphics):

1.   Data is the information to use when creating the plot.
2.   Aesthetics (aes) provides a mapping between data variables and aesthetic, or graphical, variables used by the underlying drawing system.
3.   Geometric objects (geoms) defines the type of geometric object to use in the drawing. You can use points, lines, bars, and many others.

In [29]:
# Fertility Rate vs. Gross National Income

In [30]:
# Fertility Rate vs. Percentage of the Population under 15