# First data examples

This notebook goes through some real data examples

In [None]:
import numpy as np
import pandas as pd

## Google Analytics data

This is a simple dataset about users at a website.

In [None]:
?pd.read_excel

In [None]:
webdata = pd.read_excel("Webanalytics_data_example.xlsx", sheet_name = "Dataset1")

In [None]:
webdata

In [None]:
webdata.info()

Selecting only those media channels for which bounce rate i larger than 0.6:

In [None]:
webdata[webdata["BounceRate"] > 0.6]

In [None]:
webdata[webdata["BounceRate"] > 0.6][["MediaChannel", "BounceRate"]]

Selecting the rows for which Transactions i bigger than 100 and Sessions is less than 30000

In [None]:
webdata[(webdata["Transactions"] > 100) & (webdata["Sessions"] < 30000 )]

Finding the media channel with the highest revenue and putting the row on top (sorting the rows by revenue)

In [None]:
webdata.sort_values("Revenue",  ascending=False)[["MediaChannel", "Revenue"]]

We can create a more precise measure for the effectiveness of a media channel by calculating the percentage of sessions that ended in a transaction. That is, for how many of the user sessions that the website had, did the user end up actually buying something. This is often referred to as the *conversion rate*. We create a new column called `ConversionRate` that is equal to `Transactions/Sessions`.

In [None]:
webdata["ConversionRate"] = webdata["Transactions"] / webdata["Sessions"]
webdata

In [None]:
webdata.sort_values("ConversionRate",  ascending=False)[["MediaChannel", "ConversionRate"]]

## Diabetes dataset

This is classic dataset in machine learning and one of the example dataset that comes with scikit-learn. Thus we can load it directly from the scikit-learn package.

In [None]:
from sklearn.datasets import load_diabetes

In [None]:
d_data = load_diabetes(as_frame=True)
d_data

In [None]:
diabetes_data = d_data.data

In [None]:
diabetes_data["Target"] = d_data.target

In [None]:
diabetes_data

In [None]:
diabetes_data.info()

In [None]:
diabetes_data.describe()

We see that the means are almost close to zero, which means the data is probably normalized (more on this later in the course)

We want to know what correlates with the target...

In [None]:
diabetes_data["Target"].corr(diabetes_data["age"])

In [None]:
diabetes_data["Target"].corr(diabetes_data["sex"])

In [None]:
diabetes_data["Target"].corr(diabetes_data["bmi"])

In [None]:
diabetes_data["Target"].corr(diabetes_data["bp"])

In [None]:
diabetes_data["Target"].corr(diabetes_data["s1"])

In [None]:
diabetes_data["Target"].corr(diabetes_data["s2"])

In [None]:
diabetes_data["Target"].corr(diabetes_data["s3"])

In [None]:
diabetes_data["Target"].corr(diabetes_data["s4"])

In [None]:
diabetes_data["Target"].corr(diabetes_data["s5"])

In [None]:
diabetes_data["Target"].corr(diabetes_data["s6"])

Or we can actually get all the correlations in a matrix with one line of code

In [None]:
diabetes_data.corr()

Later, when we talk about regression we will try to predict the target from all the columns.

## Adult dataset from UCI Machine Learning Repository

Contains information about income from adults in the US. We will load data from UCI Machine Learning Repository. See: https://archive.ics.uci.edu/dataset/2/adult

In [None]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
adult = fetch_ucirepo(id=2) 
  
# data (as pandas dataframes) 
X = adult.data.features 
y = adult.data.targets 
  
# metadata 
print(adult.metadata) 
  
# variable information 
print(adult.variables) 

In [None]:
X

In [None]:
X.info()

In [None]:
X.describe()

In [None]:
y

**EXERCISE:** Answer the following questions based on the Adult dataset
1. What is the mean age of all persons in the data?
1. What is the mean age of female persons? What about male persons?
1. How many different types of educations are there?
2. What are the different types of education and how many persons are the for each type?
3. Is there a difference in educational level across sex?
4. What is the most common relationship status?
5. Is there a correlation between hours per week (worked) and age?
6. Is the average hours per week (worked) different across different marital-status groups?
7. Is there an income difference across sexes?