**Introduction to Python**<br/>
Prof. Dr. Jan Kirenz <br/>
Hochschule der Medien Stuttgart

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-data" data-toc-modified-id="Import-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import data</a></span></li><li><span><a href="#Data-transformation" data-toc-modified-id="Data-transformation-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data transformation</a></span><ul class="toc-item"><li><span><a href="#Descriptive-statistics" data-toc-modified-id="Descriptive-statistics-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Descriptive statistics</a></span><ul class="toc-item"><li><span><a href="#Measures-of-central-tendency" data-toc-modified-id="Measures-of-central-tendency-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>Measures of central tendency</a></span></li><li><span><a href="#Measures-of-dispersion" data-toc-modified-id="Measures-of-dispersion-2.1.2"><span class="toc-item-num">2.1.2&nbsp;&nbsp;</span>Measures of dispersion</a></span></li><li><span><a href="#Summary-statistics" data-toc-modified-id="Summary-statistics-2.1.3"><span class="toc-item-num">2.1.3&nbsp;&nbsp;</span>Summary statistics</a></span></li></ul></li></ul></li></ul></div>

In [1]:
import pandas as pd

## Import data

In [2]:
# Import data from GitHub (or from your local computer)
df = pd.read_csv("https://raw.githubusercontent.com/kirenz/datasets/master/wage.csv")

In [3]:
df

Unnamed: 0.1,Unnamed: 0,year,age,maritl,race,education,region,jobclass,health,health_ins,logwage,wage
0,231655,2006,18,1. Never Married,1. White,1. < HS Grad,2. Middle Atlantic,1. Industrial,1. <=Good,2. No,4.318063,75.043154
1,86582,2004,24,1. Never Married,1. White,4. College Grad,2. Middle Atlantic,2. Information,2. >=Very Good,2. No,4.255273,70.476020
2,161300,2003,45,2. Married,1. White,3. Some College,2. Middle Atlantic,1. Industrial,1. <=Good,1. Yes,4.875061,130.982177
3,155159,2003,43,2. Married,3. Asian,4. College Grad,2. Middle Atlantic,2. Information,2. >=Very Good,1. Yes,5.041393,154.685293
4,11443,2005,50,4. Divorced,1. White,2. HS Grad,2. Middle Atlantic,2. Information,1. <=Good,1. Yes,4.318063,75.043154
...,...,...,...,...,...,...,...,...,...,...,...,...
2995,376816,2008,44,2. Married,1. White,3. Some College,2. Middle Atlantic,1. Industrial,2. >=Very Good,1. Yes,5.041393,154.685293
2996,302281,2007,30,2. Married,1. White,2. HS Grad,2. Middle Atlantic,1. Industrial,2. >=Very Good,2. No,4.602060,99.689464
2997,10033,2005,27,2. Married,2. Black,1. < HS Grad,2. Middle Atlantic,1. Industrial,1. <=Good,2. No,4.193125,66.229408
2998,14375,2005,27,1. Never Married,1. White,3. Some College,2. Middle Atlantic,1. Industrial,2. >=Very Good,1. Yes,4.477121,87.981033


## Data transformation

### Descriptive statistics

#### Measures of central tendency

First of all we obtain some common statistics per variable.

In [4]:
# mode
df['age'].mode()

0    40
dtype: int64

In [5]:
# calculation of the mean (e.g. for age)
df["age"].mean()

42.41466666666667

In [6]:
# calculation of the mean (e.g. for age) and round the result
round(df["age"].mean(), 2)

42.41

In [7]:
# calculation of the median (e.g. for age)
df["age"].median()

42.0

In [8]:
# use a function to print a text with the output
print(f'The median of age is {df["age"].median()}')

The median of age is 42.0


#### Measures of dispersion 

In [9]:
# quantiles
df['age'].quantile([.25, .5, .75])

0.25    33.75
0.50    42.00
0.75    51.00
Name: age, dtype: float64

In [10]:
# Range
df['age'].max() - df['age'].min()

62

In [11]:
# standard deviation
round(df['age'].std(),2)

11.54

#### Summary statistics

In [12]:
# summary statistics for all numerical columns
round(df.describe(),2)

Unnamed: 0.1,Unnamed: 0,year,age,logwage,wage
count,3000.0,3000.0,3000.0,3000.0,3000.0
mean,218883.37,2005.79,42.41,4.65,111.7
std,145654.07,2.03,11.54,0.35,41.73
min,7373.0,2003.0,18.0,3.0,20.09
25%,85622.25,2004.0,33.75,4.45,85.38
50%,228799.5,2006.0,42.0,4.65,104.92
75%,374759.5,2008.0,51.0,4.86,128.68
max,453870.0,2009.0,80.0,5.76,318.34


Compare summary statistics for specific groups in the data:

In [13]:
# summary statistics by groups
df['age'].groupby(df['education']).describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
education,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1. < HS Grad,268.0,41.794776,12.611111,18.0,33.0,41.5,50.25,75.0
2. HS Grad,971.0,42.217302,12.02348,18.0,33.0,42.0,50.0,80.0
3. Some College,650.0,40.887692,11.523327,18.0,32.0,40.0,49.0,80.0
4. College Grad,685.0,42.773723,10.902406,22.0,34.0,43.0,51.0,76.0
5. Advanced Degree,426.0,45.007042,10.263468,25.0,38.0,44.0,53.0,76.0
