# Analysis: Teachers to Students Ratio

Author: Tom Freudenmann
Date: 2024-12-01
Tags: Teachers, Students, Ratio, Analysis
Summary: This is the analysis of the teachers to students ratio in Germany.


In [163]:
from school_analysis.preprocessing.load import Loader
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import school_analysis as sa
from tabulate import tabulate

loader = Loader()
students_per_state = loader.load('school-children-by-state')
students_per_type = loader.load('school-children-by-type')
teachers = loader.load('teachers-per-schooltype')

# 1. Understanding data structure and validate downloaded data

This section is used to look deeper into the data structure, validate the data and to get a better understanding of what to plot later.

## 1.1 Students per state
First we want to have a look at the students per state and the structure of the data to understand what data is used.

In [164]:
def describe_counts(df, cols):
    """Pretty print the value counts of a dataframe"""
    value_counts = {}
    for c in cols:
        v_counts = df[c].value_counts()
        value_counts[c] = v_counts.keys().tolist()
        value_counts[c + "_count"] = v_counts.values.tolist()
        
    print(tabulate(value_counts, headers='keys'))
    
def analyse_structure(df, cols = None):
    """Analyse the structure of a dataframe"""
    print("Shape: ", df.shape)
    print("Columns: ", df.columns)
    print("Data types:\n", df.dtypes, "\n")
    print("Missing values:\n", df.isnull().sum(), "\n")
    print("Unique values:\n", df.nunique(), "\n")
    print("Value counts: ")
    describe_counts(df, df.columns) if cols is None else describe_counts(df, cols)

def analyse_min_max(df, col="Value"):
    most_students = df.loc[df[col].idxmax()]
    least_students = df.loc[df[col].idxmin()]

    # Print some stats as a table
    print("Highest {}: \n{}".format(col, most_students))
    print("-" * 100)
    print("Lowest {}: \n{}".format(col, least_students))
    print("-" * 100)

# Todo: Move this to the framework

In [165]:
# Specify Analysis
analyse_min_max(students_per_state, "Value")

Highest Value: 
Federal State    Nordrhein-Westfalen
Gender                           all
Type                          Pupils
Value                      2338855.0
Year                            2003
Name: 2280, dtype: object
----------------------------------------------------------------------------------------------------
Lowest Value: 
Federal State              Bremen
Gender                          f
Type             School beginners
Value                      2437.0
Year                         2008
Name: 1048, dtype: object
----------------------------------------------------------------------------------------------------


In [166]:
# General Analysis
analyse_structure(students_per_state, ["Federal State", "Gender", "Year", "Type"])

Shape:  (3600, 5)
Columns:  Index(['Federal State', 'Gender', 'Type', 'Value', 'Year'], dtype='object')
Data types:
 Federal State     object
Gender            object
Type              object
Value            float64
Year               int64
dtype: object 

Missing values:
 Federal State     0
Gender            0
Type              0
Value            99
Year              0
dtype: int64 

Unique values:
 Federal State      16
Gender              3
Type                3
Value            3441
Year               25
dtype: int64 

Value counts: 
Federal State             Federal State_count  Gender      Gender_count    Year    Year_count  Type                                       Type_count
----------------------  ---------------------  --------  --------------  ------  ------------  ---------------------------------------  ------------
Baden-Württemberg                         225  m                   1200    1998           144  Pupils                                           1200
Bayern 

As we can see the counts are always the same, but there are less 'all' genders and less results in the year 2022. ==> Maybe the data is not complete.

Trying now to get all year, school combinations in which the gender all doesn't exist.

In [167]:
students_per_state.groupby(["Federal State", "Type", "Year"]).apply(lambda x: len(x["Gender"].value_counts().values.tolist()) != 3).loc[lambda x: x == True]

Series([], dtype: bool)

It seems like the preprocessing is broken, because there are no results for the year 2022. So we have to fix this. (You may now see the corrected results in the table above.)

## 1.2 Students per school type
Now we want to have a look at the students per school type and the structure of the data to understand what data is used.

In [168]:
analyse_min_max(students_per_type, "Value")

Highest Value: 
School Type         Grammar schools (9 years of schooling)
Certificate Type                                     Total
Gender                                               Total
Value                                             268984.0
Year                                                  2006
Name: 2369, dtype: object
----------------------------------------------------------------------------------------------------
Lowest Value: 
School Type                                           Special schools
Certificate Type    Entrance qualification for univ. of appl. scie...
Gender                                                         Female
Value                                                             1.0
Year                                                             1997
Name: 139, dtype: object
----------------------------------------------------------------------------------------------------


In [169]:
# General Analysis
analyse_structure(students_per_type, ["Certificate Type", "Gender", "Year", "School Type"])

Shape:  (5850, 5)
Columns:  Index(['School Type', 'Certificate Type', 'Gender', 'Value', 'Year'], dtype='object')
Data types:
 School Type          object
Certificate Type     object
Gender               object
Value               float64
Year                  int64
dtype: object 

Missing values:
 School Type            0
Certificate Type       0
Gender                 0
Value               1491
Year                   0
dtype: int64 

Unique values:
 School Type           13
Certificate Type       6
Gender                 3
Value               3294
Year                  25
dtype: int64 

Value counts: 
Certificate Type                                      Certificate Type_count  Gender      Gender_count    Year    Year_count  School Type                                  School Type_count
--------------------------------------------------  ------------------------  --------  --------------  ------  ------------  -----------------------------------------  -------------------
Without sec

Again it can be seen, that there are some data missing for `Entrance qualification for universities of applied sciences`, `Total` and `University entrance qualification`. So we have to fix this. (You may now see the corrected results in the table above.)

## 1.3 Teachers per state
Now we want to have a look at the teachers per state and the structure of the data to understand what data is used.

In [170]:
teachers.columns

Index(['School Type', 'Contract Type', 'Federal State', 'Gender', 'Year',
       'Number of Teachers'],
      dtype='object')

In [171]:
analyse_min_max(teachers, "Number of Teachers")

Highest Number of Teachers: 
School Type                              Grundschulen
Contract Type         Vollzeitbeschäftigte Lehrkräfte
Federal State                             Deutschland
Gender                                              z
Year                                             1992
Number of Teachers                           119355.0
Name: 1871, dtype: object
----------------------------------------------------------------------------------------------------
Lowest Number of Teachers: 
School Type                          Abendrealschulen
Contract Type         Vollzeitbeschäftigte Lehrkräfte
Federal State                       Baden-Württemberg
Gender                                              z
Year                                             1992
Number of Teachers                                0.0
Name: 27, dtype: object
----------------------------------------------------------------------------------------------------


In [172]:
analyse_structure(teachers, ["School Type", "Contract Type", "Federal State", "Gender", "Year"])

Shape:  (58261, 6)
Columns:  Index(['School Type', 'Contract Type', 'Federal State', 'Gender', 'Year',
       'Number of Teachers'],
      dtype='object')
Data types:
 School Type            object
Contract Type          object
Federal State          object
Gender                 object
Year                    int64
Number of Teachers    float64
dtype: object 

Missing values:
 School Type           0
Contract Type         0
Federal State         0
Gender                0
Year                  0
Number of Teachers    0
dtype: int64 

Unique values:
 School Type             21
Contract Type            3
Federal State           17
Gender                   3
Year                    29
Number of Teachers    9205
dtype: int64 

Value counts: 
School Type                                  School Type_count  Contract Type                           Contract Type_count  Federal State             Federal State_count  Gender      Gender_count    Year    Year_count
---------------------------------

Since every year appears the same amount of times, we can assume that the data is complete. We can also see that there are differences in the school types of the federal states, because they have different school systems.

## 2. Plotting the data
In this section we want to plot the data to get a better understanding of the data. Foreach data set we want to plot the following:
  - Evolution over time
  - Comparison of the federal states
  - Comparison of the school types
  - Comparison of the school types in the federal states
  
Then we want to compare the teachers to students ratio of the federal states and the school types.