<a href="https://colab.research.google.com/github/ReDI-School/hh-dcp-intro-to-computer-science/blob/main/content_2022/6.%20Python%20Libraries/6.%20Intro_to_Python_Libraries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 14 Nov - Python Libraries


![python gif](https://media.giphy.com/media/KAq5w47R9rmTuvWOWa/giphy.gif)

## Pre Requisites

None for today!

---

## Class Curriculum

| Section content                             | Expected time (mins) | Pre - Requirements |
| ------------------------------------------- | -------------------- | ------------------ |
| text   | 5 minutes            | ❌                 |
| Lesson Goals                                | 5 minutes            | ❌                 |

## using libraries

"A library is a collection of materials, books or media that are accessible for use. A library provides physical (hard copies) or digital access (soft copies) materials, and may be a physical location or a virtual space, or both." - *Wikipedia*

Python libraries provide various functions and classes for all kinds of uses. They consist of well tested code that proofed useful in many projects. In the end it's just code that you do not need to write yourself, which you can use to make your life as a programmer a little bit easier.

## some libraries and usage examples

There are quite a lot of Python libraries available and we can show only a few examples that we will use later in this course.

### NumPy

NumPy is an open source project aiming to enable numerical computing with Python. [https://numpy.org](https://numpy.org)

#### working with arrays

NumPy supports a variety of use cases when working with numeric or tabular data, which come most handy in data processing or datascience. You can use the following website as a reference:
[https://numpy.org/devdocs/user/quickstart.html](https://numpy.org/devdocs/user/quickstart.html)



In [None]:
import numpy as np

# create a two dimensional array with some numbers
dim2array = np.array([
    [10, 11, 12, 13, 14, 15],
    [20, 21, 22, 23, 24, 25], 
    [30, 31, 32, 33, 34, 35],
    [40, 41, 42, 43, 44, 45]
])

# to see the size of the array dimensions (4 rows with 6 elements each in this case)
print(dim2array.shape)

# element wise operations
print("add '3' to each element: {}".format(
    (dim2array[0]+3))
)

# you can transform the shape of arrays by providing the new dimensions
dim1array = dim2array.reshape(24)
print("one dimensional: {}".format(dim1array))
print("add all elements: {}".format(dim1array.sum()))


#### loading and storing data to files

In real life situations you will not be able to write every data into your source code file. NumPy hase some convenience functions to load and store array data as text files. [https://numpy.org/doc/stable/reference/generated/numpy.savetxt.html](https://numpy.org/doc/stable/reference/generated/numpy.savetxt.html)

when reading Data from File, you need to specify a format:

    c : character
    d or i : signed decimal integer
    f : decimal floating point
    s : string of characters
    u : unsigned decimal integer

In [None]:
import numpy as np
from io import StringIO

names = np.array([['Python',29.48],['Java',17.18],['JavaScript',9.14],['C#',6.94],['PHP',6.49],['C/C++',6.49],['R',3.59],['TypeScript',2.18],['Swift',2.1],['Objective-C',2.06]])
np.savetxt('programming-languages.txt', names, delimiter=';', fmt=("%s"))

# load data from CSV file:  
f = open('programming-languages.txt','r')

# alternatively you could provide Data as String
#f = StringIO("Python;29.48\nJava;17.18\nJavaScript;9.14\nC#;6.94\nPHP;6.49\nC/C++;6.49\nR;3.59\nTypeScript;2.18\nSwift;2.1\nObjective-C;2.06")

language, percentage =  np.loadtxt(f, dtype={'names': ('language', 'percentage'), 'formats': ('U15', 'f4')}, delimiter=';', comments='§', unpack=True)
print(language)

### pandas

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. [https://pandas.pydata.org/](https://pandas.pydata.org/)

#### loading and storing CSV data from/to files

Pandas is a rather large library with many features, so we will take a look at a data visualization show case to get a grasp of Pandas capabilities.
Let's take for example some RKI Data with corona cases per 'Bundesland' from https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Daten/Inzidenz-Tabellen.html?nn=13490888
You find the data in the 'fallzahlen.txt' file.

In [None]:
# to parse some data with non standard formatting: (we use transpose to change rows with columns)
fallzahlen = pd.read_csv("fallzahlen.txt", sep='\t', parse_dates=[0], index_col=0).transpose()

# Pandas will try to guess the data type of the columns
# fallzahlen = fallzahlen.apply(pd.to_numeric, errors='ignore')

# to display the first 5 rows of our data:
fallzahlen.head(5)

In [None]:
# display details of the DataFrame
fallzahlen.info()

In [None]:
# plot as a line diagram
fallzahlen[["Hamburg","Berlin","Bayern","Nordrhein-Westfalen"]].plot(rot=45, figsize=(22, 8))

In [8]:
# load the file into the DataFrame 'einwohner'
einwohner = pd.read_csv("Bundesland_Einwohner_2020", sep=';', index_col=0).transpose()
# we can access the numbers like this:
einwohner["Bayern"][0]

13140183

To be able to compare the numbers, we will calculate the number of infections per 100.000 citizens. e.g. 

Bayern: 1318 cases

Bayern: 13140183 citizens

      1318 / (13140183/100000) = 10.03030170888792

Next we need the number of citizens per Bundesland from Wikipedia https://de.wikipedia.org/wiki/Liste_der_deutschen_Bundesl%C3%A4nder_nach_Bev%C3%B6lkerung
which are stored in the file "Bundesland_Einwohner_2020":

In [None]:
# we copy the dataframe because we change the contents below
normalized = fallzahlen.copy()

# now we add new Columns to the DataFrame with the numbers from the formula
for title in einwohner.columns:
  normalized[title]=normalized[title]/(einwohner[title][0]/100000)

# "normalized" now holds the number of corona cases for each date and every Bundesland per 100.000 citizens
print(normalized.head())

# now we plot all columns except "Gesamt" 
normalized.drop(['Gesamt'], axis=1).plot(rot=45, figsize=(22, 16))


In [2]:
#pandasstart
import pandas as pd
import numpy as np

#%% Series
my_series = pd.Series(np.arange(3), index=['a', 'b', 'c'])
# %%
# my_series[0]  # access via index
my_series['a']  # access via label

#%% create dataframe
# from dict
my_dict = {'name': ['Bob', 'Stuart', 'Kevin'], 'grades': [1, 2, 3]}
df = pd.DataFrame(my_dict)
df
# %%
# from np array
# df = pd.DataFrame(np.random.rand(2, 2), columns=['height', 'width'])
# df.columns

# now set/change column names after creation
df = pd.DataFrame(np.random.rand(2, 2))
df.columns = ['height', 'width']
df

Unnamed: 0,height,width
0,0.308683,0.535918
1,0.568412,0.945405


In [None]:
#eda
# %% [markdown]
# Import **pandas** as the required package for working with dataframes.

# %%
import pandas as pd

# A dataframe is a multi-dimensional table with rows and columns.

# %% Creating a Dataframe
# Usually you import a dataframe from a file, a SQL server, or a web-resource. But here I will show you how to create a dataframe from scratch.
# 
# You can create a dataframe based on lists, tuples, arrays. Here we develop it based on a dictionary.

# %%
data = {
    'A': [1,2,3],
    'B': [4,5,6],
    'C': [7,8,9]
}
df = pd.DataFrame(data=data)
df

# A dataframe has columns (here: A, B, C), and rows. The index is creating and starts with 0.

# # Import and Export of Dataframes

# You can export a dataframe into different formats like Excel, JSON, ... Here I export it to a CSV file.

# %%
filename = 'df.csv' 
df.to_csv(filename, index=False)

# Similarly the dataframe can be imported with **pandas**. There are many different read-functions to import from different formats.

# %%
df = pd.read_csv(filename)

# # Exploratory Data Analysis

# %% [markdown]
# You can explore the data with *head()* to see the first observations. If you are interested in the last observations go with *tail()*. The argument refers to the number of observations to be shown.

# %%
df.head(2)

# %% [markdown]
# Statistical properties are shown with the *describe()* method.

# %%
df.describe()

# %% [markdown]
# A general summary on the dataframe is provided by *info()* method.

# %%
df.info()

# %% [markdown]
# Often you are interested in getting the number of rows and columns. You can get this with the shape property.

# %%
df.shape

# %% [markdown]
# The column-names are stored in the property *columns*.

# %%
df.columns

In [None]:
#modify
# %% [markdown]
# Import **pandas** as the required package for working with dataframes.

# %%
import pandas as pd

# A dataframe is a multi-dimensional table with rows and columns.

# %% Creating a Dataframe
# Usually you import a dataframe from a file, a SQL server, or a web-resource. But here I will show you how to create a dataframe from scratch.
# 
# You can create a dataframe based on lists, tuples, arrays. Here we develop it based on a dictionary.

# %%
data = {
    'A': [1,2,3],
    'B': [4,5,6],
    'C': [7,8,9]
}
df = pd.DataFrame(data=data)
df

# A dataframe has columns (here: A, B, C), and rows. The index is creating and starts with 0.

# # Import and Export of Dataframes

# You can export a dataframe into different formats like Excel, JSON, ... Here I export it to a CSV file.

# # Adding/Modifying Columns

# %%
df['D'] = list(range(10,13))
df

# %% [markdown]
# # Delete Rows or Columns

# %% [markdown]
# If you want to delete a column use method *drop()* and specify the column name. The argument axis needs to be 1 for columns. With inplace set to true the dataframe is directly modified.

# %%
df.drop('C', axis=1, inplace=True)
df

# %% [markdown]
# Similarly you can delete rows by specifying the index of the row, the axis is 0 for rows and inplace is set to true to change the dataframe directly.

# %%
df.drop(1, axis=0, inplace=True)
df

# %% [markdown]
# # Apply a lambda function to a column

# %% [markdown]
# You can also apply a specific function to a column.

# %%
my_func = lambda x: x + 2

df['E'] = df['A'].apply(my_func)
df

# %% [markdown]
# # Reshape your dataframe structure

# %% [markdown]
# You can reshape your dataframe structure from wide data to tidy data and vice versa. We are starting with wide-data.

In [3]:
#filtering #dataneeded
#%% packages
import pandas as pd

#%%
# source: https://perso.telecom-paristech.fr/eagan/class/igr204/datasets
file_path = './factbook.csv'
df = pd.read_csv(file_path, sep=';', skiprows=[1])

df
# %% Data Filtering
# select columns
# select columns like df['Country]
type(df['Country'])  # returns Series
type(df[['Country']])  # returns DataFrame
# %%
# 1. find number of unique countries
len(pd.unique(df['Country']))
# %%
# select rows
# 2. get all countries with more than 1E6 people
df[df['Population']>1E6]
# %%
df['Population']>1E6
# %%

df[['Country', 'Population']].shape


# %%
(df['Population']>1E6).value_counts()
# %%
df[df['Population']>1E6]
# %% loc
df.loc[1:3, ['Country']]
# %% iloc
df.iloc[10:20, -4:]  # row 10 to 19, last four cols

# integer location
df.iloc[0,0]  # get the first 

# %%
df.iloc[1, :]  # 2nd row

# %%
df.iloc[:, -1]  # last column

# %% perform multiplication
df['Population'] /1000 * df['Birth rate(births/1000 population)']
# %% find the largest country in the world
df['Population'].sort_values(ascending=False)
# %%
df.loc[[49], :]  # manual approach
# %%
df.loc[[df['Population'].idxmax()], :]

Unnamed: 0,Country,Area(sq km),Birth rate(births/1000 population),Current account balance,Death rate(deaths/1000 population),Debt - external,Electricity - consumption(kWh),Electricity - production(kWh),Exports,GDP,...,Oil - production(bbl/day),Oil - proved reserves(bbl),Population,Public debt(% of GDP),Railways(km),Reserves of foreign exchange & gold,Telephones - main lines in use,Telephones - mobile cellular,Total fertility rate(children born/woman),Unemployment rate(%)
49,China,9596960,13.14,30320000000.0,6.94,233300000000.0,1630000000000.0,1910000000000.0,583100000000.0,7262000000000.0,...,3392000.0,17740000000.0,1306314000.0,31.4,70058.0,609900000000.0,263000000.0,269000000.0,1.72,9.8


### Matplotlib

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. [https://matplotlib.org/](https://matplotlib.org/)

Matplot supports a number of visualizations for common data structures out of the box, for example:

In [None]:
import matplotlib.pyplot as plt

# drawing a bar diagram
plt.bar(language, percentage, color = 'g', label = 'File Data')
  
plt.xlabel('Programming languages', fontsize = 12)
plt.ylabel('popularity in %', fontsize = 12)

  
plt.title('Programming language popularity', fontsize = 20)
plt.xticks(rotation=45, ha="right")
plt.legend()
plt.show()

In [None]:
colors = ['yellow', 'b', 'green', 'cyan','red'] 
top5Percent = np.split(percentage,2)[0]
top5language = np.split(language,2)[0]
print("displaying pie chart of {}".format(top5language))

# plotting pie chart 
plt.pie(top5Percent, labels = top5language, colors = colors, startangle = 90,
        shadow = True, radius = 1.2) 
plt.show()

In [None]:
import pandas as pd
import numpy as np

#%%
df = pd.DataFrame({'language': ['R', 'Python', 'SQL', 'R', 'R', 'Python', 'Python'], 
                    'year': [2020, 2020, 2020, 2021, 2022, 2022, 2022],
                    'users': [1E6, 2E6, 0.5E6, 1.1E6, 1.2E6, 2.2E6, 2.4E6]})
df
# %%
df['users'].plot()
# %%
import matplotlib.pyplot as plt
plt.plot(df.loc[df['language']=='Python', 'users'])
plt.plot(df.loc[df['language']=='R', 'users'])
plt.plot(df.loc[df['language']=='SQL', 'users'])
plt.show()

In [None]:
#matplotlibintro
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

#%% 1D data -> automatic x labels
plt.plot([0, 2, 5, 6])
plt.title('my first plot')
plt.xlabel('automatically generated')
plt.show()


#%% 2D data
plt.plot([1, 3, 5, 7], [-10, 5, 25, 20])

#%% formatting
# color and line type: default b-, 
plt.plot([1, 3, 5, 7], [-10, 5, 25, 20], 'go')
plt.axis([0, 10, -15, 30])  # [xmin, xmax, ymin, ymax]

#%% multiple series
my_range = np.arange(0, 10, 1)
# plt.plot(my_range, my_range)
# plt.plot(my_range, my_range, 'b-')
plt.plot(my_range, my_range, 'b-o', my_range, my_range**0.5, 'g-s', my_range, my_range**1.5, 'r-^')


#%% Data Import
filename = "Diamonds.csv"
diamonds = pd.read_csv(filename)

# %% additional dimensions
diamonds['volume'] = diamonds['x'] * diamonds['y'] * diamonds['z']
plt.scatter(data=diamonds.iloc[:100, :], x='carat', y = 'depth', s='price', c='volume')
plt.xlabel('carat')
plt.ylabel('depth')
plt.colorbar(label='volume')
plt.show()


In [None]:
import pandas as pd

import matplotlib.pyplot as plt

programmingLanguages = pd.read_csv("programming-languages.txt", sep=';', header=None)
programmingLanguages.head(5).to_csv("programming-languages.csv", header=["language","percent"])

### seaborn

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. [https://seaborn.pydata.org/](https://seaborn.pydata.org/)

In [None]:
#seaborn
import seaborn as sns
import pandas as pd

# %%
filename = "Diamonds.csv"
diamonds = pd.read_csv(filename)
diamonds['volume'] = diamonds['x'] * diamonds['y'] * diamonds['z']
#%%
sns.set_theme(style='darkgrid')
s= sns.relplot(data=diamonds.iloc[1500:2000, :], x='carat', y = 'price', size='depth', hue='cut', style='clarity')
s.set(ylim = (3000, 3150))

## Data Manipulation

In [None]:
### Data Grouping

In [None]:
import pandas as pd
import numpy as np

#%%
df = pd.DataFrame({'language': ['R', 'Python', 'SQL', 'R', 'R', 'Python', 'Python'], 
                    'year': [2020, 2020, 2020, 2021, 2022, 2022, 2022],
                    'users': [1E6, 2E6, 0.5E6, 1.1E6, 1.2E6, 2.2E6, 2.4E6]})
df

# %% understand grouping
for name, group in df.groupby('language'):
    mean_val = group['users'].mean()
    print(f"{name}: {mean_val}")

# %% group on a column and average over all other cols
# df.groupby('language').mean()
df.groupby('language').agg(np.mean)
# %% group on a column, average over a specific column
df.groupby('language')['users'].mean()
# %% group by multiple columns
df.groupby(['language', 'year'])['users'].mean()
# %% several aggregation functions
df.groupby('language').agg({'users': ['mean', 'min', 'max', 'sum']})

# %%
df.groupby('language').agg([np.mean, np.sum])

# %%
import pandas as pd
data = {'group_col': ['A', 'B', 'C', 'A', 'B', 'A'],
                   'value_col': [0, -2, 5, 2, 2, 4]}
df = pd.DataFrame(data)
df


#%%

# %%
for name, group in df.groupby('group_col'):
    group_values = group['value_col'].tolist()
    print(f"{name}: {group_values}")
# %%
df.groupby('group_col').agg({'value_col': [np.mean, np.sum]})

# %%
df.groupby('group_col').size()

In [None]:
#Exercise
#%% packages

#%% 0. add required packages at the top of this script

# %% 1. import dataset from subfolder "data"

# %% 2. how many rows and columns does it have

# %% 3. which columns does it have?

# %% 4. There is a column named 'Unnamed: 0'. Please delete it.

# %% 5. which levels does column cut have?

# %% 6. create a barplot for diamonds and their cut's
#%% 7. save the graph as png-file

# %% 8. find out what the median price per cut is.

# %% 9. Create two filtered dataframes which only have the cut 'Ideal' and 'Premium'. Store it in the dataframes 'df_cut_ideal' and 'df_cut_premium'

# %% 10. stack the dataframes together to get a combined version df_cut_ideal_premium

# %% 11. save the dataframe to a csv file in subfolder data: 'df_cut_ideal_premium.csv'

In [None]:
#join
import pandas as pd
import numpy as np

#%% prepare dataframes
# use main characters of animated films

minions = pd.DataFrame({
    'student': ['Stuart', 'Bob', 'Kevin', 'Gru'],
    'art': [4,2,1, 2]
    
})
print(f"minions:\n {minions}")

despicable_me = pd.DataFrame({
    'student': ['Agnes', 'Margo', 'Edith', 'Gru'],
    'sport': [1,2,2, 3]
    
})
print(f"despicable me:\n {despicable_me}")






frozen = pd.DataFrame({
    'student': ['Anna', 'Elsa', 'Olaf'],
    'art': [4,2,1]
})
print(f"frozen:\n {frozen}")

simpsons = pd.DataFrame({
    'student': ['Bart', 'Lisa'],
    'math': [5,1],
    'sport': [1,5]
    
})
print(f"simpsons:\n {simpsons}")

#%% left join
# how: left, right, inner, outer
# minions.merge(right=despicable_me, how='right', on='student', indicator=True)
minions.merge(right=despicable_me, how='right', left_on='student', right_on='student', indicator=True)

#%% join via index
despicable_me_index = despicable_me.copy()
despicable_me_index.index = despicable_me_index['student']
despicable_me_index.drop(columns=['student'], axis=1, inplace=True)
print(f"despicable me index:\n {despicable_me_index}")

minions.merge(right=despicable_me_index, how='right', left_on='student', right_index=True, indicator=True).reset_index(drop=True)



# %% append rows
minions.append(frozen)
# %% alternatively
pd.concat([minions, frozen])
# %% append columns
pd.concat([minions, simpsons])