# Data Analysis in Python

Let's start with some basic data analysis in Python! Therefor we need two popular python packages

- ***Pandas***: "[pandas](https://pandas.pydata.org) is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language*"


- ***Matplotlib***: "[Matplotlib](https://matplotlib.org) is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms*"

<img src="images/pandas_matplotlib.png" width=200 align=right />

## Installation

```conda install pandas matplotlib```

OR

```pip install pandas matplotlib```


## Powerful pandas

Pandas is a very powerful and flexible tool for data munging and preparing. 

Pandas deals with 3 data structures:
- **Series**: homogeneous 1-dimensional array
- **DataFrames**: heterogenous 2-dimensional array
- **Panel**: heterogenous 3-dimensional array


<img src="images/dataframe.png" width=400   />

### Pandas Series

A Series is a one-dimensional object similar to an array, list, or column in a table.

In [None]:
import pandas as pd

In [None]:
# create a Series with an arbitrary list
s = pd.Series(['Würzburg',127.880, 177, 87.6 ])
s

Pandas by default creates an index from 0 to N for each item. Of course we can also define our own index

In [None]:
s = pd.Series(['Würzburg',127.880, 177, 87.6],
              index=['Name', 'Population', 'Altitude', 'Area'])
s

You can also use dictionaries to create a pandas series

In [None]:
d = {'Name': 'Würzburg', 'Population': 127.880, 'Altitude': 177, 'Area': 87.6}
s = pd.Series(d)
s

We can now use the index to select specific items from the Series or just use integer indexing/slicing

In [None]:
s['Name']
s[0]
s[:2]

### DataFrames

In order to create a Pandas DataFrame we can pass a dictionary of lists to the DataFrame constructor. 

In [None]:
data = {'City': ['Madrid', 'Berlin', 'Lisbon', 'Paris', 'Rome', 'Copenhagen', 'London'],
        'Elevation': [667, 34, 15, 34, 14, 10, 14],
        'Population': [3266126, 3664088, 544851, 2165423, 2837332, 805420, 9002488],
        'Country': ['Spain', 'Germany', 'Portugal', 'France', 'Italy', 'Denmark', 'United Kingdom']}
cities = pd.DataFrame(data, columns=['City', 'Elevation', 'Population', 'Country'])
cities

Much more often, you'll have a dataset you want to read into a DataFrame. Lets read a real world dataset into python. The dataset contains several observations from three different weather stations (5705:'Würzburg',282:'Bamberg',1420:'Frankfurt'). The pd.read_csv function is used to read a CSV file into a DataFrame in our Python environment. Of course pandas can read much more file formats than csv files. For more information have a look at https://pandas.pydata.org/docs/user_guide/io.html

In [None]:
import pandas as pd

df = pd.read_csv('../Data/nonspatial/station_data.csv', sep=',')
df

Note that rows and columns are labelled using indices

In this case:

- Rows are labelled with integers

- Columns are labelled with column names

## Data exploration


### DataFrame Porperties

First of all we want to have a first glance at the data we are dealing with. Pandas provides in-built functions for data inspection. The properties tail() and head() allow us to look at the first/last rows of our datafame

In [None]:
df.head()

In [None]:
df.tail()               # show data tail

DataFrame dimensions are accessible through .shape()

In [None]:
df.shape

The DataFrame column names can be obtained using .columns()

In [None]:
df.columns # display column names

If we want to know the data types of our columns we can just use .dtype

In [None]:
df.dtypes

The .info() function is used to get a concise summary of the dataframe.

In [None]:
df.info()

The describe() function is used to generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution,

In [None]:
df.describe()

### Subsetting data

There are numerous methods of subsetting Series and DataFrames in pandas and its sometimes confusing.
The recommended subsetting methods are using .loc and .iloc. The principal difference between .loc and .iloc is:

- .loc uses pandas indices, while
- .iloc uses (implicit) numpy-style numeric indices.

A DataFrame is a 2-dimensional object, thats why .loc and .iloc accept 2 values which a seperated by a comma. The first one for the row and the second one for the column

### Selecting columns

In [None]:
df.loc[:, "parameter"] # select column as Series

df.loc[:, ["parameter"]] # select column as DataFrame

df["parameter"]  ## Shortcut for df.loc[:, "AverageTemperature"]

df[["parameter"]]  ## Shortcut for df.loc[:, ["AverageTemperature"]]

df[["parameter", "value"]] ## Select multiple columns

df.loc[:,"station_id":"date"] ## Slice columns

### Selecting rows

Rows can be selected using the same way using .loc() and .iloc()

In [None]:
df.iloc[[0], :]     ## get 1st row

df.iloc[0:3, :]     ## 1st row (inclusive) to 4th row (exclusive)

df.iloc[[0, 12], :]  ## get first and 13th row

df.loc[[0, 12], :]  ## get first and 13th row

### Select rows and columns
Of course we can also combine the methods we used above

In [None]:
df.iloc[0:4, 0:2] 

df.loc[:, "station_id":"date"].iloc[0:4, :]

###  Summary functions

Pandas provides a large set of Operators and summary functions that operate on different kinds of pandas object. Let us use this to find out more about our dataset.  

In [None]:
## time period
 
df['date'].min()
df['date'].max()

In [None]:
## parameters measured

df['parameter'].unique()

In [None]:
## mean quality

df['quality'].mean()

In [None]:
## number of observations by station

df['station_id'].value_counts()

###  Filtering

Additionally we can use conditional statements in order to filter our dataframe. In this example we want to extract the mean preciptation for the station Würzburg

In [None]:
## extract all data for station Würzburg

df_wue = df.loc[df['station_id'] == 5705]
df_wue

Of course we can also filter data using multiple conditions

In [None]:
prec_wue = df.loc[(df['station_id'] == 5705) & (df['parameter'] == 'precipitation_height') ]
prec_wue['value'].max()

Sometimes the condition we are interested in is not only one individual value, but multiple values. Therefor we could of course just use something like that:

In [None]:
df[(df['station_id'] == 5705) | (df['station_id'] == 282)]

.. or we can use the .isin() method

In [None]:
sel = df['station_id'].isin([5705,282])
df[sel]

### Sorting

A DataFrame can be sorted using the .sort_values method. We can of course also sort by index using .sort_index()

In [None]:

temp_wue = df.loc[(df['station_id'] == 5705) & (df['parameter'] == 'temperature_air_mean_200') ]

temp_wue.sort_values(by = 'value')                   #Order rows by values of a column (low to high).
temp_wue.sort_values(by = 'value', ascending=False)   #Order rows by values of a column (high to low).

## Data manipulation

In the next section we will prepare our dataset for analysis using some of the basic data manipulation methods in provided by pandas. In this example we will

- change station_id to actual station names
- reshape dataset from long to wide format
- use timestamps as index
- convert temperature data to celsius
- make some first visualisation

### Remapping  values

First of all we want to replace our station_id with the real world name of the station. 

In [None]:
new_values = {5705:'Würzburg',282:'Bamberg',1420:'Frankfurt'}

df2=df.replace({"station_id": new_values})

df2

### Reshaping dataframe

Pandas offers different functions to rehape your dataframe

<img src="images/reshaping.png" width=700   />

https://pandas.pydata.org/

In [None]:
data = df2.pivot(index=['date','station_id'], columns='parameter', values='value').reset_index()
data

## Modifying the index

As we will see, the index plays an important role in many operations in pandas. Therefore, often we would like to set a more meaningful, custom index. Especially when dealing with time series data we want to set the index to the time points 

- .index for Series 
- .set_index for DataFrame 

In [None]:
data = data.set_index('date')                   
data.head()
data

Especially when working with time series it is recommended to convert your timestamps into datetime format, which adds some specific methods to our time data

In [None]:
data.index = pd.to_datetime(data.index)
data.index.year

### Arithmetic operations

We can also combine values of different columns to apply pairwise arithmetic or boolean operators. As we can see our temperature data is still in Kelvin. So let's convert it to Celsius Degree

In [None]:
data.reset_index().reset_index()

In [None]:
data['temperature_air_mean_200'] - 273.15

Or we can even apply write a function and apply it

In [None]:
def kelvin2cel(x):
    x = x - 273.15
    return(x)

In [None]:
data["temperature_air_mean_200"].apply(kelvin2cel)

### Create/Drop columns

If we want to store our new values in the same dataframe, we can also create a new column. These can be created by  assigning of a Series to a non-existing column index.

In [None]:
## create new temperature column

data['mean_temp_c'] = data["temperature_air_mean_200"].apply(kelvin2cel)
data['max_temp_c'] = data["temperature_air_max_200"].apply(kelvin2cel)
data['min_temp_c'] = data["temperature_air_min_200"].apply(kelvin2cel)


Of course we can also drop columns we don't need

In [None]:
data = data.drop(columns=["temperature_air_mean_200", "temperature_air_min_200", "temperature_air_max_200"])
data

### Groupby

Next we want calculate the annual mean for each parameter in our datset. Therefor we can use the groupby() function

In [None]:
data['year'] = data.index.year
data_annual = data.groupby(['year','station_id']).mean().reset_index()

In [None]:
data_annual

## Basic Plotting

Pandas provides some basic plotting functions, which allow use to make some quick visualisations of our dataset
For example, we can draw a histogram of the average temperature

In [None]:
data_annual[["mean_temp_c"]].hist()

Or create a lineplot showing the mean temperature in Würzburg

In [None]:
temp_wue = data_annual[data_annual['station_id'] == 'Würzburg']
temp_wue["mean_temp_c"].plot(title='Temperature in Würzburg',ylabel = 'Temperature in °C')

## Writing to disk

If we are finished with the manipulation of our dataset we can just write it back to disk. In this case we will save the data as CSV. 

In [None]:
data_annual.to_csv("../Data/nonspatial/dwd_annual.csv", index=False)

### Additional: Method Chaining

The nice thing with pandas is that we don't to apply all the operations one by one. We can just chain our commands

In [None]:
import pandas as pd

new_values = {5705:'Würzburg',282:'Bamberg',1420:'Frankfurt'}

temp = (pd.read_csv('../Data/nonspatial/station_data.csv')
  .replace({"station_id": new_values})
  .query("station_id == 'Würzburg' & parameter == 'temperature_air_mean_200'")
  .assign(datetime=lambda x: pd.to_datetime(x['date']))   
  .assign(mean_temp_c=lambda x: x.value - 273.15 )     
  .set_index('datetime')
  .groupby(pd.Grouper(freq='M')).mean(numeric_only=True)      
 )

temp['mean_temp_c'].plot()

## Interactive data analysis with D-Tale

D-Tale is the combination of a Flask back-end and a React front-end to bring you an easy way to view & analyze Pandas data structures. It integrates seamlessly with ipython notebooks & python/ipython terminals. Currently this tool supports such Pandas objects as DataFrame, Series, MultiIndex, DatetimeIndex & RangeIndex.

In [None]:
import dtale

data_annual['date'] = pd.to_datetime(data_annual['year'])
dtale.show(data_annual)


## AI Pandas

Pandas AI is a Python library that adds generative artificial intelligence capabilities to Pandas, the popular data analysis and manipulation tool. It is designed to be used in conjunction with Pandas, and is not a replacement for it

In [None]:
!pip install pandasai

In [None]:
import pandas as pd
from pandasai import PandasAI

# Sample DataFrame
df = pd.DataFrame({
    "country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
    "gdp": [19294482071552, 2891615567872, 2411255037952, 3435817336832, 1745433788416, 1181205135360, 1607402389504, 1490967855104, 4380756541440, 14631844184064]
})

# Instantiate a LLM
from pandasai.llm.openai import OpenAI
llm = OpenAI(api_token="")

pandas_ai = PandasAI(llm, conversational=False)
pandas_ai.run(
    df,
    "Plot the histogram of countries showing for each the gpd, using different colors for each bar",
)

## Alternative Packages

### Polars: Pandas DataFrame but Much Faster

Polars is a DataFrame interface on top of an OLAP Query Engine implemented in Rust using Apache Arrow Columnar Format as the memory model.

The goal of Polars is to provide a lightning fast DataFrame library that:

- Utilizes all available cores on your machine.
- Optimizes queries to reduce unneeded work/memory allocations.
- Handles datasets much larger than your available RAM.
- Has an API that is consistent and predictable.
- Has a strict schema (data-types should be known before running the query).

Polars is written in Rust which gives it C/C++ performance and allows it to fully control performance critical parts in a query engine

https://pola-rs.github.io/polars-book/user-guide/

In [None]:
%pip install polars

In [None]:
import pandas as pd
import polars as pl
import time
import numpy as np

In [None]:
ptime = []
for i in range(1000):
    start = time.time()
    df_pandas = pd.read_csv("../Data/nonspatial/dwd_annual.csv")
    df_pandas['precipitation_height'].agg(['min','max','mean','median','std'])
    cols=['station_id','precipitation_height'] 
    df_pandas.sort_values(by=cols,ascending=True)
    df_pandas.groupby("station_id", as_index=False)
    end = time.time()
    ptime.append(end-start)
    
print(np.mean(ptime))

In [None]:
ptime = []
for i in range(1000):
    start = time.time()
    df_polars = pl.read_csv("../Data/nonspatial/dwd_annual.csv")
    df_polars.with_columns([
        pl.col('precipitation_height').min().suffix('_min'),
        pl.col('precipitation_height').max().suffix('_max'),
        pl.col('precipitation_height').mean().suffix('_mean'),
        pl.col('precipitation_height').median().suffix('_median'),
        pl.col('precipitation_height').std().suffix('_std')
    ])
    cols=['station_id','precipitation_height'] 
    df_polars.sort(cols,descending=False)
    df_polars.groupby("station_id")
    end = time.time()
    ptime.append(end-start)
    
print(np.mean(ptime))

## Exercise 7

- Import ***GlobalLandTemperaturesByMajorCity.csv***
- Look at the first 5 entries
- Print the name of all the columns
- What is the number of observations in the dataset?
-  Display a summary of the basic information about this DataFrame and its data.
- Select just the 'AverageTemperature' and 'City' columns from the DataFrame
- Count the number of cities in the dataset
- Calculate the mean temperature for each city over all years. Sort the values in the 'AverageTemperature' - in decending order
- Calculate the maximal AverageTemperature in european cities between 1990 and 2013

 ## Visualize data with python

Producing high-quality graphics is one of the main reasons for doing statistical computing. For this purpose, we will use the matplotlib library, which is probably the most used python library for 2D-graphics. Matplotlib is the "grandfather" library of data visualization with Python. It was created by John Hunter. Matplotlib allows a quick data visualization and the creation of publication-quality figures.
Import matplotlib's pyplot module as well as numpy and pandas.
(Pyplot is a module in the matplotlib package. The module allows you to implicitly and automatically create figures and axes)

Some of the major Pros of Matplotlib are:

- easy to get started for simple plots
- custom labels and texts
- control of every element in a figure
- high-quality output in many formats
 

In [None]:
%pip install matplotlib

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

#### A Basic Plot

First we start with something easy.

In [None]:
df = pd.read_csv('../Data/nonspatial/dwd_annual.csv')
df

In [None]:
temp_wue = df[(df['station_id'] == 'Würzburg')]
temp_wue

In [None]:
plt.plot(temp_wue['mean_temp_c'])

That was easy. But in order to customize our plot we have to understand a bit more about how the plt.plot() works. 
The plt.plot accepts 3 basic arguments in the following order: (x, y, format). Let's look deeper into the format parameter. If we want to change the styling of our plot we can do this via the a combination of {color}{marker}{line}. Let's assume we want to plot a green dots connected by dashed line. Therefor we would use the formatting code 'go--'.

In [None]:
plt.plot(temp_wue['mean_temp_c'],'go--' )

Of course matplotlib offers a bunch of diffrent styling options:
- 'b'    : blue markers with default shape
- 'or'   : red circles
- '-g'   : green solid line
- '--'   : dashed line with default color
- '^k:'  : black triangle_up markers connected by a dotted line
- 'r*--' : ‘red stars with dashed lines’
- 'kp:' : ‘black pentagons with dotted line’
- 'bD-.' : ‘blue diamonds with the dash-dot line’.

Now that we have our first plot lets go through the things which are still missing. First of all we can see that the x-axis is still showing wrong values. This is because matplotlib will use the index by default. Let's change this

In [None]:
plt.plot(temp_wue['year'],temp_wue['mean_temp_c'])

matplotlib offers a variety of different plot types. Here are just a few examples

In [None]:
#plt.scatter(temp_wue['year'],temp_wue['mean_temp_c'])
#plt.bar(temp_wue['year'],temp_wue['mean_temp_c'])
#plt.stem(temp_wue['year'],temp_wue['mean_temp_c'])
#plt.hist(temp_wue['mean_temp_c'])
#plt.step(temp_wue['year'],temp_wue['mean_temp_c'])
#plt.boxplot(temp_wue[['mean_temp_c','min_temp_c','max_temp_c']])
plt.violinplot(temp_wue[['mean_temp_c','min_temp_c','max_temp_c']])

But let's move back to our lineplot. We already managed to plot the temperature per year. But of course a proper plot consists of more elements than just the graph and its values. 

<img src="images/matplot.jpg"  width=600/>

In [None]:
plt.figure(figsize=(18,4), dpi=120)
plt.plot(temp_wue['year'],temp_wue['mean_temp_c'], label='Würzburg')
plt.title('Mean Temperature Würzburg')
plt.xlabel('Year')
plt.ylabel('Temperature in °Celsius')
plt.legend(loc='best')

Let's also add the maximum and the minimum temperature to our graph by just adding another plt.plot()

In [None]:
plt.figure(figsize=(18,4), dpi=120)
plt.plot(temp_wue['year'],temp_wue['mean_temp_c'],'-g', label='mean_temp')
plt.plot(temp_wue['year'],temp_wue['max_temp_c'],'-r', label='max_temp')
plt.plot(temp_wue['year'],temp_wue['min_temp_c'],'-b', label='min_temp')
plt.title('Temperature Würzburg')
plt.xlabel('Year')
plt.ylabel('Temperature in °Celsius')
plt.legend(loc='best')
plt.grid(True)

You can use the subplot() method to add more than one plot in a figure

In [None]:
temp_bam = df[(df['station_id'] == 'Bamberg')]
temp_bam

In [None]:
plt.figure(figsize=(18, 4))
plt.subplot(1,2,1)
plt.plot(temp_wue['year'],temp_wue['mean_temp_c'], label='Würzburg')
plt.title('Würzburg')
plt.xlabel('Year')
plt.ylabel('Temperature in °Celsius')

plt.subplot(1,2,2)
plt.plot(temp_bam['year'],temp_bam['mean_temp_c'], label='Bamberg')
plt.title('Bamberg')
plt.xlabel('Year')
plt.ylabel('Temperature in °Celsius')

plt.suptitle('Annual Temperature in Würzburg and Bamberg')
plt.show()


In [None]:
plt.figure(figsize=(18, 4))
plt.subplot(2,1,1)
plt.plot(temp_wue['year'],temp_wue['mean_temp_c'], label='Würzburg')
plt.title('Würzburg')
plt.xlabel('Year')
plt.ylabel('Temperature in °Celsius')

plt.subplot(2,1,2)
plt.plot(temp_bam['year'],temp_bam['mean_temp_c'], label='Bamberg')
plt.title('Bamberg')
plt.xlabel('Year')
plt.ylabel('Temperature in °Celsius')

plt.suptitle('Annual Temperature in Würzburg and Bamberg')
plt.show()

Unitl now you had a first glimpse into the world of matplotlib. But in order to create more advanced plot it is important to know a little bit more abbout matplotlib. One important big-picture matplotlib concept is its object hierarchy.

We already called the function plt.plot(). This one-liner hides the fact that a plot is really a hierarchy of nested Python objects. A “hierarchy” here means that there is a tree-like structure of matplotlib objects underlying each plot.

####   The Matplotlib Object Hierarchy


- When we call plt.plot(x, y), we internally create a hierarchy of nested Python objects: Figure and Axes.

- A Figure object is the outermost container for a matplotlib graphic, which can contain multiple Axes objects.

- An Axes actually translates into what we think of as an individual plot or graph

- Below the Axes in the hierarchy are smaller objects such as tick marks, individual lines, legends, and text boxes. Almost every “element” of a chart is its own manipulable Python object, all the way down to the ticks and labels


<img src="images/matplot.png" width=300/>


You can think of the Figure object as a box-like container holding one or more Axes (actual plots). Below the Axes in the hierarchy are smaller objects such as tick marks, individual lines, legends, and text boxes. Almost every “element” of a chart is its own manipulable Python object, all the way down to the ticks and labels.

### Pyplot vs. Object-Oriented

matplotlib on the surface is made to imitate MATLAB's methods. All the pyplot commands make changes and modify the same figure. This is a state-based interface, where the state (i.e., the figure) is preserved through various function calls (i.e., the methods that modify the figure). This interface allows us to quickly and easily generate plots. The state-based nature of the interface allows us to add elements and/or modify the plot as we need, when we need it. This is the method which we have already used.

The pyplot interface makes it quite easy to create fast and easy plots. So why should we use a different method. Here is an example:

In [None]:
plt.figure(figsize=(18,4), dpi=120)
plt.plot(temp_wue['year'],temp_wue['mean_temp_c'],'-r', label='max_temp')
plt.plot(temp_wue['year'],temp_wue['precipitation_height'],'-b', label='prec_height')
plt.title('Temperature Würzburg')
plt.xlabel('Year')
plt.ylabel('Temperature in °Celsius')
plt.legend(loc='best')
plt.grid(True)

Here, we run into some obvious and serious issues. We can see that since both the quantities share the same axis but have very different magnitudes, the graph looks disproportionate. What we need to do is separate the two quantities on two different axes. This is where the second approach to making plot comes into play.

Also, the pyplot approach doesn't really scale when we are required to make multiple plots or when we have to make intricate plots that require a lot of customisation. However, internally matplotlib has an Object-Oriented interface that can be accessed just as easily, which allows to reuse objects.

Although this looks more complicated, using this method gives us full control over the plot

<img src="https://matplotlib.org/stable/_images/anatomy.png" width=600 />

In [None]:
fig, ax1 = plt.subplots()

ax1.set_ylabel("precipitation height")
ax1.set_xlabel("Year")
ax1.bar(temp_wue['year'],temp_wue['precipitation_height'], color='skyblue')
ax1.set_ylabel('Precipitation in mm')
ax1.yaxis.label.set_color('skyblue')
ax1.tick_params(axis='y', labelcolor='skyblue')
ax2 = ax1.twinx() # create another y-axis sharing a common x-axis
ax2.set_ylabel("temperature air mean")
ax2.set_xlabel("Year")
ax2.plot(temp_wue['year'],temp_wue['mean_temp_c'], "red")
ax2.set_ylabel('Temperature in °C')
ax2.yaxis.label.set_color('red')
ax2.tick_params(axis='y', labelcolor='red')

fig.set_size_inches(7,3)
fig.set_dpi(100)

plt.show()

or we can do something like that

In [None]:
fig, ax1 = plt.subplots(figsize=(10,6))
ax1.set_ylabel("temperature air mean")
ax1.set_xlabel("year")
ax1.set_xlabel("temperature")
ax1.plot(temp_wue['year'],temp_wue['mean_temp_c'], "red")
ax1.set_ylabel('Temperature in °C')
l, b, h, w = .19, .65, .2, .2
ax2 = fig.add_axes([l, b, w, h])
ax2.set_ylabel("precipitation height")
ax2.set_xlabel("year")
ax2.set_ylabel("precipitation")
ax2.bar(temp_wue['year'],temp_wue['precipitation_height'])
ax2.set_ylabel('Precipitation in mm')
plt.show()

## Customizing Matplotlib

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
x = np.random.randn(1000)
plt.hist(x);

### Customization by Hand

In [None]:
ax = plt.axes(facecolor='#E6E6E6')
ax.set_axisbelow(True)
plt.grid(color='w', linestyle='solid')
for spine in ax.spines.values():
    spine.set_visible(False)
ax.xaxis.tick_bottom()
ax.yaxis.tick_left()
ax.tick_params(colors='gray', direction='out')
for tick in ax.get_xticklabels():
    tick.set_color('gray')
for tick in ax.get_yticklabels():
    tick.set_color('gray')
ax.hist(x, edgecolor='#E6E6E6', color='#EE6666');

### Stylesheets

In [None]:
plt.style.available[:10]

In [None]:
plt.style.use('ggplot')
plt.hist(x);

### Additional syling packages

In [None]:
# Generate x values
x = np.linspace(0, 10, 20)
# Generate y values
y = np.sin(x)
y2 = np.cos(x)

In [None]:
!pip install mplcyberpunk
!pip install matplotx
!pip install SciencePlots

In [None]:
import mplcyberpunk
plt.style.use('cyberpunk')
plt.figure()

plt.plot(x, y, marker = 'o')
plt.plot(x, y2, marker = 'o', c='lime')

mplcyberpunk.make_lines_glow()

plt.xlabel('X-Axis')
plt.ylabel('Y-Axis')

plt.show()

In [None]:
import matplotx

with plt.style.context(matplotx.styles.pitaya_smoothie['light']):
  plt.scatter(x, y, c=y2)
  plt.colorbar(label='Y2')
  plt.xlabel('X')
  plt.ylabel('Y')
  plt.show

In [None]:
import scienceplots

plt.style.use('default')
with plt.style.context(['science', 'ieee', 'std-colors']):
    plt.figure(figsize = (10,10))
    plt.plot(x, y, marker='o', label='Line 1')
    plt.plot(x, y2, marker='o', label='Line 2')

    plt.xlabel('X')
    plt.ylabel('Y')
    plt.legend()
    plt.show()

### Seaborn

Another powerful Python visualization library is seaborn which is based on matplotlib. Seaborn can do the same things as matplotlib and it makes plotting easier. It is often preferred, because user think the default settings in seaborn are more pleasing then in matplotlib. And the good thing is, if you know matplotlib, seaborn is really easy to learn. Let´s try some nice example plots (More plot examples can be found on http://seaborn.pydata.org/)

In [None]:
import seaborn as sns

sns.lineplot(x=temp_wue['year'],y=temp_wue['mean_temp_c'], linewidth=2.5)

In [None]:
import seaborn as sns

sns.set(style="darkgrid")

plt.figure(figsize=(13, 5))
ax = sns.scatterplot(x=temp_wue['year'],y=temp_wue['mean_temp_c'], palette="flare", hue=temp_wue['mean_temp_c'])

In [None]:
plt.figure(figsize=(13, 8))
ax = sns.regplot(x=temp_wue['year'],y=temp_wue['mean_temp_c'])

In [None]:
plt.figure(figsize=(18, 4))
sns.lineplot(data=df, x="year", y="precipitation_height", hue="station_id")

## Visualization alternatives

Few years ago matplotlib was the only answer to the question: "How do I make plots in python?". But nowadays we have a lot of choices. Each library takes a slightly different approach to plotting data. Which one you use is up to you.

In [None]:
%pip install altair

In [None]:
import altair as alt
alt.data_transformers.disable_max_rows()

alt.Chart(temp_wue).mark_line().encode(
    alt.X('year:Q',scale=alt.Scale(zero=False)),
    alt.Y('mean_temp_c:Q',scale=alt.Scale(zero=False)),
    color=alt.Color(
        'mean(mean_temp_c):Q', scale=alt.Scale(scheme='reds', domain=(-5, 20))),
    tooltip=[
        alt.Tooltip('year', title='Year'),
        alt.Tooltip('mean(mean_temp_c):Q', title='AverageTemperature')
    ]
 ).properties(width=600, height=300)




In [None]:
%pip install plotly

In [None]:
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot


init_notebook_mode(connected=True)

trace_middle = go.Scatter(
    x=temp_wue.year, 
    y=temp_wue.mean_temp_c,
    name = "Temp",
    line = dict(color = '#17BECF'),
    opacity = 0.8)

trace_high = go.Scatter(
    x=temp_wue.year, 
    y=temp_wue.max_temp_c,
    name = "High",
    line = dict(color = '#7F7F7F'),
    opacity = 0.8)    
    
trace_low = go.Scatter(
      x=temp_wue.year, 
    y=temp_wue.min_temp_c,
    name = "Low",
    line = dict(color = '#7F7F7F'),
    opacity = 0.8)

data = [trace_high,trace_middle,trace_low]

layout = dict(
    title='Time Series with Rangeslider',
    xaxis=dict(
        rangeselector=dict(
            buttons=list([
                dict(count=1,
                     label='1y',
                     step='year',
                     stepmode='backward'),
                dict(count=6,
                     label='10y',
                     step='year',
                     stepmode='backward'),
                dict(step='all')
            ])
        ),
        rangeslider=dict(
            visible = True
        ),
        type='date'
    )
)

fig = dict(data=data, layout=layout)
iplot(fig, filename = "Time Series with Rangeslider")