<div class='bar_title'></div>

*Introduction to Data Science*

# Assignment 4 - Plotting with Lets Plot Solutions

Gunther Gust / Vanessa Haustein<br>
Chair of Enterprise AI

Winter Semester 24/25

<img src='https://raw.githubusercontent.com/vhaus63/ids_data/main/d3.png?raw=true' style='width:20%; float:left;' />

In [1]:
import pandas as pd
from lets_plot import *
LetsPlot.setup_html()

We will work with the `Cars` dataset from the Altair Viz module. It contains information on manufacturing and other attributes of cars.

In [2]:
df = pd.read_json('https://raw.githubusercontent.com/vhaus63/ids_data/refs/heads/main/cars.json')
df.head()

Unnamed: 0,Name,Miles_per_Gallon,Cylinders,Displacement,Horsepower,Weight_in_lbs,Acceleration,Year,Origin
0,chevrolet chevelle malibu,18.0,8,307.0,130.0,3504,12.0,1970-01-01,USA
1,buick skylark 320,15.0,8,350.0,165.0,3693,11.5,1970-01-01,USA
2,plymouth satellite,18.0,8,318.0,150.0,3436,11.0,1970-01-01,USA
3,amc rebel sst,16.0,8,304.0,150.0,3433,12.0,1970-01-01,USA
4,ford torino,17.0,8,302.0,140.0,3449,10.5,1970-01-01,USA


### Exercise 1: Visual Check of Correlation

(a) We want to find out whether there is a linear correlation between the attributes `Horsepower` and `Miles_per_Gallon`. We will do this via a scatterplot that is supposed to look like this:

<img src="https://raw.githubusercontent.com/vhaus63/ids_data/main/correlation_cars.png" style="width:60%;" />

Recreate the plot.


In [3]:
(
    ggplot(df, aes(x='Horsepower', y='Miles_per_Gallon'))
    + geom_point(aes(color='Origin'))
    + geom_smooth(color='black')
    + ggtitle('Linear Correlation between Horsepower and Miles per Gallon')
)

(b) In the plot before, we added a third dimension - the Origin of the cars - via the coloring of the points. Now we want to include the `Weight_in_lbs` attribute in our plot instead of the origin. Since this is not a categorical variable, we don't want to work with coloring (although we could, of course). We will vary the size of the scatter points depending on the weight variable. Change the plot accordingly.

In [4]:
(
    ggplot(df, aes(x='Horsepower', y='Miles_per_Gallon'))
    + geom_point(aes(size='Weight_in_lbs'), alpha=0.5)
    + ggtitle('Bubble Chart')
)

(c) You can see that especially in the lower part of the plot it looks very crowded. It would be really nice to have a function to zoom into the plot interactively. Can you find a way to do that?

In [5]:
(
    ggplot(df, aes(x='Horsepower', y='Miles_per_Gallon'))
    + ggtb()
    + geom_point(aes(size='Weight_in_lbs'), alpha=0.5)
    + ggtitle('Bubble Chart')
)

### Exercise 2:
Find a good visualization for the interplay between Origin, Year and Horsepower.

Before you start to plot, change the Year column. You may have noticed that here, every year is given by year-01-01. We will drop the month and the day since they contain no additional information and would otherwise be annoying in a plot legend.

In [6]:
df['Year'] = df['Year'].str.split('-', expand=True)[0]

In [7]:
(
    ggplot(df, aes(x='Year', y='Origin', fill='Horsepower'))
    + geom_tile()
    + scale_fill_gradient(low='yellow', high='blue')
    + ggtitle('Horsepower per Origin and Year')
)

### Exercise 3: Dot Strip Plot

Recreate this plot in order to evaluate the interplay between Horsepower and Cylinders for each Origin.

<img src="https://raw.githubusercontent.com/vhaus63/ids_data/main/horsepower_cylinders.png" style="width:60%;" />


In [8]:
(
    ggplot(df, aes(x='Horsepower', y='Origin', color='Cylinders'))
    + geom_point(position=position_jitter(height=0.1))
)

### Exercise 4:

(a) We want to aggregate the data per year to see how parameters might have changed over time. Create a DataFrame called `df_agg` that contains the average horsepower and the average miles per gallon for each year.

In [9]:
df_agg = df.groupby('Year').agg({'Horsepower': 'mean', 'Miles_per_Gallon': 'mean'}).reset_index()
df_agg

Unnamed: 0,Year,Horsepower,Miles_per_Gallon
0,1970,148.857143,17.689655
1,1971,104.928571,21.25
2,1972,120.178571,18.714286
3,1973,130.475,17.1
4,1974,94.230769,22.703704
5,1975,101.066667,20.266667
6,1976,101.117647,21.573529
7,1977,105.071429,23.375
8,1978,99.694444,24.061111
9,1979,101.206897,25.093103


(b) Now you can recreate the following plot:

<img src="https://raw.githubusercontent.com/vhaus63/ids_data/main/time_horsepower_cars.png" style="width:60%;" />


In [10]:
(
   ggplot(df_agg, aes('Horsepower', 'Miles_per_Gallon'))
   + geom_point(aes(color='Year'))
   + geom_segment(aes(xend=df_agg['Horsepower'].shift(-1),
                      yend=df_agg['Miles_per_Gallon'].shift(-1),
                      color='Year'),
                      arrow=arrow(type='closed', angle=40, length=10))
   + scale_color_gradient(low='lightblue', high='red', guide='none')
   + geom_text(aes(label='Year'), nudge_x=3.5, size=5)
)

(c) What insights does this plot provide?

- It visualizes the relationship between Horsepower and Miles per Gallon (MPG) over time, with points labeled by year
- Higher horsepower correlates with lower MPG, indicating that cars with more powerful engines tend to be less fuel-efficient
- Fuel efficiency seems to have improved over the years
- Around 1980 there was a drastic incrcease in MPG with lower horsepower, indicating that in this time, fuel efficiency was highly prioritized in manufacturing cars

### Exercise 5:
In order to better understand the time-dependent evolution of efficiency (measured by MPG) in the manufactured cars, we want to create a statistical plot that contains for each Origin the mean and the 95% confidence intervall of the mean.

Recreate this plot:

<img src="https://raw.githubusercontent.com/vhaus63/ids_data/main/mpg_time_cars2.png" style="width:80%;" />


You can either create an aggregated table like we did above and add all the lines and fillings one by one or you can use the `stat_summary()` function. For upper and lower bound of the 95% confidence interval, use the formula

`x.mean() +- 1.96 * x.std()/len(x)**0.5`.

In [11]:
(
    ggplot(df)
    # Adding mean line
    + stat_summary(aes(x='Year', y='Miles_per_Gallon', color='Origin'),
                 fun_y='mean', 
                 geom='line')
    # Adding confidence interval
    + stat_summary(aes(x='Year', y='Miles_per_Gallon', color='Origin', fill='Origin'),
                 fun_y='mean', 
                 fun_ymin=lambda x: x.mean() - 1.96 * x.std()/len(x)**0.5,
                 fun_ymax=lambda x: x.mean() + 1.96 * x.std()/len(x)**0.5,
                 geom='ribbon', 
                 alpha=0.3)
    + labs(y='Miles per Gallon', x='Year')
    + facet_wrap(facets='Origin', ncol=3, scales='fixed')
    + ggtitle('Miles per Gallon Over Years by Origin')
    + guides(color='none', fill='none')
)
