# ICN Programming Course

<p align="center">
    <img width="500" alt="image" src="https://github.com/Lenakeiz/ICN_Programming_Course/blob/main/Images/cog_neuro_logo_blue_png_0.png?raw=true">
</p>

---

## Data visualisation and matplotlib

Data visualization should be the starting point of every analysis. It's often undervalued, with the belief that statistical analysis alone is sufficient.
However, the significance of visualizing data is exemplified by [Anscome's Quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet).
DCreated by Francis J. Anscombe in 1973, this consists of four datasets.
Each dataset yields identical summary statistics (mean, standard deviation, and correlation), suggesting they are similar.
Yet, when these datasets are graphically represented, their differences become apparent.
The key aspect of Anscombe's Quartet lies not in having four datasets with identical statistical properties, but in how these visually distinct datasets reveal those same properties.

Over the years more similar datasets have been created to show the importance of data visualization:
one of those is the Datasaurus dataset which urges people to "never trust summary statistics alone; always visualize your data", since, while the data exhibits normal seeming statistics, plotting the data reveals a picture of a dinosaur 🦖.
Inspired by Anscombe's Quartet, the Datasaurus you can download found in the original publication the [_datasaurus dozen_](https://dl.acm.org/doi/10.1145/3025453.3025912 
), _i.e._ 13 datasets (the Datasaurus + $12$ others) having the same summary statistics up to two decimal places.

Hence, the golden rule is the following one.
> Always find a nice way to visualise your data before going into applying the statistics.


<p align="center">
    <img width="1000" src="https://blog.revolutionanalytics.com/downloads/DataSaurus%20Dozen.gif">
</p>

### Pandas plotting

Another great thing about pandas is that it integrates with [Matplotlib](https://matplotlib.org/) asd well as [Seaborn](https://seaborn.pydata.org/), two very important  so you get the ability to plot directly off DataFrames and Series.

To get started we need to import those libraries (make sure to have them installed first).

In [None]:
# Import libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

# set plot size
plt.rcParams['figure.figsize'] = (20, 13)
# this is a magic command for Jupyter Notebooks or IPython environments. It sets up the Matplotlib figures to be displayed inline, which means that the plots will be shown directly under the code cell that produced them.
%matplotlib inline
# set the resolution of the plot - retina means higher pixel density
%config InlineBackend.figure_format = "retina"

# set the style for plotting using seaborn. Changes the global defaults for all plots using the matplotlib rcParams system
sns.set_theme()

---
## Graphical analysis

The aim is to start our data analysis using Python data visualization tools.

### Matplotlib and Seaborn

[Matplotlib](https://matplotlib.org/) is a highly customisable library, and that could come with drawbacks.
[Seaborn](https://seaborn.pydata.org/) standardises some graphical aspects and easily integrates with Pandas.

When importing seaborn as sns you can also set default options to make the plots more readable.
This is effectively using overriding the same rcParams as done throug Matplotlib.
It is very simnple to do that by using `sns.set_theme()`.
Indeed, once that Seaborn has been imported, we can use the set function to force its default graphical settings.

```python
import seaborn as sns
sns.set_theme()
```

Tricky question for you:
> Why we import Seaborn as `sns`?

<details>
    <summary><b>HINT</b></summary> 
    <p align="center">
      <a href= "https://stackoverflow.com/questions/41499857/seaborn-why-import-as-sns"><img src="https://www.thesun.co.uk/wp-content/uploads/2019/06/NINTCHDBPICT000002475114.jpg" width="350" title="Sam Norman Seaborn"></a>
    </p>
    Click on the image for a little bit more detailed answer.
</details>

#### Import data

Let's start by import movie dataset and plot the relationship between `ratings` and `revenue`. 
All we need to do is call `plot()` on df with some info about how to construct the plot.

In [None]:
data_url = 'https://raw.githubusercontent.com/LearnDataSci/articles/master/Python%20Pandas%20Tutorial%20A%20Complete%20Introduction%20for%20Beginners/IMDB-Movie-Data.csv'

df = pd.read_csv(data_url)
df.head()

To plot our output is very easy now that we have imported Matplotlib. In fact we will now have an extension method for the class dataframe called `plot`. The function works as a a wrapper around Matplotlib's plotting functions as it is able to print different `kind` of plots. You can use it on Dataframes or a Series (an individual column of a dataframe).

In [None]:
df.plot(kind='scatter', x='Rating', y='Revenue (Millions)', 
    title = "Revenues vs Rating", figsize=(20,13));

Pandas' `plot` function has many optional arguments.

Some of these arguments are linked to Matplotlib properties and not directly available within the function documentation (😠). 
A complete list of such properties can be found [here](https://matplotlib.org/stable/api/_as_gen/matplotlib.lines.Line2D.html).

While some of these properties might be never of use, some of them are quite useful like `marker` and `color` ones.

In [None]:
df.plot(kind='scatter', 
    x='Rating', 
    y='Revenue (Millions)', 
    title = "Revenues vs Rating",
    marker='x',
    color='r',
    figsize=(20,13));

Since `plot` works on the series or dataframe objects, we can still apply of the pandas functions that return such objects. In particular we can run `queries` and then plot the filtered quantities.

In [None]:
df.query("Director == 'Ridley Scott'").sort_values(by="Year").plot(x="Year",
                                            y="Rating",
                                            linestyle="-.",
                                            title="Ridley Scott movies' rating over time",
                                            color='g',
                                            figsize=(20,13));

This is a nice plot, but sometimes inputs into our datasets might be input manually from the source and so it could be the case that when we search for a string that could be case sensitive.
Luckily we can solve the problem by using string functions and doing alternative ways for querying the dataframe.

As you will notice the output is the same.

In [None]:
# Try to search for "ridley scott", but case-insensitive
# Specifying na to be False instead of NaN replaces the null findings with False. 
# We can then use the filtering series directly on our dataframe.
scott_movies = df[df['Director'].str.contains('ridley scott', case=False, na=False)]

# Sort the filtered DataFrame by year
scott_movies_sorted = scott_movies.sort_values(by="Year")

# Plot
scott_movies_sorted.plot(x="Year", y="Rating", linestyle="-.", title="Movies' rating over time for directors containing 'scott'", color='g', figsize=(20,13))

plt.show()


### Histograms

If we want to plot a simple Histogram based on a single column, we can call plot on the column series.

In [None]:
df['Rating'].plot(kind='hist', title='Rating', figsize=(20,13));

We can also make a graphical representation using the interquartile range, the famous __Boxplot__. Let's first recall what `describe` gives us on the ratings column.

In [None]:
df.Rating.describe()

Using a Boxplot we can visualize this data.

In [None]:
df['Rating'].plot(kind="box",
                figsize=(20,13));

To have a brief summary, one can have a look at the picture below.

<p align="center">
    <img width="600" src="https://mathpullzone-8231.kxcdn.com/wp-content/uploads/boxplot-with-outliers.jpg">
</p>

By combining categorical and continuous data, we can create a Boxplot of revenue that is grouped by a Rating Category.

In [None]:
# Create rating category column
df["rating_category"] = df.Rating.apply(lambda x: 'good' if x>= 8.0 else 'bad')
df.head()

You can use the pandas boxplot or the seaborn one. They are equivalent.

The advantage of seaborn is that the `groupby` is hidden by the boxplot, and the code is slightly more compact.

In [None]:
plt.figure(figsize=(20,13))
sns.boxplot(x=df["rating_category"],
            y=df['Revenue (Millions)'],
            data=df)

#### A final suggestion

Always label your axis!

<p align="center">
    <img width="689" src="https://raw.githubusercontent.com/qingkaikong/blog/master/2017_12_machine_learning_funny_pictures/figures/figure_20.png">
</p>

---

#### Exercises

1. Write a Pandas code to get those movies whose revenue more than $150$ million, sort them by rating, change good/bad threshold to `7.0` for `rating_category` and build an histogram for each rating category.

2. Produce a boxplot for visualising the rating grouped by release year.

Let s go further by looking at the IMDB movie dataset and let s do more exploratory analysis.
We can answers to questions like

1. Can we select all movies within a genre (_e.g._ Sci-Fi) and analyze the annual trend in terms of quantity of movies produced, commercial success and appreciation?
2. What s the relation between metacritic and rating?

#### Using meaningful column names
Before proceding on, it is important to rename columns to make them easier to be used for filtering.
Let's rename columns.

In [None]:
df = df.rename(columns={
    'Runtime (Minutes)': 'Runtime',
    'Revenue (Millions)': 'Revenue_millions'
})
df.columns = [col.lower() for col in df]
df.head()

#### Users and critics ratings

The user rating is store in column `ratings`, while the critic one in `metascore`.
In principle we can study the difference between these two.
However they are not on the same scale.
We have two choices here: 
1. Rescale one to the other, if we know the conversion rate.
2. Rescale both on the interval $\left[0,1 \right]$.

we are going to the second choice here.

In [None]:
# A compact way to do so is the following
df[['rating', 'metascore']] -= df[['rating', 'metascore']].min() # subtract the min, such that the new min is 0
df[['rating', 'metascore']] /= df[['rating', 'metascore']].max() # divide by the max, such that the new max is 1
df[['rating', 'metascore']].head()

We can now evaluate the difference and save it into a new column

In [None]:
df["score_difference"] = df["metascore"] - df["rating"] # abs here is optional, depending what you wanna see.

To start with our exploratory analysis, a good strategy might be plotting some of the quantities into account, specifically `rating`, `metascore` and their difference.
Recall they are on the same scale, so nothing further to worry about!

In [None]:
genre = "Horror"
df[df.genre.str.contains(f"{genre}")].plot(kind='bar', 
                                        x = 'title', 
                                        y = 'rating', 
                                        title= f"{genre} Movies user rating",
                                        figsize=(25,15)); # We plot just one genre for the sake of visualisation clarity.

And we can do the same for the metascore

In [None]:
genre = "Horror"
df[df.genre.str.contains(f"{genre}")].plot(kind='bar', 
                                        x = 'title', 
                                        y = 'metascore', 
                                        title= f"{genre} Movies critics rating",
                                        figsize=(25,15)); # We plot just one genre for the sake of visualisation clarity.

And finally also our calculated value, the difference approval

In [None]:
genre = "Horror"
df[df.genre.str.contains(f"{genre}")].plot(kind='bar', 
                                        x = 'title', 
                                        y = 'score_difference', 
                                        title= f"{genre} Movies score difference between critics and users",
                                        figsize=(25,15)); # We plot just one genre for the sake of visualisation clarity.

One can immediately see that the critics is on average more strict than the users about movies, with some noteworthy exceptions.

#### Analysis of the annual trend of a specific genre movie in terms of quantity of films produced, commercial success and appreciation

We are going to visualise results for "Sci-Fi" movies, however, let's try to keep the genre-choice parametric so that we can easily explore another genre without re-writing any code.

_In fact_: [Laziness](https://medium.com/@tsecretdeveloper/a-good-coder-is-a-lazy-coder-a678eb56d373) is a good characteristic of good programmers

First of all let's plot the number of movies in the genre over years.

In [None]:
genre_to_select = "Sci-Fi"

df_genre = df[df["genre"].str.contains(genre_to_select)]
df_genre[["year", "director"]].groupby(by=["year"]).count().plot(linestyle="-.",
                                                                color="b",
                                                                figsize=(20, 10),
                                                                title=f"Number of {genre_to_select} movies over time")


This is a nice plot, but maybe we can do the same by couting the number of occurences in categorical variables. We can do this by using `coutplot` from Seaborn

In [None]:
from matplotlib.ticker import MaxNLocator

plt.figure(figsize=(20,10))
sns.countplot(x = "year", hue="year", legend=False, data = df_genre, palette="pastel")
plt.gca().yaxis.set_major_locator(MaxNLocator(integer=True))
plt.title(f"Number of {genre_to_select} movies over time")
plt.tight_layout()
plt.show()

Let s not look at the trend of of `metascore` and `rating` over time for this genre.

In [None]:
df_genre[["year", "metascore", "rating"]].groupby("year")[["metascore", "rating"]].mean().plot(figsize=(20, 10), title=f"Average ratings over time for {genre_to_select} movies")

It s seems that there may be a correlation between the two scores.
Let s try to explore that further.
One can use the `corr` method in pandas.
By default the pearson correlation index is used, the method can be changed via the method parameter

In [None]:
grouped_data = df_genre.groupby("year")[["metascore", "rating"]].mean()

# Calculate the correlation matrix
correlation_matrix = grouped_data.corr()
correlation_matrix

Let s visualize this correlation by creating a scatter plot between the two variables.
This time we are going to work on the dataframe extracted from the genre directly.

In [None]:
plt.figure(figsize=(20, 10))
sns.scatterplot(x="rating", y="metascore", data=df_genre)
plt.tight_layout()
plt.grid()

Let s see if other variables are correlated.
For this part we are going to use the entire database.
For example is the lenght of the movie an indication that it may be more or less appreciated?

In [None]:
df[["runtime", "metascore", "rating"]].corr()

Does not seem so.
One can visualize the correlation coefficients with with an heatmap.
Seaborn comes again handy with this.

In [None]:
plt.figure(figsize=(10, 10))
sns.heatmap(df[["runtime", "metascore", "rating"]].corr(), annot=True)
plt.tight_layout()

Now let s see if the revenue correlates with the level of appreciation of the movue.
We can do that by creating a score that takes into account `metacritic` and `rating`

In [None]:
df["mean_score"] = (df["metascore"] + df["rating"]) / 2 # mean of the two scores
df[["mean_score", "metascore", "rating"]].head()

# plotting the correlation between mean score and revenue
plt.figure(figsize=(10, 10))
sns.heatmap(df[["mean_score", "revenue_millions"]].corr(), annot=True)
plt.tight_layout()

Really not much correlated.
This will be also clear when plotting the scatterplot.

In [None]:
plt.figure(figsize=(25, 16))
sns.scatterplot(x="mean_score", y="revenue_millions", data=df)
plt.tight_layout()
plt.grid()

There are plenty of movies with low revenue and still very much appreciated.

On the other hands, the revenue is correlated with the number of votes the movie got in the IMDB platform.

In [None]:
plt.figure(figsize=(10, 10))
sns.heatmap(df[["votes", "revenue_millions"]].corr(), annot=True)
plt.tight_layout()

plt.figure(figsize=(20, 10))
sns.scatterplot(x="revenue_millions", y="votes", data=df)
plt.tight_layout()
plt.grid()
plt.show()

So far we have tried to look at specific correlation based on assumptions we have on our dataset.
Sometimes it can be very convenient to visualise correlation plots between each coupled variables. 
Wwe can do this in Seaborn by using the powerful `pairplot`.
It will take some time 

In [None]:
sns.pairplot(df)

Finally we can look the continuos variiables to check their distribution. 
THis can be done easily by using displot.

In [None]:
# looking at the distribution of revenue
sns.set_theme()
sns.displot(
    data=df, kind="hist", kde=True,
    x="revenue_millions",  height=8, aspect=1.5)

### Boxplot or Violin plot
We can check for distributions and outliers of variables with boxplot and violin plot.

In [None]:
plt.figure(figsize=(10, 7))
sns.boxplot(x="year", hue="year", y="revenue_millions", palette="pastel", data=df)
plt.tight_layout()

plt.figure(figsize=(10, 7))
sns.violinplot(x="year", hue="year", y="revenue_millions", palette="pastel", data=df)
plt.tight_layout()

We can also use the `hue` parameter to add levels of discrimination to other type of plots.

In [None]:
plt.figure(figsize=(15, 10))
sns.scatterplot(x="rating", y="metascore", hue="year", data=df, palette="pastel")
plt.tight_layout()
plt.grid()

In [None]:
sns.displot(
    data=df, kind="hist", kde=True,
    x="mean_score", hue="year", palette='pastel', height=10, aspect=1.5)